- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- Re: 1 node restarted automatically
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-09-2006 08:21 PM
05-09-2006 08:21 PM
We have cluster running with 2 nodes and just 1 package.
Problem is yesterday node1 restarted automatically.
We checked syslog and OLDsyslog file, event.log, shutdownlog and tombstones files from both nodes and:
- It doesn't seem to crash due HW issue as there isn't any message in any log from node1
- We just detected node2 lost connection from node1 some minutes before restart. This make us feel node1 restarted due node2 request
How can I check if node2 really requested restart to node1?
Is there any other log file to check?
FYI...
- While node was rebooting its status into cluster was failed and package was halted (but this is normal as auto-run option is disabled).
- OS is HP-UX B.11.00
- ServiceGuard version is 11.13.
Thanks in advance for your help.
Regards,
Carles
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-09-2006 08:34 PM
05-09-2006 08:34 PM
Re: 1 node restarted automatically
We had this problem a while ago, it was caused by a bottleneck on the heartbeat lan, which caused the servers to think each other were down and reboot. I cannot remember which version of servicesguard we were using, but when the error occurred we upgraded to the latest and ensured that the heartbeat lan was not used by anything else.
Sorry I cannot be more specific, but I am not working for the company any more.
Regards,
JASH
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-09-2006 08:34 PM
05-09-2006 08:34 PM
Re: 1 node restarted automatically
Is the heartbeat network working - are the nodes reacheable on that network.
What does cmviewcl show ?
Are the latest patches applied ?
Regards,
Ninad
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-09-2006 08:46 PM
05-09-2006 08:46 PM
Re: 1 node restarted automatically
HP-UX 11.00 goes out of support December 31st this year
Serviceguard 11.13 has been out of support for over 18 months
Node 2 would not have requested node shutdown
Having said that, it seems that node1 "disappeared", leaving node2 as the cluster. You need to review the OLDsyslog on node1, and look for any hints that there were network issues, also (assuming ntp is used to keep the time synced between the two) check the last entry in the OLDsyslog and try to correlate that with messages logged in node2 syslog.
If nothing appears obvious, and as you say the shutdown log shows nothing, you may have had a hardware failure that was NOT logged, possibly power related, or a reboot -q was done, or someone did a reset on th eserver.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-09-2006 08:49 PM
05-09-2006 08:49 PM
Re: 1 node restarted automatically
if there is the heartbeat problem to causing the split brain. you should found the message on syslog and also normally have a TOC message in shutdownlog.
BTW, is the fail_fast option enabled?
could you post the syslog messages on node2?
GOOD LUCK!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-09-2006 09:47 PM
05-09-2006 09:47 PM
Re: 1 node restarted automatically
Here you're last OLDsyslog file lines from node1:
May 8 08:46:51 ra named[523]: NSTATS 1147070811 1128609976 A=1707 PTR=172154 33=16
May 8 08:46:51 ra named[523]: XSTATS 1147070811 1128609976 RR=0 RNXD=0 RFwdR=0 RDupR=0 RFail=0 RFErr=0 RErr=0 RAXFR=0 RLam
e=0 ROpts=0 SSysQ=86769 SAns=172150 SFwdQ=637 SDupQ=1398493 SErr=0 RQ=173877 RIQ=0 RFwdQ=637 RDupQ=1090 RTCP=0 SFwdR=0 SFai
l=0 SFErr=0 SNaAns=0 SNXD=0
May 8 09:33:27 ra ftpd[5988]: FTP LOGIN FROM ws133.x30-56.santpau.es [172.30.56.133], root
May 8 09:46:51 ra named[523]: NSTATS 1147074411 1128609976 A=1707 PTR=172154 33=16
May 8 09:46:51 ra named[523]: XSTATS 1147074411 1128609976 RR=0 RNXD=0 RFwdR=0 RDupR=0 RFail=0 RFErr=0 RErr=0 RAXFR=0 RLam
e=0 ROpts=0 SSysQ=86769 SAns=172150 SFwdQ=637 SDupQ=1398493 SErr=0 RQ=173877 RIQ=0 RFwdQ=637 RDupQ=1090 RTCP=0 SFwdR=0 SFai
l=0 SFErr=0 SNaAns=0 SNXD=0
May 8 09:54:45 ra ftpd[5988]: FTP session closed
May 8 09:58:07 ra ftpd[9138]: FTP LOGIN FROM ws133.x30-56.santpau.es [172.30.56.133], root
May 8 10:05:00 ra ftpd[9138]: FTP session closed
May 8 10:46:51 ra named[523]: NSTATS 1147078011 1128609976 A=1707 PTR=172154 33=16
May 8 10:46:51 ra named[523]: XSTATS 1147078011 1128609976 RR=0 RNXD=0 RFwdR=0 RDupR=0 RFail=0 RFErr=0 RErr=0 RAXFR=0 RLam
e=0 ROpts=0 SSysQ=86769 SAns=172150 SFwdQ=637 SDupQ=1398493 SErr=0 RQ=173877 RIQ=0 RFwdQ=637 RDupQ=1090 RTCP=0 SFwdR=0 SFai
l=0 SFErr=0 SNaAns=0 SNXD=0
And here you're syslog file lines from node2 (more or less at same time):
May 8 08:37:32 isis vmunix: SCSI: bp: 000000004d698000
May 8 08:37:32 isis vmunix: dev: cd160140
May 8 08:37:32 isis vmunix: cdb: 00 00 00 00 00 00
May 8 08:37:32 isis vmunix: status: (02) Check Condition
May 8 08:37:32 isis vmunix: sense data: 70 00 06 42 55 5a 5a 0a 00 00 00 00 29 00 01 00
May 8 08:37:32 isis vmunix: 00 00
May 8 08:37:32 isis vmunix: sense key: (06) Unit Attention
May 8 08:37:32 isis vmunix: additional sense code: (29)
May 8 08:37:32 isis vmunix: additional sense code qualifier: (00)
May 8 09:04:07 isis vmunix:
May 8 09:04:07 isis vmunix: SCSI: bp: 000000004d7e5c00
May 8 09:04:07 isis vmunix: dev: cd160140
May 8 09:04:07 isis vmunix: cdb: 00 00 00 00 00 00
May 8 09:04:07 isis vmunix: status: (02) Check Condition
May 8 09:04:07 isis vmunix: sense data: 70 00 06 42 55 5a 5a 0a 00 00 00 00 29 00 01 00
May 8 09:04:07 isis vmunix: 00 00
May 8 09:04:07 isis vmunix: sense key: (06) Unit Attention
May 8 09:04:07 isis vmunix: additional sense code: (29)
May 8 09:04:07 isis vmunix: additional sense code qualifier: (00)
May 8 11:15:44 isis cmcld: Timed out node ra. It may have failed.
May 8 11:15:44 isis cmcld: Attempting to form a new cluster
May 8 11:15:47 isis cmcld: Obtaining Cluster Lock
May 8 11:15:48 isis cmcld: Turning off safety time protection since the cluster
May 8 11:15:48 isis cmcld: may now consist of a single node. If ServiceGuard
May 8 11:15:48 isis cmcld: fails, this node will not automatically halt
May 8 11:16:15 isis vmunix: NFS server spe not responding still trying
May 8 11:16:47 isis cmcld: 1 nodes have formed a new cluster, sequence #4
May 8 11:16:47 isis cmcld: The new active cluster membership is: isis(id=1)
May 8 11:16:47 isis cmcld: Executing '/etc/cmcluster/spe/spe.sh start' for package spe, as service PKG*23809.
Since node1 restart all is working fine again.
About the possibility of HW failure, I already contacted with HP and they don't think so.
We're planning to move cluster to new servers, so we will update ServiceGuard ans OS, but until we change systems I need to be sure system won't restart automatically again.
Thanks in advance for your help.
Regards,
Carles
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-09-2006 10:26 PM
05-09-2006 10:26 PM
Re: 1 node restarted automatically
May 8 11:15:44 isis cmcld: Timed out node ra. It may have failed.
Looks like a network glitch for few seconds or node 1 went for a RS/TOC, check the chassis logs. During which the next node started the cluster.
As you told that HP already said nothing about crash, my advice don't panic, just monitor it and plan for the upgrade.
Please check you cluster logs that will provide more information.
Chan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-10-2006 12:44 AM
05-10-2006 12:44 AM
Solution# verify entrys in /var/adm/shutdownlog
Verify crash file:
#grep SAVECRASH_DIR /etc/rc.config.d/savecrash
--> # SAVECRASH_DIR=/var/adm/crash
# cd /var/adm/crash
# ll -t crash*
If crash, open software incident whit HP service.
rgs,
ran
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-10-2006 01:30 AM
05-10-2006 01:30 AM
Re: 1 node restarted automatically
Into cluster logs there's nothing about system restart, but there's crash directory created at same time and date system crashed.
I called HP an I opened a case asking about this directory and information it contains.
As soon as I have news from them I'll inform you.
Thanks to all.
Regards,
Carles
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-10-2006 07:00 PM
05-10-2006 07:00 PM
Re: 1 node restarted automatically
a) You can confirm from Crash whether it was a Serviceguard TOC by Checking the INDEX File in the Crash directory.
If you do a cat on the "INDEX" file It will give you the Panic String containing "SafetyTimer expired, isr.ior =" , That will confirm it to be SG Issue.
b) Another thing you can confirm is whether it's a network issue which caused the SG TOC . Procedure to check the same is :
#netfmt -f /var/adm/nettl.LOG000 >/tmp/net.tmp
-Amit
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-10-2006 07:53 PM
05-10-2006 07:53 PM
Re: 1 node restarted automatically
I checked INDEX file into crash directory and there isn't any panic message. Does it means restart was not due TOC?
I also checked network log as you suggested and here you're messages from before and after restart:
***********************************STREAMS/UX*******************************@#%
Timestamp : Mon May 08 METDST 2006 10:50:38.496645
Process ID : 4758 Subsystem : STREAMS
User ID ( UID ) : 103 Log Class : ERROR
Device ID : 0 Path ID : 0
Connection ID : 0 Log Instance : 0
Location : 00123
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6251 10:50:38 1846849149 1 T.. 5321 14680 tcp_rput_other: case T_ERROR_ACK, ERROR_prim == 1
***********************************STREAMS/UX*******************************@#%
Timestamp : Mon May 08 METDST 2006 11:37:02.023737
Process ID : 4203 Subsystem : STREAMS
User ID ( UID ) : 103 Log Class : ERROR
Device ID : 0 Path ID : 0
Connection ID : 0 Log Instance : 0
Location : 00123
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 11:37:02 73390 1 T.. 5321 106 tcp_rput_other: case T_ERROR_ACK, ERROR_prim == 1
Thanks for your help.
Regards,
Carles
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-10-2006 07:53 PM
05-10-2006 07:53 PM
Re: 1 node restarted automatically
I forgot to said HP didn't help me as our SG version is out of support.
Regards,
Carles
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-10-2006 08:40 PM
05-10-2006 08:40 PM
Re: 1 node restarted automatically
If nothing is logged there, then the it is most probably some form of hardware failure OR a reboot -q was issued. In both cases you would not get a panic.
Was there an INDEX file in the crash directory? If so, post that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-10-2006 09:00 PM
05-10-2006 09:00 PM
Re: 1 node restarted automatically
syslog/OLDsyslog/shutdownlog/Network log from both the nodes as well as "INDEX" file from the node which got crashed. (/var/adm/crash/crash.0)
-Amit
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-10-2006 10:26 PM
05-10-2006 10:26 PM
Re: 1 node restarted automatically
# more INDEX
comment savecrash crash dump INDEX file
version 2
hostname ra
modelname 9000/800/N4000-75
panic , isr.ior = 0'183405d7.80000000'61befad0
dumptime 1147079804 Mon May 8 11:16:44 METDST 2006
savetime 1147080405 Mon May 8 11:26:45 METDST 2006
release @(#)B2352B/9245XB HP-UX (B.11.00) #1: Wed Nov 5 22:38:19 PST 1997
memsize 2147483648
chunksize 134217728
module /stand/vmunix vmunix 19707144 938661231
image image.1.1 0x0000000000000000 0x0000000007ffc000 0x0000000000000000 0x000000000000885f 2589870395
image image.1.2 0x0000000000000000 0x0000000007ff9000 0x0000000000008860 0x0000000000010857 1714528652
image image.1.3 0x0000000000000000 0x0000000007ffc000 0x0000000000010858 0x00000000000244f7 2041002910
image image.1.4 0x0000000000000000 0x0000000005656000 0x00000000000244f8 0x000000000007ffff 3663242024
image image.2.1 0x0000000000000000 0x0000000007ff6000 0x0000000000100000 0x0000000000141bff 1043799285
image image.2.2 0x0000000000000000 0x00000000035b2000 0x0000000000141c00 0x000000000017ffff 2622514722
image image.3.1 0x0000000000000000 0x000000000524d000 0x0000000000280000 0x00000000002fffff 2486325632
# more /tmp/net.tmp
#
Regards,
Carles
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-10-2006 10:29 PM
05-10-2006 10:29 PM
Re: 1 node restarted automatically
panic , isr.ior = 0'183405d7.80000000'61befad0
Tells you you have either had an HPMC (Hardware) panic, or a TOC panic.
The TOC could have been done manually, or may have been a Serviceguard induced panic.
You would need to get the dump analysed by HP to be certain.
But my recommendation would be to udate Serviceguard and patch it, and then review any hardware logs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-10-2006 10:35 PM
05-10-2006 10:35 PM
Re: 1 node restarted automatically
panic , isr.ior = 0'183405d7.80000000'61befad0 =======> (isr.ior)
dumptime 1147079804 Mon May 8 11:16:44 METDST 2006
Network Logs:
Timestamp : Mon May 08 METDST 2006 10:50:38.496645
Syslog:
May 8 11:15:44 isis cmcld: Timed out node ra. It may have failed.
May 8 11:15:44 isis cmcld: Attempting to form a new cluster
May 8 11:15:47 isis cmcld: Obtaining Cluster Lock
May 8 11:15:48 isis cmcld: Turning off safety time protection since the cluster
It's clear from the Logs that Communication between Node2 and node1 got broke at "May 8 11:15:44" due to Network error and the Node1 got panicked at "May 8 11:16:44".
It seems to be a SG TOC Only. Just check the Network Connection. Also send me the following output:
#cmgetconf
-Amit
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-10-2006 11:01 PM
05-10-2006 11:01 PM
Re: 1 node restarted automatically
I colleague told me he tried to run package on node2 when he detected node1 didn't respond.
Could cmrunpkg command on node2 force node2 restarts (sent the TOC)?
BTW, I attached package configuration file extracted with cmgetconf command.
Regards,
Carles
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-10-2006 11:11 PM
05-10-2006 11:11 PM
Re: 1 node restarted automatically
You really should request a dump analysis from HP Response Centre if you wish to get to th ebottom of the crash, but I suspect the answer is it was a Serviceguard TOC, due to some form of network issue. But as you are running an unsupported version of SG, that is all you may get.
If it is NOT a Serviceguard TOC, then they should investigate further for you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-10-2006 11:16 PM
05-10-2006 11:16 PM
Re: 1 node restarted automatically
Gone through the Configuration file , everything seems to be OK .
As regards to your query -
Could cmrunpkg command on node2 force node2 restarts (sent the TOC)?
Answer is NO . Although it seems to be a SG TOC but "cmrunpkg" can't initiate it. There can be multiple reasons ranging from Network issue to improper patching. Dump analysis is the only solution if you need to know the root cause of SG TOC.
-Amit
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-10-2006 11:54 PM
05-10-2006 11:54 PM
Re: 1 node restarted automatically
I think we have to update SG or move all data and applications to new systems. We want to do this in 1 year but maybe we will do it before.
BTW, do you I can get some information about into service processor?
Someone from my office suggested it to me but I don't really know too much this tool.
Thanks again for your help.
Regards,
Carles
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-11-2006 12:01 AM
05-11-2006 12:01 AM
Re: 1 node restarted automatically
You can get information from Service Processor also , but if it's a Software issue like SG TOC in our case then it won't be of much use . However it can help in determine whethere there is any hardware issue with the system which caused the reboot.
To get the Logs from the Service Processor , here is the procedure:
Step 1: Press Ctrl + B at Console
it will take you to GSP Prompt
Step 2: GSP>sl ( Type SL=Service Logs)
Step 3: Type "e" for Error Logs
Step 4: Capture the output and paste for our analysis.
However you need specialised Tool to decode the Hex Code generated in the Logs.
-Amit
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-11-2006 03:15 AM
05-11-2006 03:15 AM
Re: 1 node restarted automatically
Here you're log entry before system restart (note system restarted at 11h30, more or less):
SYSTEM NAME: lc-ra
DATE: 05/08/2006 TIME: 09:17:39
ALERT LEVEL: 3 = System blocked waiting for operator input
SOURCE: 1 = processor
SOURCE DETAIL: 1 = processor general SOURCE ID: 0
PROBLEM DETAIL: 0 = no problem detail
CALLER ACTIVITY: E = system warning STATUS: 0
CALLER SUBACTIVITY: F0 = implementation dependent
REPORTING ENTITY TYPE: E = HP-UX REPORTING ENTITY ID: 01
0xF8E010301100EF00 00000000 0000EF00 type 31 = legacy PA HEX chassis-code
0x58E018301100EF00 00006A04 08091127 type 11 = Timestamp 05/08/2006 09:17:39
That's all I can find. What do you think?
Regards,
Carles
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-11-2006 10:43 AM
05-11-2006 10:43 AM
Re: 1 node restarted automatically
If you are using 2 nodes with 1 package, in the mean time why don't you bring down the 2nd node so you don't see another TOC?
Best of luck,
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-11-2006 04:09 PM
05-11-2006 04:09 PM
Re: 1 node restarted automatically
#netfmt -f /var/adm/nettl.LOG000 >/tmp/net.tmp