Re: 1 node restarted automatically

Carles Viaplana · ‎05-09-2006

Hello,

We have cluster running with 2 nodes and just 1 package.

Problem is yesterday node1 restarted automatically.

We checked syslog and OLDsyslog file, event.log, shutdownlog and tombstones files from both nodes and:

- It doesn't seem to crash due HW issue as there isn't any message in any log from node1
- We just detected node2 lost connection from node1 some minutes before restart. This make us feel node1 restarted due node2 request

How can I check if node2 really requested restart to node1?
Is there any other log file to check?

FYI...

- While node was rebooting its status into cluster was failed and package was halted (but this is normal as auto-run option is disabled).
- OS is HP-UX B.11.00
- ServiceGuard version is 11.13.

Thanks in advance for your help.
Regards,

Carles

JASH_2 · ‎05-09-2006

Carles,

We had this problem a while ago, it was caused by a bottleneck on the heartbeat lan, which caused the servers to think each other were down and reboot. I cannot remember which version of servicesguard we were using, but when the error occurred we upgraded to the latest and ensured that the heartbeat lan was not used by anything else.

Sorry I cannot be more specific, but I am not working for the company any more.

Regards,

JASH

If I can, I will!

Ninad_1 · ‎05-09-2006

Carles,

Is the heartbeat network working - are the nodes reacheable on that network.
What does cmviewcl show ?
Are the latest patches applied ?

Regards,
Ninad

melvyn burnard · ‎05-09-2006

Couple of points to make first:
HP-UX 11.00 goes out of support December 31st this year
Serviceguard 11.13 has been out of support for over 18 months
Node 2 would not have requested node shutdown
Having said that, it seems that node1 "disappeared", leaving node2 as the cluster. You need to review the OLDsyslog on node1, and look for any hints that there were network issues, also (assuming ntp is used to keep the time synced between the two) check the last entry in the OLDsyslog and try to correlate that with messages logged in node2 syslog.

If nothing appears obvious, and as you say the shutdown log shows nothing, you may have had a hardware failure that was NOT logged, possibly power related, or a reboot -q was done, or someone did a reset on th eserver.

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

Warren_9 · ‎05-09-2006

Hi,

if there is the heartbeat problem to causing the split brain. you should found the message on syslog and also normally have a TOC message in shutdownlog.

BTW, is the fail_fast option enabled?

could you post the syslog messages on node2?

GOOD LUCK!!

Carles Viaplana · ‎05-09-2006

Thanks to all for your answers.

Here you're last OLDsyslog file lines from node1:

May 8 08:46:51 ra named[523]: NSTATS 1147070811 1128609976 A=1707 PTR=172154 33=16
May 8 08:46:51 ra named[523]: XSTATS 1147070811 1128609976 RR=0 RNXD=0 RFwdR=0 RDupR=0 RFail=0 RFErr=0 RErr=0 RAXFR=0 RLam
e=0 ROpts=0 SSysQ=86769 SAns=172150 SFwdQ=637 SDupQ=1398493 SErr=0 RQ=173877 RIQ=0 RFwdQ=637 RDupQ=1090 RTCP=0 SFwdR=0 SFai
l=0 SFErr=0 SNaAns=0 SNXD=0
May 8 09:33:27 ra ftpd[5988]: FTP LOGIN FROM ws133.x30-56.santpau.es [172.30.56.133], root
May 8 09:46:51 ra named[523]: NSTATS 1147074411 1128609976 A=1707 PTR=172154 33=16
May 8 09:46:51 ra named[523]: XSTATS 1147074411 1128609976 RR=0 RNXD=0 RFwdR=0 RDupR=0 RFail=0 RFErr=0 RErr=0 RAXFR=0 RLam
e=0 ROpts=0 SSysQ=86769 SAns=172150 SFwdQ=637 SDupQ=1398493 SErr=0 RQ=173877 RIQ=0 RFwdQ=637 RDupQ=1090 RTCP=0 SFwdR=0 SFai
l=0 SFErr=0 SNaAns=0 SNXD=0
May 8 09:54:45 ra ftpd[5988]: FTP session closed
May 8 09:58:07 ra ftpd[9138]: FTP LOGIN FROM ws133.x30-56.santpau.es [172.30.56.133], root
May 8 10:05:00 ra ftpd[9138]: FTP session closed
May 8 10:46:51 ra named[523]: NSTATS 1147078011 1128609976 A=1707 PTR=172154 33=16
May 8 10:46:51 ra named[523]: XSTATS 1147078011 1128609976 RR=0 RNXD=0 RFwdR=0 RDupR=0 RFail=0 RFErr=0 RErr=0 RAXFR=0 RLam
e=0 ROpts=0 SSysQ=86769 SAns=172150 SFwdQ=637 SDupQ=1398493 SErr=0 RQ=173877 RIQ=0 RFwdQ=637 RDupQ=1090 RTCP=0 SFwdR=0 SFai
l=0 SFErr=0 SNaAns=0 SNXD=0

And here you're syslog file lines from node2 (more or less at same time):

May 8 08:37:32 isis vmunix: SCSI: bp: 000000004d698000
May 8 08:37:32 isis vmunix: dev: cd160140
May 8 08:37:32 isis vmunix: cdb: 00 00 00 00 00 00
May 8 08:37:32 isis vmunix: status: (02) Check Condition
May 8 08:37:32 isis vmunix: sense data: 70 00 06 42 55 5a 5a 0a 00 00 00 00 29 00 01 00
May 8 08:37:32 isis vmunix: 00 00
May 8 08:37:32 isis vmunix: sense key: (06) Unit Attention
May 8 08:37:32 isis vmunix: additional sense code: (29)
May 8 08:37:32 isis vmunix: additional sense code qualifier: (00)
May 8 09:04:07 isis vmunix:
May 8 09:04:07 isis vmunix: SCSI: bp: 000000004d7e5c00
May 8 09:04:07 isis vmunix: dev: cd160140
May 8 09:04:07 isis vmunix: cdb: 00 00 00 00 00 00
May 8 09:04:07 isis vmunix: status: (02) Check Condition
May 8 09:04:07 isis vmunix: sense data: 70 00 06 42 55 5a 5a 0a 00 00 00 00 29 00 01 00
May 8 09:04:07 isis vmunix: 00 00
May 8 09:04:07 isis vmunix: sense key: (06) Unit Attention
May 8 09:04:07 isis vmunix: additional sense code: (29)
May 8 09:04:07 isis vmunix: additional sense code qualifier: (00)
May 8 11:15:44 isis cmcld: Timed out node ra. It may have failed.
May 8 11:15:44 isis cmcld: Attempting to form a new cluster
May 8 11:15:47 isis cmcld: Obtaining Cluster Lock
May 8 11:15:48 isis cmcld: Turning off safety time protection since the cluster
May 8 11:15:48 isis cmcld: may now consist of a single node. If ServiceGuard
May 8 11:15:48 isis cmcld: fails, this node will not automatically halt
May 8 11:16:15 isis vmunix: NFS server spe not responding still trying
May 8 11:16:47 isis cmcld: 1 nodes have formed a new cluster, sequence #4
May 8 11:16:47 isis cmcld: The new active cluster membership is: isis(id=1)
May 8 11:16:47 isis cmcld: Executing '/etc/cmcluster/spe/spe.sh start' for package spe, as service PKG*23809.

Since node1 restart all is working fine again.

About the possibility of HW failure, I already contacted with HP and they don't think so.

We're planning to move cluster to new servers, so we will update ServiceGuard ans OS, but until we change systems I need to be sure system won't restart automatically again.

Thanks in advance for your help.
Regards,

Carles

Chan 007 · ‎05-09-2006

Hi Charles

May 8 11:15:44 isis cmcld: Timed out node ra. It may have failed.

Looks like a network glitch for few seconds or node 1 went for a RS/TOC, check the chassis logs. During which the next node started the cluster.

As you told that HP already said nothing about crash, my advice don't panic, just monitor it and plan for the upgrade.

Please check you cluster logs that will provide more information.

Chan

rariasn · ‎05-10-2006

Hi Carles,

# verify entrys in /var/adm/shutdownlog

Verify crash file:

#grep SAVECRASH_DIR /etc/rc.config.d/savecrash

--> # SAVECRASH_DIR=/var/adm/crash

# cd /var/adm/crash

# ll -t crash*

If crash, open software incident whit HP service.

rgs,

ran

Carles Viaplana · ‎05-10-2006

Hello,

Into cluster logs there's nothing about system restart, but there's crash directory created at same time and date system crashed.

I called HP an I opened a case asking about this directory and information it contains.

As soon as I have news from them I'll inform you.

Thanks to all.
Regards,

Carles

Chauhan Amit · ‎05-10-2006

Hi Carles,

a) You can confirm from Crash whether it was a Serviceguard TOC by Checking the INDEX File in the Crash directory.
If you do a cat on the "INDEX" file It will give you the Panic String containing "SafetyTimer expired, isr.ior =" , That will confirm it to be SG Issue.

b) Another thing you can confirm is whether it's a network issue which caused the SG TOC . Procedure to check the same is :
#netfmt -f /var/adm/nettl.LOG000 >/tmp/net.tmp

-Amit

If you are not a part of solution , then you are a part of problem

Carles Viaplana · ‎05-10-2006

Hello,

I checked INDEX file into crash directory and there isn't any panic message. Does it means restart was not due TOC?

I also checked network log as you suggested and here you're messages from before and after restart:

***********************************STREAMS/UX*******************************@#%
Timestamp : Mon May 08 METDST 2006 10:50:38.496645
Process ID : 4758 Subsystem : STREAMS
User ID ( UID ) : 103 Log Class : ERROR
Device ID : 0 Path ID : 0
Connection ID : 0 Log Instance : 0
Location : 00123
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6251 10:50:38 1846849149 1 T.. 5321 14680 tcp_rput_other: case T_ERROR_ACK, ERROR_prim == 1

***********************************STREAMS/UX*******************************@#%
Timestamp : Mon May 08 METDST 2006 11:37:02.023737
Process ID : 4203 Subsystem : STREAMS
User ID ( UID ) : 103 Log Class : ERROR
Device ID : 0 Path ID : 0
Connection ID : 0 Log Instance : 0
Location : 00123
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 11:37:02 73390 1 T.. 5321 106 tcp_rput_other: case T_ERROR_ACK, ERROR_prim == 1

Thanks for your help.
Regards,

Carles

Carles Viaplana · ‎05-10-2006

Ops!

I forgot to said HP didn't help me as our SG version is out of support.

Regards,

Carles

melvyn burnard · ‎05-10-2006

Well it loks like you may have had a network issue here, but if this caused a TOC panic, you should have something logged in your /etc/shutdownlog.
If nothing is logged there, then the it is most probably some form of hardware failure OR a reboot -q was issued. In both cases you would not get a panic.
Was there an INDEX file in the crash directory? If so, post that.

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

Chauhan Amit · ‎05-10-2006

Can you please zip and attach following files
syslog/OLDsyslog/shutdownlog/Network log from both the nodes as well as "INDEX" file from the node which got crashed. (/var/adm/crash/crash.0)

-Amit

If you are not a part of solution , then you are a part of problem

Carles Viaplana · ‎05-10-2006

Here you're INDEX file:

# more INDEX
comment savecrash crash dump INDEX file
version 2
hostname ra
modelname 9000/800/N4000-75
panic , isr.ior = 0'183405d7.80000000'61befad0
dumptime 1147079804 Mon May 8 11:16:44 METDST 2006
savetime 1147080405 Mon May 8 11:26:45 METDST 2006
release @(#)B2352B/9245XB HP-UX (B.11.00) #1: Wed Nov 5 22:38:19 PST 1997

memsize 2147483648
chunksize 134217728
module /stand/vmunix vmunix 19707144 938661231
image image.1.1 0x0000000000000000 0x0000000007ffc000 0x0000000000000000 0x000000000000885f 2589870395
image image.1.2 0x0000000000000000 0x0000000007ff9000 0x0000000000008860 0x0000000000010857 1714528652
image image.1.3 0x0000000000000000 0x0000000007ffc000 0x0000000000010858 0x00000000000244f7 2041002910
image image.1.4 0x0000000000000000 0x0000000005656000 0x00000000000244f8 0x000000000007ffff 3663242024
image image.2.1 0x0000000000000000 0x0000000007ff6000 0x0000000000100000 0x0000000000141bff 1043799285
image image.2.2 0x0000000000000000 0x00000000035b2000 0x0000000000141c00 0x000000000017ffff 2622514722
image image.3.1 0x0000000000000000 0x000000000524d000 0x0000000000280000 0x00000000002fffff 2486325632
# more /tmp/net.tmp
#

Regards,

Carles

melvyn burnard · ‎05-10-2006

Right, this string:
panic , isr.ior = 0'183405d7.80000000'61befad0

Tells you you have either had an HPMC (Hardware) panic, or a TOC panic.
The TOC could have been done manually, or may have been a Serviceguard induced panic.
You would need to get the dump analysed by HP to be certain.

But my recommendation would be to udate Serviceguard and patch it, and then review any hardware logs.

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

Chauhan Amit · ‎05-10-2006

INDEX:

panic , isr.ior = 0'183405d7.80000000'61befad0 =======> (isr.ior)
dumptime 1147079804 Mon May 8 11:16:44 METDST 2006

Network Logs:

Timestamp : Mon May 08 METDST 2006 10:50:38.496645

Syslog:

May 8 11:15:44 isis cmcld: Timed out node ra. It may have failed.
May 8 11:15:44 isis cmcld: Attempting to form a new cluster
May 8 11:15:47 isis cmcld: Obtaining Cluster Lock
May 8 11:15:48 isis cmcld: Turning off safety time protection since the cluster

It's clear from the Logs that Communication between Node2 and node1 got broke at "May 8 11:15:44" due to Network error and the Node1 got panicked at "May 8 11:16:44".
It seems to be a SG TOC Only. Just check the Network Connection. Also send me the following output:
#cmgetconf

-Amit

If you are not a part of solution , then you are a part of problem

Carles Viaplana · ‎05-10-2006

Hello again,

I colleague told me he tried to run package on node2 when he detected node1 didn't respond.

Could cmrunpkg command on node2 force node2 restarts (sent the TOC)?

BTW, I attached package configuration file extracted with cmgetconf command.

Regards,

Carles

melvyn burnard · ‎05-10-2006

No, this would not do it.
You really should request a dump analysis from HP Response Centre if you wish to get to th ebottom of the crash, but I suspect the answer is it was a Serviceguard TOC, due to some form of network issue. But as you are running an unsupported version of SG, that is all you may get.
If it is NOT a Serviceguard TOC, then they should investigate further for you.

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

Chauhan Amit · ‎05-10-2006

Hi Carles,

Gone through the Configuration file , everything seems to be OK .

As regards to your query -
Could cmrunpkg command on node2 force node2 restarts (sent the TOC)?

Answer is NO . Although it seems to be a SG TOC but "cmrunpkg" can't initiate it. There can be multiple reasons ranging from Network issue to improper patching. Dump analysis is the only solution if you need to know the root cause of SG TOC.

-Amit

If you are not a part of solution , then you are a part of problem

Carles Viaplana · ‎05-10-2006

Thanks for your answers.

I think we have to update SG or move all data and applications to new systems. We want to do this in 1 year but maybe we will do it before.

BTW, do you I can get some information about into service processor?

Someone from my office suggested it to me but I don't really know too much this tool.

Thanks again for your help.
Regards,

Carles

Chauhan Amit · ‎05-11-2006

HI Carles,

You can get information from Service Processor also , but if it's a Software issue like SG TOC in our case then it won't be of much use . However it can help in determine whethere there is any hardware issue with the system which caused the reboot.

To get the Logs from the Service Processor , here is the procedure:
Step 1: Press Ctrl + B at Console
it will take you to GSP Prompt

Step 2: GSP>sl ( Type SL=Service Logs)
Step 3: Type "e" for Error Logs
Step 4: Capture the output and paste for our analysis.

However you need specialised Tool to decode the Hex Code generated in the Logs.

-Amit

If you are not a part of solution , then you are a part of problem

Carles Viaplana · ‎05-11-2006

Hello all,

Here you're log entry before system restart (note system restarted at 11h30, more or less):

SYSTEM NAME: lc-ra
DATE: 05/08/2006 TIME: 09:17:39
ALERT LEVEL: 3 = System blocked waiting for operator input

SOURCE: 1 = processor
SOURCE DETAIL: 1 = processor general SOURCE ID: 0
PROBLEM DETAIL: 0 = no problem detail

CALLER ACTIVITY: E = system warning STATUS: 0
CALLER SUBACTIVITY: F0 = implementation dependent
REPORTING ENTITY TYPE: E = HP-UX REPORTING ENTITY ID: 01

0xF8E010301100EF00 00000000 0000EF00 type 31 = legacy PA HEX chassis-code
0x58E018301100EF00 00006A04 08091127 type 11 = Timestamp 05/08/2006 09:17:39

That's all I can find. What do you think?
Regards,

Carles

Sheriff Andy · ‎05-11-2006

Carles,

If you are using 2 nodes with 1 package, in the mean time why don't you bring down the 2nd node so you don't see another TOC?

Best of luck,
Andy

Nguyen Anh Tien · ‎05-11-2006

I think like chauhan. it's a network issue which caused the SG TOC.
#netfmt -f /var/adm/nettl.LOG000 >/tmp/net.tmp

HP is simple

Categories

Company

Local Language

Forums

Discussions

Knowledge Base

Forums

Discussions

Knowledge Base

Forums

Discussions

Knowledge Base

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: 1 node restarted automatically

1 node restarted automatically