topic Re: Help determining cause of reboot in Operating System - HP-UX

Help determining cause of reboot

Sean OB_1 — Thu, 26 Sep 2002 12:50:36 GMT

Hello.

One of our client sites had some power outtages yesterday, approximately 4 in 5 minutes. The servers are all supposed to be on UPS supplied circuits.

After the outtages the servers did remain running. However about 1 minute after the last noticeable outtage one of the servers rebooted.

This server is part of a service guard 2 node cluster. It was the primary node at the time of the outtage.

Can someone take a look at the log file entries below and help me determine why the machine rebooted?

Sep 25 14:40:12 cosmo0 : su : + tmc voecksl-root
Sep 25 16:39:16 cosmo0 telnetd[20105]: getpid: peer died: Connection timed out
Sep 25 16:40:12 cosmo0 telnetd[20163]: getpid: peer died: Connection timed out
Sep 25 16:40:12 cosmo0 telnetd[20164]: getpid: peer died: Connection timed out
Sep 25 16:40:12 cosmo0 telnetd[20165]: getpid: peer died: Connection timed out
Sep 25 16:40:29 cosmo0 vmunix: btlan: NOTE: MII Link Status Not OK - Check Cable
Connection to Hub/Switch at 0/2/0/0/5/0....
Sep 25 16:40:29 cosmo0 vmunix: btlan: NOTE: MII Link Status Not OK - Check Cable
Connection to Hub/Switch at 0/5/0/0/5/0....
Sep 25 16:40:29 cosmo0 cmcld: lan2 failed
Sep 25 16:40:29 cosmo0 cmcld: Subnet 148.8.70.0 switched from lan2 to lan3
Sep 25 16:40:29 cosmo0 cmcld: lan2 switched to lan3
Sep 25 16:40:29 cosmo0 cmcld: lan6 failed
Sep 25 16:40:29 cosmo0 cmcld: Package unidata cannot run on this node because sw
itching has been disabled for this node.
Sep 25 16:40:31 cosmo0 vmunix: btlan: NOTE: MII Link Status Not OK - Check Cable
Connection to Hub/Switch at 0/2/0/0/6/0....
Sep 25 16:40:31 cosmo0 cmcld: lan3 failed
Sep 25 16:40:31 cosmo0 cmcld: Subnet 148.8.70.0 down
Sep 25 16:41:39 cosmo0 cmcld: Timed out node cosmo1. It may have failed.
Sep 25 16:41:39 cosmo0 cmcld: Attempting to form a new cluster
Sep 25 16:45:01 cosmo0 cmcld: lan2 recovered
Sep 25 16:45:01 cosmo0 cmcld: Subnet 148.8.70.0 switched from lan3 to lan2
Sep 25 16:45:01 cosmo0 cmcld: lan3 switched to lan2
Sep 25 16:45:01 cosmo0 cmcld: Subnet 148.8.70.0 up
Sep 25 16:45:01 cosmo0 cmcld: Package unidata cannot run on this node because sw
itching has been disabled for this node.
Sep 25 16:45:03 cosmo0 cmcld: lan6 recovered
Sep 25 16:46:41 cosmo0 cmcld: Obtaining Cluster Lock
Sep 25 16:46:42 cosmo0 cmcld: Cluster lock was denied. Lock was obtained by anot
her node.
Sep 25 16:46:42 cosmo0 cmcld: Attempting to form a new cluster
Sep 25 16:46:42 cosmo0 cmcld: Daemon exiting due to halt message from node cosmo
1
Sep 25 16:46:42 cosmo0 cmcld: Halting cosmo0 to preserve data integrity
Sep 25 16:46:42 cosmo0 cmcld: Reason: Impossibly long daemon hang detected
Sep 25 16:46:42 cosmo0 cmcld: cl_abort: abort cl_kepd_printf failed: Invalid arg
ument
Sep 25 16:46:42 cosmo0 cmcld: Aborting! Impossibly long daemon hang detected (fi
le: utils.c, line: 155)
Sep 25 16:46:46 cosmo0 cmclconfd[2596]: The ServiceGuard daemon, /usr/lbin/cmcld
[2597], died upon receiving the signal 6.
Sep 25 16:46:53 cosmo0 vmunix:
Sep 25 16:46:53 cosmo0 vmunix: sync'ing disks (15 buffers to flush): 15 4 1
Sep 25 16:46:53 cosmo0 vmunix: 0 buffers not flushed
Sep 25 16:46:53 cosmo0 vmunix: 0 buffers still dirty
root@cosmo0:/var/adm/syslog->

Re: Help determining cause of reboot

Tom Danzig — Thu, 26 Sep 2002 12:54:22 GMT

Sounds like the switch or hub the lan cards are attached to lost power and resulted in loss of lan connectivity.

Re: Help determining cause of reboot

Tom Danzig — Thu, 26 Sep 2002 12:57:12 GMT

I should have added that in a two node cluster, if connectivity between the nodes stops and they are both up, whichever node gets the cluster lock VG will stay up. The other node will panic and reboot.

Sounds like that's what happened here. This node lost the race to the lock VG.

Re: Help determining cause of reboot

Sean OB_1 — Thu, 26 Sep 2002 12:58:18 GMT

Sorry forgot to add the following.

The datacenter and main switches are UPS powered. However the switches in the closets throughout the campus are not.

So when we lost power all of the external switches reboot and try to re-establish connectivity to the main bridge switches.

The way they have things set up, if there are successive outtages like this in a quick period the main bridges get overloaded and fail, requiring a reboot of them.

So while the center is UPS'd this type of failure does cause the servers to lose their lan while the main bridges are rebooting.

Would serviceguard for any reason reboot the server when it sees a lan failure?

TIA,

Sean

Re: Help determining cause of reboot

Sean OB_1 — Thu, 26 Sep 2002 13:02:40 GMT

Thanks Tom, that was what I needed to know.

I had forgotten that sg will reboot if it doesn't get the lock.

As a side question, is there a way to give one node priority on the lock over the other? This company would prefer that one of the two machines be the primary node virtually all the time. And ALL of the failovers that they have had were the result of network problems. So the primary server was always working, and always available to run the package.

But it seems that on every failover the alternate machines gets the lock first and we and up halting the package on that node and bringing it back up on the primary machine.

It would be nice if we could set some type of priority to give the primary the first shot at the lock. Say like a 10 second delay on the alternate or something like that.

Re: Help determining cause of reboot

Sridhar Bhaskarla — Thu, 26 Sep 2002 13:08:13 GMT

Hi Sean,

Yes. MC/ServiceGuard TOC's the node that is not having the cluster lock but having the volume groups activated during the cluster reformation. Your cosmo0 lost the cluster lock to cosmo1 during the outage.

Go through the messages and you will get it crystal clear.

Sep 25 16:46:41 cosmo0 cmcld: Obtaining Cluster Lock
Sep 25 16:46:42 cosmo0 cmcld: Cluster lock was denied. Lock was obtained by anot
her node.
Sep 25 16:46:42 cosmo0 cmcld: Attempting to form a new cluster
Sep 25 16:46:42 cosmo0 cmcld: Daemon exiting due to halt message from node cosmo
1
Sep 25 16:46:42 cosmo0 cmcld: Halting cosmo0 to preserve data integrity

-Sri

Re: Help determining cause of reboot

Sridhar Bhaskarla — Thu, 26 Sep 2002 13:14:06 GMT

Sean,

Look at NODE_TIMEOUT and NETWORK_POLLING_INTERVAL in cluster's ascii file. The first one determines how long it should wait to reform the cluster when it finds that the other node is timed out and the second one is to decide when to call it a network outage and is particularly helpful for local lan failovers.

You can increase these values. My settings are 12 secs for both.

-Sri

Re: Help determining cause of reboot

Tom Danzig — Thu, 26 Sep 2002 13:16:29 GMT

Sean,

NO! There is no way to force one node to have any advantage! I was a bit peeved about this myself when brought this up in the MC/SG class which I attended about 2 months ago. Seem like HP could put some delay mechanism in place to give one node an advantage. Alas, the is nothing you can do (at least that's what my instructor said).