Re: Network Failure caused reboot?

Greg OBarr · ‎05-16-2003

I am running a 2 node cluster of L2000s, HP-UX 11.00, ServiceGuard 11.09.

The network guys were hot-inserting a card into the switch and caused the switch to reboot this morning. This temporarily interrupted the communication between the primary and takeover nodes. The first thing I notice is that the heartbeat may be running through the network rather than on the crossover cable connecting the two systems. But more importantly, the takeover node rebooted when this happened. Below are the entried from the syslog file. I can find no other information on why it rebooted. Any ideas?

May 16 11:37:18 cadb02a cmcld: Communication to node cadb01a has been interrupted
May 16 11:37:18 cadb02a cmcld: Node cadb01a may have died
May 16 11:37:18 cadb02a cmcld: Attempting to form a new cluster
May 16 11:37:29 cadb02a cmcld: Obtaining Cluster Lock
May 16 11:37:29 cadb02a vmunix: SCSI: Reset requested from above -- lbolt: 53062089, bus: 4
May 16 11:37:30 cadb02a cmcld: Cluster lock was denied. Lock was obtained by another node.
May 16 11:37:30 cadb02a vmunix: SCSI: Resetting SCSI -- lbolt: 53062189, bus: 4
May 16 11:37:30 cadb02a vmunix: SCSI: Reset detected -- lbolt: 53062189, bus: 4
May 16 11:37:34 cadb02a vmunix: NFS server cadb03a not responding still trying
May 16 11:37:30 cadb02a cmcld: Attempting to form a new cluster
May 16 11:37:41 cadb02a cmcld: Cluster lock has been denied
May 16 11:37:41 cadb02a cmcld: Attempting to form a new cluster
~
~
~
~
~
~
~

Greg OBarr · ‎05-16-2003

I should have mentioned that the syslog above is from cadb02a, the takeover node. Cadb01a is the primary node in the cluster and was not affected by whatever happened this morning.

A. Clay Stephenson · ‎05-16-2003

Without knowing more it's difficult to say but it certainly appears that your network is not robust (redundant) enough. You should have been able to handle this with nothing more than a minor hiccup; MC/SG should have simply spun up another network connection and that should have been the end of it. I assume that you have a second switch because the complete failure (or reboot) of a network switch should be considered a routine event - to be handled automatically. It sounds as though what really happened is that when communications were lost, the node did a 'TOC'.

If it ain't broke, I can fix that.

Steven E. Protter · ‎05-16-2003

Preface.

I don't know a darned thing about Service Gaurd.

But..

May 16 11:37:30 cadb02a vmunix: SCSI: Resetting SCSI -- lbolt: 53062189, bus: 4
May 16 11:37:30 cadb02a vmunix: SCSI: Reset detected -- lbolt: 53062189, bus: 4

Looks like a common hardware problem.

I've seen this kind of stuff triggered by power failures on our switch on one of our older D class systems.

We ended up figuring out tht NIC card was bad and needed to be replaced.

Perhaps its time to do a normal hardware investigation on that second card.

Also, overall, it seems your network configuration isn't all that strong.

We actually plan to bring in Service guard after our training budget is unfrozen. We are going to have a second switch in our HP-9000 rack so that we can have a redundant connection between our machines regardless of whether the core switch is up or down.

Just some things to think about.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

John Poff · ‎05-16-2003

Hi,

Perfectly normal behavior for an MC/SG cluster. The nodes lost communication with each other, so the cluster reformed. Since the nodes couldn't communicate, they both tried to lock the cluster lock disk. The first one succeeded and reformed the cluster. The node that lost did a TOC so that all its resources and packages would be sure to be free for the other node as needed.

I'd suggest a couple of things. First, slap your network guys for crashing the switch. Next, make sure your heartbeat is configured properly so that each node can still see each other in case of a total lan failure. Are you running a separate lan just for the heartbeat? I like to do that, just using the built-in lan cards plugged into a cheap hub. That way they have a fighting chance of seeing each other despite any nonsense that might be happening on the lan.

JP

Greg OBarr · ‎05-16-2003

I agree that it looks like the node tried to do a TOC. Since the primary node was not affected, it still had a lock on the arrays and the LVM couldn't get control of them. That's a good thing, of course, because production was not affected other than the temporary network loss (looks good for me :)
But now I see that the array that serves the takeover node is rebuilding. I'm confused. Could the SCSI lbolt errors been telling me there was a disk going bad in the array or did I get the SCSI lbolt errors because the node was trying to take over the PROD arrays and could not get a lock on them? Is there any way to tell, based on the SCSI -lbolt message, what controller and disk had the problem?

A. Clay Stephenson · ‎05-16-2003

THe LBOLT values are simply the number of clock ticks since the last boot and are thus of little value. If you see "device numbers" (maybe in syslog as well) then you have something. A fairly good explanation on decoding them can be found here:

http://forums.itrc.hp.com/cm/QuestionAnswer/1,,0xf84063f96280d711abdc0090277a778c,00.html

------------------------------

Now, having said all this - it almost certainly ain't disks. Your cluster completely lost network connectivity and tried to reform. As soon as one box locked the disk, the only safe play for the remaining node was a TOC. It done good. Fix your network. When done correctly, you should be able to yank wires and not break a sweat. After you get your network robust, you need to ask yourself "Now what would happen if I yanked this here SCSI cable (or disk, or power cord)?" This is all MC/SG 101 stuff.

If it ain't broke, I can fix that.

Greg OBarr · ‎05-19-2003

Thanks Clay. Agreed, I need to ask those "what would happen if..." questions, and I have. The problem comes when what actually happens differs from what I expected to happen because some little thing had changed somewhere in the many months since I was able to test it or because I had such a little window of time to actually test things that I was not able to test that particular scenario and that just happened to be be the one that occurred. I rarely ever get a chance to shut these systems down for any decent amount of time.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Network Failure caused reboot?

Network Failure caused reboot?