cancel
Showing results for 
Search instead for 
Did you mean: 

HP SG Node Reboot Problam

Hamid Reza
Occasional Visitor

HP SG Node Reboot Problam

Hi

I have 2 rp5470 HP system working on cold-standby (share-disk) using SG software. The two HP hosts are connected to on VA7110 array via two SAN switchs (FC channel). While the system status is normal, if I restart the primary node, a TOC would occur, but after some moments, during the boot up of first node, the second node also restarts. I do't know why? I appreciate any urgent help.

Regards
Hamid
6 REPLIES
Pedro Cirne
Esteemed Contributor

Re: HP SG Node Reboot Problam

Hi Hamid,

I think you have problems with your LOCK-DISK.

Check cluster logs and syslog.log and post here messages related with SG

Enjoy :)

Pedro

Re: HP SG Node Reboot Problam

Your problem statement is a little confusing.
If I read correctly, you have 2 nodes running Serviceguard, and all is ok. You then restart the primary node, a TOC occurs.
How do you "restart" the system?

During the boot sequence of the node that you "restarted"/TOC'ed, you say the other node "restarts". Do you mean it TOC's? or does it just reboot?

I would suggest you look VERY closely at the OLDsyslog.log on BOTH nodes to try to see if there are any pointers in there.
I would also suggest you check the patching on both nodes, not forgetting the Serviceguard patch which does NOT normally come with a patch bundle.
My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
Hamid Reza
Occasional Visitor

Re: HP SG Node Reboot Problam

Thanks for your replies

Here is the configuration of my system.
I have found that TOC process is done normally when two SAN switchs are powered on. The problem occurs when one of the SAN switches is powered off. (so strange for me!)
Mr.Melvyn, I use "reboot" command to restart the primary node. During the restart process I checked the second node, the TOC is successfully done. But after some moments, during the boot process of first node, the second node restarts abnormally.
Shutting down the first node, everything is OK.

Regards
Hamid
Hamid Reza
Occasional Visitor

Re: HP SG Node Reboot Problam

I am sorry the previous attachment is a big file to download (662KB); here I have attached a small version (23KB)

Regards
Hamid
Bob_Vance
Esteemed Contributor

Re: HP SG Node Reboot Problam

Part of the confusion here is your use of the term "TOC". When we use the 'reboot' command, we do not speak of "TOC". 'reboot' is a graceful shutdown and reboot of the system. "TOC" is not -- the system is immediately reset at the hardware level. SG causes a TOC of one node in possible split-brain scenarios to be sure that running programs cannot do any damage.

When you reboot the first node, the second node loses heartbeat to the first one. The second node then tries to grab the LOCK to see whether he can stay up and avoid a split brain.

If you have one of the Fibre switches down, it might very well be the one thru which the second nodes access the LOCK PV, in which case it will fail and do a TOC.

The failure of a node (your reboot of the firs node) *and* the failure of a FC switch (being powered off) is *two* failures. SG is designed to prevent single-point-of-failure (SPOF), but not necessarily more.


So, don't reboot while one of the FC switches is off!!


hth
bv
"The lyf so short, the craft so long to lerne." - Chaucer
Sudeesh
Respected Contributor

Re: HP SG Node Reboot Problam

Hi Hamid,
I suspect your issue is related to cluster lock disk.

MC/ServiceGuard does not utilize the LVM layer to get to the cluster lock disk. Hence, it does not avail itself of the PVLink capability of LVM.

If the path to the cluster lock disk specified in the Cluster ASCII file is lost, and another failure occurs requiring a race to the cluster lock disk, this server will be forced to reboot.

When you switched off one of the SAN switch, probably system lost the connection to the cluster lock disk. Then when you rebooted the first node, second node couldn't get the cluster lock as the lock disk is not accessable. This will force second node also to reboot.

Check your syslog.log file for any errors, which may help us to confirm the root cause.
Look for errors like 'Cluster lock disk /dev/dsk/cxtxdx has failed: I/O error'.

Sudeesh
The most predictable thing in life is its unpredictability