Serviceguard
cancel
Showing results for 
Search instead for 
Did you mean: 

SGLX problem on SLES10, when one node reboot

SOLVED
Go to solution
ieeezhang
Occasional Visitor

SGLX problem on SLES10, when one node reboot

I have two hosts running ServiceGuard on SLES10 SP1. The Lock LUN is on the disk /dev/sdc2.

The NFS service on it can switch to the backup host manually or by command cmhaltnode.
But when the serving node reboot, all two hosts are down. The cluster status and error logs are in the attachment.

I've tried the SGLX A.11.16 on RHEL4 before, no such problem. Can anyone help to tell why it occures on SGLX A.11.18 SLES10 SP1. Thanks.
4 REPLIES
Colin Topliss
Esteemed Contributor

Re: SGLX problem on SLES10, when one node reboot

There can be a number of reasons - I had so much trouble with using lock LUNs that I abandoned the whole idea and switched to using quorum servers instead. In fact that worked so well we have begun to switch our HP nodes to using quorum servers too!

So, has this ever worked? Is this the first time you've rebooted the nodes since you put SG on there? Have you presented more LUNs to this system since the day you installed it? If you have, have you set up persistent binding on the LUNs?

ieeezhang
Occasional Visitor

Re: SGLX problem on SLES10, when one node reboot

I've tried rebooting several times. Sometimes it works. The backup node can get the lock LUN, the serving node down. But most times, it failed.
I have several LUNs mapped to the hosts. I've tried with other partitions as the Lock LUN. The problem also exits.
Since the same lun have the same name on my two hosts every time they booted. So I suppose it's not the persistent binding problem.

Are there any possibility when the serving node rebooting, the Ethernet down first, so the heartbeat down. Then two nodes begin to contend for the Lock LUN. Sometimes, the serving node get the Lock LUN before it really shutdown.
If so, are there any configuration to be edited to postpone the node down detection time when the heartbeat down, or any parameter to let the backup node to try to obtain the lock LUN more times? Thanks.
Colin Topliss
Esteemed Contributor
Solution

Re: SGLX problem on SLES10, when one node reboot

Hi,

Yes - look in your cmclconf.ascii file for the line containing the phrase 'Cluster Timing Parameters'. This is the section that deals with the HEARTBEAT_INTERVAL and NODE_TIMEOUT parameters.

This file sits in $SGCONF (which in our case /opt/cmcluster/conf).

Colin.
ieeezhang
Occasional Visitor

Re: SGLX problem on SLES10, when one node reboot

Thanks for your advice. I've changed the NODE_TIMEOUT paremeter in that file to a larger number. It really works.