SGLX problem on SLES10, when one node reboot

ieeezhang · ‎05-17-2008

I have two hosts running ServiceGuard on SLES10 SP1. The Lock LUN is on the disk /dev/sdc2.

The NFS service on it can switch to the backup host manually or by command cmhaltnode.
But when the serving node reboot, all two hosts are down. The cluster status and error logs are in the attachment.

I've tried the SGLX A.11.16 on RHEL4 before, no such problem. Can anyone help to tell why it occures on SGLX A.11.18 SLES10 SP1. Thanks.

Colin Topliss · ‎05-17-2008

There can be a number of reasons - I had so much trouble with using lock LUNs that I abandoned the whole idea and switched to using quorum servers instead. In fact that worked so well we have begun to switch our HP nodes to using quorum servers too!

So, has this ever worked? Is this the first time you've rebooted the nodes since you put SG on there? Have you presented more LUNs to this system since the day you installed it? If you have, have you set up persistent binding on the LUNs?

ieeezhang · ‎05-17-2008

I've tried rebooting several times. Sometimes it works. The backup node can get the lock LUN, the serving node down. But most times, it failed.
I have several LUNs mapped to the hosts. I've tried with other partitions as the Lock LUN. The problem also exits.
Since the same lun have the same name on my two hosts every time they booted. So I suppose it's not the persistent binding problem.

Are there any possibility when the serving node rebooting, the Ethernet down first, so the heartbeat down. Then two nodes begin to contend for the Lock LUN. Sometimes, the serving node get the Lock LUN before it really shutdown.
If so, are there any configuration to be edited to postpone the node down detection time when the heartbeat down, or any parameter to let the backup node to try to obtain the lock LUN more times? Thanks.

Colin Topliss · ‎05-18-2008

Hi,

Yes - look in your cmclconf.ascii file for the line containing the phrase 'Cluster Timing Parameters'. This is the section that deals with the HEARTBEAT_INTERVAL and NODE_TIMEOUT parameters.

This file sits in $SGCONF (which in our case /opt/cmcluster/conf).

Colin.

ieeezhang · ‎05-18-2008

Thanks for your advice. I've changed the NODE_TIMEOUT paremeter in that file to a larger number. It really works.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

SGLX problem on SLES10, when one node reboot

SGLX problem on SLES10, when one node reboot

Re: SGLX problem on SLES10, when one node reboot

Re: SGLX problem on SLES10, when one node reboot

Re: SGLX problem on SLES10, when one node reboot

Re: SGLX problem on SLES10, when one node reboot