Server Reboot

Duncan Edmonstone · ‎06-07-2011

Yes, complete network failure between the 2 nodes in the cluster by the look of it - this should never be able to happen unless the aggregates from both lan900 and lan901 run through the same networking kit. So first port of call is to talk to your network team and ask them why all their network switches failed at the same time...

After the network failed, the remore node (bilprddb) was ejected from the cluster following a race for the cluster lock disk - this is normal cluster behaviour when 2 nodes in a cluster cannot communicate over any LAN interfaces.

bilprdci formed a one node cluster, and attempted to start the dbPRD package, which failed (reason unknown - you would need to look at the package log for this, but most likely due to the complete network failure)

Later bilprddb rejoined the cluster and someone manually stopped and started ciPRD on bilprdci

So my advice here is:

1. Review your cluster package logs as well, as they may throw more light on the nature of the failure(s) here.

2. You need a ground up review of the network design within this cluster - a good cluster design should never be able to lose all network links at the same time.

3. Lots of nasty NFS issues in here too, no doubt caused by the network outage - however you should review that you are following the NFS best practice when used in a cluster

4. You need to check your name resolution standards in /etc/nsswitch.conf. In a cluster you really need to have name resolution handled first by files and only then by DNS, and you need to make sure all the interfaces are consistently named in /etc/hosts on both cluster nodes

HTH

Duncan

I am an HPE Employee

kunjuttan · ‎06-08-2011

Thank you all for the support.Its a 2node cluster.And one more thing,If HB lan got failed,is it natural that the other node will get rebooted??Here HB lan fails and my primary node gets rebooted.Is it natural in case when HB lan fails?Or even if HB lan it shuld only swicth over the packages and the server shuld b intact?

Duncan Edmonstone · ‎06-08-2011

In a 2 node cluster, if all the heartbeat LANs between the 2 nodes fail, then one of the nodes is going to get rebooted... this is to ensure that your data is not corrupted.

If neither node can talk to the other, how do they know whether the other node is running one of the packages in the cluster or not... they can't, so what happens is they both try and obtain the cluster lock and the node that "loses" the race for the cluster lock reboots itself. It could just as easily have been the other node that lost the race for the cluster lock...

HTH

Duncan

I am an HPE Employee

kunjuttan · ‎06-08-2011

Thanks Duncan....I was luking for the same.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Server Reboot

Re: Server Reboot

Re: Server Reboot

Re: Server Reboot

Re: Server Reboot