1829580 Members
6483 Online
109992 Solutions
New Discussion

Re: Server Reboot

 
SOLVED
Go to solution
kunjuttan
Super Advisor

Server Reboot

Hi,

I am having a 2 node active-passive cluster server.Last Sunday my primary node gets rebooted.How to find the reason for the reboot of the node.

Sever Model - Superdome2
OS - HP-UX11.31
13 REPLIES 13
Torsten.
Acclaimed Contributor

Re: Server Reboot

I would first take a look at the shutdownlog, the OLDsyslog and the cluster log.

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
kunjuttan
Super Advisor

Re: Server Reboot

Thanks for the update.

But may i know the location and how to find the exact details of the logs.
Dennis Handly
Acclaimed Contributor

Re: Server Reboot

>may I know the location

/var/adm/syslog/
/var/adm/shutdownlog
Torsten.
Acclaimed Contributor

Re: Server Reboot

and /etc/cmcluster

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
kunjuttan
Super Advisor

Re: Server Reboot

I have a doubt.In syslog its showing like heartbeat connection lost.But because of this is the server will reboot?
Torsten.
Acclaimed Contributor

Re: Server Reboot

This could be a reason.

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   

Re: Server Reboot

>> In syslog its showing like heartbeat connection lost.

Why not actually post the message(s) from syslog rather than something "like" what it says - then we can give you a better answer...

HTH

Duncan

I am an HPE Employee
Accept or Kudo
kunjuttan
Super Advisor

Re: Server Reboot

Hi,

Please find the attached syslog output.And also I want to know in normal case,if a hearbeat lan got failed,the package will move to other node.But what will the happen to the other node from which package got moved to other node.Is this node will get rebooted?
g3jza
Esteemed Contributor

Re: Server Reboot

Jun 5 14:53:52 bilprdci cmnetd[3812]: 172.16.8.165 failed.

Jun 5 14:53:52 bilprdci cmnetd[3812]: lan900 is down at the IP layer.

Jun 5 14:53:52 bilprdci cmnetd[3812]: lan900 failed.

Jun 5 14:53:52 bilprdci cmnetd[3812]: Subnet 172.16.8.0 down

Jun 5 14:54:43 bilprdci vmunix: Dead gateway detection can't ping the last remaining default gateway at 0xac1008c8 .See ndd -h ip_ire_gw_probe for more info

Jun 5 14:56:46 bilprdci cmnetd[3812]: 172.16.8.165 recovered.

Jun 5 14:56:46 bilprdci cmnetd[3812]: Subnet 172.16.8.0 up

Jun 5 14:56:46 bilprdci cmnetd[3812]: lan900 is up at the IP layer.

Jun 5 14:56:46 bilprdci cmnetd[3812]: lan900 recovered.

Jun 5 15:02:26 bilprdci cmnetd[3812]: 172.16.8.165 failed.

Jun 5 15:02:26 bilprdci cmnetd[3812]: lan900 is down at the IP layer.

Jun 5 15:02:26 bilprdci cmnetd[3812]: lan900 failed.

Jun 5 15:02:26 bilprdci cmnetd[3812]: Subnet 172.16.8.0 down

Jun 5 15:03:43 bilprdci vmunix: Dead gateway detection can't ping the last remaining default gateway at 0xac1008c8 .See ndd -h ip_ire_gw_probe for more info

Jun 5 15:06:43 bilprdci vmunix: Dead gateway detection can't ping the last remaining default gateway at 0xac1008c8 .See ndd -h ip_ire_gw_probe for more info

Jun 5 15:09:43 bilprdci vmunix: Dead gateway detection can't ping the last remaining default gateway at 0xac1008c8 .See ndd -h ip_ire_gw_probe for more info

Jun 5 15:12:52 bilprdci cmnetd[3812]: 172.16.8.165 recovered.

Jun 5 15:12:52 bilprdci cmnetd[3812]: Subnet 172.16.8.0 up

Jun 5 15:12:52 bilprdci cmnetd[3812]: lan900 is up at the IP layer.

Jun 5 15:12:43 bilprdci vmunix: Dead gateway detection can't ping the last remaining default gateway at 0xac1008c8 .See ndd -h ip_ire_gw_probe for more info

Jun 5 15:12:52 bilprdci cmnetd[3812]: lan900 recovered.

Jun 5 15:18:38 bilprdci cmnetd[3812]: Link level address on network interface lan901 has been changed from 0xf4ce46f488fa to 0xf4ce46f48808.

Jun 5 15:18:38 bilprdci cmnetd[3812]: lan901 is down at the data link layer.

Jun 5 15:18:38 bilprdci cmnetd[3812]: lan901 failed.

Jun 5 15:18:38 bilprdci cmnetd[3812]: Subnet 10.10.12.0 down

Jun 5 15:18:39 bilprdci cmcld[3803]: Member bilprddb seems unhealthy, not receiving heartbeats from it.

.......

Looks like some network problems, both of your aggregates went down and the node lost the heartbeat with other nodes, which is possible cause of reboot. You should check it with your network team, what was going on exactly...

How many nodes are in the cluster? In the logs it says that the cluster later on formed with only one node and the other 2 were excluded, is that a 3-node cluster then?


Re: Server Reboot

Yes, complete network failure between the 2 nodes in the cluster by the look of it - this should never be able to happen unless the aggregates from both lan900 and lan901 run through the same networking kit. So first port of call is to talk to your network team and ask them why all their network switches failed at the same time...

After the network failed, the remore node (bilprddb) was ejected from the cluster following a race for the cluster lock disk - this is normal cluster behaviour when 2 nodes in a cluster cannot communicate over any LAN interfaces.

bilprdci formed a one node cluster, and attempted to start the dbPRD package, which failed (reason unknown - you would need to look at the package log for this, but most likely due to the complete network failure)

Later bilprddb rejoined the cluster and someone manually stopped and started ciPRD on bilprdci

So my advice here is:

1. Review your cluster package logs as well, as they may throw more light on the nature of the failure(s) here.

2. You need a ground up review of the network design within this cluster - a good cluster design should never be able to lose all network links at the same time.

3. Lots of nasty NFS issues in here too, no doubt caused by the network outage - however you should review that you are following the NFS best practice when used in a cluster

4. You need to check your name resolution standards in /etc/nsswitch.conf. In a cluster you really need to have name resolution handled first by files and only then by DNS, and you need to make sure all the interfaces are consistently named in /etc/hosts on both cluster nodes

HTH

Duncan

I am an HPE Employee
Accept or Kudo
kunjuttan
Super Advisor

Re: Server Reboot

Thank you all for the support.Its a 2node cluster.And one more thing,If HB lan got failed,is it natural that the other node will get rebooted??Here HB lan fails and my primary node gets rebooted.Is it natural in case when HB lan fails?Or even if HB lan it shuld only swicth over the packages and the server shuld b intact?
Solution

Re: Server Reboot

In a 2 node cluster, if all the heartbeat LANs between the 2 nodes fail, then one of the nodes is going to get rebooted... this is to ensure that your data is not corrupted.

If neither node can talk to the other, how do they know whether the other node is running one of the packages in the cluster or not... they can't, so what happens is they both try and obtain the cluster lock and the node that "loses" the race for the cluster lock reboots itself. It could just as easily have been the other node that lost the race for the cluster lock...

HTH

Duncan

I am an HPE Employee
Accept or Kudo
kunjuttan
Super Advisor

Re: Server Reboot

Thanks Duncan....I was luking for the same.