- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- Help determining cause of reboot
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-26-2002 05:50 AM
09-26-2002 05:50 AM
One of our client sites had some power outtages yesterday, approximately 4 in 5 minutes. The servers are all supposed to be on UPS supplied circuits.
After the outtages the servers did remain running. However about 1 minute after the last noticeable outtage one of the servers rebooted.
This server is part of a service guard 2 node cluster. It was the primary node at the time of the outtage.
Can someone take a look at the log file entries below and help me determine why the machine rebooted?
Sep 25 14:40:12 cosmo0 : su : + tmc voecksl-root
Sep 25 16:39:16 cosmo0 telnetd[20105]: getpid: peer died: Connection timed out
Sep 25 16:40:12 cosmo0 telnetd[20163]: getpid: peer died: Connection timed out
Sep 25 16:40:12 cosmo0 telnetd[20164]: getpid: peer died: Connection timed out
Sep 25 16:40:12 cosmo0 telnetd[20165]: getpid: peer died: Connection timed out
Sep 25 16:40:29 cosmo0 vmunix: btlan: NOTE: MII Link Status Not OK - Check Cable
Connection to Hub/Switch at 0/2/0/0/5/0....
Sep 25 16:40:29 cosmo0 vmunix: btlan: NOTE: MII Link Status Not OK - Check Cable
Connection to Hub/Switch at 0/5/0/0/5/0....
Sep 25 16:40:29 cosmo0 cmcld: lan2 failed
Sep 25 16:40:29 cosmo0 cmcld: Subnet 148.8.70.0 switched from lan2 to lan3
Sep 25 16:40:29 cosmo0 cmcld: lan2 switched to lan3
Sep 25 16:40:29 cosmo0 cmcld: lan6 failed
Sep 25 16:40:29 cosmo0 cmcld: Package unidata cannot run on this node because sw
itching has been disabled for this node.
Sep 25 16:40:31 cosmo0 vmunix: btlan: NOTE: MII Link Status Not OK - Check Cable
Connection to Hub/Switch at 0/2/0/0/6/0....
Sep 25 16:40:31 cosmo0 cmcld: lan3 failed
Sep 25 16:40:31 cosmo0 cmcld: Subnet 148.8.70.0 down
Sep 25 16:41:39 cosmo0 cmcld: Timed out node cosmo1. It may have failed.
Sep 25 16:41:39 cosmo0 cmcld: Attempting to form a new cluster
Sep 25 16:45:01 cosmo0 cmcld: lan2 recovered
Sep 25 16:45:01 cosmo0 cmcld: Subnet 148.8.70.0 switched from lan3 to lan2
Sep 25 16:45:01 cosmo0 cmcld: lan3 switched to lan2
Sep 25 16:45:01 cosmo0 cmcld: Subnet 148.8.70.0 up
Sep 25 16:45:01 cosmo0 cmcld: Package unidata cannot run on this node because sw
itching has been disabled for this node.
Sep 25 16:45:03 cosmo0 cmcld: lan6 recovered
Sep 25 16:46:41 cosmo0 cmcld: Obtaining Cluster Lock
Sep 25 16:46:42 cosmo0 cmcld: Cluster lock was denied. Lock was obtained by anot
her node.
Sep 25 16:46:42 cosmo0 cmcld: Attempting to form a new cluster
Sep 25 16:46:42 cosmo0 cmcld: Daemon exiting due to halt message from node cosmo
1
Sep 25 16:46:42 cosmo0 cmcld: Halting cosmo0 to preserve data integrity
Sep 25 16:46:42 cosmo0 cmcld: Reason: Impossibly long daemon hang detected
Sep 25 16:46:42 cosmo0 cmcld: cl_abort: abort cl_kepd_printf failed: Invalid arg
ument
Sep 25 16:46:42 cosmo0 cmcld: Aborting! Impossibly long daemon hang detected (fi
le: utils.c, line: 155)
Sep 25 16:46:46 cosmo0 cmclconfd[2596]: The ServiceGuard daemon, /usr/lbin/cmcld
[2597], died upon receiving the signal 6.
Sep 25 16:46:53 cosmo0 vmunix:
Sep 25 16:46:53 cosmo0 vmunix: sync'ing disks (15 buffers to flush): 15 4 1
Sep 25 16:46:53 cosmo0 vmunix: 0 buffers not flushed
Sep 25 16:46:53 cosmo0 vmunix: 0 buffers still dirty
root@cosmo0:/var/adm/syslog->
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-26-2002 05:54 AM
09-26-2002 05:54 AM
Re: Help determining cause of reboot
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-26-2002 05:57 AM
09-26-2002 05:57 AM
SolutionSounds like that's what happened here. This node lost the race to the lock VG.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-26-2002 05:58 AM
09-26-2002 05:58 AM
Re: Help determining cause of reboot
The datacenter and main switches are UPS powered. However the switches in the closets throughout the campus are not.
So when we lost power all of the external switches reboot and try to re-establish connectivity to the main bridge switches.
The way they have things set up, if there are successive outtages like this in a quick period the main bridges get overloaded and fail, requiring a reboot of them.
So while the center is UPS'd this type of failure does cause the servers to lose their lan while the main bridges are rebooting.
Would serviceguard for any reason reboot the server when it sees a lan failure?
TIA,
Sean
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-26-2002 06:02 AM
09-26-2002 06:02 AM
Re: Help determining cause of reboot
I had forgotten that sg will reboot if it doesn't get the lock.
As a side question, is there a way to give one node priority on the lock over the other? This company would prefer that one of the two machines be the primary node virtually all the time. And ALL of the failovers that they have had were the result of network problems. So the primary server was always working, and always available to run the package.
But it seems that on every failover the alternate machines gets the lock first and we and up halting the package on that node and bringing it back up on the primary machine.
It would be nice if we could set some type of priority to give the primary the first shot at the lock. Say like a 10 second delay on the alternate or something like that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-26-2002 06:08 AM
09-26-2002 06:08 AM
Re: Help determining cause of reboot
Yes. MC/ServiceGuard TOC's the node that is not having the cluster lock but having the volume groups activated during the cluster reformation. Your cosmo0 lost the cluster lock to cosmo1 during the outage.
Go through the messages and you will get it crystal clear.
Sep 25 16:46:41 cosmo0 cmcld: Obtaining Cluster Lock
Sep 25 16:46:42 cosmo0 cmcld: Cluster lock was denied. Lock was obtained by anot
her node.
Sep 25 16:46:42 cosmo0 cmcld: Attempting to form a new cluster
Sep 25 16:46:42 cosmo0 cmcld: Daemon exiting due to halt message from node cosmo
1
Sep 25 16:46:42 cosmo0 cmcld: Halting cosmo0 to preserve data integrity
-Sri
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-26-2002 06:14 AM
09-26-2002 06:14 AM
Re: Help determining cause of reboot
Look at NODE_TIMEOUT and NETWORK_POLLING_INTERVAL in cluster's ascii file. The first one determines how long it should wait to reform the cluster when it finds that the other node is timed out and the second one is to decide when to call it a network outage and is particularly helpful for local lan failovers.
You can increase these values. My settings are 12 secs for both.
-Sri
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-26-2002 06:16 AM
09-26-2002 06:16 AM
Re: Help determining cause of reboot
NO! There is no way to force one node to have any advantage! I was a bit peeved about this myself when brought this up in the MC/SG class which I attended about 2 months ago. Seem like HP could put some delay mechanism in place to give one node an advantage. Alas, the is nothing you can do (at least that's what my instructor said).