- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- Integrity virtual machine abrupt shutdown.
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-30-2013 01:02 AM - last edited on 04-01-2013 06:39 PM by Cathy_xu
03-30-2013 01:02 AM - last edited on 04-01-2013 06:39 PM by Cathy_xu
Hi everybody,
Yesterday i faced an issue that was i have 2 blade machines(bl860c i2) each blade has 3 virtual machines on first blead i have application server and two other vms on second blade database server and two other vms.
Yesterday suddenly these two application and databes servers got abrupt shudown.in the /etc/shutdownlog shows
Reboot after panic: SafetyTimer expired, INIT, IIP:0xe0
00000001f91af0 IFA:0x20000000777d70cc message on one machine and on other machine
Reboot after panic: SafetyTimer expired, INIT, IIP:0xe00000fffff01cd0 IFA:
0xc0000000b9982100
please let me know what was the issue . if you need any information i can provide.
Regards
somana
Moved from HP-U>System Administration to HP-UX > Serviceguard
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-31-2013 02:17 PM
03-31-2013 02:17 PM
Re: Integrity virtual machine abrupt shutdown.
"SafetyTimer" is normally related to HP Serviceguard. When a cluster is running normally, a safety timer is constantly ticking down, but it is reset by each successful cluster heartbeat, so it will never reach zero. But if the heartbeat network connectivity fails, the isolated cluster nodes will all attempt to get the cluster lock for themselves.
If all the heartbeat connections have failed, only the node that successfully gets the cluster lock is allowed to disable the safety timer and continue running. All the other nodes will perform a panic reboot if they are unable to contact the node that holds the cluster lock before the safety timer reaches zero. This is necessary to resolve split-brain scenarios: "A thinks B has failed, but B thinks A has failed. If both A and B both attempt to access the shared disks simultaneously without being aware of each other, the data on those disks will be corrupted for sure."
There are several ways to implement a cluster lock: a lock disk, a lock LUN, and a Quorum Server-based cluster lock.
If a Quorum Server is used for cluster lock, then the cluster lock mechanism is network-based. A total network outage can then bring down the entire cluster, if the Quorum Server becomes inaccessible and all the heartbeat connections fail at the same time.
Does this seem possible in your environment?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-31-2013 10:40 PM
03-31-2013 10:40 PM
Re: Integrity virtual machine abrupt shutdown.
Thanks MK,
but these two servers are in different clusters and here one more thing the servers went to completed shutdown state.
these servers are related as Aplication&Database.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-01-2013 04:19 AM
04-01-2013 04:19 AM
SolutionYes, a safety timer expiration will cause an abrupt shutdown. This is normal and expected in certain situations.
When a cluster node loses all heartbeat connections and cannot get a cluster lock, it must assume that another node is alive and preparing to take over the clustered services. Therefore, the isolated node *must* stop running the clustered services as fast as possible. The safety timer expiration handles this requirement by intentionally crashing the node.
The maximum value of the safety timer is a cluster-wide configuration item, so all the nodes in a cluster will know the value. If the other node(s) are still eligible for running the cluster (i.e. the node holding the cluster lock, or a node that has a heartbeat connection with another node that holds the cluster lock), they will wait for the isolated node's safety timer to expire before failing over the services from the isolated node.
In your case, two separate clusters each had a total heartbeat connection failure on one node. Try to find a network component that is common to both the failed nodes. Are the failed nodes in the same server room with each other? Are they in the same rack? In the same network segment? Served by the same switch(es)?
If all the heartbeat connections to the two failed nodes go through a single network component, you have found a certain kind of a Single Point of Failure: a point that can disable one node on two clusters if it fails.
If you cannot find a single point whose failure could disable all heartbeat connections to those two nodes, there might have been a fault that affected more than one network component simultaneously.
For example, if your network admins changed Spanning Tree settings in the switches or modified the links between the switches at the time your nodes failed, the failure might have been caused by the time taken by the Spanning Tree Protocol to detect and eliminate loops in your network: a basic STP may take about 30 seconds after a network topology change to converge to a new topology, during which time no regular network traffic may pass. As this may cause problems with clusters, many modern networks will use RSTP (Rapid STP) or MSTP (Multiple STP, an extended version of RSTP) instead.
You should talk to your network admins to see if there were any network maintenance operations at the time your nodes failed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-01-2013 06:02 AM
04-01-2013 06:02 AM
Re: Integrity virtual machine abrupt shutdown.
MK ..you are awesome !!! .
i have one more doubt is there any other angle that i could check which caused this issue ?
thanks
somana
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-02-2013 04:04 AM
04-02-2013 04:04 AM
Re: Integrity virtual machine abrupt shutdown.
You might look at /var/adm/syslog/syslog.log on each cluster node on the clusters that had a node shutdown abruptly.
The logs on the nodes that kept running should indicate that a heartbeat connection to the failed node was lost, providing proof that there really was a network problem of some kind.
On the failed nodes, there might also be some log messages related to the issue, however they might end abruptly at the point where Safety Timer crashed the node.