Integrity virtual machine abrupt shutdown.

somanaboina_522 · ‎03-30-2013

Hi everybody,

Yesterday i faced an issue that was i have 2 blade machines(bl860c i2) each blade has 3 virtual machines on first blead i have application server and two other vms on second blade database server and two other vms.

Yesterday suddenly these two application and databes servers got abrupt shudown.in the /etc/shutdownlog shows

Reboot after panic: SafetyTimer expired, INIT, IIP:0xe0
00000001f91af0 IFA:0x20000000777d70cc message on one machine and on other machine

Reboot after panic: SafetyTimer expired, INIT, IIP:0xe00000fffff01cd0 IFA:
0xc0000000b9982100

please let me know what was the issue . if you need any information i can provide.

Regards

somana

Moved from HP-U>System Administration to HP-UX > Serviceguard

Matti_Kurkela · ‎03-31-2013

"SafetyTimer" is normally related to HP Serviceguard. When a cluster is running normally, a safety timer is constantly ticking down, but it is reset by each successful cluster heartbeat, so it will never reach zero. But if the heartbeat network connectivity fails, the isolated cluster nodes will all attempt to get the cluster lock for themselves.

If all the heartbeat connections have failed, only the node that successfully gets the cluster lock is allowed to disable the safety timer and continue running. All the other nodes will perform a panic reboot if they are unable to contact the node that holds the cluster lock before the safety timer reaches zero. This is necessary to resolve split-brain scenarios: "A thinks B has failed, but B thinks A has failed. If both A and B both attempt to access the shared disks simultaneously without being aware of each other, the data on those disks will be corrupted for sure."

There are several ways to implement a cluster lock: a lock disk, a lock LUN, and a Quorum Server-based cluster lock.

If a Quorum Server is used for cluster lock, then the cluster lock mechanism is network-based. A total network outage can then bring down the entire cluster, if the Quorum Server becomes inaccessible and all the heartbeat connections fail at the same time.

Does this seem possible in your environment?

MK

somanaboina_522 · ‎03-31-2013

Thanks MK,

but these two servers are in different clusters and here one more thing the servers went to completed shutdown state.

these servers are related as Aplication&Database.

Matti_Kurkela · ‎04-01-2013

Yes, a safety timer expiration will cause an abrupt shutdown. This is normal and expected in certain situations.

When a cluster node loses all heartbeat connections and cannot get a cluster lock, it must assume that another node is alive and preparing to take over the clustered services. Therefore, the isolated node *must* stop running the clustered services as fast as possible. The safety timer expiration handles this requirement by intentionally crashing the node.

The maximum value of the safety timer is a cluster-wide configuration item, so all the nodes in a cluster will know the value. If the other node(s) are still eligible for running the cluster (i.e. the node holding the cluster lock, or a node that has a heartbeat connection with another node that holds the cluster lock), they will wait for the isolated node's safety timer to expire before failing over the services from the isolated node.

In your case, two separate clusters each had a total heartbeat connection failure on one node. Try to find a network component that is common to both the failed nodes. Are the failed nodes in the same server room with each other? Are they in the same rack? In the same network segment? Served by the same switch(es)?

If all the heartbeat connections to the two failed nodes go through a single network component, you have found a certain kind of a Single Point of Failure: a point that can disable one node on two clusters if it fails.

If you cannot find a single point whose failure could disable all heartbeat connections to those two nodes, there might have been a fault that affected more than one network component simultaneously.

For example, if your network admins changed Spanning Tree settings in the switches or modified the links between the switches at the time your nodes failed, the failure might have been caused by the time taken by the Spanning Tree Protocol to detect and eliminate loops in your network: a basic STP may take about 30 seconds after a network topology change to converge to a new topology, during which time no regular network traffic may pass. As this may cause problems with clusters, many modern networks will use RSTP (Rapid STP) or MSTP (Multiple STP, an extended version of RSTP) instead.

You should talk to your network admins to see if there were any network maintenance operations at the time your nodes failed.

MK

somanaboina_522 · ‎04-01-2013

MK ..you are awesome !!! .

i have one more doubt is there any other angle that i could check which caused this issue ?

thanks

somana

Matti_Kurkela · ‎04-02-2013

You might look at /var/adm/syslog/syslog.log on each cluster node on the clusters that had a node shutdown abruptly.

The logs on the nodes that kept running should indicate that a heartbeat connection to the failed node was lost, providing proof that there really was a network problem of some kind.

On the failed nodes, there might also be some log messages related to the issue, however they might end abruptly at the point where Safety Timer crashed the node.

MK

Categories

Company

Local Language

Forums

Discussions

Knowledge Base

Forums

Discussions

Knowledge Base

Forums

Discussions

Knowledge Base

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Integrity virtual machine abrupt shutdown.

Integrity virtual machine abrupt shutdown.

Re: Integrity virtual machine abrupt shutdown.

Re: Integrity virtual machine abrupt shutdown.

Re: Integrity virtual machine abrupt shutdown.

Re: Integrity virtual machine abrupt shutdown.

Re: Integrity virtual machine abrupt shutdown.