- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- SG panics when half cluster is down.
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-29-2005 08:35 PM
06-29-2005 08:35 PM
SG panics when half cluster is down.
I have a SG cluster with 4 nodes. Last Monday I installed patches on two nodes (1 and 2) of that cluster. Previously, I had switched packages to other nodes (3 and 4). I didn't run cmhaltnode on nodes 1 and 2.
When nodes 1 and 2 rebooted due to patching, nodes 3 and 4 panic'ed together. The panic message is
Reboot after panic: SafetyTimer expired, isr.ior = 0'10340005.0'f83e01d8
Thus, the HA services went hell and couldn't recover until 15 or 20 min later because of crash dump and startup.
Since a panic was reported I opened a call in HP. I believed this was a patch issue.
But HP answers this is the LOGICAL behaviour of SG clusters in order to prevent a split-brain.
I am indignant about this.
Do you think that TOCing the surviving nodes in a cluster is a logical behaviour ?
I agree that, in a split-brain, SG cluster should stop services BUT NOT REBOOT nodes.
I managed SG cluster with 2 nodes. In all those cases, when a node crashed due to CPU or memory failure, the surviving node never paniced and rebooted.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-29-2005 09:07 PM
06-29-2005 09:07 PM
Re: SG panics when half cluster is down.
Assuming it is, then the behaviour of your cluster is to be expected.
Your cluster couldn't achieve quorum (exactly half - 50% - of the nodes were still up) so it didn't know and couldn't know what the state of the other two nodes were (it would have done if you had run a cmhaltnode, beacuse the nodes would have gracefully left the cluster)
SO in this situation Serviceguard MUST protect your data first, and worry about availability second.
Your idea of stopping services sounds great, but how long does that take, and doesn't stopping an app usually involve writing data to disk? In these scenarios its just not possible to know that stopping services won't actually corrupt our data (we don't know what the other two nodes are doing remember.) - so the only safe course of action is to stop everything DEAD - and Serviceguard does that by TOC'ing the box.
Of course if you DO have some form of cluster lock (lock disk or quorum server) - then we need to look again at what happened.
If you had rebooted just 1 node at a time you wouldn't have had this problem as quorum (>50% of nodes) would have been maintained.
HTH
Duncan
I am an HPE Employee
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-29-2005 09:14 PM
06-29-2005 09:14 PM
Re: SG panics when half cluster is down.
The reboot did a kill -9 on the cmcld process, which is seen as a failure, and this happened on two out of four nodes, leaving you 2 nodes, or 50%. Serviceguard REQUIRES more than a 50% quorum upon a failure scenario, or exactly 50% with access to a cluster locking mechanism.
The correct procedure would be to cmhaltnode the two nodes to be patched, and this would allow the other two nodes to reform as a two node cluster under non-failure conditions.
This would also prevent the nodes being patched doing a TOC and dump.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-29-2005 09:29 PM
06-29-2005 09:29 PM
Re: SG panics when half cluster is down.
I side with HP, and recommend you do some reading about "arbitration". It is all available on the HP docs website.
Rebooting two nodes at the same time, was not a *single" failure (like one node in a two-node cluster). Still you could have prevented this by running cmhaltnode before patching them. Then the cluster would have known the nodes *are* down. And with that knowledge you could have even stopped the third node as well and the remaining node would have known the others are down. After that they could have re-entered the cluster.
Your cluster recognized that it did not have *more* than half of the nodes and since it did not know what was going on, the nodes *needed* to go down because if the other half of nodes tried to bring up packages you need to make sure they can, but if you loose access to disks you cannot deactivate a VG and then the other node would not be able to activate exclusively....
Regards,
Bernhard
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-29-2005 09:39 PM
06-29-2005 09:39 PM
Re: SG panics when half cluster is down.
Your idea of stopping services sounds great, but how long does that take, and doesn't stopping an app usually involve writing data to disk? In these scenarios its just not possible to know that stopping services won't actually corrupt our data (we don't know what the other two nodes are doing remember.
In my case, I switched all packages to nodes 3 and 4 and I disabled switching prior to install patches.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-29-2005 09:48 PM
06-29-2005 09:48 PM
Re: SG panics when half cluster is down.
a) the nodes have failed
b) the nodes might be OK, but the network between them has failed.
But how the cluster can determine which of these two situations it is? There must be some sort of a tie-breaker, otherwise the cluster has NO WAY of knowing which of these situations it is. In SG this tie-breaker is either a lock disk or a quorum server.
In the situation a) it is obvious that the remaining nodes should continue processing and claim for themselves the packages from the failed nodes, if possible.
But in the situation b) the situation is more dangerous. From the point of view of the nodes 1 and 2, the nodes 3 and 4 have failed and their services should be moved to nodes 1 and 2.
However, from the point of view of the nodes 3 and 4, the nodes 1 and 2 have failed... which means that BOTH groups of nodes will attempt to claim the other group's packages' IP addresses and disks. Since there is no connectivity between the groups of nodes, they will succeed, potentially causing data corruption in EVERY package that can be moved between the groups.
So, to answer your question: when there is a possibility that there is another node mounting the same disks "this" node is using and there is no way to communicate to that node, yes, the only way to avoid data corruption is to stop immediately. The TOC is the fastest method to do that.
War story:
We recently reconfigured all our SG clusters to use a quorum server, to make our major storage system upgrade more painless - no cluster restarts for lock disk reconfiguration.
However, after that the UPS system of one of our two major server rooms malfunctioned. There was a small fire, and the firemen had to power down the UPS system... which caused a total power outage to one server room.
No problem, the services should fail over to the servers in the other room, right...?
But it happened that the quorum server was in the server room that was having the power outage... so all our production SG nodes made a TOC, since each cluster lost exactly one-half of its nodes AND the quorum server was unreachable at the same time.
When the power was back, it took about eight hours to restore the production systems (during a Saturday evening) and more time on Monday to restore the test/development/noncritical systems.
We also had a configuration error where one single FibreChannel disk was simultaneously used in two unrelated systems. One of the systems had a lot of unused space in its VG, so the problem did not show up until both systems actually started writing to the desk. That took about two weeks from the initial configuration error.
Sorting out the resulting mess (make copies of the affected disk, verify the correctness of the data, find and restore the corrupted data from backups and/or regenerate from raw data archived elsewhere) took about 24 hours. Of course, the problem was noticed on a Friday afternoon...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-29-2005 10:30 PM
06-29-2005 10:30 PM
Re: SG panics when half cluster is down.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-30-2005 12:21 AM
06-30-2005 12:21 AM
Re: SG panics when half cluster is down.
...they are trying to tell you that you created the situation, and force SG (the way it was done) to take the action it did.
So...do you have a lock disk ? or a Quorum Server running ? If you had either one of these working properly, you wouldn't have lost your cluster.
...or...as they have said, if you would have simply run a cmhaltnode on each server B4 you bagan patching it. Your cluster would not have failed, because the cluster would have known that those boxes were now on the outside of the cluster. Then when you were done patching & rebooting just run cmrunnode.
Or is there something else ???
Hang in there SG can be a bit overwhelming at times. But you'll get it !
Rgrds,
Rita
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-30-2005 02:56 AM
06-30-2005 02:56 AM
Re: SG panics when half cluster is down.
You can't compare a crashed node to your situation and you can't compare a two-node cluster either because it requires a lock disk or quorum server.
This was nothing more and nothing less than pilot error; learn from this and move on.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-01-2005 02:13 AM
07-01-2005 02:13 AM
Re: SG panics when half cluster is down.
The logic behind Serviceguard is explained in the "Managing Serviceguard" manual, which is available to anyone on the internet at this location:
http://docs.hp.com/en/ha.html#Serviceguard
The manual section that describes the behavior of Serviceguard when half of the nodes leave unexpectedly is documented in the section titled:
"Cluster Quorum to Prevent Split-Brain Syndrome"
--- which is at this link:
http://docs.hp.com/en/B3936-90065/ch03s02.html#d0e1810