topic serviceguard problem in Operating System - HP-UX

serviceguard problem

uvc — Tue, 21 Dec 2004 14:25:46 GMT

Hi,

We just had an awfull crash:
We have a 2-node RAC cluster with CVM.
One of the nodes had a file table overflow a few days ago, that went unnoticed.
Neither Oracle or CVM had shown problems. However, we decided to reboot the node for good measure. (It was the CVM master node, btw)
When shutting the node, a panic was received in the second node, who went down in a flash.
Needless to say, we were not prepared for it and the RAC was open on the second node. This caused MAJOR data corruption.

Now we restore.

Why, Why, Why does these things happen?

Re: serviceguard problem

Steven E. Protter — Tue, 21 Dec 2004 14:43:04 GMT

These things generally happen because ServiceGuard configuration is inadequate and not tested.

I'm in the process of building SG in the lab with two ancient D class servers. The point is to learn how to configure and test the product and develop procedures for coding proper monitoring scripts and other configurations.

I will be able to test this setup without fear of hurting anything. Prior to going live with any SG configuration I will test several failure solutions.

It does appear that your monitor scripts are a problem, and the alert.ora logs are not being looked at and acted upon. Very simple scripts can check these and email you before trouble happens.

I know you have a crisis now, but once you have everything put back together, write and execute a test plan on this setup.

SEP

Re: serviceguard problem

uvc — Tue, 21 Dec 2004 14:52:15 GMT

What really annoys me is the 'solidarity' of the nodes in serviceguard.
You can't trust it on nothing but the simplest situations. If a network segment went down - poof, you're dead. if a single node is experiencing software problems (cvm) - poof, you're dead. If you got multiple disk failures - poof, you're dead.

All these have happened to me in various s/g configurations. the solution is usually to disable as much functionality as you can, or - tidiously try to get the machine to a 'stable state' - in which the cluster will work ok - until something (sftware or hardware) change - when all cluster testing need to be done again.

It's just bad. Really. It causes more downtime than it saves.

Re: serviceguard problem

melvyn burnard — Tue, 21 Dec 2004 14:54:36 GMT

sounds like you do not have your monitoring or configuration set correctly.
It may be worth looking at what happened and use it as a learning curve and fix what did not appear to work or react correctly

Re: serviceguard problem

Pete Randall — Tue, 21 Dec 2004 14:55:54 GMT

Gee, I thought this thread had a different subject the first time I saw it!!!

;^)

Pete

P.S. I do offer my commiseration.

Re: serviceguard problem

A. Clay Stephenson — Tue, 21 Dec 2004 15:08:29 GMT

Men at some time are masters of their fates;
The fault is not in our stars,
But in ourselves.

Julius Caesar - I. 2.

I've run one MC/SG production cluster for over 5.5 years without a single package failover that was not intentionally initiated --- and with zero unplanned downtime. If setup correctly, MC/SG is extremely robust. In fact, your carping about losing one network connection and dying is the key. Your configuration is not nearly robust enough. Network cable/NIC/switch failures should be considered routine events and should not cause any failures. Disk replacements should be considered absolutely routine tasks. The whole point is to have your systems so well configured that MC/SG itself very rarely comes into play.

When a cluster and its packages are well constructed and configured robustly, you should be able (and should actually try) to yank any one thing --- including yanking a server's power cable, yanking a disk out, turning off a network switch, ... --- and the package should continue to function with at most a switch to an alternate node.

Re: serviceguard problem

Armin Kunaschik — Wed, 22 Dec 2004 04:15:17 GMT

I saw such "solidarity" problems on a 2 node
cluster with lock disk.
The reason was a total loss of hearbeat and
the 2 nodes started a "fight" for the cluster
lock.
There is a 50:50 chance that the "wrong" node
gets the cluster lock and the "right" one reboots. Shortly after that the other node
gets toc'ed because of it's own problems.

Maybe this happend to you?
Check the OLDsyslog.log for information.

Re: serviceguard problem

Stephen Doud — Fri, 24 Dec 2004 11:00:11 GMT

My condolences on the data corruption.
The other node should have survived the shutdown. Though it is too late to prevent the disaster this time, investigate the OLDsyslog.log on both servers (and potentially the package control logs) to try to determine the source of the problem.

As for the issue of proper cluster arbitration when all HB NICs suffer an outage on one node, the online manual discusses how to use the Serial Heartbeat feature to prevent the "dead" node from winning the cluster lock disk arbitration race:
http://docs.hp.com/en/B3936-90079/ch02s02.html
Caviates apply - search the manual for all references to the Serial Heartbeat concepts.

-StephenD.