Re: serviceguard problem

uvc · ‎12-21-2004

Hi,

We just had an awfull crash:
We have a 2-node RAC cluster with CVM.
One of the nodes had a file table overflow a few days ago, that went unnoticed.
Neither Oracle or CVM had shown problems. However, we decided to reboot the node for good measure. (It was the CVM master node, btw)
When shutting the node, a panic was received in the second node, who went down in a flash.
Needless to say, we were not prepared for it and the RAC was open on the second node. This caused MAJOR data corruption.

Now we restore.

Why, Why, Why does these things happen?

Steven E. Protter · ‎12-21-2004

These things generally happen because ServiceGuard configuration is inadequate and not tested.

I'm in the process of building SG in the lab with two ancient D class servers. The point is to learn how to configure and test the product and develop procedures for coding proper monitoring scripts and other configurations.

I will be able to test this setup without fear of hurting anything. Prior to going live with any SG configuration I will test several failure solutions.

It does appear that your monitor scripts are a problem, and the alert.ora logs are not being looked at and acted upon. Very simple scripts can check these and email you before trouble happens.

I know you have a crisis now, but once you have everything put back together, write and execute a test plan on this setup.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

uvc · ‎12-21-2004

What really annoys me is the 'solidarity' of the nodes in serviceguard.
You can't trust it on nothing but the simplest situations. If a network segment went down - poof, you're dead. if a single node is experiencing software problems (cvm) - poof, you're dead. If you got multiple disk failures - poof, you're dead.

All these have happened to me in various s/g configurations. the solution is usually to disable as much functionality as you can, or - tidiously try to get the machine to a 'stable state' - in which the cluster will work ok - until something (sftware or hardware) change - when all cluster testing need to be done again.

It's just bad. Really. It causes more downtime than it saves.

melvyn burnard · ‎12-21-2004

sounds like you do not have your monitoring or configuration set correctly.
It may be worth looking at what happened and use it as a learning curve and fix what did not appear to work or react correctly

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

Pete Randall · ‎12-21-2004

Gee, I thought this thread had a different subject the first time I saw it!!!

;^)

Pete

P.S. I do offer my commiseration.

Pete

A. Clay Stephenson · ‎12-21-2004

Men at some time are masters of their fates;
The fault is not in our stars,
But in ourselves.

Julius Caesar - I. 2.

I've run one MC/SG production cluster for over 5.5 years without a single package failover that was not intentionally initiated --- and with zero unplanned downtime. If setup correctly, MC/SG is extremely robust. In fact, your carping about losing one network connection and dying is the key. Your configuration is not nearly robust enough. Network cable/NIC/switch failures should be considered routine events and should not cause any failures. Disk replacements should be considered absolutely routine tasks. The whole point is to have your systems so well configured that MC/SG itself very rarely comes into play.

When a cluster and its packages are well constructed and configured robustly, you should be able (and should actually try) to yank any one thing --- including yanking a server's power cable, yanking a disk out, turning off a network switch, ... --- and the package should continue to function with at most a switch to an alternate node.

If it ain't broke, I can fix that.

Armin Kunaschik · ‎12-21-2004

I saw such "solidarity" problems on a 2 node
cluster with lock disk.
The reason was a total loss of hearbeat and
the 2 nodes started a "fight" for the cluster
lock.
There is a 50:50 chance that the "wrong" node
gets the cluster lock and the "right" one reboots. Shortly after that the other node
gets toc'ed because of it's own problems.

Maybe this happend to you?
Check the OLDsyslog.log for information.

And now for something completely different...

Stephen Doud · ‎12-24-2004

My condolences on the data corruption.
The other node should have survived the shutdown. Though it is too late to prevent the disaster this time, investigate the OLDsyslog.log on both servers (and potentially the package control logs) to try to determine the source of the problem.

As for the issue of proper cluster arbitration when all HB NICs suffer an outage on one node, the online manual discusses how to use the Serial Heartbeat feature to prevent the "dead" node from winning the cluster lock disk arbitration race:
http://docs.hp.com/en/B3936-90079/ch02s02.html
Caviates apply - search the manual for all references to the Serial Heartbeat concepts.

-StephenD.

Categories

Company

Local Language

Forums

Discussions

Knowledge Base

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: serviceguard problem

serviceguard problem