- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- Re: serviceguard problem
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-21-2004 06:25 AM
12-21-2004 06:25 AM
serviceguard problem
We just had an awfull crash:
We have a 2-node RAC cluster with CVM.
One of the nodes had a file table overflow a few days ago, that went unnoticed.
Neither Oracle or CVM had shown problems. However, we decided to reboot the node for good measure. (It was the CVM master node, btw)
When shutting the node, a panic was received in the second node, who went down in a flash.
Needless to say, we were not prepared for it and the RAC was open on the second node. This caused MAJOR data corruption.
Now we restore.
Why, Why, Why does these things happen?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-21-2004 06:43 AM
12-21-2004 06:43 AM
Re: serviceguard problem
I'm in the process of building SG in the lab with two ancient D class servers. The point is to learn how to configure and test the product and develop procedures for coding proper monitoring scripts and other configurations.
I will be able to test this setup without fear of hurting anything. Prior to going live with any SG configuration I will test several failure solutions.
It does appear that your monitor scripts are a problem, and the alert.ora logs are not being looked at and acted upon. Very simple scripts can check these and email you before trouble happens.
I know you have a crisis now, but once you have everything put back together, write and execute a test plan on this setup.
SEP
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-21-2004 06:52 AM
12-21-2004 06:52 AM
Re: serviceguard problem
You can't trust it on nothing but the simplest situations. If a network segment went down - poof, you're dead. if a single node is experiencing software problems (cvm) - poof, you're dead. If you got multiple disk failures - poof, you're dead.
All these have happened to me in various s/g configurations. the solution is usually to disable as much functionality as you can, or - tidiously try to get the machine to a 'stable state' - in which the cluster will work ok - until something (sftware or hardware) change - when all cluster testing need to be done again.
It's just bad. Really. It causes more downtime than it saves.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-21-2004 06:54 AM
12-21-2004 06:54 AM
Re: serviceguard problem
It may be worth looking at what happened and use it as a learning curve and fix what did not appear to work or react correctly
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-21-2004 06:55 AM
12-21-2004 06:55 AM
Re: serviceguard problem
;^)
Pete
P.S. I do offer my commiseration.
Pete
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-21-2004 07:08 AM
12-21-2004 07:08 AM
Re: serviceguard problem
The fault is not in our stars,
But in ourselves.
Julius Caesar - I. 2.
I've run one MC/SG production cluster for over 5.5 years without a single package failover that was not intentionally initiated --- and with zero unplanned downtime. If setup correctly, MC/SG is extremely robust. In fact, your carping about losing one network connection and dying is the key. Your configuration is not nearly robust enough. Network cable/NIC/switch failures should be considered routine events and should not cause any failures. Disk replacements should be considered absolutely routine tasks. The whole point is to have your systems so well configured that MC/SG itself very rarely comes into play.
When a cluster and its packages are well constructed and configured robustly, you should be able (and should actually try) to yank any one thing --- including yanking a server's power cable, yanking a disk out, turning off a network switch, ... --- and the package should continue to function with at most a switch to an alternate node.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-21-2004 08:15 PM
12-21-2004 08:15 PM
Re: serviceguard problem
cluster with lock disk.
The reason was a total loss of hearbeat and
the 2 nodes started a "fight" for the cluster
lock.
There is a 50:50 chance that the "wrong" node
gets the cluster lock and the "right" one reboots. Shortly after that the other node
gets toc'ed because of it's own problems.
Maybe this happend to you?
Check the OLDsyslog.log for information.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-24-2004 03:00 AM
12-24-2004 03:00 AM
Re: serviceguard problem
The other node should have survived the shutdown. Though it is too late to prevent the disaster this time, investigate the OLDsyslog.log on both servers (and potentially the package control logs) to try to determine the source of the problem.
As for the issue of proper cluster arbitration when all HB NICs suffer an outage on one node, the online manual discusses how to use the Serial Heartbeat feature to prevent the "dead" node from winning the cluster lock disk arbitration race:
http://docs.hp.com/en/B3936-90079/ch02s02.html
Caviates apply - search the manual for all references to the Serial Heartbeat concepts.
-StephenD.