Re: hard stopping( failover test) hpux cluster package

CFI-beheer · ‎05-13-2008

Hi All,

In about a few weeks I want to test failover on a HPUX MCSG HA cluster.
What I want to know if the following test is good test for testing failover

We have two Itanium server RX4640 which part of production cluster. Both has the OS "HP-UX B.11.23".

hpux01 and hpux02.
hpux01 is running a hpdb01 cluster packages with contains 20 oracle databases.
hpux02 is not running any cluster packages and it configured as failover for hpux01.
hpux01 is configured to only switch to hpux02.

The hard failover test that we thought of is:

Take both the power cable of the out of the hpux01.
As a result the hpdb01 cluster packages ,which is running on the hpux01, should failover to hpux02.

- Is this a good test?
- What will be the result if take out powercable without shutting down the system, correctly?
- Could this result in a filesystem corruption?
- Could also corrupt the hpux11.23 OS on the hpux01?

If this is not good test for testing failover, Please let me know
what other alternatives there are for doing a hard failover test.

- Is killing the cmcld process maybe a good failover test?
root 13137 13118 0 Apr 23 ? 64:09 /usr/lbin/cmcld -j

Thanks for your help.

Kind Regards,

Stephen Doud · ‎05-13-2008

Though extreme, a power-failure test is legitimate. HPUX was designed to recover from a power failure. I suspect Oracle was designed to recover from a power fail as well.

Killing cmcld will force a system to take a memory dump and reboot, which is also a legitimate test, although it will take longer for the server to go through the TOC/reboot cycle (and you should clean up the resulting dump in /var/adm/crash).

A question comes to my mind though, why run all of the databases on one server, leaving the other one totally idle? Wouldn't it be better to distribute packages such that each server were equally loaded?

Rita C Workman · ‎05-13-2008

If you have redundant power, why not check to ensure it's working right. Just drop one power connection, is the box still up?
Then put that plug back and drop the other side of your power. Is the box still holding up? - Then if you want to do a full power loss, drop the remaining power connection.

Another, less forceful test you may wish to add to your list of things to check, might be to drop the network connections and check the results.

You might first ensure that lan0 will fail over to lan1 (or whatever is your second lan failover nic). Then ensure it did.
Then, add to your network disruption, the heartbeat for hpvm01.

You might also add testing your pvlinks or whatever utility you use for I/O redundancy.
Drop one link........is everything still running ok? Put the link back, and then drop the other link.....is everything still running ok?

If your going to test failover - test every feature you can think of.

Rgrds,
Rita

Rita C Workman · ‎05-13-2008

...and to add to Stephen's comment about your packages...

I totally agree. I am amazed at how many places out there set up 2 node clusters and totally waste the second node. You could distribute your packages across the two giving failover to each side, thus possibly improving application performance.

Or...another suggestion. Make 1 node your production node and the other your test/dev node. If the production node goes down, it fails over to the test node. Do not give failover option to your test/dev packages When your production packages come up on the test box, at the very start of each production package, just put a simple one-liner to halt it's development package FIRST......and voila - dev goes down, production comes up... and both boxes are being fully utilized and mgmt see's a "bang for the buck!!"

The one-liner:

/usr/sbin/cmhaltpkg -n $(/usr/bin/uname -n)

This says that if the dev package is running on this box...turn it off before you start running this production pkg.

Rgrds,
Rita

Armin Kunaschik · ‎05-14-2008

If you like to use the sneaker network, go remove the power cables. The equivalent is "pc -off" on the MP/ilo. You did configure your MP, did you? If not, do it!

But this is not the only testcase! Most of the time the power test works fine and everything gets stuck if network errors occur. E.g. the package stop script might hang forever because the application depends on a present network connection while shutting down. Find out potential software problems too e.g. client issues!
And be critical with your environment and pull any(!!!) cable if in doubt.

The other thing is the empty failover node.
As the others state it's always a good idea to split the application into 2 packages or run more than 1 package on every node.
But don't build asymmetric clusters (e.g. configure the test package to run on 1 node only).
- You'll get different configurations for every cluster node, which makes troubleshooting and recovery difficult.
- Administration (e.g. creation of users) is harder and directory/user id collisions occur more often.
- You'll increase the overall downtime of your production package because everytime after a failure of your production package you need to switch it back to get the rest package up again.

So, from my point of view, always create a cluster with identical configurations, patch levels etc and configure packages to be able run on all nodes. It'll make life easier in case of an incident... and you'll find out in the very same second you're responsible for bringing things up again after a crash/failure.

My 2 cents,
Armin

PS: Please assign points if you find answers useful!

And now for something completely different...

Steve Lewis · ‎05-14-2008

Hi,

Please note you are not just testing Serviceguard to see if it works. You are also testing it so that you can learn about the expected behaviour and be prepared for every eventuality. Make a list of every test and document what happens. I can assure you that someone in the future will ask you what will happen under certain circumstances.
The other replies were excellent, I completely agree that you must test gradual fail of network interfaces and of power.
Serviceguard is only designed for SPOF, so you should also know what happens under various MPOF conditions and how it would react - will it TOC, will it stay up...etc...know the limits of your config.
You must also test failure of a network switch, then several.
Also, know the limits of your serviceguard protection, for example a single link to an external system.
It is also worth testing to see what happens to the client software - can it cope with a package switchover?

The HA documentation under docs.hp.com gives more detailed information about this vital part of Serviceguard implementations.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: hard stopping( failover test) hpux cluster package

hard stopping( failover test) hpux cluster package