1820782 Members
3810 Online
109627 Solutions
New Discussion юеВ

testing Serviceguard

 
Hanry Zhou
Super Advisor

testing Serviceguard

I know how to test other things like heartbeat or lan connections, but what about CPU, and memeory?

For CPU, I can issue shutdown command while the package is running on this server, but, what about anything like "power down" server test. I don't think we should perform that test, because it is dangerous for the server, right?

For memory, what we can do test on ?

Thanks,
Roger
none
7 REPLIES 7
melvyn burnard
Honored Contributor

Re: testing Serviceguard

No, powering off the server is a good and valid test.
You could also try to force a TOC by killing the cmcld process.
There is no real test you should do for the memory
The OS should take care of that, and HPMC or panic the box (same as a TOC really)

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
A. Clay Stephenson
Acclaimed Contributor

Re: testing Serviceguard

One of the normal tests in an MC/SG environment is the failure of a node. You do this by yanking the power cord(s). If your box isn't robust enough to handle this then you shouldn't be using it for MC/SG. This isn't nearly as dangerous as it sounds. The most critical filesystem for booting is /stand and after booting it is essentially untouched; it is also read-only for all intents and purposes. /stand is generally the only hfs filesystem on the box; the other filesystems will be vxfs which are very robust and have the intent log so that their fsck's are both quick and almost always successful w/o any intervention.

While yanking the power cord is not intended to be a replacement for the shutdown command; it almost all cases the system will reboot almost normally. Bear in mind this is exactly the kind of event that can happen in real life. It also is a good test of how well applications like databases survive a crash.

If memory problems aren't severe enough to panic the machine then they shouldn't be a problem; if they are severe enoufg to induce a panic then the box will TOC and you are essentially back to yanking the power cord. There is really no way to test for bad memory.
If it ain't broke, I can fix that.
Carl Munnelly
Frequent Advisor

Re: testing Serviceguard

The only true way to test Service Guard is to force one of the node to TOC. Pulling power cords is true test of failure and this simiulates a system failure (either memory/cpu or application)
Steven E. Protter
Exalted Contributor

Re: testing Serviceguard

Shalom Roger,

In HP education classes failure is simulated by cutting off the heartbeat lan cable, ie unplugging it.

You induce split brain syndrome and make sure one of the two systems does a TOC, which is a pretty hard crash.

You need to test loss of access to shared disk while the cluster is running normally to make sure packages configued to fail over, actually do and run correctly.

You can do a user test and fail a node using the methods above and see what kind of delays they experience.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Hanry Zhou
Super Advisor

Re: testing Serviceguard

What it is supposed to happen when I lost the access to the shared disks?

I have two HBA cards to the SAN shared storage. As first step of the test, I am going to disconnect all two cables, at this point, I would expect the package would fail over to the second node, but what exactly caused the fail over, I would like to know a little bit more in depth. Thanks,
none

Re: testing Serviceguard

Pulling both SAN connections probably won't invoke a failover, unless you have setup an EMS monitor to watch for this - if you have ITRC access look at article UMCSGKBRC00012483 for more details.

A lot of people decide not to setup an EMS monitor for the disk access as:

i) you have 2 connections to your data, so its not a single point of failure anyway.

ii) If you do use the EMS disk monitoring you have to set NODE_FAIL_FAST_ENABLED to YES - which means on a package failure the node just TOCs rather than trying to do a gracefull shutdown/switchover of the package. Why is this? Well if you think about the situation where all disk IO is effectively hung because of outstanding IOs, there's no way that the package stop process is going to be able to stop your application and unmount the filesystems, so the only thing to do is a hard reset (TOC).

HTH

Duncan

I am an HPE Employee
Accept or Kudo
A. Clay Stephenson
Acclaimed Contributor

Re: testing Serviceguard

Your testing method is flawed because it is extremely unlikely that you would lose access to all shared disks and over multiple data paths. MC/SG is intended to deal with SPOF's and your at throwing MPOF's at it. You "break" one thing and one thing only at a time -- eventhough this may be an entire server. In your case, you should have removed one SCSI connection and that should have been nothing more than an LVM failover w/o MC/SG ever coming into play. Likewise, you remove 1 network cable, power off one network switch, power off one Fabric switch, etc.

And even if someone was taught in a class to remove cables to simulate a node failure that is really not valid. Consider the case where the redundant LAN connections have been yanked but both nodes in a 2-node cluster can access all the disks. The behavior is unpredictable because the box with the cables yanked could very easily be the one that acquires the lock -- so that you have an unusable cluster.


If it ain't broke, I can fix that.