topic Re: fail over test in Operating System - HP-UX

fail over test

ROSS HANSON — Fri, 30 Apr 2004 16:52:04 GMT

We did a fail over test of a N4000 server running informix dynamic server ver. 7.3 connected by two fc wires to two brocade switches with redundent paths going to a va7400 san box.
We pulled out one controller from the san box and crashed the database.
We then pulled out the other san controller and crashed the database again.
Then we turned off one of the brocade switches and crashed the database, after that we did the same thing to the other brocade switch and once
again the database came down.
Doing a pvdisplay -v /dev/vg05 and vg06 shows that alternate paths exist so does anyone know why informix, running on HPUX 11.0 would not fail over properly?

Re: fail over test

hari jayaram_1 — Fri, 30 Apr 2004 17:01:53 GMT

Hi,

Could you please advice if you are seeing any error messages.

Re: fail over test

Jeff Schussele — Fri, 30 Apr 2004 17:02:14 GMT

Hi Ross,

You're under the assumption that a disk I/O failure should cause a failover. Not gonna happen! MC/SG needs to shut the SW down when it wants to failover & how is it gonna do that if it's I/O is gone? The same kind of rule applies to the CPUs - BUT in that case the box is gonna panic & TOC & THEN the other node will pick up the ball & start up the package. Ditto with power supplies.
Bottom line is that MC/SG is *not* a continously available solution - its *highly* available. For continous you need to spend *far* more $ with HP. They can do it, but not for the price of MC/SG.

My $0.02,
Jeff

Re: fail over test

Jeff Schussele — Fri, 30 Apr 2004 17:12:07 GMT

I guess I should clarify that a little more...
When disk I/O goes away a TOC is not the reaction - a hang will occur. IF a TOC finally happens - unlikely - then the failover will occur.
SO moral is only failures that cause TOCS or failures of monitored HW resources (hint LAN) will cause a failover.
NOW... there's nothing stopping you from setting up a monitor script that will watch the disk I/O & IF it sees a failure of *ONE* channel then cause a manual failover. But still - even a monitor script can't save you if BOTH channels fail because you'd still need to cause a TOC & you can't do that from the command line.

Rgds,
Jeff

Re: fail over test

ROSS HANSON — Fri, 30 Apr 2004 17:25:24 GMT

We receive no error messages, as far as the O/S is concerned. We only receive the messages from Informix saying an Assert Failure I/O
and then I am told what chunk is going down

So, we cannot get the O/S to go down different
paths is a controller go out, without spending
more money huh?
By the way we do not run Service Guard here.

Re: fail over test

hari jayaram_1 — Fri, 30 Apr 2004 17:30:19 GMT

Ross,

I am having a 11.0 server in the exact same configuration you have and it is running pretty smooth without any problems.It is one of our legacy applications which has not been migrated to 11i.
We ahve however had numerous problems with the VA and have stopped using it in favour of a XP1024.
Let me poke around my server and see if I can come up with something you can use.

Re: fail over test

Jeff Schussele — Fri, 30 Apr 2004 17:31:29 GMT

Sorry - when you said failover I thought you implied Service Guard.
You need to check the LVM setup of the VGs that contain that data:
vgdisplay -v /dev/vg_name
there *must* be an alternate link there for LVM to handle the link loss.
If not you need to vgextend that VG to use that extra link.
You should see something at the end of the output like:
/dev/dsk/c5t2d3
/dev/dsk/c9t2d3 Alternate Link
If you don't - you don't have that LVM protection.

Rgds,
Jeff

Re: fail over test

hari jayaram_1 — Fri, 30 Apr 2004 17:35:39 GMT

Hi Ross,

I saw this in one of the forum threads
http://forums1.itrc.hp.com/service/forums/questionanswer.do?admit=716493758+1083364406421+28353475&threadId=215197

We have discovered the problem here.

Linkdown-tmo was set to 60 and no_device_delay was set to 30.

The combinations of these two delays caused the PVLinks to get in such a state that they never failed over.

I waited for one hour until resetting the kernel parameters to defaults.

All is now working properly.

Thanks for everyone's input.

Re: fail over test

Patrick Wallek — Fri, 30 Apr 2004 17:36:01 GMT

When utilizing alternate paths the switch from primary to alternate is NOT instantaneous. There can be a delay of several seconds while I/O times out on one path before it goes down the other path.

What you might want to look at is HP's Auto Path VA product. I think that will give you more of the functionality that you want.

http://www.hp.com/products1/storage/products/disk_arrays/modular/autopath/index.html

Re: fail over test

ROSS HANSON — Fri, 30 Apr 2004 17:41:54 GMT

Yes Jeff,
We do have alternate links already. That is what brought up the whole thing with the fail over test. As mentioned we have everything doubled, switches, disks, controllers ect...
and configured to supposible use alternate links to go to the other side if one should fail. But, our database has gone down twice now because when a san controller fails it does not go to the alternate link it just takes down the database. I thought with the redundency we have in place that it would "fail over"

Re: fail over test

Jeff Schussele — Fri, 30 Apr 2004 17:50:37 GMT

OK - then you need to look at the I/O timeout values for either the LV or PV and whether they are longer than what the SW will tolerate.
In LV & PV it's the -t value that sets it.
IF it's higher than the SW can tolerate then either the HW or SW value needs to change.
I believe about 90 seconds is the HW default & it may be the SW pukes & dies at that amount. BUT I would caution you to not decrease it unless you're sure loads will *never* cause that long of a delay - i.e. you may want to elongate the SW timeout before you shorten the HW timeout.

Rgds,
Jeff