fail over test

ROSS HANSON · ‎04-30-2004

We did a fail over test of a N4000 server running informix dynamic server ver. 7.3 connected by two fc wires to two brocade switches with redundent paths going to a va7400 san box.
We pulled out one controller from the san box and crashed the database.
We then pulled out the other san controller and crashed the database again.
Then we turned off one of the brocade switches and crashed the database, after that we did the same thing to the other brocade switch and once
again the database came down.
Doing a pvdisplay -v /dev/vg05 and vg06 shows that alternate paths exist so does anyone know why informix, running on HPUX 11.0 would not fail over properly?

Ross Hanson

hari jayaram_1 · ‎04-30-2004

Hi,

Could you please advice if you are seeing any error messages.

Jeff Schussele · ‎04-30-2004

Hi Ross,

You're under the assumption that a disk I/O failure should cause a failover. Not gonna happen! MC/SG needs to shut the SW down when it wants to failover & how is it gonna do that if it's I/O is gone? The same kind of rule applies to the CPUs - BUT in that case the box is gonna panic & TOC & THEN the other node will pick up the ball & start up the package. Ditto with power supplies.
Bottom line is that MC/SG is *not* a continously available solution - its *highly* available. For continous you need to spend *far* more $ with HP. They can do it, but not for the price of MC/SG.

My $0.02,
Jeff

PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!

Jeff Schussele · ‎04-30-2004

I guess I should clarify that a little more...
When disk I/O goes away a TOC is not the reaction - a hang will occur. IF a TOC finally happens - unlikely - then the failover will occur.
SO moral is only failures that cause TOCS or failures of monitored HW resources (hint LAN) will cause a failover.
NOW... there's nothing stopping you from setting up a monitor script that will watch the disk I/O & IF it sees a failure of *ONE* channel then cause a manual failover. But still - even a monitor script can't save you if BOTH channels fail because you'd still need to cause a TOC & you can't do that from the command line.

Rgds,
Jeff

PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!

ROSS HANSON · ‎04-30-2004

We receive no error messages, as far as the O/S is concerned. We only receive the messages from Informix saying an Assert Failure I/O
and then I am told what chunk is going down

So, we cannot get the O/S to go down different
paths is a controller go out, without spending
more money huh?
By the way we do not run Service Guard here.

Ross Hanson

hari jayaram_1 · ‎04-30-2004

Ross,

I am having a 11.0 server in the exact same configuration you have and it is running pretty smooth without any problems.It is one of our legacy applications which has not been migrated to 11i.
We ahve however had numerous problems with the VA and have stopped using it in favour of a XP1024.
Let me poke around my server and see if I can come up with something you can use.

Jeff Schussele · ‎04-30-2004

Sorry - when you said failover I thought you implied Service Guard.
You need to check the LVM setup of the VGs that contain that data:
vgdisplay -v /dev/vg_name
there *must* be an alternate link there for LVM to handle the link loss.
If not you need to vgextend that VG to use that extra link.
You should see something at the end of the output like:
/dev/dsk/c5t2d3
/dev/dsk/c9t2d3 Alternate Link
If you don't - you don't have that LVM protection.

Rgds,
Jeff

PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!

hari jayaram_1 · ‎04-30-2004

Hi Ross,

I saw this in one of the forum threads
http://forums1.itrc.hp.com/service/forums/questionanswer.do?admit=716493758+1083364406421+28353475&threadId=215197

We have discovered the problem here.

Linkdown-tmo was set to 60 and no_device_delay was set to 30.

The combinations of these two delays caused the PVLinks to get in such a state that they never failed over.

I waited for one hour until resetting the kernel parameters to defaults.

All is now working properly.

Thanks for everyone's input.

Patrick Wallek · ‎04-30-2004

When utilizing alternate paths the switch from primary to alternate is NOT instantaneous. There can be a delay of several seconds while I/O times out on one path before it goes down the other path.

What you might want to look at is HP's Auto Path VA product. I think that will give you more of the functionality that you want.

http://www.hp.com/products1/storage/products/disk_arrays/modular/autopath/index.html

ROSS HANSON · ‎04-30-2004

Yes Jeff,
We do have alternate links already. That is what brought up the whole thing with the fail over test. As mentioned we have everything doubled, switches, disks, controllers ect...
and configured to supposible use alternate links to go to the other side if one should fail. But, our database has gone down twice now because when a san controller fails it does not go to the alternate link it just takes down the database. I thought with the redundency we have in place that it would "fail over"

Ross Hanson

Jeff Schussele · ‎04-30-2004

OK - then you need to look at the I/O timeout values for either the LV or PV and whether they are longer than what the SW will tolerate.
In LV & PV it's the -t value that sets it.
IF it's higher than the SW can tolerate then either the HW or SW value needs to change.
I believe about 90 seconds is the HW default & it may be the SW pukes & dies at that amount. BUT I would caution you to not decrease it unless you're sure loads will *never* cause that long of a delay - i.e. you may want to elongate the SW timeout before you shorten the HW timeout.

Rgds,
Jeff

PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

fail over test

fail over test

Re: fail over test

Re: fail over test

Re: fail over test

Re: fail over test

Re: fail over test

Re: fail over test

Re: fail over test

Re: fail over test

Re: fail over test

Re: fail over test