Operating System - OpenVMS
1839205 Members
2825 Online
110137 Solutions
New Discussion

Re: QDSKINTERVAL Setting

 
SOLVED
Go to solution
Jack Trachtman
Super Advisor

QDSKINTERVAL Setting

2 2-node Alphaserver Clusters
OpenVMS V8.3
2 EVA5000 disk storage w/Quorum disks defined

I discovered that attempting to upgrade the disk firmware on our EVA5Ks would cause a quorum disk timeout when the EVA5K would pause at the end of the upgrade. I opened a case & found that HP has a "note" about this and recommends changing QDSKINTERVAL from the default of 3 seconds to 10 seconds.

How would this affect the cluster integrity?

I asked about also changing the RECNXINTERVAL but was told to leave it at 20 seconds.

Thoughts please?

TIA
7 REPLIES 7
Hoff
Honored Contributor

Re: QDSKINTERVAL Setting

You have a storage controller which can apparently go catatonic for ten seconds and quite possibly more, and which trips up the quorum disk polling processing.

The HP-provided work-around implies that the device might be catatonic for longer than ten seconds, too.

I might well look to connect a SCSI between the two boxes (if you can get a supported configuration for that) and move the quorum disk off the SAN. Or add a third voting node to the cluster.

RECNXINTERVAL: host-to-host activity
QDSKINTERVAL: host-to-quorum-disk activity

In short: follow what HP recommends. You have paid them for the privilege of calling them for support, and particularly to have them resolve these cases for you, after all.
Dave Sullivan_3
Occasional Advisor

Re: QDSKINTERVAL Setting

Hi,

No cluster integrity issues moving up the value of qdskinterval. The idea of this is to move it up so as to avert an issue during your storage bounce.

Additionally, not sure how your nodes are connected to each other. The only way someone would say leave recnxinterval at 20 is if they are hard connected.

The 2 params as Hoff stated are different in what they are looking for. My feeling on 2 node clusters is to keep the nodes close and run a crossover cable between them. Best $7 you will ever spend for SCS communications. SCS always speaks down the path of least resitance plus, we can use SCACP to prioritize that channel. Thus, as long as both nodes are up... you are good... But that is only for host-to-host.

- Dave Sullivan
Hoff
Honored Contributor

Re: QDSKINTERVAL Setting

Cross-over isn't my choice, but (depending on the controllers and the cabling) it works for communications.

But to be clear, it isn't a replacement for the quorum disk.

The quorum disk is what allows this cluster to survive the failure of either host within the cluster. (The alternative here being the addition of a third voting node.)

The multi-host (shared) SCSI with a quorum disk on that shared SCSI bus allows you to avoid the issues arising from the quorum disk access delay should that EVA 5000 box go walkabout.

As for avoiding switches, that's your call. Some folks do like that configuration. But I prefer having available ports on the core cluster network, and (ignoring the expensive, managed switches, which can and sometimes fail more often than I'd prefer) the "dumb" and cheap and unmanaged switches tend to be quite reliable.
Jack Trachtman
Super Advisor

Re: QDSKINTERVAL Setting

Out cluster nodes are close enough to each other that we've been able to use cross-over cables for SCS communications. This has worked well for years.

My thinking on the QDSKINTERVAL is that changing the polling time would reduce, but not eliminate, the chance of a quorum disk timeout, i.e. it would still be possible for the poll to occur while the disk controller was "frozen".

Am I thinking correctly here?
Hoff
Honored Contributor
Solution

Re: QDSKINTERVAL Setting

Quorum Disk Processing 101...

Four failed polling I/O requests sent to the quorum disk cause the votes from the quorum disk to be discounted. The quorum disk is effectively ejected from quorum calculations. You get three I/O errors (for whatever reason), basically.

This processing means that the maximum delay expected with a departing host with a quorum disk with QDSKINTERVAL set to 10 here is circa 40 seconds. (Clean cluster exits can be and usually are faster than that.) This is also the duration of the quorum hang, during an unclean exit.

If you have tighter timing tolerances and more stringent timing requirements here, then you do have some choices.

- Move the quorum disk to an interconnect with sufficiently fast and particularly more consistent response times

- add additional voting nodes (to total three or more),

- migrate to a different clustering technology that (better) meets your applications needs

- replace that EVA controller with one that reacts within your timing requirements.

- work with HP to get that EVA to react more quickly in these cases.

FWIW, this EVA (mis)behavior and this pause is hitting all I/O activity, and not just the quorum disk polling. (You're just not seeing that application pause because you're wedged in a cluster hang here, and the applications are apparently not specifically coded to continue operations during the quorum hang. Yes, you can do that to a degree, if you're willing and careful, and play within the rules.)

As a side note, I'd consider asking HP what to expect as the worst-case EVA controller wedge duration. That's the key determinate here, and it looks to be somewhere between 10-ish and 40 seconds.

Prior to about V7.2, the default QDSKINTERVAL was 10.

Here is a write-up on a low-end cluster; on a cluster configuration of two voting hosts:

http://labs.hoffmanlabs.com/node/569

My previous reply indicated 3x. That recollection was incorrect, based on what I see in the available low-level cluster docs. It's a 4x poll before a decision is rendered.
Hoff
Honored Contributor

Re: QDSKINTERVAL Setting

ps...

Clean cluster exit: a shutdown, or a crash, or any of the failure paths that send out the so-called last-gasp datagram.

Disconnecting a cable won't send that datagram, for instance, so you'll get the full 4x poll.
Jack Trachtman
Super Advisor

Re: QDSKINTERVAL Setting

Thanks for the responses.

We'll start scheduling down times to change QDSKINTERVAL.