topic Re: Cluster member crash has high impact on other nodes in the cluster in Operating System - OpenVMS

Cluster member crash has high impact on other nodes in the cluster

Toine_1 — Tue, 07 Sep 2010 14:50:44 GMT

Hello,

We have a VMS cluster of 6 x I64 servers and 2 x Alpha servers. We use ethernet/LAN as Cluster interconnect.
One I64 server crashed today but it had a hudge impact on the other members in the cluster. Many processes on the other nodes came in RWSCS state.
We use HBMM.

Is this normal?
Which sysgen parameter should I change to avoid this or minimize this behaviour.

Also I got strange message on the I64 Console of the failing node.

**** Unable to write header, dump will probably be unusable ****PGQBT-E-Transport Error IO[11]: STS 0x2, SCSISTS 0x0, STSFLG 0x0, STATEFLG 0x0

Does anyone know what this mean?

Toine

Re: Cluster member crash has high impact on other nodes in the cluster

Hoff — Tue, 07 Sep 2010 15:11:53 GMT

You do have a SAN here (based on that "PGQBT"), right?

Without the SAN, there is a reasonable chance that the (presumably) gigabit ethernet LAN is overloaded; eight hosts and some number of HBVS full-merge operations is a whole lot of network traffic, after all.

Even with the SAN, you might have a plugged network.

Based on the PGQBT boot driver diagnostic, there was apparently some sort of a SAN error here during the crash. The box apparently couldn't get to the SAN or to the storage controller or to the disk.

I'd dispense with the tuning effort at least temporarily, and start investigating the steady-state and the HBVS recovery network loading (with T4 as well as with a network monitor hanging off a "mirror" port on your network switch), with an investigation of what hardware is present here, and I'd look at adding links and faster interconnects.

Definitely check for ECO kits.

And check the boot device.

And check the error logs.

And if you have support, call HP.

Re: Cluster member crash has high impact on other nodes in the cluster

Andy Bustamante — Tue, 07 Sep 2010 15:37:27 GMT

What is the storage and networking configuration? Are there multiple network adaptors? Are speed and duplex configured propertly for all paths? What is the network speed? Is there a SAN and does each system have HBAs?

Re: Cluster member crash has high impact on other nodes in the cluster

Toine_1 — Tue, 07 Sep 2010 16:16:19 GMT

Hi,

I have a SAN of two EVA4400 and two HBA cards in each server. So each server has 4 paths to each disk.
The EVA boxes are located in two computerroom.

Two brocade switches in each computerroom for each EVA4400.

Each server has two Gigabit LAN interfaces.
One Gigabit interface is connected to a dedicated switch for the cluster communication.

The error count on the PE device was increased during this problem.

I use Host based mini merge but can it be that during this mini merge some I/O are blocked.
It was also strange that all processes that are using sockets connections were in RWSCS state.
Also via Telnet no one could logon to the remaining members for a short period.

$ show shadow sys$sysdevice

_DSA100: Volume Label: I64VMS
Virtual Unit State: Steady State
Enhanced Shadowing Features in use:
Host-Based Minimerge (HBMM)

VU Timeout Value 16777215 VU Site Value 1
Copy/Merge Priority 5000 Mini Merge Enabled
Recovery Delay Per Served Member 30
Merge Delay Factor 200 Delay Threshold 200

HBMM Policy
HBMM Reset Threshold: 6000000
HBMM Master lists:
Up to any 3 of the nodes: NVR,NVC,NVE,NVJ Multiuse: 0
HBMM bitmaps are active on NVJ,NVC,NVE
HBMM Reset Count 49 Last Reset 7-SEP-2010 12:23:49.90
Modified blocks since last bitmap reset: 5976239

Device $1$DGA100 Master Member
Read Cost 2 Site 1
Member Timeout 120

Device $1$DGA200
Read Cost 42 Site 2
Member Timeout 120

Toine

Re: Cluster member crash has high impact on other nodes in the cluster

Hoff — Tue, 07 Sep 2010 16:29:38 GMT

RWSCS is a cluster wait.

Short RWSCS waits are entirely normal.

Longer RWSCS can indicate a blocked network. Or blocked locking. Or cluster credit exhaustion. Or lock manager flailing. I'd also expect that a stuffed-up SAN could also trigger this resource wait state, too.

And that PGQBT SAN error is worth investigation.

You're going to have to instrument the cluster and the LAN, via wireshark and T4, or analogous.

You're also going to have to investigate the error logs.

Also the power stability, and the contents of the network and storage server logs.

If you have HP support available, start down that path now. (Your management paid good money for that, too.)

Re: Cluster member crash has high impact on other nodes in the cluster

Toine_1 — Tue, 07 Sep 2010 16:37:21 GMT

Thank you Hoff,

What I also saw that there was during 6 minutes a queue length of 20 on the system disk of the I64 servers.
We use one system disk for all I64 servers.

I have logged a call at HP and I hope we will find teh root cause.

Regards,

Toine,

Re: Cluster member crash has high impact on other nodes in the cluster

John Gillings — Tue, 07 Sep 2010 20:52:04 GMT

Toine,

Please post the output of:

$ MCR SYSGEN SHOW/CLUSTER

The tradeoff with failures in a cluster is around how long to wait before deciding an apparently lost node is really lost. One key parameter is RECNXINTERVAL. If another node stops responding, surviving nodes wait that many seconds to see if it reappears.

If RECNXINTERVAL is too high, the whole cluster can freeze for that long before even attempting to reform without the missing node. If the value is too low, some transient event on your cluster interconnect can cause the cluster to kick a node out unnecessarily.

Another issue which affects the timing of recovery from failure is where your locks are mastered. This is mostly controlled by LOCKDIRWT. Much of the time of a cluster transition is working out which lock resources have been "lost" (because they were mastered on the lost node), deciding which node will take over that resource, and reconciling the states of any interested locks on surviving cluster nodes.

If you happened to have a large lock tree, with lots of intercluster activity mastered on the node which crashed, then it could take substantial time (order minutes) to reconstruct the tree. Processes waiting on locks against lost resources will wait in RWSCS state while the states are sorted out. There's not a lot you can do about his, except perhaps to make sure, if you have multiple, large lock trees, that they are not concentrated on a single node.

Find out what lock trees normally live on your system, and how they are distributed. If they're all on one node, and that node is lost, you have to rebuild them all.

Re: Cluster member crash has high impact on other nodes in the cluster

Toine_1 — Tue, 07 Sep 2010 21:05:59 GMT

Hello John,

You are correct perhaps the RECNXINTERVAL is too high 60 seconds in my cluster.

I must also tell you that the two Integrity servers with the highest LOCKDIRWT didn't crash. But I will check the lock remastering.

Below the sysgen parameters

$ mc sysgen sho/cluster

Parameters in use: Active
Parameter Name Current Default Min. Max. Unit Dynamic
-------------- ------- ------- ------- ------- ---- -------
VAXCLUSTER 2 1 0 2 Coded-valu
EXPECTED_VOTES 10 1 1 127 Votes
VOTES 2 1 0 127 Votes
DISK_QUORUM " " " " " " "ZZZZ" Ascii
QDSKVOTES 1 1 0 127 Votes
QDSKINTERVAL 3 3 1 32767 Seconds
ALLOCLASS 1 0 0 255 Pure-numbe
LOCKDIRWT 6 0 0 255 Pure-numbe
CLUSTER_CREDITS 128 32 10 128 Credits
NISCS_CONV_BOOT 0 0 0 1 Boolean
NISCS_LOAD_PEA0 1 0 0 1 Boolean
NISCS_USE_LAN 1 1 0 1 Boolean
NISCS_USE_UDP 0 0 0 1 Boolean
MSCP_LOAD 1 0 0 16384 Coded-valu
TMSCP_LOAD 0 0 0 3 Coded-valu
MSCP_SERVE_ALL 1 4 0 -1 Bit-Encode
TMSCP_SERVE_ALL 0 0 0 -1 Bit-Encode
MSCP_BUFFER 16384 1024 256 -1 Coded-valu
MSCP_CREDITS 128 32 2 1024 Coded-valu
TAPE_ALLOCLASS 0 0 0 255 Pure-numbe
NISCS_MAX_PKTSZ 8192 8192 576 9180 Bytes
CWCREPRC_ENABLE 1 1 0 1 Bitmask D
RECNXINTERVAL 60 20 1 32767 Seconds D
NISCS_PORT_SERV 0 0 0 256 Bitmask D
NISCS_UDP_PORT 0 0 0 65535 Pure-numbe D
MSCP_CMD_TMO 0 0 0 2147483647 Seconds D
LOCKRMWT 5 5 0 10 Pure-numbe

Toine

Re: Cluster member crash has high impact on other nodes in the cluster

John Gillings — Tue, 07 Sep 2010 23:36:57 GMT

Toine,

>perhaps the RECNXINTERVAL is too high 60 seconds

Don't assume that! Someone has set RECNXINTERVAL up from default, hopefully with good reason.

That means if a node loses power, is disconnected, or crashes, you will expierience a cluster state transition of at least 60 seconds. BUT depending on your network infrastructure, and business needs, that may be perfectly reasonable.

Consider, if cluster nodes are separated by a long distance, the cluster interconnect may go through various network boxes. If the reboot time for one of those boxes is (say) 30 seconds, you may WANT a relatively high RECNXINTERVAL so your cluster will survive an expected network outage.

As long as the business can tolerate up to a 60 second pause if there's a network transient, that may be preferable to having nodes kicked out unnecessarily.

Only you, and your internal business customers can decide the best tradeoff for your systems.

If you need shorter transitions, one way to allow you to reduce RECNXINTERVAL is to have multiple cluster interconnect paths. That way, even if you lose connectivity on one path, the remaining one(s) will keep the cluster together. Often modern systems have several network adapters, some of which may be unused. Perhaps you can connect all nodes using "spare" adapters through a private switch. Watch out for single points of failure.

As always, you need to balance costs and business needs.

Re: Cluster member crash has high impact on other nodes in the cluster

labadie_1 — Wed, 08 Sep 2010 06:01:52 GMT

Toine

You said
"But I will check the lock remastering."

Be careful with the SDA extension LCK

Usually, we assume that we can do nearly whatever we want in SDA with no harm. Just looking at memory locations is innocent.

This is not correct, as a
SDA> lck remaster...
can use a lot of CPU, generate processes in RWSCS, RWCLU...

Of course,
SDA> lck stat/toptrees=10
is "innocent"

Re: Cluster member crash has high impact on other nodes in the cluster

Ian Miller. — Wed, 08 Sep 2010 08:37:48 GMT

do also consider re-configuring to put the dump file on local storage (SAN).

don't rush to make changes.

Re: Cluster member crash has high impact on other nodes in the cluster

Toine_1 — Wed, 08 Sep 2010 13:30:33 GMT

Hi,

Thank you all for the good answers. very helpfull as always.

HP advised us to disable the CPE monitoring via a sysgen paramater as a work around.

SYSGEN> use active
SYSGEN> set crd_control %x80016
SYSGEN> write active
SYSGEN> use current
SYSGEN> set crd_control %x80016
SYSGEN> show crd_control
SYSGEN> write current
SYSGEN> exit

Regards,

Toine

Re: Cluster member crash has high impact on other nodes in the cluster

Hoff — Wed, 08 Sep 2010 14:43:44 GMT

The suggested remediation from HP implies that there are buckets of memory errors arising here, and that's something I've definitely encountered on a few Integrity and AlphaServer boxes over the years. RAS features or not, memory errors can cause instabilities on both Integrity and AlphaServer boxes. On any box, for that matter. And the errors don't always get overtly logged; you have to go look for them.

Re: Cluster member crash has high impact on other nodes in the cluster

John Gillings — Wed, 08 Sep 2010 20:38:13 GMT

Toine,

> SYSGEN> set crd_control %x80016

Hmmm, someone in HP support needs to be taught "Balmer's Rule" (someone, somewhere on the planet may laugh... :-)

You've adjusted the parameter value correctly (USE/SET/WRITE), but if you don't want a surprise sometime in the future when you've forgotten about this thread, you should also add that SET command to MODPARAMS.DAT, commented, with your name, date, reference to the HP service case, and maybe even the URL of this thread.

System configurations are complex things, and changes often have unintended consequences. As an example, in this case it may have been helpful to know why your cluster has a non-default RECNXINTERVAL. Who decided one the value, when and why? If your system doesn't have a clearly documented MODPARAMS.DAT, please start now.

Also, it just occurred to me, that since your cluster has 8 nodes, you should check that RECNXINTERVAL is consistent across all nodes. I'd expect that the resultant delay would be the longest across the cluster (of course, inconsistencies like that SHOULD be detected and notified at cluster formation, but engineering has never considered it important enough to expend resources implementing a proper cluster consistency check :-(