- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - OpenVMS
- >
- Cluster member crash has high impact on other node...
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО09-07-2010 07:50 AM
тАО09-07-2010 07:50 AM
We have a VMS cluster of 6 x I64 servers and 2 x Alpha servers. We use ethernet/LAN as Cluster interconnect.
One I64 server crashed today but it had a hudge impact on the other members in the cluster. Many processes on the other nodes came in RWSCS state.
We use HBMM.
Is this normal?
Which sysgen parameter should I change to avoid this or minimize this behaviour.
Also I got strange message on the I64 Console of the failing node.
**** Unable to write header, dump will probably be unusable ****PGQBT-E-Transport Error IO[11]: STS 0x2, SCSISTS 0x0, STSFLG 0x0, STATEFLG 0x0
Does anyone know what this mean?
Toine
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО09-07-2010 08:11 AM
тАО09-07-2010 08:11 AM
Re: Cluster member crash has high impact on other nodes in the cluster
Without the SAN, there is a reasonable chance that the (presumably) gigabit ethernet LAN is overloaded; eight hosts and some number of HBVS full-merge operations is a whole lot of network traffic, after all.
Even with the SAN, you might have a plugged network.
Based on the PGQBT boot driver diagnostic, there was apparently some sort of a SAN error here during the crash. The box apparently couldn't get to the SAN or to the storage controller or to the disk.
I'd dispense with the tuning effort at least temporarily, and start investigating the steady-state and the HBVS recovery network loading (with T4 as well as with a network monitor hanging off a "mirror" port on your network switch), with an investigation of what hardware is present here, and I'd look at adding links and faster interconnects.
Definitely check for ECO kits.
And check the boot device.
And check the error logs.
And if you have support, call HP.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО09-07-2010 08:37 AM
тАО09-07-2010 08:37 AM
Re: Cluster member crash has high impact on other nodes in the cluster
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО09-07-2010 09:16 AM
тАО09-07-2010 09:16 AM
Re: Cluster member crash has high impact on other nodes in the cluster
I have a SAN of two EVA4400 and two HBA cards in each server. So each server has 4 paths to each disk.
The EVA boxes are located in two computerroom.
Two brocade switches in each computerroom for each EVA4400.
Each server has two Gigabit LAN interfaces.
One Gigabit interface is connected to a dedicated switch for the cluster communication.
The error count on the PE device was increased during this problem.
I use Host based mini merge but can it be that during this mini merge some I/O are blocked.
It was also strange that all processes that are using sockets connections were in RWSCS state.
Also via Telnet no one could logon to the remaining members for a short period.
$ show shadow sys$sysdevice
_DSA100: Volume Label: I64VMS
Virtual Unit State: Steady State
Enhanced Shadowing Features in use:
Host-Based Minimerge (HBMM)
VU Timeout Value 16777215 VU Site Value 1
Copy/Merge Priority 5000 Mini Merge Enabled
Recovery Delay Per Served Member 30
Merge Delay Factor 200 Delay Threshold 200
HBMM Policy
HBMM Reset Threshold: 6000000
HBMM Master lists:
Up to any 3 of the nodes: NVR,NVC,NVE,NVJ Multiuse: 0
HBMM bitmaps are active on NVJ,NVC,NVE
HBMM Reset Count 49 Last Reset 7-SEP-2010 12:23:49.90
Modified blocks since last bitmap reset: 5976239
Device $1$DGA100 Master Member
Read Cost 2 Site 1
Member Timeout 120
Device $1$DGA200
Read Cost 42 Site 2
Member Timeout 120
Toine
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО09-07-2010 09:29 AM
тАО09-07-2010 09:29 AM
SolutionShort RWSCS waits are entirely normal.
Longer RWSCS can indicate a blocked network. Or blocked locking. Or cluster credit exhaustion. Or lock manager flailing. I'd also expect that a stuffed-up SAN could also trigger this resource wait state, too.
And that PGQBT SAN error is worth investigation.
You're going to have to instrument the cluster and the LAN, via wireshark and T4, or analogous.
You're also going to have to investigate the error logs.
Also the power stability, and the contents of the network and storage server logs.
If you have HP support available, start down that path now. (Your management paid good money for that, too.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО09-07-2010 09:37 AM
тАО09-07-2010 09:37 AM
Re: Cluster member crash has high impact on other nodes in the cluster
What I also saw that there was during 6 minutes a queue length of 20 on the system disk of the I64 servers.
We use one system disk for all I64 servers.
I have logged a call at HP and I hope we will find teh root cause.
Regards,
Toine,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО09-07-2010 01:52 PM
тАО09-07-2010 01:52 PM
Re: Cluster member crash has high impact on other nodes in the cluster
Please post the output of:
$ MCR SYSGEN SHOW/CLUSTER
The tradeoff with failures in a cluster is around how long to wait before deciding an apparently lost node is really lost. One key parameter is RECNXINTERVAL. If another node stops responding, surviving nodes wait that many seconds to see if it reappears.
If RECNXINTERVAL is too high, the whole cluster can freeze for that long before even attempting to reform without the missing node. If the value is too low, some transient event on your cluster interconnect can cause the cluster to kick a node out unnecessarily.
Another issue which affects the timing of recovery from failure is where your locks are mastered. This is mostly controlled by LOCKDIRWT. Much of the time of a cluster transition is working out which lock resources have been "lost" (because they were mastered on the lost node), deciding which node will take over that resource, and reconciling the states of any interested locks on surviving cluster nodes.
If you happened to have a large lock tree, with lots of intercluster activity mastered on the node which crashed, then it could take substantial time (order minutes) to reconstruct the tree. Processes waiting on locks against lost resources will wait in RWSCS state while the states are sorted out. There's not a lot you can do about his, except perhaps to make sure, if you have multiple, large lock trees, that they are not concentrated on a single node.
Find out what lock trees normally live on your system, and how they are distributed. If they're all on one node, and that node is lost, you have to rebuild them all.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО09-07-2010 02:05 PM
тАО09-07-2010 02:05 PM
Re: Cluster member crash has high impact on other nodes in the cluster
You are correct perhaps the RECNXINTERVAL is too high 60 seconds in my cluster.
I must also tell you that the two Integrity servers with the highest LOCKDIRWT didn't crash. But I will check the lock remastering.
Below the sysgen parameters
$ mc sysgen sho/cluster
Parameters in use: Active
Parameter Name Current Default Min. Max. Unit Dynamic
-------------- ------- ------- ------- ------- ---- -------
VAXCLUSTER 2 1 0 2 Coded-valu
EXPECTED_VOTES 10 1 1 127 Votes
VOTES 2 1 0 127 Votes
DISK_QUORUM " " " " " " "ZZZZ" Ascii
QDSKVOTES 1 1 0 127 Votes
QDSKINTERVAL 3 3 1 32767 Seconds
ALLOCLASS 1 0 0 255 Pure-numbe
LOCKDIRWT 6 0 0 255 Pure-numbe
CLUSTER_CREDITS 128 32 10 128 Credits
NISCS_CONV_BOOT 0 0 0 1 Boolean
NISCS_LOAD_PEA0 1 0 0 1 Boolean
NISCS_USE_LAN 1 1 0 1 Boolean
NISCS_USE_UDP 0 0 0 1 Boolean
MSCP_LOAD 1 0 0 16384 Coded-valu
TMSCP_LOAD 0 0 0 3 Coded-valu
MSCP_SERVE_ALL 1 4 0 -1 Bit-Encode
TMSCP_SERVE_ALL 0 0 0 -1 Bit-Encode
MSCP_BUFFER 16384 1024 256 -1 Coded-valu
MSCP_CREDITS 128 32 2 1024 Coded-valu
TAPE_ALLOCLASS 0 0 0 255 Pure-numbe
NISCS_MAX_PKTSZ 8192 8192 576 9180 Bytes
CWCREPRC_ENABLE 1 1 0 1 Bitmask D
RECNXINTERVAL 60 20 1 32767 Seconds D
NISCS_PORT_SERV 0 0 0 256 Bitmask D
NISCS_UDP_PORT 0 0 0 65535 Pure-numbe D
MSCP_CMD_TMO 0 0 0 2147483647 Seconds D
LOCKRMWT 5 5 0 10 Pure-numbe
Toine
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО09-07-2010 04:36 PM
тАО09-07-2010 04:36 PM
Re: Cluster member crash has high impact on other nodes in the cluster
>perhaps the RECNXINTERVAL is too high 60 seconds
Don't assume that! Someone has set RECNXINTERVAL up from default, hopefully with good reason.
That means if a node loses power, is disconnected, or crashes, you will expierience a cluster state transition of at least 60 seconds. BUT depending on your network infrastructure, and business needs, that may be perfectly reasonable.
Consider, if cluster nodes are separated by a long distance, the cluster interconnect may go through various network boxes. If the reboot time for one of those boxes is (say) 30 seconds, you may WANT a relatively high RECNXINTERVAL so your cluster will survive an expected network outage.
As long as the business can tolerate up to a 60 second pause if there's a network transient, that may be preferable to having nodes kicked out unnecessarily.
Only you, and your internal business customers can decide the best tradeoff for your systems.
If you need shorter transitions, one way to allow you to reduce RECNXINTERVAL is to have multiple cluster interconnect paths. That way, even if you lose connectivity on one path, the remaining one(s) will keep the cluster together. Often modern systems have several network adapters, some of which may be unused. Perhaps you can connect all nodes using "spare" adapters through a private switch. Watch out for single points of failure.
As always, you need to balance costs and business needs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО09-07-2010 11:01 PM
тАО09-07-2010 11:01 PM
Re: Cluster member crash has high impact on other nodes in the cluster
You said
"But I will check the lock remastering."
Be careful with the SDA extension LCK
Usually, we assume that we can do nearly whatever we want in SDA with no harm. Just looking at memory locations is innocent.
This is not correct, as a
SDA> lck remaster...
can use a lot of CPU, generate processes in RWSCS, RWCLU...
Of course,
SDA> lck stat/toptrees=10
is "innocent"