Quorum disk lost connection every two hours

pcseunix · ‎06-18-2008

I have a system that is currently a one node cluster, with a quorum disk. Yes, I know that the quorum disk is not necessary -- it's a holdover from earlier days when there were two nodes. We're planning on getting rid of the quorum disk on our next scheduled reboot. But, this is an interesting problem.

Anyway, on to the problem.

From time to time, the system reports "Lost connection to quorum disk", followed a few seconds later by "Quorum regained...". The interesting this is that this occurs on two hour intervals, but not on all two hour intervals:

06/17/08 00:07:45: %CNXMAN, Lost "connection" to quorum disk
06/17/08 00:07:48: %CNXMAN, Quorum regained, resuming activity
06/17/08 02:07:45: %CNXMAN, Lost "connection" to quorum disk
06/17/08 02:08:15: %CNXMAN, Quorum regained, resuming activity
06/17/08 04:07:45: %CNXMAN, Lost "connection" to quorum disk
06/17/08 04:07:52: %CNXMAN, Quorum regained, resuming activity
06/17/08 08:07:45: %CNXMAN, Lost "connection" to quorum disk
06/17/08 08:08:15: %CNXMAN, Quorum regained, resuming activity
06/17/08 10:07:45: %CNXMAN, Lost "connection" to quorum disk
06/17/08 10:07:53: %CNXMAN, Quorum regained, resuming activity
06/17/08 14:07:45: %CNXMAN, Lost "connection" to quorum disk
06/17/08 14:08:15: %CNXMAN, Quorum regained, resuming activity
06/17/08 16:07:41: %CNXMAN, Lost "connection" to quorum disk
06/17/08 16:08:15: %CNXMAN, Quorum regained, resuming activity
06/17/08 22:07:45: %CNXMAN, Lost "connection" to quorum disk
06/17/08 22:08:15: %CNXMAN, Quorum regained, resuming activity

No disk errors reported, the system is not busy at the times indicated -- actually not very busy at all.

System is ES40, 4 cpus, 4GB memory, CIPCA connected to HSZ50, all disks are RAID5. Has VMS83A_UPDATE V5.0 installed (yes, I see that there is a V6.0).

Ideas, suggestions?

Hoff · ‎06-18-2008

The usual triggers tend to be I/O errors, periodic I/O floods, or other such. Here, I'd look at the CI, too, as cable faults and termination problems can cause communications issues. Periodic, though, is weird.

Please post the cluster system parameters.

SYSMAN> param show /cluster

Please also post the SHOW DEVICE /FULL from the quorum disk. This disk is typically MOUNT /SYSTEM.

Please do check for errors or restarts or such out at the HSZ, too -- for any disk- or CI-related errors or faults or such that might be logged out on the controller, or elsewhere in the configuration.

Also check the network and other cluster communications controllers that might be present.

FWIW, RAID5 has an enormous I/O load during rebuilds, too. IMHO with modern disk prices, RAID10 is often a better choice. And when you get rid of the quorum disk, I'd take a look at the whole of the CI storage connection, too, as that's old kit. Direct-attached SCSI might be a better choice for a one-node configuration, with a PCI RAID controller.

And yes, do get rid of the quorum disk.

Hein van den Heuvel · ‎06-18-2008

This would be possibly a nice T4 excercise.
If you have T4 running, zoom in to the 7'th minute.
Notably I would check the minute for 6/17 06:07 and 12:07 because it might show something happening without the lost quorum noise.

I would also run a SHOW SYSTEM just at 6 minutes past the hour, and again at 8 minutes and 'subtract' them for a process activity insight for those minutes.
Of course this is not unlikely to influence the problem ... it might even make it go away :-).

Finally, has it been behaving like this 'for ever'? When did it start? What had changed around that time?

Hein.

pcseunix · ‎06-18-2008

For Hoff's comments:

Parameter Name Current Default Minimum Maximum Unit Dynamic
-------------- ------- ------- ------- ------- ---- -------
VAXCLUSTER 2 1 0 2 Coded-value
EXPECTED_VOTES 2 1 1 127 Votes
VOTES 1 1 0 127 Votes
DISK_QUORUM "$1$DUA182 " " " " " "ZZZZ" Ascii
QDSKVOTES 1 1 0 127 Votes
QDSKINTERVAL 3 3 1 32767 Seconds
ALLOCLASS 1 0 0 255 Pure-number
LOCKDIRWT 1 0 0 255 Pure-number
CLUSTER_CREDITS 32 32 10 128 Credits
NISCS_CONV_BOOT 0 0 0 1 Boolean
NISCS_LOAD_PEA0 1 0 0 1 Boolean
MSCP_LOAD 1 0 0 16384 Coded-value
TMSCP_LOAD 0 0 0 3 Coded-value
MSCP_SERVE_ALL 1 4 0 -1 Bit-Encoded
TMSCP_SERVE_ALL 0 0 0 -1 Bit-Encoded
MSCP_BUFFER 1024 1024 256 -1 Coded-value
MSCP_CREDITS 32 32 2 1024 Coded-value
TAPE_ALLOCLASS 1 0 0 255 Pure-number
NISCS_MAX_PKTSZ 8192 8192 576 9180 Bytes
CWCREPRC_ENABLE 1 1 0 1 Bitmask D
RECNXINTERVAL 20 20 1 32767 Seconds D
NISCS_PORT_SERV 0 0 0 256 Bitmask D
MSCP_CMD_TMO 0 0 0 2147483647 Seconds D
LOCKRMWT 5 5 0 10 Pure-number D

Disk $1$DUA182: (HSJ004), device type MSCP served SCSI disk array, is online,
mounted, file-oriented device, shareable, served to cluster via MSCP Server,
error logging is enabled.

Error count 0 Operations completed 12140682
Owner process "" Owner UIC [SYSTEM]
Owner process ID 00000000 Dev Prot S:RWPL,O:RWPL,G:R,W
Reference count 1722 Default buffer size 512
Current preferred CPU Id 0 Fastpath 1
Total blocks 17763835 Sectors per track 64
Total cylinders 6939 Tracks per cylinder 40
Logical Volume Size 17763835 Expansion Size Limit 18505728
Host name "HSJ004" Host type, avail HSJ5, yes
Alternate host name "HSJ005" Alt. type, avail HSJ5, yes
Allocation class 1

Volume label "CL1_RD09_182" Relative volume number 0
Cluster size 18 Transaction count 896
Free blocks 5740218 Maximum files allowed 467469
Extend quantity 5 Mount count 1
Mount status System Cache name "_$1$DUA182:XQPCACHE"
Extent cache size 64 Maximum blocks in extent cache 574021
File ID cache size 64 Blocks in extent cache 573444
Quota cache size 0 Maximum buffers in FCP cache 4240
Volume owner UIC [1,1] Vol Prot S:RWCD,O:RWCD,G:RWCD,W:RWCD

Volume Status: ODS-2, subject to mount verification, protected subsystems
enabled, write-through caching enabled.

No activity on the HSJ50 consoles. No unusual network activity.

This appears to have started around the time that we upgraded from V7.3-2 to V8.3.

The machine is scheduled for a reboot tomorrow evening to remove the quorum disk, and for other changes, so the matter will be, as Spock would say, rendered academic.

Hoff · ‎06-18-2008

Ok. You have an HSJ, and not an HSZ.

I don't see anything obvious in the settings.

Usual shot-gun for weirdnesses: Check the HSJ firmware, the SRM firmware, and the OpenVMS ECOs.

But then if you're removing the quorum disk, set your votes and expected votes and disk quorum values appropriately, and be done with it.

Anton van Ruitenbeek · ‎06-25-2008

PCSEUniks

How many nodes is the cluster ?
All nodes are/have the same vms version ?

AvR

NL: Meten is weten, maar je moet weten hoe te meten! - UK: Measuremets is knowledge, but you need to know how to measure !

pcseunix · ‎06-25-2008

We have removed the quorum disk on our 1-node cluster, and how the CNXMAN messages have gone away.

Categories

Company

Local Language

Forums

Discussions

Knowledge Base

Forums

Discussions

Knowledge Base

Forums

Discussions

Knowledge Base

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Quorum disk lost connection every two hours

Quorum disk lost connection every two hours

Re: Quorum disk lost connection every two hours

Re: Quorum disk lost connection every two hours

Re: Quorum disk lost connection every two hours

Re: Quorum disk lost connection every two hours

Re: Quorum disk lost connection every two hours

Re: Quorum disk lost connection every two hours