Operating System - OpenVMS
1829737 Members
1698 Online
109992 Solutions
New Discussion

A lot of processes in state RWCAP

 
Peter Hofman
Frequent Advisor

A lot of processes in state RWCAP

Hi again,

Today we analysed a dump of a node of aa dual node cluster, that seemed to be frozen.
A lot of processes were in state RWCAP, including the CLUSTER_SERVER and OPCOM processes.
We saw that every process has capabilities QUORUM and RUN, but boths CPUs only had RUN (and the primary CPU of course had PRIMARY), which explains why so many processes are in state RWCAP.

Now my question is: if the CLUSTER_SERVER is in state RWCAP, will QUORUM capabilities of the CPUs ever be set again in the CPU database???
8 REPLIES 8
John Gillings
Honored Contributor

Re: A lot of processes in state RWCAP

Peter,

If this is a CLUEXIT bugcheck, then the RWCAP processes are "normal". When quorum is lost, the QUORUM capability is removed from CPUs, that's how OpenVMS prevents processes from running until quorum has been regained.

Changes to CPU capabilities are made in high IPL interrupt service routines, rather than from the process context of CLUSTER_SERVER, so the fact that it's in RWCAP is not an issue.

Perhaps there is a comms problem which is causing quorum to be lost?
A crucible of informative mistakes
Peter Hofman
Frequent Advisor

Re: A lot of processes in state RWCAP

I don't think that is what happened.

The system frooze, then it was decided to press the HALT button and crash the system from the prompt.

I doubt that it is a clue exit.

The strange thing is that we not not see any errors in the ERRLOG.SYS, nor the OPERATOR.LOG
Probably because the processes that should do so are in state RWCAP ???
Wim Van den Wyngaert
Honored Contributor

Re: A lot of processes in state RWCAP

Keith's words :

If you were seeing excessive remastering activity, you'd likely spot
processes in RWCAP state.
Wim
Peter Hofman
Frequent Advisor

Re: A lot of processes in state RWCAP

I am not familiar with the term remastering.
Can you explain a bit more about what you mean.
Uwe Zessin
Honored Contributor

Re: A lot of processes in state RWCAP

He was talking about moving parts of the lock manager (in-memory) database from one system to another.
.
Keith Parris
Trusted Contributor

Re: A lot of processes in state RWCAP

Remastering might be indicated as a possibility if you saw processes in RWCLU state.

RWCAP here indicates a state of quorum loss. (We know because the CPUs lack the QUORUM capability bit. All the processes require both QUORUM and RUN to be scheduled to run).

The node could lose quorum if it lost communications with the other node.

What type of cluster interconnects are involved?
Keith Parris
Trusted Contributor

Re: A lot of processes in state RWCAP

It might help to also know your voting configuration (i.e. settings for VOTES and EXPECTED_VOTES and DISK_QUORUM and QDSKVOTES parameters on all nodes), and what version of VMS you are running. RECNXINTERVAL might also be handy to know, and QDSKINTERVAL if you have a quorum disk.

Things that might conceivably cause a hang and quorum loss include:
o Bad hardware generating a steady stream of interrupts such that you're stuck up at hardware interrupt IPL
o Software problem or overload that keeps the Primary CPU saturated at or above IPL 8, so things like PEDRIVER Hello messages don't get sent out and communications links look like they're broken as a result. (And if this occurred on the other node, making it uncommunicative, that might cause this node to lose quorum, if it didn't have enough votes by itself.)

It might help to look at the console output (if you have a console printer or a console management system that catches that) or in console output or the OPERATOR.LOG file on the other node in the cluster. I'd be looking for things like messages from the Connection Manager about connection loss or quorum loss or state transition events, or from PEDRIVER (if you use the LAN as your cluster interconnect) about excessive packet loss.

Did you have a performance management data collector (like DECps or ECP or T4) running at the time? Sometimes those can give you clues as to what happened (especially just before the time of the hang -- because during the hang itself, you will probably be missing data).
Peter Hofman
Frequent Advisor

Re: A lot of processes in state RWCAP

Thanks all for the replies.

It turns out, I was misinformed when I got access to the dump file. It is from the second node of a dual node cluster that was crashed manually just after the first one.
So, I have been following the wrong leads.
I was already wondering why I could not see anything in the operator.log or the errlog.sys.

Thanks anyway.