A lot of processes in state RWCAP

Peter Hofman · ‎07-08-2004

Hi again,

Today we analysed a dump of a node of aa dual node cluster, that seemed to be frozen.
A lot of processes were in state RWCAP, including the CLUSTER_SERVER and OPCOM processes.
We saw that every process has capabilities QUORUM and RUN, but boths CPUs only had RUN (and the primary CPU of course had PRIMARY), which explains why so many processes are in state RWCAP.

Now my question is: if the CLUSTER_SERVER is in state RWCAP, will QUORUM capabilities of the CPUs ever be set again in the CPU database???

John Gillings · ‎07-08-2004

Peter,

If this is a CLUEXIT bugcheck, then the RWCAP processes are "normal". When quorum is lost, the QUORUM capability is removed from CPUs, that's how OpenVMS prevents processes from running until quorum has been regained.

Changes to CPU capabilities are made in high IPL interrupt service routines, rather than from the process context of CLUSTER_SERVER, so the fact that it's in RWCAP is not an issue.

Perhaps there is a comms problem which is causing quorum to be lost?

A crucible of informative mistakes

Peter Hofman · ‎07-08-2004

I don't think that is what happened.

The system frooze, then it was decided to press the HALT button and crash the system from the prompt.

I doubt that it is a clue exit.

The strange thing is that we not not see any errors in the ERRLOG.SYS, nor the OPERATOR.LOG
Probably because the processes that should do so are in state RWCAP ???

Wim Van den Wyngaert · ‎07-08-2004

Keith's words :

If you were seeing excessive remastering activity, you'd likely spot
processes in RWCAP state.

Wim

Peter Hofman · ‎07-08-2004

I am not familiar with the term remastering.
Can you explain a bit more about what you mean.

Uwe Zessin · ‎07-08-2004

He was talking about moving parts of the lock manager (in-memory) database from one system to another.

.

Keith Parris · ‎07-09-2004

Remastering might be indicated as a possibility if you saw processes in RWCLU state.

RWCAP here indicates a state of quorum loss. (We know because the CPUs lack the QUORUM capability bit. All the processes require both QUORUM and RUN to be scheduled to run).

The node could lose quorum if it lost communications with the other node.

What type of cluster interconnects are involved?

Keith Parris · ‎07-09-2004

It might help to also know your voting configuration (i.e. settings for VOTES and EXPECTED_VOTES and DISK_QUORUM and QDSKVOTES parameters on all nodes), and what version of VMS you are running. RECNXINTERVAL might also be handy to know, and QDSKINTERVAL if you have a quorum disk.

Things that might conceivably cause a hang and quorum loss include:
o Bad hardware generating a steady stream of interrupts such that you're stuck up at hardware interrupt IPL
o Software problem or overload that keeps the Primary CPU saturated at or above IPL 8, so things like PEDRIVER Hello messages don't get sent out and communications links look like they're broken as a result. (And if this occurred on the other node, making it uncommunicative, that might cause this node to lose quorum, if it didn't have enough votes by itself.)

It might help to look at the console output (if you have a console printer or a console management system that catches that) or in console output or the OPERATOR.LOG file on the other node in the cluster. I'd be looking for things like messages from the Connection Manager about connection loss or quorum loss or state transition events, or from PEDRIVER (if you use the LAN as your cluster interconnect) about excessive packet loss.

Did you have a performance management data collector (like DECps or ECP or T4) running at the time? Sometimes those can give you clues as to what happened (especially just before the time of the hang -- because during the hang itself, you will probably be missing data).

Peter Hofman · ‎07-11-2004

Thanks all for the replies.

It turns out, I was misinformed when I got access to the dump file. It is from the second node of a dual node cluster that was crashed manually just after the first one.
So, I have been following the wrong leads.
I was already wondering why I could not see anything in the operator.log or the errlog.sys.

Thanks anyway.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

A lot of processes in state RWCAP

A lot of processes in state RWCAP

Re: A lot of processes in state RWCAP

Re: A lot of processes in state RWCAP

Re: A lot of processes in state RWCAP

Re: A lot of processes in state RWCAP

Re: A lot of processes in state RWCAP

Re: A lot of processes in state RWCAP

Re: A lot of processes in state RWCAP

Re: A lot of processes in state RWCAP