Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

cluster transition timeout??

 
TMcB
Super Advisor

cluster transition timeout??

Hi there
Any time we have to remove a node from our cluster - it causes issues on the other nodes. it would appear that adding/removing nodes into the cluster trashes a lot of our detached processes. These things all fail with %RMS-E-RRF, recovery unit recovery failed -RMS-F-ACC_AIJ, after image journal can not be accessed %RMS-F-BUG, fatal RMS condition (00000004), process deleted
We notice this same effect everytime the cluster has nodes added/removed, it seems dependent on what AIJ files are being accessed at the time.
Is there some kind of timeout that we can extend to make things a bit more persistent/resilient during the cluster transitions? Thanks
5 REPLIES 5
The Brit
Honored Contributor

Re: cluster transition timeout??

Just a thought.

Are these files located on disks which are being served by the system which is shutting down??
John Gillings
Honored Contributor

Re: cluster transition timeout??

TMcB,

Lots more detail required here for anything better that wild guesses... OpenVMS version, architecture, number of cluster nodes, storage technology, locations of RMS data files and journals (direct access or served disks?)

I suspect RMS is permanently losing access to the journal files, so timeouts won't help. Also realise that during a cluster state transition, all user mode processes are suspended, so timeouts don't apply.

A crucible of informative mistakes
TMcB
Super Advisor

Re: cluster transition timeout??

thanks for replying.

The cluster has 6 nodes - all Alpha servers running openvms 8.3. The disks are all direct disks from the SAN and RMS files are on these.

We've also received the same RMS error messages a few other times when not going through cluster transition
John Gillings
Honored Contributor

Re: cluster transition timeout??

T,

>We've also received the same RMS error
>messages a few other times when not going
>through cluster transition

In other words, the issue is independent of the cluster state transition... Perhaps a problem with the SAN which is exacerbated by a state transition, maybe because the nodes stop talking to the SAN for a while?

The error message ACC_AIJ means exactly what it says, the system can't see the AIJ disk.

I'd be looking at logs for errors on the SAN, and/or re-checking the SAN configuration.
A crucible of informative mistakes
Volker Halle
Honored Contributor

Re: cluster transition timeout??

TMcB,

from the accounting records, you can find out the exact time of those processes being deleted. Check for messages in OPERATOR.LOG at those times. Check ERRLOG.SYS, there should be non-fatal RMS bugcheck entries reported.

Find out, if there are any other events reported at exactly the same time as those RMS errors.

Volker.