Operating System - OpenVMS
1753511 Members
5024 Online
108795 Solutions
New Discussion юеВ

Re: Cluster member crash has high impact on other nodes in the cluster

 
SOLVED
Go to solution
Ian Miller.
Honored Contributor

Re: Cluster member crash has high impact on other nodes in the cluster

do also consider re-configuring to put the dump file on local storage (SAN).

don't rush to make changes.
____________________
Purely Personal Opinion
Toine_1
Regular Advisor

Re: Cluster member crash has high impact on other nodes in the cluster

Hi,

Thank you all for the good answers. very helpfull as always.

HP advised us to disable the CPE monitoring via a sysgen paramater as a work around.


SYSGEN> use active
SYSGEN> set crd_control %x80016
SYSGEN> write active
SYSGEN> use current
SYSGEN> set crd_control %x80016
SYSGEN> show crd_control
SYSGEN> write current
SYSGEN> exit

Regards,

Toine
Hoff
Honored Contributor

Re: Cluster member crash has high impact on other nodes in the cluster

The suggested remediation from HP implies that there are buckets of memory errors arising here, and that's something I've definitely encountered on a few Integrity and AlphaServer boxes over the years. RAS features or not, memory errors can cause instabilities on both Integrity and AlphaServer boxes. On any box, for that matter. And the errors don't always get overtly logged; you have to go look for them.
John Gillings
Honored Contributor

Re: Cluster member crash has high impact on other nodes in the cluster

Toine,

> SYSGEN> set crd_control %x80016

Hmmm, someone in HP support needs to be taught "Balmer's Rule" (someone, somewhere on the planet may laugh... :-)

You've adjusted the parameter value correctly (USE/SET/WRITE), but if you don't want a surprise sometime in the future when you've forgotten about this thread, you should also add that SET command to MODPARAMS.DAT, commented, with your name, date, reference to the HP service case, and maybe even the URL of this thread.

System configurations are complex things, and changes often have unintended consequences. As an example, in this case it may have been helpful to know why your cluster has a non-default RECNXINTERVAL. Who decided one the value, when and why? If your system doesn't have a clearly documented MODPARAMS.DAT, please start now.

Also, it just occurred to me, that since your cluster has 8 nodes, you should check that RECNXINTERVAL is consistent across all nodes. I'd expect that the resultant delay would be the longest across the cluster (of course, inconsistencies like that SHOULD be detected and notified at cluster formation, but engineering has never considered it important enough to expend resources implementing a proper cluster consistency check :-(
A crucible of informative mistakes