Re: Cluster member crash has high impact on other nodes in the cluster

Ian Miller. · ‎09-08-2010

do also consider re-configuring to put the dump file on local storage (SAN).

don't rush to make changes.

____________________
Purely Personal Opinion

Toine_1 · ‎09-08-2010

Hi,

Thank you all for the good answers. very helpfull as always.

HP advised us to disable the CPE monitoring via a sysgen paramater as a work around.

SYSGEN> use active
SYSGEN> set crd_control %x80016
SYSGEN> write active
SYSGEN> use current
SYSGEN> set crd_control %x80016
SYSGEN> show crd_control
SYSGEN> write current
SYSGEN> exit

Regards,

Toine

Hoff · ‎09-08-2010

The suggested remediation from HP implies that there are buckets of memory errors arising here, and that's something I've definitely encountered on a few Integrity and AlphaServer boxes over the years. RAS features or not, memory errors can cause instabilities on both Integrity and AlphaServer boxes. On any box, for that matter. And the errors don't always get overtly logged; you have to go look for them.

John Gillings · ‎09-08-2010

Toine,

> SYSGEN> set crd_control %x80016

Hmmm, someone in HP support needs to be taught "Balmer's Rule" (someone, somewhere on the planet may laugh... :-)

You've adjusted the parameter value correctly (USE/SET/WRITE), but if you don't want a surprise sometime in the future when you've forgotten about this thread, you should also add that SET command to MODPARAMS.DAT, commented, with your name, date, reference to the HP service case, and maybe even the URL of this thread.

System configurations are complex things, and changes often have unintended consequences. As an example, in this case it may have been helpful to know why your cluster has a non-default RECNXINTERVAL. Who decided one the value, when and why? If your system doesn't have a clearly documented MODPARAMS.DAT, please start now.

Also, it just occurred to me, that since your cluster has 8 nodes, you should check that RECNXINTERVAL is consistent across all nodes. I'd expect that the resultant delay would be the longest across the cluster (of course, inconsistencies like that SHOULD be detected and notified at cluster formation, but engineering has never considered it important enough to expend resources implementing a proper cluster consistency check :-(

A crucible of informative mistakes

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Cluster member crash has high impact on other nodes in the cluster