1836598 Members
2135 Online
110102 Solutions
New Discussion

Node keeps crashing

 
Geoff Wild
Honored Contributor

Node keeps crashing

I have a call in with HP - but in the meantime I thought I'd share this with you.

2 node cluster - one runs a prod db and the other test.

The test crashed - for no apparent reason - other then this in the syslog of the prod node:

Dec 6 13:09:24 svr3001 cmcld: New node svr3000 is joining the cluster
Dec 6 13:09:24 svr3001 cmcld: Attempting to adjust cluster membership
Dec 6 13:09:24 svr3001 cmcld: Beginning standard partial election
Dec 6 13:09:28 svr3001 cmcld: Enabling safety time protection
Dec 6 13:09:28 svr3001 cmcld: Clearing Cluster Lock
Dec 6 13:09:30 svr3001 cmcld: 2 nodes have formed a new cluster, sequence #18
Dec 6 13:09:30 svr3001 cmcld: The new active cluster membership is: svr3001(id=2), svr3000(id=1)
Dec 6 13:09:30 svr3001 cmcld: Package ilogtest cannot run on this node because switching has been disabled for this node
Dec 6 13:09:31 svr3001 cmcld: One or more packages is not currently running because AUTO_RUN is disabled so that it cannot start automatically. To start these packages, enable AUTO_RUN via cmmodpkg -e .



Dec 6 13:11:20 svr3001 cmcld: Timed out node svr3000. It may have failed.
Dec 6 13:11:20 svr3001 cmcld: Attempting to adjust cluster membership
Dec 6 13:11:20 svr3001 cmcld: Beginning standard partial election
Dec 6 13:11:22 svr3001 cmclconfd[13667]: Updated file /var/adm/cmcluster/frdump.cmcld.3 for node svr3001 (length = 512096).
Dec 6 13:11:30 svr3001 cmcld: Obtaining Cluster Lock
Dec 6 13:11:31 svr3001 cmcld: Turning off safety time protection since the cluster
Dec 6 13:11:31 svr3001 cmcld: may now consist of a single node. If Serviceguard
Dec 6 13:11:31 svr3001 cmcld: fails, this node will not automatically halt
Dec 6 13:11:31 svr3001 cmcld: This will not affect the behavior of Package Failfast
Dec 6 13:11:31 svr3001 cmcld: or Service Failfast. If such a package or service fails,
Dec 6 13:11:31 svr3001 cmcld: safety timer will be re-enabled and this node will
Dec 6 13:11:31 svr3001 cmcld: automatically halt.


The server crashed (the first time) on it's own.

Since then, everytime I try a cmrunpkg -n svr3000 packtst, it crashes.

And the kicker - NO CRASHDUMP!

/var/adm/crash is configured, as well as dump:


# lvlnboot -v
Boot Definitions for Volume Group /dev/vg00:
Physical Volumes belonging in Root Volume Group:
/dev/dsk/c1t2d0 (0/0/1/1.2.0) -- Boot Disk
/dev/dsk/c2t2d0 (0/0/2/0.2.0) --Boot
Boot: lvol1 on: /dev/dsk/c1t2d0
/dev/dsk/c2t2d0
Root: lvol3 on: /dev/dsk/c1t2d0
/dev/dsk/c2t2d0
Swap: lvol2 on: /dev/dsk/c1t2d0
/dev/dsk/c2t2d0
Dump: lvol2 on: /dev/dsk/c1t2d0, 0



in /etc/rc.config.d/savecrash

SAVECRASH=1

SAVECRASH_DIR=/var/adm/crash

and in /etc/rc.config.d/crashconf

CRASHCONF_ENABLED=1

Last line in svr3000 (test) syslog before crash:

Dec 6 13:10:56 svr3000 CM-packtest[9869]: cmmodnet -a -i 192.44.162.196 192.44.160.0

Last line of package log file - shows it calling another script to startup Oracle...

In all my years of ServiceGuard I have never seen something like this before...

Rgds...Geoff


Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.
5 REPLIES 5
Patrick Wallek
Honored Contributor

Re: Node keeps crashing

Is your networking on svr3000 OK? It appears that the heartbeat may be timing out.
Geoff Wild
Honored Contributor

Re: Node keeps crashing

As far as I know Network ios fine - I did 1000 pings to both the primary and both hb ip's - both ways and 0% packet loss...

Rgds...Geoff

Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.
Sameer_Nirmal
Honored Contributor

Re: Node keeps crashing

Hi Geoff,

It seems that the test node had initiating the TOC of it own.

I would just start the node using cmrunnod . This is just to ensure the cluster service of the node work OK under the configured cluster enviornment.
Then the further debugging could be done at the package level i,e. cluster services , application startup and monitoring.

I am just wondering about the line in syslog about cmmodnet . It maybe a case the command is not going through may be
hanged or something. I would very the package IP conflict. Since the node was crashed before , it may be required
to clean up the pakcage IP using cmmodnet -r before starting the package.

Did you check the /etc/shutdownlog on this node?

You can check the SGFR as well using cmfmtfr

Geoff Wild
Honored Contributor

Re: Node keeps crashing

Well - turns out I have a bad I/O board HPMC in /var/tombstones.

Rgds...Geoff

Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.
Geoff Wild
Honored Contributor

Re: Node keeps crashing

Case seems solved - up over 30 minutes and no crash....

Rgds...Geoff
Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.