1834744 Members
2787 Online
110070 Solutions
New Discussion

Re: Reboot after Panic

 
SOLVED
Go to solution
venkat_7
Frequent Advisor

Reboot after Panic

I've had a server crash lot's of time.please suggests solution for this problem..how do i prevent reboot again.Is there is any patch i need to install..
MY server model HP9000,K370,800s,HP11.00

/etc/shutdownlog which says:
=============================

23:22 Sun Apr 14 2002. Reboot after panic: , isr.ior = 0'10340003.0'afebe1c8
06:12 Mon Apr 15 2002. Reboot after panic: , isr.ior = 0'10240023.0'ceb3b210
18:04 Fri Apr 19, 2002. Halt:
02:16 Mon Apr 22 2002. Reboot after panic: , isr.ior = 0'10340003.0'b22c9a20
10:17 Mon Apr 22 2002. Reboot after panic: , isr.ior = 0'240001.0'cff3e438
11:14 Mon Apr 22 2002. Reboot after panic: , isr.ior = 0'240001.0'cff3e438

cat /var/adm/crash/crash.4/INDEX
=================================
modelname 9000/898/K370
panic , isr.ior = 0'240001.0'cff3e438
dumptime 1019484304 Mon Apr 22 10:05:04 EDT 2002
savetime 1019488361 Mon Apr 22 11:12:41 EDT 2002
release @(#)B2352B/9245XB HP-UX (B.11.00) #1: Wed Nov 5 22:38:19 PST 1997

memsize 2147463168
chunksize 33554432
module /stand/vmunix vmunix 16699776 1979218741
image image.1.1 0x0000000000000000 0x0000000001ffd000 0x0000000000000000 0x00
000000000026df 3965744714
image image.1.2 0x0000000000000000 0x0000000001ff9000 0x00000000000026e0 0x00
000000000046d7 4215789036
image image.1.3 0x0000000000000000 0x0000000001ff9000 0x00000000000046d8 0x00
000000000066cf 1487303440
image image.1.4 0x0000000000000000 0x0000000001ff9000 0x00000000000066d0 0x00
000000000086c7 1753854333
image image.1.5 0x0000000000000000 0x0000000001fef000 0x00000000000086c8 0x00
00000000078bcf 3144202866
image image.1.6 0x0000000000000000 0x0000000001ff9000 0x0000000000078bd0 0x00
0000000007abc7 3532251890
image image.1.7 0x0000000000000000 0x000000000145a000 0x000000000007abc8 0x00
0000000007fffa 3875253367


Regards
venkat
5 REPLIES 5
Santosh Nair_1
Honored Contributor
Solution

Re: Reboot after Panic

First you should do a Q4 analysis to find out why your machine is panicing. Unless you know how to read crash dumps, chances are that you'll need to send the dumps to HP for analysis and have them come up with a resolution for you.

Below is the procedure to analyze a dump:

http://us-support2.external.hp.com/cki/bin/doc.pl/sid=8dcc1f6d191d873cd1/screen=ckiSearchResults

-Santosh
Life is what's happening while you're busy making other plans
Helen French
Honored Contributor

Re: Reboot after Panic

Hi Venkat:

Check this document and the solution for applying the patch- PHKL_25021. This refers the same problem (TKB # 2200156074):

http://us-support2.external.hp.com/cki/bin/doc.pl/sid=4f2a62b4005f6b3d70/screen=ckiDisplayDocument?docId=200000057223328

HTH,
Shiju
Life is a promise, fulfill it!
pap
Respected Contributor

Re: Reboot after Panic

Hi ,
Looking to the eror code in /etc/shutdownlog it seems that the error is related to MC/Service guard. I haed faced the same problem and i increased the NODE_TIMEOUT parameter from default value of 2 seconds to 8 seconds.

Please try to change the parameter and you will be fine.

Due to heavey network traffic some times it is not posssible all the time to transmit heartbeat signals from one node to all other node within default time of 2 seconds. failure to do so will cause rebooting of the machine as per service guard funcionality.

-pap
"Winners don't do different things , they do things differently"
venkat_7
Frequent Advisor

Re: Reboot after Panic

hi,

Please let me know which place i need to modify NODE_TIMEOUT parameter.

Regards
venkat
Domenico_5
Respected Contributor

Re: Reboot after Panic

In order for a ServiceGuard cluster to insure all applications are
being operated, cluster nodes must detect if a member node fails. This is
done by cluster nodes sending a 40 byte heartbeat package periodically to
the other nodes. If nodes fail to receive a heartbeat from a given node
in NODE_TIMEOUT time, a cluster reformation is instigated.
In some cases, a node may even TOC (reboot) because it is still
too busy to join a reforming cluster.

By experience, the factory default setting for NODE_TIMEOUT of 2 seconds
(2000000 microseconds per the cluster configuration template file) is
often too short for systems under heavy load. Systems servicing
kernel-priority processes may postpone the lower-priority heartbeat
generation process... which innately delays transmission of the
heartbeat. To counteract this occasional experience, simply increase the
NODE_TIMEOUT value.

The process:

1) # cd /etc/cmcluster

2) Edit the cluster configuration template file used to configure the
cluster. There are no naming conventions for the file, however it
is usually found in /etc/cmcluster on one of the cluster nodes. It
may be named cmclconfig.ascii It's header text contains this
banner:

# **********************************************************************
# ********* HIGH AVAILABILITY CLUSTER CONFIGURATION FILE ***************
# ***** For complete details about cluster parameters and how to ****
# ***** set them, consult the cmquerycl(1m) manpage or your manual. ****
# **********************************************************************

If it cannot be found and ServiceGuard version 10.10 or later
is operating, use this command to build a new cluster ASCII
file:

# cd /etc/cmcluster
# cmgetconf CONF
This command builds a file based on the content of the cluster
binary previously built (with a cmapplyconf).

Validate the original/new ASCII file:
# cmcheckconf -C
If this command fails, the current hardware configuration does not
match that discovered when the binary was built. It will be
necessary to correct either the hardware configuration or the
.

Once the cluster configuration file is validated, proceed.

3) Edit this line in the file:

NODE_TIMEOUT 2000000

The HP Response Center recommends changing the value to
8000000 (8 seconds).

4) Write/close the file.

5) Halt the cluster (the cluster configuration file cannot be
checked or applied while the cluster is up
and NODE_TIMEOUT is different).

# cmhaltcl -f (-f = force cluster packages down)


6) NOTE: In this step, cmcheckconf or cmapplyconf will fail if
the cluster is still running. You must halt the cluster in
order for this step to succeed. If not, you will see messages
of this sort:

Error: Modifying NODE_TIMEOUT value from 2000000 to 8000000
while cluster hpha1 is running is not supported.
cmcheckconf : Unable to verify cluster file: cmclconfig.ascii.
Invalid argument.

Use the cmapplyconf command to validate the configuration file
and build and distribute a new cluster binary file.

$ cmapplyconf -C cmclconfig.ascii

NOTE: For ServiceGuard release 10.06 or lower, include the
package configuration files in the cmcheckconf and cmapplyconf
commands (See the manpage for cmcheckconf/cmapplyconf):

# cmapplyconf -C cmclconfig.ascii -P pkg1/config -P pkg2/config [...]

7) Once applied, start the cluster when ready:

# cmruncl

Over time, syslog.log should no longer contain messages of
this type:

"cmcld[3256]: 2 nodes have formed a new cluster,sequence #47"