how to understand why the OS reboots

mustafa_12 · ‎11-18-2005

Two days ago, I realized that one node of our cluster is rebooted. When I have looked at the the operator log, it is seen that the server was rebooted on 18.11.2005 between 03:53 AM and 04:03 AM.
In the "serverX::sys$manager:operator.log" file, there is an error log just before the server reboot. This is:

%%%%%%%%%%% OPCOM 18-NOV-2005 03:51:00.28 %%%%%%%%%%%
Message from user INTERnet on serverX
INTERnet ACP Error Status = %SYSTEM-F-INSFMEM

This is a fatal error but I do not think that this may cause a system reboot. Has anybody experienced that kind of error and consequent system reboot?

If this error is not the reason, is it possible who or what caused the system reboot?

Thank you very much...

Heinz W Genhart · ‎11-18-2005

I think You have insufficient Non-Paged Pool.
Increasing it could solve Your Problem
Do a show memory/pool/full and check initial size and current size. You probably will see that current size is larger than initial size.

Heinz

Volker Halle · ‎11-19-2005

Mustafa,

to understand, why OpenVMS (VAX or Alpha) crashes (and reboots), you need to look at the information in the system dump file.

You didn't tell, whether you're running OpenVMS VAX or Alpha or I64. Here are the necessary steps to find out more information about the crash reason:

OpenVMS Alpha and I64:

$ TYPE CLUE$HISTORY

Crash history file. 1 line per crash. For each crash, there is an additional CLUE file containing more detailled information in: CLUE$COLLECT:CLUE$node_ddmmyy_hhmm.LIS

OpenVMS VAX:

$ CLUE:==$CLUE
$ CLUE/DISPLAY will show crash history.

File CLUE$OUTPUT:CLUE$LAST_node.LIS will contain information about the most recent crash (or shutdown). You can also extract this kind of information from the last 50 crashes with the CLUE utility.

The most important info about the crash is:

Bugcheck type
Module and offset of crash PC

There is much more information in the CLUE files and even more in the dump file, but just let us know those two items and we can tell you more.

If you like, you could attach the CLUE file as an .TXT attachment in a reply.

Nonpaged pool shortage could be a reason for a system crash, depending on the circumstances, under which a packet from nonpaged pool is needed.

Volker.

Ian Miller. · ‎11-19-2005

there also may be an entry in the system error log

____________________
Purely Personal Opinion

mustafa_12 · ‎11-19-2005

Heinz,
show mem /pool /full gives me

Current Size (MB) 24.18
Initial Size (MB) 6.68
Maximum Size (MB) 33.46
Free Space (MB) 9.15

what if the nonpaged dynamic memory (ndm) reaches its maximum limit? Volker said that it depends on the application. As a programmer, I've coded a lot of program but never seen such a case that if a program can not allocate memory from the system or if it tries the reach a NULL area, it never causes the system crash. It only gets a segmentation fault. This must be more robust in OpenVMS.

Volker,

when I $type clue$history, it gives me at the last line that

18-NOV-2005 03:53 V7.3-2 AlphaServer DS25 serverX CLUEXIT NULL xxxxxxxx SYS$CLUSTER

from that line, can I understand that it is absolutely a "crash" not a normal system reboot. I am asking this question since system user password is known not only by me, but there are other system admins. So, I was suspecting that they did a shutdown. However, when I see the result of "type clue$history", I understand that it is a crash. Is it correct?

Although, there is crash information in "type clue$history" on 18-nov-2005", I could not find any file named "clue$serverX_181105_xxxx.lis;" What may be the reason?

From the output, you can see it is a Alpha DS25. I would be grateful if you send me further analyzing information.

Ian,

by system log, do you mean sys$errorlog:errlog.sys. How can I analyze it?

Thank you all...

Ian Miller. · ‎11-19-2005

A CLUEEXIT bugcheck implies that the node lost communication with the other nodes in the cluster, the other nodes then continued without that node. When cluster communications was re-established this node performs a CLUEXIT bugcheck (node volentarily leaving cluster) to prevent data inconsistancies.

I mentioned the errorlog incase there was not a clue entry.

There will be a file in SYS$ERRORLOG called CLUE$*.LIS with the name containing the date and time of the crash. Can you post it here. Look in the operator logs of the other nodes as there will be messages about cluster communications being lost.

____________________
Purely Personal Opinion

mustafa_12 · ‎11-19-2005

I sent the lines before crash in the operator.log in my orijinal posting. There is insufficent memory error.

Additionally, when I

analy /crash SYS$SYSROOT:[SYSEXE]SYSDUMP.DMP
SDA> show memory /full /pool

Nonpaged Dynamic Memory (Lists + Variable)
Current Size (MB) 33.46
Initial Size (MB) 6.68
Maximum Size (MB) 33.46
Free Space (MB) 0.37

So, Nonpaged Dynamic Memory seems full. But I wonder if it causes system crash.

In the previous posting, I said that I could not find the specific clue file. But I was wrong, I have found it posted as a attachment. I have erased all the processes at the end apart from 1 process, since it,s BIO differs from others. It is 2. What does BIO and its value 2 means? Also in this file, there is

Per-CPU Slot Processor Information:
CPU ID 00 CPU State rc,pa,pp,cv,pv,pmv,pl
CPU Type EV68CB Pass 4.0 (21264C)
PAL Code 1.98-42 Halt PC 00000000.20000000
CPU Revision .... Halt PS 00000000.00001F00
Serial Number JA23803680 Halt Code "Bootstrap or Powerfail"
Console Vers V6.8-18 Halt Request "Warm Bootstrap Request"

CPU ID 01 CPU State rc,pa,pp,cv,pv,pmv,pl
CPU Type EV68CB Pass 4.0 (21264C)
PAL Code 1.98-42 Halt PC FFFFFFFF.800839E0
CPU Revision .... Halt PS 00000000.00001F04
Serial Number JA23803605 Halt Code "Kernel Mode HALT Instruction"
Console Vers V6.8-18 Halt Request "Remain Halted"

What is "halt code" and "halt request" and theirs values' meanings here?

best regards...

mustafa_12 · ‎11-19-2005

sorry, here is the attachment...

Volker Halle · ‎11-19-2005

Mustafa,

thanks for posting the CLUE file. So this crash is a CLUEXIT crash. Such a crash happens, when a node in a cluster looses connection to other nodes in a cluster for more than RECNXINTERVAL seconds and then succeeds to re-connect to one of the nodes in the cluster. The other nodes have timed out this node after RECNXINTERVAL seconds and removed it from the cluster. The node has to voluntarily crash with CLUEXIT to be able to join the cluster again after the reboot following the crash.

The CLUE file shows (under Memory Management Statistics) lots of pool allocation failures:

Failed Alloc Requests 13930

First make sure to run @AUTOGEN SAVPARAMS SETPARAMS FEEDBACK and make sure NPAGEDYN gets increased. Nonpaged pool in the running system has also been expanded quite a lot. Consider to set NPAGEDYN=25000000 at least. And watch nonpaged pool usage in the running system, to determine, if there may be a memory leak (if free space keeps decreasing and current space keeps increasing).

As you've seen, nonpaged pool has been completely consumed and cannot be expanded anymore. This will cause drivers etc. to fail when trying to allocate nonpaged pool and could very well be the reason for connection loss.

Nonpaged pool is typically not used 'directly' from an application, it requires kernel mode code to allocate packets from nonpaged pool (drivers etc.).

BIO is the no. of outstanding buffered IOs.

HALT code and HALT request describe the status of the CPUs. The values you see are normal in a crash. The secondary CPU(s) are being stopped by the bugcheck processing code and the primary CPU requests a reboot after writing the dump.

Volker.

mustafa_12 · ‎11-19-2005

Hi Volker,

Thank you for your very informative answer. Things seem to be more clear now about relation between nonpaged dynamic memory and system crash. I think, I have found the application that have used up nonpaged dynamic memory. when "anal /crush dump_file" and "sda> show summary", it gives the processes and their states just before the crush. The state of one of the processes is like that:

202D7A04 0204 ProcessX UserX RWNPG 5 82583140 837A8000 224

So, does it mean that it is using the nonpaged dynamic memory? Also, some of other processes' states are "RWCLU" meaning "cluster state transition". Does it mean that they are migrating from the crushed node to the other nodes?

Again, thank you ALL for your sincere answers...

Volker Halle · ‎11-20-2005

Mustafa,

a process in RWNPG is waiting for nonpaged pool, but it's not necessarily the culprit, even if we assume, that there would be a real nonpaged pool leak. It could be though...

Processes in RWCLU are waiting for cluster quorum to be restored.

You can check allocated pool (in the dump or in the running system) with:

SDA> SHOW POOL/NONP/SUMM

to find out, which types of packets and how much of them occupy nonpaged pool. If it's just a badly tuned system, make sure to increase NPAGEDYN and reboot. Then watch pool consumption over time...

If you can identify certain types of packets, which consume MOST of nonpaged pool, post the SHOW POOL/NONP/SUMM output as an attachment.

Volker.

mustafa_12 · ‎11-26-2005

Hi,

when I "anal /crash crash_file" and "sda> show pool /nonp /summ", the most sizes in Npagedyn are

LCKCTX 34%
FRK 10%
unknown 9%

what can I figure out from these results...

According the postings here, I decided to increase the size of NPAGEDYN. But I wonder something:

when I command,
"$show mem /full /pool"
it says that the maximum size is 33.46 MB.

However when I command,
"$mc sysgen show NPAGEDYN"
it says:
current is 7004160 bytes (app. 7MB)
default is 1048576 bytes (app. 1MB)

I wonder which result is correct.

And another thing, how can I change the value for NPAGEDYN. I think it is possible to use both SYSGEN and SYSMAN. Which one should I use to change the value for NPAGEDYN without the need for reboot the system.

Does increasing the NPAGEDYN cause a system performance decrease since it is the area that is not paged out to disk? Is there a optimum value for it?

Thank you?

Volker Halle · ‎11-26-2005

Mustafa,

LCKCTX are Lock Context Blocks, so your system seems to use many locks. If this causes nonpaged pool depletion, the system may not be correctly tuned to support all the applications running on the system.

The size of nonpaged pool is variable within certain ranges:

NPAGEDYN = initial size of nonpaged pool at boot time

NPAGEVIR = max. expansion size while the system is running

Your system has already expanded nonpaged pool to it's maximum value (NPAGEVIR), so additional requests for nonpaged pool will fail, this seems to have caused your system to crash.

To tune your nonpaged pool size, just set up the new initial size of nonpaged pool by specifying:

MIN_NPAGEDYN = 33460000

in SYS$SYSTEM:MODPARAMS.DAT. This value is based on the fact, that pool HAD to be expanded to at least that size in your system, before it crashed. Then run @SYS$UPDATE:AUTOGEN SAVPARAMS SETPARAMS FEEDBACK to let autogen figure out the corresponding value of NPAGEVIR (and other related parameters).

SYSGEN> and SYSMAN> PARA are just different user interfaces to the SAME data, the system parameters, as stored in SYS$SYSTEM:ALPHAVMSSYS.PAR. The parameters from this file will be used to configure the active system parameters during boot. Some parameters are 'dynamic', i.e. they can be changed in the running system with SYSGEN or SYSMAN PARA and the WRITE ACTIVE command. NPAGEDYN is NOT a dynamic parameter, so you MUST reboot to change it.

Increasing nonpaged pool will reduce physical memory available to the applications. There is only ONE OPTIMAL value for NPAGEDYN, that's the value required to support all the applications running on the system. In general, you would want to set NPAGEDYN, so that nonpaged pool does NOT get expanded during normal load, i.e. $ SHOW MEM/POOL/FULL should show the same values for Current and Initial size.

Volker.

Arch_Muthiah · ‎11-26-2005

Mustafa,

As this crash CLUEXIT looks similar to one of my previous crash analysis I did for one of my earlier customer, I would like add this. I have gone thru Mr.Heinz/Ian/Volker responses. I am also with their points.

The CLUEXIT error is a type of bugcheck initiated by the Connection Manager, the OpenVMS Cluster software component that manages the interaction of cooperating OpenVMS Cluster computers. Most such bugchecks are triggered by conditions resulting from failures in communications paths, configuration errors, system management errors, and hardware failures.

There are bugs in CNX$WAIT_xxx, one of the connections manager routine and Quorum Disks handling routines. HP has accepted and released the following patches for these bugchecks.

1. Bugcheck in CNX$WAIT_xxx. Patch Details: VMS732_UPDATE-V0100. System can crash with a CLUEXIT bugcheck in CNX$WAIT_xxx. A dump shows that the system is out of non-paged pool. Images Affected: [SYS$LDR]SYS$PIPEDRIVER.EXE

2. Bugcheck in handle Quorum Disks. KIT: VMS73_SYS-V0400. OpenVMS V7.3 nodes do not correctly handle Quorum Disks. The issues can result in either CLUEXIT bugchecks or numerous Connection Manager console messages, such as %CNXMAN, Proposing modification of quorum or quorum disk membership. Any cluster with a quorum disk and V7.3 nodes requires this fix.
Image Affected: [SYS$LDR]SYS$CLUSTER.EXE.

Archunan

Regards
Archie

Arch_Muthiah · ‎11-26-2005

Mustafa,

When I faced this kind of crash earlier, I have documented these other doubts, Please find a time to go thru this slowly and I will be happy if you let us know the reason once your problem rectified.

This bugcheck CLUEXIT may also have been triggered by the following events in the OpenVMS cluster environment.

1. By port driver poller, when it discovered a remote system with SCSSYSTEMID or SCSNODE equal to that of another system in the cluster, to which a virtual circuit is already open.

2. The cluster connection between two computers is broken for longer than RECNXINTERVAL seconds. This point already discussed by Mr.Volker. This condition can occur in the following conditions:

- Upon recovery with battery backup after a power failure.

- After the repair of an SCS communication link.

- After the computer was halted for a period longer than the number of seconds specified for the RECNXINTERVAL parameter and was restarted with a CONTINUE command entered at the operator console

3. Cluster partitioning: This condition generally arises when the intracluster communications of one or more systems is delayed in some way. This condition can also occur if one system is incorrectly booted on a common system disk using another systemâ s duplicate SYSROOT identifier, or if a copy of an existing system disk is booted on another machine. If the condition persists, this system may have been incorrectly physically configured as part of two distinct OpenVMS Cluster systems. Also member of a cluster discovers or establishes connection to a member of another cluster, or a foreign cluster is detected in the quorum file.

4. Incorrect switch and/or jumper settings on the CI interface modules both in the HSC controller and the CI host node can cause CLUEXIT bugcheck.

Though my doubts are in in the patch and non-paged mem size, I would suggest you to try this all...

a. Bugs in CNX$WAIT_C and Quorum Disks handling routines. Download and Install the following ECO patches

and patch kit
- VMS732_UPDATE-V0100
- VMS73_SYS-V0400

b. Duplicate SCSNODE and SCSSYSTEMD identifier.
Set unique value for SCSNODE and SCSSYSTEMID parameters

c. Modify RECNXINTERVAL parameter value.
Determine the cause of the interrupted connection and correct the problem. For example, if recovery from a power failure is longer than RECNXINTERVAL seconds, you may want to increase the value of the RECNXINTERVAL parameter on all computers.

d. Duplicate SYSROOT Identifier.
Ensure that the cluster is properly configured or that the proper root and disk are specified in the boot command.

e. Setting unique Node Address
- Unique node addresses are set using the two node address switches on the HSC port LINK module. Use the following guidelines when setting the node address switches on the LINK module,
- If the HSC controller or a new CPU is added to a cluster, set the address switches to the next available node address for the cluster. The HSC controller should be assigned to lower node addresses.

- If the HSC controller is presently part of a cluster to which no new nodes have been added, but a LINK module has been replaced, ensure the node address on the new LINK module is set to the same address as the previous module.

f. Cluster partitioning occur.
Review the setting of EXPECTED_VOTES on all computers.

If the problem is not resolved, better contact HP customer support team and have them investigate this crash to determine what steps need to be taken to resolve this issue.

Archunan

Regards
Archie

mustafa_12 · ‎02-20-2006

Dear Archunan,

The CLUEXIT was caused by the small size of NPAGEDYN and the user applications using it up. I think there was a leak in the apps as Volker said, and this bugs were fixed.

From then on, I have been monitoring the the NPAGEDYN area, and there have been no problem.

Thank you...

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

how to understand why the OS reboots

how to understand why the OS reboots

Re: how to understand why the OS reboots

Re: how to understand why the OS reboots

Re: how to understand why the OS reboots

Re: how to understand why the OS reboots

Re: how to understand why the OS reboots

Re: how to understand why the OS reboots

Re: how to understand why the OS reboots

Re: how to understand why the OS reboots

Re: how to understand why the OS reboots

Re: how to understand why the OS reboots

Re: how to understand why the OS reboots

Re: how to understand why the OS reboots

Re: how to understand why the OS reboots

Re: how to understand why the OS reboots

Re: how to understand why the OS reboots