Operating System - OpenVMS
1827890 Members
1655 Online
109969 Solutions
New Discussion

Re: Help with System Crashing

 
SOLVED
Go to solution
Rick Hayes
New Member

Help with System Crashing

Here is the sequence of events from a system crash. The o/s version is 7.2-1 running on an AlphaServer DS10 466 MHz


-Last non-error message written to log file before cold start @ 17:03 (5-Jun), ‘Time Stamp Entry’
-ServersAlive Notification Omega is down @ 17:14 (5-Jun)
-Next message written to log file 07:47:12 (6-Jun), ‘Unrecognized Configuration Entry’
-Cold Start @ 07:47:21 (cold start + 3 seconds)
-Error message @ 07:48:06 (cold start + 49 seconds), ** Error during CTR processing of EVT seg’ (for DQA0)
-Error message @ 07:48:21 (cold start + 64 seconds), ** Error during CTR processing of EVT seg’ (for DQA1)
-Volume Mount @ 07:48:24 (cold start + 67 seconds), resuming backup from last night.

Not a whole lot of information, I know, but if you could point me in the right direction I would appreciate any assistance.

Thanks,
Rick
6 REPLIES 6
Ian Miller.
Honored Contributor
Solution

Re: Help with System Crashing

Do you have a crash dump? Try
ANAL/CRASH SYS$SYSTEM:SYSDUMP.DMP

Is there a file SYS$ERRORLOG:CLUE*.LIS
dated when the crash was ?
____________________
Purely Personal Opinion
Volker Halle
Honored Contributor

Re: Help with System Crashing

Rick,

welcome to the OpenVMS ITRC forum.

Where did you copy those error messages from ? They do not exactly look like output from DECevent or WEBES (Compaq Analyze/SEA).

If a real system crash was involved, you should also have a bugcheck entry in ERRLOG.SYS. And a valid dump in SYS$SYSTEM:SYSDUMP.DMP - this would lead to a CLUE file being created during startup.

Does $ TYPE CLUE$HISTORY show an entry for 5-JUN-2006 or 6-JUN-2006 ?

What also could have happened is a 'hang' and someone pressed the RESTART button, a powerfail can also cause a boot with no crash entry. And then there could be an error halt and AUTO_ACTION not being set to RESTART.

What does the following command return:

$ write sys$output f$getenv("AUTO_ACTION")

Volker.
Rick Hayes
New Member

Re: Help with System Crashing

Ian and Volker,

There was no dump file and no clue*.lis file, dated for either yesterday (when I think the crash happened based on the event log) or today.
I checked the sys$system directory and found no .dmp file for yesterday/today. The last .dmp was written on 19-MAY-2000.
There is no entry for $ TYPE CLUE$HISTORY for 5-JUN or 6-JUN. The last entry was 12-OCT-2005.
The $write sys$output f$getenv("AUTO_ACTION") returns 'HALT'.

Thanks to both of you for your help. This sever has 'crashed' several times over the last 2-3 weeks. There is no 'system administrator' per se here, so it feels a bit like trying to find your way around in the dark in an unfamiliar house. There is not a terrible lot of documentation. I am computer literate but not familiar with VMS although I am learning (the hard way).

Thanks,
Rick
Volker Halle
Honored Contributor

Re: Help with System Crashing

Rick,

the SYS$SYSTEM:SYSDUMP.DMP file is created once and is mapped during boot. When the system is crashing, it writes memory to the blocks mapped at boot time. You do not get a new dump file for each crash and the dates on the dump file don't change when a dump is being written.

The fact that there is no 'new' CLUE file, is indicative of a missing valid dump in SYSDUMP.DMP. You could check SYS$MANAGER:CLUE$STARTUP_node.LOG for any error messages returned from the analyze/crash command during startup.

The fact that AUTO_ACTION is set to HALT will also rule out an error HALT, because the CPU would have been left at the console prompt >>>, same is true for a possible powerfail: the system would also have remained halted afterwards.

Can you capture the output from the console terminal ? From your description of the timing of the events, the system could also have hung starting 5-JUN 17:03 and then someone pressed the RESTART button (or HALT and >>> BOOT) on 6-JUN 7:46. Find out, who had physical access to the console and what they've done !

If this is true, then - instead of just restarting the node - you should force a crash:

- press HALT button
- issue >>> CRASH

This should cause a system crash to be written and the system will boot automatically. The 'forced' dump will then be available for analysis to find the possible reason for the hang.

Volker.
Rick Hayes
New Member

Re: Help with System Crashing

Volker,

There were no messages and nothing to caputre from the console terminal. When I went to check it it was 'locked'/'hung' like all other terminals.
I think also that the system hung starting around 17:00 on 5-JUN. The standard practice for this has been to power the server down and power it back up. Once the server got to the >>> prompt, I just type BOOT. I would rather use the HALT button but if memory serves this did not work. I'm not a fan of powering down the server unless it is absolutly necessary.
So far the server has stayed up, but the next time it goes 'down' (I am assuming it will), I'll used your CRASH suggestion.

Thanks!
Rick
Volker Halle
Honored Contributor

Re: Help with System Crashing

Rick,

here is a pointer to the DS10 Console Reference Manual:

http://h18002.www1.hp.com/alphaserver/download/ds10cr-d.pdf

There is only a combined Halt/Reset button on the DS10. If you press it, the system may just RESTART. If so, you need to set the Halt/Reset Select jumper on the main board (see appendix A).

If you are using a serial console, you might also be able to HALT the system by typing CTRL-P. It might also be possible to use the RMC commands to halt the system.

Without being able to HALT the system, you can get to the console prompt and you can't force a system crash.

Volker.