Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Crash Dump Reset

SOLVED
Go to solution
Richard W Hunt
Valued Contributor

Crash Dump Reset

We have an AS4100 with OpenVMS 7.3-2. Patches are up to date (AFAIK).

We are having crashes after a recent hardware upgrade so we have a pretty good idea on what changed. We had no crashes for like a year or more before that, and the crash last year was when the swap/page file disk died a horrible death. No problems in understanding there, and prior to that, the last crash was 2001. Very stable system. So it is clear that the new hardware is the culprit.

The problem is that our crash dump won't update. I can't give the tech guys details on the crash dump because I don't have them.

We are NOT set up to use the page file for crash dumps. We have a separate SYS$SYSTEM:SYSDUMP.DMP file. DUMPSTYLE is set to 9 (selective, compressed, use system disk) We took a crash dump a couple of days ago but we have crashed at least twice since then without another crash dump being written.

I have heard there is some sort of interlock file but I don't know where it is and wading through the manuals has (so far) produced no joy. I've looked in System Manager's Manual, System Management Utilities Manual, and a few other miscellaneous places. I see a lot about how to analyze the dump. I see a lot about copying a dump from the page file via the CDA utility. But if there is a free-standing crash dump file in a valid path, why won't a crash overwrite the previous contents?

Sr. Systems Janitor
10 REPLIES
Allan Bowman
Respected Contributor

Re: Crash Dump Reset

Richard,

What kind of hardware upgrade was done? If it was something that might have an effect on the I/O bus (especially a drive controller or disk drive), it is quite possible that whatever is causing the crash is also preventing any further I/O to the system disk. It could be a problem with the system disk itself - maybe even a bad spot within SYSDUMP.DMP. If you have enough space, you might try creating a new dumpfile (rename the old one, create a new one, reboot) and see if the behavior changes.

Allan in Atlanta
Karl Rohwedder
Honored Contributor

Re: Crash Dump Reset

Richard,

may be the system died a death without being able to write to the dumpfile, e.g. ane error on the main board etc.

Are any messages on the console or in the error log (DIAGNOSE)?

regards Kalle
Richard W Hunt
Valued Contributor

Re: Crash Dump Reset

The memory was upgraded from 2 Gb to 8 Gb.

We are looking at the IOD board, the chance of having a DOA memory card, and the fact that S3 cards with a big memory & a 64-bit interface can hang a system. The KGPSA we use for our SAN fits that bill perfectly. Our tech guy has found some memories to swap out and he is also bringing a new IOD board.

The DIAGNOSE command reveals some correctable ECC on the original crash, but later crashes go straight from a TIMESTAMP entry to a reboot entry. I can't check for more than that right now because the hardware tech has the system even as I type this.

The crash dump from two days ago was calling out a machine check from KERNEL mode and the K-stack was indicating (a) no currently runnning process and (b) The ECC_CORRECTABLE routine was referenced in the bowels of the stack.

We are confident that this problem is either a bad memory or a bad IOD board. (For those who don't know: IOD translates addresses from PCI bus to Alpha memory bus. That's the short answer and don't whack me for simplifying. I know it does more than that.)

My issue remains that I'm disturbed about not getting a crash dump. No, no bad spots reported on the disk. I can do an analyze RMS telling it to verify the blocks of the file and it passes that test, so the SYSDUMP.DMP isn't bad that I know of. I just am worried that I can't provide better info to my tech support guys when they ask me "What's up?" (Though the correct answer is more often "Well, we're not.")
Sr. Systems Janitor
Jon Pinkley
Honored Contributor

Re: Crash Dump Reset

It is possible that the dumpfile is not being written due to the setting of the SRM auto_action environmnet variable.

For the following bugchecks:

UNKRSTRT
KRNLSTKNV
INVSCBB
HALT
DBLERR
INVPTBR
MCHECKPAL

the dumptile will not be written unless auto_action is set to "RESTART".

Check with

$ write sys$output f$getenv("AUTO_ACTION")

or

>>> show auto_action

To change in a supported way:

>>> set auto_action RESTART

If that is already set to restart, then please provide the output of

$ mcr sysgen show dumpbug ! should be 1
$ mcr sysgen show dumpstyle
$ mcr sysgen show savedump

example:

$ write sys$output f$getenv("AUTO_ACTION")
RESTART
$ mcr sysgen show dumpbug ! should be 1
Parameter Name Current Default Min. Max. Unit Dynamic
-------------- ------- ------- ------- ------- ---- -------
DUMPBUG 1 1 0 1 Boolean
$ mcr sysgen show dumpstyle
Parameter Name Current Default Min. Max. Unit Dynamic
-------------- ------- ------- ------- ------- ---- -------
DUMPSTYLE 9 0 0 -1 Bitmask D
$ mcr sysgen show savedump
Parameter Name Current Default Min. Max. Unit Dynamic
-------------- ------- ------- ------- ------- ---- -------
SAVEDUMP 0 0 0 1 Boolean
$
it depends
Jon Pinkley
Honored Contributor

Re: Crash Dump Reset

You may want to connect something to the console to capture output (PC with terminal emulator that has record feature, or hard copy terminal).

Then set bit 1 of DUMPSTYLE which will give verbose console output on crashdump.

e.g. if current value of DUMPSTYLE is 9 (2^0 + 2^3) change to 11 (2^0 + 2^1 + 2^3)

Good Luck,

Jon
it depends
Richard W Hunt
Valued Contributor

Re: Crash Dump Reset

OK, follow-up:

This might be due to having an S3 Trio board in the system. Our tech tells me this can cause a HANG, not a crash, and therefore you don't get a crash dump. Two days ago we had a crash dump that led to replacing one of our CPU cards. That didn't bother be because that crash dump occurred. My concern was never that we crashed... I sort of expected problems after a hardware upgrade. It was the lack of a crash dump.

Auto-Action is RESTART so that's why I rather expected a crash dump, but of course if this is a HANG and my ops guys just hit the INIT button, that would do it too. I need to communicate with them the urgency of knowing when the INIT button is hit.

Thanks for the concern, guys. We are going to replace the console video board and see if that clears up the last nagging problem. I won't close this right away, but I'll check in if our system stabilized.
Sr. Systems Janitor
Andy Bustamante
Honored Contributor
Solution

Re: Crash Dump Reset

The S3 video card and fibre channel HBAs need to be installed on different PCI buses.

If I recall, correctly, you can violate this configuration rule if you have 2 GB or less of memory. Since you have a recent memory upgrade, please check the location of HBAs and the video card.


Andy Bustamante
If you don't have time to do it right, when will you have time to do it over? Reach me at first_name + "." + last_name at sysmanager net
Wim Van den Wyngaert
Honored Contributor

Re: Crash Dump Reset

You multiplied the momory by 4. Did you also increase the dump file size accordingly ?

Wim
Wim
Richard W Hunt
Valued Contributor

Re: Crash Dump Reset

Didn't have another PCI bus, so we swapped the S3 card for an ELSA Gloria Synergy, which is known to not have the problem.

As to the DUMPFILE - we are currently running a compressed selective dump until I can go to the remote site to issue some console commands. I have a full-sized dump file ready to go but I have to use DOSD, hence the console operations.

As of Saturday, we have had no more problems. I think our issue is resolve. Thanks to all who replied.
Sr. Systems Janitor
Richard W Hunt
Valued Contributor

Re: Crash Dump Reset

After three events, we have stabilized our system.

1. Swapped video cards (see in thread)

2. Swapped out one of our CPU boards, which was doing something vile but unknown to the new memories and busses.

3. Scheduled a visit so we can fix the dump file permanently via DOSD techniques.
Sr. Systems Janitor