Operating System - OpenVMS
1827863 Members
1704 Online
109969 Solutions
New Discussion

System crashes every 3 weeks.

 
Doug_81
Frequent Advisor

System crashes every 3 weeks.

Here's one I've been wrestling with for a few months and would appreciate any assistance.

Basically, I have a 2 node Alpha cluster and 1 node crashes every 3 weeks. If I reboot this node prior to it crashing, it runs for 3 weeks from that re-boot.

Let's call them node1 and node2.

Node1 crashes every 3 weeks and it is a
AlphaServer 1000 4/233
Main Memory (1024.00Mb)
OpenVMS V7.1-1H2

Node2 is stable - currently up for 120 days.
AlphaServer 1200 5/400 4MB
Main Memory (1024.00Mb)
OpenVMS V7.3
------------------------------
Here's an excerpt from the clue file:

Bugcheck Type: MACHINECHK, Machine check while in kernel mode
Failing PC: FFFFFFFF.80066BCC EXE$GEN_BUGCHK_C+0003C
Failing PS: 10000000.00001F04
Module: EXCEPTION
Offset: 00018BCC
-------------------------------------
Unfortunately, my client has a contract with a 3rd party support company, so I can't contact HP directly to get the crash dump analyzed, and they haven't been very useful with their analysis. So, I've come to the experts....

The system is running OpenVMS V7.1-1H2 and all the patches (for this version) have been installed.

I suspect that I'm running out of some resource after 3 weeks, but I can't figure out which one.

Any ideas/sugestions?

Thanks,
Doug
27 REPLIES 27
Ian Miller.
Honored Contributor

Re: System crashes every 3 weeks.

can you post the clue file SYS$ERRORLOG:CLUE$*.LIS

Anything in the errorlog?

What layered products are you running?

____________________
Purely Personal Opinion
Uwe Zessin
Honored Contributor

Re: System crashes every 3 weeks.

A machine check is almost always a hardware problem. I've once had a similar problem about 1987 on a VAX-8650 which went down every 14 days on a friday afternoon.

After may parts replacements, swappings we had to escalate... Turned out to be bad memory. The machine was running rock-solid since then.
.
Tom O'Toole
Respected Contributor

Re: System crashes every 3 weeks.

100% agree with Uwe, I doubt it's a resource exhaustion. Maybe you are just happen to be hitting that bad memory after three weeks.
Can you imagine if we used PCs to manage our enterprise systems? ... oops.
Doug_81
Frequent Advisor

Re: System crashes every 3 weeks.

I've attatched the clue file.

Uwe:
Your sugestion re. memory is interesting. A month ago our 3rd party hardware support group sugested "so far their thinking is a bad memory stick", but I haven't heard anything since.

Are there some diagnostics I can run to check the memory?
Shouldn't the error log file show any memory errors?

Here's the last entry in the error log prior to the last crash:

******************************* ENTRY 377. *******************************
ERROR SEQUENCE 19263. LOGGED ON: CPU_TYPE 00000006
DATE/TIME 29-MAR-2005 14:54:23.14 SYS_TYPE 00000011
SYSTEM UPTIME: 23 DAYS 10:28:05
SCS NODE: ALPHA2 OpenVMS AXP V7.1-1H2

HW_MODEL: 00000000 Hardware Model = 0.

FATAL BUGCHECK AlphaServer 1000 4/233

MACHINECHK, Machine check while in kernel mode

PROCESS NAME INTEGRA_DF
PROCESS ID 002E0152

ERROR PC FFFFFFFF 80066BD0

Process Status = 10000000 00001F04, SW = 00, Previous Mode = KERNEL
System State = 01, Current Mode = KERNEL
VMM = 00 IPL = 31, SP Alignment = 16

STACK POINTERS

KSP 00000000 7FFA1E90 ESP 00000000 7FFA6000 SSP 00000000 7FFAC100
USP 00000000 7AF76DC0

GENERAL REGISTERS

R0 FFFFFFFF 8A0E01E8 R1 00000000 0000940E R2 FFFFFFFF 839A6EB0
R3 FFFFFFFF 8A0E0000 R4 00000000 00200040 R5 FFFFFFFF FFFFFFFF
R6 00000000 00000001 R7 00000000 00000003 R8 00000000 0000005C
R9 00000000 00000000 R10 00000000 00000006 R11 00000000 00000006
R12 00000000 00000000 R13 00000000 0000001C R14 00000000 00000010
R15 00000000 00000000 R16 00000000 00000215 R17 00000000 00000001
R18 00000000 00000001 R19 00000000 00000001 R20 00000000 00C42414
R21 FFFFFFFF 8A0E0000 R22 FFFFFFFF FFFFFFFF R23 00000000 00000086
R24 00000000 00000086 R25 00000000 00000003 R26 00000000 00000210
R27 FFFFFFFF 839BD680 R28 00000000 00000000 FP 00000000 7FFA1E90
SP 00000000 7FFA1E90 PC FFFFFFFF 80066BD0 PS 10000000 00001F04

SYSTEM REGISTERS

PTBR 00000000 0000F7BF
Page Table Base Register
PCBB 00000000 11B4E080
Privileged Context Block Base
PRBR FFFFFFFF 8100E000
Processor Base Register
VPTB FFFFFFFC 00000000
Virtual Page Table Base Register
SCBB 00000000 000001A0
System Control Block Base
SISR 00000000 00000000
Software Interrupt Summary Register
ASN 00000000 00000006
Address Space Number

V M S SYSTEM ERROR REPORT COMPILED 31-MAR-2005 17:51:38
PAGE 4.

ASTSR_ASTEN 00000000 0000000F
AST Summary/AST Enable
FEN 00000000 00000001
Floating-Point Enable
IPL 00000000 0000001F
Interrupt Priority Level
MCES 00000000 00000000
Machine Check Error Summary

Volker Halle
Honored Contributor

Re: System crashes every 3 weeks.

Doug,

the key to MACHINECHK crashes are the MCHK errlog entries - if there are any.

SDA> CLUE ERRLOG

will list them and extract them from the dump into a file CLUE$ERRLOG.SYS in your login or default directory.

Run this file through ANAL/ERR or - better- DECevent ($ DIAGNOSE).

The CLUE file is not of much help, especially as the MACHINECHK stack is not correctly decoded until V7.3-2.

Volker.
Steve Nimr
Advisor

Re: System crashes every 3 weeks.

Doug,

Since Volker forgot you're new to VMS:

Do
$ analyze/system

to get to the SDA> prompt.

Steve
Steve Nimr
Advisor

Re: System crashes every 3 weeks.

Sorry Doug I got threads mixed up. :(
I guess there is no way to recall a reply once it's submitted.
Volker Halle
Honored Contributor

Re: System crashes every 3 weeks.

Steve,

it's ANAL/CRASH SYS$SYSTEM:SYSDUMP.DMP to access a system dump file (in it's default location).

Volker.
Doug_81
Frequent Advisor

Re: System crashes every 3 weeks.

Here's the output from diagnose around the time period of the last crash:

******************************** ENTRY 376 ********************************


Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V7.1-1H2
Event sequence number 19262.
Timestamp of occurrence 29-MAR-2005 14:45:02
Time since reboot 23 Day(s) 10:18:45
Host name ALPHA2

System Model AlphaServer 1000 4/233

Entry type 38. Time Stamp Entry

SWI Minor class 7. Timestamp


******************************** ENTRY 377 ********************************


Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V7.1-1H2
Event sequence number 19263.
Timestamp of occurrence 29-MAR-2005 14:54:23
Time since reboot 23 Day(s) 10:28:05
Host name ALPHA2

System Model AlphaServer 1000 4/233

Entry type 37. Crash Re-Start

Bugcheck Minor class 1. Crash Re-start

Bugcheck Msg MACHINECHK, Machine check while in kernel
mode
Process ID x002E0152
Process Name
KSP x000000007FFA1E90
ESP x000000007FFA6000
SSP x000000007FFAC100
USP x000000007AF76DC0
R0 xFFFFFFFF8A0E01E8
R1 x000000000000940E
R2 xFFFFFFFF839A6EB0
R3 xFFFFFFFF8A0E0000
R4 x0000000000200040
R5 xFFFFFFFFFFFFFFFF
R6 x0000000000000001
R7 x0000000000000003
R8 x000000000000005C
R9 x0000000000000000
R10 x0000000000000006
R11 x0000000000000006
R12 x0000000000000000
R13 x000000000000001C
R14 x0000000000000010
R15 x0000000000000000
R16 x0000000000000215
R17 x0000000000000001
R18 x0000000000000001
R19 x0000000000000001
R20 x0000000000C42414
R21 xFFFFFFFF8A0E0000
R22 xFFFFFFFFFFFFFFFF
R23 x0000000000000086
R24 x0000000000000086
R25 x0000000000000003
R26 x0000000000000210
R27 xFFFFFFFF839BD680
R28 x0000000000000000
FP x000000007FFA1E90
SP x000000007FFA1E90
PC xFFFFFFFF80066BD0
PS x1000000000001F04
PTBR x000000000000F7BF
Process Ctl Block Base Re x0000000011B4E080
PRBR xFFFFFFFF8100E000
VPTB xFFFFFFFC00000000
System Ctl Block Base Reg x00000000000001A0
Software Interrupt Summar x0000000000000000
ASN x0000000000000006
ASTSR ASTEN x000000000000000F
FEN x0000000000000001
ASN x0000000000000006
IPL x000000000000001F
MCES x0000000000000000


******************************** ENTRY 378 ********************************


Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V7.1-1H2
Event sequence number 19263.
Timestamp of occurrence 29-MAR-2005 15:00:11
Time since reboot 0 Day(s) 0:00:17
Host name ALPHA2

System Model AlphaServer 1000 4/233

Entry type 32. Cold Start (ie: System Boot)

SWI Minor class 2. System startup

TODR x3D202445


******************************** ENTRY 379 ********************************


Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V7.1-1H2
Event sequence number 19264.
Timestamp of occurrence 29-MAR-2005 15:00:12
Time since reboot 0 Day(s) 0:00:17
Host name ALPHA2

System Model AlphaServer 1000 4/233

Entry type 64. Volume Mount

SWI Minor class 4. Volume mount

Owner UIC x00010001
Error count 0.
OP count 517.
Unit Number 100.
Unit Name ALPHA2$DKA
Volume number 0.
Volumes in set 0.
Volume Label ALPHA2SYS


******************************** ENTRY 380 ********************************


Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V7.1-1H2
Event sequence number 19265.
Timestamp of occurrence 29-MAR-2005 15:01:15
Time since reboot 0 Day(s) 0:01:21
Host name ALPHA2

System Model AlphaServer 1000 4/233

Entry type 98. Asynchronous Device Attention


---- Device Profile ----
Unit ALPHA2$PEA0
Product Name NI-SCA Port

---- NISCA Port Data ----
Error Type and SubType x0700 Device Error, Fatal Error Detected by
Datalink
Status x0000120100000500
Datalink Device Name FWA2:
Remote Node Name
Remote Address x0000000000000000
Local Address x00000405000400AA
Error Count 1. Error Occurrences This Entry

----- Software Info -----
UCB$x_ERRCNT 1. Errors This Unit


******************************** ENTRY 381 ********************************


Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V7.1-1H2
Event sequence number 19266.
Timestamp of occurrence 29-MAR-2005 15:01:16
Time since reboot 0 Day(s) 0:01:22
Host name ALPHA2

System Model AlphaServer 1000 4/233

Entry type 98. Asynchronous Device Attention


---- Device Profile ----
Unit ALPHA2$PEA0
Product Name NI-SCA Port

---- NISCA Port Data ----
Error Type and SubType x0700 Device Error, Fatal Error Detected by
Datalink
Status x0000120000000400
Datalink Device Name FWA2:
Remote Node Name
Remote Address x0000000000000000
Local Address x00000405000400AA
Error Count 1. Error Occurrences This Entry

----- Software Info -----
UCB$x_ERRCNT 2. Errors This Unit


******************************** ENTRY 382 ********************************


Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V7.1-1H2
Event sequence number 19267.
Timestamp of occurrence 29-MAR-2005 15:01:23
Time since reboot 0 Day(s) 0:01:30
Host name ALPHA2

System Model AlphaServer 1000 4/233

Entry type 64. Volume Mount

SWI Minor class 4. Volume mount

Owner UIC x00010004
Error count 0.
OP count 15.
Unit Number 1.
Unit Name 213260$DUA
Volume number 0.
Volumes in set 0.
Volume Label USER2


Volker Halle
Honored Contributor

Re: System crashes every 3 weeks.

Doug,

bad luck - OpenVMS V7.1-1H2 did NOT log any machine check entry.

This is the SAME machine/problem as already discussed in previous thread:

http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=808549

I keep a database of all crashes, that's why I know ;-)

Could you please try to provide the stack data as requested in the previous thread:

$ ANAL/CRASH SYS$SYSTEM:SYSDUMP.DMP
SDA> READ/EXEC
SDA> SHOW STACK/QUAD 7FFA1FC0;40

It may also be possible to find the machine check logout frame in the dump.

Volker.
Doug_81
Frequent Advisor

Re: System crashes every 3 weeks.

Thanks for the link Volker.
You're absolutely right. Adrian is my hardware support contact and I'm that "sysadmin is in west coast Canada" he referred to.

In-any-case, I was not aware that they were using this forum to trouble-shoot the problem. I thought I'd try as I'm not getting anywhere following the official channels.

Here's the output from the SHOW STACK/QUAD 7FFA1FC0;40 command:

Specified Stack Range
---------------------
00000000.7FFA1FC0 00000000.0002F030
00000000.7FFA1FC8 00000000.010E0019
00000000.7FFA1FD0 00000000.7AF77A5C
00000000.7FFA1FD8 00000000.7AF78AA0
00000000.7FFA1FE0 00000000.00000001
00000000.7FFA1FE8 00000000.00000003
00000000.7FFA1FF0 00000000.0030F080
00000000.7FFA1FF8 00000000.0000001B
Galen Tackett
Valued Contributor

Re: System crashes every 3 weeks.

Doug,

Just curious--just how precisely do you mean "every 3 weeks":

1) every 3 weeks, within a few milliseconds
2) every 3 weeks, within a couple of hours
3) Every 3 weeks, within a few days

I'll bet your answer is 3. :-)

To hazard a little speculation around each possibility:

1) would be pretty strange, to me at least. Perhaps a flaw in the fabric of space-time. :-)

2) might suggest a link to some calendar-related activity. Perhaps a procedure or device that is used at every couple of weeks? But you'd probably have noticed that.

3) suggests something a lot more random or at least aperiodic, which is why I guessed you'd pick this answer.

Just a few thoughts which may at least stimulate some thought, if they're of any use at all...

Galen
Ian Miller.
Honored Contributor

Re: System crashes every 3 weeks.

Volker,
"I keep a database of all crashes, that's why I know"
and I thought you just remembered them all rather than having a private copy of canasta :-)

____________________
Purely Personal Opinion
Volker Halle
Honored Contributor

Re: System crashes every 3 weeks.

Doug,

the interrupt/exception stack frame shows, that the current PC at the time of the MACHINECHK is in P0 space and the PS shows user-mode IPL 0:

00000000.7FFA1FF0 00000000.0030F080 <<< PC
00000000.7FFA1FF8 00000000.0000001B <<< PS

SDA> eva/ps 0000001B
MBZ SPAL MBZ IPL VMM MBZ CURMOD INT PRVMOD
0 00 00000000000 00 0 0 USER 0 USER

so whatever the instruction is

SDA> EXA/INS 30F080

it CANNOT have caused a MACHINECHK through a programming error (i.e. access into IO-space), because you can't do that in USER mode. It could have caused access to a bad memory page, but that would be pure speculation !!

Please issue the following commands in SDA:

SDA> EXA/INS 30F080-30;40

to examine the instruction stream. If the current instruction include a memory access and you're able to figure out the address, also do

SDA> SHOW PROC/PAGE address;1000

Otherwise, I'll help you to figure out the page number...

To get an overview of the last couple of crashes on this node, just try TYPE CLUE$HISTORY - if there is something timing related, you might be able to spot a pattern.

Volker.
DICTU OpenVMS
Frequent Advisor

Re: System crashes every 3 weeks.

Doug,

If you realy suspect the memory, then try to shut down the machine and bring it to SRM console. Then start 2 memexers per CPU and let them run for a few hours. If there is realy bad RAM it should show on console. To stop the memexer give the kill_diag command (or init the system). To show the status of memexter type show_diag.

(I could be a litle of with the commands, look in the manual or try help or man for exact commands).

It could be possible that the RAM has gone bad. At my current site we have had several issue's with bad RAM.
Doug_81
Frequent Advisor

Re: System crashes every 3 weeks.

Volker:
SDA> EXA/INS 30F080
00000000.0030F080: BIS R31,#X1D,R7

SDA> EXA/INS 30F080-30;40
00000000.0030F050: CVTDG F3,F3
00000000.0030F054: ADDG F4,F3,F3
00000000.0030F058: CVTGD F3,F3
00000000.0030F05C: STD F3,#X0CF8(FP)
00000000.0030F060: TRAPB
00000000.0030F064: LDA R16,#X0008(FP)
00000000.0030F068: BIS R31,#X01,R25
00000000.0030F06C: LDQ R26,#XFF60(R2)
00000000.0030F070: LDQ R27,#XFF68(R2)
00000000.0030F074: JSR R26,(R26)
00000000.0030F078: JMP R31,(R0)
00000000.0030F07C: TRAPB
00000000.0030F080: BIS R31,#X1D,R7
00000000.0030F084: STL R7,#X0020(FP)
00000000.0030F088: LDL R3,#X0CE0(FP)
00000000.0030F08C: ADDL/V R3,#X01,R3
00000000.0030F090: LDA R16,#X8000(R31)

I looked at the clue$history file and there doesn't appear to be any pattern other than approx every 3 weeks.
e.g. The previous 4 crashes are:
Date Uptime
======== ==========
Dec 29 22 days
Jan 20 25 days
Feb 14 25 days
Mar 29 23 days

Sorry, I don't know what address to put in the SHOW PROC/PAGE address;1000 command.


Volker Halle
Honored Contributor

Re: System crashes every 3 weeks.

Doug,

the exception PC points to a BIS R31,#X1D,R7 instruction, so there are no memory accesses involved executing this instruction - except access to the page, where this instruction is stored. Please remember to repeat these steps against the next crash(es).

Now let's try to find the machinecheck logout frame in the dump:

SDA> READ SYSDEF
SDA> SHOW STACK @(@smp$gl_cpu_data+CPU$L_PROC_MCHK_ABORT_SVAPTE+4);2F0

You have to enter the command in one line.
(above command only applies to single-CPU system - which this node is).

Try to include the output as a text file attachment in your next reply (or mail it to me - see my forum profile).

Volker.
Doug_81
Frequent Advisor

Re: System crashes every 3 weeks.

Thanks for your help Volker.
I've attached a text file with the output.
Volker Halle
Honored Contributor

Re: System crashes every 3 weeks.

Doug,

thanks for the data:

8A0E0058 00000001.00000205 = mchk code

Could you please compare the data with the same SDA command in the running system ? Sometimes mchk data is left in this buffer from 'expected' machinechecks (like during SYSMAN IO AUTOCONFIGURE when scanning the device configuration).

If the same data exists in the running system, we know that no machine check frame has been logged and need to try to find out, why OpenVMS has crashes with a MACHINECHK crash.

Volker.
Doug_81
Frequent Advisor

Re: System crashes every 3 weeks.

Sorry Volker, I don't understand.
These commands were executed on the running system.
Volker Halle
Honored Contributor

Re: System crashes every 3 weeks.

Doug,

sorry. My fault of leaving out the obvious ;-(

Please execute the SDA commands against the system dump file from the MACHINECHK crash, e.g.:

$ ANAL/CRASH SYS$SYSTEM:SYSDUMP.DMP
SDA> READ SYSDEF
SDA> SHOW STACK ...

Volker.
Doug_81
Frequent Advisor

Re: System crashes every 3 weeks.

See attached.
Volker Halle
Honored Contributor

Re: System crashes every 3 weeks.

Doug,

so the mchk code is the SAME. You have now documented the machine check logout frame from the running system. After the next MACHINECHK crash, compare the data from the crash against the data just captured from the running system. If the data is IDENTICAL (all quadwords), we can be sure, that no mchk frame is logged before the crash.

Volker.
Doug_81
Frequent Advisor

Re: System crashes every 3 weeks.

Ok, but how does this information help to determine the cause of the crash?