Operating System - OpenVMS
1753466 Members
4353 Online
108794 Solutions
New Discussion юеВ

Re: System crashes every 3 weeks.

 
Doug_81
Frequent Advisor

System crashes every 3 weeks.

Here's one I've been wrestling with for a few months and would appreciate any assistance.

Basically, I have a 2 node Alpha cluster and 1 node crashes every 3 weeks. If I reboot this node prior to it crashing, it runs for 3 weeks from that re-boot.

Let's call them node1 and node2.

Node1 crashes every 3 weeks and it is a
AlphaServer 1000 4/233
Main Memory (1024.00Mb)
OpenVMS V7.1-1H2

Node2 is stable - currently up for 120 days.
AlphaServer 1200 5/400 4MB
Main Memory (1024.00Mb)
OpenVMS V7.3
------------------------------
Here's an excerpt from the clue file:

Bugcheck Type: MACHINECHK, Machine check while in kernel mode
Failing PC: FFFFFFFF.80066BCC EXE$GEN_BUGCHK_C+0003C
Failing PS: 10000000.00001F04
Module: EXCEPTION
Offset: 00018BCC
-------------------------------------
Unfortunately, my client has a contract with a 3rd party support company, so I can't contact HP directly to get the crash dump analyzed, and they haven't been very useful with their analysis. So, I've come to the experts....

The system is running OpenVMS V7.1-1H2 and all the patches (for this version) have been installed.

I suspect that I'm running out of some resource after 3 weeks, but I can't figure out which one.

Any ideas/sugestions?

Thanks,
Doug
27 REPLIES 27
Ian Miller.
Honored Contributor

Re: System crashes every 3 weeks.

can you post the clue file SYS$ERRORLOG:CLUE$*.LIS

Anything in the errorlog?

What layered products are you running?

____________________
Purely Personal Opinion
Uwe Zessin
Honored Contributor

Re: System crashes every 3 weeks.

A machine check is almost always a hardware problem. I've once had a similar problem about 1987 on a VAX-8650 which went down every 14 days on a friday afternoon.

After may parts replacements, swappings we had to escalate... Turned out to be bad memory. The machine was running rock-solid since then.
.
Tom O'Toole
Respected Contributor

Re: System crashes every 3 weeks.

100% agree with Uwe, I doubt it's a resource exhaustion. Maybe you are just happen to be hitting that bad memory after three weeks.
Can you imagine if we used PCs to manage our enterprise systems? ... oops.
Doug_81
Frequent Advisor

Re: System crashes every 3 weeks.

I've attatched the clue file.

Uwe:
Your sugestion re. memory is interesting. A month ago our 3rd party hardware support group sugested "so far their thinking is a bad memory stick", but I haven't heard anything since.

Are there some diagnostics I can run to check the memory?
Shouldn't the error log file show any memory errors?

Here's the last entry in the error log prior to the last crash:

******************************* ENTRY 377. *******************************
ERROR SEQUENCE 19263. LOGGED ON: CPU_TYPE 00000006
DATE/TIME 29-MAR-2005 14:54:23.14 SYS_TYPE 00000011
SYSTEM UPTIME: 23 DAYS 10:28:05
SCS NODE: ALPHA2 OpenVMS AXP V7.1-1H2

HW_MODEL: 00000000 Hardware Model = 0.

FATAL BUGCHECK AlphaServer 1000 4/233

MACHINECHK, Machine check while in kernel mode

PROCESS NAME INTEGRA_DF
PROCESS ID 002E0152

ERROR PC FFFFFFFF 80066BD0

Process Status = 10000000 00001F04, SW = 00, Previous Mode = KERNEL
System State = 01, Current Mode = KERNEL
VMM = 00 IPL = 31, SP Alignment = 16

STACK POINTERS

KSP 00000000 7FFA1E90 ESP 00000000 7FFA6000 SSP 00000000 7FFAC100
USP 00000000 7AF76DC0

GENERAL REGISTERS

R0 FFFFFFFF 8A0E01E8 R1 00000000 0000940E R2 FFFFFFFF 839A6EB0
R3 FFFFFFFF 8A0E0000 R4 00000000 00200040 R5 FFFFFFFF FFFFFFFF
R6 00000000 00000001 R7 00000000 00000003 R8 00000000 0000005C
R9 00000000 00000000 R10 00000000 00000006 R11 00000000 00000006
R12 00000000 00000000 R13 00000000 0000001C R14 00000000 00000010
R15 00000000 00000000 R16 00000000 00000215 R17 00000000 00000001
R18 00000000 00000001 R19 00000000 00000001 R20 00000000 00C42414
R21 FFFFFFFF 8A0E0000 R22 FFFFFFFF FFFFFFFF R23 00000000 00000086
R24 00000000 00000086 R25 00000000 00000003 R26 00000000 00000210
R27 FFFFFFFF 839BD680 R28 00000000 00000000 FP 00000000 7FFA1E90
SP 00000000 7FFA1E90 PC FFFFFFFF 80066BD0 PS 10000000 00001F04

SYSTEM REGISTERS

PTBR 00000000 0000F7BF
Page Table Base Register
PCBB 00000000 11B4E080
Privileged Context Block Base
PRBR FFFFFFFF 8100E000
Processor Base Register
VPTB FFFFFFFC 00000000
Virtual Page Table Base Register
SCBB 00000000 000001A0
System Control Block Base
SISR 00000000 00000000
Software Interrupt Summary Register
ASN 00000000 00000006
Address Space Number

V M S SYSTEM ERROR REPORT COMPILED 31-MAR-2005 17:51:38
PAGE 4.

ASTSR_ASTEN 00000000 0000000F
AST Summary/AST Enable
FEN 00000000 00000001
Floating-Point Enable
IPL 00000000 0000001F
Interrupt Priority Level
MCES 00000000 00000000
Machine Check Error Summary

Volker Halle
Honored Contributor

Re: System crashes every 3 weeks.

Doug,

the key to MACHINECHK crashes are the MCHK errlog entries - if there are any.

SDA> CLUE ERRLOG

will list them and extract them from the dump into a file CLUE$ERRLOG.SYS in your login or default directory.

Run this file through ANAL/ERR or - better- DECevent ($ DIAGNOSE).

The CLUE file is not of much help, especially as the MACHINECHK stack is not correctly decoded until V7.3-2.

Volker.
Steve Nimr
Advisor

Re: System crashes every 3 weeks.

Doug,

Since Volker forgot you're new to VMS:

Do
$ analyze/system

to get to the SDA> prompt.

Steve
Steve Nimr
Advisor

Re: System crashes every 3 weeks.

Sorry Doug I got threads mixed up. :(
I guess there is no way to recall a reply once it's submitted.
Volker Halle
Honored Contributor

Re: System crashes every 3 weeks.

Steve,

it's ANAL/CRASH SYS$SYSTEM:SYSDUMP.DMP to access a system dump file (in it's default location).

Volker.
Doug_81
Frequent Advisor

Re: System crashes every 3 weeks.

Here's the output from diagnose around the time period of the last crash:

******************************** ENTRY 376 ********************************


Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V7.1-1H2
Event sequence number 19262.
Timestamp of occurrence 29-MAR-2005 14:45:02
Time since reboot 23 Day(s) 10:18:45
Host name ALPHA2

System Model AlphaServer 1000 4/233

Entry type 38. Time Stamp Entry

SWI Minor class 7. Timestamp


******************************** ENTRY 377 ********************************


Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V7.1-1H2
Event sequence number 19263.
Timestamp of occurrence 29-MAR-2005 14:54:23
Time since reboot 23 Day(s) 10:28:05
Host name ALPHA2

System Model AlphaServer 1000 4/233

Entry type 37. Crash Re-Start

Bugcheck Minor class 1. Crash Re-start

Bugcheck Msg MACHINECHK, Machine check while in kernel
mode
Process ID x002E0152
Process Name
KSP x000000007FFA1E90
ESP x000000007FFA6000
SSP x000000007FFAC100
USP x000000007AF76DC0
R0 xFFFFFFFF8A0E01E8
R1 x000000000000940E
R2 xFFFFFFFF839A6EB0
R3 xFFFFFFFF8A0E0000
R4 x0000000000200040
R5 xFFFFFFFFFFFFFFFF
R6 x0000000000000001
R7 x0000000000000003
R8 x000000000000005C
R9 x0000000000000000
R10 x0000000000000006
R11 x0000000000000006
R12 x0000000000000000
R13 x000000000000001C
R14 x0000000000000010
R15 x0000000000000000
R16 x0000000000000215
R17 x0000000000000001
R18 x0000000000000001
R19 x0000000000000001
R20 x0000000000C42414
R21 xFFFFFFFF8A0E0000
R22 xFFFFFFFFFFFFFFFF
R23 x0000000000000086
R24 x0000000000000086
R25 x0000000000000003
R26 x0000000000000210
R27 xFFFFFFFF839BD680
R28 x0000000000000000
FP x000000007FFA1E90
SP x000000007FFA1E90
PC xFFFFFFFF80066BD0
PS x1000000000001F04
PTBR x000000000000F7BF
Process Ctl Block Base Re x0000000011B4E080
PRBR xFFFFFFFF8100E000
VPTB xFFFFFFFC00000000
System Ctl Block Base Reg x00000000000001A0
Software Interrupt Summar x0000000000000000
ASN x0000000000000006
ASTSR ASTEN x000000000000000F
FEN x0000000000000001
ASN x0000000000000006
IPL x000000000000001F
MCES x0000000000000000


******************************** ENTRY 378 ********************************


Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V7.1-1H2
Event sequence number 19263.
Timestamp of occurrence 29-MAR-2005 15:00:11
Time since reboot 0 Day(s) 0:00:17
Host name ALPHA2

System Model AlphaServer 1000 4/233

Entry type 32. Cold Start (ie: System Boot)

SWI Minor class 2. System startup

TODR x3D202445


******************************** ENTRY 379 ********************************


Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V7.1-1H2
Event sequence number 19264.
Timestamp of occurrence 29-MAR-2005 15:00:12
Time since reboot 0 Day(s) 0:00:17
Host name ALPHA2

System Model AlphaServer 1000 4/233

Entry type 64. Volume Mount

SWI Minor class 4. Volume mount

Owner UIC x00010001
Error count 0.
OP count 517.
Unit Number 100.
Unit Name ALPHA2$DKA
Volume number 0.
Volumes in set 0.
Volume Label ALPHA2SYS


******************************** ENTRY 380 ********************************


Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V7.1-1H2
Event sequence number 19265.
Timestamp of occurrence 29-MAR-2005 15:01:15
Time since reboot 0 Day(s) 0:01:21
Host name ALPHA2

System Model AlphaServer 1000 4/233

Entry type 98. Asynchronous Device Attention


---- Device Profile ----
Unit ALPHA2$PEA0
Product Name NI-SCA Port

---- NISCA Port Data ----
Error Type and SubType x0700 Device Error, Fatal Error Detected by
Datalink
Status x0000120100000500
Datalink Device Name FWA2:
Remote Node Name
Remote Address x0000000000000000
Local Address x00000405000400AA
Error Count 1. Error Occurrences This Entry

----- Software Info -----
UCB$x_ERRCNT 1. Errors This Unit


******************************** ENTRY 381 ********************************


Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V7.1-1H2
Event sequence number 19266.
Timestamp of occurrence 29-MAR-2005 15:01:16
Time since reboot 0 Day(s) 0:01:22
Host name ALPHA2

System Model AlphaServer 1000 4/233

Entry type 98. Asynchronous Device Attention


---- Device Profile ----
Unit ALPHA2$PEA0
Product Name NI-SCA Port

---- NISCA Port Data ----
Error Type and SubType x0700 Device Error, Fatal Error Detected by
Datalink
Status x0000120000000400
Datalink Device Name FWA2:
Remote Node Name
Remote Address x0000000000000000
Local Address x00000405000400AA
Error Count 1. Error Occurrences This Entry

----- Software Info -----
UCB$x_ERRCNT 2. Errors This Unit


******************************** ENTRY 382 ********************************


Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V7.1-1H2
Event sequence number 19267.
Timestamp of occurrence 29-MAR-2005 15:01:23
Time since reboot 0 Day(s) 0:01:30
Host name ALPHA2

System Model AlphaServer 1000 4/233

Entry type 64. Volume Mount

SWI Minor class 4. Volume mount

Owner UIC x00010004
Error count 0.
OP count 15.
Unit Number 1.
Unit Name 213260$DUA
Volume number 0.
Volumes in set 0.
Volume Label USER2