Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Machine Check OpenVMS Alpha 7.3-2 EXE$SYSTEM_CORRECTED_ERROR_C+00768

 
Max Pierre
Advisor

Machine Check OpenVMS Alpha 7.3-2 EXE$SYSTEM_CORRECTED_ERROR_C+00768

Hi,

We have an ES45 system crahing at EXE$SYSTEM_CORRECTED_ERROR_C+00768

Problem is most likely related to hardware. CPU 0 was already swapped but did not solve the problem.

I cannot find documentation about EXE$SYSTEM_CORRECTED_ERROR_C.

Could it be related to memory problems instead of cpu ?

Any inputs welcome :-)

Max

VMS Version: V7.3-2
Current Process: NULL
Current Image: <not available>
Failing PC: FFFFFFFF.80018088 EXE$SYSTEM_CORRECTED_ERROR_C+00768
Failing PS: 30000000.00001F04
Module: SYS$CPU_ROUTINES_2608 (Link Date/Time: 1-OCT-2003 21:19:12.40)
Offset: 00008088

Failing Instruction:
EXE$SYSTEM_CORRECTED_ERROR_C+00768: BUGCHK

Instruction Stream (last 20 instructions):
EXE$SYSTEM_CORRECTED_ERROR_C+00718: LDQ R26,#XFF60(R2)
EXE$SYSTEM_CORRECTED_ERROR_C+0071C: BIS R31,#X21,R16
EXE$SYSTEM_CORRECTED_ERROR_C+00720: BIS R31,#X01,R25
EXE$SYSTEM_CORRECTED_ERROR_C+00724: LDBU R3,(R3)
EXE$SYSTEM_CORRECTED_ERROR_C+00728: BLBC R3,#X000002
EXE$SYSTEM_CORRECTED_ERROR_C+0072C: LDQ R27,#XFF68(R2)
EXE$SYSTEM_CORRECTED_ERROR_C+00730: JSR R26,(R26)
EXE$SYSTEM_CORRECTED_ERROR_C+00734: LDL R3,#X0010(FP)
EXE$SYSTEM_CORRECTED_ERROR_C+00738: LDQ R26,#XFFE0(R2)
EXE$SYSTEM_CORRECTED_ERROR_C+0073C: BIS R31,#X07,R16
EXE$SYSTEM_CORRECTED_ERROR_C+00740: BIS R31,#X01,R25
EXE$SYSTEM_CORRECTED_ERROR_C+00744: BEQ R3,#X000006
EXE$SYSTEM_CORRECTED_ERROR_C+00748: LDL R3,#XFE98(R2)
EXE$SYSTEM_CORRECTED_ERROR_C+0074C: LDQ R27,#XFFE8(R2)
EXE$SYSTEM_CORRECTED_ERROR_C+00750: JSR R26,(R26)
EXE$SYSTEM_CORRECTED_ERROR_C+00754: BIS R3,#X05,R16
EXE$SYSTEM_CORRECTED_ERROR_C+00758: BUGCHK
EXE$SYSTEM_CORRECTED_ERROR_C+0075C: BR R31,#X000003
EXE$SYSTEM_CORRECTED_ERROR_C+00760: LDL R0,#XFE98(R2)
EXE$SYSTEM_CORRECTED_ERROR_C+00764: BIS R0,#X05,R16
EXE$SYSTEM_CORRECTED_ERROR_C+00768: BUGCHK
EXE$SYSTEM_CORRECTED_ERROR_C+0076C: BIS R31,FP,SP
EXE$SYSTEM_CORRECTED_ERROR_C+00770: LDQ R26,#X0018(FP)
EXE$SYSTEM_CORRECTED_ERROR_C+00774: LDQ R2,#X0020(FP)
EXE$SYSTEM_CORRECTED_ERROR_C+00778: LDQ R3,#X0028(FP)

OpenVMS (TM) Operating System, Version V7.3-2 -- System Dump Analysis 17-SEP-2016 22:09:46.72 Page 3
Current Registers: Process index: 0000 Process name: NULL PCB: 844DAFC8 (CPU 0)

 

R0 = 00000000.00000210 %SYSTEM-W-RESULTOVF, resultant string overflow
R1 = 00000000.00000000
R2 = FFFFFFFF.8443DC10 SMP_STD$EXTENDED_HW_SETUP+00260
R3 = 00000000.00000000
R4 = 00000000.000000B0
R5 = 00000000.00000000
R6 = FFFFFFFF.81438000 MP_CPU (CPU Id 0)
R7 = 00000000.00000000
R8 = 00000000.00000000
R9 = 00000000.00000000
R10 = 00000000.00000004
R11 = FFFFFFFF.81A72B00 PCB (Username SYSTEM, Procnam WBEM$SERVER)
R12 = 00000000.7BD2F9C8
R13 = FFFFFFFF.844DDF08 SCH$IDLE
R14 = 00000000.00000001
R15 = 00000000.00000000
R16 = 00000000.00000215
R17 = 40000000.00000000
R18 = 40000000.00000000
R19 = 00000000.00000000
R20 = FFFFFFFF.80018054 EXE$SYSTEM_CORRECTED_ERROR_C+00734
R21 = 00000000.0000000D
R22 = FFFFFFFF.00000000 AQB
R23 = 00000000.00000000
R24 = FFFFFFFF.81438000 MP_CPU (CPU Id 0)
AI = 00000000.00000001
RA = FFFFFFFF.80004030 IMG$RUNDOWN_C
PV = FFFFFFFF.84408000 EXE$GR_SYSTEM_DATA_CELLS
R28 = FFFFFFFF.80018054 EXE$SYSTEM_CORRECTED_ERROR_C+00734
FP = FFFFFFFF.85829EB0
PC = FFFFFFFF.8001808C EXE$SYSTEM_CORRECTED_ERROR_C+0076C
PS = 30000000.00001F04 Kernel Mode, IPL 31, Interrupt

OpenVMS (TM) Operating System, Version V7.3-2 -- System Dump Analysis 17-SEP-2016 22:09:46.72 Page 4
Stack Decoder:

 

System Stack (NULL Process):
Stack Pointer FFFFFFFF.85829EB0
Stack Limits (low) FFFFFFFF.85826000
(high) FFFFFFFF.8582A000

OpenVMS (TM) Operating System, Version V7.3-2 -- System Dump Analysis 17-SEP-2016 22:09:46.72 Page 5
MACHINECHK Stack:

 

Stack Pointer SP => FFFFFFFF.85829EB0

Stack Frame:
PV FFFFFFFF.85829EB0 FFFFFFFF.8443DC10 SMP_STD$EXTENDED_HW_SETUP+00260
Entry Point FFFFFFFF.80017D30 EXE$SYSTEM_CORRECTED_ERROR_C+00410
FFFFFFFF.85829EB8 00000000.00000043
FFFFFFFF.85829EC0 FFFFFFFF.00000000
return PC FFFFFFFF.85829EC8 FFFFFFFF.8001B1F4 SYS$CPU_ROUTINES_2608+0B1F4
saved R2 FFFFFFFF.85829ED0 FFFFFFFF.85829F80
saved R3 FFFFFFFF.85829ED8 FFFFFFFF.8443E000 EXE$SETUP_MEMTEST_ENV+00310
saved FP FFFFFFFF.85829EE0 FFFFFFFF.85829EF0

PROC_CORRECTED_ERROR_JACKET saved Scratch Registers:
saved R27 FFFFFFFF.85829EF0 00000000.00002000 IRP$M_ON_ACT_Q
saved R0 FFFFFFFF.85829EF8 00000000.00002000 IRP$M_ON_ACT_Q
saved R1 FFFFFFFF.85829F00 FFFFFFFF.81438000 CPUDB
saved R16 FFFFFFFF.85829F08 08020206.60000003
saved R17 FFFFFFFF.85829F10 00000000.00000000
saved R18 FFFFFFFF.85829F18 00000000.00000000
saved R19 FFFFFFFF.85829F20 00000000.00000000
saved R20 FFFFFFFF.85829F28 FFFFFFFF.8012ADD8 SCH$IDLE_C+00078
saved R21 FFFFFFFF.85829F30 00000000.0000000F
saved R22 FFFFFFFF.85829F38 FFFFFFFF.FFE060D0
saved R23 FFFFFFFF.85829F40 FFFFFFFF.84408000 EXE$GR_SYSTEM_DATA_CELLS
saved R24 FFFFFFFF.85829F48 00000000.00000001
saved R25 FFFFFFFF.85829F50 00000000.00000001
saved R26 FFFFFFFF.85829F58 FFFFFFFF.844100F8 EXE$GL_RADCNT
saved R27 FFFFFFFF.85829F60 00000000.00002000 IRP$M_ON_ACT_Q
saved R28 FFFFFFFF.85829F68 FFFFFFFF.84412C18 MMG$GL_ZERO_LIST_HI_LIM
saved R29 FFFFFFFF.85829F70 FFFFFFFF.85829FC0

Interrupt/Exception Frame:
saved R2 FFFFFFFF.85829F80 FFFFFFFF.84409790 SCH$GQ_ACTIVE_PRIORITY
saved R3 FFFFFFFF.85829F88 FFFFFFFF.844E2A40 SCH$WAIT_ANY_MODE
saved R4 FFFFFFFF.85829F90 FFFFFFFF.81AC6C80
saved R5 FFFFFFFF.85829F98 00000000.00000001
saved R6 FFFFFFFF.85829FA0 FFFFFFFF.81438000 CPUDB
saved R7 FFFFFFFF.85829FA8 00000000.00000000
saved PC FFFFFFFF.85829FB0 FFFFFFFF.8012B15C SCH$IDLE_C+003FC
saved PS FFFFFFFF.85829FB8 00000000.00000303 IPL INT CURR PREV
SP Align = 00(hex) [...............] 03 0 Kern User

Stack Frame:
PV FFFFFFFF.85829FC0 FFFFFFFF.844DDF08 SCH$IDLE
Entry Point FFFFFFFF.8012AD60 SCH$IDLE_C
FFFFFFFF.85829FC8 00000000.00000000
FFFFFFFF.85829FD0 FFFFFFFF.8185E600
return PC FFFFFFFF.85829FD8 FFFFFFFF.80156B1C SCH$INTERRUPT+00C3C
saved R3 FFFFFFFF.85829FE0 FFFFFFFF.844E2A40 SCH$WAIT_ANY_MODE
saved R13 FFFFFFFF.85829FE8 FFFFFFFF.844E18F0 SYS$HIBER
saved R14 FFFFFFFF.85829FF0 00000000.00000001
saved FP FFFFFFFF.85829FF8 00000000.00000000

8 REPLIES 8
abrsvc
Respected Contributor

Re: Machine Check OpenVMS Alpha 7.3-2 EXE$SYSTEM_CORRECTED_ERROR_C+00768

The short answer is yes.  This can be a memory error.  Have the hardware service organization run diagnostics.

Dan

Max Pierre
Advisor

Re: Machine Check OpenVMS Alpha 7.3-2 EXE$SYSTEM_CORRECTED_ERROR_C+00768

Hi Dan,

 

Thanks for your reply.

I'll ask to run memory tests and keep you posted.

Best regards

Volker Halle
Honored Contributor

Re: Machine Check OpenVMS Alpha 7.3-2 EXE$SYSTEM_CORRECTED_ERROR_C+00768

Max,

to diagnose the underlying reason for a MACHINECHK crash, you need to look at the errorlog entry immediately preceeding the crash.

To extract the errlog entries from the crashdump use

$ ANALYZE/CRASH sysdump.dmp

SDA> CLUE ERRLOG     ! creates CLUE$ERRLOG.SYS and shows the errlog entries saved in the dump.

SDA> EXIT

Then use Compaq Analyse (CA) or SEA (System Event Analyzer) or DIAGNOSE to try to decode the errlog entry.

Volker.

Highlighted
Volker Halle
Honored Contributor

Re: Machine Check OpenVMS Alpha 7.3-2 EXE$SYSTEM_CORRECTED_ERROR_C+00768

Max,

this MACHINECHK crash is a 'System machine check abort' (SCB Vector 660) .

You need to diagnose the errlog entry to try to find the root cause.

Volker.

Max Pierre
Advisor

Re: Machine Check OpenVMS Alpha 7.3-2 EXE$SYSTEM_CORRECTED_ERROR_C+00768

Hi,

Someone ran Webes and result is :

An uncorrectable CPU access to reserved IO space event has been detected

Full Description:
An uncorrectable CPU0 access to reserved Pci 0 space has been detected.
This may be a result of a previous hardware or software related error.

Onsite egineer will analyze previous errorlogs.

I will keep you posted.

Thanks for your inputs

Max

Max Pierre
Advisor

Re: Machine Check OpenVMS Alpha 7.3-2 EXE$SYSTEM_CORRECTED_ERROR_C+00768

Hi,

Yet another crash two days ago

Seems to be a Machine Check 660.

Error just before the crash is ACPMBFAIL, ACP failure to read mailbox.

I tried different tools; ana/err/elv, Diagnose, Ana/Crash and clue but none seems able to decode all the information contained in the crash dump and errlog.

Here below the information I collected today. What can I do to solve this problem ?

Thanks in advance and best regards

Max

 

Ana/Crash sys$system:sysdump.dmp

OpenVMS (TM) system dump analyzer
...analyzing an Alpha compressed selective memory dump...

Dump taken on 18-OCT-2016 20:56:19.75
MACHINECHK, Machine check while in kernel mode

SDA> clue errlog

Dumpfile Errorlog Entry Information:
------------------------------------
Sequence Date Time Error Message Type
-------- ----------- ----------- --------------------------------
0 8-OCT-2016 22:07:56.00 CRD Throttle Event
1 8-OCT-2016 22:07:56.00 CRD Throttle Event
2 8-OCT-2016 22:07:57.09 Asynch Device Attention
3 8-OCT-2016 22:08:14.24 Device Timeout
4 8-OCT-2016 22:08:15.24 Device Timeout
8885 8-OCT-2016 22:08:17.17 Cold Start (System Boot)
8886 8-OCT-2016 22:08:17.24 Volume Mount
8887 8-OCT-2016 22:08:22.84 Device Timeout
8888 8-OCT-2016 22:08:28.18 Asynch Device Attention
8889 8-OCT-2016 22:08:28.18 Asynch Device Attention
8890 8-OCT-2016 22:08:28.19 Asynch Device Attention
10330 18-OCT-2016 20:56:19.75 Machine Check 660
10331 18-OCT-2016 20:56:19.75 * Crash Entry

Config Entry and Errlog Entries written to CLUE$ERRLOG.SYS file, use COMPAQ **bleep**
yze or DECevent to analyze.

-------------------------------------------------------

$ Diag CLUE$ERRLOG.SYS

Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V7.3-2
Event sequence number 10330.
Timestamp of occurrence 18-OCT-2016 20:56:19
Time since reboot 9 Day(s) 22:47:38
Host name LAST02

System Model AlphaServer ES45 Model 3B

Entry Type 27. System Uncorrectable Error



========================
Raw Event Data Dump
========================

Entry# (record in file) 13.

Entry Body Size: x00000164
Entry body:

15--<-12 11--<-08 07--<-04 03--<-00 :Byte Order
0000: 00000000 000C0000 0026FFF9 000007DE *......&.........*
0010: 00000000 00000020 20323054 53414C08 *.LAST02 .......*
0020: 285A00B1 0D0A8A37 4C60001B 60030000 *...`..`L7.....Z(*
0030: 000D1E0A 00000000 2020322D 332E3756 *V7.3-2 ........*
0040: 34534520 72657672 65536168 706C4119 *.AlphaServer ES4*
0050: 00000000 00004233 206C6564 6F4D2035 *5 Model 3B......*
0060: 000000A0 00000018 00000000 000000F8 *................*
0070: 00000000 00000000 00000001 00000202 *................*
0080: 00000000 00000000 00000000 00000000 *................*
0090: 00000000 00000000 00000000 00000000 *................*
00A0: 00000000 00000000 00000000 00000000 *................*
00B0: FFFFFFFF 8012AE30 00000000 00000000 *........0.......*
00C0: 00000002 00000000 0000007E FFFE0000 *....~...........*
00D0: 00000000 00008000 00000000 00000000 *................*
00E0: 00000000 00000000 FFFFFEFC 21300386 *..0!............*
00F0: 00000000 00000000 00000000 00000000 *................*
0100: 40000000 00000000 00000000 00000000 *...............@*
0110: 00000000 00000008 00000012 000000C0 *................*
0120: 00000000 00000000 00000000 00000000 *................*
0130: 00000000 00000000 00000000 00000000 *................*
0140: 00000000 00000000 00000000 00000000 *................*
0150: 00010000 00000008 00000000 00000000 *................*
0160: 00000000 * ....*

 


**** V3.4 ********************* ENTRY 14 ********************************


Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V7.3-2
Event sequence number 10331.
Timestamp of occurrence 18-OCT-2016 20:56:19
Time since reboot 9 Day(s) 22:47:38
Host name LAST02

System Model AlphaServer ES45 Model 3B

Entry Type 37. Crash Re-Start

Bugcheck Minor class 1. Crash Re-start

Bugcheck Msg MACHINECHK, Machine check while in kernel
mode
Process ID x00010000
Process Name NULL
KSP xFFFFFFFF85829EB0
ESP xFFFFFFFF8582B000
SSP xFFFFFFFF85825000
USP xFFFFFFFF85825000
R0 x0000000000000210
R1 x0000000000000000
R2 xFFFFFFFF8443DC10
R3 x0000000000000000
R4 x00000000000000B0
R5 x0000000000000000
R6 xFFFFFFFF81438000
R7 x0000000000000000
R8 x0000000000000000
R9 x0000000000000000
R10 x0000000000416010
R11 xFFFFFFFF81A35B00
R12 x000000000041200C
R13 xFFFFFFFF844DDF08
R14 x0000000000416010
R15 x0000000000001000
R16 x0000000000000215
R17 x4000000000000000
R18 x4000000000000000
R19 x0000000000000000
R20 xFFFFFFFF80018054
R21 x000000000000000D
R22 xFFFFFFFF00000000
R23 x0000000000000000
R24 xFFFFFFFF81438000
R25 x0000000000000001
R26 xFFFFFFFF80004030
R27 xFFFFFFFF84408000
R28 xFFFFFFFF80018054
FP xFFFFFFFF85829EB0
SP xFFFFFFFF85829EB0
PC xFFFFFFFF8001808C
PS x3000000000001F04
PTBR x000000000007FFF8
Process Ctl Block Base Re x0000000001838080
PRBR xFFFFFFFF81438000
VPTB xFFFFFEFC00000000
System Ctl Block Base Reg x00000000000003AA
Software Interrupt Summar x0000000000000000
ASN x0000000000000000
ASTSR ASTEN x0000000000000000
FEN x0000000000000000
IPL x000000000000001F
MCES x0000000000000000

------------------------------------------
ana/err CLUE$ERRLOG.cvt

.... some entries dated on 8-oct-2016 ....

******************************* ENTRY 17. *******************************
ERROR SEQUENCE 0. LOGGED ON: CPU_TYPE 0000000C
DATE/TIME 17-NOV-1858 01:15:26.85 SYS_TYPE 00000026
SCS NODE: LAST02 OpenVMS AXP

HW_MODEL: 000007DE Hardware Model = 2014.

"UNKNOWN ENTRY"

ERROR LOG RECORD

ERF$L_SID 000007DE
SYSTEM ID REGISTER
ERL$W_ENTRY 001B
ERROR ENTRY TYPE
EXE$GQ_SYSTIME 8A374C60
0000000A 64 BIT TIME WHEN ERROR LOGGED
ERL$GL_SEQUENCE 0000
UNIQUE ERROR SEQUENCE = 0.

BYTE <65482:0>
00000000 /











%ERF-W-UNKPKTFMT, unknown packet format, entry 18 skipped
%ERF-W-UNKPKTFMT, unknown packet format, entry 19 skipped
%ERF-I-UNKENTRY, unknown entry type, 37
******************************* ENTRY 20. *******************************
ERROR SEQUENCE 40960. LOGGED ON: CPU_TYPE 0000000C
DATE/TIME 17-DEC-1858 14:15:54.76 SYS_TYPE 00000026
SCS NODE: LAST02 OpenVMS AXP ........

HW_MODEL: 000007DE Hardware Model = 2014.

FATAL BUGCHECK

ACPMBFAIL, ACP failure to read mailbox

PROCESS NAME
PROCESS ID 00000000

ERROR PC 00000000 00000000

Process Status = 00000000 00000000, SW = 00, Previous Mode = KERNEL
System State = 00, Current Mode = KERNEL
VMM = 00 IPL = 0, SP Alignment = 0

STACK POINTERS

KSP 00000000 00000000 ESP 00000000 00000000 SSP 00000000 00000000
USP FF8012AE 30000000

GENERAL REGISTERS

R0 7EFFFE00 00FFFFFF R1 02000000 00000000 R2 00000000 00000000
R3 00000080 00000000 R4 FC213003 86000000 R5 00000000 00FFFFFE
R6 00000000 00000000 R7 00000000 00000000 R8 00000000 00000000
R9 00000000 00000000 R10 12000000 C0400000 R11 00000000 08000000
R12 00000000 00000000 R13 00000000 00000000 R14 00000000 00000000
R15 00000000 00000000 R16 00000000 00000000 R17 00000000 00000000
R18 00000000 00000000 R19 00000000 08000000 R20 00000000 00000100
R21 00000000 00000000 R22 00000000 00000000 R23 00000000 00000000
R24 00000000 00000000 R25 00000000 00000000 R26 00000000 00000000
R27 00000000 00000000 R28 00000000 00000000 FP 00000000 00000000
SP 00000000 00000000 PC 00000000 00000000 PS 00000000 00000000

SYSTEM REGISTERS

PTBR 00000000 00000000
Page Table Base Register
PCBB 00000000 00000000
Privileged Context Block Base
PRBR 00000000 00000000
Processor Base Register
VPTB 00000000 00000000
Virtual Page Table Base Register
SCBB 00000000 00000000
System Control Block Base
SISR 00000000 00000000
Software Interrupt Summary Register
ASN 00000000 00000000
Address Space Number
ASTSR_ASTEN 00000000 00000000
AST Summary/AST Enable
FEN 00000000 00000000
Floating-Point Enable
IPL 00000000 00000000
Interrupt Priority Level
MCES 00000000 00000000
Machine Check Error Summary
%ERF-W-UNKPKTFMT, unknown packet format, entry 21 skipped
%ERF-W-UNKPKTFMT, unknown packet format, entry 22 skipped

------------------------------

And what I found about the ACPMBFAIL in the system messages does not help a lot ...

ACPMBFAIL, ACP failure to read mailbox
        Facility:
BUGCHECK, System Bugcheck
        Explanation:
The OpenVMS software detected an ir-
        recoverable, inconsistent condition. After all physical
        memory is written to a system dump file, the sys-
        tem automatically reboots if the BUGREBOOT system
        parameter is set to 1.
        User Action:
Submit a Software Performance Report
        (SPR) that describes the conditions leading to the error.
        Include a backup save set containing the system dump
        file and the error log file active at the time of the er-
        ror. (Use the /IGNORE=NOBACKUP qualifier with the
        BACKUP command that produces the save set included
        with the SPR.)

abrsvc
Respected Contributor

Re: Machine Check OpenVMS Alpha 7.3-2 EXE$SYSTEM_CORRECTED_ERROR_C+00768

IRRC, the throttle events are many single bit errors occurring in succession.  I would look first at the memory cards in the  machine as the culprit here.  After all, the mailbox is a memory device.

Try running memory diagnostics.

Dan

Volker Halle
Honored Contributor

Re: Machine Check OpenVMS Alpha 7.3-2 EXE$SYSTEM_CORRECTED_ERROR_C+00768

Max,

the underlying problem is a Machine Check 660 ! You need to run  Compaq Analyse (CA) or SEA (System Event Analyzer) to decode ERRLOG.SYS - Diagnose can't decode ES45 CPU related errors.

The CRD throttle Events are just indications of some correctable memory errors detected during system bootstrap. These are not necessarily related to the Machine Check.

The ACPMBFAIL is a bogus error, it's a MACHINECHK crash.

If the previous analysis has reported: An uncorrectable CPU access to reserved IO space event has been detected, this may very well be a software problem (some driver accessed an incorrect IO space address). Any special hardware/driver installed ? When did this problem show up first ?

Volker.