Operating System - OpenVMS
1752792 Members
6099 Online
108789 Solutions
New Discussion юеВ

Re: Server crashing every mth, this time with.. could this be memory related?

 
Victor Mendham
Regular Advisor

Server crashing every mth, this time with.. could this be memory related?

I originally posted in March as
http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=516763.

The server is not crashing now when the backup is running, but it can be random as well, so I do not think it is device (tape drive) or controller (HSJ) related. Anyone seen this error before and think it could be memory related?

%%%%%%%%%%% OPCOM 21-MAY-2004 01:30:56.92 %%%%%%%%%%% (from node FREDFLINTSTONE )
Message from user INTERnet on FREDFLINTSTONE
INTERnet ACP Timeout Idle FTP Server

%%%%%%%%%%% SYSLOA 21-MAY-2004 06:20:33.70 %%%%%%%%%%%

MACHINE CHECK @ XMI node E1880000 - CANNOT CONTINUE

Valid mask: FFFFFFFF
000000FF
XDEV = 00058087
XBE = 90042040
XFADR0 = 61E00108
XGPR = 00000036
NSCSR0 = 00000020
XCR0 = 00000000
XFAER0 = 1000000F
XBEER0 = 00000000
WFADR0 = 1D7F7860
WFADR1 = 1D7F8100
NCSR = 00000800
ICSR = 00000001
VMAR = 000007E0
VTAG = 81AFA690
VDATA = EF1652D4
ECR = 000000CA
PAMODE = 00000000
MMEADR = 00019C48
MMEPTE = 911A7B38
MMESTS = 1C00C004
TBADR = 00000000
TBSTS = 800001D0
PCADR = FFFFFFF8
PCSTS = FFFFF830
PCCTL = FFFFFC13
CCTL = 00000037
BCETSTS = 00000140
BCETIDX = 01400020
BCETAG = 8040F600
BCEDSTS = 00000400
BCEDIDX = 00000020
BCEDECC = 00000000
CEFADR = E1E00108
CEFSTS = 0001920A
NESTS = 00000000
NEOADR = 1D601FAC
NEOCMD = 00000F15
NEDATHI = 00018001
NEDATLO = 00018001
NEICMD = 0000000C

Stack frame:

00000018
80060001
00000001
81AF82DF
8000848E
17800000
50E10080
PC = 8320D324
PSL = 04150009



**** Fatal BUG CHECK, version = V5.5-2 MACHINECHK, Machine check while in kee

Crash CPU: 01 Primary CPU: 01

Active/available CPU masks: 0000001E/0000001E

Current process = NULL

Register dump

R0 = 041F0000
R1 = 83B4A1E8
R2 = 859BEB20
R3 = 831B46CC
R4 = 00000000
R5 = 00000000
R6 = 81B2CC50
R7 = 864C2000
R8 = 81B36E00
R9 = 00000000
R10= 00000001
R11= 831B0C00
AP = 859BEAC0
FP = 7FFE7798
SP = 864C3144
PC = 83B42B21
PSL= 041F0004

Kernel/interrupt/boot stack

864C314C 831B7DE0
864C3150 81B07C00
864C3154 0000000E
864C3158 00000000
864C315C 83214EA0
864C3160 83202080
864C3164 82E2E810
864C3168 00000000
864C316C 00000000
864C3170 7FEB2E80
864C3174 7FEB2F3E
864C3178 7FEB2F0A
864C317C 7FFE77BC
864C3180 00000000
864C3184 00000000
864C3188 00000000
864C318C 00000000
864C3190 00000000
864C3194 00000000
864C3198 00000018
864C319C 80060001
864C31A0 00000001
864C31A4 81AF82DF
864C31A8 8000848E
864C31AC 17800000
864C31B0 50E10080
864C31B4 8320D324
864C31B8 04150009
864C31BC 00000008
864C31C0 00000008
864C31C4 00000001
864C31C8 864C2000
864C31CC 0000000E
864C31D0 00000000
864C31D4 83214EA0
864C31D8 83216128
864C31DC 81B343BA
864C31E0 00000001
864C31E4 00000001
864C31E8 0000000E
864C31EC 864C2000
864C31F0 83273840
864C31F4 8B34E200
864C31F8 81AF82DF
864C31FC 04030001


Loaded images

UCX$INTERNET_SERVICES 81969400 8197E000
[SYSMSG]SYSMSG.EXE 819A2C00 819D1A00
[SYS$LDR]SYSLDR_DYN.EXE 81B38E00 81B3AE00
[SYS$LDR]DDIF$RMS_EXTENSION.EXE 81B3B400 81B3C600
[SYS$LDR]RECOVERY_UNIT_SERVICES.EXE 81B3C800 81B3D000
[SYS$LDR]RMS.EXE 819D1A00 819F9800
VAXCLUSTER_CACHE.EXE 81A49800 81A49E00
SYS$NETWORK_SERVICES.EXE 81A4A400 81A4A600
SYS$TRANSACTION_SERVICES.EXE 81A4AC00 81A65E00
CPULOA.EXE 81A66200 81A69400
LMF$GROUP_TABLE.EXE 81A6A400 81A6B800
SYSLICENSE.EXE 81A6BC00 81A6D400
SYSGETSYI.EXE 81A6DA00 81A6F000
SYSDEVICE.EXE 81A6F400 81A71A00
MESSAGE_ROUTINES.EXE 81A72000 81A77200
EXCEPTION.EXE 81A87600 81A90600
LOGICAL_NAMES.EXE 81A90E00 81A92A00
SECURITY.EXE 81A93000 81A95600
LOCKING.EXE 81A95C00 81A9B000
PAGE_MANAGEMENT.EXE 81A9B600 81AA3E00
WORKING_SET_MANAGEMENT.EXE 81AE4800 81AE9800
IMAGE_MANAGEMENT.EXE 81AEA200 81AECE00
EVENT_FLAGS_AND_ASTS.EXE 81AED400 81AEE800
IO_ROUTINES.EXE 81AEEE00 81AF7A00
PROCESS_MANAGEMENT.EXE 81AF8200 81B02000
ERRORLOG.EXE 81B2EA00 81B2F400
PRIMITIVE_IO.EXE 81B2FA00 81B30A00
SYSTEM_SYNCHRONIZATION_MIN.EXE 81B30E00 81B32A00
SYSTEM_PRIMITIVES.EXE 81B33000 81B36600


**** Starting memory dump....

Header and error log buffers dumped...
SPT & GPT dumped...
System space dumped...


**** Memory dump complete....



**** to shadow set member, unit ... 0


?0002 External halt (CTRL/P, break, or external halt)
PC = 83B4D4F0
PSL = 041F8208
ISP = 864C3098

Initializing system

#123456789 0123456789 0123456789 0123456789 012345#

F E D C B A 9 8 7 6 5 4 3 2 1 0 NODE #
A A A A M M M M M M P P P P TYP
+ + + + + + + + + + + + + + STF
. . . . . . . . . . E E E B BPD
. . . . . . . . . . + + + + ETF
. . . . . . . . . . E E E B BPD


. . . . A2 A1 C1 B2 B1 D1 . . . . ILV
. . . . 128 128 64 64 64 32 . . . . 480 Mb

Console = V1.00 RBDs = V1.00 EEPROM = 1.00/1.07 SN = ############


Loading system software

* Initializing adapter
* Specified adapter initialized successfully
* Connecting to storage controller
* Connecting to MSCP server layer
* Connecting to boot disk
* Reading bootblock from disk
* Passing control to transfer address

%SYSBOOT-W-WS default and quota raised to PHD+MINWSCNT
VAX/VMS Version V5.5-2 Major version id = 1 Minor version id = 0

%CNXMAN, Using remote access method for quorum disk
waiting to form or join a VAXcluster system
%CNXMAN, Sending VAXcluster membership request to system FREDFLINTSTONE
%CNXMAN, Now a VAXcluster member -- system BARNEYRUBBLE
%SHADOW-I-VOLPROC, DSA0: contains the member named to VMB. System boot and du.

%SMP-I-CPUBOOTED, CPU #03 has joined the PRIMARY CPU in multiprocessor operatin
%SMP-I-CPUBOOTED, CPU #02 has joined the PRIMARY CPU in multiprocessor operatin
%SMP-I-CPUBOOTED, CPU #04 has joined the PRIMARY CPU in multiprocessor operatin
%SHADOW-I-VOLPROC, DSA0: contains the member named to VMB. System boot and du.

%SHADOW-I-VOLPROC, DSA0: has completed volume processing.

$! Copyright (c) 1992 Digital Equipment Corporation. All rights reserved.
%STDRV-I-STARTUP, VMS startup begun at 21-MAY-2004 06:34:53.93
%SET-I-NEWAUDSRV, identification of new audit server process is 2060020D
%%%%%%%%%%% OPCOM 21-MAY-2004 06:36:20.43 %%%%%%%%%%%
Operator _BARNEYRUBBLE$OPA0: has been enabled, username SYSTEM

%%%%%%%%%%% OPCOM 21-MAY-2004 06:36:21.21 %%%%%%%%%%%
Operator status for operator _BARNEYRUBBLE$OPA0:
CENTRAL, PRINTER, TAPES, DISKS, DEVICES, CARDS, NETWORK, CLUSTER, LICENSE,
OPER1, OPER2, OPER3, OPER4, OPER5, OPER6, OPER7, OPER8, OPER9, OPER10, OPER11,
OPER12

%%%%%%%%%%% OPCOM 21-MAY-2004 06:36:23.66 %%%%%%%%%%%
Logfile has been initialized by operator _BARNEYRUBBLE$OPA0:
Logfile is BARNEYRUBBLE::SYS$SYSROOT:[SYSMGR]OPERATOR.LOG;24

%%%%%%%%%%% OPCOM 21-MAY-2004 06:36:23.66 %%%%%%%%%%%
Operator status for operator BARNEYRUBBLE::SYS$SYSROOT:[SYSMGR]OPERATOR.LOG;24
CENTRAL, PRINTER, TAPES, DISKS, DEVICES, CARDS, NETWORK, CLUSTER, SECURITY,
LICENSE, OPER1, OPER2, OPER3, OPER4, OPER5, OPER6, OPER7, OPER8, OPER9, OPER10,
OPER11, OPER12

6 REPLIES 6
Uwe Zessin
Honored Contributor

Re: Server crashing every mth, this time with.. could this be memory related?

Victor,
I suggest you get someone to analyze the error log. Many years ago I got crashes, too and when I checked the error log I saw a big burst of memory errors just before the crash. I can't tell what your problem is, sorry.
.
Victor Mendham
Regular Advisor

Re: Server crashing every mth, this time with.. could this be memory related?

Thanks, you just reminded me I was going to implement a mthly cleanup of the error log for this server (it's too large).
Uwe Zessin
Honored Contributor

Re: Server crashing every mth, this time with.. could this be memory related?

You could also check the accounting and audit files - it is not unusual that I see files with 100,000 blocks or more in the field. The operator log has just been rolled over, I assume...
.
Lokesh_2
Esteemed Contributor

Re: Server crashing every mth, this time with.. could this be memory related?

Hi Victor,

Machine check normally implies hardware problem. As suggested by Uwe, please check errlog log entries for analysis.

HTH,
Thanks & regards,
Lokesh Jain
What would you do with your life if you knew you could not fail?
Victor Mendham
Regular Advisor

Re: Server crashing every mth, this time with.. could this be memory related?

The error log has nothing in it for May 18,19 or 20 when the server crashed. It was crashing before as it was in the middle of a backup. I suspected Tape drive or Controller, but our h/w support said memory, we moved the tape backups to the other node in the cluster, same drive, same controller, but different server, it worked fine no crashes, moved it back and no more crashes ( or at least no crashes while the backups or tape dumps were running).

So no indication at this time through the error log, next up is the crash dump.
Willem Grooters
Honored Contributor

Re: Server crashing every mth, this time with.. could this be memory related?

Victor:


moved it back and no more crashes ( or at least no crashes while the backups or tape dumps were running).


Am I right that the system crashes once in a while, not just during backup? That's not normal either. If this is true,m I would strongly suggest to have this system checked thoroughly. There will definitely be something wrong, and memory is a likely cause.

BTW. Have you Physically moved the tapedrive to the other system, it could have been the connection between the drive and the machine as well. Since it has bee fastned again, eventually the cable laid around a bit different, that could make a change - as will some connections (earthing, to mention one)
Willem Grooters
OpenVMS Developer & System Manager