Operating System - OpenVMS
1753841 Members
8680 Online
108806 Solutions
New Discussion юеВ

Re: System crashes every 3 weeks.

 
Volker Halle
Honored Contributor

Re: System crashes every 3 weeks.

Doug,

bad luck - OpenVMS V7.1-1H2 did NOT log any machine check entry.

This is the SAME machine/problem as already discussed in previous thread:

http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=808549

I keep a database of all crashes, that's why I know ;-)

Could you please try to provide the stack data as requested in the previous thread:

$ ANAL/CRASH SYS$SYSTEM:SYSDUMP.DMP
SDA> READ/EXEC
SDA> SHOW STACK/QUAD 7FFA1FC0;40

It may also be possible to find the machine check logout frame in the dump.

Volker.
Doug_81
Frequent Advisor

Re: System crashes every 3 weeks.

Thanks for the link Volker.
You're absolutely right. Adrian is my hardware support contact and I'm that "sysadmin is in west coast Canada" he referred to.

In-any-case, I was not aware that they were using this forum to trouble-shoot the problem. I thought I'd try as I'm not getting anywhere following the official channels.

Here's the output from the SHOW STACK/QUAD 7FFA1FC0;40 command:

Specified Stack Range
---------------------
00000000.7FFA1FC0 00000000.0002F030
00000000.7FFA1FC8 00000000.010E0019
00000000.7FFA1FD0 00000000.7AF77A5C
00000000.7FFA1FD8 00000000.7AF78AA0
00000000.7FFA1FE0 00000000.00000001
00000000.7FFA1FE8 00000000.00000003
00000000.7FFA1FF0 00000000.0030F080
00000000.7FFA1FF8 00000000.0000001B
Galen Tackett
Valued Contributor

Re: System crashes every 3 weeks.

Doug,

Just curious--just how precisely do you mean "every 3 weeks":

1) every 3 weeks, within a few milliseconds
2) every 3 weeks, within a couple of hours
3) Every 3 weeks, within a few days

I'll bet your answer is 3. :-)

To hazard a little speculation around each possibility:

1) would be pretty strange, to me at least. Perhaps a flaw in the fabric of space-time. :-)

2) might suggest a link to some calendar-related activity. Perhaps a procedure or device that is used at every couple of weeks? But you'd probably have noticed that.

3) suggests something a lot more random or at least aperiodic, which is why I guessed you'd pick this answer.

Just a few thoughts which may at least stimulate some thought, if they're of any use at all...

Galen
Ian Miller.
Honored Contributor

Re: System crashes every 3 weeks.

Volker,
"I keep a database of all crashes, that's why I know"
and I thought you just remembered them all rather than having a private copy of canasta :-)

____________________
Purely Personal Opinion
Volker Halle
Honored Contributor

Re: System crashes every 3 weeks.

Doug,

the interrupt/exception stack frame shows, that the current PC at the time of the MACHINECHK is in P0 space and the PS shows user-mode IPL 0:

00000000.7FFA1FF0 00000000.0030F080 <<< PC
00000000.7FFA1FF8 00000000.0000001B <<< PS

SDA> eva/ps 0000001B
MBZ SPAL MBZ IPL VMM MBZ CURMOD INT PRVMOD
0 00 00000000000 00 0 0 USER 0 USER

so whatever the instruction is

SDA> EXA/INS 30F080

it CANNOT have caused a MACHINECHK through a programming error (i.e. access into IO-space), because you can't do that in USER mode. It could have caused access to a bad memory page, but that would be pure speculation !!

Please issue the following commands in SDA:

SDA> EXA/INS 30F080-30;40

to examine the instruction stream. If the current instruction include a memory access and you're able to figure out the address, also do

SDA> SHOW PROC/PAGE address;1000

Otherwise, I'll help you to figure out the page number...

To get an overview of the last couple of crashes on this node, just try TYPE CLUE$HISTORY - if there is something timing related, you might be able to spot a pattern.

Volker.
DICTU OpenVMS
Frequent Advisor

Re: System crashes every 3 weeks.

Doug,

If you realy suspect the memory, then try to shut down the machine and bring it to SRM console. Then start 2 memexers per CPU and let them run for a few hours. If there is realy bad RAM it should show on console. To stop the memexer give the kill_diag command (or init the system). To show the status of memexter type show_diag.

(I could be a litle of with the commands, look in the manual or try help or man for exact commands).

It could be possible that the RAM has gone bad. At my current site we have had several issue's with bad RAM.
Doug_81
Frequent Advisor

Re: System crashes every 3 weeks.

Volker:
SDA> EXA/INS 30F080
00000000.0030F080: BIS R31,#X1D,R7

SDA> EXA/INS 30F080-30;40
00000000.0030F050: CVTDG F3,F3
00000000.0030F054: ADDG F4,F3,F3
00000000.0030F058: CVTGD F3,F3
00000000.0030F05C: STD F3,#X0CF8(FP)
00000000.0030F060: TRAPB
00000000.0030F064: LDA R16,#X0008(FP)
00000000.0030F068: BIS R31,#X01,R25
00000000.0030F06C: LDQ R26,#XFF60(R2)
00000000.0030F070: LDQ R27,#XFF68(R2)
00000000.0030F074: JSR R26,(R26)
00000000.0030F078: JMP R31,(R0)
00000000.0030F07C: TRAPB
00000000.0030F080: BIS R31,#X1D,R7
00000000.0030F084: STL R7,#X0020(FP)
00000000.0030F088: LDL R3,#X0CE0(FP)
00000000.0030F08C: ADDL/V R3,#X01,R3
00000000.0030F090: LDA R16,#X8000(R31)

I looked at the clue$history file and there doesn't appear to be any pattern other than approx every 3 weeks.
e.g. The previous 4 crashes are:
Date Uptime
======== ==========
Dec 29 22 days
Jan 20 25 days
Feb 14 25 days
Mar 29 23 days

Sorry, I don't know what address to put in the SHOW PROC/PAGE address;1000 command.


Volker Halle
Honored Contributor

Re: System crashes every 3 weeks.

Doug,

the exception PC points to a BIS R31,#X1D,R7 instruction, so there are no memory accesses involved executing this instruction - except access to the page, where this instruction is stored. Please remember to repeat these steps against the next crash(es).

Now let's try to find the machinecheck logout frame in the dump:

SDA> READ SYSDEF
SDA> SHOW STACK @(@smp$gl_cpu_data+CPU$L_PROC_MCHK_ABORT_SVAPTE+4);2F0

You have to enter the command in one line.
(above command only applies to single-CPU system - which this node is).

Try to include the output as a text file attachment in your next reply (or mail it to me - see my forum profile).

Volker.
Doug_81
Frequent Advisor

Re: System crashes every 3 weeks.

Thanks for your help Volker.
I've attached a text file with the output.
Volker Halle
Honored Contributor

Re: System crashes every 3 weeks.

Doug,

thanks for the data:

8A0E0058 00000001.00000205 = mchk code

Could you please compare the data with the same SDA command in the running system ? Sometimes mchk data is left in this buffer from 'expected' machinechecks (like during SYSMAN IO AUTOCONFIGURE when scanning the device configuration).

If the same data exists in the running system, we know that no machine check frame has been logged and need to try to find out, why OpenVMS has crashes with a MACHINECHK crash.

Volker.