1748180 Members
4026 Online
108759 Solutions
New Discussion юеВ

Re: system crash

 
Sk Noorul  Hassan
Regular Advisor

system crash

One my VAX machine hot standby(all processes remain in HIB mode) went down without any error message on operator log or application log. It was a application + vms crash. As per error log and clue file it is showing that it has failed to see duty machine, so it tried to come as new duty machine, but when it again saw a duty is there, it crashed itself. But, there was no network problem also. I am unable to understand the reason for it. This is 2nd time this machine has gone down in similar situation. I am attaching clue file for your reference.

Pls suggest..
16 REPLIES 16
Richard White_5
Advisor

Re: system crash

Good Morning Sk...

It would appear that this Halt_Restart is very similar to the crash submitted by Rajarshi Gupta, back on 01-Jul-2005. In fact, in both Clue-Listings, the node name is the same. (TGEV01)

The K-Stk footprint is similar (but not exactly the same) in both of these crashes. My suspicion, based on the same node-name, and your statement-- "This is the 2nd time this machine has gone down in similar situation" is that both you and Rajarshi are trying to troubleshoot and isolate this problem.

If my previous two paragraphs are correct, and this "IS" the same system/vax-4100A, then we may have to lean towards a hardware failure. I say this because the first Halt that was reported by Rajarshi occurred in the SYSTSG image at appproximately PC=7E07 or 7E08 (updated Pc reflected?); while your second Halt occurred at PC=891EB or 891EA (again not sure if the Halt-Restart-Bugcheck displays the Failing-PC or the Updated-PC) in the SYSDSK image.

In other words, I would find it hard to believe that you have two (2) different executable images with the similar code-threads, that execute Halt instructions while in Kernel-Mode. It would make more sense that if the "same" system has crashed more than once, in different code-streams, that it is likely to be an internal IC-Chip failure (ALU/Mux/Shift-Reg) on the Vax Processor module.

But if the crashes are occurring on two systems, then it is likely to be a problem that is common, but independent of the actual system-boxes. For example you mention that this system checks to see if there is a "duty-machine", and if not, then this system tries to become the "new duty machine". If there are multiple systems that each check for "duty-machine" (via a keep-alive-broadcast over the network?) and the network-concentrator/switch/hub does not forward the broadcast/multicast, you may have a network-filtering problem...

Just a couple of thoughts, not sure if they help or not...

Thanx,
whynot3k
Veli K├╢rkk├╢
Trusted Contributor

Re: system crash

EXE$GL_MEMERRS -> 800044F4 = 00000001

would this suggest that you had one memory error somewhere sometime prior the crash.

at least worth checking.

_veli
Volker Halle
Honored Contributor

Re: system crash

Hi,

please try to report the instruction at PC = 891EB (or 891EA)

$ ANAL/SYS SYS$SYSTEM:SYSDUMP.DMP
SDA> EXA/INS 891EB
SDA> EXA/INS 891EA

CLUE reported the failing instruction at PC=000C09D4, could you please also examine

SDA> EXA/INS C09d4
SDA> EXA/INS C09D3

Volker.
Volker Halle
Honored Contributor

Re: system crash

This is the same machine as reported before. Last boot time from this crash is just a couple of minutes after the previously reported crash time.

Thanks for providing the CLUE file, this allows at least some educated guesses on what might have happened.

Please also provide the data from SDA, as this may help make a decision between a software or hardware problem...

Volker.
Sk Noorul  Hassan
Regular Advisor

Re: system crash

Hi all, thanks for your suggestions.

Volker,
could you please let me know how to get SDA output which you require, so that I can attach that also.

Richard,
you are right, it is the same machine crashing with two different image name in halt crash. Generally, when a system crashes, it gives some application error log pointing a probable reason for crash. But in the two crash, this machine is not giving any application reason.

Pls suggest if you need any other log, which I can attach.
Willem Grooters
Honored Contributor

Re: system crash

Just a thought, correct me if I'm wrong, but AFAIK a VAX _requires_ a connected, switched-on terminal as console. I ran into a VAX system that halted due to the fact the VT200 used as a terminal broke down.

Could it be that the console-terminal is broke or has a failing connection? Can it be that this is switched off by the application (due to the crash) and therefore crashing VMS?

Willem
Willem Grooters
OpenVMS Developer & System Manager
Ian Miller.
Honored Contributor

Re: system crash

some VAXes halt if their VT consoles get switched off (especially VAXstations) but others don't mind at all.
____________________
Purely Personal Opinion
Sk Noorul  Hassan
Regular Advisor

Re: system crash

Volker, pls find the instructions as asked by you.

SDA>EXA/INS 891EB
000891EB: XFC
SDA>EXA/INS 891EA
000891EA: NOP
SDA>EXA/INS C09D4
000C09D4 : RET
SDA>EXA/INS C09D3
000C09D4 : HALT


Pls suggest..
Volker Halle
Honored Contributor

Re: system crash

SDA>EXA/INS C09D3
000C09D4 : HALT
^^^^^^^^ this should be 000C09D3, right ?!
SDA>EXA/INS C09D4
000C09D4 : RET

If this would be the real HALT-PC, it makes sense. The other 2 instructions could not have halted the system.

Could you now also please try to examine the instruction stream leading to 000C09D3 and 000891EA ?

Start with SDA> EXA/INS C09D4-10;10
If this provides a valid instruction stream up to address C09D4, please post it. Otherwise try -11;11 or -A;A - VAX instructions are variable length and you need to find the beginning of a valid instruction to be able to decode the whole instruction stream.

Then please do the same with 891EA-10;10 and so on.

Volker.