1839269 Members
3654 Online
110137 Solutions
New Discussion

Re: system crash

 
Sk Noorul  Hassan
Regular Advisor

system crash

One my VAX machine hot standby(all processes remain in HIB mode) went down without any error message on operator log or application log. It was a application + vms crash. As per error log and clue file it is showing that it has failed to see duty machine, so it tried to come as new duty machine, but when it again saw a duty is there, it crashed itself. But, there was no network problem also. I am unable to understand the reason for it. This is 2nd time this machine has gone down in similar situation. I am attaching clue file for your reference.

Pls suggest..
16 REPLIES 16
Richard White_5
Advisor

Re: system crash

Good Morning Sk...

It would appear that this Halt_Restart is very similar to the crash submitted by Rajarshi Gupta, back on 01-Jul-2005. In fact, in both Clue-Listings, the node name is the same. (TGEV01)

The K-Stk footprint is similar (but not exactly the same) in both of these crashes. My suspicion, based on the same node-name, and your statement-- "This is the 2nd time this machine has gone down in similar situation" is that both you and Rajarshi are trying to troubleshoot and isolate this problem.

If my previous two paragraphs are correct, and this "IS" the same system/vax-4100A, then we may have to lean towards a hardware failure. I say this because the first Halt that was reported by Rajarshi occurred in the SYSTSG image at appproximately PC=7E07 or 7E08 (updated Pc reflected?); while your second Halt occurred at PC=891EB or 891EA (again not sure if the Halt-Restart-Bugcheck displays the Failing-PC or the Updated-PC) in the SYSDSK image.

In other words, I would find it hard to believe that you have two (2) different executable images with the similar code-threads, that execute Halt instructions while in Kernel-Mode. It would make more sense that if the "same" system has crashed more than once, in different code-streams, that it is likely to be an internal IC-Chip failure (ALU/Mux/Shift-Reg) on the Vax Processor module.

But if the crashes are occurring on two systems, then it is likely to be a problem that is common, but independent of the actual system-boxes. For example you mention that this system checks to see if there is a "duty-machine", and if not, then this system tries to become the "new duty machine". If there are multiple systems that each check for "duty-machine" (via a keep-alive-broadcast over the network?) and the network-concentrator/switch/hub does not forward the broadcast/multicast, you may have a network-filtering problem...

Just a couple of thoughts, not sure if they help or not...

Thanx,
whynot3k
Veli Körkkö
Trusted Contributor

Re: system crash

EXE$GL_MEMERRS -> 800044F4 = 00000001

would this suggest that you had one memory error somewhere sometime prior the crash.

at least worth checking.

_veli
Volker Halle
Honored Contributor

Re: system crash

Hi,

please try to report the instruction at PC = 891EB (or 891EA)

$ ANAL/SYS SYS$SYSTEM:SYSDUMP.DMP
SDA> EXA/INS 891EB
SDA> EXA/INS 891EA

CLUE reported the failing instruction at PC=000C09D4, could you please also examine

SDA> EXA/INS C09d4
SDA> EXA/INS C09D3

Volker.
Volker Halle
Honored Contributor

Re: system crash

This is the same machine as reported before. Last boot time from this crash is just a couple of minutes after the previously reported crash time.

Thanks for providing the CLUE file, this allows at least some educated guesses on what might have happened.

Please also provide the data from SDA, as this may help make a decision between a software or hardware problem...

Volker.
Sk Noorul  Hassan
Regular Advisor

Re: system crash

Hi all, thanks for your suggestions.

Volker,
could you please let me know how to get SDA output which you require, so that I can attach that also.

Richard,
you are right, it is the same machine crashing with two different image name in halt crash. Generally, when a system crashes, it gives some application error log pointing a probable reason for crash. But in the two crash, this machine is not giving any application reason.

Pls suggest if you need any other log, which I can attach.
Willem Grooters
Honored Contributor

Re: system crash

Just a thought, correct me if I'm wrong, but AFAIK a VAX _requires_ a connected, switched-on terminal as console. I ran into a VAX system that halted due to the fact the VT200 used as a terminal broke down.

Could it be that the console-terminal is broke or has a failing connection? Can it be that this is switched off by the application (due to the crash) and therefore crashing VMS?

Willem
Willem Grooters
OpenVMS Developer & System Manager
Ian Miller.
Honored Contributor

Re: system crash

some VAXes halt if their VT consoles get switched off (especially VAXstations) but others don't mind at all.
____________________
Purely Personal Opinion
Sk Noorul  Hassan
Regular Advisor

Re: system crash

Volker, pls find the instructions as asked by you.

SDA>EXA/INS 891EB
000891EB: XFC
SDA>EXA/INS 891EA
000891EA: NOP
SDA>EXA/INS C09D4
000C09D4 : RET
SDA>EXA/INS C09D3
000C09D4 : HALT


Pls suggest..
Volker Halle
Honored Contributor

Re: system crash

SDA>EXA/INS C09D3
000C09D4 : HALT
^^^^^^^^ this should be 000C09D3, right ?!
SDA>EXA/INS C09D4
000C09D4 : RET

If this would be the real HALT-PC, it makes sense. The other 2 instructions could not have halted the system.

Could you now also please try to examine the instruction stream leading to 000C09D3 and 000891EA ?

Start with SDA> EXA/INS C09D4-10;10
If this provides a valid instruction stream up to address C09D4, please post it. Otherwise try -11;11 or -A;A - VAX instructions are variable length and you need to find the beginning of a valid instruction to be able to decode the whole instruction stream.

Then please do the same with 891EA-10;10 and so on.

Volker.
Doug Phillips
Trusted Contributor

Re: system crash

Willem & Ian:

You've reminded me of a similar situation. The VAX (don't remember what kind, but it wasn't a VAXstation) would just sometimes crash for no reason.

Finally realized that they had a PC as the console, and were using the console to run a data entry application. The Terminal Emulator had F5 mapped to send , and the operator would sometimes accidently hit the F5 key.

I don't remember if there was also a console command that set the break condition, but I moved that operators PC off of the console and put in a dedicated console with break disabled. It never crashed like that again.
Volker Halle
Honored Contributor

Re: system crash

re: console receiving

I'm pretty sure that the system will just HALT and display the console prompt >>>
but it will not crash with a HALT restart bugcheck. The HALT restart crash should only happen, if the console detects, that the operating system has issued a HALT instruction in kernel mode (thus halting the CPU) and the HALT console parameter is set to RESTART. The console would then try to restart OpenVMS via the restart entry point.

Volker.
Doug Phillips
Trusted Contributor

Re: system crash

Thanks, Volker. It was just an old fuzzy memory from the distant past, and I think it did just halt to the >>>. <:-\
Sk Noorul  Hassan
Regular Advisor

Re: system crash

Thanks for the suggestion.

Volker,
I will get back after trying your suggestions.
Richard W Hunt
Valued Contributor

Re: system crash

Too many years and too many versions of O/S ago, a company I worked for tried something like this. It is important to know, at least in overview, how the standby machine learns that it needs to become the duty machine - and how it learns later that it should NOT have tried to become the duty machine.

Can you perhaps give us a brief overview of the method you are using to trigger the change in system status? As I recall, there is a possible situation in which you could run afoul of one of the Goedel theorems on computability that would prevent this from being a truly reliable process, depending on exactly how you approach it.

Sr. Systems Janitor
Volker Halle
Honored Contributor

Re: system crash

Please see my entry from today on the other thread:

http://forums2.itrc.hp.com/service/forums/questionanswer.do?threadId=929654

It may not be possible to obtain the HALT PC from the dump, but ONLY from the halt message on the console.

Volker.
Volker Halle
Honored Contributor

Re: system crash

The data in the CLUE file (read from the crash) is inconsistent:

For a valid restart crash (see [SYS]POWERFAIL routine EXE$RESTART_ATT), the following input values are expected:

AP - Halt reason code (a value between 3 and 31.)
R10 - HALT PC
R11 - HALT PSL

These values do not match in this crash, which makes a hardware problem even more likely.

Volker.