Operating System - OpenVMS
1748121 Members
3192 Online
108758 Solutions
New Discussion юеВ

Re: OpenVMS Cluster crash

 
dschwarz
Frequent Advisor

OpenVMS Cluster crash

Hi,

we have an OpenVMS Cluster, two DS20E running OpenVMS 7.3-2, patched to UPDATE-V1100 (not the latest, I know). The systems are connected to a MSA1000 active/standby.

Yesterday both nodes crashed/rebooted at the same time. No memory dump, no errorlog entries.

Both systems are connected to a UPS.

These were the only systems that crashed at that time, all other systems in that room kept on running.

What can cause such a problem ?
What can we do to find out why this happened ?

Dieter
19 REPLIES 19
labadie_1
Honored Contributor

Re: OpenVMS Cluster crash

Hello

Are you sure they are correctly configured in order to take a dump ?

Have you already had a valid dump ?

Are you sure the UPS is ok ? May be it is just a power failure ?
marsh_1
Honored Contributor

Re: OpenVMS Cluster crash

dieter ,

is the systemm disk on the msa1000, if so is anything in the event log on the msa1000 for that time ?

dschwarz
Frequent Advisor

Re: OpenVMS Cluster crash

labadie,

dumpstyle = 9
dumpbug = 1
savedump = 0
bugcheckfatal = 0
bugreboot = 1

Yes, we have had a valid dump from earlier this year.
ANA/CRASH SYS$SYSTEM:SYSDUMP.DMP shows
...
Dump taken on 16-MAR-2008 09:56:18.03
...
So its obvious that nothing has been written yesterday.

UPS is ok, other systems are connected to the same UPS. These systems did not crash.

Power failure was our first idea, too. But we have no idea how this can happen to the cluster nodes without affecting any other system connected to the same UPS/power line.

Dieter
dschwarz
Frequent Advisor

Re: OpenVMS Cluster crash

mark,

we use separate system disks, both on the MSA1000.

MSA1000 shows >120 days uptime.

There are no events reported on the MSA1000.

Dieter
marsh_1
Honored Contributor

Re: OpenVMS Cluster crash

dieter,

are they in the same rack ? although they have redundant power supplies the ds20 has only one ac input if someone was working in the rack ....?
Jan van den Ende
Honored Contributor

Re: OpenVMS Cluster crash

Dieter,

did you have some direct logging of console output?

We once had (AS2100's connected via HSZ50, so relevance questionable, but...) a hardware failure on the cabling (squeezed, and thereby semi-broken connection).

And if the connection between system and disks is gone, how can ANYTHING get written to any disk?
On the console there was an error code (IIRC, error 660). Our field engeneer was able to diagnose that as a flaky connection.
And yes, the system DID get back online, only to go down again the next day. It was that second crash that we were able to re-trace the error.
Maybe, maybe not applicable in your case, but, fwiw.

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
marsh_1
Honored Contributor

Re: OpenVMS Cluster crash

dieter,

any other common components in the fabric ? is an embedded switch in the msa or two separate switches / dual hbas in ds20's ?
dschwarz
Frequent Advisor

Re: OpenVMS Cluster crash

Mark,

they are in the same rack.
Nobody was working there, that's sure.

Jan,

we don't have console logging.
We have two embedded switches in the MSA1000 and two HBAs in each DS20.


Volker Halle
Honored Contributor

Re: OpenVMS Cluster crash

Dieter,

the only chance left is to have a look in ERRLOG.SYS. You'll need DECevent to decode the errlog file, as ANAL/ERR/ELV will probably not be able to translate the errlog entries from a crash.

What's the setting of the console environment variable AUTO_ACTION ?

$ WRITE SYS$OUTPUT F$GETENV("AUTO_ACTION")

Volker.