topic Re: OpenVMS Cluster crash in Operating System - OpenVMS

OpenVMS Cluster crash

dschwarz — Wed, 03 Dec 2008 13:05:22 GMT

Hi,

we have an OpenVMS Cluster, two DS20E running OpenVMS 7.3-2, patched to UPDATE-V1100 (not the latest, I know). The systems are connected to a MSA1000 active/standby.

Yesterday both nodes crashed/rebooted at the same time. No memory dump, no errorlog entries.

Both systems are connected to a UPS.

These were the only systems that crashed at that time, all other systems in that room kept on running.

What can cause such a problem ?
What can we do to find out why this happened ?

Dieter

Re: OpenVMS Cluster crash

labadie_1 — Wed, 03 Dec 2008 13:12:00 GMT

Hello

Are you sure they are correctly configured in order to take a dump ?

Have you already had a valid dump ?

Are you sure the UPS is ok ? May be it is just a power failure ?

Re: OpenVMS Cluster crash

marsh_1 — Wed, 03 Dec 2008 13:37:34 GMT

dieter ,

is the systemm disk on the msa1000, if so is anything in the event log on the msa1000 for that time ?

Re: OpenVMS Cluster crash

dschwarz — Wed, 03 Dec 2008 13:46:15 GMT

labadie,

dumpstyle = 9
dumpbug = 1
savedump = 0
bugcheckfatal = 0
bugreboot = 1

Yes, we have had a valid dump from earlier this year.
ANA/CRASH SYS$SYSTEM:SYSDUMP.DMP shows
...
Dump taken on 16-MAR-2008 09:56:18.03
...
So its obvious that nothing has been written yesterday.

UPS is ok, other systems are connected to the same UPS. These systems did not crash.

Power failure was our first idea, too. But we have no idea how this can happen to the cluster nodes without affecting any other system connected to the same UPS/power line.

Dieter

Re: OpenVMS Cluster crash

dschwarz — Wed, 03 Dec 2008 13:49:13 GMT

mark,

we use separate system disks, both on the MSA1000.

MSA1000 shows >120 days uptime.

There are no events reported on the MSA1000.

Dieter

Re: OpenVMS Cluster crash

marsh_1 — Wed, 03 Dec 2008 13:56:01 GMT

dieter,

are they in the same rack ? although they have redundant power supplies the ds20 has only one ac input if someone was working in the rack ....?

Re: OpenVMS Cluster crash

Jan van den Ende — Wed, 03 Dec 2008 14:02:08 GMT

Dieter,

did you have some direct logging of console output?

We once had (AS2100's connected via HSZ50, so relevance questionable, but...) a hardware failure on the cabling (squeezed, and thereby semi-broken connection).

And if the connection between system and disks is gone, how can ANYTHING get written to any disk?
On the console there was an error code (IIRC, error 660). Our field engeneer was able to diagnose that as a flaky connection.
And yes, the system DID get back online, only to go down again the next day. It was that second crash that we were able to re-trace the error.
Maybe, maybe not applicable in your case, but, fwiw.

Proost.

Have one on me.

jpe

Re: OpenVMS Cluster crash

marsh_1 — Wed, 03 Dec 2008 14:07:09 GMT

dieter,

any other common components in the fabric ? is an embedded switch in the msa or two separate switches / dual hbas in ds20's ?

Re: OpenVMS Cluster crash

dschwarz — Wed, 03 Dec 2008 14:19:13 GMT

Mark,

they are in the same rack.
Nobody was working there, that's sure.

Jan,

we don't have console logging.
We have two embedded switches in the MSA1000 and two HBAs in each DS20.

Re: OpenVMS Cluster crash

Volker Halle — Wed, 03 Dec 2008 14:51:15 GMT

Dieter,

the only chance left is to have a look in ERRLOG.SYS. You'll need DECevent to decode the errlog file, as ANAL/ERR/ELV will probably not be able to translate the errlog entries from a crash.

What's the setting of the console environment variable AUTO_ACTION ?

$ WRITE SYS$OUTPUT F$GETENV("AUTO_ACTION")

Volker.

Re: OpenVMS Cluster crash

Volker Halle — Wed, 03 Dec 2008 14:55:10 GMT

Dieter,

$ ANAL/ERR/ELV TRANSLATE/INCL=BUGCHECK/SINCE=...

Does SYS$SYSTEM:SYS$ERRLOG.DMP exist and is it big enough ?

Volker.

Re: OpenVMS Cluster crash

marsh_1 — Wed, 03 Dec 2008 15:03:36 GMT

dieter,

how do you know they crashed and were'nt rebooted by someone ?

Re: OpenVMS Cluster crash

Hoff — Wed, 03 Dec 2008 15:31:52 GMT

Regarding the UPS coverage: both systems, the storage controllers, and the network infrastructure required for clustering? (A network glitch or VLAN outage can crash a cluster.)

As there's little evidence here of what transpired; if no dump and no error logs and no controller logs... Or even if so...

Set up logging to capture these events as described elsewhere and also capture via the console serial lines.

Ensure both boxes are set to RESTART/REBOOT, and not to HALT, nor to REBOOT -- this via the SRM console AUTO_ACTION variable.

Patch to current. For on the hosts and controllers.

I'd be seriously tempted to tie the MSAs into the logging, as well as the UPS.

Move down the racks, and up-rate the monitoring on and the UPS and related configurations of the other boxes in a similar fashion. Rack-mount boxes have a nasty habit of incurring unintentional and incremental changes, and this can lead to network switches that aren't covered by UPS, storage controllers that aren't, or any number of other subtle "adjustments" to the intended configuration.

Then wait for the next one.

Re: OpenVMS Cluster crash

dschwarz — Wed, 03 Dec 2008 15:38:07 GMT

Mark,

SYS$MANAGER:OPERATOR.LOG would show entries like this:
_BUCL02$OPA0:, BUCL02 shutdown was requested by the operator.
It doesn't

DIAG/SIN.. would show entries like this:
Entry Type 65. Volume Dismount

SWI Minor class 5. Volume dismount
It doesn't

Volker,
AUTO_ACTION is RESTART

Will ANAL/ERR/ELV..... give me more information than DIAG.... does?
I don't think so and DIAG shows a time stamp entry at 23:50 and configuration informations at 23:58.

Re: OpenVMS Cluster crash

marsh_1 — Wed, 03 Dec 2008 15:46:10 GMT

dieter,

i know, had to ask though, would'nt be the first time somebody who should'nt have was playing around and it was overlooked ... :-)

Re: OpenVMS Cluster crash

Volker Halle — Wed, 03 Dec 2008 16:17:50 GMT

Dieter,

then these 2 OpenVMS system really did not 'crash'. They may just have 'booted' without a prior crash. And this most likely can only caused by a hardware event/signal. Look for something similar to those 2 machines, which could have affected both machines at the same time.

Only console information would have been able to provide more info, if there was really anything to tell. If you see just an INIT message on a console of a running system, you still have to wonder about the underlying reason.

Another piece of info to check would be the configuration entries logged at boot time. Maybe there are some status bits, which would tell more about the preceeding events...

Volker.

Re: OpenVMS Cluster crash

dschwarz — Thu, 04 Dec 2008 07:49:42 GMT

Conclusion:

We will try to capture as much information as possible (console logging, msa1000 logging,...)
and wait for the next time as Hoff wrote.

We haven't seen something like this for the last 6 years, so there is a chance to survive the next decade without seeing it again.

Re: OpenVMS Cluster crash

Peter Elliott — Thu, 04 Dec 2008 11:19:04 GMT

Dieter,
Do you have any CLUE listing files in SYS$ERRORLOG: ?
These should get written at system boot time.

Re: OpenVMS Cluster crash

dschwarz — Thu, 04 Dec 2008 11:36:16 GMT

Peter,
there are only some old CLUE$node_date_time.LIS files. They don't help.
CLUE$HISTORY.DAT does not contain any information related to the problem.

Re: OpenVMS Cluster crash

Robert Gezelter — Thu, 04 Dec 2008 12:13:59 GMT

Dieter,

While you are checking things, please carefully check the grounding of the systems, in addition to the power.

Grounding problems can cause all manner of failures, many of them seemingly mysterious. A grounding failure can be something as simple as a corroded connection.

- Bob Gezelter, http://www.rlgsc.com