Operating System - OpenVMS
1829102 Members
2590 Online
109986 Solutions
New Discussion

Hung after "has joined VMS cluster"

 
SOLVED
Go to solution
Russ Carraro
Regular Advisor

Hung after "has joined VMS cluster"

Customer has a mixed architect cluster, 2 VAX7730's, 2 AlphaServer 2100's and 6 AlphaServer 1000's, which has been running for years. The 1000's are running VMS 6.2, the rest are 7.1. The 1000's are basically paired together to serve disks from dedundant pairs of HSD controllers. When rebooting one of the 1000s it gets to where it becomes a member of the cluster and then everything seems to hang. After about 10 minutes I do a ctrl-P and get back to the boot prompt. It takes another 2-3 minutes for the cluster to start responding. The ckuster runs 24x7 shop applications.

Any ideas as to why it's taking so long or how to speed up the SCS communications? Thanks.
11 REPLIES 11
Karl Rohwedder
Honored Contributor
Solution

Re: Hung after "has joined VMS cluster"

The waittime to kick out a non responding node is derived from the SYSGEN parameter RECNXINTERVAL. On busy networks as interconnects it should be increased, how is yours set? A long RECNXINTERVAL may explain the 2-3 minutes after you halted the 1000 again, but not the hang during reboot.

regards Kalle
Volker Halle
Honored Contributor

Re: Hung after "has joined VMS cluster"

Russ,

any messages on the consoles of the other systems ? What's 'hanging' on the other nodes ?

Try booting with >>> b -fl n,30000 and capture the console output.

Consider to force a crash, when the node is 'hung'.

You first need to determine, what's causing the 'hang', then you can think about how to prevent it.

Volker.
Russ Carraro
Regular Advisor

Re: Hung after "has joined VMS cluster"

The current RECNXINTERVAL is 120 on all systems.
Volker Halle
Honored Contributor

Re: Hung after "has joined VMS cluster"

Russ,

RECNXINTERVAL of 2 minutes will cause a 2 minute 'hang' on the rest of the cluster, if you just CTRL-P HALT the node. Forcing a crash of the node will allow the cluster to continue immediately...

Volker.
Wim Van den Wyngaert
Honored Contributor

Re: Hung after "has joined VMS cluster"

"everything seems to hang"

Is that the cluster or the node ?

Never tried it but is expected_votes correct ?

Wim
Wim
Russ Carraro
Regular Advisor

Re: Hung after "has joined VMS cluster"

It appears to be the whole cluster. There is no response to carriage return on the nodes I have an active session and trying to telnet to one of the nodes just waits.
labadie_1
Honored Contributor

Re: Hung after "has joined VMS cluster"

Next time when it (say alpha2) is hung, do from another node, sh sys/node=alpha2, several times to see if the startup process uses cpu or I/O.

You can use
$ sh clu/cont
add counters/all
add loc_proc_nam
to see if there is traffic between alpha2 and the other nodes.

But first boot with verbosity, like Volker said, you will see were it hangs.

Use sysman if it works
mc sysman set env/node=alpha2
do sh sys
note the pid of the startup process
do pipe wr sys$output "sh proc/id=pid_of_startup" | ana/sys
to see the opened files and the devices marked busy
and also do
do pipe (wr sys$output "set proc/id=pid_of_startup_process" ; wr sys$output "exam @pc" ) | ana/sys
to see if the program counter "moves"
Hoff
Honored Contributor

Re: Hung after "has joined VMS cluster"

This smells like a hardware or a network problem, though I could easily see a low-level cluster misconfiguration triggering a similar cluster-wide lock-up.

HSD controllers and DSSI disks and mixed software versions? Time to upgrade to more current hardware (dual-host SCSI would be one obvious target), and to consistent versions of software. Based on what I see here in ITRC, you've been chasing outages on the DSSI gear for a couple of years now, and such outages are likely only going to increase as the gear ages. DIGITAL retired DSSI a very long time ago now.

Do get ready to crash this thing again (eg: dump files to size, crashdump procedures on the console, etc) and do (in addition to what other topics have been mentioned here) get the AMDS remote management probes installed where ever you can.

And start planning to replace this gear, and do look at establishing an escalation process; a way for you to get help for yourself, when you have a customer-down issue such as this.

And a couple of Integrity servers spanning and an MSA30MI or other such are going to run rings around this configuration, in terms of physical size, power and cooling, disk capacity, general reliability, and raw performance.

Stephen Hoffman
HoffmanLabs LLC
Dean McGorrill
Valued Contributor

Re: Hung after "has joined VMS cluster"

hi Russ,
well thats different. bringing a node
into the cluster hangs it for 10 minutes+.
halting puts the cluster in a 2 minute
transision. as suggested, do a

sho clus/cont
add vot
add quo
add clus
add cir

see if the values make sense (or post here)

RECNXINTERVAL is dynamic and can be lowered
(everywhere) if you want do reduce the hang
time. on the offending node, try boot -fl n,1 at sysboot do a sho/clus. see if the
expected votes quorum, niscs, mscp etc. are set right. you can set them there before
booting. also set STARTUP_P2 "YES" that
will keep a verbose boot. Dean
Colin Butcher
Esteemed Contributor

Re: Hung after "has joined VMS cluster"

Hi,

Sounds as if you'd benefit from using AMDS to monitor the nodes and be able to make changes to Quorum and Votes when it's in the "hung" state. AMDS might just get you the information you need.

Any evidence from the network layers? What are the cluster interconnections - all single rail, or multi-rail? Common DSSI busses?

What's the quorum scheme to avoid a partitioned cluster? If there's a quorum disc around - is it availabile during the boot sequence?

Assuming that you can't change the machines just yet then for a speed-up I'd be tempted to use Nemonix fast ethernet / SCSI boards in the VAXes and bring it all up to 100Mbit/sec fast ethernet. You could also look at moving the storage away from DSSI to something a litle more modern, such as HSZ70/80 based arrays. It would also probably help to reduce complexity by reducing the number of nodes and disc servers.

However, it's really difficult to provide much other than general guesses without actually seeing it and understanding it.

Cheers, Colin (http://www.xdelta.co.uk).
Entia non sunt multiplicanda praeter necessitatem (Occam's razor).
Russ Carraro
Regular Advisor

Re: Hung after "has joined VMS cluster"

The server was rebooted before I had a chance to try any of the recommendations. Apparently this time it booted normally. Thanks for all the replies.