Operating System - OpenVMS
1752319 Members
5830 Online
108786 Solutions
New Discussion юеВ

Hung after "has joined VMS cluster"

 
SOLVED
Go to solution
Russ Carraro
Regular Advisor

Hung after "has joined VMS cluster"

Customer has a mixed architect cluster, 2 VAX7730's, 2 AlphaServer 2100's and 6 AlphaServer 1000's, which has been running for years. The 1000's are running VMS 6.2, the rest are 7.1. The 1000's are basically paired together to serve disks from dedundant pairs of HSD controllers. When rebooting one of the 1000s it gets to where it becomes a member of the cluster and then everything seems to hang. After about 10 minutes I do a ctrl-P and get back to the boot prompt. It takes another 2-3 minutes for the cluster to start responding. The ckuster runs 24x7 shop applications.

Any ideas as to why it's taking so long or how to speed up the SCS communications? Thanks.
11 REPLIES 11
Karl Rohwedder
Honored Contributor
Solution

Re: Hung after "has joined VMS cluster"

The waittime to kick out a non responding node is derived from the SYSGEN parameter RECNXINTERVAL. On busy networks as interconnects it should be increased, how is yours set? A long RECNXINTERVAL may explain the 2-3 minutes after you halted the 1000 again, but not the hang during reboot.

regards Kalle
Volker Halle
Honored Contributor

Re: Hung after "has joined VMS cluster"

Russ,

any messages on the consoles of the other systems ? What's 'hanging' on the other nodes ?

Try booting with >>> b -fl n,30000 and capture the console output.

Consider to force a crash, when the node is 'hung'.

You first need to determine, what's causing the 'hang', then you can think about how to prevent it.

Volker.
Russ Carraro
Regular Advisor

Re: Hung after "has joined VMS cluster"

The current RECNXINTERVAL is 120 on all systems.
Volker Halle
Honored Contributor

Re: Hung after "has joined VMS cluster"

Russ,

RECNXINTERVAL of 2 minutes will cause a 2 minute 'hang' on the rest of the cluster, if you just CTRL-P HALT the node. Forcing a crash of the node will allow the cluster to continue immediately...

Volker.
Wim Van den Wyngaert
Honored Contributor

Re: Hung after "has joined VMS cluster"

"everything seems to hang"

Is that the cluster or the node ?

Never tried it but is expected_votes correct ?

Wim
Wim
Russ Carraro
Regular Advisor

Re: Hung after "has joined VMS cluster"

It appears to be the whole cluster. There is no response to carriage return on the nodes I have an active session and trying to telnet to one of the nodes just waits.
labadie_1
Honored Contributor

Re: Hung after "has joined VMS cluster"

Next time when it (say alpha2) is hung, do from another node, sh sys/node=alpha2, several times to see if the startup process uses cpu or I/O.

You can use
$ sh clu/cont
add counters/all
add loc_proc_nam
to see if there is traffic between alpha2 and the other nodes.

But first boot with verbosity, like Volker said, you will see were it hangs.

Use sysman if it works
mc sysman set env/node=alpha2
do sh sys
note the pid of the startup process
do pipe wr sys$output "sh proc/id=pid_of_startup" | ana/sys
to see the opened files and the devices marked busy
and also do
do pipe (wr sys$output "set proc/id=pid_of_startup_process" ; wr sys$output "exam @pc" ) | ana/sys
to see if the program counter "moves"
Hoff
Honored Contributor

Re: Hung after "has joined VMS cluster"

This smells like a hardware or a network problem, though I could easily see a low-level cluster misconfiguration triggering a similar cluster-wide lock-up.

HSD controllers and DSSI disks and mixed software versions? Time to upgrade to more current hardware (dual-host SCSI would be one obvious target), and to consistent versions of software. Based on what I see here in ITRC, you've been chasing outages on the DSSI gear for a couple of years now, and such outages are likely only going to increase as the gear ages. DIGITAL retired DSSI a very long time ago now.

Do get ready to crash this thing again (eg: dump files to size, crashdump procedures on the console, etc) and do (in addition to what other topics have been mentioned here) get the AMDS remote management probes installed where ever you can.

And start planning to replace this gear, and do look at establishing an escalation process; a way for you to get help for yourself, when you have a customer-down issue such as this.

And a couple of Integrity servers spanning and an MSA30MI or other such are going to run rings around this configuration, in terms of physical size, power and cooling, disk capacity, general reliability, and raw performance.

Stephen Hoffman
HoffmanLabs LLC
Dean McGorrill
Valued Contributor

Re: Hung after "has joined VMS cluster"

hi Russ,
well thats different. bringing a node
into the cluster hangs it for 10 minutes+.
halting puts the cluster in a 2 minute
transision. as suggested, do a

sho clus/cont
add vot
add quo
add clus
add cir

see if the values make sense (or post here)

RECNXINTERVAL is dynamic and can be lowered
(everywhere) if you want do reduce the hang
time. on the offending node, try boot -fl n,1 at sysboot do a sho/clus. see if the
expected votes quorum, niscs, mscp etc. are set right. you can set them there before
booting. also set STARTUP_P2 "YES" that
will keep a verbose boot. Dean