Hung after "has joined VMS cluster"

Russ Carraro · ‎06-19-2007

Customer has a mixed architect cluster, 2 VAX7730's, 2 AlphaServer 2100's and 6 AlphaServer 1000's, which has been running for years. The 1000's are running VMS 6.2, the rest are 7.1. The 1000's are basically paired together to serve disks from dedundant pairs of HSD controllers. When rebooting one of the 1000s it gets to where it becomes a member of the cluster and then everything seems to hang. After about 10 minutes I do a ctrl-P and get back to the boot prompt. It takes another 2-3 minutes for the cluster to start responding. The ckuster runs 24x7 shop applications.

Any ideas as to why it's taking so long or how to speed up the SCS communications? Thanks.

Karl Rohwedder · ‎06-19-2007

The waittime to kick out a non responding node is derived from the SYSGEN parameter RECNXINTERVAL. On busy networks as interconnects it should be increased, how is yours set? A long RECNXINTERVAL may explain the 2-3 minutes after you halted the 1000 again, but not the hang during reboot.

regards Kalle

Volker Halle · ‎06-19-2007

Russ,

any messages on the consoles of the other systems ? What's 'hanging' on the other nodes ?

Try booting with >>> b -fl n,30000 and capture the console output.

Consider to force a crash, when the node is 'hung'.

You first need to determine, what's causing the 'hang', then you can think about how to prevent it.

Volker.

Russ Carraro · ‎06-19-2007

The current RECNXINTERVAL is 120 on all systems.

Volker Halle · ‎06-19-2007

Russ,

RECNXINTERVAL of 2 minutes will cause a 2 minute 'hang' on the rest of the cluster, if you just CTRL-P HALT the node. Forcing a crash of the node will allow the cluster to continue immediately...

Volker.

Wim Van den Wyngaert · ‎06-19-2007

"everything seems to hang"

Is that the cluster or the node ?

Never tried it but is expected_votes correct ?

Wim

Wim

Russ Carraro · ‎06-19-2007

It appears to be the whole cluster. There is no response to carriage return on the nodes I have an active session and trying to telnet to one of the nodes just waits.

labadie_1 · ‎06-19-2007

Next time when it (say alpha2) is hung, do from another node, sh sys/node=alpha2, several times to see if the startup process uses cpu or I/O.

You can use
$ sh clu/cont
add counters/all
add loc_proc_nam
to see if there is traffic between alpha2 and the other nodes.

But first boot with verbosity, like Volker said, you will see were it hangs.

Use sysman if it works
mc sysman set env/node=alpha2
do sh sys
note the pid of the startup process
do pipe wr sys$output "sh proc/id=pid_of_startup" | ana/sys
to see the opened files and the devices marked busy
and also do
do pipe (wr sys$output "set proc/id=pid_of_startup_process" ; wr sys$output "exam @pc" ) | ana/sys
to see if the program counter "moves"

Hoff · ‎06-19-2007

This smells like a hardware or a network problem, though I could easily see a low-level cluster misconfiguration triggering a similar cluster-wide lock-up.

HSD controllers and DSSI disks and mixed software versions? Time to upgrade to more current hardware (dual-host SCSI would be one obvious target), and to consistent versions of software. Based on what I see here in ITRC, you've been chasing outages on the DSSI gear for a couple of years now, and such outages are likely only going to increase as the gear ages. DIGITAL retired DSSI a very long time ago now.

Do get ready to crash this thing again (eg: dump files to size, crashdump procedures on the console, etc) and do (in addition to what other topics have been mentioned here) get the AMDS remote management probes installed where ever you can.

And start planning to replace this gear, and do look at establishing an escalation process; a way for you to get help for yourself, when you have a customer-down issue such as this.

And a couple of Integrity servers spanning and an MSA30MI or other such are going to run rings around this configuration, in terms of physical size, power and cooling, disk capacity, general reliability, and raw performance.

Stephen Hoffman
HoffmanLabs LLC

Dean McGorrill · ‎06-19-2007

hi Russ,
well thats different. bringing a node
into the cluster hangs it for 10 minutes+.
halting puts the cluster in a 2 minute
transision. as suggested, do a

sho clus/cont
add vot
add quo
add clus
add cir

see if the values make sense (or post here)

RECNXINTERVAL is dynamic and can be lowered
(everywhere) if you want do reduce the hang
time. on the offending node, try boot -fl n,1 at sysboot do a sho/clus. see if the
expected votes quorum, niscs, mscp etc. are set right. you can set them there before
booting. also set STARTUP_P2 "YES" that
will keep a verbose boot. Dean

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Hung after "has joined VMS cluster"

Hung after "has joined VMS cluster"

Re: Hung after "has joined VMS cluster"

Re: Hung after "has joined VMS cluster"

Re: Hung after "has joined VMS cluster"

Re: Hung after "has joined VMS cluster"

Re: Hung after "has joined VMS cluster"

Re: Hung after "has joined VMS cluster"

Re: Hung after "has joined VMS cluster"

Re: Hung after "has joined VMS cluster"

Re: Hung after "has joined VMS cluster"