Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Help on cluster hang

 
BLANQUART
Occasional Contributor

Help on cluster hang

Hi i have a cluster of two node ES47 with fiber interconnect. Yesterday I lost one of the ES47 because of a cpu problem. During the wait of the replacement, I try to boot one of my DS25 in the cluster using the same root of my ES47. The boot hang just after the message: %SYSINIT-I- waiting to form or join an OpenVMS Cluster

Any Idea
12 REPLIES
Thomas Ritter
Respected Contributor

Re: Help on cluster hang

Do you use a Quorum disk ?
Do you use something like AMDS ?

You can from the console prompt boot min and adjust quorum using sysgen parameters.

If think you may need to revise your cluster config.
P Muralidhar Kini
Honored Contributor

Re: Help on cluster hang

Hi,

>> i have a cluster of two node ES47 with fiber interconnect.
Your configuration seems to have only two nodes in the cluster and there is no
quorum disk.

Looks like in your case, as one node is down, the cluster does not have
quorum to continue and hence other cluster node is in a Hang state.
In such case, only once the cluster quorum is met, the other node gets out
of the hang state

Example:
Node A - Votes=1
Node B - Votes=1

In this case Quorum = 2.
* when both Node A & Node B is up, Votes = 1 + 1 = 2. Quorum (=2) is met.

* If node A goes down (for whatever reason).
Only Node B is up, its votes=1. Since Quorum is 2, Node B will now be in a
Hang state as the Quorum is not met.


>> The boot hang just after the message: %SYSINIT-I- waiting to form or join an
>> OpenVMS Cluster

You need to give more information about you cluster, such as
what is the value of
VOTES ?
EXPECTED_VOTES ?

Based on this the Quorum of the cluster would get decided.

The most likely cause of the Node in hang state looks like because the cluster
quorum is not met. However you need to give the values of the above
parameters in order for us to confirm it.


For more details about OpenVMS Cluster configuration, refer -
http://labs.hoffmanlabs.com/node/153

Hope this helps.

Regards,
Murali
Let There Be Rock - AC/DC
P Muralidhar Kini
Honored Contributor

Re: Help on cluster hang

Hi,

To get the cluster wide details, you need to provide the values of CL_VOTES,
CL_EXP and CL_QUORUM parameters.

Command -

$SHOW CLUSTER/CONT
Command > add CL_VOTES,CL_EXP,CL_QUORUM

Provide the output of the above command.

Regards,
Murali
Let There Be Rock - AC/DC
Bob Blunt
Respected Contributor

Re: Help on cluster hang

Blanquart, is your cluster still running with the one ES47 or have you rebooted things back to "normal" after the CPU was replaced? If you're still running with one node and trying to add the DS25 and it's stopping at "form or join" you might try halting the DS25 and rebooting it with flags to enable tracking the activities of the system during the boot. To wit you're putting the primary bootstrap into verbose mode. This setting is variable depending on your CURRENT root. If my system boots from SYS5 then I'd use the command boot -fl 5,30000 to make this happen and watch for the system to hang and the last function should be what caused the hang.

However if you've already gotten the cluster back into full operation we can't reach into the ozone and find out the cause of what hung your DS25. There could be a hardware configuration conflict between the DS25 and ES47 that caused the hang. The only way to identify that would be the verbose boot and that still might not be totally clear.

bob
Shriniketan Bhagwat
Trusted Contributor

Re: Help on cluster hang

Hi,

A number of things can be the cause for this problem.
The system parameters, VOTES, EXPECTED_VOTES, DISK_QUORUM etc. To form a VMS cluster, you need to adjust VOTES and EXPECTED_VOTES. These two define whether a system will actually boot
or wait until sufficient quorum is available.

Below is the link where similar problem was discussed earlier.

http://forums13.itrc.hp.com/service/forums/questionanswer.do?admit=109447627+1277094513523+28353475&threadId=1282524

Refer below link for some other details.

http://www.itec.suny.edu/scsys/vms/vmsdoc/72final/6534/6534pro_009.html


Regards,
Ketan
labadie_1
Honored Contributor

Re: Help on cluster hang

Try to boot with

b -fl 0,20000

This will add a lot of debugging, and then, post the last lines displayed here.
The Brit
Honored Contributor

Re: Help on cluster hang

It doesn't seem like a voting issue here, it seems more like a communication issue.

What do you mean when you say you have a "Fiber Interconnect" ?

There is no FC interconnect that I am aware of. I assume that it is ethernet over Fibre. Is the interconnect Point-to-Point? I guess the question really is whether the DS25 can talk to the ES47. (Layer 2 comms ??)

Dave
labadie_1
Honored Contributor

Re: Help on cluster hang

Sorry for the typo, I meant

b -fl 5,20000

If you have enabled the log of the startup, see if you have a file in sys$sysdevice:[SYS5.sysexe]startup.log
Steve Reece_3
Trusted Contributor

Re: Help on cluster hang

Rather than being a voting issue, I would think this is either a communication issue (is the DS25 connected to the same network as the ES47?) or else the DS25 is booting from the wrong root directory.

You can test this by the following:

- put a crossover network cable between the ES47 and the DS25;
- login to the surviving ES47 (assuming it's booted and running) and type in SHOW LOGICAL SYS$SYSROOT

If SYS$SYSROOT ends in SYS0. then you'll need to boot the DS25 off a different root (e.g. SYS1) or the lock manager will step in during boot and prevent the second system from booting.

Steve
BLANQUART
Occasional Contributor

Re: Help on cluster hang

Thanks for all your answers.

I have resolve the issue in booting another DS25 with other fiber cable in the cluster. It appears to be some faulting fiber wich prevent the host to correctly access the quorum disk on the SAN.

Now my ES47 is back and all is OK

Rgs
P Muralidhar Kini
Honored Contributor

Re: Help on cluster hang

Hi,

Good that you have posted the solution to the problem.
It was infact more of a communication issue rather than a voting problem.

>> It appears to be some faulting fiber wich prevent the host to correctly
>> access the quorum disk on the SAN.
Did not know that you had a quorum disk also in your setup.

>> Thanks for all your answers.
Refer the following link which says how you can thank the forum-
http://forums11.itrc.hp.com/service/forums/helptips.do?#28

Regards,
Murali
Let There Be Rock - AC/DC
Jon Pinkley
Honored Contributor

Re: Help on cluster hang

BLANQUART,

"It appears to be some faulting fiber wich prevent the host to correctly access the quorum disk on the SAN."

But wasn't the "root" you were booting from also on the SAN? My guess is that it is more likely that he quorum disk was not presented (by the unspecified SAN disk controller) to the FC HBA in the non-working DS25.

It is worth figuring out the cause, so you will know how to get things working if something similar happens in the future.

Jon
it depends