Operating System - OpenVMS
1833198 Members
2567 Online
110051 Solutions
New Discussion

Cluster hang when one node under SYSBOOT >

 
Philippe Bocher
Advisor

Cluster hang when one node under SYSBOOT >

Hello,

We've got a small problem I 've never seen before. Cluster is made of 2 DS25 VMS 7.3-2 MSA1000 & all patches.

Quorum disk is defined, all votes OK (EXP 3 V1 QDSKVOTE 1).

When we shutdown both node, rebooting one of them under SYSBOOT (b -fl x,1) prevents the other from booting (hang just after starting CPU #1).

Continuing to boot the first member (c under sysboot) makes everything works fine.

We've no problem if we try to do this (put the other one under sysboot) while one node is up... any idea

Thx
13 REPLIES 13
Volker Halle
Honored Contributor

Re: Cluster hang when one node under SYSBOOT >

Philippe,

if the booting node hangs, just force a crash: press HALT, then enter >>> crash

Once everything is up again, you can look at the dump and try to figure out, why the node does not continue.

Volker.
Jan van den Ende
Honored Contributor

Re: Cluster hang when one node under SYSBOOT >

Philippe,

To me, that sounds just like desired behaviour.

(I am now assuming that you boot from the same system disk).

In the early phases of the bootstrap the system does a physical access of the system disk. It knows where to find the bootblock, and from there, where to find the xVMSSYS.PAR
file. After reading that is when the SYSBOOT> wait function is.
At that moment you are accessing the disk, but have NOT mounted it. And you still have the ability to modify system params (including VAXCLUSTER) before loading them and using them (that is the PURPOSE of SYSBOOT).
So, the system also does not yet know about the quorum disk.
Now, if you try to boot the second system, after starting the CPU's you are trying to access the system disk. It is a good thing that that is not allowed, it would be two systems uncoordinatedly accessing the same disk, an all too easy way to generate corruption.

Soon after you give Continue, the disk is formally mounted as System Disk, and the system "knows" it is to be a cluster, and what the Quorum Disk is. That is also 'accessed', and now it is a valid cluster, in which another booting node is allowed access to the system disk. So, the second node can continue, as you have experienced.

In the case of one node being shut down and rebooting, it directly encounters the situation where the other node and the QD are a valid config, so, it can simply continue.

Hth,

Proost.

Have one on me.

Jan
Don't rust yours pelled jacker to fine doll missed aches.
Mobeen_1
Esteemed Contributor

Re: Cluster hang when one node under SYSBOOT >

Philippe,
I agree with Jan, if your system is at SYSBOOT, i guess the system will not know its a cluster and the system disk is not formally mounted as a result of which the other node will not boot up.

Jan, thanks a lot for your explanation....i am really becoming your fan :) You have managed to explain it very well...

regards
Mobeen
Philippe Bocher
Advisor

Re: Cluster hang when one node under SYSBOOT >


(I am now assuming that you boot from the same system disk).

and you're OK ;-)

>In the early phases of the bootstrap the system does a physical access of the system disk. It knows where to find the >bootblock, and from there, where to find the xVMSSYS.PAR
>file. After reading that is when the SYSBOOT> wait function is.
>At that moment you are accessing the disk, but have NOT mounted it. And you still have the ability to modify system params >(including VAXCLUSTER) before loading them and using them (that is the PURPOSE of SYSBOOT).
>So, the system also does not yet know about the quorum disk.

Yes



>Now, if you try to boot the second system, after starting the CPU's you are trying to access the system disk. It is a good >thing that that is not allowed, it would be two systems uncoordinatedly accessing the same disk, an all too easy way to >generate corruption.

Yes but... when you're under sysboot, the only thing you can do is modify system params (yes including vaxcluster,votes...) or modify system startup file (set /startup), so I was thinking (but I may be wrong) that, in that case, the other node will form the cluster (with all the required votes and quorum disk vote) and then decide if it can "allow" the first node to join the cluster ?
Uwe Zessin
Honored Contributor

Re: Cluster hang when one node under SYSBOOT >

Agreed, that looks strange to me, too. I would pick up Volker's suggestion to take a crash dump from the system that is hanging during boot. I am sure Volker will happily help diagnosting this (seriously).
.
Kris Clippeleyr
Honored Contributor

Re: Cluster hang when one node under SYSBOOT >

Philippe,

Fortunately we have a similar configuration here (only difference is VMS, V7.3-1 not 7.3-2). And since we're in the process of configuring the cluster (no application or users yet), I took the opportunity to test it. And guess what, with 1 node at the SYSBOOT> prompt, the other node boots normally, and happily forms a cluster with the assistance of the quorum disk.

So Jan,

To me, that sounds just like desired behaviour.


I think that proves you wrong (no pun intended).

So, IMHO I think either they (=engineering) have added a feature to bootstrapping and/or clustering, or something else is wrong with your configuration. As already advised, you might want to take a crash dump on the "hanging" node if and when this happens.

Regards,
Kris (aka Qkcl)
I'm gonna hit the highway like a battering ram on a silver-black phantom bike...
Jan van den Ende
Honored Contributor

Re: Cluster hang when one node under SYSBOOT >



in that case, the other node will form the cluster (with all the required votes and quorum disk vote) and then decide if it can "allow" the first node to join the cluster ?


The point is, that _BEFORE_ the second node can reach the point where it can form a cluster, it will have to get _ITS_ params from disk. And it is not allowed to access that disk... A Chicken-and-egg situation.

That is why I concluded that those systems boot from the same disk.

If you have a config with two (or more) system disks, then, starting from cluster-is-down, if you boot one node into SYSBOOT from disk A, you CAN boot another system from another disk, and THAT can then form the cluster.
Configs like that include multi-architecture clusters, but, also mixed-version clusters (eg. during rolling upgrade)

Proost.

Have one on me.

Jan
Don't rust yours pelled jacker to fine doll missed aches.
Volker Halle
Honored Contributor

Re: Cluster hang when one node under SYSBOOT >

Gentlemen,

there's nothing wrong with the idea of having multiple systems at the SYSBOOT prompt just accessing (reading and/or writing) their system parameter files (using the boot driver) on the same disk. It should not cause any problems/conflicts with other nodes booting or running from that same system disk.

There's NOTHING in OpenVMS that PREVENTS you to boot 2 systems with VAXCLUSTER=0 from the same disk.

Philippe, before forcing a crash, you can also use the VERBOSE mode boot flag to obtain maximum information during boot. Consider to capture the console information to a file (using a terminal emulator or console mgmt application):

>>> b- fl x,30001

This will make it easier to find out, which operations have completed and which have not, once the system hangs.

Volker.
Volker Halle
Honored Contributor

Re: Cluster hang when one node under SYSBOOT >

Philippe,

with VERBOSE boot messages turned on, it should be quite easy to find out, where the system is hanging. Just make a note of the last message printed before system hangs, then SYSBOOT> CONT the other node and capture the next message issued by the 'hung' (and now continuing) node.

This 'next' message should relate to the 'blocked' resource/access.

Volker.
Anton van Ruitenbeek
Trusted Contributor

Re: Cluster hang when one node under SYSBOOT >

Philippe,

The apearance looks like the other node has EXPECTED_VOTES to high or no quorumdisk configured. You mentioned you checked it, but do it again for both nodes.
As posted before it should be possible to have more system in the SYSBOOT> whithout holding eachother up. My gues will be check again the VOTES, EXPECTED_VOTES, VAXCLUSTER (SHOW/CLUSTER ) .

AvR
NL: Meten is weten, maar je moet weten hoe te meten! - UK: Measuremets is knowledge, but you need to know how to measure !
Ian Miller.
Honored Contributor

Re: Cluster hang when one node under SYSBOOT >

if the cluster is running at present then
MCR SYSMAN SET ENV/CLUS
SYSMAN> PARAM SHOW/CLUSTER

and compare the results displayed for each not in particular expected votes and DISK_QUORUM, QDSKVOTES
____________________
Purely Personal Opinion
Philippe Bocher
Advisor

Re: Cluster hang when one node under SYSBOOT >

We're in the process of configuring the cluster too so we might test all those things.
I've asked the operator to force a crash, I will try to boot "verbose" (very good idea why not mine ;-)

It is not a quorum problem I've been working with VMS clusters for too long, it was the first thing I've checked and the hang is not at the moment where cluster is formed (I've checked many many times with many many clusters) it's just at the beginning of the boot sequence (starting cpu # 1)

Thanks to you all Stay tune please ;-)

Thx
Philippe Bocher
Advisor

Re: Cluster hang when one node under SYSBOOT >

Everything works fine now !!!! I' ve reproduced the problem 3 times before calling you and this time it works... I'll investigate to see if a patch was missing

Thanks for your help and patience