1831147 Members
2697 Online
110020 Solutions
New Discussion

Clustering question

 
Willem Grooters
Honored Contributor

Clustering question

I have a problem in setting up a cluster using shared scsi for storage interconnect, and LAN (100MB) for cluster communication.
Node A is set up from the beginning as a cluster member, booting from [SYS0] over shared SCSI.
Node B is destined to be added into the cluster. When booted locallyand running CLUSTER_CONFIG_LAN on NodeB, it will join the cluster.
For booting it from the common system disk, I ran CLUSTER_CONFIG_LAN on node A, adding Node B. [SYS1] will then be created (has been verified) and next Node A requests Node B to boot (from [SYS1]). It will get up, trying to contact Node A to reauest formation of the cluster. Node A will confirm but for some reason, Node B will never accept it. At one point it does signal there it has contacted node A but forming the cluster will not happen.
When booted from it's local system disk, after having CLUSTER_AUTHORIZATIOn.DAT to the system, it will boot, but node A will not be contacted, and the system will start up as a single-node cluster.
The disks on the shared SCSI that are mounted by Node A, are seen as "remote mount" on node B, with multiple paths. So far, so good - as long as the disks are not mounted loaclly. If so, I get "disk offline" and "mount verification" messages on both systems.

The controller might be an issue - I have installed PZPBA-CY on both machines, but I tried KZPSA as well, with the same outcome.
Node A has two NIC's but one has been disabled after the system is started since it's not (yet) connected ($ NCL DISABLE CSMA STATION CMSMA-CD1). Node B has one NIC

I may have missed something - but what?
Willem Grooters
OpenVMS Developer & System Manager
9 REPLIES 9
John Gillings
Honored Contributor

Re: Clustering question

Willem,

If it's shared SCSI, make sure you've got Port Allocation Classes enabled, and the ports for the shared bus with matching allocation classes on both systems.

The disks on the shared bus MUST have the same names on both systems. With PAC enabled, all SCSI drives will be named $a$DKAnnn where only "a" varies between busses. (of course, your applications only access disks via logical names, so any change in the physical name of a drive is easily dealt with, right?... ;-)
A crucible of informative mistakes
Willem Grooters
Honored Contributor

Re: Clustering question

John,

I know - and this has been set up some time ago already. All disks accessed by NodeA (except for page- and swapfiles) are on the shared SCSI.
Willem Grooters
OpenVMS Developer & System Manager
Martin Hughes
Regular Advisor

Re: Clustering question

Sounds like it might be a quorum problem. You say that node-b can join the cluster when booting from a local system disk, but not from SYS1 on the shared system disk. Are the quorum settings the same for both builds?
For the fashion of Minas Tirith was such that it was built on seven levels, each delved into a hill, and about each was set a wall, and in each wall was a gate. (J.R.R. Tolkien). Quote stolen from VAX/VMS IDSM 5.2
Andy Bustamante
Honored Contributor

Re: Clustering question


Check sysgen parameters VOTES and EXPECTED_VOTES. A system will note join a cluster if it's EXPECTED_VOTES will cause quorum to be lost.

Andy
If you don't have time to do it right, when will you have time to do it over? Reach me at first_name + "." + last_name at sysmanager net
Robert_Boyd
Respected Contributor

Re: Clustering question

Also, are you booting with flags set to 20000 or 10000 so that you can track where in the process things are actually hanging up? Sometimes there are extra clues to be seen in the details of the startup processing.

I agree completely with the port allocation classes needing to be set up properly. When I work with them, I usually set up a cluster common file that has all of the settings for every node in the cluster in one SYS$DEVICES.DAT file and propagate that to every SYS$COMMON:[SYSEXE] in the cluster.

Robert
Master you were right about 1 thing -- the negotiations were SHORT!
Willem Grooters
Honored Contributor

Re: Clustering question

I got it booted yesterday, after I added a quorum disk. I reran cluster_config_lan, NodeB got up, autogen was run and the node rebooted. That's where the story ended: I ran into some severe problems on Node B's console (which is, BTW, latest possble). Attached file contains NodeB's console logging when booted with "-flags 1,20000"
Next, it won't boot anymore, see this file (second part, as yesterday, after ">>>init").

It _might_ have to do with the SCSI card (KZPBA-CY) that doesn't support shered SCSI, though the documentation isn't clear at all. Some state KZPSA (could try that but I don't have sufficient cards for all intended members), some state KZPBA-CB - where I use KZPBA-CY said by a collegue to be feasable for the job - his machines work with that as well - admitted though he wasnt sure..

A file containing output of NodeA (session running cluster_config_lan, and of an "operator terminal") and NodeB's console is attached.
Willem Grooters
OpenVMS Developer & System Manager
Hoff
Honored Contributor

Re: Clustering question

Is this the same cluster that has mismatched device names for the same physical devices? (There's another ITRC thread going on with an HSZ50 RAID Array 450 configuration.)

(Two different names for the same physical disk is bad. Very bad. This could conceivably lead to disk data corruptions.)

The %x910 errors are File Not Found errors. That's not usually a good sign for a system disk.

I also see a mix of $2$ and $116$. Is this the same bus?

There's no gentle way to ask this: how current is your most recent system disk BACKUP?

Should the node start up as a single node cluster, then it is either the only node with votes (and all other nodes should be wedged), or the configuration is problematic and the settings for VOTES and EXPECTED_VOTES might well be incorrect.

I encountered the results of booting into a partitioned clusters -- where two nodes booted from the same system root on the same disk, and whomever had set it up had not set VOTES and EXPECTED_VOTES correctly -- and the disk corruptions were quite impressive. Each node thought it had quorum, could not reach the other node due to the duplicate network address, and both proceeded to write to the disks sans any and all coordination. The owner (and perpetrator of the incorrect settings) told me he ended up reloading the system disk.

And as for the hardware and based on a quick look, it looks like the KZPBA-CY controller will probably work for the configuration. I don't see immediate evidence that it's officially supported, however, nor do I know that it will work.

Willem Grooters
Honored Contributor

Re: Clustering question

The same indeed - and the clash in naming was indeed "DKA100" that was accepted on NodeA although no such disk really existed. However, $ SHOD DEV DKA100 was accepted and showed $116$DKA100. So "DKA100: was added without keeping in mind that DKA100 is a physical disk on NodeB.
$116$DKA100 and $2$DKA100 are different disks: the first the one on shared scsi (116 comes straight from the book) and the other the local disk on NodeB. I think the system disk is not corrupted, backup is very recent (just a few days), and DKA100 - well, I don't really care, it's a newly installed, and hardly modified 8.3 version that can easily be rebuilt from scratch.)
I will dig into the prevention of a partitioned cluster some day, this is not a major issue at the moment. It's neither a problem to reboot all machines if required.
It's very well possible that NodeA will address some disks on shared SCSI and nodeB others - but the ability to 'take over' in case of emergency is wanted. It doesn't have to be automated now - some manual work is no problem.
So at this moment: no issue ;-)
Willem Grooters
OpenVMS Developer & System Manager
Willem Grooters
Honored Contributor

Re: Clustering question

I have been able to add NodeB to the cluster - completely, following the same script as before - after having it deleted. Done some minor changes in NodeA config, but finally, I got it up. Not running: boot fails - see attached file.
Willem Grooters
OpenVMS Developer & System Manager