Operating System - OpenVMS
1829225 Members
2145 Online
109987 Solutions
New Discussion

Reconfiguring the Heart-beat connection on cluster

 
SOLVED
Go to solution
Mahmoud_1
Frequent Advisor

Reconfiguring the Heart-beat connection on cluster

Dear All,
I have an existing cluster system configured to use the LAN on the cluster heart-beat, Each System has two network cards, one for the public network and the other for the heart-beat,
I experience connections lose between the cluster nodes and some times one of the nodes rebooted, I tested the connections I found that one of the network cards used for both the connections and the heart-beat.
So how I can force the cluster to use the other cards for the cluster ping or heart-beat.
Regards
5 REPLIES 5
Volker Halle
Honored Contributor
Solution

Re: Reconfiguring the Heart-beat connection on cluster

Mahmoud,

by default, OpenVMS uses all available LAN adapters to establish SCS channels (point-to-point network connections) to the other cluster members. The so-called 'heart-beat', the correct term is 'cluster hello message', is sent out on all LAN adapters, which are enabled for cluster traffic (SCS protocol), about once every 3 seconds. The default is all LAN or FDDI adapters.

The SCS virtual circuit (VC) between each pair of nodes in the cluster will use the preferred channel or can even (depending on VMS version) be multiplexed over the available and working channels.

You can influence the priority of the LAN adapters and channels via SCACP. In general, you should have more than 1 channel active, to prevent CLUEXIT crashes, if network communication on the only remaining channel is disrupted for more than RECNXINTERVAL seconds.

'Lost connection' is reported, if a node does NOT receive a cluster hello multicast message from any other member of the cluster within about 9 seconds. 'connection re-established' is reported, once the next hello message is received from that node before the timeout of RECNXINTERVAL has expired. The node would be removed from the cluster, if RECNXINTERVAL seconds have passed before receiving the next hello message from that node.

What kind of crashes did you see (check with $ TYPE CLUE$HISTORY) ? CLUEXIT crashes would be a typical symptom of lost network connectivity, but can also indicate periods of high IPL activity on the nodes.

Use MC SCACP SHOW CHANNEL to find out which LAN interfaces are used to form the SCS channels between the nodes.

Volker.
Mahmoud_1
Frequent Advisor

Re: Reconfiguring the Heart-beat connection on cluster

Dear Volker,
Thanks for the valuable message,
I want to inform you that all crashes on the system caused by CLUEXIT, so if there are any suggestions please provide me.
Note: My cluster is consist of three nodes eache has three network cards.
When I disconnect the interconnect between the servers all is, but if I disconnect the LAN the servers will crashes.
Regards
Mahmoud
Volker Halle
Honored Contributor

Re: Reconfiguring the Heart-beat connection on cluster

Mahmoud,

3 nodes, 3 NW cards each (or 2 as you wrote in your initial problem description ?). Are all the NW cards connected to a switch/hub and do they really work (speed settings) ? What is the 'interconnect', that you can disconnect and 'all is ???' ?

A cluster node will crash with a CLUEXIT crash, if it looses cluster communication to any other member of the cluster for more than RECNXINTERVAL seconds (system parameter, should be set to SAME value on all cluster nodes) and then - for whatever reason - succeeds in re-establishing cluster communications. The other nodes will have timed out and removed that node from the cluster and the only way for it to get back into the cluster is via a CLUEXIT crash followed by a reboot.

I would suggest, that you first check your LAN adapters for any errors:

$ MC LANCP SHOW DEV/COUNT

Then check which adapters are enabled for SCS communication:

$ MC SCACP SHOW LAN

Then check the SCS channels:

$ MC SCACP SHOW CHANNEL

Depending on the actual physical network and the adapters used by SCS, you should see a channel between each local LAN adapter and all the other LAN adapters on the remaining 2 nodes. You will also see error counts and the time of the last error.
Use SET TERM/WIDTH=132, because this information is displayed in a wide screen format.

Consider to attach the output in a .TXT file, so we can have a look.

Volker.
Andy Bustamante
Honored Contributor

Re: Reconfiguring the Heart-beat connection on cluster


Mahmoud,

Please check the speed and duplex settings on the network connections.

$ MCR LANCP
$ show dev /char

You can use SET (active configuration) and DEFINE (permanent configuration) to make changes to the network interface.

As Volker states, using 2 switches or hubs prevents having a single point of failure for your cluster. Depending on the application, an inexpensive hub can be serviceable for your cluster traffic.

Andy
If you don't have time to do it right, when will you have time to do it over? Reach me at first_name + "." + last_name at sysmanager net
Richard Brodie_1
Honored Contributor

Re: Reconfiguring the Heart-beat connection on cluster

A description of your network layout would be useful. If you have three cards in each node, does that mean you have two point to point connections for a heartbeat? In one post you said two in another three.

If you have connected them directly to each other, you did remember to use crossover cables?