Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Network disruptions cause reboot

 
SOLVED
Go to solution

Network disruptions cause reboot

Hi,
Our OpenVMS 7.3-1 cluster includes two DS10 alphas with Compaq TCPIP 5.3. Recently our institution performed upgrades on some of their network switches over a period of several days. Our two alphas logged hundreds of network error messages during several different time periods. These messages included "carrier check failure" and "unavailable user buffer". Then, occasionally there might be cluster errors like "timed-out operation to quorum disk" and "lost connection to quorum disk". There were three or four instances where the DS10 actually crashed and rebooted in the midst of these disruptions. Our Windows servers logged minor disruptions of a few seconds and then continued without problems.
Does anyone know why this is occurring and is there any way to avoid these problems in the future?
Thanks,
Pat G.
9 REPLIES
Andy Bustamante
Honored Contributor
Solution

Re: Network disruptions cause reboot

You may have two issues here.

I'll assume your DS10s are using the network for cluster traffic. If the network is down for longer than the sysgen parameter RECNXINTERVAL one of the nodes will crash and reboot. One simple solution is to add a network interface and use a cross over cable. OpenVMS will automatically use this for cluster traffic. Don't forget to configure both interfaces for speed and duplex.

The "timed-out operation to quorum disk" is curious. Is one the DS10s a disk server? Or did your networking staff also make changes to a SAN the DS10s are connected to?

Andy
If you don't have time to do it right, when will you have time to do it over? Reach me at first_name + "." + last_name at sysmanager net
Robert_Boyd
Respected Contributor

Re: Network disruptions cause reboot

Carrier check failures often are related to changes being made in the line characteristics at one end of a connection that don't match up with the settings for the other end. For instance if one end is set for full duplex and the other end is set to half. It's quite possible that the upgrades on the switches created situations where the parameters were being set to defaults by the upgrade procedure and then the permanent settings of the device were reloaded, returning operation to normal.

If you read up on how cluster communications work, you'll find that there is a steady flow of traffic to maintain quorum and synchronization at various levels. When the networking hardware was disrupted, this created interruptions in the synchronization of the cluster. The software can be fairly resilient in these situations, but it requires careful tuning to match the characteristics of the communications setup.

Crashes during these incidents are most often voluntary CLUEXIT situation where the reconnection interval has passed without any word from one or more other cluster members. In these situations a node will voluntarily crash to prevent a partitioned cluster from trashing storage resources.

You can tune how long an interruption can last before this happens by adjusting RECNXINTERVAL to a value high enough to ride through "normal" interruptions for your site. If you're expecting longer interruptions than usual, you can dynamically increase the value to help out and then reduce it after the work is done.

I have seen times where work on networking hardware triggered some vulnerability in hardware and/or drivers on VMS leading to a crash that had nothing to do with cluster communication being lost. The most common situation I've seen is where a shielded twisted pair cable was used instead of the preferred un-shielded variety.

You might want to read the various manuals that discuss the configuration considerations in setting up a VMS Cluster. They'll help you understand the tradeoffs more fully.

Robert
Master you were right about 1 thing -- the negotiations were SHORT!

Re: Network disruptions cause reboot

Our quorum disk is in a StorageWorks shelf with our other disks.The shelf uses a SCSI Expander Box to enable sharing between the two DS10s. The quorum disk errors may have appeared AFTER one of the nodes crashed.

Anyway, the suggestions about increasing RECNXINTERVAL and cross-connecting the two nodes both sound wothwhile. We currently have RECNXINTERVAL at 20 seconds(default) on both nodes so I guess we could up that to one hour or so during periods when disruptions are anticipated.

Also, each DS10 has dual-port network interfaces and we have enabled only one on each. How would I use the second port for cross-connecting the two machines? If it is not too complicated and/or risky we might try that instead.

Thanks for all the help. - Pat G.
Andy Bustamante
Honored Contributor

Re: Network disruptions cause reboot


Assuming EIB0 is the unused interface you need to connect a cross over network cable and configure the speed/duplex on each node.

mc lancp set device EIB0 /speed=100/full_duplex
mc lancp define device EIB0 /speed=100/full_duplex

Replace EIB0 with your interface. Use "mc lancp show device" to display interfaces. OpenVMS will see the interface and automatically use it for cluster traffic.

I'd try increasing RECNXINTERVAL to something on the order of 90 - 120 seconds first. One hour a long time for two nodes to operate on shared storage without coordination. If you have a unused interface available, you've got a better solution.

Andy
If you don't have time to do it right, when will you have time to do it over? Reach me at first_name + "." + last_name at sysmanager net
Robert_Boyd
Respected Contributor

Re: Network disruptions cause reboot

The worst side effect of a very high RECNXINTERVAL setting is the length of time the system will hang waiting to see if anyone is going to show up. There isn't any risk of corruption from using a high value. If you read the manuals on clustering and system parameter adjustment you will see a full discussion of the effects of setting this parameter higher or lower.

60-180 seconds normally is plenty for stormy network situations. 60-90 is probably better, but it depends on how long it takes your network gear to settle down if a switch or router is rebooted.

Robert
Master you were right about 1 thing -- the negotiations were SHORT!
Wim Van den Wyngaert
Honored Contributor

Re: Network disruptions cause reboot

We put our recnx to 900 on development cluster stations to avoid reboots. Network changes most of the time take less than 15 minutes (but more than 5) or else cause multiple interruptions whioch is not causing problems.

Fwiw

Wim
Wim
Jan van den Ende
Honored Contributor

Re: Network disruptions cause reboot

Pat,

On the other hand,
_IF_ you have redundant interconnects (especially when they are of a different nature, like one 'real' net and one crossover, or an FDDI or SCSI or ...) then you can LOWER your RECNX to shorten the freeze periods if a node crashes for any reason.
We have not ever yet have our ethernet and FDDI disrupted at the same time, but we DID have nodes crashing (HARD- & software reasons. RECNX at 20 caused some nasty application complications, but since we set it to 5 those have until now been prevented.

YMMV

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Richard Brodie_1
Honored Contributor

Re: Network disruptions cause reboot

"Also, each DS10 has dual-port network interfaces and we have enabled only one on each. How would I use the second port for cross-connecting the two machines?"

If the machines are only a few metres apart then just buy a crossover Ethernet cable, and plug it in between them. The cluster software will dynamically discover the best path.

There are more sophisticated ways to set up redundant network connections but sometimes simplest is best.

Re: Network disruptions cause reboot

Since this seems to have been a one time occurrence I have let it slide. If the problem recurs we will try the crossover ethernet cable suggested above. Thanks to all who contributed ideas. - Pat G.