- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - OpenVMS
- >
- Re: Network disruptions cause reboot
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО01-03-2006 03:46 AM
тАО01-03-2006 03:46 AM
Our OpenVMS 7.3-1 cluster includes two DS10 alphas with Compaq TCPIP 5.3. Recently our institution performed upgrades on some of their network switches over a period of several days. Our two alphas logged hundreds of network error messages during several different time periods. These messages included "carrier check failure" and "unavailable user buffer". Then, occasionally there might be cluster errors like "timed-out operation to quorum disk" and "lost connection to quorum disk". There were three or four instances where the DS10 actually crashed and rebooted in the midst of these disruptions. Our Windows servers logged minor disruptions of a few seconds and then continued without problems.
Does anyone know why this is occurring and is there any way to avoid these problems in the future?
Thanks,
Pat G.
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО01-03-2006 04:02 AM
тАО01-03-2006 04:02 AM
SolutionI'll assume your DS10s are using the network for cluster traffic. If the network is down for longer than the sysgen parameter RECNXINTERVAL one of the nodes will crash and reboot. One simple solution is to add a network interface and use a cross over cable. OpenVMS will automatically use this for cluster traffic. Don't forget to configure both interfaces for speed and duplex.
The "timed-out operation to quorum disk" is curious. Is one the DS10s a disk server? Or did your networking staff also make changes to a SAN the DS10s are connected to?
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО01-03-2006 04:07 AM
тАО01-03-2006 04:07 AM
Re: Network disruptions cause reboot
If you read up on how cluster communications work, you'll find that there is a steady flow of traffic to maintain quorum and synchronization at various levels. When the networking hardware was disrupted, this created interruptions in the synchronization of the cluster. The software can be fairly resilient in these situations, but it requires careful tuning to match the characteristics of the communications setup.
Crashes during these incidents are most often voluntary CLUEXIT situation where the reconnection interval has passed without any word from one or more other cluster members. In these situations a node will voluntarily crash to prevent a partitioned cluster from trashing storage resources.
You can tune how long an interruption can last before this happens by adjusting RECNXINTERVAL to a value high enough to ride through "normal" interruptions for your site. If you're expecting longer interruptions than usual, you can dynamically increase the value to help out and then reduce it after the work is done.
I have seen times where work on networking hardware triggered some vulnerability in hardware and/or drivers on VMS leading to a crash that had nothing to do with cluster communication being lost. The most common situation I've seen is where a shielded twisted pair cable was used instead of the preferred un-shielded variety.
You might want to read the various manuals that discuss the configuration considerations in setting up a VMS Cluster. They'll help you understand the tradeoffs more fully.
Robert
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО01-03-2006 06:38 AM
тАО01-03-2006 06:38 AM
Re: Network disruptions cause reboot
Anyway, the suggestions about increasing RECNXINTERVAL and cross-connecting the two nodes both sound wothwhile. We currently have RECNXINTERVAL at 20 seconds(default) on both nodes so I guess we could up that to one hour or so during periods when disruptions are anticipated.
Also, each DS10 has dual-port network interfaces and we have enabled only one on each. How would I use the second port for cross-connecting the two machines? If it is not too complicated and/or risky we might try that instead.
Thanks for all the help. - Pat G.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО01-03-2006 06:58 AM
тАО01-03-2006 06:58 AM
Re: Network disruptions cause reboot
Assuming EIB0 is the unused interface you need to connect a cross over network cable and configure the speed/duplex on each node.
mc lancp set device EIB0 /speed=100/full_duplex
mc lancp define device EIB0 /speed=100/full_duplex
Replace EIB0 with your interface. Use "mc lancp show device" to display interfaces. OpenVMS will see the interface and automatically use it for cluster traffic.
I'd try increasing RECNXINTERVAL to something on the order of 90 - 120 seconds first. One hour a long time for two nodes to operate on shared storage without coordination. If you have a unused interface available, you've got a better solution.
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО01-03-2006 10:42 AM
тАО01-03-2006 10:42 AM
Re: Network disruptions cause reboot
60-180 seconds normally is plenty for stormy network situations. 60-90 is probably better, but it depends on how long it takes your network gear to settle down if a switch or router is rebooted.
Robert
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО01-03-2006 06:34 PM
тАО01-03-2006 06:34 PM
Re: Network disruptions cause reboot
Fwiw
Wim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО01-03-2006 07:16 PM
тАО01-03-2006 07:16 PM
Re: Network disruptions cause reboot
On the other hand,
_IF_ you have redundant interconnects (especially when they are of a different nature, like one 'real' net and one crossover, or an FDDI or SCSI or ...) then you can LOWER your RECNX to shorten the freeze periods if a node crashes for any reason.
We have not ever yet have our ethernet and FDDI disrupted at the same time, but we DID have nodes crashing (HARD- & software reasons. RECNX at 20 caused some nasty application complications, but since we set it to 5 those have until now been prevented.
YMMV
Proost.
Have one on me.
jpe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО01-10-2006 02:02 AM
тАО01-10-2006 02:02 AM
Re: Network disruptions cause reboot
If the machines are only a few metres apart then just buy a crossover Ethernet cable, and plug it in between them. The cluster software will dynamically discover the best path.
There are more sophisticated ways to set up redundant network connections but sometimes simplest is best.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО04-20-2006 02:37 AM
тАО04-20-2006 02:37 AM