- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - OpenVMS
- >
- Cluster reconnection interval (RECNXINTERVAL)
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Discussions
Forums
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-04-2006 08:38 PM
тАО10-04-2006 08:38 PM
We will implement the DWDM network to replace the existing FDDI, so could anyone can suggest the best value for the RECNXINTERVAL (current setting is 60), since the network failover can be trim down to less than 2 seconds with the DWDM.
Also, we are using the remote volume shadowing, so whether the SHADOW_MBR_TMO can be trim down and what is the best value ?
Many thanks.
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-04-2006 09:24 PM
тАО10-04-2006 09:24 PM
Re: Cluster reconnection interval (RECNXINTERVAL)
http://www2.openvms.org/kparris/
Purely Personal Opinion
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-08-2006 06:15 PM
тАО10-08-2006 06:15 PM
Re: Cluster reconnection interval (RECNXINTERVAL)
If increased, cluster reconfigurations will take longer.
I would not immedicately reduce it till you see the stability of the new connection. If the cluster timeouts are not bothering you now, don't mess with it.
Even in a CI cluster, the default is 10.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-08-2006 06:19 PM
тАО10-08-2006 06:19 PM
Re: Cluster reconnection interval (RECNXINTERVAL)
We have RECNXINTERVAL at 180 and SHADOW_MBR_TMO at 240. Use host based shadowing and have about 70 disks mounted.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-13-2006 09:57 AM
тАО10-13-2006 09:57 AM
SolutionSo you would need to measure the actual length of traffic disruption as you trigger various failure events (such as rebooting a switch or disconnecting a link to the DWDM).
In order to be able to accurately measure the duration of communications disruptions,
I'd enable LAVC$FAILURE_ANALYSIS if it's not already in place (see my article in the VTJ Volume 2 at http://h71000.www7.hp.com/openvms/journal/v2/articles/lavc.html). That will generate OPCOM messages both when any piece of the LAN configuration you're using as a cluster interconect fails, and again when it starts working again. With this, you'll be able to get accurate timestamps of how long a disruption appears from the VMS cluster's viewpoint.
Once you know how long a disruption various real failures generate in practice, then it's a simple matter to choose a value for RECNXINTERVAL which is larger than the longest of those periods.
The recommendation that SHADOW_MBR_TMO be at least 10 seconds larger than RECNXINTERVAL included the underlying assumption that a VMS node at the remote site is MSCP-serving the disks, and so you don't want to throw a disk out of the shadowset before you would make a decision about whether to throw out the VMS node serving that disk. If you have Fibre Channel linked between sites and either don't use MSCP-serving (or, better yet, have it enabled, but it is only used as a backup path), then your choice of SHADOW_MBR_TMO would be independent of RECNXINTERVAL, and more dependent on the duration of a potential outage on the SAN, rather than the LAN.
> Even in a CI cluster, the default is 10.
The default value for RECNXINTERVAL is 20 seconds.
The recommendation that one often sees in disaster-tolerant VMS clusters of 180 seconds for RECNXINTERVAL was originally based on the time required to reboot a GIGAswitch/FDDI. (In real life, after firmware updates and the introduction of newer linecards, the actual need grew to 210 seconds (the time required to reboot a 4-port FDDI line card).
With dual (completely-independent, not connected together, so that both won't undergo a spanning-tree reconfiguration at the same time) inter-site LAN links, and dual LAN adapters in each VMS node, it is possible to run a disaster-tolerant cluster at the default RECNXINTERVAL value of 20 seconds. Some even run at 10 seconds.
There is some additional detail in my user-group presentation entitled "OpenVMS Connection Manager and the Quorum Scheme" at http://www2.openvms.org/ as well as more detail in the older ones entitled "VMS Cluster State Transitions in Action" and "Understanding VAXcluster State Transitions" at http://www.geocities.com/keithparris/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-13-2006 08:13 PM
тАО10-13-2006 08:13 PM
Re: Cluster reconnection interval (RECNXINTERVAL)
When using a SAN based cluster,
the systems can lose connectivity,
while the SAN disk maintains connectivity.
This is too common.
This can cause in a short network bump, a system that has the quorum disk to say
I am the cluster and ALL the other nodes
will Clue$EXIT
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-17-2006 10:28 PM
тАО10-17-2006 10:28 PM
Re: Cluster reconnection interval (RECNXINTERVAL)
The system that will cluexit is based on a number of factors. of course it is the voting, and the majority will reconfigure (post REXNXINTERVAL) which whatever member(s) are visible to each other, and may or may not have quorum. If a set of systems have quorum, then sure, when the "lesser removed" members reconnect they will voluntarily leave the cluster. defined behaviour.
if no-one has quorum, then processing halts, until it is regained. in the case of a quorum disk, if both halves have access, the continual updating of the quorum disk file will prevent the vote being counted. if one of the systems stops reattempting configuration (and updating the SAN based file) then the remaining member will validate that quorum disk file, take its votes, and complete reconfiguration without the other member. no matter what happens it can only rejoin as a new member, not in its "removed" state.
complex yes.
This of course is the point that you would halt one of the systems. just remember that when a system has a vote it is effectively given equal rights to be a member of, or be the cluster, to any other member, but other factors, in some cases race conditions, in others ID precedence, possibly access to quorum devices, become factors in determining how a reconfiguration situation is ultimately resolved. the process is multi layered.
There are rare possible situations that non voting nodes can prevent voting systems from properly configuring but this is when multiple interconnects are involved that have complex permuations of failure.
overall the advice is good, yes it's a trade off, as is most things, but good practice is to take an understanding of what you need from the service of the VMS systems, and draw out on paper interconnect failures, and how you'd expect, or want your systems to recover in any situation, then control the parameters and voting etc. accordingly.
It is also wise to figure in private interconnects avoiding switches, and of course that if a network card fails as a cluster interconnect, if it has also failed as access to the outside world that decisions can be complex as to what should survive what type of failure.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-18-2006 03:06 PM
тАО10-18-2006 03:06 PM
Re: Cluster reconnection interval (RECNXINTERVAL)
In our configuration, it is a 4 member clusters and connecting to RA8000 disk with remote volume shadowing in two sites. There is no quorum disk. According to our network side, all the network component has resilience and any component failover would require milli-second or just up to maximum of 2 seconds. Therefore, just want to check whether the recnxinterval can be trimmed down from 60 to 15 seconds in cluster members in order to enhance the system availability.
Many thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО10-18-2006 05:53 PM
тАО10-18-2006 05:53 PM
Re: Cluster reconnection interval (RECNXINTERVAL)
consider the situations when RECNXINTERVAL is actually used:
Only if one of the systems abruptly halts without sending the 'last gasp' message, then the other nodes would have to wait for RECNXINTERVAL seconds before removing that node from the cluster.
If your network recovers or fails over to alternate pathes within 2 seconds, the connection manager won't probably even notice this.
If a node crashes or shuts down, a 'last gasp' message is sent, causing the node to be removed immediately.
Note that there could be extreme cases in OpenVMS, like deleting an extremely large lock/resource tree at elevated IPL, where the system may not be able to send/receive SCS hello messages for some time. You wouldn't want that to cause CLUEXIT crashes (I have seen this with V7.2-1, 30000 locks and RECNXINTERVAL=20).
Volker.