- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - OpenVMS
- >
- Cluster Time out Question.
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-01-2005 02:21 AM
12-01-2005 02:21 AM
This means (As you are aware) that if a node exits the other nodes in the cluster will hang for 3 minutes (360 seconds.
We are looking at reducing this value.
We have reliable network. A link beteen sites should never be down for more than 60 seconds.
Our cluster consists of Three nodes.
Two at one site (Production) and a third at the DR site.
This is a DTCS cluster.
What values do other people run with ?
Does anyone have an opinion on lowering these values ?
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-01-2005 03:17 AM
12-01-2005 03:17 AM
Re: Cluster Time out Question.
RECNX is at 120 seconds.
If you have shadowing, make sure that shadow_*_TMO is higher than the recnx value.
Wim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-01-2005 06:25 AM
12-01-2005 06:25 AM
Re: Cluster Time out Question.
Oh my!
Where ARE you living!
Any place I know about 3 minutes only last 180 seconds! :-)
We also have FDDI between our two sites (7 KM apart)
To reduce cluster reconfig time, we have RECNXINTERVAL at 5 seconds.
We also have 100 Mb Ethernet.
"Networks" wants to upgrade to Gb, ... by REPLACING the FDDI.
But the (Cisco) Eth is configured using "Spanning Tree", with failover times of up to 45 secs...
That is why we are fighting to keep the FDDI, if only as a fallback interconnect during tree re-builds.
So yes, you can shorten your the Cluster Timeout period, as long as you make VERY sure it is longer than the potential network connection interrupt. Redundant ( & preferably very different) network connections are helpful.
hth
Proost.
Have one on me.
jpe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-01-2005 09:14 PM
12-01-2005 09:14 PM
Re: Cluster Time out Question.
Jan: Old fashioned spanning tree is so last year: hold out for RSTP at the very least.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-01-2005 10:38 PM
12-01-2005 10:38 PM
Re: Cluster Time out Question.
Richard,
the 'last gasp datagram' is sent from the departing node, if it crashes (or shuts down), so the other nodes in the cluster won't have to wait RECNXINTERVAL before timing out the departed node. This does not help in case the network connection breaks.
The 'long' (=RECNXINTERVAL) hang should ONLY be seen, if a node is HALTed without a crash or shutdown or is just powered down or the network connection really brakes. If you see a long state transition during normal shutdown, then something is wrong and needs to be diagnosed.
Volker.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-02-2005 02:27 AM
12-02-2005 02:27 AM
Re: Cluster Time out Question.
The multiple NICs give us redundant connectivity for Cluster Communications. We normally have 1 NIC for IP, 1 NIC for Decnet, and 1 NIC for backup network.
Our environments are all with all 4 nodes at the same site but in 3 different rooms. We have redundant network paths.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-05-2005 06:51 PM
12-05-2005 06:51 PM
Re: Cluster Time out Question.
Sorry about the bit about 360 seconds = 3 minutes. I was entering data in imperial rather than metric ;-) ....
We have the timeout set to 180 seconds = 3 minutes. the high timeout is a throw back to some earlier network kit we used run.
I'm told the longest we should be out is now less than 60 seconds. Having read some of the feedback , I think we should look at going for 90 seconds (Half of what we have).
Regards
Kevin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-12-2005 11:52 AM
12-12-2005 11:52 AM
SolutionIn my case, I was able to run with lower RECNXINTERVAL values because I had two completely separate and independent extended LANs connecting the systems at the two sites, so that a Spanning Tree reconfiguration, which seemed to typically take about 35-40 seconds in those days (which is greater than the default 20-second value for RECNXINTERVAL), would be unlikely to affect both LANs at once.
In the old days, there was a recommendation of 180 seconds for RECNXINTERVAL in disaster-tolerant clusters of the day because that was the time required for a GIGAswitch/FDDI (or one of its line cards) to reboot, as it would have to do after a firmware upgrade. (But even that figure became out-of-date, as with the latest firmware revisions, that time in practice actually increased to 210 seconds for the 4-port FDDI line card to reboot after a firmware upgrade.) Perhaps your figure came from a conservative doubling of this old recommendation after the old figure of 180 seconds proved insufficient at some point in the past.
As another poster pointed out, in addition to the original IEEE 802.1d Spanning Tree Protocol there is now the Rapid Spanning Tree Protocol, IEEE 802.1w, which aims for much shorter reconfiguration times -- on the order of seconds or less rather than 10s of seconds.
I highly recommend that anyone for whom cluster interconnect reliability is critical and where the LAN is used as a cluster interconnect implement LAVC$FAILURE_ANALYSIS. This feature has been in VMS since 6.0 and generates OPCOM messages whenever a piece of the cluster interconnect LAN breaks (or when it is repaired). The EDIT_LAVC.COM tool from the V6 Freeware directory [KP_CLUSTERTOOLS] will help you set this up with minimal effort. LAVC$FAILURE_ANALSYS is documented in the appendices of the OpenVMS Cluster Systems Manual, and I had an article on the topic in the OpenVMS Technical Journal, V2 - see http://h71000.www7.hp.com/openvms/journal/v2/index.html
Once LAVC$FAILURE_ANALYSIS is in place, then you will have a written record in console output and in the OPERATOR.LOG file, with timestamps, of all LAN outages, and thus their durations. If you run with this enabled for a while and see that your expectation of a maximum outage length of 60 seconds is what you're really seeing in practice (and you might even consider inducing some of the likely failure types during off hours while you still have RECNXITNERVAL set to 360 to see how the network really behaves), then you could lower RECNXINTERVAL to a bit longer than your maximum outage times with relative safety.
I once worked with a stock exchange which needed to run with RECNXINTERVAL=10 seconds and we used this technique to identify a LAN outage problem which turned out to be lasting 11 seconds (and simultaneously across three supposedly-independent LANs) and thus causing much grief. Armed with the timestamped info, the proof was available with which to, uh, enlighten the understanding of the network folks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-12-2005 08:48 PM
12-12-2005 08:48 PM
Re: Cluster Time out Question.
Wim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-12-2005 09:05 PM
12-12-2005 09:05 PM