Operating System - OpenVMS
1828038 Members
2213 Online
109973 Solutions
New Discussion

Volume shadowing - member timeout threshold

 
Jim Geier_1
Regular Advisor

Volume shadowing - member timeout threshold

We are running a cluster of AlphaServer ES45 systems running OpenVMS Alpha V8.3 and an EVA8000. We are considering using Continuous Access EVA to replicate our EVA storage to an EVA at a remote location. The plan is to run in asynchronous replication mode. Our concern is that is the CA-EVA switches to synchronous mode because of a link failure, or some other problem, that performance will be affected negatively. Not only that, but that the LUNs (shadow set member disks) being replicated will have exceedingly slow access times, say 100-200 ms. Will the Shadow Server at some point declare this disk as non-responsive and remove it from the shadow set? What is the threshold where the Shadow Server considers a disk non-responsive and "failed"?
3 REPLIES 3
John Gillings
Honored Contributor

Re: Volume shadowing - member timeout threshold

Jim,

See SYSGEN parameter SHADOW_MBR_TMO. Default is 120 seconds (not msecs!). As long as operations complete in that time, shadowing will continue to use the virtual unit. Your WRITE I/Os to the shadow set will just get slower (reads should be OK as they can be satsified from the local member)

On V7.3-2 and above, you can set member timeouts per virtual unit (and possibly per site, see the docs). See SET SHADOW/MEMBER_TIMEOUT

(IMHO "Continuous Access" is a very poor substitute only necessary for impoverished operating systems that don't have host based shadowing - go with shadow sets across the two sites. With minimerge and minicopy you're WAY ahead of anything CA can provide).
A crucible of informative mistakes
The Brit
Honored Contributor

Re: Volume shadowing - member timeout threshold

Jim,
We are in a similar position. First of all, John's comments are correct, providing you are replicating over a distance less than say 200KM (latency 2-5ms) however for longer distances Volume Shadowing (i.e. Host controlled synchronous replication) could have significant impact on your Production IO.
The point you need to investigate and understand is the EVA CA response to a link failure.
This is our understanding. The CA process (asynchronous) is effectively a continuous merge, from a Log (or Journal) at the source end to the target LUN at the remote site. In the event of a Link Failure, the process does not immediately switch to synchronous. Initially, the process takes no notice of the link failure, and continues to accumulate the IO's in the Log File. When the Log File is full, then the LUN is marked for "FULL COPY", which will occur when the Link returns. The remote UNIT is now effectively discarded. When the Link returns, the LUN immediately initiates a full copy to the remote device *in synchronous mode*.
We believe that this can be avoided (and HP seems to be in "fuzzy" agreement), if the Log/Journal is >= the size of the DR Group. In this case, the Log functions something like a bit map (kinda like mini-copy) recording block changes to the volume rather than sequential changes. Because the log never fills up, the LUN is never marked for "FULL COPY" and therefore never goes into "synchronous" mode.

It is very much up to you to research this, and I know we would be extremely interested if you reach a different conclusion. Part of our discussion with HP relates to us using XP storage as well as EVA8000, and there are minimum Firmware requirements.

If I am off the mark with my comments here, please feel free to chime in! I would love a more widespread discussion on this topic.

Dave.
Ed Barnum
Advisor

Re: Volume shadowing - member timeout threshold

For fiber channel devices the recommendation is to not set shadow_mbr_tmo less than 180 seconds. See page 6-8 of "Guidelines for OpenVMSCluster Configurations" for additional information.