Re: Unexpected host chosen for minicopy HBVS 7.3-2 + HBMM patch

Jon Pinkley · ‎11-12-2007

We are running Alpha VMS 7.3-2 with patches that include the HBMM feature.

Cluster of 3 with two ES40's (SIGMA and OMEGA) with two shared SCSCI buses ($4$DKA and $4$DKB), local non-shared SCSI shelf $4$DKF, and FC connection to EVA6000. There is also an AS2000 satellite with only FDDI and 10Mb ethernet connections. Only one ES40 has FDDI (OMEGA)

SHADOW_MAX_COPY is set to 4 on the ES40's and to 0 on the AS2000. No SET SHADOW /SITE or /READ_COST commands were in effect.

We have several SCSI disks in the ES40 shelves, and have these shadowed with EVA presented $1$DGA devices. The DKF devices have only a single host connection (non-shared SCSI). We are not using port allocation classes on the DKF shelves, but do not have disks installed in the same slot on both systems (even slots on OMEGA, odd slots on SIGMA), so there is a single instance of $4$DKF200 and $4$DKF500. So $4$DKF200 is local to OMEGA and MSCP served to SIGMA, and $4$DKF500 is local to SIGMA and MSCP served to OMEGA.

I dismounted the locally attached SCSI members from the shadows, and specified /policy=minicopy. I did the dismounts on the node with the local connection, so the master bitmap would be there.

After making changes to the shadow virtual units, I remounted the members, doing the mount on the node with the local connection to the device, which was also the node with the master bitmap. I was surprised to see that the minicopy for the drive locally connected to the SIGMA node was done by the OMEGA node, which then had to access the DK member via the MSCP server. This surprised me for several reasons:

1. The master copy of the WBM was on SIGMA.
2. The mount was initiated from SIGMA (and /CLUSTER was not specified)
3. The cost of doing the copy from OMEGA was higher.
4. Both OMEGA and SIGMA had the same SYSGEN value for SHADOW_MAX_COPY: 4
5. All shadowsets were steady state prior to the addition of the DK members, i.e. no other copy or merge operations were in progress.

Does anyone know why VMS shadowing preferred to do the copy from OMEGA instead of (what to me seems more rational) SIGMA?

I can't understand why OMEGA having FDDI would have made it be chosen for the copy. It is possible that at the time the mount was issued, SIGMA had a higher workload, I don't think there were any files open on the shadow.

Perhaps shadowing was confused by the DKF adapters both having the same allocation class (from the node allocation class), but the only path on OMEGA to the specific device $4$DKF500: was via the MSCP served path.

From node OMEGA
$ sho dev dkf500

Device Device Error Volume Free Trans Mnt
Name Status Count Label Blocks Count Cnt
DSA6500: Mounted 0 SIGMA72 64573024 1 3
$1$DGA6500: (OMEGA) ShadowSetMember 0 (member of DSA6500:)
$4$DKF500: (SIGMA) ShadowSetMember 0 (member of DSA6500:)
$

I suppose I should have set shadow_max_copy to zero on OMEGA, then mounted the member, then set it back to 4. I don't understand what conditions made the shadowing software consider OMEGA as a better node to do the copy on.

More info is in the attached file (along with a copy of text to this point).

Jon (still learning about HBVS)

it depends

John Gillings · ‎11-12-2007

Jon,

You may not get much interest in V7.3-2 behaviour, it may have been changed/fixed in V8.3, and V7.3-2 it too long ago for anyone to remember.

That said, a few comments...

>2. The mount was initiated from SIGMA (and /CLUSTER was not specified)

Adding a member to an existing shadow set, which is already mounted across the cluster is implicitly cluster wide.

>3. The cost of doing the copy from OMEGA was higher

If you haven't defined any sites or specified any costs, how are the systems supposed to know that?

>4. Both OMEGA and SIGMA had the same SYSGEN value for SHADOW_MAX_COPY: 4

Therefore both are equally eligible to perform the shadow copy.

>preferred to do the copy from OMEGA
>instead of (what to me seems more
>rational) SIGMA?

Check Keith Parris' notes about which end of the wire is better for performing the shadow copy. I can't remember the details, but I think it was somewhat non-intuitive due to the asymmetry of read and write operations in MSCP.

One way of looking at this is, if you haven't prioritised shadow operations by specifying sites and costs, then you're effectively leaving shadowing to choose for itself.

If you care about it, you should work out your sites and costs and tell the system. As you've observed, you can also control behaviour by dynamically adjusting SHADOW_MAX_COPY (I think Mr Parris has recommendations about that too).

A crucible of informative mistakes

Jon Pinkley · ‎11-12-2007

John Gillings>>>"You may not get much interest in V7.3-2 behaviour, it may have been changed/fixed in V8.3, and V7.3-2 it too long ago for anyone to remember."

True: I am not expecting a fix for 7.3-2, I was just surprised, and thought others that are still running 7.3-2 may be interested in what I observed, so they can avoid it.

John Gillings>>>"Adding a member to an existing shadow set, which is already mounted across the cluster is implicitly cluster wide."

That is just as true for dismounting a member, yet the master copy of the minicopy WBM is always created on the node initiating the dismount /policy=minicopy

John Gillings>>>"If you haven't defined any sites or specified any costs, how are the systems supposed to know that?"

Because one node had a direct connect and the other didn't. The shadowing software considered the read cost for the direct connection as 2 (whether a SCSI connection to an HSZ40 or a FC connection to an EVA). For the MSCP served connection the read cost was 501.

This was well hidden in the previous attachment (without my commentary)

This is the device as shown from the SIGMA node (which has a direct SCSI connection to $4$DKF500) Note Read Cost 2 for device $4$DKF500:

$ sho shadow dsa6500

_DSA6500: Volume Label: SIGMA72
Virtual Unit State: MiniCopy Active (65%) on OMEGA
Enhanced Shadowing Features in use:
Dissimilar Device Shadowing (DDS)
Host-Based Minimerge (HBMM)

VU Timeout Value 3600 VU Site Value 0
Copy/Merge Priority 5000 Mini Merge Enabled
Recovery Delay Per Served Member 30

HBMM Policy
HBMM Reset Threshold: 50000
HBMM Master lists:
Up to any 2 of the nodes: SIGMA,OMEGA
HBMM bitmaps are active on SIGMA,OMEGA
HBMM Reset Count 247 Last Reset 11-NOV-2007 04:00:04.69
Modified blocks since last bitmap reset: 1016

Device $1$DGA6500 Master Member
Read Cost 2 Site 0
Member Timeout 120

Device $4$DKF500 Copy Target (65%)
Read Cost 2 Site 0
Member Timeout 120
$
SIGMA::_VTA5787: 19:50:08 (DCL) CPU=00:00:39.64 PF=33549 IO=243513 MEM=229

------------------------

Same devices seen from OMEGA (with MSCP served path to $4$DKF500) Note Read Cost 501 (due to MSCP served status, cost is incremented "automatically")

$ sho shadow dsa6500

_DSA6500: Volume Label: SIGMA72
Virtual Unit State: MiniCopy Active (65%) on OMEGA
Enhanced Shadowing Features in use:
Dissimilar Device Shadowing (DDS)
Host-Based Minimerge (HBMM)

VU Timeout Value 3600 VU Site Value 0
Copy/Merge Priority 5000 Mini Merge Enabled
Recovery Delay Per Served Member 30

HBMM Policy
HBMM Reset Threshold: 50000
HBMM Master lists:
Up to any 2 of the nodes: SIGMA,OMEGA
HBMM bitmaps are active on SIGMA,OMEGA
HBMM Reset Count 247 Last Reset 11-NOV-2007 03:59:56.02
Modified blocks since last bitmap reset: 1016

Device $1$DGA6500 Master Member
Read Cost 2 Site 0
Member Timeout 120

Device $4$DKF500 Copy Target (65%)
Read Cost 501 Site 0
Member Timeout 120
$
OMEGA::JON 19:50:02 (DCL) CPU=00:00:37.96 PF=28650 IO=353461 MEM=275

------------------------

Jon Pinkley>4. Both OMEGA and SIGMA had the same SYSGEN value for SHADOW_MAX_COPY: 4

John Gillings>>>"Therefore both are equally eligible to perform the shadow copy"

I assumed (incorrectly) that minicopy was similar to mini-merge in that the operation had to be done on the node with the master copy of the WBM. When a mount is done with a minicopy, does the "elected" node synchronize its local copy of the bitmap with the master, and take mastership of the WBM?

John Gillings>>>"If you care about it, you should work out your sites and costs and tell the system. As you've observed, you can also control behaviour by dynamically adjusting SHADOW_MAX_COPY (I think Mr Parris has recommendations about that too)."

Good advice. However these two ES40's are in the same rack. It isn't obvious to me how I could use site to do what I want. If I set SIGMA and its local only connections in site 1, and OMEGA and its local connections in SITE 2, and all "shared direct" connections in site 3; then shadowing would prefer to use the local dumb drive over the EVA FC connection. And it appears that read cost alone wasn't enough to encourage the copy to take place on the system that had local connection to all members.

We have only 3 shadowsets containing "dangling" disks with only a single direct connection, and these drives have archival type data (low activity), so it isn't a big issue. I will be more diligent about setting SHADOW_MAX_COPY when I do this in the future, as there was a very noticeable 40 minute spike in Interrupt mode in my T4 CSVPNG graphs on both SIGMA and OMEGA.

Jon

it depends

Robert Brooks_1 · ‎11-13-2007

I haven't read the entire string of notes, but let me make a couple of quick points.

With the HBMM kit, there are two major "control knobs" that determine which node performs a copy or merge on which shadow set.

The SYSGEN param SHADOW_REC_DLY is effectively a "backoff timer"; it determines how long a node waits before it attempts to vie for the needed lock to perform the copy merge, once it figures out that a copy. Similarly, the prioritization mechanism, in conjunction with the _MAX_COPY SYSGEN param, determine on any given node, what the order is in which a shadow set copy/merge is done.

In general, the lock manager is a key component in sorting all this out. The node on which a MOUNT is done is irrelevant. The node on which a bitmap exists is irrelevant for a mini-copy (mini-copy bitmaps can be moved; HBMM bitmaps cannot).

In the case of a crash (which precipates a merge) or a MOUNT (which precipitates a copy), the lock manager is used both to signal that a recovery operation is needed and it's used to arbitrate as to who actually does the work. I don't want to get into a detailed description, but shadowing uses a LOT of locks with blocking AST's as a signaling mechanism.

-- Rob

Jon Pinkley · ‎11-15-2007

Rob,

Thanks for the response. I saw in the 8.3 release notes/new features manual that there is now an "automatic" minicopy bitmap on loss of member if the HBMM is setup for multiuse. I couldn't find a lot of documentation about this, and did see that at least on the IA64, there have been patches released to fix problems introduced by the change.

The HP Volume Shadowing for OpenVMS manual needs to be updated. As far as I know, the 7.3-2 manual is the latest version; that has nothing about HBMM, which is only documented in the supplemental documentation.

The "OpenVMS Volume Shadowing Support for Host-Based Minimerge (HMBB) manual that is part of the 7.3-2 HBMM kit says the following in Section 1.12.8 Managing Transient States in Progress
----------
"SHADOW_MAX_COPY is a dynamic system parameter that governs the use of system resources by shadowing. Shadowing can be directed to immediately respond to changes in this parameter setting with the following DCL command:

$ SET SHADOW/EVALUATE=RESOURCES

This command stops all the current merge and copy operations on the system on which it is issued. It then restarts the work using the new value of SHADOW_MAX_COPY."
-----------
And Table 1-1 "Visible Impact of Transient State Events" say s that a "SET SHADOW /EVAL=RESOURCES command issued on this system" will cancel a prior minicopy in progress, but that it will continue at the LBN where it was cancelled.

It appears that even after I discovered the minicopy was being done were I didn't want it to be done, I could have forced it to move. This does not work for minimerge according to the 7.3-2 HBMM manual.

Pre-existing conditions: OMEGA and SIGMA have DSA6500 mounted. $1$DGA6500: is ShadowSetMember, $4$DKF500 Minicopy target in progress on OMEGA, but this member is only accessible via MSCP server on NODE SIGMA. SIGMA has direct connection to both members, and I would prefer that the copy be done from SIGMA. Both SIGMA and OMEGA have SHADOW_MAX_COPY set to 4; DELTA (third mounter) has SHADOW_MAX_COPY set to zero, as it is a satellite with only MSCP connection to either member.

$ show dev dsa6500

Device Device Error Volume Free Trans Mnt
Name Status Count Label Blocks Count Cnt
DSA6500: Mounted 0 SIGMA72 64573024 1 3
$1$DGA6500: (OMEGA) ShadowSetMember 0 (member of DSA6500:)
$4$DKF500: (SIGMA) ShadowCopying 0 (copy trgt DSA6500: 65% copied)

Assume I am on OMEGA, and issue the following commands:

$ set proc/priv=all ! remove any doubt that lack of privilege is changing behavior.
$! in the following we explicitly use ACTIVE because we are going to write ACTIVE.
$ mcr sysgen
SYSGEN> USE ACTIVE
SYSGEN> SET SHADOW_MAX_COPY 0
SYSGEN> WRITE ACTIVE
SYSGEN> EXIT
$ set shadow/evaluate=resources

At this point, the shadow minicopy should be cancelled on OMEGA (the node where the set shadow/eval=resources command was issued), then the minicopy should resume on SIGMA, since it has SHADOW_MAX_COPY set to 4.

Once the minicopy has resumed on SIGMA, it should be possible to reset SHADOW_MAX_COPY on OMEGA (even to something higher than what is on SIGMA) and issue another set shadow/evaluate=resources command on OMEGA. (Note that the documentation implies that SET SHADOW /EVALUATE=RESOURCES only appears to change the system it is executed on, i.e. if other cluster members had their active dynamic SYSGEN parameters changed, only the system where the SET SHADOW /EVAL=RESOURCES would change cancel any copies or merges.)

Is my understanding of the documentation correct?

Jon

it depends

Robert Brooks_1 · ‎11-15-2007

I'll spend some time going over your last post in more detail tomorrow (Friday), but I believe that your understanding of the behaviour of SET SHAD/EVAL=RESOURCES is correct; it only affects the local node. In fact, it was put in for your exact situation; to move a copy or merge off of a specific node.

While I wasn't directly involved in the development of automatic minicopy, I do think it's a pretty neat feature, even though a patch or two has been needed to tidy things up :-(

There may be more neat shadowing features on the way for V8.4 (both Alpha and I64). It's likely out in late 2008 or early 2009; stay tuned!

The fact that the documentation has not been updated is a longstanding annoyance to the HBMM team! The problem is that the shadowing doc writer was out with a medical condition at the time that the V8.2 doc set was being prepared, and she retired a bit before the V8.3 doc set was done. I've been fighting to get the V7.3-2 HBMM documentation in a more prominent place on the web, to no avail.

-- Rob

Jon Pinkley · ‎11-15-2007

Robert Brooks>>>"The node on which a bitmap exists is irrelevant for a mini-copy (mini-copy bitmaps can be moved; HBMM bitmaps cannot)."

Is there any way to request that the master mini-copy bitmap be moved, specifically before shutting a system down that currently has the master? If not in 7.3-1, is this possible in 8.2 or 8.3, or planned for a future release?

Perhaps that is what the 8.3 multiuse bitmaps can provide, but I can't easily test that. The documentation I read seemed to imply that the multiuse bitmaps would only come into play when a member timed out, and in that case, although the bitmap may be a bit stale (and have more "dirty bits" set than needed), it would still be very preferable to copy the blocks that had been modified since the last time the HBMM bitmap was zeroed instead of having to do a full copy.

However, when a system is being shutdown, and the system being shutdown has the master copy of a mini-copy bitmap, it should be able to "resign the mastership" and pass the still valid bitmap to another node before it leaves. For example if a member is removed to do a backup, but the backup of the removed member is not initiated immediately, and you then decide you need to shutdown the node that dismounted the member with /policy=minicopy, you should be able to do so without loosing the ability to remount the removed member without a full copy.

In another thread, http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=1172934 we also discussed the wish that if a system has the last path to a member of a multimember shadowset that is still mounted elsewhere in the cluster, and there is at least one other member in the shadowset that will continue to be accessible after the shutdown, it should be possible for the "dangling member" to be dismounted with a mini-copy bitmap mastered on another node that has the volume mounted.

For example, on the system this thread is discussing, the DSA6500 shadowset has two members. The $4$DKF500 device has only a single direct connection, to a SCSI HBA on the SIGMA node. The other member of the shadowset $1$DGA6500 is on an EVA, and that member has direct connections to both SIGMA and OMEGA. When Sigma is shutdown, OMEGA will loose its MSCP connection to $4$DKF500, so before SIGMA shuts down, it should request OMEGA to dismount/policy=minicopy $4$DKF500: Then when SIGMA reboots, it will only need to copy blocks modified on the DSA6500 shadowset while SIGMA was down. The other advantage is that there will be no delay waiting for the $4$DKF500 member to timeout.

The downside is that if you were planning to shutdown the whole cluster, it would be better to dismount the DSA device so that both members will have the same generation number. So there needs to be a way to control whether this member dismount occurs at shutdown time.

Note that if a node could request a master bitmap to be moved, then the dismount of the member device could be done locally on the system shutting down (with generation of the master minicopy bitmap on the system being shutdown), followed by the transfer of mastership of the bitmap to another system.

Jon

it depends

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Unexpected host chosen for minicopy HBVS 7.3-2 + HBMM patch

Unexpected host chosen for minicopy HBVS 7.3-2 + HBMM patch