Re: New clustered shadowset

Aaron Sakovich · ‎04-16-2007

Hi all,

I'm about to create a new shadowset, but for the first time (for me), it will be across members in a cluster. I'm trying to get my brain around some of the concepts, think I've got it, but want to run a sanity check back against the collective intelligence of this forum.

The scenario is that I'll have a 3 node Integrity -based VMScluster running 8.2-1, with one drive dedicated per host to the shadowset, and this 3-disk shadowset will be used to hold the common cluster files, user directories, among other things.

To create the shadowset, I'll have all 3 nodes booted into the cluster, then will issue "Init /Shadow=($1$DKB200,$2$DKB200,$3$DKB200) Common /Erase" (note 3 different AlloClass values, plus the Erase switch). No big surprise there, I don't think.

Now, to mount the drives, it's easy when it's all on a single node... But on a cluster, where you don't know what sequence a particular system and its disk will be added?

I think the correct command to put in each node's SyLogicals.com is: "Mount /System DSA10: /Shadow=($1$DKB200) Common /Include /Policy=MiniCopy=Optional" on the node with AlloClass = 1 ($2$ on node 2, etc.)

Now, make sure I understand how this is working, if you please: Since the full list of disks is included in the shadowset's metadata, I don't have to spec all drives, just the drive I'm adding on the currently booting node. This should take care of the instance even when the entire cluster is booting, while the first host is waiting for a second to create quorum, if I understand things correctly. Also, I shouldn't spec /Cluster, because that would cause the shadowset to be remounted on the remote nodes -- or would it cause the mount to fail?

Now, in the SyShutdwn.com file, I add a "Dismount $1$DKB200/Policy=MiniCopy=Optional" on node 1, $2$ on 2, etc., in order to gracefully remove each host's local drive from the shadowset, right? I do NOT dismount the DSA drive, because that would dismount it on the other 2 cluster nodes, right? Or would it just break the shadowset by removing the local disk from its definition?

Any other tricks I should know about this new shadowset? What do I do to recover efficiently/effectively from an unexpected system crash? Where can I find docu for HBMM on 8.2-1? It's included in the OS, but the Shadowing document's not yet been updated to include it. (I've never used HBMM before.)

Thanks for any insight you might be able to offer!

Aaron

Jan van den Ende · ‎04-16-2007

Aaron,

That will work out just fine,
BUT,
in view of potential future devellopments,
1)
Stay away from $1$ and $2$, as those are now the defined "alloclasses" of SAN disks and tapes, resp.
2) _DO_ take care to specify /LIMIT= <1Tbyte> upon INIT! (or, if you use /CLUSTER < 8 , then /LIMIT = 1 / 8 * clustersize * 1 Tb)
This way, you allow for future online shadow set expansion.

And please review any and all INIT qualifiers in the manual or in HELP. Several of those merit consideration. I already mentioned /CLUSTER, but others may also prove benificial.

Success!
... and, have fun!

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Jan van den Ende · ‎04-16-2007

Sorry,

upon rereading I also noted your question on HBMM.
Idf you do not have any other info, download the 732 HBMM patch. It has quite extensive release notes.

Proost.

Have one on me,

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Robert Brooks_1 · ‎04-16-2007

Now, make sure I understand how this is working, if you please: Since the full list of disks is included in the shadowset's metadata, I don't have to spec all drives, just the drive I'm adding on the currently booting node. This should take care of the instance even when the entire cluster is booting, while the first host is waiting for a second to create quorum, if I understand things correctly. Also, I shouldn't spec /Cluster, because that would cause the shadowset to be remounted on the remote nodes -- or would it cause the mount to fail?

---

Using the /CLUSTER qualifier on the mount won't cause a "remount"; there's no such thing. If the shadowset is already mounted, the operation essentially becomes a no-op.

On the other hand, you don't need to specify /CLUSTER if you are simply adding a member to a shadow set that is already mounted somewhere on the cluster. Simply adding the new member on any node will force the shadow set to reevaluate the membership and the other two nodes will automatically add the new member(s). In general, I avoid using /CLUSTER; MOUNT can tend to get "confused" sometimes, although some effort has been made to bring sanity to the /CLUSTER qualifier; I think it first showed up in V8.3, although it may have been V8.2.

For HBMM, the online help (with extensive examples!) is pretty good, although the stuff we included in the V7.3-2 HBMM kit will explain things in good detail.

The HBMM policy mechanism may look at first glance to be quite complicated, but for a one-site cluster with only a few nodes, your HBMM policy definitions should be quite simple.

-- Rob (part of the HBMM team)

Jack Trachtman · ‎04-16-2007

Aaron,

How are the local disks presented to the other cluster members? Via MSCP?

John Gillings · ‎04-16-2007

Aaron,

>But on a cluster, where you don't know
>what sequence a particular system and its
>disk will be added?

Exactly! Shadowing in a cluster works best when storage is shared, like a SAN.

The big danger in mounting a three member shadowset in this type of configuration is mounting the wrong (ie: older) drive first and losing data.

Depending on your exact requirements, my preference is to use /POLICY=REQUIRE_MEMBERS - this means the shadowset will not mount unless all members are available. Add also use /CLUSTER. This means the shadowset will be mounted by the LAST node up, and you will need to qualify your application startups to check that storage is available.

It also means you need to do something to force mounting in a disaster recovery if a node has been lost and you can manually ensure the shadow members are mounted in the correct sequence - maybe use one of the USER SYSGEN parameters?

One step better is to abstract your disks away from physical devices. All application code references devices by logical name, you then build a procedure that knows the mapping from logical name to physical name and how to mount the devices, and any special logic involving host availability. There may be multiple logical devices mapped to a single physical device. Your application code then asks "get my storage", and there's a single place that can check if it's already mounted, knows how many members to mount, how long to wait and what policies to enforce. The same module can be responsible for reducing shadowsets for backup, handle recovery etc...

This approach makes it much easier to manage and maintain your storage, and, most importantly, protect yourself from mistakes.

A crucible of informative mistakes

Martin Hughes · ‎04-16-2007

> Now, in the SyShutdwn.com file, I add a "Dismount $1$DKB200/Policy=MiniCopy=Optional" on node 1, $2$ on 2, etc., in order to gracefully remove each host's local drive from the shadowset, right? I do NOT dismount the DSA drive, because that would dismount it on the other 2 cluster nodes, right? Or would it just break the shadowset by removing the local disk from its definition?
<

Dismounting the DSA volume would only dismount it on that node, unless you specific /CLUSTER.

Dismounting the member DKBnnn will remove this member from the shadowset cluster wide. The DSA volume would still be available to the node shutting down, via MSCP served from the remaining nodes (assuming you have MSCP enabled).

Be careful with what you are doing dismounting disks in SYSHUTDWN. Consider the cluster shutdown scenario, and whether your system still needs access to this disk during the shutdown process.

What are you planning on putting on this disk? SYSUAF? QMAN$MASTER.DAT etc?.

Shadowsets with local disks presented via MSCP between cluster members can be a bit tricky to manage. In this configuration I tend to manage the mounting/dismounting manually.

For the fashion of Minas Tirith was such that it was built on seven levels, each delved into a hill, and about each was set a wall, and in each wall was a gate. (J.R.R. Tolkien). Quote stolen from VAX/VMS IDSM 5.2

Jan van den Ende · ‎04-16-2007

Aaron,

>>>
Now, in the SyShutdwn.com file, I add a "Dismount $1$DKB200/Policy=MiniCopy=Optional" on node 1, $2$ on 2, etc., in order to gracefully remove each host's local drive from the shadowset, right? I do NOT dismount the DSA drive, because that would dismount it on the other 2 cluster nodes, right? Or would it just break the shadowset by removing the local disk from its definition?
<<<

Any _MEMBER_ dismount reduces the shadow set by that member, and therefor operates ingerently on ALL nodes that have it mounted.
And, you CANNOT dismount the last remaining member of a set !!
OTOH, if you want to dismount on just one node, then, on that node, you DISMOUNT the shadow SET.
In working with shadow sets, I find it helps a lot to start thinkong on DRIVES ( = members ) and VOLUMES ( = is datasets available to programs etc ).
If you look back through any documentation, (and the relevant commands) you will find that that distinction has "always" been consistently made, albeit silently.

hth

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Robert_Boyd · ‎04-17-2007

Aaron,

Here's a not so simple procedure for mounting/dismounting disks that I use on a couple of clusters running Oracle. Each system has it's own disks with 2 of them having some shared SCSI drives. All 3 system disks are separate, so this procedure is copied to each sys$manager directory.

Before full backups are done Oracle gets shut down, this procedure gets called to remove the members of shadowsets that belong to the node running the backup job, and Oracle gets restarted.

There's a subprocedure referenced -- here's a copy of disk_sites.com

I tried to comment this fairly thoroughly -- if you have any questions let me know.

Robert

$!
$ set noon
$ call find_site 1
$ call find_site 2
$
$FIND_SITE: subroutine
$ set noon
$ site = p1
$ dev_context = 2*site+1+2048
$
$DEVICE_LOOP:
$ next_dev = f$DEVICE("*:","DISK",,dev_context)
$ IF next_dev.nes.""
$ THEN
$ if f$verify() then $ show symbol next_dev
$ if f$ext(0,4,next_dev).eqs."_$''site'$"
$ then
$ next_dev = next_dev - "_"
$ if f$getdvi(next_dev,"SHDW_MEMBER") then -
$ SET SHADOW/SITE='site' 'next_dev'
$ endif
$ goto DEVICE_LOOP
$!
$ ENDIF
$
$EXIT:
$ exit
$ENDSUBROUTINE

Master you were right about 1 thing -- the negotiations were SHORT!

Robert_Boyd · ‎04-17-2007

Hmmm looks like the procedure I gave as an example before was more for the 2 node cluster running Oracle. This one is from the 3 node cluster.

:-)

Robert

Master you were right about 1 thing -- the negotiations were SHORT!

Aaron Sakovich · ‎04-17-2007

Woot! I knew asking here would elicit tons of good tips!

Jack, et. al. -- yes, the disks are MSCP served to the cluster members; there is no shared storage.

John, you talk about using /Policy=Require_Members. The downside that I see to that is that if our cluster boots and 2 members of the 3 are available, quorum is reached, but we still would not see the volume mounted until the 3rd system booted, right? That's not our goal -- we're happy with 2 members of the cluster. It's a development & QA cluster, so there's a good chance that we'll have at least one system down or rebooting for any variety of reasons. And because of that, I'm loath to do anything that's going to require massive amounts of hand-waving and magical incantations to get the system up and running after just one host reboots.

Martin and Jan -- about dismounting DSA vs. DKB during shutdown. I'm trying to figure out if the best bet would be to first dismount the local drive DKB200 so that it's removed cluster-wide and then dismount the DSA shadowset on the local host, OR should I just dismount the DSA volume on the local host? If I dismount the drive, it doesn't remove it from the permanent definition of the shadowset, it just tells the other hosts that this drive is unavailable until it's mounted again, right? With the common system files on the shadowset, will I even be able to dismount the DSA volume on the local host during SyShutdwn? Is there anything special that I should do (or even could do) to facilitate the clean removal of the local drive with these shared files?

And, yes, the common drive is to be used for the standard complement of shared files: SysUAF, RightsDB, TCPIP$Hosts, Qman, LPD spool directory, etc.

Ian Miller. · ‎04-17-2007

If you dismount the local drive the other members of the shadow set will update the definition of the shadow set (in the SCB) so that local drive will require a full copy if added in to the shadow set.

Just dismount the DSA. Although there will always be files open (UAF, queue files etc) and therefore I think it unlikely you can avoid the resulting shadowset merge.

____________________
Purely Personal Opinion

Wim Van den Wyngaert · ‎04-17-2007

This is the mount procedure we use on all our systems. You use it as

@xx single_disk dsa1 disk1 $45$dka401,$45$dka301

Note that some parts may require our context.

Wim

Wim

Aaron Sakovich · ‎04-17-2007

Ah, I found the HBMM docu on one of my 7.3-2 systems -- it was included in a subsequent UPDATE ECO. I'm going to transfer it to my PDA so I can read it at the coffee shop... ;^)

Thanks to all for your input -- I'm going over the info you've provided to see if I can come up with any more questions.

Robert_Boyd · ‎04-18-2007

Ian,

It depends on how you do the DISMOUNT of the local drive -- from the HELP DISMOUNT /POLICY text:

/POLICY=[NO]MINICOPY[=(OPTIONAL)] (Alpha/I64 only)

Controls the setup and use of the shadowing minicopy function.

Requires LOG_IO (logical I/O) privilege to create bitmaps.

The exact meaning of the MINICOPY keyword depends on the context
of the DISMOUNT command, as follows:

1. If this is a dismount of a single member from a multi-member
shadow set, a write bitmap is created to track all writes
to the shadow set. This write bitmap may be used at a later
time to return the removed member to the shadow set with a
minicopy.

If the write bitmap cannot be initiated and the keyword
OPTIONAL is not specified, the dismount will fail and the
member will not be removed.

If you omit the /POLICY qualifier or if you specify
/POLICY=NOMINICOPY, no bitmap will be created.

Cheers,

Robert

Master you were right about 1 thing -- the negotiations were SHORT!

Aaron Sakovich · ‎04-18-2007

Hi, me again...

So, going over the stuff about HBMM, the one thing that has me concerned is this one statement in the OVMS FAQ:

"In a regular Merge operation, Shadowing uses an algorithm similar to the Full Copy algorithm (except that it can choose either of the members' contents as the source data, since both are considered equally valid)"

Ref:
http://hoffmanlabs.org/vmsfaq/vmsfaq_011.html#mgmt63_mm

Okay, so in my 3 host scenario, where I hope to have at least 2 hosts up at any time (quorum = 2), do I even need to use HBMM? Will MiniCopy handle both the expected departure of a host via shutdown, and the unexpected crash?

Or do I want to setup HBMM to handle the case when 2 of my hosts die, then at least one comes back? Will enabling HBMM in any way negatively affect the benefits of MiniCopy?

Oh, and I'm still wrestling with doing a "Dismount DSA10:" or "Dismount $10$DKB200:" in my shutdown. Or both. At the present time, I think the correct scenario would be to dismount the shadow set, then if the local drive is still mounted on any of the other hosts, dismount it (remembering to include the MiniCopy policy request). Is that legit, or is there a better/easier way to do it?

Jan van den Ende · ‎04-18-2007

Aaron,

"merge" kicks in when two or 3 members of a shadowset stay on line, but one node that has the set mounted leaves unexpectedly ("crashes").
Now, if any writes to that VOLUME are in progress (remember I empasised the distinction between drive and volume?) there is NO way for the other nodes to know how far each of the parallel writes to each of the drives has progressed.
In other words, the members are NOT guaranteed to be identical, and that MUST be corrected. A "merge" is needed. As there was no way of knowing WHERE the writes-in-progress are, the WHOLE drive had to be checked. A VERY lengthy process! In HBMM Engeneering has found ways of keeping track of where any write is to occur, and when it has completely finished. And that, in such a way, that the surviving cluster members have access to this info. Now, ONLY those disk blocks that might be affected need re-synching. A matter of seconds (or fractions thereoff).

COPY (and MINICOPY) occurs, when a DRIVE is (re-)added to the volume. (following a MOUNT Of course, MINIcopy is only possible if the drives once WERE identical (ie, they formed a shadowset, and one drive was DISMOUNTed), AND all changes since the separation have been kept track of.

Completely different scenarios, and rather different objects, with different algorithms. Although, of course the global object is to reach the goal where all drives contain reliably the same info in the same location, ie, the Volume is consistent.

hth

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Aaron Sakovich · ‎04-18-2007

Jan,

Super -- that made things clear.

I've also just found this thread in Italy:

http://www.eleuteron.it/mess_160141_3319014.html

Where the same thing is said, down towards the bottom. There's also an unanswered question in that thread about whether you should only use /Policy=MiniCopy in the dismount command. If I understand it correctly, it would only slow down your shutdown if you had to create the bitmap from scratch on all the nodes when you dismount the volume. It would make more sense to create it during the mount, but still verify that it's there during the dismount. Is that a valid statement?

I'm getting close -- thanks for everyone's help!

Robert_Boyd · ‎04-18-2007

Aaron,

You definitely want to enable HBMM by using the /POLICY=MINICOPY on the mount.

If you want to speed up shutdown a bit, you can dismount the local member of a shadowset first with /Policy=Minicopy, then dismount the DSA device. Removing the member first will save the other cluster members from having to go through shadow member timeouts after the node leaves.

Robert

Master you were right about 1 thing -- the negotiations were SHORT!

Martin Hughes · ‎04-18-2007

> Oh, and I'm still wrestling with doing a "Dismount DSA10:" or "Dismount $10$DKB200:" in my shutdown. Or both. At the present time, I think the correct scenario would be to dismount the shadow set, then if the local drive is still mounted on any of the other hosts, dismount it (remembering to include the MiniCopy policy request). Is that legit, or is there a better/easier way to do it?
<

If your aim here is to achieve a MINICOPY then you need to dismount the member (DKBnnn) not the volume (DSAnnn).

Unless I am mistaken, if you dismount the DKB member using /POLICY=MINICOPY in SYSHUTDWN, this will create the master write bitmap on the system that is shutting down. This master write bitmap will be deleted when the node shuts down, and therefore a full copy will be required to add the member back into the shadowset.

What I would do is dismount the DKB member using /POLICY=MINICOPY from one of the other cluster members, so that the master write bitmap can be created and maintained there. And I would do it manually rather than trying to automate this process.

For the fashion of Minas Tirith was such that it was built on seven levels, each delved into a hill, and about each was set a wall, and in each wall was a gate. (J.R.R. Tolkien). Quote stolen from VAX/VMS IDSM 5.2

Robert Brooks_1 · ‎04-18-2007

You definitely want to enable HBMM by using the /POLICY=MINICOPY on the mount.

--

HBMM is enabled through a sequence of commands that create an HBMM policy and an association of that policy with a shadow set.

For your system, you may find the DEFAULT policy to be sufficient.

The above statement describes one of the ways that minicopy (not HBMM) can be enabled.

John Gillings · ‎04-18-2007

Aaron,

>John, you talk about
>using /Policy=Require_Members. The downside
>that I see to that is that if our cluster
>boots and 2 members of the 3 are
>available, quorum is reached, but we still
>would not see the volume mounted until the
>3rd system booted, right?

Correct! But consider what happens if the 3rd (missing) member has the latest data? Once you've established the shadowset with the older members, the 3rd will be overwritten (full copy) when it's returned. /POLICY=REQUIRE_MEMBERS means you will always use the latest data as the basis of the shadowset.

THE most important issue with shadow sets is keeping track of which one has the latest data, and the order in which the shadow set is formed. If you believe you can manually keep track, and ensure everything is always booted in the correct sequence, that's great. REQUIRE_MEMBERS is a simple way of enforcing it, BUT the downside is you need to have all three nodes up to form the shadowset.

A crucible of informative mistakes

Aaron Sakovich · ‎04-19-2007

Martin said:

"If your aim here is to achieve a MINICOPY then you need to dismount the member (DKBnnn) not the volume (DSAnnn)."

Ah, but you're addressing this as an either/or proposition. I think I should do both. The question is, which first.

My current thinking is that during shutdown, the system should remove first the local drive, then the shadow set. In both cases, the /Policy=MiniCopy=Optional should be issued. However, if the MountCnt is only 1, then the local drive should NOT be dismounted, just the shadow set volume (item #2 in HELP DISMOUNT /POLICY).

This process ensures that the MiniCopy bitmaps are properly written, and when a drive is removed, the local host that is responsible for that drive does the grunt work of ensuring the bitmaps are properly processed.

Aaron Sakovich · ‎04-19-2007

John said:

"Correct! But consider what happens if the 3rd (missing) member has the latest data?"

Our goal is overall system availability -- it's a development, not production, cluster. If a coder loses a few seconds worth of edits due to being on a host that was shutting down or crashes before a write was completed, they can live with that, as long as they can still get at their login via another host. Most users won't be on the system during a shutdown anyways, and since we don't usually see crashes, I'm not too worried about data loss. If this were production, that would be an entirely different story.

Aaron Sakovich · ‎04-19-2007

Oh, yet another question. The docu says the best bet is to initially do an Init/Erase on the volume, then after you've mounted it, do a Set Volume/NoErase.

Is it reasonable to shortcut that with just an Init/Erase=Init?

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: New clustered shadowset

New clustered shadowset