Operating System - OpenVMS
1839269 Members
4048 Online
110137 Solutions
New Discussion

Re: MountVerify after attempt to add third member of shadow set.

 
SOLVED
Go to solution
DSM_1
Advisor

MountVerify after attempt to add third member of shadow set.

We have a three node cluster of GS1280's running VMS 7.3-2, with storage divided between a couple of MA8000's and an EVA5000. The system disk is currently a two member shadow set across two groups in the EVA5000. We purchased an EVA8000 a while ago with the intent of shadowing critical volumes across EVA's. A test with a non-critical, low activity volume, worked fine. So we intended moving data onto the EVA8000 by using VMS volume shadowing.

Our VMS guy attempted to copy the system volume onto the EVA8000, by adding a third member to the existing shadow set. This resulted in two members of the cluster losing contact with the system disk, showing the device as MntVerifyTimeout.

We thought it must have been the EVA8000 and/or communication to it, and waited for the HP support people to upgrade firmware and check the configuration and so forth.

Yesterday, a colleague wanted to clone the system disk, to be used in another node. So he mounted a third shadowset member WITHIN the EVA5000. Once again two of the cluster nodes lost contact with the system disk. We literally had to power them off, as not even the console responded.

Any ideas?

By the way, I am not the VMS guy at this site, I just work closely with him. So my VMS knowledge is patchy.
18 REPLIES 18
Volker Halle
Honored Contributor

Re: MountVerify after attempt to add third member of shadow set.

DSM,

welcome to the OpenVMS ITRC forum.

Please carefully check ERRLOG.SYS and OPERATOR.LOG to determine the exact sequence of events. This is a complex scenario and likely requires a lot of configuration information to be able to understand what has been going on...

If you say 'not event the console responded', did you try CTRL-P ? If the system disk is offline, you would not expect any response from the console. Did you capture the console output ?

Volker.
DSM_1
Advisor

Re: MountVerify after attempt to add third member of shadow set.

Thanks Volker.

Apparently our current VMS guy forgot about ctrl-P, perhaps in the panic of the situation. (Our regular VMS guy is on vacation.). He has opened a point with HP and is sending off the console logs and other assorted pieces.

I guess this is not the place for a complex diagnosis. I just thought it was possible that someone might have seen something like this before. I decided to try this forum partly because I use other (not HP) forums from time to time, while my colleagues don't.

I have attached a small section of the operator log from the node that remained up. Nothing was recorded from the other two nodes. The time frame covered mounting the new member to crashing the other two nodes.

Using "anal/error/elv translate" for a one hour period covering the significant events, there were no messages listed that looked significant to me (just timestamps and volume changes).

Normally I would stick to my own patch (Oracle) and leave VMS to the VMS guys, but this problem worries and frustrates me, and it ought to have been diagnosed properly the last time. If it looks like this is getting too complex, I will close the thread and hope HP come up with something, this time.
Volker Halle
Honored Contributor

Re: MountVerify after attempt to add third member of shadow set.

The other 2 nodes (VENUS and HOYLE) reported 'DSA1: contains zero working members' 4 resp. 6 seconds after OBERON started the shadow-copy from DSA1: to $1$DGA500:.

The shadowing driver would have logged an errlog entry on VENUS and HOYLE about the reason for dropping the members, but as you didn't crash those nodes, that information is lost.

What are the values of SHADOW_MBR_TMO in this cluster ? Are the pathes to those disks as expected: SHOW DEV/MULTI showing consistent multipath counts ?

Volker.
Volker Halle
Honored Contributor

Re: MountVerify after attempt to add third member of shadow set.

The system disk shadowset member timeout is SHADOW_SYS_TMO, so please also check that value.

As this is V7.3-2, mount-verification messages will tend to be suppressed. Something must have happened on VENUS and HOYLE to cause the shadowing driver to drop all members from the DSA1: shadowset.

What if VENUS and HOYLE could not access the new DGA500 disk ? And OBERON adds it to the shadowset ? Is MSCP-serving of that disk enabled on OBERON ($ SHO DEV/SERVED) ?

Volker.
DSM_1
Advisor

Re: MountVerify after attempt to add third member of shadow set.

SHADOW_MBR_TMO is the default, 120 seconds.

I have attached the results of "show dev/multi dsa1" from all three nodes.

MSCP is not loaded/enabled. Both Venus and Hole can see DGA500 (as evidenced by Show Dev). I believe that gets into the intricacies of fibrechannel which is well outside my comfort zone.

A question: if the MOUNT command was run on one node without a /CLUSTER qualifier, could that cause problems?

Volker Halle
Honored Contributor
Solution

Re: MountVerify after attempt to add third member of shadow set.

When mounting an additional member into the shadowset, you just use a simple MOUNT/SYS DSAx:/SHAD=DGAx: label command without /CLUSTER.

The view of the members in the shadowset must be UNIQUE cluster-wide, i.e. every node must be able to see and access every member in the shadowset.

HOYLE and VENUS can see $1$DGA500 NOW, but could they, when the problem happened ???

Chances are, that DGA500 had just been added to the EVA and a MC SYSMAN IO AUTO had not been run on HOYLE and VENUS ? It had to be run on OBERON, as the new member was to be mounted on that node.

And this has happened twice, always with NEW members being added to an existing clsuter-wide shadowset, am I beginning to see a pattern here ?

Volker.
DSM_1
Advisor

Re: MountVerify after attempt to add third member of shadow set.

I will ask tomorrow my time and post a reply then. Thanks for your time and interest so far!
Wim Van den Wyngaert
Honored Contributor

Re: MountVerify after attempt to add third member of shadow set.

I don't see any evidence of dga500 visible on all 3 nodes.

Wim
Wim
Dean McGorrill
Valued Contributor

Re: MountVerify after attempt to add third member of shadow set.

Neither do I wim. and from the log I'd guess
the sysman i a not being run is it, as volker suggests. Dean
Thomas Ritter
Respected Contributor

Re: MountVerify after attempt to add third member of shadow set.

DSM, what was the actual mount command ?
Could the /system qualifier have been omitted ?
You have checked on each node that the three disks are visible ?
DSM_1
Advisor

Re: MountVerify after attempt to add third member of shadow set.

I don't have a definitive answer for the first event, yet. But for this week's event, my colleague confirms that he did not run SYSMAN IO AUTO on all three nodes. He created the volume in the EVA, presented it to all three nodes from a fibrechannel perspective, ran SYSMAN IO AUTO on the one node (Oberon), only, and then ran the mount command. This sounds like a good explanation to me.

In answer to the other questions: The device is currently visible on all three nodes (using SHOW DEVICE), but not mounted. I have not posted this display. But the other nodes have been bounced since then, so the new device would have been picked up then. The original MOUNT command was done with the /SYSTEM qualifier. It was dismounted immediately after the shadow copy finished. This shows up at 14:54:57.69, in the log portion I posted.

A new question: Ideally, I would like to see this hypothesis tested with another shadow set, before it gets tried again on something critical. But, if we can reproduce this situation, with a volume showing as MntVerifyTimeout, is it easy to recover? Would a SYSMAN IO AUTO fix it at that point?
Volker Halle
Honored Contributor

Re: MountVerify after attempt to add third member of shadow set.

DSM,

so it does look like my theory did hit the point...

What happens is this: You mount the 3rd member on one node, it does see the new disk (and the old ones - of course) and happily adds it to the shadowset. A shadowset must have a unique state across all nodes it's mounted on, so the other 2 nodes - not being able to see the new member - just drop ALL the members from their view of the shadowset and end up with an shadowset with zero members immediately (shown as MntVerifyTimeout). You can NOT recover from that, because it's already too late. If there are no open files on that shadowset on those 2 nodes, you could DISM/ABORT DSA1:, then run SYSMAN IO AUTO and then remount the shadowset with it's current members, i.e. MOUNT/SYS DSA1: label. But if this a system disk, you're out of luck.

If you would have enabled MSCP-serving across all nodes, you could have prevented this fatal scenario. OpenVMS supports failover to the MSCP-served path at any time and V7.3-2 also supports failback, i.e. the path will go back to the FC path, if that becomes available. In such a config, you could actually pull out all FC cables from one node and it will continue to access the disks via the MSCP-served path - it will not be able to boot that way.

Volker.
DSM_1
Advisor

Re: MountVerify after attempt to add third member of shadow set.

Thanks. I am tempted to keep asking questions. I know nothing about MSCP. I shall do some reading and googling.

But I think my original question has been answered.

I am not sure what the norm is in this forum. If it is up to me to close the thread, I will leave it another 24 hours in case there are further comments and then close it.
Volker Halle
Honored Contributor

Re: MountVerify after attempt to add third member of shadow set.

DSM,

MSCP means 'serving' the disks via the cluster interconnect. This presents a path to a disk 'local' to another system. FC disks are considered 'local' in this context.

You can close the topic at any time. It is still possible to enter replies to a closed topic.

I hope you enjoyed the 'ITRC OpenVMS forum' experience ;-)

Volker.
Dean McGorrill
Valued Contributor

Re: MountVerify after attempt to add third member of shadow set.

so it was your theroy Volker. it would seem to me some user defensive coding by vms could check with the other nodes and return something like "device not configured on other cluster member" in a case like this. (Not that it will get done..)

Dean
Robert Brooks_1
Honored Contributor

Re: MountVerify after attempt to add third member of shadow set.

it would seem to me some user defensive coding by vms could check with the other nodes and return something like "device not configured on other cluster member" in a case like this. (Not that it will get done..)

---

There has been some talk of doing this. For various reasons, the ability to do that didn't exist until recently. What is needed (and now exists) is a "voting" scheme where one node could "veto" a change that another node is proposing, if it doesn't have the capabilities needed to allow the proposed change. Dissimilar device shadowing uses this construct.

That's not to say that the check to add a new member will be done, but it has been discussed. In any event, it wouldn't happen until V8.4.

-- Rob
Dean McGorrill
Valued Contributor

Re: MountVerify after attempt to add third member of shadow set.

DSM,
I usually almost always enable MSCP (mass storage control protocal), and might
have protected you on this one.

Rob,
I assumed (wrongly) that they
had some defensive protection for a scenero like DSM's, so hence my post. Tx for your
post, I hope the implement it. Dean
Volker Halle
Honored Contributor

Re: MountVerify after attempt to add third member of shadow set.

DSM,

now, if you would have been running OpenVMS I64 V8.3, here is the patch you would want to install: VMS83I_SYS-V0300

5.2.11 Verify New Shadow Member

5.2.11.1 Problem Description:

If a disk is not presented to all nodes in the cluster, yet the member is added to a shadow set on all nodes, the shadow set will end up in Mount Verification and unable to recover, even after SHADOW_MBR_TMO seconds.

5.2.11.3 Problem Analysis:

A proposed new member is validated on the node which is performing the actual mount. On other nodes, it is added via a "trigger validate". However, if the UCB for the device does not exist, then the trigger validate can not complete but the member cannot be expelled properly because it is not yet a valid member of the shadow set on that node.


I would not expect this to be ever back-ported to V7.3-2

Volker.