1753918 Members
7674 Online
108810 Solutions
New Discussion юеВ

Notes from an Upgrade

 
SOLVED
Go to solution
Kevin Raven (UK)
Frequent Advisor

Notes from an Upgrade

Last Saturday , we got tasked with upgrading and migrating to new hardware a rather old and very neglected Alpha 4100 running OpenVMS 7.2-1.

Upgrade OpenVMS to V7.3-2 and apply ECO Patch Update 15. Also applied PCSI Patch and TCPIP patch.

Couple of distractions from the plan.
To keep things brief I have only included amusing parts of process.
Some steps are missing ...i.e. Make system disk ex shadow set member - writable. Mount/over=shadow ....

1) All disks are shadowed.
Boot from OpenVMS 7.3-2 CD.
Select to upgrade from member 2 of system disk shadow set.
Upgrade goes to plan.
Apply PCSI patch.
Reboot
Apply Update 15 .....gets to 30% after 20 mins ...and declares not insufficient disk space to continue.
After patch rolls back ....check free disk space ...shows 800,000 blocks - 400mb ish ....clear some old logs files and free up 200,000 extra blocks.
Reapply update 15 .....same error after 30% !!!!
set disk/rebuild
ana/disk/norepair <-- Nowt to worry about being reported.
mc sysgen create temp.file/size=300000 <-- on system disk ....bombs out with non contiguous space error ....
check created temp.file ...only 25,000 blocks in size !
Do disk to disk /image backup member A to member B
Update 15 now goes in .....

2) Es45 .....New disks DGA devices.
All disks shadowed - two member shadow sets - 11 shadow sets
boot from CD .....init first member of 10 shadow sets ....mount first member ....restore data to disk.
Second members all blank - virgin disks
Boot first node in 2 node cluster - with quorum disk
Node A boots and mounts all 10 shadow sets
4 members have shadow copies taking place. Shadow_max_copy set to 4
6 stuck at 0% .....
Boot second node in cluster ....
First 4 disk mount ok
Next 6 error on mount ...with member part of another virtual disk unit !!!
Do manual mount of failed disks ....mount DSA1 /shadow=(member1) label/sys/clu
Disk now mounts ok with both members
Repeat for all 6 disks
Reboot node B ....now all ok

Comments ?
Mine is ....
Why did the Update 15 install not give a more relevant error message.

Would you call the failed mounts a bug or OpenVMS protecting the disk ?
i.e. ...The disk is still blank ...so how can Node B know what shadow set member b is part of ?

16 REPLIES 16
Hoff
Honored Contributor
Solution

Re: Notes from an Upgrade

The system disk was probably fragmented. For reasons probably lost in the mists of time, the error messages for a disk that's full and a disk that's fragmented are unfortunately similar; a program request for storage failed, and the program doesn't bother to differentiate a contiguous file request from a more typical request. This is why the disk-to-disk worked; that defragged the disk.

New disks are seldom (never?) truly blank (and should never be assumed to be blank), and I prefer to use either BACKUP /PHYSICAL or INITIALIZE /ERASE for new disks and then followed by a series of MOUNT commands (with the /CONFIRM option) when I first form the shadow sets. And FWIW, FC SAN DG disks are particularly sensitive to correct unit settings; the downside of flexible technologies such as FC SAN storage and of the BladeSystem Virtual Connect is the increased exposure to software-implemented hardware configuration failures.

Without having a better idea of the configuration and the MOUNT commands, exactly what happened with the shadowsets isn't certain.
Kevin Raven (UK)
Frequent Advisor

Re: Notes from an Upgrade

Shadow sets ...

20 EMC based disks were presented to OpenVMS.
1) We booted from CD
2) From DCL ...init of 10 of the new disk....
i.e. Disk1 , Disk2 , Disk3 ....
Restore to 10 disks from tape

3) On boot ...disks are mounted
MOUNT/SYS/CLUSTER DSAn: /SHADOW=($1$DGAmem1, $1$DGAmem2) DISK1

All disks mounted on first node ok.
Shadow max copy set to 4 ...thus 4 started to shadow copy onto second members.
Other 6 disks stuck at 0% copy ....
Dues to shadow max copy limit

Second node booted ...
First 4 shadow sets mount ok
Next 6 fail to mount ...with error ..
something like ...disk already member of another virtual disk...

On completion of boot of node B
login and find only 4 disk mounted ..other 6 show remote mount ...
Issue of
$MOUNT/SYS/CLUSTER DSAn: /SHADOW=($1$DGAmem1) DISK1

Gives message to say shadow set mounted ..with also same error displayed.
Checking shadow set show its now mounted.

Once shadow copy starts or completes on all shadow sets ...all following boots are ok.



Craig A
Valued Contributor

Re: Notes from an Upgrade

Raven

[Devil's advocate mode ON]
Did you really backup one shadow set member to another (A to B)?

What woudl you have done if the backup had failed or the system crashed and left both volumes in an unknown state?

Was making the system disk a 3 member shadow set an option? Then you could have dropped one member out and then backed up the primary volume to it.

Craig
Kevin Raven (UK)
Frequent Advisor

Re: Notes from an Upgrade

Backup Shadow member A to shadow member B....

A - Mounted Write locked
B - To be overwritten by A using backup

Why would a system crash , spoil volume A ???

I have never seen a crashing S/A backup ...ever in 25 years ....destroy a write locked or not write locked volume ?
Yes the one your writing to would be duff ...if it failed during the backup.

Has anyone ever had an OpenVMS server crash and corrupt a volume ?
Of course applications might leave DB's in funny states etc ...but will recover on restart. If configured correctly.

ps ...We also had a tape backup of all the shadow sets to hand....

Craig A
Valued Contributor

Re: Notes from an Upgrade

Many years ago (and no I am not telling you how many :-) disk corruptions used to be pretyy regular after power outages. I think I used to see 2 or 3 per year across a large VMS estate.

I'm not criticising your stance - I'm just flagging that there are always risks with any operation.

If you are happy with it, then I guess that is all that matters.

Personally, I wouldn't do it like that for the simple reason that it introduces a risk that is, in my view, unnecessary when another solution woudl be more appropriate.

I've never seen an airliner crash but I do believe it does happen from time to time.

Craig
Hoff
Honored Contributor

Re: Notes from an Upgrade

I'm with Craig here; you're rather more aggressive here than I prefer to be with this stuff. You may not have seen problems with HBVS and with odd-ball cluster and controller glitches, but I have. That's only my opinion, and I'm not charged with maintaining and managing and recovering your data in any case.

Here, I'd erase the disks, form the disk shadow sets, then cluster the shadow sets. I'd not look to reconstitute brand-new shadow sets across multiple nodes without having some very explicit partitioning.

I almost never use MOUNT /CLUSTER, as I prefer /SYSTEM in combination with a tool such as MSCPMOUNT.COM or a local analog.

And INITIALIZE sans /ERASE was once hazardous around blank disks, particularly with OpenVMS I64. (I *think* it got fixed to overwrite the lowest and highest block ranges of a disk regardless, but I don't have a way to confirm that. And I use /ERASE as a form of basic operational verification anyway, as I'd rather kick over early-life disk hardware errors earlier rather than later.)

The CD is full BACKUP, not standalone BACKUP. (SAB ceased to exist at V6.1.) I've seen some indications the performance with the CD (Alpha) or the DVD (Integrity) isn't as much as I'd like, but I've not tracked that back. I tend to use a local BACKUP username or a local bootable environment with the V8.3 process quota settings. Not the CD or the DVD.

YMMV, of course, and this is your data and your decision.
Kevin Raven (UK)
Frequent Advisor

Re: Notes from an Upgrade

As I said we had a full backup on tape.
If for any reason , both members of the shadow set had been corrupted due to a server crash , then we would have restored from the tape. That would have been mounted /Nowrite .....
Of course the tape could have also crunched.
But .....we a mirror of the server in the DR site.
Don't worry .....we dont take risks ...
As I said in the original post ....I have not given aql the details. To keep the post brief.
Back to OpenVMS corrupting disks during crashes ....
The disk being read from was mounted /NOWRITE.
So what you are saying is that ....
When using backup , for example to backup up a disk to another disk , you could end up with data corruption on disk being read from, if OpenVMS fails.

Better warn customers then , that Backup can cause corruption if while being run a server outage occurs.
I'll pass on the word.

PS The disks were being served by two seperate HSZ70's ...with dual power feeds ...from different source PDU.
These are supported by battery backup (PSU) and generators ....

When we restored the data to shadow set on the new server .....if we had any form of data corrurption ...then we would have simply restored again ....
Still bad the way OpenVMS handled the condition of forming the shadow sets on the second server.
With the potentail data corruptions that Backup can cause and what you are saying about the forming of shadow sets ...maybe it's good we are looking at moving away from VMS :-)

Volker Halle
Honored Contributor

Re: Notes from an Upgrade

You should trust OpenVMS backup to not corrupt disks mounted /NOWRITE !

But did you think about the possible scenario, which may arise, if the system crashes before finishing the copy and then reboots ? No big deal, if you booted from the CD. But if you were running from your original system disk with SYSTARTUP_VMS.COM enabled and including the MOUNT DSAx:/SHADOW=(mbr1,mbr2) command. Would you want to risk the possibility of the shadow copy operating going in the wrong direction ?

Volker.
Jon Pinkley
Honored Contributor

Re: Notes from an Upgrade

Craig,

If the problem was a fragmented disk, then using shadowing isn't an option that will help the situation. Backup/image is, and as long as there isn't a typo in the backup command, then I don't see a big issue. Yes, The Raven will be working with a single member of the shadowset, and therefore there is a possibility of failure of the drive, or new hard error showing up. That is a requirement of upgrading, as upgrades aren't allowed with the system disk being shadowed. So other than doing an anal/disk/shadow prior to splitting the shadow (and that's a relatively new command) I don't think he had a choice. If your concern is that he is overwriting his "fallback", yes, that is true, but he will still have the original in its unmodified condition. I see no advantage to putting in a third member if you are going to immediately do a backup/image to it once you remove it from the shadowset. I suppose I should say, I see not benefit other than knowing you can write to every block on the disk without errors, an that can be done quicker with init/erase or backup/physical.

Raven,

My guess is the error message you got was WRONGVU, and second condition listed below:

-----------------------------------------------------------
$ help/message wrongvu

WRONGVU, device is already a member of another virtual unit

Facility: MOUNT, Mount Utility

Explanation: This message can occur under any of the following conditions:

o A shadow set member (identified in an accompanying
SHADOWFAIL message) is already mounted on another node
in the cluster as a member of a different shadow set.

o The device is the target of a shadow set copy operation,
and the copy operation has not yet started. In this case,
the storage control block (SCB) of the copy target is not
in the same location as it is on the master node. This
causes MOUNT to read the wrong SCB and fail with this
error.

o The target of the shadow set copy operation is a new,
uninitialized disk. This failure is commonly seen when a
MOUNT/CLUSTER command is issued and one or more of the
members is a new disk. The set is mounted successfully
on the local node, but all of the remote nodes report a
WRONGVU error.


User Action:

o For the first condition, specify a different member for the
shadow set you are mounting, or specify the correct virtual
unit for the member that is already mounted elsewhere.

o For the second condition, wait for the copy operation
to proceed before attempting to remount this device, or
initialize the copy target disk so that the SCB is in the
same location as it is on the master member.

o For the third condition, OpenVMS recommends that all new
disks be initialized prior to mounting them into a shadow
set.
-----------------------------------------------------------

From the description, it seems to me that there really is very little you can do once there are pending copy operations, other than wait or dismount the member that is pending full copy and initialize. I think there is a typo in the explanation, where it says master node it should say master member (of the shadowset virtual unit). It does seem that the Shadowing driver/ACP should perhaps verify that the SCB of all target disks are at the same LBN and if not, to initialize the minimum portions of the target disk needed to force the new SCB to be in the same location. If this was done before placing the disk on the full copy pending list, which would be enough to prevent the problem when another node booted and attempted the mount (if the explanation is correct).

Volker,

I thought the only problem case is mounting the system disk in the startup sequence, as described in the CAUTION section on page 44 of the PDF version of HP Volume Shadowing for OpenVMS Alpha 7.3-2. This is in Chapter 3 under Booting from a System Disk Shadow Set. In the case of a non-system disk, then the SCBs of each member should provide protection, via the generation number. Or am I mistaken?

Jon
it depends