Notes from an Upgrade

Kevin Raven (UK) · ‎02-24-2009

Last Saturday , we got tasked with upgrading and migrating to new hardware a rather old and very neglected Alpha 4100 running OpenVMS 7.2-1.

Upgrade OpenVMS to V7.3-2 and apply ECO Patch Update 15. Also applied PCSI Patch and TCPIP patch.

Couple of distractions from the plan.
To keep things brief I have only included amusing parts of process.
Some steps are missing ...i.e. Make system disk ex shadow set member - writable. Mount/over=shadow ....

1) All disks are shadowed.
Boot from OpenVMS 7.3-2 CD.
Select to upgrade from member 2 of system disk shadow set.
Upgrade goes to plan.
Apply PCSI patch.
Reboot
Apply Update 15 .....gets to 30% after 20 mins ...and declares not insufficient disk space to continue.
After patch rolls back ....check free disk space ...shows 800,000 blocks - 400mb ish ....clear some old logs files and free up 200,000 extra blocks.
Reapply update 15 .....same error after 30% !!!!
set disk/rebuild
ana/disk/norepair <-- Nowt to worry about being reported.
mc sysgen create temp.file/size=300000 <-- on system disk ....bombs out with non contiguous space error ....
check created temp.file ...only 25,000 blocks in size !
Do disk to disk /image backup member A to member B
Update 15 now goes in .....

2) Es45 .....New disks DGA devices.
All disks shadowed - two member shadow sets - 11 shadow sets
boot from CD .....init first member of 10 shadow sets ....mount first member ....restore data to disk.
Second members all blank - virgin disks
Boot first node in 2 node cluster - with quorum disk
Node A boots and mounts all 10 shadow sets
4 members have shadow copies taking place. Shadow_max_copy set to 4
6 stuck at 0% .....
Boot second node in cluster ....
First 4 disk mount ok
Next 6 error on mount ...with member part of another virtual disk unit !!!
Do manual mount of failed disks ....mount DSA1 /shadow=(member1) label/sys/clu
Disk now mounts ok with both members
Repeat for all 6 disks
Reboot node B ....now all ok

Comments ?
Mine is ....
Why did the Update 15 install not give a more relevant error message.

Would you call the failed mounts a bug or OpenVMS protecting the disk ?
i.e. ...The disk is still blank ...so how can Node B know what shadow set member b is part of ?

Hoff · ‎02-24-2009

The system disk was probably fragmented. For reasons probably lost in the mists of time, the error messages for a disk that's full and a disk that's fragmented are unfortunately similar; a program request for storage failed, and the program doesn't bother to differentiate a contiguous file request from a more typical request. This is why the disk-to-disk worked; that defragged the disk.

New disks are seldom (never?) truly blank (and should never be assumed to be blank), and I prefer to use either BACKUP /PHYSICAL or INITIALIZE /ERASE for new disks and then followed by a series of MOUNT commands (with the /CONFIRM option) when I first form the shadow sets. And FWIW, FC SAN DG disks are particularly sensitive to correct unit settings; the downside of flexible technologies such as FC SAN storage and of the BladeSystem Virtual Connect is the increased exposure to software-implemented hardware configuration failures.

Without having a better idea of the configuration and the MOUNT commands, exactly what happened with the shadowsets isn't certain.

Kevin Raven (UK) · ‎02-24-2009

Shadow sets ...

20 EMC based disks were presented to OpenVMS.
1) We booted from CD
2) From DCL ...init of 10 of the new disk....
i.e. Disk1 , Disk2 , Disk3 ....
Restore to 10 disks from tape

3) On boot ...disks are mounted
MOUNT/SYS/CLUSTER DSAn: /SHADOW=($1$DGAmem1, $1$DGAmem2) DISK1

All disks mounted on first node ok.
Shadow max copy set to 4 ...thus 4 started to shadow copy onto second members.
Other 6 disks stuck at 0% copy ....
Dues to shadow max copy limit

Second node booted ...
First 4 shadow sets mount ok
Next 6 fail to mount ...with error ..
something like ...disk already member of another virtual disk...

On completion of boot of node B
login and find only 4 disk mounted ..other 6 show remote mount ...
Issue of
$MOUNT/SYS/CLUSTER DSAn: /SHADOW=($1$DGAmem1) DISK1

Gives message to say shadow set mounted ..with also same error displayed.
Checking shadow set show its now mounted.

Once shadow copy starts or completes on all shadow sets ...all following boots are ok.

Craig A · ‎02-24-2009

Raven

[Devil's advocate mode ON]
Did you really backup one shadow set member to another (A to B)?

What woudl you have done if the backup had failed or the system crashed and left both volumes in an unknown state?

Was making the system disk a 3 member shadow set an option? Then you could have dropped one member out and then backed up the primary volume to it.

Craig

Kevin Raven (UK) · ‎02-24-2009

Backup Shadow member A to shadow member B....

A - Mounted Write locked
B - To be overwritten by A using backup

Why would a system crash , spoil volume A ???

I have never seen a crashing S/A backup ...ever in 25 years ....destroy a write locked or not write locked volume ?
Yes the one your writing to would be duff ...if it failed during the backup.

Has anyone ever had an OpenVMS server crash and corrupt a volume ?
Of course applications might leave DB's in funny states etc ...but will recover on restart. If configured correctly.

ps ...We also had a tape backup of all the shadow sets to hand....

Craig A · ‎02-24-2009

Many years ago (and no I am not telling you how many :-) disk corruptions used to be pretyy regular after power outages. I think I used to see 2 or 3 per year across a large VMS estate.

I'm not criticising your stance - I'm just flagging that there are always risks with any operation.

If you are happy with it, then I guess that is all that matters.

Personally, I wouldn't do it like that for the simple reason that it introduces a risk that is, in my view, unnecessary when another solution woudl be more appropriate.

I've never seen an airliner crash but I do believe it does happen from time to time.

Craig

Hoff · ‎02-24-2009

I'm with Craig here; you're rather more aggressive here than I prefer to be with this stuff. You may not have seen problems with HBVS and with odd-ball cluster and controller glitches, but I have. That's only my opinion, and I'm not charged with maintaining and managing and recovering your data in any case.

Here, I'd erase the disks, form the disk shadow sets, then cluster the shadow sets. I'd not look to reconstitute brand-new shadow sets across multiple nodes without having some very explicit partitioning.

I almost never use MOUNT /CLUSTER, as I prefer /SYSTEM in combination with a tool such as MSCPMOUNT.COM or a local analog.

And INITIALIZE sans /ERASE was once hazardous around blank disks, particularly with OpenVMS I64. (I *think* it got fixed to overwrite the lowest and highest block ranges of a disk regardless, but I don't have a way to confirm that. And I use /ERASE as a form of basic operational verification anyway, as I'd rather kick over early-life disk hardware errors earlier rather than later.)

The CD is full BACKUP, not standalone BACKUP. (SAB ceased to exist at V6.1.) I've seen some indications the performance with the CD (Alpha) or the DVD (Integrity) isn't as much as I'd like, but I've not tracked that back. I tend to use a local BACKUP username or a local bootable environment with the V8.3 process quota settings. Not the CD or the DVD.

YMMV, of course, and this is your data and your decision.

Kevin Raven (UK) · ‎02-24-2009

As I said we had a full backup on tape.
If for any reason , both members of the shadow set had been corrupted due to a server crash , then we would have restored from the tape. That would have been mounted /Nowrite .....
Of course the tape could have also crunched.
But .....we a mirror of the server in the DR site.
Don't worry .....we dont take risks ...
As I said in the original post ....I have not given aql the details. To keep the post brief.
Back to OpenVMS corrupting disks during crashes ....
The disk being read from was mounted /NOWRITE.
So what you are saying is that ....
When using backup , for example to backup up a disk to another disk , you could end up with data corruption on disk being read from, if OpenVMS fails.

Better warn customers then , that Backup can cause corruption if while being run a server outage occurs.
I'll pass on the word.

PS The disks were being served by two seperate HSZ70's ...with dual power feeds ...from different source PDU.
These are supported by battery backup (PSU) and generators ....

When we restored the data to shadow set on the new server .....if we had any form of data corrurption ...then we would have simply restored again ....
Still bad the way OpenVMS handled the condition of forming the shadow sets on the second server.
With the potentail data corruptions that Backup can cause and what you are saying about the forming of shadow sets ...maybe it's good we are looking at moving away from VMS :-)

Volker Halle · ‎02-24-2009

You should trust OpenVMS backup to not corrupt disks mounted /NOWRITE !

But did you think about the possible scenario, which may arise, if the system crashes before finishing the copy and then reboots ? No big deal, if you booted from the CD. But if you were running from your original system disk with SYSTARTUP_VMS.COM enabled and including the MOUNT DSAx:/SHADOW=(mbr1,mbr2) command. Would you want to risk the possibility of the shadow copy operating going in the wrong direction ?

Volker.

Jon Pinkley · ‎02-25-2009

Craig,

If the problem was a fragmented disk, then using shadowing isn't an option that will help the situation. Backup/image is, and as long as there isn't a typo in the backup command, then I don't see a big issue. Yes, The Raven will be working with a single member of the shadowset, and therefore there is a possibility of failure of the drive, or new hard error showing up. That is a requirement of upgrading, as upgrades aren't allowed with the system disk being shadowed. So other than doing an anal/disk/shadow prior to splitting the shadow (and that's a relatively new command) I don't think he had a choice. If your concern is that he is overwriting his "fallback", yes, that is true, but he will still have the original in its unmodified condition. I see no advantage to putting in a third member if you are going to immediately do a backup/image to it once you remove it from the shadowset. I suppose I should say, I see not benefit other than knowing you can write to every block on the disk without errors, an that can be done quicker with init/erase or backup/physical.

Raven,

My guess is the error message you got was WRONGVU, and second condition listed below:

-----------------------------------------------------------
$ help/message wrongvu

WRONGVU, device is already a member of another virtual unit

Facility: MOUNT, Mount Utility

Explanation: This message can occur under any of the following conditions:

o A shadow set member (identified in an accompanying
SHADOWFAIL message) is already mounted on another node
in the cluster as a member of a different shadow set.

o The device is the target of a shadow set copy operation,
and the copy operation has not yet started. In this case,
the storage control block (SCB) of the copy target is not
in the same location as it is on the master node. This
causes MOUNT to read the wrong SCB and fail with this
error.

o The target of the shadow set copy operation is a new,
uninitialized disk. This failure is commonly seen when a
MOUNT/CLUSTER command is issued and one or more of the
members is a new disk. The set is mounted successfully
on the local node, but all of the remote nodes report a
WRONGVU error.

User Action:

o For the first condition, specify a different member for the
shadow set you are mounting, or specify the correct virtual
unit for the member that is already mounted elsewhere.

o For the second condition, wait for the copy operation
to proceed before attempting to remount this device, or
initialize the copy target disk so that the SCB is in the
same location as it is on the master member.

o For the third condition, OpenVMS recommends that all new
disks be initialized prior to mounting them into a shadow
set.
-----------------------------------------------------------

From the description, it seems to me that there really is very little you can do once there are pending copy operations, other than wait or dismount the member that is pending full copy and initialize. I think there is a typo in the explanation, where it says master node it should say master member (of the shadowset virtual unit). It does seem that the Shadowing driver/ACP should perhaps verify that the SCB of all target disks are at the same LBN and if not, to initialize the minimum portions of the target disk needed to force the new SCB to be in the same location. If this was done before placing the disk on the full copy pending list, which would be enough to prevent the problem when another node booted and attempted the mount (if the explanation is correct).

Volker,

I thought the only problem case is mounting the system disk in the startup sequence, as described in the CAUTION section on page 44 of the PDF version of HP Volume Shadowing for OpenVMS Alpha 7.3-2. This is in Chapter 3 under Booting from a System Disk Shadow Set. In the case of a non-system disk, then the SCBs of each member should provide protection, via the generation number. Or am I mistaken?

Jon

it depends

Craig A · ‎02-25-2009

Jon

>If your concern is that he is overwriting >his "fallback", yes, that is true, but he >will still have the original in its >unmodified condition.

Agreed. He will still have it but whether it will be usable is another thing. If I have to go back to a tape backup as part of a recovery prcoedure for a piece of work inititated by me, then I would consider that
a failure.

In 1995 I remember a client who had ran a database backup to tape. It was only when they needed to restore it did it reveal that there was a severe crease in tape 8 of 10. Apparently a /VERIFY (which they didn't do) would have flagged the crease.

It took something like a month to get a database expert to rebuild the database to a known safe point.

Now backup technology has moved on considerably since then BUT it does kind of stick in one's memory.

I guess the other thing to mention is that there will be some sites that operate as near to 24 x 7 as possible so some downtime to defrag the system disk might be a useful exercise as part of a system upgrade.

Much of itm, I admit, is down to personal preference and whether one is happy with the [perceived[ level of risk.

Craig

Volker Halle · ‎02-25-2009

Jon,

assume you do a BACKUP/IMAGE DSAx: mbr2. Then DISM/CLUS DSAx: and MOUNT/SYS DSAx:/SHAD=(mbr1,mbr2). In which direction would you expect (and want) the shadowcopy to go ?

Time for a little experiment with LD devices, maybe ...

Volker.

Kevin Raven (UK) · ‎02-25-2009

I will detail what we had ....

1 -Disk mounted /Nowrite as source of backup to second disk.
2- Tape backup - production site copy
3- Tape backup sent to DR site and restored to DR cluster

So If for some reason ....
The server crashed ...or both HSZ70 crashed ...or all the power supply and UPS and Generators failed and VMS backup corrupted a Disk mounted /NOWRITE ....
Then restore from tape ...
If tape crunched ...then recall second tape from DR site ....

If whole datacentre in flames ...then we have upgraded server in DR site and current DR system in DR site.
We upgraded DR server and prepared DR server week prior ....

Still would like to know if anyone has ever had a server crash , cause OpenVMS Backup to corrupt a write locked source disk.

Jon Pinkley · ‎02-25-2009

Volker,

I would expect the DSA to be restored to the state of the DSA at the time it was dismounted. In other words, I would expect the mbr2 to be overwritten. If I wanted to use the backup image target (mbr2) as the master member, then I should dismount the DSA virtual unit and remount with the single member, then add the other members. Even if all I did was dismount the DSA and remount with MOUNT/SYS DSAx:/SHAD=(mbr2), mbr2 will now have the highest generation number, and mbr2 should be used as the master of a shadowset created with a MOUNT/SYS DSAx:/SHAD=(mbr1,mbr2). It the system crashes while the BACKUP/IMAGE DSAx: mbr2 is active, the SCB on mbr2 will not reflect that it was a member of a shadowset, and I believe its generation number will be 0 (17-nov-1858).

I would consider a backup/phy of a shadowset member to be more dangerous than a backup/image, as it the SCB had been copied to the target before the crash, it is very possible that the mount would succeed without a copy operation.

I agree that a test with LD would be the definitive answer.

Jon

Raven,

Given your scenario, I can think of no way that the disk mounted/nowrite would be modified, with the exception of a hardware failure or a bug in kernel mode code (I am not sure what would happen if Dump Off System Disk was active and pointed at mbr2, but that's a stretch. If the unit were write protected at the HSZ level, then the probability of corruption would be about as close to zero as possible.

The biggest problem I have seen with power outages are disks that fail to spin up once they spin down and cool off.

it depends

Craig A · ‎02-25-2009

I've worked at some sites whereby I've been up against the clock in terms of the downtime window permitted to do certain work.

This is even more relevant when there is little contingency time built into a plan.
Sometimes it is simply impossible to argue that more donwtime is needed when the business says "No!".

When the time comes to abort, all I want to do is halt the current activity and reobot the system from an untouched system disk.

I do not want to have to second guess whether that volume is going to be OK or restore from another place (tape or elsewhere).

Craig

Kevin Raven (UK) · ‎02-25-2009

Craig....

Yes ...I know the situation well.
Thats why all our plans have roll back time built into them and cut off points.

I could tell you some war stories from what I have seen in the past ....lol

Anyone remember HSC backup ?
Got called to a customer site once ...where they had done a HSC backup of a disk.
Backup to Tape ....4 volumes
Volume 1 ....remove mount Volume 2....remove mount Volume 3...remove mount volume 2 again !!!!!!!!
They then restored the disk ....
Mount volume 1 ...restore .....mount volume 2 ....bang ....
When I got to site ....errr what you expecting me to do ?
I have a magic wand ...hold on a moment ....

They had to restore from last good backup ...a year ago...

Bengt Torin · ‎02-25-2009

Hi,
As a funny thing, I have seen the problem with insufficient disk space in an 8.3 system where I did apply UPDATE patches. As You know You get some questions in the start phase If You want files to be renamed to *.*OLD. I Answered NO to that question but it did do the renaming ! so I run out of disk space. I did not put any more effort in why it was done this way.

Regards

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Notes from an Upgrade

Notes from an Upgrade

Re: Notes from an Upgrade

Re: Notes from an Upgrade

Re: Notes from an Upgrade

Re: Notes from an Upgrade

Re: Notes from an Upgrade

Re: Notes from an Upgrade

Re: Notes from an Upgrade

Re: Notes from an Upgrade

Re: Notes from an Upgrade

Re: Notes from an Upgrade

Re: Notes from an Upgrade

Re: Notes from an Upgrade

Re: Notes from an Upgrade

Re: Notes from an Upgrade

Re: Notes from an Upgrade

Re: Notes from an Upgrade