Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

%BACKUP-F-CLUSTER, unsuitable cluster factor redux

 
SOLVED
Go to solution
BrianT_1
Regular Advisor

%BACKUP-F-CLUSTER, unsuitable cluster factor redux

In another thread, I posted:

I'm still using a VAX with OpenVMS V7.3. I'm getting this same error whether I init the disk first myself or try to allow BACKUP to init it. I have a cluster factor of 8 and the disk had been running for _years_ with no problem (until it failed). The total size is 53294505, so that should allow 53294505/9=5921611 files. I specified only /maximum_files=1500000. No matter what I do, I cannot restore this backup with /IMAGE. I can, of course, restore it without /IMAGE, but how can BACKUP create the saveset without complaint and then not let me restore it exactly as it was?

Hoff responded to that and his responses and my answers to his questions are interspersed.

H> Are you current on your ECOs for OpenVMS
H> VAX V7.3? BACKUP and SCSI and UPDATE would
H> be the obvious targets. (If not, load
H> these now and try again.)

I am up to date on these, according to the master list of ECOs, with the exception of a couple of ECOs that don't apply to me.

H> How big was the old disk, and how big is
H> the new disk? (If I've done the math
H> correctly, it looks like it might be an
H> RZ23 disk? And is this really a 104 MB
H> disk?)

It is an RZ1D, 9GB disk. Both the old disk and the new disk are physically identical. It's an HSD-hosted raidset of four RZ1Ds and three of the four building devices are physically the same devices. I had to replace one.

H> Which VAX? (There are other disk-capacity
H> issues, depending on the VAX model and VAX
H> console.)

VAX 7730. It's the same system that was hosting the disk before the restore and with which the disk was initialized originally and on which the backup was taken.

H> Is this BACKUP /IMAGE a system disk?

No.

H> How many files were on the old disk?

I really don't know. I didn't count.

H> What command(s) did you use to restore the
H> disk?

$ backup/image tape:saveset disk:/init

I also initialized the disk manually with

$ init/sys/own=system/max=1500000 -
/head=3000000/clust=8 disk: label

(which appears to be the original INIT command) and used

$ backup/image tape:saveset disk:/noinit

H> To INITIALIZE the disk?

I tried the above and I also tried just

% init/sys/own=system disk: label

H> What was the original BACKUP command? (You
H> can get this from the BACKUP /LIST -- just
H> post the whole header.)

$ BACKUP/IMAGE/RECORD/IGNORE=INTERL,LABEL) -
disk: tape:saveset/BLOCK=32256

H> There are cases where BACKUP cannot
H> restore a disk image due to conflicts in
H> the structures, too.

And this would be a bug. BACKUP should always be able to restore a disk that was running successfully and which it had no problem backing up in the first place. BACKUP should always be able to initialize a disk with the same values INITIALIZE found acceptable.

H> As for the specified maximum file count
H> here, I've yet to encounter an OpenVMS
H> system that has a disk anywhere near full
H> of one-cluster files. Is that really the
H> case here?

No, it's not.

Here's the header information from the BACKUP saveset.

Save set: $1$DUA103.40
Written by: BACKUP
UIC: [000010,000040]
Date: 11-OCT-2008 13:18:52.88
Command: BACKUP/IMAGE/RECORD/IGNORE=(INTERLOCK,LABEL) $1$DUA103: $10$M
UA16:$1$DUA103.40/BLOCK=32256
Operating system: OpenVMS VAX version V7.3
BACKUP version: V7.3
CPU ID register: 13000202
Node name: _CASS::
Written on: _$10$MUA16:
Block size: 32256
Group size: 10
Buffer count: 503

Image save of volume set
Number of volumes: 1

Volume attributes
Structure level: 2
Label: AIRBUS_DISK
Owner:
Owner UIC: [000001,000004]
Creation date: 19-APR-1999 15:29:04.89
Total blocks: 53294505
Access count: 3
Cluster size: 8
Data check: No Read, No Write
Extension size: 5
File protection: System:RWED, Owner:RWED, Group:RE, World:
Maximum files: 2960805
Volume protection: System:RWCD, Owner:RWCD, Group:RWCD, World:RWCD
Windows: 7

H> CLUSTER, unsuitable cluster factor
H> for 'device-name'
H>
H> Facility: BACKUP, Backup Utility
H>
H> Explanation: During an attempt to
H> initialize an output volume, the Backup
H> utility found that the cluster factor was
H> too large or too small for the specified
H> device.
H>
H> User Action: If the input is a save set,
H> use the BACKUP/LIST command to determine
H> the volume initialization parameters of
H> the input volumes. Refer to the
H> description of the DCL command
H> INITIALIZE, determine a suitable cluster
H> factor, and initialize the output volumes
H> using the INITIALIZE command. Then,
H> reenter the command specifying
H> the /NOINITIALIZE qualifier.

So, based on the volume initialization parameters, what might I be doing wrong, if anything? I may need to restore this or another disk again.

A side question: how does /DIRECTORIES enter into this, if at all. That just controls preallocation of 000000.dir, correct?
82 REPLIES 82
Hoff
Honored Contributor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

This BACKUP saveset is already potentially corrupt, given the command used for its original creation.

The host controller (in another discussion) was reporting errors. I will presume those have been resolved, though it is not clear whether those errors could also have contributed to saveset corruptions.

If the following DCL command:

BACKUP /IMAGE ddcu:saveset/SAVE ddcu:

doesn't resolve this, I'd try another approach and an approach not involving this particular OpenVMS VAX version and ECO or this particular OpenVMS VAX box. I'd ask that the BACKUP command not be edited, adjust or altered; that no command qualifiers nor tweaks be applied to the syntax.

As a particular alternative, try OpenVMS Alpha V8.3 or OpenVMS I64 V8.3; both of these releases have newer BACKUP bits.

Or you can call in some help.

BrianT_1
Regular Advisor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

H> This BACKUP saveset is already
H> potentially corrupt, given the command
H> used for its original creation.

And why is that? What portion of the original BACKUP command would lead to this corruption. I would contend that the saveset is not corrupt because I was able to restore the data by leaving off the /IMAGE qualifier. It just took much longer to restore.

H> The host controller (in another
H> discussion) was reporting errors. I will
H> presume those have been resolved, though
H> it is not clear whether those errors
H> could also have contributed to saveset
H> corruptions.

I have no way of knowing what the errors even mean, so I can't determine if they've been resolved.

H> If the following DCL command:
H>
H> BACKUP /IMAGE ddcu:saveset/SAVE ddcu:
H>
H> doesn't resolve this,

And since I stated that I already used this command (/SAVE is implied when using a tape device for the saveset and I included both /INIT and /NOINIT in trials - it's got to be one ot the other), we know that it doesn't. FOr completeness, I tried it with neither /INIT nor /NOINIT and, of course, it didn't change anything.

H> I'd try another
H> approach and an approach not involving
H> this particular OpenVMS VAX version and
H> ECO or this particular OpenVMS VAX box.

Will you give me a VMS system with which to do this? I have no access to anything but what I have.

H> I'd ask that the BACKUP command not be
H> edited, adjust or altered; that no
H> command qualifiers nor tweaks be applied
H> to the syntax.

With ot without tweaks, BACKUP in OpenVMS VAX V7.3 is clearly broken.

H> As a particular alternative, try OpenVMS
H> Alpha V8.3 or OpenVMS I64 V8.3; both of
H> these releases have newer BACKUP bits.

Could you suggest a way to do this? I have no access to any of that hardware or software here. I do have OpenVMS Alpha V7.3-a on an AlphaServer 4/233 but it's a member of a cluster not connected to the one where I must restore the data.

H> Or you can call in some help.

I tried that. I sent HP a message four days ago via the www.openvms.compaq.com website asking that someone contact me. I received a robo-response, but nothing else.
Robert Gezelter
Honored Contributor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

Brian,

If I am reading the posts correctly, Hoff is referring to the fact that file updates could be in progress during the BACKUP operation. IO errors during the BACKUP could result in an inherently corrupted Save Set.

Restoring a BACKUP Save Set without image could cause a problem if there are any aliased files on the volume (which is why people are warned to be careful when restoring system volumes).

Personally, I would attempt to recreate this with a very small test case and a non-tape Save Set. If there is a small reproducer, then it is far easier to get attention, something I learned many years ago when dealing with various support organizations at client's behest.

- Bob Gezelter, http://www.rlgsc.com
BrianT_1
Regular Advisor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

BG> If I am reading the posts correctly,
BG> Hoff is referring to the fact that file
BG> updates could be in progress during the
BG> BACKUP operation. IO errors during the
BG> BACKUP could result in an inherently
BG> corrupted Save Set.

If it were inherently corrupt, it would not restore at all, with or without /IMAGE, I would think.

BG> Restoring a BACKUP Save Set without
BG> image could cause a problem if there
BG> are any aliased files on the volume
BG> (which is why people are warned to be
BG> careful when restoring system volumes).

No aliased files.

BG> Personally, I would attempt to recreate
BG> this with a very small test case and a
BG> non-tape Save Set. If there is a small
BG> reproducer, then it is far easier to
BG> get attention, something I learned many
BG> years ago when dealing with various
BG> support organizations at client's
BG> behest.

Unfortunately, I don't have the luxury of doing this, since I have no spare drives of the size the raidset creates and I'm in a disaster recovery situation where I MUST get this data restored for a multimillion dollar project.
GuentherF
Trusted Contributor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

Brian,

I have that feeling your HSD-hosted RAID set is hosed. One factor in checking the cluster factor is the total blocks count of the output device obtained in BACKUP by SYS$GETCHN.

Check with DCL-SHOW DEVICE/FULL or in SDA-FORMAT UCB (let us know if you need the full details here). Ah, and before you do that mount the disk /FOREIGN which forces VMS to update the disk geometry info in the UCB.

/Guenther
Hoff
Honored Contributor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

Ok; so the simple BACKUP command blows up. Then it's likely something within BACKUP itself that is at fault here, or (for whatever reason) the output device doesn't quite match the input.

None of which is news, and none of which is new information.

The saveset structures are likely fine, it's the data in the saveset that is at risk. The file system interlocks that were ignored here are intended to provide for either consistency or an indication of inconsistent data, and not to require folks to add qualifiers on BACKUP. The data corruptions that can arise here can be entirely silent, per discussions with one of the long-time BACKUP maintainers. In other words, it is the contents of the files archived within the saveset that can be suspect. Whether there is a problem here depends on what activity was underway (if any) when BACKUP captured each input file.

I have VAX, Alpha and Itanium systems and storage available, if you'd like to discuss this offline.
Hein van den Heuvel
Honored Contributor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

[This is a topic appears to be a cross-posting from topic 1587 in EISNER::VMS.NOTE ]

I have a hard time making the numbers add up... unless this is a raid-5 with 4 members and an effective storage of 3*9,000,000*2 blocks and even then, that seems just close, not exact.

For an hardware raid set, the individual member disks are invisible, and thus irrelevant to the OS. As long as the raid set is happy.

>> 53294505/9=5921611 ... I specified only /maximum_files=1500000.

That still sounds like a lot of files.
How many files do you think there actually were?
How many block used in INDEXF.SYS?

That would be for an average file size of about 3 clusters = 24 block. Small, but ok fine.

backup header>> Maximum files: 2960805
Ok, more, smaller files on the original.

>>> $ init/sys/own=system/max=1,500,000 -
/head=3,000,000

I added those comma's.
I sure hope you actually used /head=300,000

If you do specify more than /MAX, then the maximum is silently set.

I seem to recall a note about an odd error with backup 7.3 when the /MAX and /HEAD was set to the absolute max, resulting in immediate failure on restore.

After the manual init, what was the resulting /MAX with $SHOW DEV/FULL?

I suggest trying an other manual $INIT but backing out the /HEAD by at least one under /MAX, and realistically probably 1/2.
Then retry the $BACKUP/NOINIT

Use DFU to REPORT of a similar idsk actual usage?

hth,
Hein.
BrianT_1
Regular Advisor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

HV> I have a hard time making the numbers
HV> add up... unless this is a raid-5 with
HV> 4 members and an effective storage of
HV> 3*9,000,000*2 blocks and even then,
HV> that seems just close, not exact.

It's whatever ADD RAIDSET of four RZ1Ds gives on the HSJ. The number is almost exactly 3X the size of one RZ1D as shown by VMS. (17769177 = RZ1D. 17769177*3=53307531. Device on VMS shows total blocks=53294505.)

HV> For an hardware raid set, the
HV> individual member disks are invisible,
HV> and thus irrelevant to the OS. As long
HV> as the raid set is happy.

Of course.

HV> That still sounds like a lot of files.
HV> How many files do you think there
HV> actually were?
HV> How many block used in INDEXF.SYS?

There are 1283598 files on the disk. INDEXF.SYS is 1285765/1500760. According to DFU, I have 303 free headers. On another RAIDSET device that's supposed to be identical to this one, DFU says there are 27192 free headers.

HV> That would be for an average file size
HV> of about 3 clusters = 24 block. Small,
HV> but ok fine.

Actually, the average file size is about 29 blocks.

HV> backup header>> Maximum files: 2960805
HV> Ok, more, smaller files on the original.
HV>
HV> $ init/sys/own=system/max=1,500,000 -
HV> /head=3,000,000
HV>
HV> I added those comma's.
HV> I sure hope you actually
HV> used /head=300,000
HV>
HV> If you do specify more than /MAX, then
HV> the maximum is silently set.

I did specify that because I thought /HEADERS specified the size of INDEXF.SYS and I have what's supposed to be an identical RAIDSET device that has INDEXF.SYS at the 2,960,805 value

HV> I seem to recall a note about an odd
HV> error with backup 7.3 when the /MAX
HV> and /HEAD was set to the absolute max,
HV> resulting in immediate failure on
HV> restore.

Suggest some reasonable values to me and I can see if I can try them (although I'm not sure I'll be able to do that, since I'd have to restore data that's two weeks old - not something the project wants to hear).

HV> After the manual init, what was the
HV> resulting /MAX with $SHOW DEV/FULL?

With or without any qualifiers? I believe that without any qualifiers, I got a cluster size of 51 and fewer than 500,000 max files.

HV> I suggest trying an other manual $INIT
HV> but backing out the /HEAD by at least
HV> one under /MAX, and realistically
HV> probably 1/2. Then retry the
HV> $BACKUP/NOINIT

It's dubious whether I'll be able to do this, since if it doesn't work, the project suffers because it will take another four days to restore. That said, I may have to do this anyway because of the forcederror messages I've been getting from many of the files on the disk. I think I have multiple problems. I'm trying to restore a portion of the files that are evincing this error to another disk to see if the saveset contains them or if they developed after the restore. The original backup and the subsequent restore didn't log any errors.
GuentherF
Trusted Contributor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

In this old BACKUP code there are only two locations where this error is issued. In both cases the total block count is involved in the calculation. So I would first check the total blocks count the way VMS sees it.

Btw. using /IGNORE=INTERLOCK never-ever is a cause for a corrupted save set. It may save the disk in an inconsistent state but that's a whole different thing then.

/Guenther
BrianT_1
Regular Advisor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

GF> In this old BACKUP code there are only
GF> two locations where this error is
GF> issued. In both cases the total block
GF> count is involved in the calculation.
GF> So I would first check the total
GF> blocks count the way VMS sees it.

And I've posted that info a couple of times.
Hein van den Heuvel
Honored Contributor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

Brian, ok, allright,... you DO have a lot of file on that drive! Yikes.

My suggestion was based on a remark in an internal publication "VMSnotes 5776" which read: "Essentially, if a disk is initialized with the maximum allowable values for /HEADERS and /MAXIMUM_FILES, and is then used as the target of a disk-to-disk image backup, the backup will fail almost immediately with the %BACKUP-F-CLUSTER error. If the same target disk is reinitialize using (maximum-1) as the value for /HEADERS and /MAXIMUM_FILES, the backup will proceed and complete normally."

That was on 7.3, but on re-read is was Alpha and suggested not reproducable on VAX 7.2 or 7.3.
I just tried a SMALL reproducer (a few thousand block MD device) on VAX 7.1... no failure.

I also tried a full reproducer under Alpha 8.3... no failure. See log below.

So I do NOT think that a slightly reduced /HEAD, to be under /MAX and more close to the needed size will help after all. (note: the backup restore phase nicely grows indexf as needed).

An other old note suggest the following
"Backup reports the error BACKUP$_CLUSTER under two circumstances:-

1. If the volume_size / cluster_size is greater than 1044480 or
- if the volume_size / cluster_size is less than 50

2. If the maximum_files_allowed is greater than volume_size / cluster_size + 1"

But that was before 7.2, and neither apply here it seems.

Good luck!
Hein

$ ld create $1$dga200:[temp]tillman.dsk/size=53294505
$ ld connec $1$dga200:[temp]tillman.dsk lda2:
$ init /max=1500000 /head=3000000/clust=8 lda2: test
$ moun lda2: test
%MOUNT-I-MOUNTED, TEST mounted on _$10$LDA2: (HEINA)
$ shwo dev /full lda2
:
Total blocks 53294505 Logical Volume Size 53294505 Expansion Size Limit 53444608
Free blocks 51792456
Maximum files allowed 1500000
:

$ dir/size=all lda2:[000000]*.*
:
BITMAP.SYS;1 1628/1632
:
INDEXF.SYS;1 409/1500400
:
Total of 10 files, 2047/1502056 blocks.
$
$ cre/dir lda2:[test]
$ copy sys$manager:*.dat lda2:[test]
$ dir /grand lda2:[*...]

Grand total of 1 directory, 9 files.
$ dir /grand/size=all lda2:[*...]

Grand total of 1 directory, 9 files, 175/216 blocks.
$ back/image lda2: $1$dga200:[temp]tillman.bck/save
$ dism lda2:
$ init /max=1500000 /head=3000000/clust=8 lda2: test2
$ mount/for lda2:
%MOUNT-I-MOUNTED, TEST2 mounted on _$10$LDA2: (HEINA)

$ back/noinit/image $1$DGA200:[temp]tillman.bck/save lda2:
%BACKUP-I-LOGNOTPRES, logical volume size of volume LDA2: not preserved
%BACKUP-I-LIMITNOTPRES, expansion size limit of volume LDA2: not preserved
%BACKUP-I-ODS2COMPAT, output volume LDA2: structure [ODS-2] is not compatible wi
th OpenVMS versions prior to 7.2

$ back/image $1$DGA200:[temp]tillman.bck/save lda2:
%BACKUP-I-ODS2COMPAT, output volume LDA2: structure [ODS-2] is not compatible wi
th OpenVMS versions prior to 7.2
$

GuentherF
Trusted Contributor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

Brian,

you didn't mention how you got the total blocks count. I would like to see the output from SHOW DEV/FULL while the disk is mounted foreign (to be sure VMS has it PackAcked).

How do the 9GB drives make up a RAID set with 53294505 blocks?

/Guenther
GuentherF
Trusted Contributor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

Ok, found it: RAID-5 with 3+1 members.

Using the older BITMAP.SYS size limitation of 255 blocks the maximum disk size for a cluster factor of 8 would have been 255*4096*8 = 8355840 blocks.

A regular INIT gave a cluster factor of 51 which would be 255*4096*51 = 53268480. Calculated based on the older BITMAPS.SYS limit. Looks like BACKUP is shooting for a 255 block BITMAP.SYS limit.

Ah, according to BACKUP V7.3 code it uses a max BITMAP.SYS size of 255 blocks for an ODS-2 volume and 65535 for ODS-5. Which would explain the error message.

It all doesn't make sense, yet. The save set was recorded for an ODS-2 disk WITH A CLUSTER FACTOR OF 8!???

But the INIT/CLUSTER=8 worked on the VAX. What is the disk structure level after the INIT?

What's the link date of SYS$SHARE:INIT$SHR.EXE and BACKUPSHR.EXE?

/Guenther
GuentherF
Trusted Contributor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

Haven't looked into VAX code for eons.

Here's the story: INIT on VAX/VMS V7.3 allows for a max BITMAP.SYS size of 65535 blocks even for ODS-2 volumes (online help INIT/CLUSTER says something like this at the end). That's why the /CLUSTER=8 worked.

BACKUP on the other hand doesn't do that. If the output volume is ODS-2 it strictly uses a BITMAP.SYS limit of 255 blocks. So this image restore can never work with the original V7.3 BACKUPSHR.EXE (link date 16-MAR-2001).

I wonder whether there ever was an ECO to fix that in BACKUP on VAX. Fixed in the Alpha version.

So one option is to pre-initialize the disk volume and do a non-image restore...and get over it.

/Guenther
Robert Gezelter
Honored Contributor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

Brian,

Please supply the precise error message that users are seeing (e.g., "Forced Error").

Offhand (I just awakened), I do not see a way that an error in BACKUP can produce a hard IO error in accessing the resulting files.

Offhand, a "forced error" on a RAID-5 volume set would seem to imply that either the underlying RAID set members are having problems or that there is a hardware problem somewhere else in the datapath.

- Bob Gezelter, http://www.rlgsc.com
Jon Pinkley
Honored Contributor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

Interestingly, Art Wiens reported this same error on Jun 17, 2007 in the following thread:

http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=1140240

This is a reasonably serious shortcoming of BACKUP, or in Brian's words "With or without tweaks, BACKUP in OpenVMS VAX V7.3 is clearly broken." Besides being much slower when there are a large number of files, and even worse if there are large directories, there are some things that a non-image restore will not do correctly. The primary shortcoming of non-image BACKUP is the inability to restore a bootable system disk, or any disk with alias entries. Other issues are that date stamps of directory files are not restored. Also, you must be much more vigilant to get file ownership correct.

This is not the only time that BACKUP has lagged ODS enhancements. Support for volume expansion was added in Alpha VMS 7.3-2, but backup didn't fully support this until VMS 8.3. Specifically, when doing a backup/image/noinit saveset disk:, the /SIZE of the disk is always maximized prior to VMS 8.3.

This restriction could even potentially affect the ability to restore image backups of system disks with small cluster sizes. It's a good reason to never have a VAX system disk with an extended bitmap.

I would argue that BACKUP enhancements should be in lock step with enhancements to the ODS structures, or possibly come before them.

Jon
it depends
BrianT_1
Regular Advisor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

GF> you didn't mention how you got the
GF> total blocks count.

$ show device/full disk:

GF> I would like to see the output from
GF> SHOW DEV/FULL while the disk is
GF> mounted foreign (to be sure VMS has it
GF> PackAcked).

Can't do that just now. People have to use the disk.

GF> Ah, according to BACKUP V7.3 code it
GF> uses a max BITMAP.SYS size of 255
GF> blocks for an ODS-2 volume and 65535
GF> for ODS-5. Which would explain the
GF> error message.

Sure would. BITMAP.SYS on these volumes is 1632

GF> It all doesn't make sense, yet. The
GF> save set was recorded for an ODS-2 disk
GF> WITH A CLUSTER FACTOR OF 8!???

You bet.

GF> But the INIT/CLUSTER=8 worked on the
GF> VAX. What is the disk structure level
GF> after the INIT?

ODS-2, naturally. "Volume Status: ODS-2,
subject to mount verification, write-back caching enabled."

GF> What's the link date of
GF> SYS$SHARE:INIT$SHR.EXE and
GF> BACKUPSHR.EXE?

16-MAR-2001 02:53:20.87 and 16-MAR-2001 03:03:16.95, respectively.

GF> So one option is to pre-initialize the
GF> disk volume and do a non-image
GF> restore...and get over it.

That's what I did and all would have been well, but now I'm battling the FORCEDERROR problems I mentioned. I'm trying to decide if they are in the saveset or developed after the restore. I can deal with the latter, perhaps, by replacing the disk drives and restoring those files that report the error.

RG> Please supply the precise error
RG> message that users are seeing
RG> (e.g., "Forced Error").

Here's one:

%COPY-E-READERR, error reading $1$DU103:[AB139SW.EDPU_OFP.WORK.COVIAK]ASM_OBJ.XL
B;90
-RMS-F-RER, file read error
-SYSTEM-F-FORCEDERROR, forced error flagged in last sector read

I found that for many of these files only COPY produces the error. DIFFERENCE, TYPE, and CONVERT do not. I used TYPE/OUT on a text file to make a copy of it and used DIFFERENCE to compare the result and everything was fine. EDIT also works and the data seems intact. We have a program, though, that counts lines of code for Ada programs and that program simply hangs trying to process one of these files. I think you're correct about the underlying hardware.

JP> I would argue that BACKUP enhancements
JP> should be in lock step with
JP> enhancements to the ODS structures, or
JP> possibly come before them.

I'd agree, but we all know that OpenVMS VAX is the poor stepchild and even before it was end-of-life Compaq/HP withdrew life support. I doubt they'd entertain fixing bugs, even ones so blatant and this BACKUP one.
Robert Brooks_1
Honored Contributor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

JP> I would argue that BACKUP enhancements
JP> should be in lock step with
JP> enhancements to the ODS structures, or
JP> possibly come before them.

I'd agree, but we all know that OpenVMS VAX is the poor stepchild and even before it was end-of-life Compaq/HP withdrew life support. I doubt they'd entertain fixing bugs, even ones so blatant and this BACKUP one.

--

That's not true; I still see periodic VAX fixes checked into the source pool. Admittedly, not too many, but it's not correct to state that it's been completely abandoned.

We have been known to generate VAX patch kits in the recent past as well.

Assuming you have a current support contract, please log a call and report back either here or on EISNER with what happens.

I'll see if I can generate some interest within VMS Engineering to take a look at this.

-- Rob
BrianT_1
Regular Advisor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

RB> That's not true; I still see periodic
RB> VAX fixes checked into the source pool.
RB> Admittedly, not too many, but it's not
RB> correct to state that it's been
RB> completely abandoned.

Those changes don't seem to be making it out to the public.

RB> We have been known to generate VAX
RB> patch kits in the recent past as well.

There have been two patches this year, one so that MONITOR would work in a mixed architecture cluster and one to fix a Screen Management security hole. The most recent one before that was one in 2006, then one in 2005. Not a very active effort.

RB> Assuming you have a current support
RB> contract, please log a call and report
RB> back either here or on EISNER with what
RB> happens.

I don't have a support contract. What would the thousands of dollars over the last ten years have gotten me? Almost nothing. Hardly a wise use of company money. At any rate, I sent a message via HP's OpenVMS support web site five days ago and received nothing but a robo-reply, so can't imagine any better response by phone, especially since I don't have a contract.

RB> I'll see if I can generate some
RB> interest within VMS Engineering to take
RB> a look at this.

I'd really appreciate this. I'm not making demands, but it would be great if the some resolution came to fruition.
Robert Brooks_1
Honored Contributor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

I asked someone in VMS Engineering to take a look at this stream, and received this response . . .



Rob - I scanned this quickly. Clearly something is broken. We still
fix bugs in VAX BACKUP. The customer should submit a case with the
relevant information and the BACKUP tesm will work on it.

BrianT_1
Regular Advisor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

RB> Clearly something is broken. We still
RB> fix bugs in VAX BACKUP. The customer
RB> should submit a case with the
RB> relevant information and the BACKUP
RB> tesm will work on it.

How do I submit a case? Which part of this discussion woud they consider "relevant"?

By the way, I am most grateful to all who have contributed to this discussion.
GuentherF
Trusted Contributor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

I opened an internal problem report with high priority because you may not be able to restore your system disk. Let's see how this goes. No promises!

Abot the forced errors: RAID-5 technology can only survive ONE error. For example if you have lost one disk drive and a rebuild is taking place on the replaced diesk drive AND another drive fails some blocks may not be recoverable because now you lost the disk with the original data and the disk with the XOR/parity data. Some implementations of RAID-5 do report this as forced error.

My personal opinion about RAID-5, well, it was developed at Berkley University. Great idea but with some flaws in a real environment. I prefer RAID-0 or some extensions to RAID-5.

/Guenther
BrianT_1
Regular Advisor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

GF> I opened an internal problem report
GF> with high priority because you may not
GF> be able to restore your system disk.
GF> Let's see how this goes. No promises!

I understand and I thank you.

GF> Abot the forced errors: RAID-5
GF> technology can only survive ONE error.
GF> For example if you have lost one disk
GF> drive and a rebuild is taking place on
GF> the replaced diesk drive AND another
GF> drive fails some blocks may not be
GF> recoverable because now you lost the
GF> disk with the original data and the
GF> disk with the XOR/parity data. Some
GF> implementations of RAID-5 do report
GF> this as forced error.

I think something similar had happened. I had lost a disk from the raidset at some point and the HSJ pulled one in from the spareset. Since VMS doesn't know about the substitution and didn't tell me, I had no spares. I think then another failed and that's when the disk went off line. I'm surprised because I thought a raid set was supposed to be able to operate in reduced mode if one of the disks failed and there was no spare, but that didn't happen. I replaced both the failed spare and the second failure, but could not get the HSJ to recognize any of the devices until I had completely removed the unit, the raidset, and the disks from the configuration and readded them with a complete reinitialization of each disk in the set. This, of course, erased the entire contents of the device and that's why I needed to perform the restore. The restore, except for taking a long time, didn't produce any errors on the disk as VMS sees it. It was a couple of days later that we started seeing the FORCEDERROR notices. The error light that each disk has has remained off since I replaced the failed drives.
GuentherF
Trusted Contributor

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

These forced errors are software errors and are imminent part of RAID-5 technology and are more officially called 'write holes'. Some early RAID-5 implementations ignored that giving you false data without a warning. DEC implementations always return a forced error for such areas.

Once you write to such blocks/area it clears the forced error condition, liek when you do a restore.

So be careful with the RAID set...we don't want to have another million dollar burndown. ;-)

/Guenther