Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

BrianT_1 · ‎10-29-2008

We've been using these raidsets for literally years with never a problem until recently. I'm not sure what I can do to rid myself of these errors now. Should I replace the devices? If I reinitialkize the VMS disk and restore the saveset again is it likely the errors would recur?

GuentherF · ‎10-29-2008

I don't know what kind of utilities there are for the controller based RAID sets. But if there are look for one that does some consistency check (non-redundance).

Or, ANALYZE/DISK/READ would tell you all the forced errors there are. And before you replace a RAID set member make sure the RAID set is fully redundant, either with a utility or again ANALYZE/DISK/READ.

/Guenther

BrianT_1 · ‎10-29-2008

GF> Or, ANALYZE/DISK/READ would tell you
GF> all the forced errors there are.

So does COPY. I have a list of all the errors. It's extensive. I may just replace all the drives and start over.

GF> And before you replace a RAID set
GF> member make sure the RAID set is fully
GF> redundant, either with a utility or
GF> again ANALYZE/DISK/READ.

I don't know how to do that. ANALYZE/READ doesn't even know it's a raid set.

GuentherF · ‎10-29-2008

ANALYZE/DISK/READ tells you about any forced error in a file. If it comes back clean then you don't have a problem. Now you might have forced error blocks in unused blocks but you don't care. As soon as a block is written its XOR/redundancy/parity block is also updated.

ANALYZE/DISK/READ is better than using COPY because it just does bulk reads and even for file not in a directory...gets them all.

A forced error on a RAID-5 array is almost always a software error. All of your disk drives are fine. You could have a power glitch, an infamous system reset (aka crash), a controller failure/reset/power cycle in which case all your drives are fine. It could be after such an event a disk drive didn't make it back into the set or, a meta data write (RAID-5 releated) didn't make it to the set. But you disk drives are mostly not the source of the problem.

If a disk drive has been kicked out of the set I typically do a "DUMP disk:/OUT=NLA0:". If that comes out clean I put that very same drive back into the set.

/Guenther

BrianT_1 · ‎10-30-2008

GF> ANALYZE/DISK/READ tells you about any
GF> forced error in a file.

I understand that. It was the other stuff you said that I didn't understand. ANALYZE/READ shows a LOT of FORCEDERRORs.

GF> A forced error on a RAID-5 array is
GF> almost always a software error. All of
GF> your disk drives are fine.

How does one tell that's true, however?

GF> It could be after such an event a disk
GF> drive didn't make it back into the set
GF> or, a meta data write (RAID-5 releated)
GF> didn't make it to the set.

When the unit went off line and I noticed the failed drive, after replacing it, when I tried to add the new disk into the raidset and couldn't. I don't remember exactly what happened when I did, but to try to fix the problem, I deleted the unit, deleted the raidset, and deleted the disks and tried to add them again. The controller complained that it couldn't determine the format of the disks. It wasn't until I actually inited the disks on the controller (thereby erasing everything) that the HSJ finally accepted the devices.

GF> If a disk drive has been kicked out of
BF> the set I typically do a "DUMP
GF> disk:/OUT=NLA0:". If that comes out
GF> clean I put that very same drive back
GF> into the set.

If it has been kicked out of the set and the HSJ hasn't automatically replaced it with a droive from SPARESET, you'd have to initialize it on the controller(desrroying all the data on it), add it as a separate unit, and mount it on VMS in order to do that. Even if you do that, the dump would probably show nothing because it would have been erased to get that far. I imagine you could add it back into the raidset and have the controller rebuild it at that point, but the contrller wouldn't touch the individual disks until I had reinitedthem all. Your descriptions sound like you're using host-based raid. I'm using controller-based raid where VMS never even sees the individual devices.

BrianT_1 · ‎10-30-2008

I'm fairly confident now that the saveset itself is clean, at any rate, because I restored a portion of it to a spare smaller disk and then performed and ANALYZE/READ on that disk. The files on the original disk that report FORCEDERROR aren't showing that error on the partial restore. I think I'm just going to buy four replacement RZ1Ds and start over. Maybe by the time I get the purchase order through Procurement the BACKUp team will have issued an ECO (hope, hope, hope) and I'll be able to use /IMAGE.

Jim_McKinney · ‎10-30-2008

When the forced error flag on a block on disk has been set, it indicates that the data in the block does not pass its parity check and has been flagged as such. It does not necessarily mean that the disk hardware is bad. Things like this occasionally occur during power failures, surges, crashes, etc. The act of writing (again) to a block with the forced error flag set will essentially repair that block - you've now written good data with a good parity check. If the block (hardware) was actually bad the attempt to write would result in a revectoring of the block (on any recent hardware). So, if you're restoring from tape, as long as the content of the tape is good, the content of the disk will be good.

When you have parity errors (forced error flag set) on disk it is possible to just re-write those blocks to repair the errors and make your data whole again. The trick is in identifying the blocks and then deciding what data belongs in them. If you can do that, then you can just poke the data into the blocks and your forced error flags will disappear.

BrianT_1 · ‎10-30-2008

Should I just restore the BACKUP again, then? If the backup is good, how would the errors have developed in the first place?

Jim_McKinney · ‎10-30-2008

In another thread you are asking for an interpretation of an HSJ failure from a couple of weeks ago - is that about the time that you first noticed the forced error flags being set? If so, then that HSJ crash was likely the culprit.

Once you've restored the data from backup all of the forced error flags will have disappeared - those blocks will have been re-written. In this case, since this is a controller based raidset that you've just created/initialized, the forced error flags would have disappeared as the controller performed the initialization. As long as your backup is good - then what you've jsut written to disk from it will be good.

Jim_McKinney · ‎10-30-2008

> how would the errors have developed in the first place?

These seem always to be the result of some hardware event that interrupts the writing of the data to your disks. Likely culprits in my experience are either power failures or crashes of the HSx storage controllers while they are active.

Hoff · ‎10-30-2008

The FORCEDERROR status means that bad data was returned from a disk read; that some number of bits are gone from the sector, and any EDC/ECC or RAID or HBVS was not able to recover the bit(s). When the sector(s) involved are rewritten, the block and its ECC is reset or the sector gets revectored.

BACKUP continues onward from this error by its intended design and intended use. Regardless, the FORCEDERROR means that some of the data was not recovered.

GuentherF · ‎10-30-2008

During the backup you get that hint - forced error - that the data read is unreliable. Whatever has been returned from the disk subsystem is stored in the save set without any further tag.

Restoring such a save set just restores these blocks without a message. Be warned!

Hoff, forced errors for RAID devices can - and mostly are - detection of a "write hole". Means at this horizontal level through all RAID set members there is not enough redundancy to recover the data. No disk ECC is involved here.

Brian, sounds like you are a good customer of that store around the block.

/Guenther

BrianT_1 · ‎10-30-2008

GF> During the backup you get that hint -
GF> forced error - that the data read is
GF> unreliable. Whatever has been returned
GF> from the disk subsystem is stored in
GF> the save set without any further tag.
GF>
GF> Restoring such a save set just restores
GF> these blocks without a message. Be
GF> warned!

I think the saveset is clean because I have a log of the batch job that made it and that log contains no errors.

I think this whole thing started with a HSJ50 error, which I'm discussing in another thread. Thanks all.

BrianT_1 · ‎11-24-2008

GF> I opened an internal problem report
GF> with high priority because you may not
GF> be able to restore your system disk.
GF> Let's see how this goes. No promises!

Has the team mentioned anything to you about this high priority item you submitted because of this error I found?

GuentherF · ‎11-25-2008

"The team" is located half-around the world. I haven't seen any response, yet.

It's like asking FORD to give you a stronger tank for your Pinto.

/Guenther

BrianT_1 · ‎02-02-2009

Still hoping for an update.
--
Brian Tillman

GuentherF · ‎02-02-2009

All I can tell is it had been assigned to the BACKUP team a long time ago but no response so far. They must be very busy then...

/Guenther

BrianT_1 · ‎05-22-2009

Should I assume that the BACKUP team never intends to address this bug in BACKUP? If that's the case, what approaches can I take? One that might be available to me is to join an AlphServer 1000 4/233 to the VAXcluster where the error is and use the Alpha version of BACKUP if I can verify that OpenVMS Alphsa V7.3-1's BACKUP doesn't contain the bug (I don't have legal access to any later versions of OpenVMS).

comarow · ‎05-24-2009

First you said it works if you don't use the image qualifier? Is it the system disk?
If not, you can do a full backup.
Will it do a backup/list?

When does it fail?

BrianT_1 · ‎05-26-2009

comarow wrote:

> First you said it works if you don't use
> the image qualifier? Is it the system
> disk? If not, you can do a full backup.
> Will it do a backup/list?

Please read the prior messages in this thread. I've covered the details already. BACKUP/IMAGE cannot be used to restore an ODS-2 volume whose BITMAP.SYS exceeds 255 blocks because, apparently, that value is hard-coded into BACKUP despite ODS-2 supporting 64K BITMAP.SYS files since V7.2. As I also said already, not using /IMAGE takes four full days to restore the RAID set I'm using of four 9.2GB RZ1C disks. Other aspects of BACKUP, like /LIST work fine and BACKUP/IMAGE also works just fine to create the unrestorable saveset. This is a major flaw in OpenVMS VAX BACKUP.
--
Brian Tillman

BrianT_1 · ‎10-09-2009

GÃ¼nther Froehlin wrote:

> All I can tell is it had been assigned to
> the BACKUP team a long time ago but no
> response so far. They must be very busy
> then...

Has fixing this fatal BACKUP error that would prevent the restore of a system disk bubbled up from the bottom yet?

BrianT_1 · ‎06-09-2010

I need to revisit this thread again. While I have been able to recover from the problems described in it, I'd like to avoid them again in the future. One of the means I intend to use is using an ALpha running a VMS version that doesn't have the backup bug. Here's a quote of GÃ¼nther Froehlin from one of the messages in this thread:

GF> I wonder whether there ever was an ECO
GF> to fix that in BACKUP on VAX. Fixed in
GF> the Alpha version.

What version of OpenVMS Alpha contains this fix? I want to purchase a license for that version and incorporate the Alpha I have into my VAXcluster, with the ALpha running the version of OpenVMS that has the fix. I'll then use the Alpha as a backup engine.

Also in this thread, GÃ¼nther said:

GF> I opened an internal problem report with
GF> high priority because you may not be
GF> able to restore your system disk. Let's
GF> see how this goes. No promises!

I was wondering if there were any results of this. I realize, of course, that "no promises" means that, as I suspected, there's no real interest in fixing this bug, but one can always hope.

Shriniketan Bhagwat · ‎06-09-2010

Hi Brian,

This problem was reported to engineering some times during year 2002 from the data what is available to me. This has been fixed in OpenVMS V7.3-2 SSB release. Also the fix is part of VMS731_BACKUP-V0200 ECO kit. Refer the below link for details.

http://ftp.uma.es/Vms/parches/v7.3-1/VMS731_BACKUP-V0200.txt

> What version of OpenVMS Alpha contains this fix? I want to purchase a license for that version
You can purchase any OpenVMS version which is higher or equal to V7.3-2. Since it has been fixed in V7.3-2 SSB, the fix will be available in all higher versions. Even higher versions will contain many more fixes and new features.

Regards,
Ketan

Shriniketan Bhagwat · ‎06-09-2010

Hi,

As Hein said:
> Backup reports the error BACKUP$_CLUSTER under two circumstances:-
BACKUP code has the check for cluster factor against its lower bound such that the storage map does not exceed 65535 blocks and also check for a reasonable minimum number of clusters i.e. 50. BACKUP calculates maximum number of files on the disk based on number of cluster factor. This value is compared against the architectural limit. This true even in case of V8.4.

Regards,
Ketan

Shriniketan Bhagwat · ‎06-09-2010

Hi,

I remember, I was facing the similar problem long back while taking BACKUP on to the LD disk having smaller cluster factor.
I tested the BACKUP to reproduce the problem by taking the image BACKUP of the disk containing the smaller cluster factor to the disk containing the larger cluster factor and vice versa. But I am unable to reproduce it. I tested this on V8.3-1h1 IA64. Below is the test log.

$ LD CREAT REGRESSION3.ISO/SIZE=10000/CONT
$ LD CONNECT REGRESSION3.ISO LDA10:
$ INIT LDA10: BACKUP
$ MOUNT/OVER=ID LDA10:
%MOUNT-I-MOUNTED, BACKUP MOUNTED ON _DADAR$LDA10:
$ SHOW DEV/FULL LDA10:

DISK DADAR$LDA10:, DEVICE TYPE FOREIGN DISK TYPE 1, IS ONLINE, ALLOCATED,
DEALLOCATE ON DISMOUNT, MOUNTED, FILE-ORIENTED DEVICE, SHAREABLE.

ERROR COUNT 0 OPERATIONS COMPLETED 405
OWNER PROCESS "_TNA58:" OWNER UIC [SYSTEM]
OWNER PROCESS ID 000010F4 DEV PROT S:RWPL,O:RWPL,G:R,W
REFERENCE COUNT 2 DEFAULT BUFFER SIZE 512
TOTAL SIZE 4.88MB SECTORS PER TRACK 10
TOTAL CYLINDERS 100 TRACKS PER CYLINDER 10
LOGICAL VOLUME SIZE 4.88MB EXPANSION SIZE LIMIT 6.00MB

VOLUME LABEL "BACKUP" RELATIVE VOLUME NUMBER 0
CLUSTER SIZE 1 TRANSACTION COUNT 1
FREE SPACE 4.83MB MAXIMUM FILES ALLOWED 2500
EXTEND QUANTITY 5 MOUNT COUNT 1
MOUNT STATUS PROCESS CACHE NAME "_DADAR$DKB200:XQPCACHE"
EXTENT CACHE SIZE 64 MAXIMUM BLOCKS IN EXTENT CACHE 989
FILE ID CACHE SIZE 64 BLOCKS IN EXTENT CACHE 0
QUOTA CACHE SIZE 0 MAXIMUM BUFFERS IN FCP CACHE 3534
VOLUME OWNER UIC [SYSTEM] VOL PROT S:RWCD,O:RWCD,G:RWCD,W:RWCD

VOLUME STATUS: ODS-2, SUBJECT TO MOUNT VERIFICATION, FILE HIGH-WATER MARKING,
WRITE-BACK CACHING ENABLED.

$ DISMOUNT DADAR$DKA0:
$ INIT DADAR$DKA0: BACKUP
$ MOUNT/OVER=ID DADAR$DKA0:
%MOUNT-I-MOUNTED, BACKUP MOUNTED ON _DADAR$DKA0:
$ SHOW DEV/FULL DADAR$DKA0:

DISK DADAR$DKA0:, DEVICE TYPE HP 73.4G MAX3073NC, IS ONLINE, ALLOCATED,
DEALLOCATE ON DISMOUNT, MOUNTED, FILE-ORIENTED DEVICE, SHAREABLE, AVAILABLE
TO CLUSTER, ERROR LOGGING IS ENABLED.

ERROR COUNT 0 OPERATIONS COMPLETED 4584474
OWNER PROCESS "_TNA58:" OWNER UIC [SYSTEM]
OWNER PROCESS ID 000010F4 DEV PROT S:RWPL,O:RWPL,G:R,W
REFERENCE COUNT 2 DEFAULT BUFFER SIZE 512
CURRENT PREFERRED CPU ID 0 FASTPATH 1
TOTAL SIZE 68.36GB SECTORS PER TRACK 96
TOTAL CYLINDERS 15558 TRACKS PER CYLINDER 96
LOGICAL VOLUME SIZE 68.36GB EXPANSION SIZE LIMIT 80.71GB

VOLUME LABEL "BACKUP" RELATIVE VOLUME NUMBER 0
CLUSTER SIZE 144 TRANSACTION COUNT 1
FREE SPACE 68.36GB MAXIMUM FILES ALLOWED 494395
EXTEND QUANTITY 5 MOUNT COUNT 1
MOUNT STATUS PROCESS CACHE NAME "_DADAR$DKB200:XQPCACHE"
EXTENT CACHE SIZE 64 MAXIMUM BLOCKS IN EXTENT CACHE 14337316
FILE ID CACHE SIZE 64 BLOCKS IN EXTENT CACHE 0
QUOTA CACHE SIZE 0 MAXIMUM BUFFERS IN FCP CACHE 3534
VOLUME OWNER UIC [SYSTEM] VOL PROT S:RWCD,O:RWCD,G:RWCD,W:RWCD

VOLUME STATUS: ODS-2, SUBJECT TO MOUNT VERIFICATION, FILE HIGH-WATER MARKING,
WRITE-BACK CACHING ENABLED.

$ DISMOUNT DADAR$DKA0:
$ MOUNT/FOR DADAR$DKA0:
%MOUNT-I-MOUNTED, DISK MOUNTED ON _DADAR$DKA0:
$
$ BACKUP/IMAGE/LOG LDA10: DADAR$DKA0:
%BACKUP-S-CREATED, CREATED DADAR$DKA0:[000000]000000.DIR;1
%BACKUP-S-CREATED, CREATED DADAR$DKA0:[000000]BACKUP.SYS;1
%BACKUP-S-CREATED, CREATED DADAR$DKA0:[000000]CONTIN.SYS;1
%BACKUP-S-CREATED, CREATED DADAR$DKA0:[000000]CORIMG.SYS;1
%BACKUP-S-CREATED, CREATED DADAR$DKA0:[000000]SECURITY.SYS;1
%BACKUP-S-CREATED, CREATED DADAR$DKA0:[000000]VOLSET.SYS;1
$
$ DISMOUNT DADAR$DKA0:
$ INIT DADAR$DKA0: BACKUP1
$ MOUNT/OVER=ID DADAR$DKA0:
%MOUNT-I-MOUNTED, BACKUP1 MOUNTED ON _DADAR$DKA0:
$ DISMOUNT LDA10:
$ MOUNT/FOR LDA10:
%MOUNT-I-MOUNTED, BACKUP MOUNTED ON _DADAR$LDA10:
$ BACKUP/LOG/IMAGE DADAR$DKA0: LDA10:
%BACKUP-S-CREATED, CREATED LDA10:[000000]000000.DIR;1
%BACKUP-S-CREATED, CREATED LDA10:[000000]BACKUP.SYS;1
%BACKUP-S-CREATED, CREATED LDA10:[000000]CONTIN.SYS;1
%BACKUP-S-CREATED, CREATED LDA10:[000000]CORIMG.SYS;1
%BACKUP-S-CREATED, CREATED LDA10:[000000]SECURITY.SYS;1
%BACKUP-S-CREATED, CREATED LDA10:[000000]VOLSET.SYS;1
$

Regards,
Ketan

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux