Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux

Robert Brooks_1 · ‎10-29-2008

I asked someone in VMS Engineering to take a look at this stream, and received this response . . .

Rob - I scanned this quickly. Clearly something is broken. We still
fix bugs in VAX BACKUP. The customer should submit a case with the
relevant information and the BACKUP tesm will work on it.

BrianT_1 · ‎10-29-2008

RB> Clearly something is broken. We still
RB> fix bugs in VAX BACKUP. The customer
RB> should submit a case with the
RB> relevant information and the BACKUP
RB> tesm will work on it.

How do I submit a case? Which part of this discussion woud they consider "relevant"?

By the way, I am most grateful to all who have contributed to this discussion.

GuentherF · ‎10-29-2008

I opened an internal problem report with high priority because you may not be able to restore your system disk. Let's see how this goes. No promises!

Abot the forced errors: RAID-5 technology can only survive ONE error. For example if you have lost one disk drive and a rebuild is taking place on the replaced diesk drive AND another drive fails some blocks may not be recoverable because now you lost the disk with the original data and the disk with the XOR/parity data. Some implementations of RAID-5 do report this as forced error.

My personal opinion about RAID-5, well, it was developed at Berkley University. Great idea but with some flaws in a real environment. I prefer RAID-0 or some extensions to RAID-5.

/Guenther

BrianT_1 · ‎10-29-2008

GF> I opened an internal problem report
GF> with high priority because you may not
GF> be able to restore your system disk.
GF> Let's see how this goes. No promises!

I understand and I thank you.

GF> Abot the forced errors: RAID-5
GF> technology can only survive ONE error.
GF> For example if you have lost one disk
GF> drive and a rebuild is taking place on
GF> the replaced diesk drive AND another
GF> drive fails some blocks may not be
GF> recoverable because now you lost the
GF> disk with the original data and the
GF> disk with the XOR/parity data. Some
GF> implementations of RAID-5 do report
GF> this as forced error.

I think something similar had happened. I had lost a disk from the raidset at some point and the HSJ pulled one in from the spareset. Since VMS doesn't know about the substitution and didn't tell me, I had no spares. I think then another failed and that's when the disk went off line. I'm surprised because I thought a raid set was supposed to be able to operate in reduced mode if one of the disks failed and there was no spare, but that didn't happen. I replaced both the failed spare and the second failure, but could not get the HSJ to recognize any of the devices until I had completely removed the unit, the raidset, and the disks from the configuration and readded them with a complete reinitialization of each disk in the set. This, of course, erased the entire contents of the device and that's why I needed to perform the restore. The restore, except for taking a long time, didn't produce any errors on the disk as VMS sees it. It was a couple of days later that we started seeing the FORCEDERROR notices. The error light that each disk has has remained off since I replaced the failed drives.

GuentherF · ‎10-29-2008

These forced errors are software errors and are imminent part of RAID-5 technology and are more officially called 'write holes'. Some early RAID-5 implementations ignored that giving you false data without a warning. DEC implementations always return a forced error for such areas.

Once you write to such blocks/area it clears the forced error condition, liek when you do a restore.

So be careful with the RAID set...we don't want to have another million dollar burndown. ;-)

/Guenther

BrianT_1 · ‎10-29-2008

We've been using these raidsets for literally years with never a problem until recently. I'm not sure what I can do to rid myself of these errors now. Should I replace the devices? If I reinitialkize the VMS disk and restore the saveset again is it likely the errors would recur?

GuentherF · ‎10-29-2008

I don't know what kind of utilities there are for the controller based RAID sets. But if there are look for one that does some consistency check (non-redundance).

Or, ANALYZE/DISK/READ would tell you all the forced errors there are. And before you replace a RAID set member make sure the RAID set is fully redundant, either with a utility or again ANALYZE/DISK/READ.

/Guenther

BrianT_1 · ‎10-29-2008

GF> Or, ANALYZE/DISK/READ would tell you
GF> all the forced errors there are.

So does COPY. I have a list of all the errors. It's extensive. I may just replace all the drives and start over.

GF> And before you replace a RAID set
GF> member make sure the RAID set is fully
GF> redundant, either with a utility or
GF> again ANALYZE/DISK/READ.

I don't know how to do that. ANALYZE/READ doesn't even know it's a raid set.

GuentherF · ‎10-29-2008

ANALYZE/DISK/READ tells you about any forced error in a file. If it comes back clean then you don't have a problem. Now you might have forced error blocks in unused blocks but you don't care. As soon as a block is written its XOR/redundancy/parity block is also updated.

ANALYZE/DISK/READ is better than using COPY because it just does bulk reads and even for file not in a directory...gets them all.

A forced error on a RAID-5 array is almost always a software error. All of your disk drives are fine. You could have a power glitch, an infamous system reset (aka crash), a controller failure/reset/power cycle in which case all your drives are fine. It could be after such an event a disk drive didn't make it back into the set or, a meta data write (RAID-5 releated) didn't make it to the set. But you disk drives are mostly not the source of the problem.

If a disk drive has been kicked out of the set I typically do a "DUMP disk:/OUT=NLA0:". If that comes out clean I put that very same drive back into the set.

/Guenther

BrianT_1 · ‎10-30-2008

GF> ANALYZE/DISK/READ tells you about any
GF> forced error in a file.

I understand that. It was the other stuff you said that I didn't understand. ANALYZE/READ shows a LOT of FORCEDERRORs.

GF> A forced error on a RAID-5 array is
GF> almost always a software error. All of
GF> your disk drives are fine.

How does one tell that's true, however?

GF> It could be after such an event a disk
GF> drive didn't make it back into the set
GF> or, a meta data write (RAID-5 releated)
GF> didn't make it to the set.

When the unit went off line and I noticed the failed drive, after replacing it, when I tried to add the new disk into the raidset and couldn't. I don't remember exactly what happened when I did, but to try to fix the problem, I deleted the unit, deleted the raidset, and deleted the disks and tried to add them again. The controller complained that it couldn't determine the format of the disks. It wasn't until I actually inited the disks on the controller (thereby erasing everything) that the HSJ finally accepted the devices.

GF> If a disk drive has been kicked out of
BF> the set I typically do a "DUMP
GF> disk:/OUT=NLA0:". If that comes out
GF> clean I put that very same drive back
GF> into the set.

If it has been kicked out of the set and the HSJ hasn't automatically replaced it with a droive from SPARESET, you'd have to initialize it on the controller(desrroying all the data on it), add it as a separate unit, and mount it on VMS in order to do that. Even if you do that, the dump would probably show nothing because it would have been erased to get that far. I imagine you could add it back into the raidset and have the controller rebuild it at that point, but the contrller wouldn't touch the individual disks until I had reinitedthem all. Your descriptions sound like you're using host-based raid. I'm using controller-based raid where VMS never even sees the individual devices.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: %BACKUP-F-CLUSTER, unsuitable cluster factor redux