topic Re: Failed read recovery error in Disk

Failed read recovery error

fawrell — Wed, 15 Oct 2008 06:35:24 GMT

Hi.

I have some questions about one SCSI S.M.A.R.T attribute Fl Rd Recv (in ADU 7.x). This should correspond to failed read recovery (unable to read the recovery info off of this drive). So this error should mean, that the drive was unable to read some data (hard read error) and it was unable to read recovery info for rebuild of this data. So the data are irretrievably lost.

And my question is, when the drive has so serious error, why it is not marked in ADU as failed? When you have mirrored array and one disk in this array goes down and the second disk has failed read recovery errors, you cant rebuild the array. The array will stay in "Ready for recovery" state. And the only solution is to make backup, create new array with new drives and restore the backup there. But will I able to make backup from drive, that has some unreadable data on it?

I think it is a serious error. But I can be worng in explaining of this error. So can someone tell me more about this error?

Thanks for answers.

Re: Failed read recovery error

skris — Wed, 15 Oct 2008 21:24:45 GMT

Every RAID Unit will have its own parity layout thereby increasing reliability.

The data Organization on a RAID volume is something like this:

Where:

Da,Db....Dz being data and Dp being parity
D1...Dn being the drives

eg:On a RAID 5 volume here is how writes are done;

D1 D2 D3 (Disks)
------------
Da | Db | Dp (All Dp's are parities)
Dc | Dp | Dd
Dp | De | Df
------------

Now the Fl Rd Recv error means, after a disk failure (Say D3), the controller should be able to re-create data from remaining data
which would be something like this;

D1 D2 D3(replacement) (Disks)
---------
Da | Db | >>>> Dp
Dc | Dp | >>>> Dd
Dp | De | >>>> Df

Now lets say we have an error in location Dc(on Disk 1)
which would be termed as Fl Rd Recv error.

1) The rebuild will not complete
2) Backup may not complete some file(s) may
be lost. (Filesystem Error correction will
also mask these errors at times)

Now since the drive firmware finds an error on the respective sector, it will re-direct all the new writes to a new sector.

However the drive may not have hit the threshold where it could be marked as failed, henceforth the message not showing up.Once the error count builds up then the drive will be failed or shows the predictive failure messages as applicable.

As a best practice it is always good to have the drives replaced with new ones rather than waiting for failure to show up;

Hope this info was helpful;

Cheers!
Shiva

Re: Failed read recovery error

kris rombauts — Thu, 16 Oct 2008 06:12:49 GMT

Fawrell,

in most situations the continious surface scan that the raid controller firmware performes will detect inconsistencies and will correct them without the user knowing about them, however there can be situations whereby one disk faces a read error on the media and the other drive in the pair(assuming raid1 as an example here) has an issue in an area where the other copy of that record is stored, meaning that piece of data is lost since both are unreadable on both disks, this is a exceptional case.
This is not specific to the HP Smartarray but is something any raid controller in the industry can face.

To answer your question on the backup of such situation ... well because the raid controller mirrors at the block level and is unaware of the user data itself, it might well be that the area with the unreadable data is not containing any user data at all.
If that is the case, your backup will be succesfull and there is no issue with the customers data and the problem is in unused space.

It is thanks to the continious surface scan (some call it parity check i.e.) that such areas are detected before any customer data is written to them. When new data is written and that bad area is hit, the read afetr write verification will fail and the data will be written elsewhere.

So when this bad spot on the disk media is detected and retries are not able to recover it, it is marked as 'bad' and not re-used anymore unless the array is initialized from scratch again after which those areas are ready to be used again which is not a problem because often this is a soft error (correctable via reformat)and not a hard media error on the platters of the disk.
When the disk reaches a certain number of errors (threshold set by the disk manufacturer, not by the raid controller) it will send out a pre-failure alert which will allow the customer to pro-actively replace the drive.

The downside in such situations is that a rebuild will stick in the Ready for recovery status as you indicated or a Rebuild will start but stop and abort at some point when it hits the inconsistency and no valid/readable copy of a record is available to 100% succesfully rebuild the array.

So bottom line is that if such a situation is what you hit, the full backup will tell you if the issue was within user data or not, if the backup fails with a read error i.e. then it is likely that you hit an issue with the inconsistency being in user data but this is more an exception due to all mechanism in place at the disk drive and raid controller level.

HTH

Kris

Re: Failed read recovery error

fawrell — Thu, 16 Oct 2008 13:12:52 GMT

Thanks for your answers and time.

Kris just one more question.

"So when this bad spot on the disk media is detected and retries are not able to recover it, it is marked as 'bad' and not re-used anymore ..."

I know when this error apears during rebuild, rebuild will not continue. But when this error appears before rebuild and the bad spot on the disk media will be marked as 'bad' and not re-used anymore, will the rebuild competed successfully?

Re: Failed read recovery error

e4services — Thu, 16 Oct 2008 13:36:16 GMT

Not usually.
Backup
Replace the disks and restore, before you can not.

Re: Failed read recovery error

skris — Thu, 16 Oct 2008 20:34:54 GMT

But when this error appears before rebuild and the bad spot on the disk media will be marked as 'bad' and not re-used anymore, will the rebuild competed successfully?

Yes, because of the following reasons:

1) A rebuild is triggered after a drive failure
2) So if a parity error is detected, the
information is recreated from the bits and
pieces available from the remaining drive
and stored in a new location.
3) MSA does periodic scrubs which will
simulate disk i/o and moves/recreates
data if it finds a fault.

Cheers!
Shiva

Re: Failed read recovery error

e4services — Fri, 17 Oct 2008 03:10:57 GMT

Or, No, as you stated, the rebuild fails due to MORE bad sectors.
Backup and restore to new drives as soon as possible

Re: Failed read recovery error

fawrell — Fri, 17 Oct 2008 09:15:52 GMT

Thanks for answers again.

One more question (hope last).

When the error occurs on drive in RAID 1 array, but the array has only this one drive, will the rebuild after the insert of a new drive completed successfully?

Re: Failed read recovery error

kris rombauts — Fri, 17 Oct 2008 09:54:34 GMT

it depends ...

1)if the error is one that is correctable and a alternate (spare) location is used to store the data then no problem and it is transparent because this is handled at the hard disk level.

2) if it is a hard error (unrecoverable) then the rebuild will fail because when the rebuild is copying onto the newly inserted disk and reaches the 'bad spot' on the source drive, the read fails and the rebuild will halt and you won't be able to get back to a redundant RAID1 and backup/re-init/restore will be needed to fix it.

HTH

Kris