Disk Arrays
cancel
Showing results for 
Search instead for 
Did you mean: 

Remapping Media Errors when Rebuilds Fail

Greg Carlson
Honored Contributor

Remapping Media Errors when Rebuilds Fail

Hello All,

I wanted to get some opinions as this:
Scenario is a hardware Raid in a NetServer With a NetRaid Controller running Raid 5 with 3 or more hdds.
Ch 0
ID 0 Online A0-0 10 media errors
ID 1 Online A0-1 no errors
ID 2 Failed A0-2 no errors

ID 2 is marked as failed and the rebuild of ID 2 fails at the 30% on multiple replacement hdds. Upon closer inspection one of the hdds that is still in an Online State, ID 0 has 10 Media Errors.

Some techs recommended to run a disk verify off of a Symbios or Adaptec SCSI controller and remap the bad blocks (Also showing at 30%). Once remapping is complete the rebuild will complete 100%. This resolves the failed rebuild issue.

However the question is what is the status of the data on the bad blocks?? If the data was corrupt on the bad blocks are you introducing corruption into the array because you essentially have two failed blocks in the Data Stripe?? If the data is ok you should be ok but how do you know? Any thoughts on this??

Is there the possibility of corruption of the backups themself as well? Also just looking for thoughts as to why the hdd with the media errors stays in an offline state and another hdd is being marked as failed.

Cheers,
Greg

P.S. I had also posted this under NetServers but have only had one reply..
Lets Roll!
2 REPLIES
Vincent Fleming
Honored Contributor

Re: Remapping Media Errors when Rebuilds Fail

A successful remapping of a bad block indicates that the data was successfully read from the bad block or reconstructed with the use of a CRC, and written to another area of the media.

If the data was too badly corrupted, the remapping would have failed.

The hdd with the media error can be kept online because media errors are normally recoverable. Non-recoverable errors cause failed drives. Some RAID controllers are more conservative than others, which means that a different RAID controller may have failed the drive. (for example, XP Disk Arrays are very conservative, and will fail any drive with media errors.)

I would replace the drive with the media failure anyway, if I were you - it is usually an indication that the drive is going to fail soon.

Good luck!
No matter where you go, there you are.
Michael Lampi
Trusted Contributor

Re: Remapping Media Errors when Rebuilds Fail

The BIOS level disk verification performed by Adaptec and Symbios SCSI controllers does not care about data. It only cares that each block on the disk drive can be read.

If a given block can not be read; i.e., media errors are encountered, then that block address is remapped to one of the spare blocks on the drive.

So, yes, it is possible that taking the drive with media errors from your degraded RAID to a PC with an Adaptec or Symbios controller will cause data corruption for those blocks that get remapped by the verification process. The replacement disk block will not have data from the old, now superceded, RAID-formatted disk block.

From the sound of it, I would guess that the media error(s) encountered at the 30% threshold are preventing the successful rebuild of the RAID.

If you have irreplaceable data on this RAID, you might consider changing the retry thresholds for the disk drive first, and see if you can recover the RAID. This typically requires SCSI tools unavailable to the normal end user, such as SCSI Toolbox.

Otherwise, do the offline verify, return the drive to the RAID system, and hope for the best.

By the way, the verification process will display which disk blocks are being remapped. It is (remotely) possible for you to track these blocks into the file system to see which file(s) and/or directories are affected by the now corrupt data.
A journey of 1000 steps ends in a mile.