Re: Problem with re-mirroring RAID 1+0 on MSA 1000 due to bad block

Kelvin K. Y. Tang · ‎05-30-2010

I have an Oracle cluster consisting of 2 ML370 PC servers sharing a MSA 1000 disk array with 10 disks (5+5) configured with RAID 1+0. The logical drives set up on the disk arrays are OCFS (Oracle File System, I think it is equivalent to raw partition from Windows point of view).

On one day, I stopped the entire cluster and took out 5 disks (i.e., broke the mirroring) from the MSA 1000 to serve as a backup for the logical drives. Then I upgraded the Oracle software on the cluster and updated the data on the logical drives. And then I put back 5 new drives into the MSA 1000, expecting that the RAID 1+0 mirroring would be reformed. However, one logical drive never got finished with the remirroring. After collecting the ADU log to HP, HP informed that one of the 5 source disks had bad block and the HP engineer said that this was the reason why the remirroring failed. I eventually had to replace the disk and rebuild the entire cluster.

My question are:
1. Is it normal that a bad block could exist in a disk in the MSA 1000 without being detected and reported by the MSA 1000?
2. Is there a way for me to monitor for bad block on the disk(s)? I could not rely on ADU, because the output of the ADU is not directly human comprehensible and had to be analysed by HP using their own internal software.

Thanks.

Clarete Riana · ‎05-31-2010

Disk drive firmware should ideally detect and report bad blocks. This should be then reflected by the MSA controller in the status of the disks/logical volumes.This status should also ideally be propagated to any utility you may have to monitor the controller.Had the bad block been created when the mirror set was in place, then the controller should have detected and corrected it. In this case, I believe the bad block was created after the breaking of the mirror and the controller could not repair it. It is always risky to have the data updated when the mirror set is broken as there is a single point of failure and data loss is very likely.

Kelvin K. Y. Tang · ‎05-31-2010

1. Does the word "ideally" means "normally"?

2. Is there a way (preferably a tool that I could run online without the need to take down the system) for me to check for bad block on the disk(s) before I break the mirror? I could not rely on ADU, because the output of the ADU is not directly human comprehensible and had to be analysed by HP using their own internal software.

Thanks.

PVD · ‎05-31-2010

There is a feature in MSA called the Dynamic Sector Repair (DSR) which performs a surface analysis or disk scrubbing in the background and manages the unreadable blocks by reallocating them.
DSR doesnot initiate until there are 3 seconds with no I/O to the controller.
On a system with heavy I/O it can slowdown or even stop the progress of DSR.
At times you may choose to decrease this from 3 to 2 or 1sec (can be done from ACU or CLI).
Or at times you could quiesce the I/O for DSR to complete.!
Also ensure that you are on a good f/w level on the MSA.

Kelvin K. Y. Tang · ‎06-03-2010

From the ADU log, it seemed that DSR had been running on my MSA1000. Moreover, The ADU log seemed to say that the unrecovery disk sector read error was detected only a few days ago (may be after the mirror was already broken). Is it normal that MSA1000, after detecting an unrecoverable disk sector read error, did not report it but only saved that information somewhere to wait for ADU to read? Is there a way to monitor or to be alerted of unrecoverable disk sector read error (not to wait until we run the ADU and pass the ADU log to HP for analysis)?

Thanks.

gregersenj · ‎06-08-2010

The Smart Array controllers fail to rebuild whenever there's a problem with the source disks.
I would expect most controllers to do so.

The best tool (In my opinion) to check disk prior to breaking the mirror is to check all physical drives using the system management homepage.

BR
/jag

Kelvin K. Y. Tang · ‎06-10-2010

Can someone tell me what to look for in the System Management Homepage.

I have checked the "Logs -> Integrated Management Log" and could not see any log message reporting the detction of the bad block.

I have gone into the "Home -> Storage -> "Storage System SGM0743DP5 (1) (i.e., the MSA1000)" and then clicked at each Phsyical Drive. I could see some drives having a small non-zero value for "Hard Read Errors" and "Recov Read Errors". Is that a problem? Or is it a problem if I see a non-zero value for "Failed Recovery Reads"?

Thanks.

gregersenj · ‎06-21-2010

The hard read errors, could be the root course, for the RAID not to rebuild.

BR
/jag

Kelvin K. Y. Tang · ‎06-21-2010

In the RAID that I had finished reinitialising and rebuilt, I still found one disk with 1 Hard Read Error count (as shown in the attachment). Does it mean that that particular disk needs to be replaced?

Thanks.

gregersenj · ‎06-29-2010

I would like a second oppinion on that.

BR
/jag

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Problem with re-mirroring RAID 1+0 on MSA 1000 due to bad block

Problem with re-mirroring RAID 1+0 on MSA 1000 due to bad block