Storage Boards Cleanup
To make it easier to find information about HPE Storage products and solutions, we are doing spring cleaning. This includes consolidation of some older boards, and a simpler structure that more accurately reflects how people use HPE Storage.
Disk Arrays
cancel
Showing results for 
Search instead for 
Did you mean: 

12H requires manual intervention on failure

SOLVED
Go to solution
Marco Paganini
Respected Contributor

12H requires manual intervention on failure

Hello All,

I have a 12H with 4x2GB disks. The configuration as returned by arraydsp is:

Total Physical: 8135MB
Allocated to LUNS: 3837MB
Used as Active Hot Spare: 2033MB
Used for Redundancy: 2229MB
Unallocated: 0MB

If one of the data disks fail on this array (as it happened today), the entire array stops and no redundancy kicks in!

I see a lot of "active spare space unavailable" messages on the syslog, however to this moment nobody has been able to provide a comprehensive description of this situation.

I imagine the Hot Space device is used to make space for RAID 0/1 (dynamically allocated by the disk array) and to be used as a regular disk (with decreased performance) in case of a system failure. However, that's not what I observe here at all.

Any ideas?
Keeping alive, until I die.
11 REPLIES
Steve Labar
Valued Contributor

Re: 12H requires manual intervention on failure

The 12h uses Unallocated space for RAID 0/1. Hot Spare space should not be used until disk failure. Have you verified that Auto Rebuild is enabled? The only other thing I can think of is if the space allocated to hot spare had too many bad sectors to handle the load. You can do a arraylog -d {slot_id} {array_id} to get the status of disks installed. Another problem could be if you have 2 failed disks in the array.

Good Luck.

Steve
S.K. Chan
Honored Contributor
Solution

Re: 12H requires manual intervention on failure

I would first check (Steve's suggestion) the "auto rebuid" flag to make sure it's enable.
# arraydsp -a
The active hot spare, when activated basically does its rebuild by distributed across all disks in the array and uses it for RAID 1 mirroring. I do not know is it would make any difference by reserving some space (say about 1-2% of your total physical space) in "unallocated". My thinking is it should not but I always leave some unallocated space behind and so far it hasn't complain. It is very likely too you might have some bad section on the disks. This can be determine by running STM's exerciser to make sure all the isk modules are in good shape.
Bill McNAMARA_1
Honored Contributor

Re: 12H requires manual intervention on failure

active spare needs to be as big as the biggest disk that can fail. If an 18G disk failed (and you had 2 installed) and all the rest 9G, then on failure of the 18G drive, hot spare is lost, since the capacity of the 18G remaining is no longer really 18G, but taken as 9G. You won't loose data, but will loose redundancy.

Another reason HS will fail is if bad blocks arenoticed on rebuild.


Later,
Bill
It works for me (tm)
Marco Paganini
Respected Contributor

Re: 12H requires manual intervention on failure

Hello Steve, Bill and Chan,

AUTOREBUILD is set to on. I noticed however that most of my drives have a "Grown Defect List", which is a Bad Thing(tm).

I suppose having bad blocks in the Hot Spare is dangerous since during a failure no space would be available to rebuild the array. However, my data disks ALSO have bad blocks :( I wonder if I should replace the Hot Spare or ALL my drives with errors...

Paga
Keeping alive, until I die.
S.K. Chan
Honored Contributor

Re: 12H requires manual intervention on failure

You have a very valid reason to be worry. From my undertanding GDL represent the list of blocks that were re-located and the more you see that the higher the risk disk failure in the future. I have just ran ..
# arraylog -d A1
A1 being the disk module in slot A1, I repeated that on all my disks and I do not see any entries in the GDL. What does your ..
"corrected read/write errors with delay" says ? They should be ZERO. If you're seeing some numbers in there it means subsequent tries of read/write recovery on your disk is having delay, maybe due to bad blocks/sectors or whatever. Also take a look at "total uncorrected read/write error", they should also be zero. If they are not you got to give your local HP response center a call and find out what they have to say. I can bet you 2 things ..
- Load latest firmware if STM does not indicate and IO error
- Replace disks if STM sees IO errors.
Steve Labar
Valued Contributor

Re: 12H requires manual intervention on failure

Good luck trying to find replacement disks for what you have also. In the last year, I haven't been able to find any drives less than 9.1GB. Even they are starting to become hard to come by. Be careful, if you do upgrade to a larger drive. If you install dissimilar size drives you MUST install atleast 2 to fully utilize the space. If you have 4-2GB drives and 1-9GB drive, the RAID will only use the first 2GB of the 9GB drive. If you have a large list in your Grown Defect List, I would try to replace those drives as soon as possible. The list is only going to get bigger and eventually the entire drive will fail.

Good Luck.

Steve
Marco Paganini
Respected Contributor

Re: 12H requires manual intervention on failure

Hello Chan,

All my "Corrected with delay" totals are zeroed. I did notice however that all of the errors have been ECC corrected. I suppose the read (or write) was performed successfully, but the block was subsequently marked as bad and won't be reused. Is that correct?

Cheers,
Paga
Keeping alive, until I die.
S.K. Chan
Honored Contributor

Re: 12H requires manual intervention on failure

All my "Corrected with delay" totals are zeroed. I did notice however that all of the errors have been ECC corrected.
==> That's good, at least the integrity of the drives are still intact.
I suppose the read (or write) was performed successfully, but the block was subsequently marked as bad and won't be reused. Is that correct?
==> That's what I think happened. Still if GDL exists you have to minitor and see if they grow over time.

Have you try running STM exerciser on one of the disk just to find out if the diagnostics tool from the OS side reports any error or not ?
Marco Paganini
Respected Contributor

Re: 12H requires manual intervention on failure

Hi,

The problem with STM is that is "sees" the array as a single thing. I'm using drivetest/dteststat to test disk by disk.

Paga
Keeping alive, until I die.
S.K. Chan
Honored Contributor

Re: 12H requires manual intervention on failure

Yeah, you were right, the most STM can do is run it against each LUN, not individual disk. I overlook that fact, my apology. The most you can find out from STM is which LUN fail the exercise test. Drivetest would be a good one to use and even if it does not produce any error, I would still stick to my 2nd reply, ie call HP, "no redundancy kicks in" is a good enough reason for you to do that.
Marco Paganini
Respected Contributor

Re: 12H requires manual intervention on failure

Hello,

Oh yes, that's the first thing I did. I called HP, spoke to three technicians about it. They were all very kind as usual, but I couldn't get a simple response like yours (check the amount of bad blocks on the dist).

I got responses from "It's like magic inside the array, we can't tell what's going on" to recommendations like "You must have more than 50% of your space unallocated..."

Sad...

Keeping alive, until I die.