Disk Enclosures

Questionable Best Practice MSA2012 disk layout (resilience-wise)

Regular Advisor

Questionable Best Practice MSA2012 disk layout (resilience-wise)

Having read as much as I could find on best-practice for configuring an MSA2000 with additional shelves to support a reasonably busy Exchange 2003 cluster late last year we've had a rather busy week in DR mode due to multiple PSU failures (which I agree is probably fairly uncommon) but it raises a serious question in my eyes.

Basically, the disks were configured as described in the attached JPG, striping "vertically" across enclosures rather than horizontally. Database VDisks 1 & two each had two volumes created/presented to hosts, as did TL Vdisks 1 & 2. However, as the disks were there for performance (spindle-counts) rather than capacity, these Vdisks were under 50% utilised from a capacity perspective.

We had a rather unique situation whereby we had a double PSU failure (one each in enclosures .2 & .3) for which we logged a hardware call. By the time the replacement PSUs turned up, a third had died (the redundant one in enclosure .3) resulting in enclosure.3 taking its disks offline.

On resumption of power, all of the disks in enclosure .3 (bottom in the diag) were in "LeftOver" state and all VDisks were "Critical" but still online.

From an O/S perspective (2003) all volumes appeared to be online with the exception of one hosted on DB VDisk 2.

To be fair, for these VDisks to still be online with so many physical disks offline is actually pretty impressive and can only be due to the low capacity utilisation I would imagine (i.e. plenty of spare capacity to stripe all the data).

However, the fact that all the LEFTOVER disks had to have their metadata wiped before they could be added back into the VDisks seems to be a massive issue in my opinion! Effectively, following Best Practice disk layout has left us with a "split-brain" type VDisk which would have most likely failed completely had we utilised more of it.

This makes me wonder if we're not better off (from a resilience perspective) striping *across* enclosures rather than *down* enclosures.

What experience have others had?

I have an earlier post (with no replies) describing another MSA we have which is configured with a VDisk per enclosure which I was asking about reconfiguring to BP but now I have serious doubts and wondered what others thought. The problems we had kicked in the day after I posted that message!

We've since had a 4th PSU fail in the same MSA so I've instigated a question about reliability/bad batches etc. to HP as there's no evidence of any underlying power issues in our data centre.

Any opinions appreciated...