HPE EVA Storage
1761135 Members
3541 Online
108898 Solutions
New Discussion юеВ

MSA1000 drive failure caused Pool deactivations and Volume dismounts

 
Cameron Todd
Regular Advisor

MSA1000 drive failure caused Pool deactivations and Volume dismounts

I have recently installed and built an entry-level SAN using several DL360G4 servers (with QLogic HBA's) connected to a new MSA1000 (SAN switch 2/8 + two MSA30's).

The servers have NetWare OES (V6.5 SP3) and the firmware on the MSA1000 was updated to the latest version (FabricOS v3.2.0a, MSA V4.48)

One of the U320 146GB drives failed last night, yet despite the MSA selecting a hot spare and starting an array rebuild as would be expected, every Pool and Volume on the NetWare server deactivated with "device failure" messages.

I was under the impression that the point of having a RAID array was that a drive failure would be seamlessly repaired and that functionality would not be impaired (only slowed a little depending on the priority of the Rebuild setting).

I do not understand why Pools residing on totally separate arrays (I have defined 4 separate RAID5 arrays across the MSA cabinets) also failed, nor why the server had to be power cycled in order to allow any volumes at all to be seen and mounted by clients again.

There is no redundancy built into the SAN infrastructure (no secondary SAN switch or duplexed fibres and HBA's) but then, even if there was, this device failure would I assume still have occurred, as it appears was a problem involving the MSA controller failing a low-level NetWare OS diskaccess request when the drive failed rather than responding with the requested data while repairing the fault in the background.

Any ideas on what is happening here?

Is this a configuration error or a fault in the MSA?
5 REPLIES 5
Cameron Todd
Regular Advisor

Re: MSA1000 drive failure caused Pool deactivations and Volume dismounts

Some additional info..
According to a respondent on the Novell forums this is a problem that has been heard of before with MSA1000's and that I should contact HP Support.
I have now done this, however, the HP engineer has not heard of it.

Anyone else ever had a server lockup when a single disk in a RAID5 array has failed?

This fundamental flaw means our whole DAS to SAN migration programme is halted until the problem can be determined and resolved.

As a next step the only thing I can think of currently is to deliberately pull another disk to see if the identical fault reappears.
Cameron Todd
Regular Advisor

Re: MSA1000 drive failure caused Pool deactivations and Volume dismounts

Stranger than fiction.

I pulled a disk from one of the arrays deliberately and as expected it started a rebuild using one of the hot spares. This time though there was no 'device failure' error on the host server and read and write access continued unaffected.

A relief, but it doesn't explain the first time failure.
There has been no word on the HP Support case I raised either.
Guess will have to put it down to gamma rays.
SAKET_5
Honored Contributor

Re: MSA1000 drive failure caused Pool deactivations and Volume dismounts

G'day Cameron,

Very strange indeed!...the fact that you won't be able to reproduce the problem means its going to be hard to get a resolution from HP as well!....in the early days of our EVA implementations - during an online Vdisk exansion presented to a Windows host - the whole Vdisk was lost as in disappeared!! No response from HP - and we were never able to reproduce this problem either - !Mind you, since then I have done the same procedure heaps of time and have never had any problems.

So, not that I am helping you in any way with this message, I just wish you good luck in finding a resolution and let us know how you go with the case. Were there any other hosts (running different OSs) accessing storage from the MSA? or was the error only seen on Novell boxes - just trying to isolate that it really was an issue with MSA1000 rather than Novell or Novell with MSA1000 storage mix?

Regards,
Saket.

Cameron Todd
Regular Advisor

Re: MSA1000 drive failure caused Pool deactivations and Volume dismounts

Hi Saket,

At the time there was only one NetWare OES server connected to the SANswitch and powered up. Didn't even have a configuration loaded or saved at that time.

Since then we have added a Win2003 server and an ISL to another MSA1000 system.
(so there are now about 7 hosts on the one fabric - equal mix of NetWare 6.x and Win2003)


One other change I had made since the failure was updating the NetWare ACU from V2.76 to V2.77 (the latest), although there were no listed fixes of any magnitude or import in the latter's release info.

Can't see how that would have had much of an effect although this upgrade did fix one noticeable fault, which was that every time I exited the ACU there would be one (random) disk in the cabinet flashing its fault light - disconcerting for the other staff, who would initially see it and panic, thinking a drive had failed.

Rich Chodacki
Occasional Advisor

Re: MSA1000 drive failure caused Pool deactivations and Volume dismounts

Just ran into this older post, searching for the reasons behind a similar problem. I did not have a drive failure, but started an expansion. The MSA became unresponsive, all lights started flashing (amber, like Identify Drives was run) and volumes started being disconnected from their hosts. No errors could be found via HP System Manager either.
One suspicion is that since I have 23 volumes on that controller, over a dozen hosts AND the rebuild / expand prioritys were both set at HIGH... the controller ran out of resources and didn't actually disconnect hosts, but the hosts timed out from the slow response. Interesting though that I then rebooted the controller and the system is running like new. Has anyone heard of memory leaks associated with firmware version 4.32?