HPE EVA Storage

EVA Disk Failure errors

 
SOLVED
Go to solution
Chikuku
Occasional Contributor

EVA Disk Failure errors

Hello,
We have an EVA 8100 with 2 disk groups of 72 300GB drives. We had a disk fail this weekend and not only does it appear that the Disk group went down (Disk Group w/no redundancy is inoperative) but some of the attached servers saw disk r/w warnings. All errors cleared in a matter of minutes, and there was no data loss, but it was my understanding that attached systems should never see a warning like this.

Can a single disk fail in such a way that the entire disk group would become inoperative and errors get passed to attached systems? The disk was dropped from the group as expected and no harm done, however we don't like clients seeing errors!

Also getting a warning that a disk group is inoperative can be a bit alarming, is this normal behavior? Has anyone else experienced this?
6 REPLIES 6
Víctor Cespón
Honored Contributor

Re: EVA Disk Failure errors

The message

09c95105; A Disk Group has transitioned to an INOPERATIVE state.

Disk Group with no redundancy is inoperative;

Appears in all EVAs after a disk failure. It means that if there was any VRAID 0 (no redundancy) vdisk it would be inoperative.

This would be an explanation for some servers losing access after a single disk failure.

Another possibility is that the disk was failed because there was a lot of problems on the loops, and the communication was interrupted several times. The controllers try to reset the loops many times and can fail one or several disks during this process.

Can you get the Controller Event log, compress it and attach it here?
Chikuku
Occasional Contributor

Re: EVA Disk Failure errors

Thanks for your reply, here is what I received from webes:

Evidence:
Local Event Time : Sun 6 Sep 2009 23:16:08 GMT-04:00
Controller Report Time: 06-Sep-2009 22:03:53.123 EVA Log Description:
09c95105: CAC=51 - Storage System Management Interface Entity State Change The state of a Storage System Management Interface entity has changed. A Disk Group has transitioned to an INOPERATIVE state.
Sequence Number: 15756
Rule#: HSV_SCMI_Rule -- V 1.12 Event Code: (09 C9 51 05)


I have attached the controller log, BTW what do you use to view this? We do not have any VRAID0 disks, so maybe this warning is just informational. Still, why would a server see errors?
Víctor Cespón
Honored Contributor
Solution

Re: EVA Disk Failure errors

I can't see anything that indicates an loss of access to vdisks.

2009-sep-06 22:02:22 timeouts to 0-01-08
2009-sep-06 22:03:49 0-01-08 missing
2009-sep-06 22:03:53 0-01-08 disappeared.
2009-sep-06 22:06:01 Controllers keep sending LIPs to 0-01-08
2009-sep-06 22:07:05 One virtual disk switch to the other controller
2009-sep-06 22:18:06 Recontructing or reverting is in progress in LDAD
2009-sep-06 23:21:59 Reconstructing; Status: success;

The whole sequence is very typical and there are no messages of vdisk data lost or RAID reconstruct failed

As explained above the "Disk Group with no redundancy is inoperative" message appears always and does not indicate loss of access to VRAID 5 or VRAID 1 LUNs, only to VRAID 0 LUNs.
Chikuku
Occasional Contributor

Re: EVA Disk Failure errors

Thanks, I am glad to hear confirmation that these are normal warnings. I will take another look at the logs from the server that reported disk errors and try to find other explanations.
Perhaps there is a bad fiber or HBA, just strange that the server saw errors at the same time as the disk failed.

What did you use to read that logfile?, it looks like gibberish in notepad.
Víctor Cespón
Honored Contributor

Re: EVA Disk Failure errors

It's in binary format. We use an internal tool to translate the events and list them on a table. Also this software allows us to filter, sort and perform analysis on the events.
Mel Nugent
Regular Advisor

Re: EVA Disk Failure errors

I had a disk failure in a disk group last night on one our our EVA 8000. The disk group is 16x450GB and contains 4 LUNs, each of which is VRAID5. These contain one VMware virtual server per LUN. Each of the 4 servers logged disk errors in event viewer for approx 45 secs around of the time of the disk failure. There is also a 5th LUN which is a destination LUN for a replicated LUN on our other EVA.
I recieved "A Disk Group has transitioned
to an INOPERATIVE state" and "DR Destination Site preventing acceptable replication throughput" errors among others.
I have checked the logs and I can see this error for my last failed disk (another newish 450GB) on the other EVA8000. However going back to other failed disks in the past I do not see these errors. Should a disk not fail and be replaced without causing errors at the host server level.