HPE EVA Storage
1754954 Members
2814 Online
108827 Solutions
New Discussion юеВ

EVA Disk Failure & Corrupt Files

 
Dave La Mar
Honored Contributor

EVA Disk Failure & Corrupt Files

Last week on our EVA 5000 we encountered excessive disk errors on a disk in a particular group.
Raid was set at raid 5. We have encountered disk failures in the past without issue, but at the time this disk was failing we encountered numerous file corruptions on logical volumes in this disk group. The disk was eventually ungrouped, but today we are still encoutering corrupted files assumed to be fall out from last weeks disk failure.
Needless to say, this is quite a concern. The EVA log files turned over to HP show nothing definitive other than the coincidence of file corruption and disk failure timings.
There were no pathing errors indicated, and no syslog errors related.

Has anyone encountered this and found a definitve cause for the file corruptions?

Files were for an Oracle 9i db which has been in place for quite some time.

Thanks for any and all input.

Regards,

dl
"I'm not dumb. I just have a command of thoroughly useless information."
14 REPLIES 14
Paul Henderson_2
Frequent Advisor

Re: EVA Disk Failure & Corrupt Files

By definition, this shouldn't happen. The EVA is reading and writing bits... it is the filesystem that is controlling the file.

You don't specify the OS or filesystem, but if the filesystem provides the capability, you may be able to rebuild the filesystem and recover the files. Most filesystems provide at least some rudimentary capability to recover corrupted filesystems.
Dave La Mar
Honored Contributor

Re: EVA Disk Failure & Corrupt Files

Thanks Paul. Yes, we are able to recover the files, rebuild indexes, etc.
[HP-UX 11i]. The O/S sees no issues, but at the Oracle level corruption is seen.
I'm really pressed to understand how this occured with the EVA.

Regards,

dl

P.S. As is my habit, all points will be assigned at thread closing.
"I'm not dumb. I just have a command of thoroughly useless information."
Prasanth B
Trusted Contributor

Re: EVA Disk Failure & Corrupt Files

Hi,

Have you set the pv timeout value for the EVA LUNs?. The default 60s is not sufficient. You may change it using pvchange command (assuming you are using LVM). Make it atleast 90s.

-PB
Take life as it comes
Dave La Mar
Honored Contributor

Re: EVA Disk Failure & Corrupt Files

PB -
Not sure how this plays into it since there were no OS level pv errors of any kind.
And yes, we are set at default.

It appears that the EVA was simply reporting completed writes when they indeed were not successfully completed.

Thanks for the thought though.

-dl
"I'm not dumb. I just have a command of thoroughly useless information."
Thomas Callahan
Valued Contributor

Re: EVA Disk Failure & Corrupt Files

Out of curiosity, what rev of VCS code is your 5000 at?

Tom Callahan
Paul Henderson_2
Frequent Advisor

Re: EVA Disk Failure & Corrupt Files

As you say, "It appears the EVA was reporting completed writes".

Since a write operation is cached (the HSV110 controllers, for example, have a 1GB mirrored cache), the write would occur to cache. Once the cache filled, or was flushed, the write would physically be completed to disk.

One HSV of the redundant pair is responsible for caching the write, which is then mirrored on the second HSV. This is a long shot, but it may be that a physical memory error (coupled bits, for example) on the responsible HSV caused a write corruption. It may be worth running a diag on the HSV to determine the health of its cache.

The other part of the mystery is the role of Oracle 9i in this, but I have no knowledge of it.

SAKET_5
Honored Contributor

Re: EVA Disk Failure & Corrupt Files

Dave,

this is definitely intriguing..although I am not convinced whether you should start your troubleshooting with the finger pointing to EVA being responsible for the corruption that you experienced.

Were there any other servers accessing the same EVA at the time the disk failure occured?I find it hard to believe (not saying its impossible)that the loss of a single physical disk on an EVA would manifest into data corruption at the Oracle end? Did you experience any such issues on any other servers, any other data areas or just Oracle level corruption?

Have you also quizzed Oracle for any such issues reported with the particular version of Oracle?

I have assumed that only the loss of a physical disk took place and not any Vdisk.

Interested to follow this one through...

all the best and keep us posted..
Dave La Mar
Honored Contributor

Re: EVA Disk Failure & Corrupt Files

Paul/SAKET -
The issue has been escalated with HP and I will definitely mentioned the diags on the HSV.
It was HP that pointed the finger at the EVA before we did. Although it was fairly coincidental in that the errors in the eva log corresponded with the crash of the
db due to the file corruption.
HP actually pointed out, that though extremely rare, there is evidence of successful write being reported when indeed they were not successful, again, let me stress "extremely rare".
Oracle has assisted in trouble shooting and, as yet, fount no 9i issues for writing
bad data as we have encountered.
There is one other HP host connected to storage in this same group that has not, as
yet had indications of corruption from the 9i db running on it. The caveat here is
that the second host does not see the transactional level and i/o that are seen
on the host showing the issues.
The db involved is slightly over 2 tb in size on this host and experiences a lot of
activity.
No vdisks we lost, simply a hardware disk failure.
There is residual going on since that Thursday with more corruption being detected daily.
In tracking this, there is solid evidence this is not new corruption but a result of
last Thursday's issue.
Obviously, there are files and tables that may not get hit every day, thus we expect it
to take a minimum of a week to find all the corruption and quite likely why we have yet
to see any on the second host.
The DBA group is using dbverify to perform the cleanup, but to run it against all files
in a 2 tb db would be detrimental to the business. Thus, they are attacking each
instance as found.

I definitley will post the outcome of this issue and difinitive cause when found.
I would hate to think of this occurring for a colleague without some insight. Since I
got no hits in the ITRC search for this, (or google for that matter), I assume the
outcome will be of interest to fellow EVA users.

Regards,

dl
"I'm not dumb. I just have a command of thoroughly useless information."
Dave La Mar
Honored Contributor

Re: EVA Disk Failure & Corrupt Files

For those following this thread, here is the latest update.
HP has provided an escalation team that has poured through EVA, brocade, and syslogs.
At this point, no reasonable explaination has been provided for the corruption.
The one clue has been an "unusual" error reported by the failing drive.
Every attempt is being made by HP to retrieve the drive in question.
Without the lab being able to examine the drive it is highly unlikely our Company will know the cause of the corruption encountered.
Additional updates will be posted as they progress until such time all resources have been exhausted.

Regards,

dl
"I'm not dumb. I just have a command of thoroughly useless information."