1752808 Members
6766 Online
108789 Solutions
New Discussion юеВ

Re: Cache-1 Corrupt?

 
Paul Hutchings
Super Advisor

Cache-1 Corrupt?

Had an "interesting" experience with our P4000 cluster this weekend.

 

We had an extended power outage which meant that one of our P4000 sites lost power so the node died.

 

The remaining node + FOM did their job and kept quorum, however, when power came back and the lost node booted, it showed in the CMC with a red X and a "Cache-1 Corrupt" status.

 

Running a diagnostic on that node showed the cache module passed the diagnostic, but the status of "Corrupt" failed.

 

When I got someone from L2 on the phone they explained that forcing the node online could corrupt the cluster because the corrupt cache content might be flushed to the storage, so their plan was to ship out a replacement cache module, which they did, and they then did a "node exchange" in the CMC and the volumes restriped overnight.

 

I'd like to get a bit better understanding of what actually happened, because whilst the cluster carried on running, which is great, I'm a little concerned/puzzled that corruption in the cache could render an entire node unusable.

 

Simply put, if replacing the cache module removed the potential for corrupt cache data to be flushed to disk why isn't there some option to just discard the contents of the existing cache module?

 

Thanks in advance.

 

Paul

17 REPLIES 17
Johan Guldmyr
Honored Contributor

Re: Cache-1 Corrupt?

An option would be very handy indeed for clustered solutions like this.

Is the RAM and the battery on the same module or would it be possible to just disconnect the battery to the controller and thus clear the cache?
Uwe Zessin
Honored Contributor

Re: Cache-1 Corrupt?

> just disconnect the battery to the controller and thus clear the cache?

Oh, please ... !!! Don't try to be 'creative' !!
That is another way to cause data corruption.

If you are concerend about the integrity of your data:
You MUST NOT and NEVER simply ignore/skip lost cache data !!

.
Paul Hutchings
Super Advisor

Re: Cache-1 Corrupt?

Uwe, I agree with you not to ignore it, but it's a fair question - if Lefthand support say to replace the cache module how am I better off than simply disconnecting the battery and flushing the existing cache module?

 

(I wouldn't just do that, but I'm asking about the theory).

Emilo
Trusted Contributor

Re: Cache-1 Corrupt?

I am very surprised tha that support shipped you a battery and cache. This is a known issue and can be fixed by a patch or by going in to a putty session. Cache corrupt is alot differnt error then Cache faulty, in order to determine if indeed the cache were bad and needed to be replaced a hpadu.zip file would need to be anyalyed. If you have this error just log a call with support and mention the patch.

Paul Hutchings
Super Advisor

Re: Cache-1 Corrupt?

When they checked, mine apparently has that patch installed already.

Steve Burkett
Valued Contributor

Re: Cache-1 Corrupt?

Out of interest, how long was your node powered down for Paul?

Paul Hutchings
Super Advisor

Re: Cache-1 Corrupt?

Only around 10 minutes or so.

Emilo
Trusted Contributor

Re: Cache-1 Corrupt?

  1. You will need to install patch 10096 this patch Is rolled up in patch 20020

Resolves a situation where, under certain conditions, on reboots after an upgrade to 9.0, SAN/iQ 9.0.00 will appear to detect and report an erroneous controller cache discard event long after the original condition has been addressed corrected and no longer exists

Paul Hutchings
Super Advisor

Re: Cache-1 Corrupt?

Apparently not though, from the L2 response:

 

"Upon reviewing the logs this unit already has PS02 installed, so 10096 will be unable to resolve the issue."