cancel
Showing results for 
Search instead for 
Did you mean: 

Cache-1 Corrupt?

Paul Hutchings
Super Advisor

Cache-1 Corrupt?

Had an "interesting" experience with our P4000 cluster this weekend.

 

We had an extended power outage which meant that one of our P4000 sites lost power so the node died.

 

The remaining node + FOM did their job and kept quorum, however, when power came back and the lost node booted, it showed in the CMC with a red X and a "Cache-1 Corrupt" status.

 

Running a diagnostic on that node showed the cache module passed the diagnostic, but the status of "Corrupt" failed.

 

When I got someone from L2 on the phone they explained that forcing the node online could corrupt the cluster because the corrupt cache content might be flushed to the storage, so their plan was to ship out a replacement cache module, which they did, and they then did a "node exchange" in the CMC and the volumes restriped overnight.

 

I'd like to get a bit better understanding of what actually happened, because whilst the cluster carried on running, which is great, I'm a little concerned/puzzled that corruption in the cache could render an entire node unusable.

 

Simply put, if replacing the cache module removed the potential for corrupt cache data to be flushed to disk why isn't there some option to just discard the contents of the existing cache module?

 

Thanks in advance.

 

Paul

17 REPLIES
Johan Guldmyr
Honored Contributor

Re: Cache-1 Corrupt?

An option would be very handy indeed for clustered solutions like this.

Is the RAM and the battery on the same module or would it be possible to just disconnect the battery to the controller and thus clear the cache?
Uwe Zessin
Honored Contributor

Re: Cache-1 Corrupt?

> just disconnect the battery to the controller and thus clear the cache?

Oh, please ... !!! Don't try to be 'creative' !!
That is another way to cause data corruption.

If you are concerend about the integrity of your data:
You MUST NOT and NEVER simply ignore/skip lost cache data !!

.
Paul Hutchings
Super Advisor

Re: Cache-1 Corrupt?

Uwe, I agree with you not to ignore it, but it's a fair question - if Lefthand support say to replace the cache module how am I better off than simply disconnecting the battery and flushing the existing cache module?

 

(I wouldn't just do that, but I'm asking about the theory).

Emilo
Trusted Contributor

Re: Cache-1 Corrupt?

I am very surprised tha that support shipped you a battery and cache. This is a known issue and can be fixed by a patch or by going in to a putty session. Cache corrupt is alot differnt error then Cache faulty, in order to determine if indeed the cache were bad and needed to be replaced a hpadu.zip file would need to be anyalyed. If you have this error just log a call with support and mention the patch.

Paul Hutchings
Super Advisor

Re: Cache-1 Corrupt?

When they checked, mine apparently has that patch installed already.

Steve Burkett
Valued Contributor

Re: Cache-1 Corrupt?

Out of interest, how long was your node powered down for Paul?

Paul Hutchings
Super Advisor

Re: Cache-1 Corrupt?

Only around 10 minutes or so.

Emilo
Trusted Contributor

Re: Cache-1 Corrupt?

  1. You will need to install patch 10096 this patch Is rolled up in patch 20020

Resolves a situation where, under certain conditions, on reboots after an upgrade to 9.0, SAN/iQ 9.0.00 will appear to detect and report an erroneous controller cache discard event long after the original condition has been addressed corrected and no longer exists

Paul Hutchings
Super Advisor

Re: Cache-1 Corrupt?

Apparently not though, from the L2 response:

 

"Upon reviewing the logs this unit already has PS02 installed, so 10096 will be unable to resolve the issue."

Emilo
Trusted Contributor

Re: Cache-1 Corrupt?

What is the exact message you are getting?

Has support gone in a cleared the Iml log?

Are your volumes fully replicated?

 

Emilo
Trusted Contributor

Re: Cache-1 Corrupt?

The case of a valid cache discard will probably never happen because to get into this situation the writeback cache must be discarded on the boot following an upgrade. The boot following an upgrade is an orderly one, so the cache is flushed before power is cycled. The nsm is not at risk of losing its writeback cache in an upgrade because it's empty.

Paul Hutchings
Super Advisor

Re: Cache-1 Corrupt?

"Cache-1 Corrupt" following the uncommanded power off.

 

Like I said, they replaced the cache module and removed/re-added the node from within the CMC so it restriped all the volumes (which are all NW RAID10).

 

If I recall, there was no IML log, or it was empty - that was one of the issues, there was nothing in any of their logs to tell them whether the cache was good or bad, only the "cache-1 corrupt" message.


Can I just ask if you work for HP as it isn't clear from your username/avatar?

cuonghs
Occasional Collector

Re: Cache-1 Corrupt?

I 'm facing the same problem with Mr Paul Hutchings.

my Storage : HP Storage P4300G2 (P/N: BK716A )

I use HP P4000 Centralized Management Console software  to diagnostic and result as follow :

Cache status:  cache 1 fail -> corrupt

 

then I replaced some parts from the another P4300G2 that operator ok  for testing such as : cache 512MB, Battery,mainboard, ram, cpu , SmartArray p410 . But cache status is still corrupt.

 

please help me to resolve this case.

 

thank alot.

David_Tocker
Regular Advisor

Re: Cache-1 Corrupt?

Sounds like you have replaced the whole lot just about...

 

Time to Call HP support?

 

Out of curiousity is the node offline? have you applied the latest updates?

 

Reading the 'updates' that pops up in the CMC the other day indicated that one of the patches related to the smart-array adaptor would not apply if later updates were applied? sounded like it needed to be manually applied...

 

Im pretty sure the update is just the latest firmware for the smart-array controller anyway... Just reboot the node and update the firmware off a usb key.

 

Although you will probably be told off by Uwe Z for being 'creative'

 

Good luck...

Regards.

David Tocker
Emilo
Trusted Contributor

Re: Cache-1 Corrupt?

You will need to contact support.

They will need to take the unit out for repair and reconfigure the RAID.

If you are not fully patched then applying the latest patches should work also.

Ethier way it going to be a call to suppor as you cannont apply any patces to non-functioning.

 

johan duchateau
Occasional Contributor

Re: Cache-1 Corrupt?

I had the same problem.

 

Solution done by HP support on our SAN IQ 9.5:

pull out all the disks.power up the node.power down the node

boot the node and remove it from the management group(repair node)

Re-initialise the Raid 5

Add back to the management group and restripe all data.

 

This solved the problem for us;but it is not a ideal solution...

SLucero
Occasional Visitor

Re: Cache-1 Corrupt?

 

We run into this problem.

In our case, it was the batery. They were old an sowllen, very puff up.

As soon as we replace those, our Cache-1 corrupt error was gone.