StoreVirtual Storage
1748008 Members
4789 Online
108757 Solutions
New Discussion

Re: Catch 22 -- can't update because critical alert, can't clear alert because outdated

 
ROnanian
Occasional Advisor

Catch 22 -- can't update because critical alert, can't clear alert because outdated

I've got multiple issues, surely a result of my own lack of education on this system...and very frustrating.

  

I have a P4300 G2 with a few bad drives (in a cluster with another identical one and a 4330). It doesn't like new replacement drives (reports them "faulty") and I'm pretty sure it needs a patch or firmware update or something.

  

The CMC's "Upgrades" tab shows a few updates. (To be honest, I'm not confident that these will help, but I have to try.) When I try to install them, it fails with "Pre installation test failed because management group 'HPSANGroup' has the following issues: Critical Alarms exist. Canceling all further installations." I believe the only fix for these Critical Alarms is an update.

  

I did find some patches and firmware updates that look like they somehow get installed directly on the P4000-series nodes but I have no idea how. The patches and firmware I found:

http://h20565.www2.hp.com/hpsc/swd/public/readIndex?sp4ts.oid=4118705&swLangOid=8&swEnvOid=54

The files are .patch and .upgrade files and I can't find any advice on installing them.

  

Can someone help steer me in the right direction so I can bail myself out?

5 REPLIES 5
Torsten.
Acclaimed Contributor

Re: Catch 22 -- can't update because critical alert, can't clear alert because outdated

You cannot update anything while the system is critical, you first need to fix the disk issue.

 

What means "a few" bad drives?

 

Once the RAID set has lost too many drives, it is gone and you probably need to rebuild the node from scratch after the bad disks are replaced. Then you can sync the nodes and install updates.


Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
ROnanian
Occasional Advisor

Re: Catch 22 -- can't update because critical alert, can't clear alert because outdated

Three drives show "Faulty" health. Two are brand new, bought to replace old drives that went "Faulty". The new drives also report "Faulty". I do not know how to fix the disk issue; I've already tried replacing them with brand new disks. Is there something else I can do?

 

Timeline:

 

  1. Disk 7 goes "Faulty". Order new disk, proper HP Spare #  508011-001.
  2. While waiting for that to arrive, Disk 1 goes "Faulty". Order a second one.
  3. First replacement arrives. Install in bay 7. Status reports "Rebuilding" for 3 days, then "Active" with Health as "Faulty".
  4. Second replacement arrives. Install in bay 7, assuming first replacement was DOA.  Status reports "Rebuilding" for 3 days, then "Active" with Health as "Faulty".
  5. Try first replacement disk in bay 1.  Status reports "Rebuilding" for 3 days, then "Active" with Health as "Faulty". Either two brand new drives are bad, the node has failed, or the new drives are incompatible with my old firmware. 
  6. Disk 3 goes "Faulty". 

CMC reports "Safe to Remove" as "Yes" for all disks.

 SANstatus.PNG

ROnanian
Occasional Advisor

Re: Catch 22 -- can't update because critical alert, can't clear alert because outdated

Nobody has any further thoughts on this?

 

In the meantime, Disk 2 now shows "Faulty".

 

Miraculously, I managed to do enough clean up to crowd everything onto the two good nodes and take the bad node offline. I can't sustain it but I can hold it like this long enough to do something with the failing node. 

 

Can anyone suggest how to proceed?

 

Maybe I should remove the failed/replaced disks, reconfigure the RAID on that node to operate on fewer disks (and present less storage), re-join the node to the management group (if it was necessary to remove it), do the updates (since there are no alarms anymore), then see if it can talk to the new replacement disks.

Torsten.
Acclaimed Contributor

Re: Catch 22 -- can't update because critical alert, can't clear alert because outdated

Even if certain versions of lefthand OS wants to maintain the firmware from OS, I would try to cold boot the system from the current SPP 2014.09 and update all the firmware. For disks mb1000famyu version HPD7 is included.

 

Problem Fixed:

  • This firmware prevents a rare condition that may occur during a WRITE SAME command sequence that may result in incorrect data being written to the hard drive. The WRITE SAME command may be used during RAID ARRAY parity initialization.

 

 

 

This could be a reason.


Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
ROnanian
Occasional Advisor

Re: Catch 22 -- can't update because critical alert, can't clear alert because outdated

Just FYI, an update...

 

I was able to acquire the SPP (thanks to a recent server purchase). I ran it and it appeared to update a ton of stuff. Still no good. I also received one more drive from my vendor and that one is good.

 

With all the updates I'm now getting better diagnostic info and it turns out that the bad drives are SMART predicted failure, not actual failures. Dumb SMART.

 

So, I think I really did get two DOA drives in a row and this whole thing has been an incredibly inconvenient wild goose chase. The good news is that I have done some serious, badly needed cleanup of LUNs.

 

Thanks everyone, I'll update this thread again when there's more news.