BladeSystem - General
1751883 Members
5613 Online
108783 Solutions
New Discussion юеВ

Re: Critical Error Redundancy Lost

 
Patrick G.
Advisor

Re: Critical Error Redundancy Lost

"...Remove that blade from the cabinet. Remove cover, and remove thin circular cache battery. Replace after one minute (make sure polarity stays the same!). Re-insert the blade. ..."

It doesn't help.
Joshua Oswald
New Member

Re: Critical Error Redundancy Lost

We have been ignoring these error messages for a couple months as well, until yesterday when it prevented a blade from powering on.

Per the instructions provided, we found one blade requesting 45kW. We removed the CMOS battery for a couple of minutes and that "solved" the problem... for now.
Joshua Oswald
New Member

Re: Critical Error Redundancy Lost

...and after a reboot, the blade is again asking for 44kW.
John Moorhead_2
Advisor

Re: Critical Error Redundancy Lost

Follow-up to my earlier post:

The procedure I outlined above did solve the problem for me, for 1.5 weeks. Then I started getting "Degraded status" again on the same blade.

Interestingly enough, this time my "Power Allocated" figure for this blade is right where it should be (541 watts) so the root cause is different this time; not the same issue; and none of the blades in the cabinet show abnormal readings. But note that since my first post and this one, I have NOT performed any firmware updates.

Pulling the blade out and re-seating it cleared the status for this second event. The IML log did not contain any info on this event. I also looked at the Enclosure Information/Enclosure Settings/Configuration Scripts/Current Inventory script, which contains a lot of good info (I highly recommend that you run this script and paste it into a file you can refer to later) including the results of "SHOW SYSLOG OA 1". That log indicates events a few days ago that I had missed:

Apr 5 23:17:42 OA: Blade 2 is reporting degraded health status.
Apr 5 23:17:42 OA: Blade in bay #2 status changed from OK to Degraded

Note the times; they are within the same second! Something about this blade is not happy.
Joshua Oswald
New Member

Re: Critical Error Redundancy Lost

This is anecdotal at this point, so take it for what it's worth.

We upgraded the firmware on the blade (BL680c) that was exhibiting the power issues to a version (2009.2.24) that was released just a few days ago and we did not experience the power errors. Immediately after we reverted back to the previous firmware (for other reasons), the power errors started again. I didn't notice anything in the release notes that indicated this specific bug was resolved.
John Moorhead_2
Advisor

Re: Critical Error Redundancy Lost

For the record, the blade that I've been having all the problems with is also a BL680c-G5, same as Joshua's. Hmnnn!!!
saks5th
Occasional Advisor

Re: Critical Error Redundancy Lost

Hi Everyone,

I just wanted to give an update to this issue. I opened up a case with HP and basically the response was that you need to install the latest and greatest firmware on your server, iLO2 and OA. In our case we already had the latest installed (BL460c G1; ROM at 11/02/2008, iLO2 at 1.70) with the exception of the OA, which was sitting at 2.32. After updating to the latest OA version (2.41) we were still experiencing the power allocation problems but after rebooting all of the individual blade servers the error went away and we have not seen it re-occur. Hope this helps...

Jesper
Patrick G.
Advisor

Re: Critical Error Redundancy Lost

ROM Version "I17 02/24/2009" for "ProLiant BL680c G5" solved the problem in our case
Joshua Oswald
New Member

Re: Critical Error Redundancy Lost

At this point, I would also agree that ROM 2.24.2009 solved the power redundancy problem as well.

Unfortunately, for us, that has brought a new host of problems.

Apparently, this ROM enumerates physical devices to the OS differently, which means our HP Teaming definitions are no longer valid. I would suspect (although haven't tested) that removing / reinstalling the NCU would resolve this except our blades are running server Core... which doesn't have a mechanism for uninstalling the NCU. The only option is to rebuild the server.

So, since these blades are not yet in production, we bit the bullet and upgraded the ROM and re-installed the OS. Then we discovered that this ROM also breaks dynamic WWN assignment for HBAs... both HBA ports get assigned the same WWN. If we revert back to the previous ROM, the WWNs get assigned correctly, but we (again) have invalid Teams and we're back to the (original) power redundancy error.

I have a case open with HP regarding the WWN assignment problem. The last suggestion was to revert our VC firmware version and recreate the domain.
The Brit
Honored Contributor

Re: Critical Error Redundancy Lost

I've been following this thread with increasing interest (after my initial panic), and it seems to me that the focus is moving away from OA firmware, and towards Proliant ROM F/W.

Just to clear up a few things, and maybe focus the discussion. The real question is whether the problem is between the OA firmware reading/getting the wrong information from the Power Supplies, or whether the interaction between the Blade and the power supply at power up is causing the PS to report to the OA incorrectly.

1. Does the initial occurance of the problem ALWAYs occur when a blade is being powered-up??
2. Has the problem ever affected a RUNNING server, i.e. caused a crash or shutdown??
3. Have there been any cases where this issue is FIRST SEEN when powering up an Itanium Blade?
4. It is being implied that Proliant ROM F/W 2.24 fixes the problem, so has any Proliant running 2.24 exhibited the problem.

Also, although the emphasis seems to be moving away from OA F/W, and not withstanding that HP is still recommending a downgrade to 2.25 for those experiencing the problem, I am surprised to see in the most recent "Alerts" e-mail, recommendations to upgrade to OA Version >2.25 for what appear to be fairly trivial issues which can be manually resolved via the OA CLI.
Since the cause of the PS issue is not fully resolved, it seems a little reckless to be making this recommendation and risk invoking a much more serious problem (in my opinion).

Makes me wonder if anyone at HP is paying attention to the problems that this PS issue is causing. I am running OA 2.32, and I have not experienced the problem at my site, however I have to admit to a certain level of paranoia, making me fearful of performing many simple functions.

1. I am concerned about power cycling any of my servers since that may initiate the problem in my enclosures.
2. I am holding off with my intended upgrade of the OA firmware (to 2.41) because the problem has also been reported with this FW level, and again, I don't want to risk initiating the problem at my site.

Some official (HP) statement or comment, or reassurance would be appropriate at this point.

Dave.