BladeSystem - General
cancel
Showing results for 
Search instead for 
Did you mean: 

C7000 chassis "Degraded" then "OK" emails.

Mike O.
Regular Advisor

C7000 chassis "Degraded" then "OK" emails.

We have a couple of C7000 chassis with 5 DL465C blades in each chassis, all running VMWare ESX 4. The systems have been in place and functioning OK for about a year.

We have email notification set up on the onboard admin so send it sends out alerts if something goes wrong on the server.

Four times in the last week, we have received an email from the OA that says the enclosure status has changed to "degraded", followed about 25 seconds later by another email that the enclosure status has changed to OK.

The problem is that there is nothing wrong that I can find, and the OA logs don't show anything happening at those times.

Three of the four times (Wednesday, Saturday, and Monday) were all at 4:10pm, however there was another one at 10:19 on Monday..

In addition to both OA logs, I've checked the IML logs on all the blade, and I can't find anything listed.

Any suggestions would be appreciated.
9 REPLIES
Sheanshar
Advisor

Re: C7000 chassis "Degraded" then "OK" emails.

HP ProLiant BL460c Server - Server Blade Has a "Degraded" Status in Onboard Administrator (OA)
ISSUE:

Onboard Administrator Shows the Server Blade BL460c Server installed with RHEL 5.3 x86 version as "Degraded" Status in spite of having all the BladeSystem Firmware up-to-date.
SOLUTION:

On any LINUX OS or VMware installed Servers please ensure the OS has the Proliant Support Pack/HPASM agents installed, as the ILO driver and agents communicate the server health to the OA. If the Proliant Support Pack/HPASM agents are absent, then the server status would appear degraded in the OA.


Let me know if there are updated
JDFC
Advisor

Re: C7000 chassis "Degraded" then "OK" emails.

Hi Mike,

we got the exact same issue, with several C7000 enclosures running OA 2.51 and OA 2.52. We could not determine the cause of the issue, as we also couldn't find any events within OA syslogs. same for servers syslogs, no issue matching the time of events. our customer did an update on some enclosures to OA 2.60 and the false alert mails semms to be gone. But from the engineering point of view this is not a known issue, and they believe that there is an OA firmware desynchronisation between the active and stby OA (i could not find anay argument for that). I'm very interested about this issue, as for now you i couldn't find someone having the same issue. I just asked my customer to try rebooting both Oas, and then trying to reflash with the running OA versions (2.51 or 2.52) instead of flashing to OA 2.60, as i want to ensure that the pb is really solved with the 2.60 OA firmware and not solved just as a result of reflashing the Oas, with existing versions 2.51 or 2.52. Please keep me updated

kind regards
Jean-Denis
Mike O.
Regular Advisor

Re: C7000 chassis "Degraded" then "OK" emails.

The HP agents are installed on all the blades. Around December we went through all the blades and chassis and did a complete firmware update on all components, then did a clean install of VMWare 4 with the HP tools.

Both OA modules are at 2.60, the ILO on each blade is 1.79 and the virtual connect modules are at 2.30 (enet) and 1.40 (fiber)

JDFC
Advisor

Re: C7000 chassis "Degraded" then "OK" emails.

Hi Mike

What seems very strange in our case, is the fact that since customer did an update from oa 2.5x to oa 2.60 the alertmails disapeared. I also asked my customer to do the following on one enclosure showing the issue with OA 2.51.
=> restart both OA modules.
For the moment the alert mails do not reoccur on this single enclosure. but it's to early to say that this solved the pb. If i find something i will let you know. this is a crazy pb as we cannot capture any relevant information from the enclosure.

regards
Jean-Denis
Mike O.
Regular Advisor

Re: C7000 chassis "Degraded" then "OK" emails.

I thought it was done, but it happened again today.

This is really odd. So far it's happened 5 times, and four of them were at exactly the same time of day, 4:10pm. It's not the same interval of days, 3/24, 3/27, 3/29, 4/2.

I can't figure out what might be happening at exactly that time, and it's really frustrating that it won't say what was "degraded".

Next week I'm going to try flashing the OA bios. I'm already at 2.60, but maybe flashing and rebooting the OA might reset something.
Mike O.
Regular Advisor

Re: C7000 chassis "Degraded" then "OK" emails.

I was going to go ahead and re-flash the 2.60 firmware, but now I see that OA3.00 has just been released on 3/31. Unfortunately, there doesn't appear to be any compatibility info released for the other components firmware...

If I can verify that we meet the requirements with our other components, I'll go ahead and update to OA 3.00.
Timur S Lukovkin
Regular Advisor
Mike O.
Regular Advisor

Re: C7000 chassis "Degraded" then "OK" emails.

I had looked at the "fixes" section of OA 3.0 when I found the firmware,but I must be missing something; I didn't seen anything listed that seemed to address this issue.

With all the other fixes listed, it wouldn't suprise me if it did resolve the issue, but I would still like to get some info on the minimimum firmware needed in the other components. The link provided under the "firmware dependency" still only references OA 2.6.

It happened again yesterday, at exactly 4:10pm, just like most of the other times.
JDFC
Advisor

Re: C7000 chassis "Degraded" then "OK" emails.

Hi Mike

Agree with You. I don't believe that there is a known issue with OA 2.60 as I already sais that i got the same kind of issues with OA 2.51 and OA 2.52 and it seems that after flashing to OA 2.60 on some enclosures my customer no longer gets alertmails on these enclosures. So I believe that either rebooting both OAs or reflashing OA firmware "resets something" and results in enclosures no longer sending alertmails. Unlike you we get the alerts at random days/time and the delay between the degraded and OK status is always less than one minute.
My customer was able to capture a show all when one of his enclosure went to degraded and there is as usual no entry in the OA1 and OA2 syslog. the only thing he could notice is the following information returned within the SHOW ENCLOSURE STATUS
Enclosure:
Status: Degraded
Unit Identification LED: Off
Diagnostic Status:
Internal Data OK
Redundancy Failed

I asked for the SHOW OA STATUS but I'm not sure that my customer had time to capture the show all output. I discussed about this kind of error with OA engineering and this seems not to be a known issue. They said that it could be due to some fw desynchronizing between both Oas, but it seems not the case for our enclosures. I will let you know if i can identify what is going on. seems to be an OA related pb (when having redundant OAsin an enclosure), but not specific to a given OA version, as i got the issues with 2.51 and 2.52 and you got the issues with 2.60 (we no longer have issues since we flashed some enclosures from 2.5x to 2.60). also i asked my customer to restart both OAs on an enclosure which had issues and for the moment the alertmails didn't reoccur ??

regards
Jean-Denis