C7000 ILO RedX

Donald J Wood · ‎07-01-2009

I have several blades that are installed in a C7000 enclosure. These are ProLiant BL465c G5 servers. The OA firmware is at 2.41 and the ILO firmware is at 1.77. Quite often we get a Critical Error Redx on the ILO which renders them inaccessible. On occasions we've also seen some ASR's (not 100% sure this is related). On some occasions the ILO becomes accessible again without any intervention and on other times the blade has to be shutdown and reseated to get them working again.

This has been happening for a long time and HP support one has told us to update the OA and ILO firmware to the new version, which we have. The issue is still happening.

Here’s an example of what we see in the OA system log throughout the day.
Jun 28 03:37:37 OA: Management Process on Blade 5 appears responsive again.
Jun 28 03:37:47 OA: Management Process on Blade 1 appears responsive again.
Jun 28 03:39:38 OA: Management Processor on Blade 1 appears unresponsive.
Jun 28 03:39:39 OA: Management Processor on Blade 5 appears unresponsive.
Jun 28 03:39:58 OA: Management Process on Blade 5 appears responsive again.
Jun 28 03:41:30 OA: Management Process on Blade 1 appears responsive again.
Jun 28 03:42:04 OA: Management Processor on Blade 5 appears unresponsive.
Jun 28 03:43:07 OA: Management Process on Blade 5 appears responsive again.
Jun 28 03:43:42 OA: Management Processor on Blade 1 appears unresponsive.
Jun 28 03:45:37 OA: Management Processor on Blade 5 appears unresponsive.
Jun 28 03:46:37 OA: Management Processor on Blade 2 appears unresponsive.
Jun 28 03:46:51 OA: Management Process on Blade 1 appears responsive again.
Jun 28 03:48:30 OA: Management Process on Blade 2 appears responsive again.
Jun 28 03:49:47 OA: Management Process on Blade 5 appears responsive again.

In order to further trouble shoot this issue, we ran a network trace to monitor the communication across that network during the time a critical error occurred. The analysis on that stated that there was no request coming into or out of the ilo ½ hour before or after the ILO went critical. I know that’s not an great analysis but we feel that the ILO going critical severed the communication.

Network settings on all of our blade ILO instances are as follows
DHCP: enabled
IP address: Has one
Subnet Mask: Has one
Gateway IP address: Has one
ILO 2 subsystem name: is the S/N with ILO, example ILOUSE324CPM0
Domain Name: has one
Line: is set to Automatic

Adrian Clint · ‎07-01-2009

Get your BIOS upto date and the Power Management firmware to version 3.4B.
See the drivers/firmware download page for both
This should fix these errors.

Donald J Wood · ‎07-01-2009

Thanks for the response. Iâ m working with our companyâ s engineers to get the recommendations solidified. So here are a few more questions. Is it okay if I leave the OA (2.41) and ILO (1.77) at their current levels and just move to the next version of System BIOS and Power Management? Are there any company advisories that I can present to my management to explain the need to upgrade and specifically talks to the redx issues?

Itâ s real important to get this type of documentation and only burn in the testing once we know for sure the updated firmware going to help. The reason for this is, we seem to be going through a revolving door with try this firmware or driver and either the enclosure comes up with another weird issue or the same one over again.

My management and our customers are become really frustrated because they have been asking our engineers to implement ILO for a number of years and now that we got it, all we have is problems with it. Some are blaming down time (ASR issues) due to the REDX weâ re seeing with the critical errors on the management processor. Thatâ s not good at all.

Another thing we also have to consider here (and very important) , since we've moved to blades and many of the have ESX with several VM's running, it's rather difficult to schedule a reboot to implement a system ROM update because it affects so many computers at one time. So just fixing the firmware isnâ t an over night job anymore.

I really need Hp to get this right (with no fallout) on a stable version package across the board, ILO, OA, BIOS and Power Management before I can move to the next version because if we have to do two, three, four of these and we end up back in the test lab again with another possible fixed version, likely someone going to be walking back from the wood shed with a limp.

Adrian Clint · ‎07-02-2009

Donald
You want all your firmware levels accross the blade chassis to be in the compatible matrix.
http://h18004.www1.hp.com/products/blades/components/c-class.html#tab3_content

As for how critical they are - there may be some advisories but I usually look at the Fixes & Revision History tabs for each firmware download
http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=3709945&prodSeriesId=3621769&swItem=MTX-1936238b62874fefb8b99a4d2b&prodNameId=3621782&swEnvOID=4025&swLang=8&taskId=135&mode=5

I know that the Power Firmware 3.4B upgrade is regarded as critical. Updating to ilo 1.77, power firmware 3.4B and updating the bios has fixed several unwanted automatic server reboots on some of our servers.

Adrian Clint · ‎07-02-2009

The iLO firmware update is non disruptive. So you can go to 1.78 anytime.

BIOS and Power firmware both need a reoot. But you should be able to run them both and do one reboot.

Can you not Vmotion the VMs off while you do a reboot?

Donald J Wood · ‎07-02-2009

Thanks for the additional information. On my wish list is the ability for Hp to do proper burn-in and testing before it is available to the customers. Right now, it doesn't seem to be happening that way and due to the manpower, timing, and scheduling factors it takes to update 100+ blade enclosures, it's a major deal here at my company.

Yes vmotion is an option on 85% of the VM's. It's not for everyone. Due to service interruptions that do occur this all has to be part of the planning to reboot an esx server. In some cases the window is extremely small.

This is why we need to get to a stable version so testing and burn in time is critical. My feeling here is that during the burn in and deployment period, Hp will have discovered some new bugs and we will end up two to three revisions back and have some annoying glitch with another upgrade scenrio coming down the road.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

C7000 ILO RedX

C7000 ILO RedX

Re: C7000 ILO RedX

Re: C7000 ILO RedX

Re: C7000 ILO RedX

Re: C7000 ILO RedX

Re: C7000 ILO RedX