BladeSystem - General
cancel
Showing results for 
Search instead for 
Did you mean: 

Blade failure (after firmware update?)

Ryan22
Occasional Contributor

Blade failure (after firmware update?)

Hi,

I've got a problem with a Blade that I'm looking for any ideas with. Essentially at the moment, it won't POST at all and the System Health light is flashing Red. I'll run through the details, as much as I can recall:

The system was originally:
- HP BL685 G1 w/ ROM 05/04/07 (405661-B21)
- 4 x AMD Opteron
- 20GB RAM
- 2 x 76GB SAS
- QLogic Fibre Mezzanine module
- iLO Firmware v1.50? - unsure what it was exactly
- within a c7000 Enclosure running OA v1.30


Now the machine had been running fine as an ESX Host for I'm not sure how long. As part of repurposing the unit for a new project, I decided to update the firmeware (big mistake?). I connected the ISO of the HP Firmware Maintenance DVD v9.30 using iLO's Virtual Media facility and booted from it. I ran a custom firmware update, so-as to see what was being updated. I left everything to be updated, except I removed the QLogic card, not wanting to cause any SAN issues. The update went through mostly without a hitch, although the 2 SAS drives didnt update, but I'd figure I would get that on another boot later. The machine rebooted fine and got back into the VMWare farm. Now part of why we were updating firmware was we wanted to ensure maximum compatibility with VMWare features. So we rebooted back into RBSU to look for any CPU virtualisation features not enabled. We found AMD Virtualisation wasnt enabled, so turned that on, and rebooted. Now the problem starts. The system never appeared back into vCenter. iLO was still working at this point, but wasnt showing any video using the consoles. So we reset the iLO. Still same issue. OA is showing the system all green. So is iLO. On physically examining the machine, the red health light is flashing, and the enclosure's fans are all spinning at high speed. Removing the blade from the chassis makes the fans return to normal speed, but then putting the blade back in, the following happens:

- Amber power light while enclosure detects blade
- System Powers on
- Insight display shows blade as green
- All green on Blade lights
- Eventually health light starts flashing red and fans spin up to near full speed
- No video/POST when directly connecting to unit, nor via iLO
- Both iLO and OA report the system is running fine. No logs to show issues whatsoever.

Things I've tried:
- Reseat RAM - no joy
- Removed 2 x CPU and their RAM - no joy
- Also removed QLogic - no joy
- Reset Configuration using DIP switch - no joy
- Tried to enable Redundant ROM, but it doesnt appear to switch to it (looking at iLO readout) - no joy
- Reverted from iLO v2.05 firmware to iLO v1.50. After doing this, I could see iLO was showing red for health but after another blade reseat, iLO now reports everything a-ok again - no joy

Another weird thing I've noticed in iLO is the System Information screen is still showing all 4 CPUs and the original RAM.
Does anyone have any ideas of what to try with this box? I know replacing the system board is one option, but as the machine is out of warranty, that is probably going to be prohibitively expensive for what is only a test machine really. I find it odd that the server was fine enough after the firmware update to boot into VMWare and then back into the RBSU, but after that one setting change, bang, broken. Just bad luck / timing possibly, and it's a general system board failure?
5 REPLIES
The Brit
Honored Contributor

Re: Blade failure (after firmware update?)

Hi Ryan,
This is just an observation, but your OA firmware is "stone-age", The number of bug fixes since this version would fill a small book.

I have a feeling that I saw one advisory regarding minimum OA level for a specific iLo version.

Anyway, my recommendation would be to upgrade the OA to at least 2.6, (preferably to most recent which is ~3.21 (??))

Good luck.

Dave.
Ryan22
Occasional Contributor

Re: Blade failure (after firmware update?)

Hi Dave,

I know the OA is really old, but in light of what's happened here, we are very loath to touch any other parts of the system for fear of breaking the other business critical blades running in that enclosure. I've been looking around for things like firmware compatibility, or more importantly, known incompability between versions. If anyone comes across such things, letting me know would be awesome.
Torsten.
Acclaimed Contributor

Re: Blade failure (after firmware update?)

>> within a c7000 Enclosure running OA v1.30

Now we are at version 3.30.


If you want, look up at the web page all the fixes, print them and you have a book.

Honestly speaking, I would upgrade OA and iLO asap.

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
Torsten.
Acclaimed Contributor

Re: Blade failure (after firmware update?)

http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=3709945&prodSeriesId=3884098&swItem=MTX-bc11d40cf7b0465eacabefb649&prodNameId=3884099&swEnvOID=1005&swLang=8&taskId=135&mode=5

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
gregersenj
HPE Pro

Re: Blade failure (after firmware update?)

It could be:
1. Bad CPU (PPM)
2. Bad memory
3. Bad system i/o bd.
4. Bad Mezz cards.

Never, clear the configuration.
Unless, if you suffer from a bad CPU in sockt 1, of if you have messed up the configuration, beyound salvation.

If a ProLiant server get a CPU failure, it will ASR and disable the bad CPU.
But it must have a functional CPU i socket 1.
So, if CPU 1 fail, you must replace it, and clear the configuration, using the switch.

The first 8kb memory block must be ok, otherwise, you will never get any video, due to POST being halted, prior to video initalization.

You can try to replace CPU 1, and clear config.
Also, check PPM's if it got such, around CPU 1. If there's a PPM and it's bad, you must replace it and clear config.

Try to replace bank 1, and maybe clear config.

Remove the Mezz cards.

If you strip it down to 1 CPU and 1 Mem Bank, and it still don't POST after Config Clear, Then it must be the System i/o.

BR
/jag