ProLiant Servers (ML,DL,SL)
1752701 Members
6758 Online
108789 Solutions
New Discussion юеВ

os/ilo freezes, possible ilo problem on dl320?

 
MartyB
Frequent Advisor

os/ilo freezes, possible ilo problem on dl320?

Sorry for the length, but this is a strange problem. I have 2 DL320 G5's and 4 DL360 G5's with the following BIOS/ILO's installed:

DL320 G5:
W04 08/21/2007
1.43 12/12/2007

DL360 G5:
P58 11/13/2007
1.43 12/12/2007

Both DL320's became unresponsive within about 12 hours of each other. Both the ILO and the OS (Linux, bonded NIC's) were unpingable. I at first thought power, but when I went to the datacenter both systems were alive and the NIC's were flickering away. When I hooked up a monitor, there was no video being sent. When I rebooted via the power button on the front, there was no video and the NIC's never came up for either the ILO or the OS. I finally pulled the power from the back, which seemed to do the job. Both the ILO and OS came up.

The time/date on both the OS and ILO reverted back to 8/20/2007, which was the BIOS build date (minus 1, for some reason). The OS logs showed nothing useful, but the ILO log was a bit more interesting.

I run a script from another system every hour that SSH's into the ILO on each box, and gathers HW data to ensure there are no failures (temp, power, etc). This script isn't the most efficient script, as it does about 10 separate SSH connections to the ILO on the top of every hour, but it works great.

The ILO log for BOTH DL320's shows that exactly 26 hours before each system went down, my SSH connections were failing. The log claims authentication failed, but since I use SSH keys and not password authentication, I highly doubt it. The final two messages at the time that the ILO/OS crashes on both systems are "server reset", then "server power restored".

No other systems in the datacenter logged any power events, and keep in mind the failures for each DL320 were offset by about 12 hours.

The DL360's, on the other hand, are fairing only somewhat better. All their OS's are still up and running, but 3 of the 4 systems have broken ILO's (pingable, but can't login via http or ssh). The ILO that's still available had it's clock reset and has the same "server reset" and "server power restored" messages as the DL320's (although in this case it was 2 days before the first DL320 logged it's instance of the messages), however in this one and only case the ILO seems to have survived. Oh, and there were no SSH login failures logged like there were on the DL320's.

So, that's pretty much it. If you made it this far, thanks for reading this book of a post ;-) Anyone have any ideas? I have since stopped my script that gets HW vitals and I'll probably open a ticket with HP, but I was hoping someone else has seen similar behavior from their systems and might have some advice.

Thanks!
4 REPLIES 4
Brian_Murdoch
Honored Contributor

Re: os/ilo freezes, possible ilo problem on dl320?

Marty,

If the servers are on a UPS, can you try one of them on raw mains just to see if that makes any difference. There have been numerous reports about the G5 models being very fussy about the types of attached UPS models.

Regards,

Brian
MartyB
Frequent Advisor

Re: os/ilo freezes, possible ilo problem on dl320?

Thanks for the reply, Brian. These servers are at a colocated datacenter, so my options for testing different methods of power are quite limited. Also, the first thing I did when the first server went down was call their NOC to find out if there were any events (power events, specifically) that would've affected my systems. They had nothing to report, so these servers never went to battery power. From what I've read, the issue's with the G5's is related to battery power, correct?
Brian_Murdoch
Honored Contributor

Re: os/ilo freezes, possible ilo problem on dl320?

Hi Marty,

Yes the problems are mainly when the UPS swaps to battery power and the attached G5's reboot rather than stay up.

There are also a number of other PSU issues with the G5's but that's generally with the DL38X and ML3XX chassis which use the common PSU, unlike the 1U units (DL320,DL360).

The resets/date issues are certainly strange (Just as if the CMOS reset switch had been flicked and it reverted back to the CMOS date) - Odd.

Sorry I can't be of more use at this point but I'll keep looking.

Regards,

Brian
MartyB
Frequent Advisor

Re: os/ilo freezes, possible ilo problem on dl320?

Thanks for staying on it, Brian. HP tech support is asking that I upgrade the ILO's to the latest revision, which isn't surprising. I guess that's my next step!