HPE Community read-only access December 15, 2018
This is a maintenance upgrade. You will be able to read articles and posts, but not post or reply.
Hours:
Dec 15, 4:00 am to 10:00 am UTC
Dec 14, 10:00 pm CST to Dec 15, 4:00 am CST
Dec 14, 8:00 pm PST to Dec 15, 2:00 am PST
ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

Randonm hangs on DL380p G8 HP servers for more than one year with no solution by HP

 
Jose-gmv
Occasional Visitor

Randonm hangs on DL380p G8 HP servers for more than one year with no solution by HP

Hi

We are having problems with 5 "DL80p G8 HP servers" that hangs randomly ( period of month or weeks).

This hangs happen with a periodicity of months or weeks, but and hp don´t know the reason.

We are using these servers in a critical environment and this problem is not acceptable.

 

During the hangs it´s only possible to acces to the servers using ILO, KVM don´t work, no ping. ONLY ILO

 

We have worked with HP for more than one year ans thye have replaced

 

- HD Firmware

- ILO Firmware

- Drive firmware

- Rom

- hpsa firmware

 

They have replaced the system board but the problem persists.

 

 

The servers are located in different physical places, so environment factors can not explain the hangs.

The aplication software is different in each server, so it can not explain the hangs

The servers have been bought at the same time, so it seems that it could be some  "wrong hardware" related to these serie of servers

The OS installed in these servers is SLES11+SP2

 

Nothing related to the OS is obtained after the crash even when we have activatef

 

 

1-      ASR service up and running

We have checked that ASR was running before and after the crash, but the ASR was not automatically activated after hang as expected.

The ASR was in charge of restarting the machine in this kind of situations, so we hat to restart it manually using iLO.

 

2-      NMI was up and running

We have checked that the NMI was up and running before and after the crash, but it was not possible to execute a NMI restart using iLO.

That problem could be the reason why we cannot obtain kernel dumps in these kind of hang.

 

3-      kernel dump configured

kernel dump was configured for being automatically generated after O.S.  hang however it didn´t work.  We are sure that the kernel dump was active because we were able to force a manual kernel dump as indicated by hp procedure (it was executed just before recover the situation of the first hang)

 

The hang seems so hard that the OS is totally blocked and it can generate a log, in fact if we create a hang

running echo c >/proc/sysrq-trigger

 

 

We have checked the status of the server when a hang happens and it says tha all is OK ==> BUT IS TOTALLY HANG

 

We are complelely blocked on this Issue, and HP doesn´t provide any solution even payiing.

 

1- We have requested a contract for a consultant or any kind of similar service but no solution is provided by hp.

 

2- We have requeste HP the replacement of the servers but hp says that is not imposible, even when hp has replaced firmware, fans, system boards....

 

3- We have payed different hp care packs ==> HW cares, Software cares

 

Any clue will be wellcome, we arre totally blocked for MORE THAN ONE YEAR,

 

 

 

 

 

 

 

2 REPLIES
waaronb
Respected Contributor

Re: Randonm hangs on DL380p G8 HP servers for more than one year with no solution by HP

It sounds like another common factor, besides the model of the server, is the OS you're using, SLES 11+sp2.

Do you have any of the same model machines running some other OS besides SLES, or have you tried running a flavor of Linux besides SUSE like Redhat or Ubuntu or anything?

It kind of sounds like there might be a general issue with the kernel or some driver... to make the whole machine hang it seems more kernel level but I'm not that familiar with the Linux architecture.
Jorge_Gamboa
Occasional Visitor

Re: Randonm hangs on DL380p G8 HP servers for more than one year with no solution by HP

Try runing a diagnostics at each memory modules one by one, could be damaged. The IML say someting?