ProLiant Servers (ML,DL,SL)
1753305 Members
6452 Online
108792 Solutions
New Discussion юеВ

Re: HP DL385 G6: Cannot login after uptime of 12-24 hours.

 
SOLVED
Go to solution
Andi Schaefer
Occasional Advisor

HP DL385 G6: Cannot login after uptime of 12-24 hours.

Hello.

We have now a DL385 G8 for a couple of weeks and suffer from serious problems with this machine.


The machine is running Windows 2003 x64 R2 wich was installed using smart start cd.
The machine is running VMWare Server 2.

It has 2 AMD CPU (each 6 cores), 32 GB RAM (8x4 GB) and 16x SAS drives in two cages.

We had updated to the most recent PSP 8.40.

We also had different firmware on two of the four SAS backplanes which now have been updated (from 1.14 to 1.16).


Previously we saw ASR reboots because of missing watchdog events.

We now have disabled ASR form RBSU to "see" how the server behaves.

The symptoms are as follows:

After 12-24 hours of uptime it is no longer possible to logon to the machine.
The virtual machines are running fine.

If you try to login using RDP the rdp client doe not get a login window. The window stays black.

If you try to login on the local console using the ILO, you can fill in your credentials at the windows login dialog. But after you click the [OK] button the dialog is gayed off and hangs in this state forever.

During this time the virtual machines behave "normal".

After another 12 hours you cannot login to the Vmware Server Web interface anymore (login hangs).

Does anyone have a tip how to come closer to the problem?

Can it me a "memory leak" problem?

10 REPLIES 10
Jan Soska
Honored Contributor

Re: HP DL385 G6: Cannot login after uptime of 12-24 hours.

Hello Andi, I recomend check i you have latest firmware and software installed on page http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareIndex.jsp?lang=en&cc=us&prodNameId=3949988&prodTypeId=15351&prodSeriesId=3949986&swLang=13&taskId=135&swEnvOID=1113 .We had similar problem with almost the same symptoms - it was caused by server and ILO firmware + Ilo driver + ilo management driver. After installing latest firmware and drivers (NEWER then PSP 8.4 from March 2010) it looks to be solved.
Try and come back with results.

Jan
Andi Schaefer
Occasional Advisor

Re: HP DL385 G6: Cannot login after uptime of 12-24 hours.

Hello.

First of all i used the latest PSP zu update all fw on the system.

I had a comment from HP support that running VMWare Server 2 on this system is not supported. Basically i thought running VMWare Server 2 on top of an operation system which is in the list of allowed OSs (Windows 2003 x64) would be no problem (like running any other "user application" - well i've been taught better). The bad thing - when selecting the system i was in contact w/ a HP sales man online and i've always talked about using VMWare Server 2 - never mention ESX - and there where no complains about this configuration.

We are running an older DL380 G5 w/ this configuration with great success. OK. I reinstalled the OS using the Smartstart 8.40 from the link that Jan has supplied in his reply. The machine was setup to do multiple heavy copy operations over the network to the local disks and was running now for 2 days w/o any problems.

I will turn over to VmWare ESX today. The Windows/Vmware Server 2 combination was choosen to make handling the system easier for the "Windows admins" (drag and drop vm from one folder to the other, see OS state, health, etc). This way - as a consolation - we are forced to start using VMWare ESX ( from my POV the better choice ;-) )

Thanks to all for your help
Andi
Jan Soska
Honored Contributor

Re: HP DL385 G6: Cannot login after uptime of 12-24 hours.

Hello,
did you update firmwares I've mentioned? There are some never then (8 Mar 2010) PSP 8.4 ... - ILO2 and Bios. Generaly I doubt it could be related to Vmware server.

Do no forget assign points if helpfull

Jan
Andi Schaefer
Occasional Advisor

Re: HP DL385 G6: Cannot login after uptime of 12-24 hours.

I already updated things using the

Smart Update Firmware DVD 9.00

from the link above which is dated 12th of April 2010.

I also made an USB key from it that contained the most current SAS backplane fw (1.16) to update the SAS backplanes which two of them had fw 1.14.

I'm assuming that this update dvd contains all fw release in their latest version until 12th of April. Am i wrong? Do i miss anything?

Jan Soska
Honored Contributor
Solution

Re: HP DL385 G6: Cannot login after uptime of 12-24 hours.

Hello Andi,
you are almost OK. Unfortunatelly, on this dvd is not ILO2 firmware 1.82 which was one part of our solution - see content of firmware dvd on: ftp://ftp.hp.com/pub/c-products/servers/management/smartstart/FWContent900.pdf .
I recommensd update ILO2 firmware throught link I've provided in previous post. You can do it via linux or windows, or extract content of windows update and update ilo2 firmware directly via ilo2 interface.

Jan
Andi Schaefer
Occasional Advisor

Re: HP DL385 G6: Cannot login after uptime of 12-24 hours.

Upps. You where right Jan, i missed that detail.

I downloaded the x64 installation exe and ran it on the machine. It updated 1.81 to 1.82

I will now install Vmware Server 2 again and see again what happens in the next 24 hours...

Stay tuned...
Andi Schaefer
Occasional Advisor

Re: HP DL385 G6: Cannot login after uptime of 12-24 hours.


And here are the points from the german jury!

"10 points for Jan"

Jan i have to thank you very much for your help. I updated ILO firmware to 1.82 and we are running the Server w/ Windows 2003 x64 and VMWare Server 2 now for five days w/o any problems - great.

Honestly i did not think that it could be the reason for the effects we had.

Again - thanks...
Jan Soska
Honored Contributor

Re: HP DL385 G6: Cannot login after uptime of 12-24 hours.

Andi, Thanks for points.
We faced issue so I knew it... :)

Jan
Andi Schaefer
Occasional Advisor

Re: HP DL385 G6: Cannot login after uptime of 12-24 hours.

Hi.

Sadly enough it is me again.

We still have the issue that the DL385 G6 suffers from problems after a period of operation.

Sometime the machine runs for 4 Weeks w/o problems and sometimes the machine hangs after 6 days.

When the situation comes up it is no longer possible to do any "gui" actions.
An already logged in user session starts hanging (gui elements stay on screen when they should disappear, menu parts keep staying on the screen) and it all ends up very quickly (just a couple of clicks) that no GUI action can be done at all. At this stage no login (neither local nor via RDP) is possible. However virtual machines that run keep running and it is possible to interact with them (ie. ssh to them).
VMWare 2 web interface is also dead at this stage.

The feeling is that vm's begin to act "slower" as they normally would "feel".
I ran hdparm to test disk thruput. Normally this is around 40-80 MB/s and more but when the situation comes up there are tests only showing 460 kB/s disk thruput in the vm (debian x86 guest)!!!!!! (floppy disk speed - wow)
linux vm guest start to spew kernel message in the way "...cpu stuck..." and such things...

I already used latest Smart Update Firmware DVD 9.10 (B) and the latest Firmware Lights-Out 2.01 (Sept. 7th 2010, cp013600.exe )

For me it looks like the deadline the problem arises depends on the amount of "io" the system is doing in sum (maybe a kernel memory leak?)
Last time it died after 6 days and i was doing a very huge install (Lotus Connections 2.5 Pilot on an Win2003 guest) and the install could not finish because machine began to be unresponsive on the console session.

One thing that is "special" to the machine is that the standard drive cages where exchanged to the onces where you can place more drives (16) in the machine. the machien already came in this configuration

Can this be the cause (bad cabling, drive firmware bug, controller firmware bug?)

I'm near to the point where i will wipe the Windows OS from the machine and put a linux on it because i think that the linux kernel would start spitting out more info if something starts getting wrong.
BTW.: When looking into Eventviewer system and App every thing looks fine - as there would be no problem at all.

I also looked around the i-net and there is no such specific problem w/ this machine. So maybe it is someting very unique to this single computer?

I'm at the end of my knowledge!

Greetz
Andi