ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

HELP!! Servers are locking up!!

 
Ryan Hobbs_2
Advisor

HELP!! Servers are locking up!!

We currently have 5 HP DL385 servers. 3 are running VMware server and two of those (so far -- the ones with the most VMware guests running) randomly lock up. HP Integrated Management detects an issue and the server automatically reboots itself after the default 10 minute time-out is detected. When this happens, all VMware guests are abruptly stopped as if the power was pulled on the server:

1 ASR Detected by System ROM 9/5/2006 3:48PM 9/5/2006 3:48PM
2 ASR Detected by System ROM 9/5/2006 1:19AM 9/5/2006 1:19AM

These are both HP DL385's with 2xDual Core 275 AMD Opterons and 8GB RAM. The other vmware DL385 (that has not performed an ASR reboot) is a 2xDual Core 280 AMD Opteron with 10GB RAM (the 10GB was swapped from one of the problem servers to try and rule out RAM issues).

The other 2 DL385's are NOT running vmware and have not had an ASR reboot. 1 is 2xdual core 280 Opteron with 7GB RAM while the other is a 1xdual core 280 Opteron with 4GB of RAM.

I have had time/clock issues on ALL of these servers since day one! with cpuspeed running, the server incorrect lists the cpu speed as almost half the speed (which it may actually be if it is in a lower powered state). And I have had nothing but issues with the VMwre guests keeping their clocks in sync... they tend to speed up quite a bit!

I have contacted HP regarding this issue as we have a support plan with them and they are firm in their stance regarding our non-HP ram. They won't rule out the RAM as the issue.. even though I have swapped RAM around with our other DL385 systems. All of these modules are name brand and comply to full Manf. specs and should be fully compatible with the HP RAM.. but apparently HP will not offer support on non HP-hardware when you call them on the phone (which I think is total B.S. -- they even wanted me to take out the 3rd internal 3com NIC!!!)

INFO:
===========
HP DL85 Proliants running RHEL ES 4.3 &/or CentOS 4.3 (firmware/BIOS/fully up-to-date)

# rpm -q VMware-server
VMware-server-1.0.0-28343

# uname -srvi
Linux 2.6.9-34.0.2.ELsmp #1 SMP Fri Jul 7 18:22:55 CDT 2006 x86_64

I am willing to give you anything you need to know (except my SSN & Mother's maiden name) as this is affecting our production systems!

Anyone have any ideas what is going on or where to begin looking for a resolution to this issue? At this point I don't believe it to be hardware related as it is not isolated to a specific machine.

/var/log/vmwared.log && the /vmware/#####.log's didn't appear to have any entries cooresponding to the times the server rebooted.

I have also tried to install and had issues with loading the guests on VMware-server-1.0.1-29996 so I went back to the currently installed 28343 build.

FROM /var/log/messages
Sep 3 08:09:33 sparvmwp01 kernel: rtc: lost some interrupts at 2048Hz.
Sep 3 14:27:52 sparvmwp01 ntpd[2739]: no servers reachable
Sep 3 14:44:56 sparvmwp01 ntpd[2739]: synchronized to 172.23.1.253, stratum 11
Sep 3 14:44:43 sparvmwp01 ntpd[2739]: time reset -13.047460 s
Sep 3 14:48:59 sparvmwp01 ntpd[2739]: synchronized to 172.23.1.253, stratum 11
Sep 4 03:06:53 sparvmwp01 kernel: rtc: lost some interrupts at 2048Hz.
Sep 4 12:24:04 sparvmwp01 kernel: rtc: lost some interrupts at 2048Hz.
Sep 5 00:52:37 sparvmwp01 kernel: rtc: lost some interrupts at 2048Hz.
Sep 5 01:21:24 sparvmwp01 syslogd 1.4.1: restart.
Sep 5 01:21:24 sparvmwp01 syslog: syslogd startup succeeded
...
Sep 5 14:52:52 sparvmwp01 kernel: rtc: lost some interrupts at 2048Hz.
Sep 5 15:16:51 sparvmwp01 kernel: rtc: lost some interrupts at 2048Hz.
Sep 5 15:51:38 sparvmwp01 syslogd 1.4.1: restart.
Sep 5 15:51:38 sparvmwp01 syslog: syslogd startup succeeded

I don't know if it is just coincidence but the message prior to the reboots are all lost interrupts.. however they are not _just_ prior.. but within an hour or so. I'm starting to think this may be a clock/timing issue that is causing vmware to choke.
8 REPLIES
NMory
Respected Contributor

Re: HELP!! Servers are locking up!!

Ryan:

To be honest, If this very important to you, and you are willing to do anything, then I recommend you to buy some HP original RAM, just the minimum config possible for one of the servers, and test only one server with HP original RAM.
I have seen many weird issues on Proliants servers when having NON HP ORIGINAL RAM.

I recommend you to do that, that way you will have HP Support. An if you are not using the 3COM NIC, take it out also, and that one server you are going to test, and try to use the internal NICs or buy an HP supported NIC.

Well that's just me in your position.

Regards,

LN
david-williams
Occasional Visitor

Re: HELP!! Servers are locking up!!

for your "clock" issues please visit http://kb.vmware.com/vmtnkb/search.do?cmd=displayKC&docType=kc&externalId=892&sliceId=SAL_Public&dialogID=1638096&stateId=0%200%201636916&doctag=Author,%20KB%20Article

it walks you through on how to fix it, but it only happens with *NIX, i have the same issue but it's not critical to me as i update the time from a time server every 10min via crontab

with memory you have to be carefull how you populate the slots, if you do not have HP memory then you will have to first populate with the HP memory then the rest. Proliants are quite fond of how you populate the memory slots, pay special attention to it

DW
Ryan Hobbs_2
Advisor

Re: HELP!! Servers are locking up!!

@LNassar: We have already spent considerable $$ to buy the RAM in the systems. This RAM is not generic and is name brand.. it just doesn't have the HP sticker on it.

> I recommend you to do that, that way you will have HP Support.

I agree that would be the easiest to continue to work with HP.. but it is ridiculous that they do not offer support on their servers unless you get everything in the system directly from them! If I put a 3rd party oil filter on my car, the dealership doesn't void my warranty or refuse to service my issues until I put an OEM filter back on... I understand this is more of a political HP thing than anything you are stating.

>..if you are not using the 3COM NIC, take it out also

We are using it and it is in the system when the ASR is detected.

Thanks for the suggestions.


@david-williams: The issue we are experiencing is not that the time is slow.. but fast.. the time keeps gaining and therefore the vmware-tools don't keep them in sync very well. kernel boot parameters (in the past) have seemed to help.. but overall the time sources are just not very stable.

> with memory you have to be carefull how you populate the slots, if you do not have HP memory then you will have to first populate with the HP memory then the rest.

I think I have decided to populate one of the DL385's with nothing but HP ram so I can at least get some support on this issue from HP. The next time this freezes up and it is on the one with the HP RAM you better beleive I will be calling them. I really think this may be a driver, or kernel module issue -- that is if it is not hardware related.. I just am not thinking it is hardware related due to the fact it is occuring on multiple servers.

Re: HELP!! Servers are locking up!!


I have no particular experience with your exact problem, but I had a very strange and difficult problem with a group of DL380's a year ago. They were all brand new and just locked up and ASR'ed randomly. We chased it around, and around. HP replaced system boards, Memory, CPU, etc. (all were "pure" HP parts). Finally, they blamed the Emulex FC HBA cards -- which we even got Emulex to replace. No fix.

Eventually, we found the fix....

It was the PCI-slot riser card cage. Although screwed in, they were apparently slightly mis-seated. Either that way from the factory, loosened in shipping, or we did it when installing the Emulex cards. If you've added a 3Com nic, I suspect you may have removed the cage to insert the card -- or simply jostled it while inserting the card.

My recommendation: Remove the PCI Cage Riser Card. Remove the 3Com card. Very carefully, firmly, and deliberately reinstall the 3Com card ensuring it's perfectly seated and aligned in the slot. Lock or screw it in place. Then do the same to the riser card cage -- firmly and accurately seat the cage onto the motherboard. It takes a good amount of force to get it in there tightly & fully. Don't be scared -- PUSH! Then screw that sucker down.

It fixed our problems. No magical ASR's in over a year.

Good Luck.
Ryan Hobbs_2
Advisor

Re: HELP!! Servers are locking up!!

Pete: Thanks for the info. I will definately do that this weekend. This issues is getting to be a pain. So, here is what I have decided.

Again, we have a total of 5 of these DL385's so I will be combining the 'HP' RAM into one machine giving it 4GB (all are only 512MB DIMMs ea) and ordering an HP NIC so I don't get any $#!+ from HP about non-HP parts (which is absolutely ludicrous imho). The only VMware DL385 that has not experienced a lock-up will be moved from a TEST & D/R Server into production swapping it out with the production box that ASRs more often.

We only run 7 vmware guests on the production box with a Total allocated virtual RAM of 3GB so we should be okay there. I am also planning on upgrading the kernele to Update 4 as well as getting vmware 1.0.1 installed. Lastly, I will completely disable all of the HP drivers (including bcm5700) & system management software from the machine.

Hopefully I will be able to report back in a few weeks and say I have had no issues. Cross your fingers for me..
Mike Johnson_14
Occasional Advisor

Re: HELP!! Servers are locking up!!

Has anyone found an answer to these lockups and reboots? We have had ~15 asr reboots in the last 5 months on about 5 different dl385 boxes and 2 dl585 boxes and have gotten no where with HP. We got 1 full vmcore out of all of those lockups and 2 partial.

On the full vmcore, Redhat was able to trace the problem down into the zap_page_range portion of the VM subsystem, but couldn't say exactly what made that happen. The partials didn't give any useful info.

Some of these boxes were running support pack 7.52 while others were running 7.60. All of them running RHEL3, but some running 32bit (update5 and update6) and some running 64bit (update 6).

We have disabled ASR on some and they haven't rebooted since, although on others nothing has changed and they haven't rebooted either, but no one knows why.

Has anyone gotten any solid information on this problem?
pj_11
Super Advisor

Re: HELP!! Servers are locking up!!

Hello,

I had the same problem on 2 of my DL385 some while back.

Found out this was due to 3rd party memory.
Best to always buy HP memory save you a headache.

PJ.
Mike Johnson_14
Occasional Advisor

Re: HELP!! Servers are locking up!!

PJ, how did you find 3rd party memory to be the reason?

Thanks