ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

ML370 W2K3 Random Reboots

 
Aaron Firouz
Occasional Advisor

ML370 W2K3 Random Reboots

Hey all.

I've been having the strangest issue for the past few weeks, and was hoping someone could provide some insight on the issue.

Our environment is about 200 users on 6 terminal servers (standard MS-RDP, not Citrix). The servers are a mix of ML370 G3s, G4s, and ML350 G4s. Our DCs and file servers are also ML350s.

Due to various issues, we've rebuilt all of the servers over the last three months. Everything was running fine for about a month, when one of the ML370 G3 started randomly rebooting, sometimes twice a day. We took that server down, and then it began happening to the servers (all models) as well. Now it is like a whack-a-mole game, any of the servers may go down at any time during operational hours.

When I say "randomly reboot", I mean that the system just goes straight down, and boots back up; no STOP error, no power down cycle. The system's event log has no entries for the few minutes preceding the reboot, and none of the Insight Agents report anything.

All the servers have the latest drivers, and latest firmware. The reboots primarily tend to happen during standard peak hours (beginning of day, lunchtime, and end of day), but I've run performance monitoring, and seen nothing that should crash a server. I'm now going through and trying to diagnose the various software installed on the machines, but I figure if that was the case, Windows would log something, or get a STOP error.

I'm really all out of ideas. The servers are connected to UPSes, and on a separate circuit from everything else. Some also have redundant PSUs, so that's not the case. Any suggestions you have would be MUCH appreciated.

Thanks in advance.

13 REPLIES
Prashant (I am Back)
Honored Contributor

Re: ML370 W2K3 Random Reboots

Hi,

In such cases since you are not getting dump.
Two thing we can do.
1) check for IML logs any error.
2) isable ASR on server. Since this the feature of proliant server if server stop responding then ASR is trigered on server. that leads to reboot so inthat case no dump is saved.

For RCA we need to think more on such kind of things.

Regards,
Prashant S.
Nothing is impossible
Aaron Firouz
Occasional Advisor

Re: ML370 W2K3 Random Reboots

Prashant,
I appreciate your advice. The IML log shows nothing at all. I did disable ASR on one of the servers, so now I'll wait and see what happens. Thanks.
Aaron Firouz
Occasional Advisor

Re: ML370 W2K3 Random Reboots

Just a quick follow-up:

Even with ASR disabled, the systems still reboot. The only change we've made recently is printer drivers, so I'm going to try to roll those back.
Andy_180
Trusted Contributor

Re: ML370 W2K3 Random Reboots

By chance it does not get hot in the server room does it? is it an evironmental issue? humidity, power issues (on a reliable ups).
thanks.
--Andy
Aaron Firouz
Occasional Advisor

Re: ML370 W2K3 Random Reboots

Andy,
We keep the room at about 70 degrees, and the machines don't report overheating issues. Humidity I don't think is a factor. We've also tested the UPSes that they are connected to, and they seem to be fine. I'm fairly confident that it's not a hardware issue, although that was my initial guess, too.
Mike Strako
Trusted Contributor

Re: ML370 W2K3 Random Reboots

Just a quick question, have you performed a though virus and spy ware check?
Aaron Firouz
Occasional Advisor

Re: ML370 W2K3 Random Reboots

Mike,
Yes, all the machines have Symantec Enterprise resident on them at all times. I have performed a full scan, and run three different malware removal utilities. The terminal servers are completely locked down (no write access to most of the hard drive), so I don't think it's that, either.

Thanks, and thanks to everyone who has replied.
Andy_180
Trusted Contributor

Re: ML370 W2K3 Random Reboots

we had a similar issue recently where the server would only respond to ping requests. server would loose shares and unable to remote into it until it was cold booted through the iLO. at anytime through out the day. on several servers. not model specific. it turned out to be a corrupt file in the backup exec agents. we had to open up a ticket with MS to get a utility to make the server blue screen next time that happened. then they analyzed the dump file and said we should update a sys file and dll file. it has been 2 weeks so it looks like that was it. but very similar to your circumstance. nothing in event viewer or IML, just a loose share, and server went dumb. nothing on the screen. please feel free to assign points...

thanks.
--Andy
Aaron Firouz
Occasional Advisor

Re: ML370 W2K3 Random Reboots

Oh, I'm sorry, man. I don't know exactly how the points system works, so I was holding off on them. I just read the forum overview about them, though.

As for your most recent suggestion, we don't have BackupExec on the Terminal Servers. There's no critical data on them, so there's no need. But thanks.
Michel Poirier_1
Occasional Advisor

Re: ML370 W2K3 Random Reboots

Hi, a few weeks ago we had the same problem on a few of our servers. ML370 G3(3 out of 30)
All the same hardware.
W2k & W2K3 & Nortel Symposium R4.2 & R5.
And it stopped as it started after 3 days.
No logs No IML entries.

Our only common point was the McAfee EPO server was down on the first day.
We are still looking for an answer.
Aaron Firouz
Occasional Advisor

Re: ML370 W2K3 Random Reboots

Quick follow-up:

I opened a ticket with Microsoft regarding the issue, and I'm still waiting to hear back from the guy.

In the meanwhile, one of the terminal servers finally blue-screened (never thought I'd be happy about that), giving me an idea of what the issue is. The STOP code was "SESSION_HAS_VALID_POOL_ON_EXIT", and some research shows that it's related to Terminal Services on W2K3 SP1. Microsoft has an unreleased hotfix for it, so I'm going to ask my MS tech for it. I'll let you guys know, should the issue ever arise for you.

Now, I don't know if it's the same problem I've been having, or if the blue-screen was an unrelated, random occurance. But at least I have a starting point now.

Michel,
I appreciate the reply, but we don't have McAfee anywhere on our network.
Andy_180
Trusted Contributor

Re: ML370 W2K3 Random Reboots

Sounds like you are in the right direction. at least it blue screens by its self. we had to get a utility from MS to force a hex dump next time it locked up and dial out via a null modem. i tend to beilive that if it was bad hardware, the CPQISSE would report it to the event log. I will still be curious toknow what fixes your issue. please post the fix when you find it. our issue was veritas replication agent needed SP1 for vertias installed. what ms said anyway. and it seems to have fixed. thanks!
--Andy
Mike Strako
Trusted Contributor

Re: ML370 W2K3 Random Reboots