ProLiant Servers (ML,DL,SL)
1753776 Members
7410 Online
108799 Solutions
New Discussion юеВ

Re: ML370 W2K3 Random Reboots

 
Aaron Firouz
Occasional Advisor

ML370 W2K3 Random Reboots

Hey all.

I've been having the strangest issue for the past few weeks, and was hoping someone could provide some insight on the issue.

Our environment is about 200 users on 6 terminal servers (standard MS-RDP, not Citrix). The servers are a mix of ML370 G3s, G4s, and ML350 G4s. Our DCs and file servers are also ML350s.

Due to various issues, we've rebuilt all of the servers over the last three months. Everything was running fine for about a month, when one of the ML370 G3 started randomly rebooting, sometimes twice a day. We took that server down, and then it began happening to the servers (all models) as well. Now it is like a whack-a-mole game, any of the servers may go down at any time during operational hours.

When I say "randomly reboot", I mean that the system just goes straight down, and boots back up; no STOP error, no power down cycle. The system's event log has no entries for the few minutes preceding the reboot, and none of the Insight Agents report anything.

All the servers have the latest drivers, and latest firmware. The reboots primarily tend to happen during standard peak hours (beginning of day, lunchtime, and end of day), but I've run performance monitoring, and seen nothing that should crash a server. I'm now going through and trying to diagnose the various software installed on the machines, but I figure if that was the case, Windows would log something, or get a STOP error.

I'm really all out of ideas. The servers are connected to UPSes, and on a separate circuit from everything else. Some also have redundant PSUs, so that's not the case. Any suggestions you have would be MUCH appreciated.

Thanks in advance.

13 REPLIES 13
Prashant (I am Back)
Honored Contributor

Re: ML370 W2K3 Random Reboots

Hi,

In such cases since you are not getting dump.
Two thing we can do.
1) check for IML logs any error.
2) isable ASR on server. Since this the feature of proliant server if server stop responding then ASR is trigered on server. that leads to reboot so inthat case no dump is saved.

For RCA we need to think more on such kind of things.

Regards,
Prashant S.
Nothing is impossible
Aaron Firouz
Occasional Advisor

Re: ML370 W2K3 Random Reboots

Prashant,
I appreciate your advice. The IML log shows nothing at all. I did disable ASR on one of the servers, so now I'll wait and see what happens. Thanks.
Aaron Firouz
Occasional Advisor

Re: ML370 W2K3 Random Reboots

Just a quick follow-up:

Even with ASR disabled, the systems still reboot. The only change we've made recently is printer drivers, so I'm going to try to roll those back.
Andy_180
Trusted Contributor

Re: ML370 W2K3 Random Reboots

By chance it does not get hot in the server room does it? is it an evironmental issue? humidity, power issues (on a reliable ups).
thanks.
--Andy
Aaron Firouz
Occasional Advisor

Re: ML370 W2K3 Random Reboots

Andy,
We keep the room at about 70 degrees, and the machines don't report overheating issues. Humidity I don't think is a factor. We've also tested the UPSes that they are connected to, and they seem to be fine. I'm fairly confident that it's not a hardware issue, although that was my initial guess, too.
Mike Strako
Trusted Contributor

Re: ML370 W2K3 Random Reboots

Just a quick question, have you performed a though virus and spy ware check?
Aaron Firouz
Occasional Advisor

Re: ML370 W2K3 Random Reboots

Mike,
Yes, all the machines have Symantec Enterprise resident on them at all times. I have performed a full scan, and run three different malware removal utilities. The terminal servers are completely locked down (no write access to most of the hard drive), so I don't think it's that, either.

Thanks, and thanks to everyone who has replied.
Andy_180
Trusted Contributor

Re: ML370 W2K3 Random Reboots

we had a similar issue recently where the server would only respond to ping requests. server would loose shares and unable to remote into it until it was cold booted through the iLO. at anytime through out the day. on several servers. not model specific. it turned out to be a corrupt file in the backup exec agents. we had to open up a ticket with MS to get a utility to make the server blue screen next time that happened. then they analyzed the dump file and said we should update a sys file and dll file. it has been 2 weeks so it looks like that was it. but very similar to your circumstance. nothing in event viewer or IML, just a loose share, and server went dumb. nothing on the screen. please feel free to assign points...

thanks.
--Andy
Aaron Firouz
Occasional Advisor

Re: ML370 W2K3 Random Reboots

Oh, I'm sorry, man. I don't know exactly how the points system works, so I was holding off on them. I just read the forum overview about them, though.

As for your most recent suggestion, we don't have BackupExec on the Terminal Servers. There's no critical data on them, so there's no need. But thanks.