1756018 Members
2913 Online
108839 Solutions
New Discussion юеВ

Unexpected reboot

 
Ayman Altounji
Valued Contributor

Unexpected reboot

We have a pair of Proliant 5500s with 1Gb of RAM and Raid5 with a 3200 Smart Array controller. For about 1 year now one of the Servers reboots unexpectedly. It is running NT 4.0 and SQL 6.5 but nothing else. I was working on it once and have seen it go down. There is no blue screen error. The screen goes black and I see it boot the BIOS and so on. It is on the same UPS as the other server and only one goes down. I changed the power supply long ago. I swapped the memory with the other server to no avail. The only error in the NT log is that the shutdown was unexpected and the Compaq array has valid data and has restored it. (Thank God for that controller). You get the same result if you pull the plug. I ran the diagnostic and the only error I saw is an array parity error 4 so I changed the controller. It now gives me array parity error 1 with the new controller (weird). This machine crashed only once every 2 months or so before we put it in production but now it goes down every 5-10 days. It always comes back OK so far but this can not go on. I have contacted the software manufacturer and they assure me the software is installed elsewhere and is stable. It is hard to work on this machine because it can not be down for more than 30 minutes (24/7/365) so I must have a plan before I can work on it. ). The program is critical and I fear the only answer I have is to put it on another server. Any ideas would be appreciated.
3 REPLIES 3
Ayman Altounji
Valued Contributor

Re: Unexpected reboot

This sounds to me like you may have a system board problem. This would be the first this to try. Good luck and I know how yoy feel.

Later
Mike
Ayman Altounji
Valued Contributor

Re: Unexpected reboot

If the problem is still happening email support@compaq.com - is there any common software / operational factors about the time of the reboots? is it various times of day - what software processes are going on? based on the info here it could be software, memory, system / processor board. Extended memory diagnostics can give you confidence in the memory - but it's time consuming.
Ayman Altounji
Valued Contributor

Re: Unexpected reboot

We had a similar problem on a clustered server, where one node kept crashing every week or two with no bugcheck. The fequency seemed to go up over time, as did our usage. I also replaced the Memory first, which did not fix it either. I then replaced the System board, along with the CD drive AND cable. The problem went away after this, but I was never 100% sure which of the 3 components were bad. I replaced the CD and cable, as we had two CD drives go bad in the same timeframe as our problems. This was the primary node for a 24/7 Active-Active cluster, so I was more interested in fixing it than finding the exact component that was bad. Compaq is good about giving parts, so I would get them to send a new System Board on warranty to see it that fixes the issue. If not, look to cable, etc. that may have pinched. You do have to fault yourself partly for allowing a server that crashes to go into production. Good quality control would have kept this from happening.

Other notes:

-A memory leak in the software can cause this issue. We found a leak in SQL 7 a couple of years ago which did just this, but then again, running over 10K databases was a little dicey at best. You can determine memory leaks by using perfmon and other tools to look at memory usage, working sets, etc. All companies that have these leaks seem to deny their existance until blantantly obvious proof is thrown in their face.
-Replacing the system board should only take about 10 minutes if you time it right. I would drop the server at 2AM if needed, and replace it after doing a dry run on another identical server if possible. Other options are to look through the server hardware manual first, or schedule in the Compaq Rep. to replace this for you. After a system board install, you have to let the Compaq utilities do hardware discovery again. This should not cause any issues if all the parts were placed back in their original slots. Compaq has done a good job on their design to allow for quick replacement of system boards, etc.
-If this is a real 24/7/365 server, then it should be clustered, mirrored or hooked up to a loadbalancer. If a company expects 24/7 from IT, then they need to invest in the equipment which will make this realistic. One server is not a 24/7/365 solution, regardless of OS, manufacturer, etc. Hardware fails, IT people make mistakes, alll software has bugs. Real 24/7 always means redundancy.