ProLiant Servers (ML,DL,SL)
1748247 Members
3704 Online
108760 Solutions
New Discussion юеВ

ML370 G4 blue screens with 0x0000009c

 
Neal Howard
Advisor

ML370 G4 blue screens with 0x0000009c

I've got a almost brand new ML370 G4 with dual 3.2GHz cpus, 2GB ram, SmartArray 6402/128 with 6 x 73GB 15Krpm drives as raid5 and with redundant power supplies and fans, running Windows Standard Server 2003. Reportedly, whenever someone plugs in any USB device other than a USB 1.1 pen drive (e.g. a USB modem or USB serial port converter) the machine will blue screen. I don't have the STOP message for that crash handy, but I do have one handy for the most recent crash which happened, possibly when a software vendor exited his pcAnywhere remote session. It is a STOP 0x0000009c (BugCheck 9c) which is a MACHINE_CHECK_EXCEPTION, generally a hardware fault. All the diagnostics I have run say nothing is wrong with the hardware, and MSDebug when run upon the crashdump file curiously says that the probable cause is "rdbss.sys" which is a software driver in the middle of the operating system (the redirector driver from the CIFS/SMB filesharing subsystem) and not a hardware driver. This machine also has reportedly had spontaneous bluescreens not directly attributed to immediate user actions either. I'm getting handed this problem after several other folks have attempted to deal with it, so I don't have a full and complete history of the bluescreens and memory dumps to work with right at this moment, but will be collecting all the crashdumps from this point onwards. The machine is running an MS SQL database and several other apps as part of a voice logging system for a 911 dispatch center so I am unable to take it up and down at will for diagnostics and troubleshooting, when the machine is running it has to be left alone to stay running, and right now it seems to be running alright, so I'm in the predicament of being ordered to find and fix the problem without really being able to touch the machine very much. We've got 24x7x365x4hr support on this box, but that seems to be worthless based on the last hardware failure we recently had on another ML370G4 box (another gripe story for another time and place).

So, am I probably looking at a genuine hardware problem? One that doesn't readily manifest itself in the diags? Or am I looking at a firmware, device driver or a Windows 2003 O/S bug that is only manifesting itself under the particular set of operating circumstances under which the box is running?

Anybody have any insight into this peculiar bluescreen error code (typically given from bad hardware) but instead coming from a software driver within the O/S? And also the apparent unuseability of the USB ports on the machine? Have I just got a lemon server? Or are there design faults with the ML370 G4? I never had wierd problems with any Proliant servers in the past years, but recently bought 11 ML370G4 boxes and have already suffered 4 hardware failures on them in the past 8 months and now this wierd unstable server is making me wish I had bought Dell PE2800's instead. Help Please!!!
6 REPLIES 6
Neal Howard
Advisor

Re: ML370 G4 blue screens with 0x0000009c

Discovered with positive certainty that it's a hardware problem this morning.

When all 8 memory sockets are populated (8 x 512MB for total of 4GB) then the machine will not boot up, the memory diags fail and claim that all 8 DIMMs are bad. The very same set of DIMMs work perfectly in a different ML370G4 and pass the diags with flying colors there.

When fewer than 8 DIMMs installed into this machine, the memory diags might or might not report an error... the testing results are totally inconsistant from one bootup to the next. When it does pass the test and boots up, all bets are off how long it will stay up before it BSOD's.

This makes the 5th confirmed hardware failure in 8 months since we bought these 11 ML370G4's, though two of the failures were hard drives only. Still, that's an unacceptably poor track record compared with what I used to get from previous Proliants of various models. I've been buying and running Proliant servers (dozens of them) over the course of nearly a decade, and until now considered them to be bulletproof juggernauts, but this last batch of them has certainly changed my opinion of that.
Oleg Koroz
Honored Contributor

Re: ML370 G4 blue screens with 0x0000009c

You won't see 0x9C often
I assume your system ROM 2005.04.15, and you tried to disable Hyper-Tre ..
Neal Howard
Advisor

Re: ML370 G4 blue screens with 0x0000009c

All 11 of these ML370G4 boxes are kept current with latest firmware for everything in them. Toggling hyperthreading was one of the very first things I tried. It's definitely a hardware problem for certain since the memory diagnostics test will pass about half the time you run it, and fail the other half.
Oleg Koroz
Honored Contributor

Re: ML370 G4 blue screens with 0x0000009c

If there is room for troubleshooting time run full memory test with 2-3 loops, and use minimum memory than swap with other pair or set, run again - see if errors found this way, swap memory to other server see if that works, along you have 11 server you can swap HDD set and see if problem follow server
Neal Howard
Advisor

Re: ML370 G4 blue screens with 0x0000009c

Already tried swapping out the hard drives and raid controller with one of the other boxes and the problem stayed with the bad motherboard chassis. That was an obvious, basic, fundamental fundamental troubleshooting step done early on in this ordeal.

The other 10 machines are in live production use and I cannot really take them down anymore for further troubleshooting.
Neal Howard
Advisor

Re: ML370 G4 blue screens with 0x0000009c

I got the new motherboard installed this afternoon, and it powered up and tested all the memory perfectly. I came with an old BIOS so I had to reflash it with the newest bios. Everything seems to be back to normal, except that this new motherboard also bluescreens Windows immediately upon plugging in a USB modem with a STOP 0x0000007E error. This same modem works fine on one of my other ML370G4 boxes. I was afraid to test it on any more of the ML370G4's out of fear it will crash them as well, and since I have users connected and using those other machines, I dared not risk it.