ProLiant Servers (ML,DL,SL)
1751726 Members
5871 Online
108781 Solutions
New Discussion юеВ

Re: ML350 G5 Disk Failure

 

Re: ML350 G5 Disk Failure

Well, I spent the weekend dinking around with this and here's what I've found. If I use drivers 6.6.2.32 or 6.8.0.32, I get the intermittent 1792's. If I downgrade to 6.6.0.32, I don't. I rebooted 26 times with that driver installed and all is clean. Using the other two, I have about a 50% chance of getting the error. I have everything patched and up-to-date with the latest Smartstart and 932755.

HP sent me a new cache board based on output from the ADU, which I knew would not work. It didn't. I'm going to call them back and see what they have to say, which I'm sure will be helpful! ;)

Now I am faced with either leaving the old driver on and dealing with hangs or stops on shutdown, or putting the latest drivers on and waiting for corruption. Great choice.

Scott
Ab Kole
New Member

Re: ML350 G5 Disk Failure

Hi All,

We have the same problem with 5 ML350G5 servers now. Alle have SBS 2003 R2 installed and freeze now and again. We have a case witch MS and one with HP. We foud out that even witch the newest SmartStart 7.91 we have bus erros on our hard drives. Even when there is no MS OS on the server !!. HP say now (after 2 weeks of testing) that this is a known bug in firmware 1.66 of the E200i raid controller and that this will be solved in the next firmware update....

We suggested upgrading the E200i to an P400 controller to help our customers but no can do. We have to test and test and test to sulte the HP problem. We spend over 2 weeks on hours on this problem and 4 servers are in our office waiting to be completed. Installation is posponed... Clients are not verry happy.

We see the problem with servers that have a lot of disk activity. I keep you informed if there is a solution

fricci
Advisor

Re: ML350 G5 Disk Failure

Scott,
as you can read in my post dated september 9, 2007 you had almost the same experience I had, but it is quite funny to discover the driver that seems to be stable is 6.6.0.32 (I got this results using release 6.6.2.32).

Anyway, I can confirm that AFTER DISABLING ACCELERATOR (I suppose this means using a write-through alghoritm in "HP language"), ALL IS WORKING without problems from the end of August (but I didn't apply any new patch or driver, waiting for some "official" solution).
I am very interested in knowing if this workaround solves your issues.

After reading all this (dramatic) posts, this is my definitely thought: the E200i controller (hardware+firmware+driver) is BAD, so the best thing you can do is: DON'T BUY IT.

I am very interested in knowing if using a P400 controller is the ultimate solution as expected.

If anyone made some testing in replacing the E200i controller, please let us know.

Franco

Re: ML350 G5 Disk Failure

Franco,

I concur with your findings. Disabling the write cache does indeed stop the 1792's from occurring.

I had a follow-on issue and unfortunately cannot say what caused it, but I am suspicious of the old driver that I was using. I found the machine crashed one night, with no indication of the cause evident in any of the logs. It had corrupted data on the partitions.


I decided to reload everything from scratch and retrace my steps. What I found makes me somewhat suspicious of SBS 2003 vs. the HP controller, but perhaps it is the combination. I found that the 1792's happen after every reboot once the first phase of the SBS install is performed (domain controller, etc.). When the remainder is installed (R2, patches, etc.) the problem becomes intermittent again, which makes me wonder if SBS is not shutting down properly and you only see it with a controller that has a battery-backed cache. At this point, I gave up and purchased a P400 controller w/o bbc (for a lot of reasons) just to get going.

I agree that the e200i hardware-firmware-driver combination is weak. I would add that HP's support is also Very BAD. I sent an engineer the logs he requested approximately three weeks ago and have heard nothing back (and I paid for 7x24-4hr). I am going to call them and complain today.

Scott
Antony Ryan
Occasional Advisor

Re: ML350 G5 Disk Failure

We have experienced the same problem with 2 new servers - ML350 G5 Quad Core with e200i. Both servers completely freeze when under heavy disk i/o.

HP have replaced the mainboard, and the battery backed up cache. One thing we did notice is that if we run a disk i/o stress test, it works on the RAID 1 config, but not on the RAID 5 config (system locks up after 10 seconds). We did this test with the HP tech standing next to us, so he could see the results himself. He is going to source another RAID controller so we are not using the e200i - will let you know once this has been done.
fricci
Advisor

Re: ML350 G5 Disk Failure

Scott,
I get the same warning during post (1792) on a Windows Server 2003 R2 installation, so I don't think the problem comes from SBS itself, but all Windows installations. I don't know if the problem exists in Linux too, but anyway HP servers are certified to work with Windows O.S.
Yesterday I had a long talk with the reseller's technical support (HP Certified Partner) and and we decided to try to replace the internal E200i with a P400 controller, probably with BBC. I hope this will be an ultimate solution.
As I already declared in a previous post HP technical support is worse . They actively creates damage.

Maybe next week they will call you asking if you fixed your issue, like they did with me..... :-(

Franco

Re: ML350 G5 Disk Failure

Franco,

Please let me know what you find if indeed you get to test a P400 w/bbc. I'd be interested in the results.

As to SBS vs. generic Windows being an issue, I did a basic load of Windows 2003 server (no domain, DNS, etc.) as a sort of control for the test. No 1792's. My approach wasn't all that scientific but it makes makes me wonder about SBS' role in this.

Thanks.

Scott
Henry Boehlert
New Member

Re: ML350 G5 Disk Failure

We have 412645-B21 (ProLiant ML350 G5), 436013-L21 (E5345, Intel Quad-Core Xeon), 351580-B21 (E200 128MB BBWC) and 395473-B21 (500GB 7.2k HP SATA).

We encountered the lockups on heavy disk i/o, too, even after applying all updates from SmartStart, SupportPack and Firmware Maintenance CD.

Also, on one of our servers the RAID array would vanish after a single disk failure and had to be rebuild from backup.

HP support would first assume the WD SATA drives we're using were not supported by HP but then had to realize that that's actually what they're shipping.

After exchanging reports from various analysis tools, HP confirmed the bus errors and now blames an inconsistency between the E200 BBWC and SATA drives regarding Native Command Queueing.

Now we're scheduled to get the E200 replaced with something else (most probably an E400) and the SATA drives by SAS drives.

Interesting to learn that it's actually a firmware issue (i.e. easy to fix), looking at the cost this is incurring on us as well as on HP.
Antony Ryan
Occasional Advisor

Re: ML350 G5 Disk Failure

We had the HP tech on-site again today to install a P400 controller - but... he couldn't get it to work!!

The server would start to boot, and then just fail (we had only installed the card, hadn't attached any drives to it as yet - as per HP supports instructions. We tried numerous things, all to no avail. HP are going to come back next week with another P400 and see if they can get this working.
fricci
Advisor

Re: ML350 G5 Disk Failure

Scott,
I will keep you informed about any evolution, but I think it will take some time.... I hope to replace the controller before the end of January, but I am not sure about it.

Anyway, I always get the 1792 warning on a Windows Server 2003 *R2* SP1, a different release of Windows compared with SBS2003 Standard R2 which runs Windows Server 2003 (R1) SP1 + SP2 update, but in this case the warning *seems* harmless (I had no problem with disks).

This server is a Domain Controller (AD+DNS+DHCP+WINS) with two SAS disk (RAID 1), the SBS server uses four SAS disk (RAID 1+0).

Franco