ProLiant Servers (ML,DL,SL)
1751707 Members
5362 Online
108781 Solutions
New Discussion юеВ

ML350 G5 Disk Failure

 
fricci
Advisor

ML350 G5 Disk Failure

I had a lot of problems with an new Proliant ML350 G5 dual Xeon, 4GB RAM, E200i + 128MB BBCW, 4 x 73GB 2,5" 10K SAS (RAID 1+0), Windows Small Business Server 2003 R2 + SP2 W2K3.
Although the HP diagnostic software didn't find anything wrong, I got lots of corrupted data files and system warnings (EventID: 55 - NTFS Error) on some partitions.
This server exhibited a strange behaviour from the beginning. After a reboot, during POST, I always got a warning from the controller: "1792-Drive Array Reports Valid Data Found in Array Accelerator".
I thought this message was quite anomalous, identifing a missed flush of the cache at reboot for unknown reasons (bugged firmware, bugged driver or both) but I didn't understand the real danger .....
I installed and configured the system software without any apparent problems.
True problems began after this server replaced the old one. After few days, the application software (TeamSystem SysInt/W) stopped working and I discovered lots of errors on the partiton where this application is installed.
I fixed the errors, but they reappeared almost every day....
TeamSystem tech support confirmed me there weren't any issues with that hardware/software platform, so I asked for HP support and I sent the requested log files.
I suppose the culprit could be a bugged firmware or a bugged driver, causing unrecoverable problems to the file system, but I have no "verified" response.
I update all firmwares to the last available release and the RAID controller driver to Ver 6.6.2.32.
The warning after reboot suddenly disappeared and the server runned smoothly for about one month.
After 40 days from the first call and a new crash, although there weren't any fault logged by the hardware diagnostic software, I obtained the replacement of the cache module, but the exact reasons of this issue remains unknown...
The HP technical support was unable to identify the source of the problem, but they ask me to completely reformat and restore the system.
In August (working activity is stopped) I completely destroyed the array, I rebuilded all partitions and I restored all!
After the cache module replacement, I also updated all firmwares to the last available release and the RAID controller driver to Ver 6.8.0.32.

Two days later, I noticed a warning in the System event log about one error during POST, related to the RAID controller.
So, I launched the controller's diagnostic software and it reported "Data found in the cache - Cache module fault - Replace the cache module". Incredible, I thought, the replaced part was faulty!
I also noted the warning "1792-Drive Array Reports Valid Data Found in Array Accelerator" after reboot was reappeared....
I discovered this warning appears only after reboot (with this firmware/drivers), but the are no warnings during POST after a shutdown! The diagnostic software doesn't find anything abnormal after a shutdown, but it always identify the cache module as damaged after a reboot!
I did several reboot, always getting the POST warning message, and after a reboot I got a partition corrupted!
I am absolutely sure about what I did before the corruption accourred:
1) Reboot - Warning 1792 during reboot
2) I tested all partitions with Chkdsk. All are OK.
3) I zipped two previously saved log files in a folder. The partiton contains data as well as WSUS folder
4) Sent a message with the zipped files attacched through Outlook Web Access
5) Reboot - Warning 1792 during reboot
6) I tested all partitions with Chkdsk. The partition containing the log files was corrupted!

The warnings at reboot are NOT HARMLESS!!!
Chkdsk was unable to fix the problem, not even during startup, so I had to start with my RescueCD (PEBuilder based), reformat and restoring the partition from a previous image.

I also downgraded the RAID controller's driver to ver. 6.6.2.32. The warning after reboot wanished...

I asked again for HP tech support. The reply this is a software issue, so this is a my problem....
Maybe I am wrong, but I don't think that a server that repeatedly crash disks, ***MAYBE*** working only with a exact driver release is a MY problem! I cannot write a new driver to safely use YOUR hardware and without a driver I can't use YOUR hardware, so this is an hardware issue, like are hardware issues every problem involving firmware and proprietary drivers!

And over all, please, NO ONE CAN KNOW the origin of the problem until it can't be reproduced!!!
I am not exacty a newbie in hardware and system management, but this is the first time I see and hear something similar!

Sorry about the long post (and my bad english), but I thought it was necessary

So, the question now is this:
considering the absence of a meaningful response from the HP support, did anyone had issues related to the exposed problem and has some idea about it?
I cannot run any risky or destructive test....

I will be absent for holidays for 10 days, so I don't think I will have the opportunity to reply, but all suggestions are welcome!

Best regards.

Franco
53 REPLIES 53
KarloChacon
Honored Contributor

Re: ML350 G5 Disk Failure

hi

I hope I got the main idea from your issue...
HDDs issues

What did Hp tell you about Arrar diagnostic utility reports? something wrong on the controller? what controller do you have?

what about cabling?
what about hadr drive cage?
even power backplane?

what are the new errors?
Event Viewer, IML, ADU?

regards
Didn't your momma teach you to say thanks!
fricci
Advisor

Re: ML350 G5 Disk Failure

The RAID controller is the E200i (integrated) + 128MB BBCW module. As I described there are ANY error reported by IML or ADU. The event viewer register the warnings "1792-Drive Array Reports Valid Data Found in Array Accelerator".
After the replacement of the cache module ADU warns "Cache module fault - replace cache module" every time I launch it IF AND ONLY IF POST report "1792-Drive Array Reports Valid Data Found in Array Accelerator". After downgrade the driver to 6.6.2.32 there are no more warnings every reboot and ADU doen't report any errors. Booting with SmartStart CD the controller diagnostic doesn't report any error.
Your question are the same I posed to HP.
The problem could arise from any components of the disk subsystem (controller, cache, disks, cabling, backplane, power supply, firmware, driver).
The real problem is I can't solve on a trial and error basis because I have not the necessary components and HP denied my request.
I am a consultant, I am not the hardware manufacturer, I can't (and the customer can't too) buy lots of HP original parts to test their hardware!!!
Sincerely, after this experience I don't think to recommend HP servers anymore. Technical support is nonexistent.
I'm thinking a way to solve could be to suggest the customer to buy a new server and to replace the faulty one.
Obiouvsly, this could bring quickly to some legal actions. I hope this will not be necessary.
Regards.
KarloChacon
Honored Contributor

Re: ML350 G5 Disk Failure

hi

call again HP explain the situation you've been having all these days I think

you can ask them just to be completely sure about the issue a power backplane and driver cage and cables

try to do it and give feedback ok

regards
Didn't your momma teach you to say thanks!
fricci
Advisor

Re: ML350 G5 Disk Failure

I opened a new ticket on last Friday. They (as usual) said me this is a software issue, so this is not their problem. After a furious talk they decided to "escalate" the problem to a higher level (this is the second time) but surely they will not reply before 10 days. And I know their answer...
if their diagnostic software reports "all right", they will not take charge of the issue.
The only chance will be to migrate that installation to a similar new server, demonstrating there aren't software issues and asking for a full refund.
Jim Philipson
Occasional Advisor

Re: ML350 G5 Disk Failure

Franco,

Sorry to hear your problems but glad to know I'm not the only one.

I've been working on the same hardware and having numerous and anomylous problems that I just couldn't explain. This is a server migration project. Restoring original data from backup left a bunch of stuff corrupt. I've been through a few OS installs on this server too and they've all failed for some reason or another over time. Active Directory will work today but not tomorrow... that sort of thing.

Finally, today I reboot the server, get the same error you did (1792) and will be following your path with support. If I can conjure up anything with them I will let you know.

Thanks,

Jim
fricci
Advisor

Re: ML350 G5 Disk Failure

Jim,
Some days ago I received a call by the HP 2nd level support.
They send me a new diagnostic utility (HPSRPT Advanced) to get some diagnostic logs out of the server...... but the software apparently doesn't work. It simply starts but it does nothing.....
Every time I speak with an HP's technician I had the clear impression to speak to a newby with very, very, very low technical background..... so they are simply not able to collect and to analyze the information I submitted them.
I am beginnig to think this could be not (only) an hardware issue, but instead it could be caused by a buggy firmware/drivers, so a simple hardware replacement will not solve....
I'm going to test the "faulty" installation (working without problems in a VMWare environment on my workstation) to a DL380 (for the moment only for a few days) and I'm sure it will work!
After disabling accelerator and installing driver 6.6.0.32, all seems to work.
The major problems is that issues appears randomly, maybe it will work for a month and the we got a new trouble.
I am very interested in any evolution in the solution of this issue, so let me know any news about it.
Regards.

Franco
Jim Philipson
Occasional Advisor

Re: ML350 G5 Disk Failure

Sorry, I've been on the road for a few days and couldn't get back to this right away. I called support with the error reported above. Here are the steps that hopefully resolved it for me.

1. Turn off write cache on controller via ACU.

2. Install the latest driver for the controller (E200i, IIRC).

3. Install firmware version 1.66 on the controller.

4. Reboot.

After this I was finally able to successfully finish a SBS install (9-10th try) on this hardware.

That seems to have fixed it but this problem seemed to come up at random intervals so who knows if it will come back or not.

Good luck.

fricci
Advisor

Re: ML350 G5 Disk Failure

Your fix procedure is similar to the steps I walked through and what you wrote makes me thinking again about the possibility of a bugged firmware/driver.
Although the firmware upgrade to version 1.66 made a clear step toward a solution,
it didn't solve completely.
The first thing I thought to do for diagnostic purposes from the beginning of this story (the end of May) was to disable what HP calls "accelerator" (I suppose this means write-back algorithm), but when I tried to do it via ACU, I reiceived a very intimidating message about the possibility to loose the content of the array. I repeatedly asked to the HP support, and they told me "Don't do it!".

After the last disaster recovery I decided to switch the accelerator off anyway and as I supposed, nothing dramatical happened....
Anyway I had to downgrade to driver ver 6.6.0.32 because the last version 6.8.0.32 has the same problem when rebooting (the warning "1792-Drive Array Reports Valid Data Found in Array Accelerator") and it corrupts again a partition (with acceleration enabled).

After the last fix on August 18 the server works apparently without problems, but considering the previous crash occurred about 1 month before, I don't think I can consider it fixed.

If this really fixed the issue it is ***absolutely unacceptable*** from a technical point of view to use a server that could destroy your data after a simple driver or firmware update. This is not a server, it is a tool for acrobats!
If disabling the write-back algorithm is the ultimate fix, I think that HP can take the server back and give us our not-bugged money....

So, I hope you fixed the issue, but I am not sure about it...

However repeatedly rebooting and also shutdowning (this often brings to a different behaviour of the server) can give you and index of the presence of the issue.
I forgot to ask you if you also got EventID 55 - NTFS Error in System Logs.
Regards.

Franco
Jim Philipson
Occasional Advisor

Re: ML350 G5 Disk Failure

I don't have access to the server right now - it's at the customer's site. He just informed me that the server rebooted itself last night - a sure sign the problem is back.

Will update when I have more.