ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

ML350 G5 Disk Failure

fricci
Advisor

ML350 G5 Disk Failure

I had a lot of problems with an new Proliant ML350 G5 dual Xeon, 4GB RAM, E200i + 128MB BBCW, 4 x 73GB 2,5" 10K SAS (RAID 1+0), Windows Small Business Server 2003 R2 + SP2 W2K3.
Although the HP diagnostic software didn't find anything wrong, I got lots of corrupted data files and system warnings (EventID: 55 - NTFS Error) on some partitions.
This server exhibited a strange behaviour from the beginning. After a reboot, during POST, I always got a warning from the controller: "1792-Drive Array Reports Valid Data Found in Array Accelerator".
I thought this message was quite anomalous, identifing a missed flush of the cache at reboot for unknown reasons (bugged firmware, bugged driver or both) but I didn't understand the real danger .....
I installed and configured the system software without any apparent problems.
True problems began after this server replaced the old one. After few days, the application software (TeamSystem SysInt/W) stopped working and I discovered lots of errors on the partiton where this application is installed.
I fixed the errors, but they reappeared almost every day....
TeamSystem tech support confirmed me there weren't any issues with that hardware/software platform, so I asked for HP support and I sent the requested log files.
I suppose the culprit could be a bugged firmware or a bugged driver, causing unrecoverable problems to the file system, but I have no "verified" response.
I update all firmwares to the last available release and the RAID controller driver to Ver 6.6.2.32.
The warning after reboot suddenly disappeared and the server runned smoothly for about one month.
After 40 days from the first call and a new crash, although there weren't any fault logged by the hardware diagnostic software, I obtained the replacement of the cache module, but the exact reasons of this issue remains unknown...
The HP technical support was unable to identify the source of the problem, but they ask me to completely reformat and restore the system.
In August (working activity is stopped) I completely destroyed the array, I rebuilded all partitions and I restored all!
After the cache module replacement, I also updated all firmwares to the last available release and the RAID controller driver to Ver 6.8.0.32.

Two days later, I noticed a warning in the System event log about one error during POST, related to the RAID controller.
So, I launched the controller's diagnostic software and it reported "Data found in the cache - Cache module fault - Replace the cache module". Incredible, I thought, the replaced part was faulty!
I also noted the warning "1792-Drive Array Reports Valid Data Found in Array Accelerator" after reboot was reappeared....
I discovered this warning appears only after reboot (with this firmware/drivers), but the are no warnings during POST after a shutdown! The diagnostic software doesn't find anything abnormal after a shutdown, but it always identify the cache module as damaged after a reboot!
I did several reboot, always getting the POST warning message, and after a reboot I got a partition corrupted!
I am absolutely sure about what I did before the corruption accourred:
1) Reboot - Warning 1792 during reboot
2) I tested all partitions with Chkdsk. All are OK.
3) I zipped two previously saved log files in a folder. The partiton contains data as well as WSUS folder
4) Sent a message with the zipped files attacched through Outlook Web Access
5) Reboot - Warning 1792 during reboot
6) I tested all partitions with Chkdsk. The partition containing the log files was corrupted!

The warnings at reboot are NOT HARMLESS!!!
Chkdsk was unable to fix the problem, not even during startup, so I had to start with my RescueCD (PEBuilder based), reformat and restoring the partition from a previous image.

I also downgraded the RAID controller's driver to ver. 6.6.2.32. The warning after reboot wanished...

I asked again for HP tech support. The reply this is a software issue, so this is a my problem....
Maybe I am wrong, but I don't think that a server that repeatedly crash disks, ***MAYBE*** working only with a exact driver release is a MY problem! I cannot write a new driver to safely use YOUR hardware and without a driver I can't use YOUR hardware, so this is an hardware issue, like are hardware issues every problem involving firmware and proprietary drivers!

And over all, please, NO ONE CAN KNOW the origin of the problem until it can't be reproduced!!!
I am not exacty a newbie in hardware and system management, but this is the first time I see and hear something similar!

Sorry about the long post (and my bad english), but I thought it was necessary

So, the question now is this:
considering the absence of a meaningful response from the HP support, did anyone had issues related to the exposed problem and has some idea about it?
I cannot run any risky or destructive test....

I will be absent for holidays for 10 days, so I don't think I will have the opportunity to reply, but all suggestions are welcome!

Best regards.

Franco
53 REPLIES
KarloChacon
Honored Contributor

Re: ML350 G5 Disk Failure

hi

I hope I got the main idea from your issue...
HDDs issues

What did Hp tell you about Arrar diagnostic utility reports? something wrong on the controller? what controller do you have?

what about cabling?
what about hadr drive cage?
even power backplane?

what are the new errors?
Event Viewer, IML, ADU?

regards
Didn't your momma teach you to say thanks!
fricci
Advisor

Re: ML350 G5 Disk Failure

The RAID controller is the E200i (integrated) + 128MB BBCW module. As I described there are ANY error reported by IML or ADU. The event viewer register the warnings "1792-Drive Array Reports Valid Data Found in Array Accelerator".
After the replacement of the cache module ADU warns "Cache module fault - replace cache module" every time I launch it IF AND ONLY IF POST report "1792-Drive Array Reports Valid Data Found in Array Accelerator". After downgrade the driver to 6.6.2.32 there are no more warnings every reboot and ADU doen't report any errors. Booting with SmartStart CD the controller diagnostic doesn't report any error.
Your question are the same I posed to HP.
The problem could arise from any components of the disk subsystem (controller, cache, disks, cabling, backplane, power supply, firmware, driver).
The real problem is I can't solve on a trial and error basis because I have not the necessary components and HP denied my request.
I am a consultant, I am not the hardware manufacturer, I can't (and the customer can't too) buy lots of HP original parts to test their hardware!!!
Sincerely, after this experience I don't think to recommend HP servers anymore. Technical support is nonexistent.
I'm thinking a way to solve could be to suggest the customer to buy a new server and to replace the faulty one.
Obiouvsly, this could bring quickly to some legal actions. I hope this will not be necessary.
Regards.
KarloChacon
Honored Contributor

Re: ML350 G5 Disk Failure

hi

call again HP explain the situation you've been having all these days I think

you can ask them just to be completely sure about the issue a power backplane and driver cage and cables

try to do it and give feedback ok

regards
Didn't your momma teach you to say thanks!
fricci
Advisor

Re: ML350 G5 Disk Failure

I opened a new ticket on last Friday. They (as usual) said me this is a software issue, so this is not their problem. After a furious talk they decided to "escalate" the problem to a higher level (this is the second time) but surely they will not reply before 10 days. And I know their answer...
if their diagnostic software reports "all right", they will not take charge of the issue.
The only chance will be to migrate that installation to a similar new server, demonstrating there aren't software issues and asking for a full refund.
Jim Philipson
Occasional Advisor

Re: ML350 G5 Disk Failure

Franco,

Sorry to hear your problems but glad to know I'm not the only one.

I've been working on the same hardware and having numerous and anomylous problems that I just couldn't explain. This is a server migration project. Restoring original data from backup left a bunch of stuff corrupt. I've been through a few OS installs on this server too and they've all failed for some reason or another over time. Active Directory will work today but not tomorrow... that sort of thing.

Finally, today I reboot the server, get the same error you did (1792) and will be following your path with support. If I can conjure up anything with them I will let you know.

Thanks,

Jim
fricci
Advisor

Re: ML350 G5 Disk Failure

Jim,
Some days ago I received a call by the HP 2nd level support.
They send me a new diagnostic utility (HPSRPT Advanced) to get some diagnostic logs out of the server...... but the software apparently doesn't work. It simply starts but it does nothing.....
Every time I speak with an HP's technician I had the clear impression to speak to a newby with very, very, very low technical background..... so they are simply not able to collect and to analyze the information I submitted them.
I am beginnig to think this could be not (only) an hardware issue, but instead it could be caused by a buggy firmware/drivers, so a simple hardware replacement will not solve....
I'm going to test the "faulty" installation (working without problems in a VMWare environment on my workstation) to a DL380 (for the moment only for a few days) and I'm sure it will work!
After disabling accelerator and installing driver 6.6.0.32, all seems to work.
The major problems is that issues appears randomly, maybe it will work for a month and the we got a new trouble.
I am very interested in any evolution in the solution of this issue, so let me know any news about it.
Regards.

Franco
Jim Philipson
Occasional Advisor

Re: ML350 G5 Disk Failure

Sorry, I've been on the road for a few days and couldn't get back to this right away. I called support with the error reported above. Here are the steps that hopefully resolved it for me.

1. Turn off write cache on controller via ACU.

2. Install the latest driver for the controller (E200i, IIRC).

3. Install firmware version 1.66 on the controller.

4. Reboot.

After this I was finally able to successfully finish a SBS install (9-10th try) on this hardware.

That seems to have fixed it but this problem seemed to come up at random intervals so who knows if it will come back or not.

Good luck.

fricci
Advisor

Re: ML350 G5 Disk Failure

Your fix procedure is similar to the steps I walked through and what you wrote makes me thinking again about the possibility of a bugged firmware/driver.
Although the firmware upgrade to version 1.66 made a clear step toward a solution,
it didn't solve completely.
The first thing I thought to do for diagnostic purposes from the beginning of this story (the end of May) was to disable what HP calls "accelerator" (I suppose this means write-back algorithm), but when I tried to do it via ACU, I reiceived a very intimidating message about the possibility to loose the content of the array. I repeatedly asked to the HP support, and they told me "Don't do it!".

After the last disaster recovery I decided to switch the accelerator off anyway and as I supposed, nothing dramatical happened....
Anyway I had to downgrade to driver ver 6.6.0.32 because the last version 6.8.0.32 has the same problem when rebooting (the warning "1792-Drive Array Reports Valid Data Found in Array Accelerator") and it corrupts again a partition (with acceleration enabled).

After the last fix on August 18 the server works apparently without problems, but considering the previous crash occurred about 1 month before, I don't think I can consider it fixed.

If this really fixed the issue it is ***absolutely unacceptable*** from a technical point of view to use a server that could destroy your data after a simple driver or firmware update. This is not a server, it is a tool for acrobats!
If disabling the write-back algorithm is the ultimate fix, I think that HP can take the server back and give us our not-bugged money....

So, I hope you fixed the issue, but I am not sure about it...

However repeatedly rebooting and also shutdowning (this often brings to a different behaviour of the server) can give you and index of the presence of the issue.
I forgot to ask you if you also got EventID 55 - NTFS Error in System Logs.
Regards.

Franco
Jim Philipson
Occasional Advisor

Re: ML350 G5 Disk Failure

I don't have access to the server right now - it's at the customer's site. He just informed me that the server rebooted itself last night - a sure sign the problem is back.

Will update when I have more.
fricci
Advisor

Re: ML350 G5 Disk Failure

They sent me the last version of HP SRPT Enhanced (finally working) so I sent them the generated logs.
I hope to find someone a bit smarter than the previous one...
I have another installed ML350G5 actually working without problems, but it runs W2K3R2 STD (not SBS2003 R2) and it gots the same warnings while rebooting. The firmware/drivers are still an old version but I didn't make any change waiting for a solution of the previous issue...


Jim Philipson
Occasional Advisor

Re: ML350 G5 Disk Failure

I'm back and here is the latest.

Called HP Support. While on hold, the server blue screened right in front of me. The stop error was 0x00000077.

They had be send off all the logs which included a couple 1792s and a 1794.

Rebooted to Smart Start and ran full diagnostics - came back 100% successful.

HP determined that the array controller is bad based on the stop error. Should have a new one tomorrow. Will update you if the problem continues.

Thanks,

Jim
PinnacleCS
Occasional Advisor

Re: ML350 G5 Disk Failure

Hello, I was reading your post and everyhting seems to be identical to the issues I am having. If you go into your System Management Homepage, do you have a bunch of SCSI Bus Faults on each of your drives in the server?

I notice that I had no other errors accept for those.

Also, did you order your server on-line from HP? Just wondering if that's a common thread.

BTW HP Support sure does leave a lot to be desired!
Jim Philipson
Occasional Advisor

Re: ML350 G5 Disk Failure

Where are the error codes/event IDs you are getting? Are you looking in Windows Event Log or the HP Logviewer? Windows wasn't telling us anything until I personally saw the blue screen.

The server came through a reseller (us).
PinnacleCS
Occasional Advisor

Re: ML350 G5 Disk Failure

I'm not really getting anyhting in windows with the aexception of the SCSI Bus errros in the Systmes Management Homepage. If you click on the Array controler, then look at each phycal disk, under the Problem Indicators your will see SCSI Bus resets. All four of my drives show the same number of resets (odd to begin with that they would all be the same). As far as any other errors, I have not had any BSOD's yet but randomly on reboot I will get the "1792-Drive Array Reports Valid Data Found in Array Accelerator" fortunatly I'm not in production yet but I'm afraid that I will start to see these issues once I start to put a load on the IO subsystem.
Jim Philipson
Occasional Advisor

Re: ML350 G5 Disk Failure

Do you have an E200 or E200i? i is for integrated. If it isn't integrated, reseat, check the cabling and look for amber lights inside the case.

If it continues, call HP Support - the server is under warranty, yes?
PinnacleCS
Occasional Advisor

Re: ML350 G5 Disk Failure

Yes mine is the E200i w/BBWC module. I did reseat all of the cables and the BBWC module. The 1792 error is random so I have no idea if the reseat fixed anyhting. I think since the server is less than 30 days old I'll likley send it back for an ML370 with the P400i controller. They seem to be a little more reliable controller. I'm not sure if the e200 has been around long or not. Scary to go into production with issues like this, reminds me of a Dell server!
SPa
Trusted Contributor

Re: ML350 G5 Disk Failure

Hi,

"1792-Drive Array Reports Valid Data Found in Array Accelerator" is just an information and not really an error.There was a valid data in the Accelarator and which it would restore back to the Drives.

There are several known issue listed on Microsoft support for Event ID 55.One of them is http://support.microsoft.com/kb/932578/en-us

You may want to validate on your setup.
fricci
Advisor

Re: ML350 G5 Disk Failure

I'm sorry for the long silence, but I'm very busy...

*** SPa,

I thought that "1792-Drive Array Reports Valid Data Found in Array Accelerator" was just an information message until I got data corruption simply restarting the server. Unfortunately we can't know what the controller wants to restore back to the drives..... and now I am sure it writes something wrong (read my precedent posts).

I know KB932578, but this issue affect only systems with cluster size smaller than 4096 bytes, which is quite uncommon.

The other question is this:
Why migrating the installation (via cloning) to a different hardware or to VMWare (running on different server) I had any of the exposed issues? I don't think you have to think a lot to find the right answer!

What is absolutely intolerable was the answer I got from HP technical support.
After 3 months of systematic corruptions, they invite me to update the controller's driver to 6.8.0.32 and to install the KB932755 (in the wrong order!).

The last corruption I got on August 18, was just after a reboot with the 6.8.0.32 driver, which is still affected by the "1792-Drive Array Reports Valid Data Found in Array Accelerator" message during post.
Then I downgraded to ver 6.6.2.32 (in my previous post I wrote 6.6.0.32 - this is wrong) and I disabled the accelerator through ACU.

I know KB932755 and it solves several problems, but they are not related to disk corruption and I didn't installed it because I had to downgrade to driver 6.6.2.32.
As clearly reported in the technical note you have to update to driver 6.8.0.32 BEFORE installing KB932755, otherwise you can get a BSOD!
"we recommend that you install the updated drivers from HP before you install this Storport update".



*** Jim and PinnacleCS,

I didn't understand if you got data corruption or only unwanted reboots or BSOD.
Can you please clarify?

Sincerely after this 3 months nightmare I think the best solution is:

DON'T BUY THAT SERVER!!!
Corollary: If you already bought one, sent it back (and ask for your non-bugged money!)
Blazhev_1
Honored Contributor

Re: ML350 G5 Disk Failure

Hi,

the problem with this server is that the power supplies in this and other G5 servers are using new technology and some models are buggy. There are problems that the server restart without reason and no errors in logs why the server restarted.
Since 1792 : "Possible Cause: Power was interrupted while data was in the array accelerator memory. Power was
then restored within several days, and the data in the array accelerator was flushed to the drive array.
Action: No action is required. No data has been lost. Perform orderly system shutdowns to avoid leaving
data in the array accelerator.".
I think this is the issue.
Replace the power supplies, and I think that the problems will be away.
Data can be lost if power is not restored in 3 days(the battery saves data 72 hours).

And this HPSRPT is not a diagnostic utility. This is a tool thay use to collect all possible logs and config from the system, so the 2nd level knows all details.
After you start it, the output is stored in
systemroot%/HPSreport/HPS*****.cab or something like that(can be in program files), but it is a .cab file you can check it. Problem is not with driver or firmware or SA controller. Just in case check cabling and reseat the BBWC, riser cage and controller, but I think problem is with power loss, so PSU and power backplane are most likely the issue.
Please if you replace the PSU update the post.

Regards,
Pac
fricci
Advisor

Re: ML350 G5 Disk Failure

Hi, Pac
thank you for your suggestions, but I never observed an unwanted reboot on that server.
Instead, I got corrupted files (and partitions) after a ***regular*** reboot (for example after applying some patches), as well as using some software applications.
The message "1792-Drive Array Reports Valid Data Found in Array Accelerator" (almost) always appears after a ***regular*** reboot with driver 6.6.0.32 or older and a similar behaviour is observed with driver 6.8.0.32.
The only version that seems not to be affected by this issue is 6.6.2.32.
I got some file corruption just rebooting also with the last driver 6.8.0.32.

Please note, just before clicking "reboot" I ran chkdsk on all volumes and all was ok.
After rebooting, during post, the message "1792-Drive Array Reports Valid Data Found in Array Accelerator" appeared, after login I ran chkdsk and I found files corruption on a volume. Chkdsk was unable to fix the problem also during the next restart, so I had to reformat the volume and to restore it from a previous backup.
I got 6-7 file/partition corruptions from the end of May.
On August 18 I had to completely restore all the system......

I know that PSU could be something involved in this issue, but I think that this could be the job of the HP support technician.... I have no PSU to change, they have them! Also cabling and backplane are potentially involved in this issue, but I have no backplane to change, because I don't work at HP....
I saw the output of the HPSRPT, I called it a diagnostic utility because it collects useful logs for diagnostic.

This was the response of 2nd level support (traslated from Italian language):
------------------------------------------
To solve the data loss issue, please do the following operations in the exposed order
1) Install driver 6.8.0.32
2) Install KB932755 which solve many problems.
------------------------------------------

If you read KB932755 you can notice that they suggest to install driver 6.8.0.32 BEFORE to install KB932755, otherwise you could get a BSOD. And I can't find in KB932755 something directy related to the specific issue. They didn't read the KB932755 at all!
At the end of August I didn't install KB932755 because after installing driver 6.8.0.32 I got a data corruption, so I had to downgrade to ver 6.6.2.32, then if I install KB932755 I could get a crash.

After 3 months of data corruptions this is the only answer I got! An endless trial and error debugging procedure at the expense of the customer!

I think these people have nothing to do with a technical role..... they are unable to walk through some technical issue.

Regards.
fricci
Advisor

Re: ML350 G5 Disk Failure

Sorry, in my previous post I wrote in a backward order the response of 2nd level support. This is the response of 2nd level support:
------------------------------------------
To solve the data loss issue, please do the following operations in the exposed order
1) Install KB932755 which solve many problems.
2) Install driver 6.8.0.32
------------------------------------------

Regards.

Re: ML350 G5 Disk Failure

WOW!
I wish that i had found this thread a month ago. I've been investigating an unexpected reboot issue on numerous Ml350 G5 servers. Forgive me for not reading this whole thread, but i believe I'm having the same issue. After several conversations with HP support, they kept tell me that it was one of my applications or Microsoft's problem. Only after i found kb932755, did the support person admit that it may be related to the HP e200i driver. After updating all HP drivers, firmware and BIOS, i install Win2K3 SP2 and my problem seems to have gone away. It's only been 2 days, so i'm crossing my fingers.

Did this solve everyone elses issues?

Thanks
Kris
Jim Philipson
Occasional Advisor

Re: ML350 G5 Disk Failure

Sorry for no update in a while. Here is a brief recap.

Same error as folks above. Called HP and they ultimately sent the correct part and replaced the systemboard/E200i. Diagnosed the controller as bad but it is integrated so the whole board had to go. I wanted to replace the SCSI backplane too but that didn't fit the diagnosis.

System rebooted again, different stop error, sorry I don't have that with me. HP diagnosed as a bad power backplane and replaced.

System rebooted again - no stop error this time, just a complete mystery. Told HP the customer is unwilling to troubleshot this any further on a new computer. Replace or return are the only options.

They now want to replace the SCSI backplane.

I have had four different cases open on this server and have zero confidence in the box at this point. It may run a week or longer before the problem comes up again. How can I put that into production? How long does it have to run before I can declare it stable? Why would the customer put up with this?

Every computer company makes a lemon or two. I think I've got one of HP's.

Good Luck.
Jason Riebe
Occasional Visitor

Re: ML350 G5 Disk Failure

Hi all, I to have been troubled with a customers ML350 Exchange server with e200i controller after installing psp 7.9, firmware 7.9 and Win2kSP2. I have just installed the KB932755 patch and all has been good for 24 hours, fingers crossed.