ProLiant Servers (ML,DL,SL)
1752777 Members
6410 Online
108789 Solutions
New Discussion юеВ

Re: ML350 G5 Disk Failure

 
fricci
Advisor

Re: ML350 G5 Disk Failure

They sent me the last version of HP SRPT Enhanced (finally working) so I sent them the generated logs.
I hope to find someone a bit smarter than the previous one...
I have another installed ML350G5 actually working without problems, but it runs W2K3R2 STD (not SBS2003 R2) and it gots the same warnings while rebooting. The firmware/drivers are still an old version but I didn't make any change waiting for a solution of the previous issue...


Jim Philipson
Occasional Advisor

Re: ML350 G5 Disk Failure

I'm back and here is the latest.

Called HP Support. While on hold, the server blue screened right in front of me. The stop error was 0x00000077.

They had be send off all the logs which included a couple 1792s and a 1794.

Rebooted to Smart Start and ran full diagnostics - came back 100% successful.

HP determined that the array controller is bad based on the stop error. Should have a new one tomorrow. Will update you if the problem continues.

Thanks,

Jim
PinnacleCS
Occasional Advisor

Re: ML350 G5 Disk Failure

Hello, I was reading your post and everyhting seems to be identical to the issues I am having. If you go into your System Management Homepage, do you have a bunch of SCSI Bus Faults on each of your drives in the server?

I notice that I had no other errors accept for those.

Also, did you order your server on-line from HP? Just wondering if that's a common thread.

BTW HP Support sure does leave a lot to be desired!
Jim Philipson
Occasional Advisor

Re: ML350 G5 Disk Failure

Where are the error codes/event IDs you are getting? Are you looking in Windows Event Log or the HP Logviewer? Windows wasn't telling us anything until I personally saw the blue screen.

The server came through a reseller (us).
PinnacleCS
Occasional Advisor

Re: ML350 G5 Disk Failure

I'm not really getting anyhting in windows with the aexception of the SCSI Bus errros in the Systmes Management Homepage. If you click on the Array controler, then look at each phycal disk, under the Problem Indicators your will see SCSI Bus resets. All four of my drives show the same number of resets (odd to begin with that they would all be the same). As far as any other errors, I have not had any BSOD's yet but randomly on reboot I will get the "1792-Drive Array Reports Valid Data Found in Array Accelerator" fortunatly I'm not in production yet but I'm afraid that I will start to see these issues once I start to put a load on the IO subsystem.
Jim Philipson
Occasional Advisor

Re: ML350 G5 Disk Failure

Do you have an E200 or E200i? i is for integrated. If it isn't integrated, reseat, check the cabling and look for amber lights inside the case.

If it continues, call HP Support - the server is under warranty, yes?
PinnacleCS
Occasional Advisor

Re: ML350 G5 Disk Failure

Yes mine is the E200i w/BBWC module. I did reseat all of the cables and the BBWC module. The 1792 error is random so I have no idea if the reseat fixed anyhting. I think since the server is less than 30 days old I'll likley send it back for an ML370 with the P400i controller. They seem to be a little more reliable controller. I'm not sure if the e200 has been around long or not. Scary to go into production with issues like this, reminds me of a Dell server!
SPa
Trusted Contributor

Re: ML350 G5 Disk Failure

Hi,

"1792-Drive Array Reports Valid Data Found in Array Accelerator" is just an information and not really an error.There was a valid data in the Accelarator and which it would restore back to the Drives.

There are several known issue listed on Microsoft support for Event ID 55.One of them is http://support.microsoft.com/kb/932578/en-us

You may want to validate on your setup.
fricci
Advisor

Re: ML350 G5 Disk Failure

I'm sorry for the long silence, but I'm very busy...

*** SPa,

I thought that "1792-Drive Array Reports Valid Data Found in Array Accelerator" was just an information message until I got data corruption simply restarting the server. Unfortunately we can't know what the controller wants to restore back to the drives..... and now I am sure it writes something wrong (read my precedent posts).

I know KB932578, but this issue affect only systems with cluster size smaller than 4096 bytes, which is quite uncommon.

The other question is this:
Why migrating the installation (via cloning) to a different hardware or to VMWare (running on different server) I had any of the exposed issues? I don't think you have to think a lot to find the right answer!

What is absolutely intolerable was the answer I got from HP technical support.
After 3 months of systematic corruptions, they invite me to update the controller's driver to 6.8.0.32 and to install the KB932755 (in the wrong order!).

The last corruption I got on August 18, was just after a reboot with the 6.8.0.32 driver, which is still affected by the "1792-Drive Array Reports Valid Data Found in Array Accelerator" message during post.
Then I downgraded to ver 6.6.2.32 (in my previous post I wrote 6.6.0.32 - this is wrong) and I disabled the accelerator through ACU.

I know KB932755 and it solves several problems, but they are not related to disk corruption and I didn't installed it because I had to downgrade to driver 6.6.2.32.
As clearly reported in the technical note you have to update to driver 6.8.0.32 BEFORE installing KB932755, otherwise you can get a BSOD!
"we recommend that you install the updated drivers from HP before you install this Storport update".



*** Jim and PinnacleCS,

I didn't understand if you got data corruption or only unwanted reboots or BSOD.
Can you please clarify?

Sincerely after this 3 months nightmare I think the best solution is:

DON'T BUY THAT SERVER!!!
Corollary: If you already bought one, sent it back (and ask for your non-bugged money!)
Blazhev_1
Honored Contributor

Re: ML350 G5 Disk Failure

Hi,

the problem with this server is that the power supplies in this and other G5 servers are using new technology and some models are buggy. There are problems that the server restart without reason and no errors in logs why the server restarted.
Since 1792 : "Possible Cause: Power was interrupted while data was in the array accelerator memory. Power was
then restored within several days, and the data in the array accelerator was flushed to the drive array.
Action: No action is required. No data has been lost. Perform orderly system shutdowns to avoid leaving
data in the array accelerator.".
I think this is the issue.
Replace the power supplies, and I think that the problems will be away.
Data can be lost if power is not restored in 3 days(the battery saves data 72 hours).

And this HPSRPT is not a diagnostic utility. This is a tool thay use to collect all possible logs and config from the system, so the 2nd level knows all details.
After you start it, the output is stored in
systemroot%/HPSreport/HPS*****.cab or something like that(can be in program files), but it is a .cab file you can check it. Problem is not with driver or firmware or SA controller. Just in case check cabling and reseat the BBWC, riser cage and controller, but I think problem is with power loss, so PSU and power backplane are most likely the issue.
Please if you replace the PSU update the post.

Regards,
Pac