ProLiant Servers (ML,DL,SL)
1825780 Members
2230 Online
109687 Solutions
New Discussion

DL320s P400 array controller operation with SATA

 
Dave Rizer
New Member

DL320s P400 array controller operation with SATA

I have several DL320s running RedHat Linux at the 2.6 Kernel.

I have had an issue multiple times when using 12 SATA drives on a P400 array controller at 2.08 firmware. When a drive fails, it doesn't fail outright. I/O to the array slows down until the server is useless. After a shutdown and POST, the array controller reports a failed drive. I replace the drive, it rebuilds and performance is back where it ought to be.

Is there any reason the P400 doesn't detect a degraded drive during operation and kick it out of the array so the server can continue to function on parity or mirror copy? It seems to heroically attempt to retry I/O when it would make more sense to kick the drive out.

device_point='/dev/cciss/c0d2/part2' recent_max_lat=4387685us ops_out=5 oldest_op_out=10155183us (excessive)

Has anyone seen this?

Additionally, the ADU report shows extremely high counts on the "bus faults" count. This may be the nature of SATA, but mostly the fields are unpopulated...all zero's except read blocks, write blocks, and bus faults.
2 REPLIES 2
James ~ Happy Dude
Honored Contributor

Re: DL320s P400 array controller operation with SATA

Hello Dave,

Refer: http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?locale=en_US&o
bjectID=c01068337
Im aware, that you WONT experience BSOD; But updating the latest Drivers/FIRMWARE for the controller might help;
Also, "bus faults" or "SCSI bus Downshift" errors are more likely because of the CABLE than the HDD.

;) Regards.
Dave Rizer
New Member

Re: DL320s P400 array controller operation with SATA

HappyDude,

Thanks for the suggestions. I am typically a strong believer in updating drivers and firmware. The issue you point out is a problem with MS storport driver. I am using Linux, so I don't think it is related.

In terms of bus faults and speed downshifts being related to the cable....I would tend to agree. There are 3 components in the cable path from controller to drive. The P400 has a cable routed to the front of the system, and this communicates through the mainboard to the hard drive backplanes. There appear to be two backplanes. So if cables are suspected, it could be any of those components.

The issue I am experiencing is not necessarily related to bus faults. In the SCSI world, bus faults at the numbers I am talking about would result in an extremely high number of BUS RESETS and it would be clear that there was a cable error (system would likely log parity errors too). I have seen high bus fault counts on every DL320s that I have looked at (at least 6).

The issue I have is an array slowing down to the point that the acknowledgement of the write is way out of bounds for any application or host to deal with. The problem seems to revolve around retries on a disk. The way the Smart Array controller line is advertised is to predictively fail using SMART data from the drive, or determine that a disk is not responding and fail it. This does not occur until I power the system off, then back on. On POST, the array controller fails the drive. There is no evidence of problems with that drive until I do a cold boot. I believe this is an issue with the array controller. I have been slowly moving my controllers to 4.06 firmware, however the release notes for this version make no mention of this issue. They simply state SATA performance improvements.

I am using battery backed cache on the controller in all circumstances. 50% read / 50% write. Unless I am filling up all that cache and going to write through, I would excpect an acknowledgement of a write.

Dave