ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

ATA hard reset errors on DL320e v2 in AHCI mode with NCQ/write-cache enabled on SSD's

 
SOLVED
Go to solution
ZakSmith
Advisor

ATA hard reset errors on DL320e v2 in AHCI mode with NCQ/write-cache enabled on SSD's

I have several DL320e v2's here.

 

They have the B120i disabled and the drives in AHCI mode.

 

OS is Linux / Debian.  Drives are Seagate 600 PRO SSD's, the 200 GB version.

 

When I have drive write caching enabled (hdparm -W1 or the bios setting) and NCQ enabled for the drives, I get these errors periodically in the logs:

 

Mar 22 14:25:19 rack3 kernel: [  905.893359] ata1: hard resetting link
Mar 22 14:25:20 rack3 kernel: [  906.211921] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Mar 22 14:25:20 rack3 kernel: [  906.212372] ata1.00: configured for UDMA/133
Mar 22 14:25:20 rack3 kernel: [  906.212377] ata1: EH complete
Mar 22 14:25:20 rack3 kernel: [  906.228293] ata1: hard resetting link
Mar 22 14:25:20 rack3 kernel: [  906.547636] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Mar 22 14:25:20 rack3 kernel: [  906.548075] ata1.00: configured for UDMA/133
Mar 22 14:25:20 rack3 kernel: [  906.548078] ata1: EH complete
Mar 22 14:25:20 rack3 kernel: [  906.577595] ata1: hard resetting link
Mar 22 14:25:20 rack3 kernel: [  906.895341] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Mar 22 14:25:20 rack3 kernel: [  906.895782] ata1.00: configured for UDMA/133
Mar 22 14:25:20 rack3 kernel: [  906.895786] ata1: EH complete
Mar 22 14:25:20 rack3 kernel: [  906.914616] ata1: limiting SATA link speed to 3.0 Gbps
Mar 22 14:25:20 rack3 kernel: [  906.915488] ata1: hard resetting link
Mar 22 14:25:21 rack3 kernel: [  907.235052] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Mar 22 14:25:21 rack3 kernel: [  907.235509] ata1.00: configured for UDMA/133
Mar 22 14:25:21 rack3 kernel: [  907.235514] ata1: EH complete

 tl;dr version - hard resets on the SATA connections under heavy workloads.

 

Googling for this indicates in generic hardware the most likely cause is poor SATA cables or bugs in the SATA controller.   Suggested work-arounds include disabling NCQ.  Well, the machines have the factory-installed mini-SAS cable going from the mainboard to the SAS/SATA backplane.  And it happens in all of the DL320e v2's I have here to test (more than 2).

 

Turns out, I can work around the bug by disabling write caching (hdparm -W0) or disabling NCQ (set /sys/block/sda/device/queue_depth to 1).  But that is not acceptible because drive performance suffers a 2-3x hit.   (Write caching will increase effective write speeds by a factor of 2.5x and disabling NCQ is about a 50% hit.)

 

I've used at least 4 unique SSD drives of the same model to reproduce this error.  Reproduction is fairly easy, just run a 100 GB "bonnie++" run several times and it'll pop up at least once, usually about a half dozen times.

 

Now the smoking gun is that if I drop in an LSI 9207-8i PCI-e 3.0 SAS/SATA controller in the left-hand expansion slot, using the same internal cabling inside the DL320e's, the errors go completely away and I get about 5% more throughput performance to boot.   The 9207 is the "IT"/JBOD mode version, so it's not running any RAID stuff either. 

 

To me this indicates the on-board SATA controller when running in AHCI mode has bugs or at least an incompatibility with the particular drives I'm using. 

 

The DL320e's are flashed to the latest BIOS and ILO versions.

 

Any ideas?   

 

I will be contacting HP support next week because the situation is unsatisfactory but I wonder if anyone else has any good ideas.     I sort-of suspect they will tell me to buy their HP-branded $900 SSD drives.  Not going to happen at a 2-3x the cost.

 

 

 

 

1 REPLY
ZakSmith
Advisor
Solution

Re: ATA hard reset errors on DL320e v2 in AHCI mode with NCQ/write-cache enabled on SSD's

Just got off the phone with HP support.

 

Unless the drive is listed on the QuickSpec they have no clue and won't comment.

 

The closest drive on the QuickSpec is the 691864-B21, a 200 GB SSD that retails for over $1399.

 

Interestingly, HP sells the 240GB version of my 200GB drive (the 200GB is overprovisioned for drive life), part #  EQ51AA for $700.  Market price for the Seagate 240GB is about $316.

 

But they say that this drive is fine on the hp z620 family of workstations.   Here's rhetorical question for the public: If HP advertises both the z620 series and the DL320e series as having a SATA 3, AHCI compatible disk controller and one of them does not work with a SATA 3 drive, doesn't that mean that one of them is NOT really compatible.

 

For the $1399 HP wants for their rebranded (but unknown OEM) SSD, I can buy an LSI 9207 SAS/SATA controller (about $250) and TWO of Seagate's 480GB Enterprise SSD's.   Or I could buy a real hardware RAID controller (LSI 9271-4i) with a BBU and two 240GB class Enterprise SSD's.   The only downside is that I don't get the blinkenlights on the front of the drives.

 

The LSI controller then actually works with SAS drives (hp wants another $100 to enable SAS mode on the b120i), and it gives a full 6 Gbps on all 4 ports instead of 6 Gbps on the first two and 3 Gbps on the second two (again limited on the b120i controller).   In benchmarking the LSI controller is 5-10% faster on reads/writes than the on-board HP "not quite SATA" controller too. 

 

Shame on you HP.