ProLiant Servers (ML,DL,SL)
1752812 Members
5797 Online
108789 Solutions
New Discussion

Re: ATA hard reset errors on DL320e v2 in AHCI mode with NCQ/write-cache enabled on SSD's

 
SOLVED
Go to solution
ZakSmith
Advisor

ATA hard reset errors on DL320e v2 in AHCI mode with NCQ/write-cache enabled on SSD's

I have several DL320e v2's here.

 

They have the B120i disabled and the drives in AHCI mode.

 

OS is Linux / Debian.  Drives are Seagate 600 PRO SSD's, the 200 GB version.

 

When I have drive write caching enabled (hdparm -W1 or the bios setting) and NCQ enabled for the drives, I get these errors periodically in the logs:

 

Mar 22 14:25:19 rack3 kernel: [  905.893359] ata1: hard resetting link
Mar 22 14:25:20 rack3 kernel: [  906.211921] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Mar 22 14:25:20 rack3 kernel: [  906.212372] ata1.00: configured for UDMA/133
Mar 22 14:25:20 rack3 kernel: [  906.212377] ata1: EH complete
Mar 22 14:25:20 rack3 kernel: [  906.228293] ata1: hard resetting link
Mar 22 14:25:20 rack3 kernel: [  906.547636] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Mar 22 14:25:20 rack3 kernel: [  906.548075] ata1.00: configured for UDMA/133
Mar 22 14:25:20 rack3 kernel: [  906.548078] ata1: EH complete
Mar 22 14:25:20 rack3 kernel: [  906.577595] ata1: hard resetting link
Mar 22 14:25:20 rack3 kernel: [  906.895341] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Mar 22 14:25:20 rack3 kernel: [  906.895782] ata1.00: configured for UDMA/133
Mar 22 14:25:20 rack3 kernel: [  906.895786] ata1: EH complete
Mar 22 14:25:20 rack3 kernel: [  906.914616] ata1: limiting SATA link speed to 3.0 Gbps
Mar 22 14:25:20 rack3 kernel: [  906.915488] ata1: hard resetting link
Mar 22 14:25:21 rack3 kernel: [  907.235052] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Mar 22 14:25:21 rack3 kernel: [  907.235509] ata1.00: configured for UDMA/133
Mar 22 14:25:21 rack3 kernel: [  907.235514] ata1: EH complete

 tl;dr version - hard resets on the SATA connections under heavy workloads.

 

Googling for this indicates in generic hardware the most likely cause is poor SATA cables or bugs in the SATA controller.   Suggested work-arounds include disabling NCQ.  Well, the machines have the factory-installed mini-SAS cable going from the mainboard to the SAS/SATA backplane.  And it happens in all of the DL320e v2's I have here to test (more than 2).

 

Turns out, I can work around the bug by disabling write caching (hdparm -W0) or disabling NCQ (set /sys/block/sda/device/queue_depth to 1).  But that is not acceptible because drive performance suffers a 2-3x hit.   (Write caching will increase effective write speeds by a factor of 2.5x and disabling NCQ is about a 50% hit.)

 

I've used at least 4 unique SSD drives of the same model to reproduce this error.  Reproduction is fairly easy, just run a 100 GB "bonnie++" run several times and it'll pop up at least once, usually about a half dozen times.

 

Now the smoking gun is that if I drop in an LSI 9207-8i PCI-e 3.0 SAS/SATA controller in the left-hand expansion slot, using the same internal cabling inside the DL320e's, the errors go completely away and I get about 5% more throughput performance to boot.   The 9207 is the "IT"/JBOD mode version, so it's not running any RAID stuff either. 

 

To me this indicates the on-board SATA controller when running in AHCI mode has bugs or at least an incompatibility with the particular drives I'm using. 

 

The DL320e's are flashed to the latest BIOS and ILO versions.

 

Any ideas?   

 

I will be contacting HP support next week because the situation is unsatisfactory but I wonder if anyone else has any good ideas.     I sort-of suspect they will tell me to buy their HP-branded $900 SSD drives.  Not going to happen at a 2-3x the cost.

 

 

 

 

3 REPLIES 3
ZakSmith
Advisor
Solution

Re: ATA hard reset errors on DL320e v2 in AHCI mode with NCQ/write-cache enabled on SSD's

Just got off the phone with HP support.

 

Unless the drive is listed on the QuickSpec they have no clue and won't comment.

 

The closest drive on the QuickSpec is the 691864-B21, a 200 GB SSD that retails for over $1399.

 

Interestingly, HP sells the 240GB version of my 200GB drive (the 200GB is overprovisioned for drive life), part #  EQ51AA for $700.  Market price for the Seagate 240GB is about $316.

 

But they say that this drive is fine on the hp z620 family of workstations.   Here's rhetorical question for the public: If HP advertises both the z620 series and the DL320e series as having a SATA 3, AHCI compatible disk controller and one of them does not work with a SATA 3 drive, doesn't that mean that one of them is NOT really compatible.

 

For the $1399 HP wants for their rebranded (but unknown OEM) SSD, I can buy an LSI 9207 SAS/SATA controller (about $250) and TWO of Seagate's 480GB Enterprise SSD's.   Or I could buy a real hardware RAID controller (LSI 9271-4i) with a BBU and two 240GB class Enterprise SSD's.   The only downside is that I don't get the blinkenlights on the front of the drives.

 

The LSI controller then actually works with SAS drives (hp wants another $100 to enable SAS mode on the b120i), and it gives a full 6 Gbps on all 4 ports instead of 6 Gbps on the first two and 3 Gbps on the second two (again limited on the b120i controller).   In benchmarking the LSI controller is 5-10% faster on reads/writes than the on-board HP "not quite SATA" controller too. 

 

Shame on you HP.  

MichalBohdal
New Member

Re: ATA hard reset errors on DL320e v2 in AHCI mode with NCQ/write-cache enabled on SSD's

Few Years have passed since this was posted, but I think I have the same issue in my Gen8 Microserver. It has the B120i raid conrtoller working in AHCI .

I have got four 6TB drives connected to it and there is zfs raidz1 on top of these. At high load one disk gets timeout and sata link reset (on random, disk 1 OR 2 OR 4, not yet seen with disk 3). This causes the array to go degraded. When I try to resilver the array it puts heavy load on the drive that was just dropped from sata controller, so it resets once again, goes degraded and starts the resilvering from zero % once again. This repeats every 20-60minutes. This makes the relisver process impossible to finish. Previously I just took the drives into a separate machine for resilvering and after this was done they could work once again in the server.

Now I have tried this method to disable write cache. It has been going for over 4 hours without the drive link resetting, although the speed is terribly slow. Thanks for the temporary solution.

I will buy some used LSI SAS HBA card, hopefully moving the disks to a different controller (pci-e) will fix the issue permanently.

ZakSmith
Advisor

Re: ATA hard reset errors on DL320e v2 in AHCI mode with NCQ/write-cache enabled on SSD's

I'm glad it was helpful.   My standard practice for a lot of these servers (and some workstations) is to just throw in a LSI controller and run with it.    Performance is uniform.  Ports are uniform.   ZFS works great on them.  On regular workstations, I end up with a bunch more single SATA/SAS connectors if I need them for anything.    

 

Heck, some years ago I put the same LSI controller in a DL380 G5 and connected it to the 8xSFF backplane so I could run ZFS on that disk array instead of the super slow and cumbersome (and not ZFS compatible) internal RAID controller.