ProLiant Servers (ML,DL,SL)
1825646 Members
4227 Online
109686 Solutions
New Discussion

ProLiant DL385 Gen10 Plus Local Disks ESXi VMs high latency

 
philipp-rohrer
Senior Member

ProLiant DL385 Gen10 Plus Local Disks ESXi VMs high latency

Hello Community

We are expiriencing a problem on a server with local disks, VMware ESXi is installed and a couple of VMs running on it.
The VMs are not responsive and they are showing very high disk latencies.
It already happend a couple of days a go, after installing the newest SPP and Updating ILO, ILO showed a defective disk. Right after removing the drive, latency was back to normal. 
Now, a couple of days later, the same problem is back. ILO does not show any errors.

I exported an ADUReport to see all the details of the local arry and the disks. We can see, that Drive 3 has 2964 errors logged, all others have not more that 10. Is this the drive causing the problem? Can someone please help analysing the Report?

Thank you very much in advance!

5 REPLIES 5
support_s
System Recommended

Query: ProLiant DL385 Gen10 Plus Local Disks ESXi VMs high latency

System recommended content:

1. HPE ProLiant DL385 Gen10 Server - Software and Configuration Utilities

2. HPE ProLiant DL385 Server - Troubleshooting

 

Please click on "Thumbs Up/Kudo" icon to give a "Kudo".

 

Thank you for being a HPE valuable community member.


Accept or Kudo

BPSingh
HPE Pro

Re: ProLiant DL385 Gen10 Plus Local Disks ESXi VMs high latency

Greetings!

Please replace the drive showing high error count. Please ensure the drive firmware, controller driver and firmware are upto date.
Is the ADU also showing unrecoverable media errors? If yes, please refer the below web-link:
https://support.hpe.com/hpesc/public/docDisplay?docId=a00104843en_us&docLocale=en_US



I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
philipp-rohrer
Senior Member

Re: ProLiant DL385 Gen10 Plus Local Disks ESXi VMs high latency

Hi BPSingh

I'd like to give you a quick update on how the case looks now and what happend:

- 18.04.25 Notable Latency on ESXi Server, VMs not responding, ILO shows up everything as alright

- 19.04.25 ADU Report created, noticing: Errors Logged 2964 (0x00000b94), ILO still green, VMs not usable

- 20.04.25 same situation

- 21.04.25 ILO shows a critical error, disk 3 went offline (the one with errors in ADU Report), VM performance good, latencies <1ms


Here some snippets from ADU Report:

Errors Logged 2964 (0x00000b94)
Physical Drive Error Log Entries Error Type SCSI Operation Code SCSI Status CAM Status Sense Key ASC ASCQ Block Valid Block Reference Time Additional Information
---------- ------------------- ----------- ---------- --------- ---- ---- ----------- ---------- -------------- ----------------------
0x02 0x28 0xf0 0x17 0x00 0x00 0x00 0x00 0x00be7be8 0x001dbf05 0x0000
0x02 0x28 0xf0 0x17 0x00 0x00 0x00 0x00 0x007ad7c0 0x001dbf05 0x0000


HPE Smart Array P408i-a SR Gen10 in slot 12 : Internal Drive Cage at Port 1I : Box 1 : Physical Drive (600 GB SAS HDD) 1I:1:3 : Monitor and Performance Statistics (Since Reset)

Serial Number **Confidential info erased**
Firmware Revision HPD8
Product Revision HPE EG000600JWJNP
Reference Time 0x00000039
Sectors Read 0x00000000000feea8
Read Errors Hard 0x00000000
Read Errors Retry Recovered 0x00000003
Read Errors ECC Corrected 0x0000000000000000
Sectors Written 0x000000000000fe35
Write Errors Hard 0x00000000
Write Errors Retry Recovered 0x00000000
Seek Count 0xffffffffffffffff
Seek Errors 0xffffffffffffffff
Spin Cycles 0x00000000
Spin Up Time 0x0000
Performance Test 1 0x0000
Performance Test 2 0xffff
Performance Test 3 0xffff
Performance Test 4 0xffff
Reallocation Sectors 0xffffffff
Reallocated Sectors 0xffffffff
DRQ Time Outs 0xffff
Other Time Outs 0x0000
Drive Rebuild Count 0 (0x0000)
Spin Retries 65535 (0xffff)
Recovers Failed Read 0x0000
Recovers Failed Write 0x0000
Format Errors 0x0000
Self Test Failures 0xffff
Not Ready Failures 0x00000000
Remap Abort Failures 0xffffffff
IRQ Deglitch Count 4294967295 (0xffffffff)
Bus Faults 0x00000000
Hot Plug Count 0 (0x00000000)
Track Rewrite Errors 0xffff
Write Errors After Remap 0x0000
Background Firmware Revision 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
Media Failures 0x0000
Hardware Errors 0x0000
Aborted Command Failures 0x0000
Spin Up Failures 0x0000
Bad Target Count 0 (0x0000)
Predictive Failure Errors 0x00000000


As far as I can interpret the Report, there are no unrecoverable media errors. Which parameter would be the right one to looke at?

So, as we can see, with the report we were able to predict the right disk making troubles. My big question is, why does ILO takes almost 4 days to take the disk offline, when we clearly can see errors in ADU Report? Can anyone explain the mechanism behind this?

Btw. All the firmware have been updated to the newest available version on support.hpe.com.

Thanks to everyone for helping!

BPSingh
HPE Pro

Re: ProLiant DL385 Gen10 Plus Local Disks ESXi VMs high latency


If the ADU report does not explicitly indicate any unrecoverable errors, then there’s no cause for concern. It’s good to know that the firmware and driver versions are up to date. Please ensure there are no "Read Errors Hard" or "Write Errors Hard" reported.

Note that iLO does not flag a drive for a limited number of read/write errors. It will only trigger an alert when the drive encounters excessive errors that lead to a predictive failure or an actual failure. 



I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
support_s
System Recommended

Query: ProLiant DL385 Gen10 Plus Local Disks ESXi VMs high latency

Hello,

 

Let us know if you were able to resolve the issue.

If you are satisfied with the answers then kindly click the "Accept As Solution" button for the most helpful response so that it is beneficial to all community members.

 

 

Please click on "Thumbs Up/Kudo" icon to give a "Kudo".


Accept or Kudo