ProLiant Servers (ML,DL,SL)
1824824 Members
3884 Online
109674 Solutions
New Discussion юеВ

DL320Gen11 with disconnecting NVMe devices

 
SOLVED
Go to solution
pirx
Valued Contributor

DL320Gen11 with disconnecting NVMe devices

We have a new vSAN cluster with 4 x DL320Gen11, each with 4 x 3,84TB NVM. One of the hosts shows devices errors for the NVMe(s) in vSphere/ESXi. But there is no error in ILO and according to HPE nothing in AHS logs. The server has the problem since beginning of the week now and was running fine for a couple of weeks, including some HCIBench runs.

When issues start, it's not always the same NVMe. And most of the time, after the fist device disconnects in OS (not shown on PCI bus anymore), at least a second one follows imediatly or later. The frustrating part is that HPE support is pointing at VMware. VMware support shoud identify the broken device/part. VMware support checked logs and the server in a remote session, outcome is that some hw is broken and responsible for the disconnect of the NVMes. But from the OS side its not possible to see if it's a broken NVMe or the backplane or....

So there is no real progress in resolving this. Does anyone have any idea how to narrow down the issue? I'm already powering off single NVMes from within the ILO to see if the error reoccurs (funny that HPE support did not suggest that, I'm not on-site). But I've not yet a result. Any tests for the NVMes that can be triggered somewhere in RBSU? And how/where can I disable NVMes in RBSU?

 

Update: with drive in bay 3 powered off, the one in bay 2 still failed a few hours later. Now I've disabled both. But I'm not happy at all how HPE suport is handling this. Somehow I'm supposed to proove which device has failed. In case it's the backplane that's nearly impossible. We are not paying _a_lot_ of money for HPE support contracts and in the end nobody moves or tries to fix this on-site. 

image.png

image.png

4 REPLIES 4
pirx
Valued Contributor

Betreff: DL320Gen11 with disconnecting NVMe devices

With 2 NVMes powered down over ILO, there has been no error in 36h. So what does this mean, 2 faulty NVMes? Broken backplane? 

Embedded:Port=3A:Box=1:Bay=4 Enabled 3.84 TB NVMe SSD
Embedded:Port=3A:Box=1:Bay=3 Disabled 3.84 TB NVMe SSD
Embedded:Port=4A:Box=1:Bay=1 Enabled 3.84 TB NVMe SSD
Embedded:Port=4A:Box=1:Bay=2 Disabled 3.84 TB NVMe SSD

ngnear
HPE Pro

Re: DL320Gen11 with disconnecting NVMe devices

Hi There, 
Thank you for reaching out.
May we have the case # or the serial # via Private message on which the issue is being handled so we may check the progress?



I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
pirx
Valued Contributor
Solution

Re: DL320Gen11 with disconnecting NVMe devices

After a week with issues we removed all NVMes and reconnected cables. Issue is fixed since then (2 weeks now). A bit suprising as the server was running ok for already 4 weeks and then a connection problem seems to the root cause.

Sunitha_Mod
Honored Contributor

Re: DL320Gen11 with disconnecting NVMe devices

Hello @pirx,

Perfect! 

We are glad to know the issue has been resolved and we appreciate you for keeping us posted.