HPE Synergy
1821215 Members
3483 Online
109632 Solutions
New Discussion

Re: 3820c disconnecting from VC uplink ports - anyone else?

 
PatrickLong
Respected Contributor

3820c disconnecting from VC uplink ports - anyone else?

Since upgrading two of my 3-frame Synergy environments to Oneview 6.20 and SSP 2021.05.01 and 2021.05.03, I have been getting  a few instances of either: 

  • my 3820c CNA ports disconnecting from the VC downlink ports (i.e. host stays running but loses all connectivity until cold-rebooted, in rare cases an e-fuse rest was required)  OneView indicates "All links down in adapter in slot 3."
  • my host experiencing a UMCE and PSOD indicating PCI device 0000:36:02.0, which I believe is the PCI bridge device to the 3820c adapter. Again, a cold reboot appears to resolve it but rarely an e-fuse reset is required.  IML logs indicate:

"1134","Critical","PCI Bus","Uncorrectable PCI Express Error Detected. Slot 3 (Segment 0x0, Bus 0x36, Device 0x2, Function 0x0).  Uncorrectable Error Status: 0x100000","08/06/2021 21:19:11","1","Hardware",

"1133","Critical","System Error","Unrecoverable I/O Error has occurred. System Firmware will log additional details in a separate IML message entry if possible.","08/06/2021 21:19:11","1","Hardware",

"1132","Critical","CPU","Uncorrectable Machine Check Exception (Processor 1, APIC ID 0x00000000, Bank 0x00000006, Status 0xBB800000'00000E0B, Address 0x00000000'00000000, Misc 0x00000000'36100000). ","08/06/2021 21:19:10","1","Hardware",

Yes, I have opened multiple support cases, where I am invariable advised to perform any or all of: reboot, e-fuse rest, re-apply server profile, reseat CNA card, move to other server bay, check profile settings, reflash with SSP firmware, and  replace the CNA module.  All of these actions are a reactive shotgun approach to an event that has already occurred - What I am looking for is root cause information about what event or incompatibility is actually triggering these failures to occur in the first place.

Is anyone else seeing similar issues?  Oneview at 6.20;  Virtual Connect SE 40Gb F8 Module for Synergy at 1.7.1.1001,  Interconnects at 1.18, FLMs at 3.00.00, 3820c at 7.18.82, BIOS at 2.50, iLo at 2.44, Server OS ESXi 7.02a (yeah, I know.... micro-SD card boot device nightmare; I'm waiting impatiently for U3 patch release), qfle3 driver at 1.4.14.0-1OEM.700.1.0.15843807 and qfle3f driver at 2.1.22.0-1OEM.700.1.0.15843807, all per 2021.05.03 SSP.

12 REPLIES 12
tech3d
HPE Pro

Re: 3820c disconnecting from VC uplink ports - anyone else?

Hi Patrick,

We regret hearing about the issue that you are facing.

We understand that you have been recommended various steps by our support engineers, to resolve the issue. As a matter of fact, customers are entitled to receive remote phone support plus assistance over remote screen sharing, and also parts replacement wherever needed, as per warranty or contractual coverage. Performing a RCA is in fact out of scope for hpe break-fix technical team.

However, in certain instances where custiomers face issues with multiple devices, we can have the cases further reviewed by engineering, to check if there are any underlying issues that we might have never encountered at other customer sites.

We request you to log a new support ticket using the serial number of any of the Synergy 12000 frames. Mention the subject as 'deep dive analysis for multiple instances of server's network outage after updating SPP 2021.05.01'. Mention all the support case numbers that were logged from your end till date, and also copy paste this forum message so that our support team would know what to do.

Thanks for being a valued customer of Hewlett Packard Enterprise.


I work for HPE

Accept or Kudo


PatrickLong
Respected Contributor

Re: 3820c disconnecting from VC uplink ports - anyone else?

@tech3d  I created this deep dive ticket as you recommended.  Yesterday morning I had a new variation on this issue on yet another host that has never had the issue before - but this time only HALF of the CNA uplinks disconnected - the side connected to the VC in frame1 slot3.  From IML:

3820c_Controller_not_connected.jpg

and from OneView:

3820c_Controller_not_connected_oneView.jpg

Will update here with whatever support is able to discover.

koilerman
Established Member

Re: 3820c disconnecting from VC uplink ports - anyone else?

I just had this exact issue pop up over the weekend.  Did you make any headway with HPE on it?  We've never had an issue, but recently (maybe a month ago) upgraded to SPP 2021.05.03 and OneView 6.2.0.

tech3d
HPE Pro

Re: 3820c disconnecting from VC uplink ports - anyone else?

Hi Patrick,

Good to hear from you again.
Although we regret hearing that you are still facing technical issues.

Looking at the screenshots, the server in bay 3 of frame 2 seems to have lost connectivity on Mez3 port1.
Hence ports 1,3,5,7 are reporting down.

Please log a new support case using the serial number of the affected server, and share the following logs:
LE support dump, AHS log, VMsupport dump.
Also, mention this chat link when logging the case.
Our support engineer will coordinate with me and we will see what is going on.


I work for HPE

Accept or Kudo


DanCernese
HPE Pro

Re: 3820c disconnecting from VC uplink ports - anyone else?

Is it possibly this? https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-a00118373en_us

Newly published.  



I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
tech3d
HPE Pro

Re: 3820c disconnecting from VC uplink ports - anyone else?

Hi Patrick,

We would also like you to check this customer advisory released recently:

https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-a00118373en_us

It seems like the issues you are facing might actually be covered in here.


I work for HPE

Accept or Kudo


PatrickLong
Respected Contributor

Re: 3820c disconnecting from VC uplink ports - anyone else?

Thank you to @DanCernese and @tech3d  for this Advisory link - sounds exactly what I am experiencing and I will implement the workaround in the Advisory.  However,  I think the wording should be a little stronger in the Description.  I  have not only experienced the PSOD's, UMCE's and network disconnecitons mentioned, but in a number of cases I have been unable to get the CNA reconnected at all either by rebooting the host or using the method advised in the OneView Resolution recommendation "Perform refresh of logical interconnect if this connection is associated with a Virtual Connect SE 100Gb F32 module or a Virtual Connect SE 40Gb F8 module."  It seems like the CNA device sometimes suffers catstrophic hardware failure on this firmware.driver combination because my last-ditch efforts to efuse reset the compute module sometimes results in the CNA no longer being properly detected on boot - OneView error  "The adapter in Mezzanine Slot 3 is present but reported incomplete or invalid configuration data." causing a failure to assign the previously-linked hardware profile   Unable to associate a server hardware type with this server, since no port information was discovered for the server. .

I will be reverting all of my compute modules with already-upgraded 3820c CNA adapters, but I do have a couple of hosts currently in the catastrophic failure situation I mentioned above for which I have not yet opened Support cases.  Happy to provide HPE with any additional troubleshooting or logging if desired, or I can simply open support cases to get the CNA's replaced if this is already a known possible outcome of the SSP 2021.05.03 firmware/driver combo issue in the Advisory.  @koilerman it looks like we have some work to do - or rather, undo.

koilerman
Established Member

Re: 3820c disconnecting from VC uplink ports - anyone else?

Sigh... yes - unfortunately @PatrickLong it sounds like we do.


I originally updated to 2021.05.03 due to a PSOD driver issue in the qfle driver that we were running.  This led to hitting the sdcard/vsphere issue that was finally resolved in 7.0.2c.  Now hitting this it's starting to get frustrating to be unable to find a solid release combo where i don't need to worry about things going belly up in the evening or during a weekend.  Fingers crossed this does it and we can stand pat for awhile.

PatrickLong
Respected Contributor

Re: 3820c disconnecting from VC uplink ports - anyone else?

@koilerman  Sounds like we are in the same boat. - managing diskless Synergy compute modules as ESXi hosts on 7.02c with 3820c CNA adapters..  I must say, as a longtime VMware admin back to the GSX days I cannot remember a seemingly interminable stretch of environment instability like thiese last 6 months have been since the release of 7.02 on 3/9/2021.  Absolutely infuriating and ultimately demoralizing to continually play host whack-a-mole with issues like the 7.02 boot device I/O instability coupled with what is now officially recognized as this issue resulting from a faulty firmware and driver combo on the 2021.05.01 and 2021.05.03 SSP.  Serenity now!

PatrickLong
Respected Contributor

Re: 3820c disconnecting from VC uplink ports - anyone else?

  @DanCernese and @tech3d  - well, we have ourselves quite a pickle here.  I remember now why I did not like that earlier 7.18.77 (from package 1.3.6) firmware combination with the MRVL-E3-Ethernet-iSCSI-FCoE_3.0.135.2-1OEM.700.1.0.15843807_17340313.zip driver set which includes.: qfle3 1.4.12.0-1OEM.700.1.0.15843807  qfle3f 2.1.14.2-1OEM.700.1.0.15843807  qfle3i 2.1.5.0-1OEM.700.1.0.15843807 - the one recommended in the Advisory https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-a00118373en_us

I downgraded to those versions on a few of my SY660 servers per the Advisory, and my DBA's immediately reported that their daily dbcc check which normally takes about 90 minutes took nearly twice that time to complete today, coincident with VMware alarms on the vm for latency.  Investigating, I found that when running a high-I/O operation on an SY660 node with 3820c CNA using the older firmware/driver combo, the KAVG values (the time the I/O spends being processed in the hypervisor kernel) were off the charts during high I/O.  Normally KAVG should not exceed 0.10 ms - today under heavy load I saw I/O's taking in excess of 70 - that's 70 full MILLISECONDS in the ESXi kernel BEFORE that I/O gets sent to the HBA then out to the array.  Notice the ACTV and QUED values - reflecting the impact of the KAVG delay. Here's an esxtop snippet:

old_firmware_drivers.jpg

Then I moved that vm to a host with the newer 2021.05.03 versions of firmware (7.18.82 from package 1.4.4) and drivers from MRVL-E3-Ethernet-iSCSI-FCoE_3.0.140.2-1OEM.700.1.0.15843807_17767068.zip driver set which includes.: qfle3 1.4.14.0-1OEM.700.1.0.15843807, qfle3f 2.1.22.0-1OEM.700.1.0.1584380, qfle3i 2.1.8.0-1OEM.700.1.0.15843807.  Re-ran dbcc and it completed in the normal 90 minute time.  Here's the same esxtop snippet - notice almost NO KAVG in relation to the CMDS/s - and as a result very little GAVG Guest Average Latency.  

new_firmware_drivers.jpg

So now it appears I have a Sophie's Choice - I can ignore the Advisory and keep running the newer firmware/driver combo and risk spontaneous and random uplink disconnects, or I can downgrade the drivers and firmware per the Advisory resulting in insufferable performance under heavy I/O load.  Any ideas for a way out of this morass?  Any estimate on timeframe for a new SSP-approved 3820c firmware/driver combo?

DanCernese
HPE Pro

Re: 3820c disconnecting from VC uplink ports - anyone else?

I apologize for not having any background on the performance issue being demonstrated.  If I had no further information in your shoes, I'd table the advisory action for now.  The forecast for a full recipe fix is early November. 



I work at HPE
HPE Support Center offers support for your HPE services and products when and how you need it. Get started with HPE Support Center today.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]
Accept or Kudo
PatrickLong
Respected Contributor

Re: 3820c disconnecting from VC uplink ports - anyone else?

The new 3820c firmware update 1.6.3 (combo image v7.18.85) has been released as well as the new driver (link is for for ESXi 7.x) which includes 

  • qfle3, version 1.4.22.0
  • qfle3f, version 2.1.25.0
  • qfle3i, version 2.1.8.0
  • qcnic, version 2.0.60.0

These can also be found on the new 2021.11.01 SSP (Synergy_Service_Pack_SSP_2021.11.01_Z7550-97261.iso), downloadable from HPE Synergy Software Releases

I have installed on a few Synergy compute nodes and am currently testing to see if the latency issue I observed earlier has been resolved.

 

<EDIT 11/04/2021> testing of dbcc on two hosts with the above combination of new irmware and drivers on ESXi 7.0U2d shows that the in-kernel latency (viewed as KAVG in screenshots from prior posts) is maintained at expected, normal and steady levels of 0.0x ms