Two node cluster crashes completely during the reboot of one node

AMG-WSI · ‎05-26-2023

In our company we have a lot of 2-node hyper-V clusters. The 2 nodes of the cluster are interconnected with 4 cross cables : 2 of them are teamed for the cluster interconnection and 2 of them are teamed for the live-migration. We use HP Ethernet 10Gb 2-port 530T adapters in each node. We sometimes see the message “The iLO health monitoring status of the device / adapter located in Slot ? is not responsive” (? is 3 or 4 which points to the 10Gb adapters) upon one node in the ILO log. We’ve noticed several times when the other node of the cluster is rebooted, the node with the message in the ILO log crashes with the impact that all the VM’s are crashing and the cluster disappears ! The issue is limited to HP ProLiant DL380 Gen10 hardware. We use SPP 2022.03.0 on all our servers. Sometimes, we see the ILO message on one node during the reboot of the other node and we cannot anything to avoid the cluster crash ! The reboots are needed for patching the MS updates and the SPP. We never mix both kind of updates but we noticed it hasn’t something to do with the process of updating itself.
It’s strange that a cluster consisting of one node that time and which hosts all the VM’s suddenly freezes. The 10Gb adapters aren’t important that time because everything is running on that single node. We recently saw this on a server which crashes today and 3 years ago (that time running an older SPP) ! Is this a known problem ? To complete the story, we have installed SPP 2021.03.0 on our Gen10’s because the 10Gb adapters sometimes suddenly “disappear” in the OS and in the RBSU. Since the install of 2021.03.0 that issue luckily disappeared. We are intending to roll out SPP 2022.03.0 but there is not so much trust in the company.

What we did already before :
In this case we’ve shut down the server and removed the power for a short while and we coldbooted the server. We haven’t seen back the messages on this server.
We have analyzed the dumps with windbg. It’s always a BSOD error code 133 (DPC_WATCHDOG_VIOLATION) and the file evbda.sys always crashes.

There is a document describing the issue Document - Advisory: (Revision) HPE Integrated Lights-Out 5 (iLO 5) - False Caution Message, The iLO Health Monitoring Status of the Device / Adapter Located In Embedded Is Not Responsive May Display Constantly In the iLO 5 Event Log | HPE Support but this is not a solution..

A HPE case didn't help us. HPE asks for usuable dumps but we can't get a "good" dump. There is not enough diskspace to save the dumps because the number of RAM is high and dumping also needs time. Has anybody an idea to move on with the issue ? Has anybody similar experience with the 530T ethernet adapters ? Maybe there is an issue with the PCIe bus ?

Kashyap02 · ‎06-05-2023

Hi
When you reboot the one node in the cluster, does the other node with the IML message (ILO Health Monitoring) reboot or shutdown ?
What is the firmware and driver on the 530T adapter ?
What is the OS version?
What is the ILO Channel Interface driver version ?
What is the ILO Version ?

From when this issue is seen ? was this setup working before ?

I am a HPE Employee.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

Accept or Kudo

AMG-WSI · ‎06-06-2023

Thanks to help us.. I'll answer your questions :

When you reboot the one node in the cluster, does the other node with the IML message (ILO Health Monitoring) reboot or shutdown ? The node with the ILO health monitoring warning will always reboot.

What is the firmware and driver on the 530T adapter ? driver version 7.19.2 and firmware version 7.13.206.0

What is the OS version? Windows Server 2019 Standard core

What is the ILO Channel Interface driver version ? 4.71.0

What is the ILO Version ? 2.72 (ILO5)

From when this issue is seen ? was this setup working before ? We use SPP 2022.03 upon all our Gen10 servers but with SPP 2021.04 we had this issue as well. We've noticed this issue for the first time on 26/04/2022. Until now, we've had about 10 cluster crashes. By observing the IML in advance (looking for these warnings), we can prevent cluster crashes by first rebooting the clusternode which shows these warnings. But it can happen that the ILO Health Monitoring warnings just appear after rebooting one clusternode and then it can be too late..

Vinky_99 · ‎06-07-2023

Hello!

Here are a few considerations and steps you can take to troubleshoot the issue:

* Ensure that your HP ProLiant DL380 Gen10 servers have the latest firmware installed, including the iLO firmware and network adapter firmware.

* Verify that the network configuration of your cluster is correct. Check the teaming configuration for the cluster interconnection and live migration networks, ensuring that it aligns with best practices for your specific hardware and network setup. Double-check the cabling and connections to rule out any physical issues.

* Examine the event logs on both nodes of the cluster for any related error messages or warnings that might provide additional clues about the cause of the crashes. Look for any patterns or recurring events that coincide with the cluster failures.

* Focus on addressing the iLO health monitoring status problem reported in the iLO logs. This could be a separate issue contributing to the cluster crashes.

* Since you mentioned that the crash generates a DPC_WATCHDOG_VIOLATION (error code 133) with the evbda.sys file crashing, it's worth investigating further. Analyzing crash dumps using tools like WinDbg can provide more insights into the root cause. If you're not familiar with analyzing crash dumps, consider engaging the support team or a knowledgeable professional who can assist you.

* If the issue persists and you're unable to identify a resolution, reach out to both HPE and Microsoft support for assistance.

FYI, please remember to always perform thorough backups before making any significant changes or updates to your production environment. This ensures that you can recover in case of any unforeseen issues during the troubleshooting process.

These are my opinions so use it at your own risk.

Kashyap02 · ‎06-09-2023

Firmware is 2 versions old.
The adapter is updated with the latest driver : 7.13.206.0
But firmware is not compatible with this driver. Refer to the release notes of the adapter driver.

https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_37c4cfedc4c84770a9031a0a4d&tab=releaseNotes

HPE recommends the firmware provided in HPE QLogic NX2 Online Firmware Upgrade Utility for Windows Server x64 Editions, version 5.2.7.0 or later, for use with these drivers.

https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_91bbda177c70418e8c5de55395&tab=releaseNotes
Version 5.2.7.0 includes a combo image v7.19.14

Referring to the advisory for the IML messages :

Advisory: https://support.hpe.com/hpesc/public/docDisplay?docLocale=en_US&docId=a00102190en_us

The fix was included in ILO version 2.30 or later.
But your server ILO has already on 2.72.

I have read somewhere that Innovation Engine version 0.2.3.0 has some fixes for the reboot issues and to mitigate the message " The iLO health monitoring status of the device / adapter located in Slot ? is not responsive"

Give it a try.

As @Vinky_99 mentioned, you can verify the cluster services, physically verify the cables, firmware and a complete analysis of crash dump.

You can raise a warranty case with HPE for assistance. We can validate the hardware and see if we can also replicate the issue in our labs.
I also recommend opening a case with OS vendor for crash dump analysis.

I am a HPE Employee.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

Accept or Kudo

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Two node cluster crashes completely during the reboot of one node

Two node cluster crashes completely during the reboot of one node

Re: Two node cluster crashes completely during the reboot of one node

Re: Two node cluster crashes completely during the reboot of one node

Re: Two node cluster crashes completely during the reboot of one node

Re: Two node cluster crashes completely during the reboot of one node