Re: Help finding cause of core switch crash

newnetadmin · ‎03-07-2019

Recently all 4 of our core switches model H3C S5820 crashed without warning, affecting network connectivity for 500 users. These switches are in an IRF stack. When they came up, they were giving a message "invalid version". We discovered when they rebooted, they tried to update the firmware, which had been added to the switches 3 years ago. The reason for the invalid version message is because we had to run a
"brand" command was to change the brand from h3c to hp.

This downtime was 2 1/2 hours long. We resolved it with HPE switch support's assistance by setting the firmware to the previous version, as well as the backup firmware.

I had a case open with HPE switch support, but they could not find a root cause in the switch log files. IMC support also looked at the issue and said the following:

"From the events I notice stack port going down and causing the switch reboot. Generally this occurs due to software exception and hence firmware upgrade was recommended to avoid the same in future"

and "Unfortunately the old logs of syslog will be removed if the database does not have enough space. I have reviewed the logfiles and there are 2 scenarios which are occurring.
1. Switch stack reboot
2. Switch stuck at bootrom menu due to invalid image file".

My manager would still like to know a root cause. Does anyone from the community have any idea of something else to check? With IRF we thought we would be protected from all 4 switches going down at the same time.

sskg · ‎03-12-2019

Hi,

Could you please provide below details:

What was the Software version running at the time of crash?
Please upload #display diagnostic-information (latest as well as if captured right after the issue recovery)
Confirm if you are able to see log file and diag file in the flash [run this command to check >dir]

It may not be possible to find RCA without relevant logs available. We can check if there was any known issue or a bug with the software version running at the time of crash. Post your update on the details, will check and get back to you if the crash is due to any software bug or not.

I am an HPE Employee

newnetadmin · ‎03-14-2019

1. H3C Comware Platform Software, Software Version 5.20, Release 1807P022

2. Already have uploaded logs to cases, do not want to upload the logs to this location, as they contain config of core switches.

3. I can see logfile.log and default.diag files in flash. I do not see any other diag file, and the default.diag file has not been updated since 2012.

Could you answer a question about how important it is to configure MAD for IRF? I have been doing some reading on this, and we do not currently have it configured. I think that we should probably implement BFD MAD, but I do not want it to be a disruptive process. We are also planning on replacing our switches within the next year, so I'm not sure that it is worth the trouble.

sskg · ‎03-17-2019

Hi,

Version 1807P022 - Is very old software version and it is below the supported level

5800-5820X_5.20.R1810P16 - Latest version available on portal, please download the release notes from the below link. As per release notes, there are several bugs that are fixed related to switch reboot or crash.

https://h20628.www2.hp.com/km-ext/kmcsdirect/emr_na-a00061393en_us-1.pdf

How important is it to configure MAD for IRF?

When you have IRF in place, it is imported to configure MAD to avoid split-brain scenario
Split-brain: Members of IRF, exchange keep-alives on IRF-ports. If for any reason members miss these keep-alives, then they consider other member is down. Suppose you have an IRF stack of 2 switches, member 1 is primary and member 2 is secondary. If keep-alives are missed, then member 2 thinks member 1 (primary) is down and it declares itself as primary. At this stage, in your network you have two switches with same IP-address and claiming to be primary this is called split-brain scenario.
To avaoid split-brain scenario, MAD should be configured
MAD can be configured in different variants, such as BFD MAD, ARP MAD, LACP MAD
Please read the documents to choose the suitable method
For BFD MAD, you may configure online and then connect the cable without any downtime

Recommendation:

I suggest you to plan for a maintenance window for about 1 hour
Upgrade the switch software version to latest as per your organisation policy such as Nth version or N-1th version (read release notes prior to version selection)
During down time, configure BFD MAD and connect the cable
During the down time window, you may test MAD functionality aswell by disconnecting or shutting IRF-ports

I hope this information is helpful!

I am a HPE Employee

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Help finding cause of core switch crash

Help finding cause of core switch crash