Comware Based
cancel
Showing results for 
Search instead for 
Did you mean: 

Help finding cause of core switch crash

 
SOLVED
Go to solution
Highlighted
Occasional Contributor

Help finding cause of core switch crash

Recently  all 4 of our core switches model H3C S5820 crashed without warning, affecting network connectivity for 500 users. These switches are in an IRF stack. When they came up, they were giving a message "invalid version". We discovered when they rebooted, they tried to update the firmware, which had been added to the switches 3 years ago. The reason for the invalid version message is because we had to run a
"brand" command was to change the brand from h3c to hp.

This downtime was 2 1/2 hours long. We resolved it with HPE switch support's assistance by setting the firmware to the previous version, as well as the backup firmware.

I had a case open with HPE switch support, but they could not find a root cause in the switch log files. IMC support also looked at the issue and said the following:

"From the events I notice stack port going down and causing the switch reboot. Generally this occurs due to software exception and hence firmware upgrade was recommended to avoid the same in future"

and "Unfortunately the old logs of syslog will be removed if the database does not have enough space. I have reviewed the logfiles and there are 2 scenarios which are occurring.
1. Switch stack reboot
2. Switch stuck at bootrom menu due to invalid image file".

My manager would still like to know a root cause. Does anyone from the community have any idea of something else to check? With IRF we thought we would be protected from all 4 switches going down at the same time.

3 REPLIES 3
Highlighted
HPE Pro

Re: Help finding cause of core switch crash

Hi,

Could you please provide below details:

  1. What was the Software version running at the time of crash?
  2. Please upload #display diagnostic-information (latest as well as if captured right after the issue recovery)
  3. Confirm if you are able to see log file and diag file in the flash [execute this command to check >dir]

It may not be possible to find RCA without relevant logs available. We can check if there was any known issue or a bug with the software version running at the time of crash. Post your update on the details, will check and get back to you if the crash is due to any software bug or not.

 

I am an HPE Employee

 


Accept or Kudo
Highlighted
Occasional Contributor

Re: Help finding cause of core switch crash

1.  H3C Comware Platform Software, Software Version 5.20, Release 1807P022

2. Already have uploaded logs to cases, do not want to upload the logs to this location, as they contain config of core switches. 

3. I can see logfile.log and default.diag files in flash. I do not see any other diag file, and the default.diag file has not been updated since 2012. 

Could you answer a question about how important it is to configure MAD for IRF? I have been doing some reading on this, and we do not currently have it configured. I think that we should probably implement BFD MAD, but I do not want it to be a disruptive process. We are also planning on replacing our switches within the next year, so I'm not sure that it is worth the trouble. 

Highlighted
HPE Pro
Solution

Re: Help finding cause of core switch crash

Hi,

Version 1807P022 - Is very old software version and it is below the supported level 

5800-5820X_5.20.R1810P16 - Latest version available on portal, please download the release notes from the below link. As per release notes, there are several bugs that are fixed related to switch reboot or crash. 

https://h20628.www2.hp.com/km-ext/kmcsdirect/emr_na-a00061393en_us-1.pdf

How important is it to configure MAD for IRF?

  1. When you have IRF in place, it is imported to configure MAD to avoid split-brain scenario
  2. Split-brain: Members of IRF, exchange keep-alives on IRF-ports. If for any reason members miss these keep-alives, then they consider other member is down. Suppose you have an IRF stack of 2 switches, member 1 is master and member 2 is slave. If keep-alives are missed, then member 2 thinks member 1 (master) is down and it declares itself as master. At this stage, in your network you have two switches with same IP-address and claiming to be masters this is called split-brain scenario.
  3. To avaoid split-brain scenario, MAD should be configured
  4. MAD can be configured in different variants, such as BFD MAD, ARP MAD, LACP MAD
  5. Please read the documents to choose the suitable method
  6. For BFD MAD, you may configure online and then connect the cable without any downtime

Recommendation:

  1. I suggest you to plan for a maintenance window for about 1 hour
  2. Upgrade the switch software version to latest as per your organisation policy such as Nth version or N-1th version (read release notes prior to version selection)
  3. During down time, configure BFD MAD and connect the cable
  4. During the down time window, you may test MAD functionality aswell by disconnecting or shutting IRF-ports

I hope this information is helpful!

 

I am a HPE Employee

 


Accept or Kudo