Aruba & ProVision-based
cancel
Showing results for 
Search instead for 
Did you mean: 

HP 5412 10GbE Module Issues / Troubleshooting Tricks

SOLVED
Go to solution
jondehen
Occasional Advisor

HP 5412 10GbE Module Issues / Troubleshooting Tricks

We recently upgraded the firmware on our 5412Rzl2 (J9851A) as such: KB_15_17_0007 > KB_16_02_0013.  All operations are normal except some issues with our single 8-Port 10-GbE SFP+ v2 zl module (J9538A) as follows:

  • Occasional 1-2 second disconnect/reconnects in the switch log from the 10GbE ports
  • DRASTIC, but occasional, slowdown of speeds, which seem to self-correct eventually
  • High numbers of Discard Rxs
  • High numbers of Drop Txs (but not nearly as high as Discard Rxs)

We are looking for any additional troubleshooting commands or techniques (other than show log and show interface <PORT>) which might yield insight into the issue.

Right now we're unsure if the issue is:

  • The new firmware (both primary and secondary were updated but we can still rollback one of them)
  • Failing hardware (cables, module, ports, NICs)
  • Drivers (NICs)
  • Hosts

More Info:

  • All ports in this module are in an unroutable VLAN so there shouldn't be any commuication in/out
  • Hosts have Intel X710s (rev 01) NICs (recent but not the latest firmware)
  • I cannot verify if these ports had abnormal Discard or Drop counts before the firmware upgrade to compare to

Please let me know if I can provide any other details.  Thank you!

18 REPLIES

Re: HP 5412 10GbE Module Issues / Troubleshooting Tricks

I'd suggest logging a case and getting them to escalate to level2/3, you may find out they already know about this and a fix is slated for release. I've seen a few problems on Kx.16, and have already escalated one issue with a POE software bug - which turned out to be a known bug with unknown fix release date.

I wouldn't use Kx.16 in production yet, maybe test it in a lab for now until those early release cycle problems are resolved.

A useful troubleshooting tool is "show tech", and also have a look at "show instrumentation"

switch# show instrumentation ?

  • cam       Show internal version-dependent counters for debugging.
  • monitor       Show latest values for monitored parameters.
  • port       Show internal version-dependent counters for debugging the specified port.
  • resptime       Show service response time data for performance sensitive operations registered for response time measurement.
  • routing       Show routing related instrumentation parameters.
  • vlan       Show internal version-dependent counters for debugging the specified VLAN.

There is also a debug mode I've used in the past, that goes really deep into the "tech support" areas, but it gets quite complicated and probably is deeper than most customers would want to go.

http://networkgeekstuff.com/networking/procurve-and-hidden-command-line/

Search for term: edomtset

 

 

parnassus
Honored Contributor

Re: HP 5412 10GbE Module Issues / Troubleshooting Tricks

Curious to know if (and how) those eight 10GbE ports - of the HP 8-port 10-GbE SFP+ v2 zl Module - are all used concurrently (maybe ports overcommit/oversubscription could/couldn't enter in the picture so having a role in the issue)...first of all start with collecting the status of each Transceiver used on those ports, what the command

show interfaces transceiver n detail (where n is the port number)

reports?

Supposing that nothing else changed but the Firmware then the actual Firmware could be the first culprit one think of...but to diagnose that - without being necessarily biased by the concept "bad new Firmware versus good old Firmware" (I mean without considering other possible sources of issues) - you should be sure enough that exactly nothing else had changed in your environment before you did that Firmware upgrade.

jondehen
Occasional Advisor

Re: HP 5412 10GbE Module Issues / Troubleshooting Tricks

Thanks for the replies!  I dug through those commands and didn't find anything particularly useful, although perhaps much of it is past my understanding.  I did check the transceiver statuses but not sure what to look for.

The firmware was the only thing which changed, unless you count the loss of network connectivity for the devices.  The devices in question are some VMware hosts and a few NAS devices.  We're going to try a full reboot of everything once we can afford downtime.

Can anyone explain the differences between firmware versions? (Major vs minor vs incremental) <MAJOR>.<MINOR>.<INCREMENTAL>

Perhaps the explaination of the three parts of the version number will help explain which firmware I should choose when upgrading...

 

parnassus
Honored Contributor

Re: HP 5412 10GbE Module Issues / Troubleshooting Tricks

Mmm...the best document I've read is the HP ProVision Software Release Process (2015): it should be explain exactly what you're looking for...

Don't you want to post and share (first trim all possible Serial Numbers and other relevant sensible information about your products/configurations) the result of the command above run against your various 10Gb Transceivers interfaces?

jondehen
Occasional Advisor

Re: HP 5412 10GbE Module Issues / Troubleshooting Tricks

Excellent PDF, thank you!  So the show interface transceiver n detail command is identical for all 8 ports, except for the incrementing Interface Index and the Serial Number....

Transceiver in L1
Interface Index : 353 (varies)
Type : SFP+DA7
Model : J9285B
Connector Type : Vendor specific
Wavelength : n/a
Transfer Distance : 7m (copper),
Diagnostic Support : None
Serial Number : <VARIES>

parnassus
Honored Contributor

Re: HP 5412 10GbE Module Issues / Troubleshooting Tricks

Are those DAC Cables installed correctly (respecting the minimum bend radius, not below 1")?

jondehen
Occasional Advisor

Re: HP 5412 10GbE Module Issues / Troubleshooting Tricks


parnassus wrote:

Are those DAC Cables installed correctly (respecting the minimum bend radius, not below 1")?


It appears that they are all installed with at least this minimum.

Michael Patmon
Trusted Contributor

Re: HP 5412 10GbE Module Issues / Troubleshooting Tricks

We are in the process of root causing an issue for that specific module (J9538A) on ports 4, 5, and 6.  In the meantime there is a configuration option you can disable that should alleviate the symptoms:

HP-Switch-5406Rzl2(config)# no tcp-push-preserve

There is a low level issue causing head of line blocking on those ports in the presence of of large amounts ot TCP traffic with the push bit set.  

 

HP-Switch-5406Rzl2(config)# tcp-push-preserve help
Usage: [no] tcp-push-preserve

Description: Enable TCP Push Preserve mode. This mode determines the
flow of the TCP packets that have the PUSH flag set. When
this mode is enabled and the egress queue is full, TCP
packets with the PUSH flag set are queued at the head of the
ingress queue for egress queue space. This might delay
subsequent incoming packets in the same queue. When this
mode is disabled and the egress queue is full, TCP packets
with the PUSH flag set are dropped from the head of the
ingress queue.

By default, this mode is enabled. Disable this mode when a
large number of TCP packets with the PUSH flag are being
dropped due to congestion.

jondehen
Occasional Advisor

Re: HP 5412 10GbE Module Issues / Troubleshooting Tricks

Interesting, Michael.  I think we're going to boot a dormant host on that module with a live DVD, mirror an active port to it, and use wireshark to investigate the actual traffic.  We'll hopefully be able to see what is being dropped or discarded, as well as if any of the TCP packets are indeed using the PSH flag or not.

Our issues also seem low level, and we'll likely end up rolling back firmware.  First, to the previous, and then second, to some newer ones (but not the absolute latest 16.02.xxxx).

I'll update here once we find more results.

parnassus
Honored Contributor

Re: HP 5412 10GbE Module Issues / Troubleshooting Tricks

Interesting...that global setting was introduced and enabled by default since x.15.14.0007 (here) - I recall it was cited already on this Post - ...but the OP's Switch starting software version was KB.15.17...so the Switch was already running with a post KB.15.14 software version...at this point: is it possible that that setting was effective (because it was enabled by default on any version since the KB.15.14.0007) but didn't produced all the negative effects on his network until the last upgrade to KB.16.02 jumped in? why the negative effects didn't showed up before if no other changes (traffic consistently grown?) were introduced?

Michael Patmon
Trusted Contributor

Re: HP 5412 10GbE Module Issues / Troubleshooting Tricks

We're still in the process of root causing it so I don't want to speculate here, but that "feature" was introduced with v1 modules many years ago.  The CLI command to enable/disable was added a few years back to address a particular issue at the time.  

We believe there was a recent change specific to the J9538A modules that made it more susceptible to HOL blocking in some scenarios, which can cause latency and other performance issues..  Disabling tcp-push-preserve is a workaround.  

I will post back when I have more information.  

jondehen
Occasional Advisor

Re: HP 5412 10GbE Module Issues / Troubleshooting Tricks

Is it possible that the no tcp-push-preserve will negatively impact other devices or throughput speed?  As it is a global setting I am hesitant to try it during production hours.  Honestly not sure what other devices might be using that setting.  We did find that about 25% of the TCP packets had the push flag set, for what it's worth.

We're planning to try a reboot sequence of the hosts and NAS devices, then those with the switch.  If the reboots do not alleviate the issue, we'll proceed with firmware rollback, as follows: 15.17.0007 (last known working).  If that works, we'll try the versions until we hit an issue: 15.17.0013 > 15.18.0013 > 16.01.0010 > 16.02.0013 (current)

 

Michael Patmon
Trusted Contributor
Solution

Re: HP 5412 10GbE Module Issues / Troubleshooting Tricks

We believe this issue, specific to the J9538A module ports 4, 5, and 6, was introduced in K/KB.16.01.0008 and K/KB.16.02.0011.  

Disabling that feature should not negatively impact other devices or throughput.  Disabling it basically configures the switch to drop TCP push traffic as it would any other packet when the egress queue fills up.  The egress queue filling up incorrectly is the issue and "no tcp-push-preserve" helps in that condition.

 

jondehen
Occasional Advisor

Re: HP 5412 10GbE Module Issues / Troubleshooting Tricks

Thank you for the information!  We'll try releases before those, and maybe with and after those.  Actually, we're seeing issues on ports besides 4, 5, and 6.  (for example, 2).  Switching from 2 to 8 seems to have helped.  Additionally, we're using hardware v2 of module J9538A.

parnassus
Honored Contributor

Re: HP 5412 10GbE Module Issues / Troubleshooting Tricks

That's interesting.

AFAIK the 8 ports 10-GbE v2 zl Module (J9538A) has two static Channels built, respectively, grouping module ports 1, 4, 6 and 8 for Channel 1 and module ports 2, 3, 5 and 7 for Channel 2.

Each Channel provides a total aggregated bandwidth of 23.4 Gbps (so the total aggregated Channels bandwidth of the entire module is 46.8 Gbps).

The Ports assignment on each Channel of the J9538A Module is fixed so the aggregated bandwidth is shared whitin that specific set of physical ports (exactly when those ports switch to a "linked state" so are active), the Ports versus Channels schema is:

  • Channel 1: Port 1
  • Channel 1: Port 4
  • Channel 1: Port 6
  • Channel 1: Port 8
  • Channel 2: Port 2
  • Channel 2: Port 3
  • Channel 2: Port 5
  • Channel 2: Port 7

Basically this means that if you need full wire rate transfer speeds (10 Gbps Full Duplex, so 20 Gbps) you must not connect more than one 10 GbE port per Channel (so you must not connect more than one port every four ports of a Channel)...that's because the module applies oversubscription (simply it's not able to sustain 8 x 10 Gbps = 80 Gbps wire rate [*]).

Said so it's somewhat important to know what ports to connect (and what ports don't) to let those alone connected ports to reach wire speed.

Another interesting thing to pay attention of is that v2 zl Modules (like the J9538A) benefit of a maximum Bandwidth of 40 Gbps (per Slot) when the 5400R zl2 is operating in v2 Compatibility Mode, on the contrary v3 zl2 Modules (like the J9993A, successor of the J9538A) benefit of a maximum Bandwidth of 80 Gbps (per Slot) either when the 5400R zl2 operates in v2 Compatibility Mode or when it operates in v3 only Mode (v2 zl Modules will not be supported in this v3 only Mode of operation).

[*] Question: those 80 Gbps refers to Full Duplex or not?

Edit: it's also interesting to read the J9538A Module related defect ProVision CR_0000213551 report (this particular Issue was declared already fixed) in which - as a workaround (so it wasn't the cause for the issue we're discussing here!)  - HPE advised to use SFP+ Transceivers instead of SFP Transceivers on Ports 4, 5 or 6 or, if possible, use different remaining ports, 1, 2, 3, 7 or 8...

jondehen
Occasional Advisor

Re: HP 5412 10GbE Module Issues / Troubleshooting Tricks

We rolled back from 16.02.13 to 16.02.10 and the issues with discards and drops appears resolved.  We did not use the workaround command to ignore the PUSH flag.

We also rearranged our connections to more accurately adhere to the channels.  I find it strange that the channels are not 1-4/5-8 OR even/odd ports.

parnassus
Honored Contributor

Re: HP 5412 10GbE Module Issues / Troubleshooting Tricks

That's good to know.

Yeah, channel <--> port binding order looks strange to me too...but it's that, look below (from the glorious HP 5400 zl Switches Installation and Getting Started Guide, Manual Port Number: 5998-2998 of June 2013):

Screenshot_1.png

or here (specifically sheet 9).

 

 

 

 

parnassus
Honored Contributor

Re: HP 5412 10GbE Module Issues / Troubleshooting Tricks

The ArubaOS-Switch KB.16.02.0014 Release Notes is worth reading especially regarding:

  • Enhancement: TCP Push Preserve mode is set to disabled by default now.
  • Fix: CR_0000216989 related to Switch Module "Switch performance degrades when using ports 4, 5, or 6 on J9583A switch" (partially IMHO related to - due to that ...may improve... on the workaround - the subject - TCP Push Preserve mode - of the above enhancement).