Switches, Hubs, and Modems
cancel
Showing results for 
Search instead for 
Did you mean: 

Internal frame loss on Procurve 2848 - need debug help!!!

Dan Fossum
Occasional Visitor

Internal frame loss on Procurve 2848 - need debug help!!!

I have a Procurve 2848 switch that is losing Ethernet frames between an Ingress LAG group and an un-lagged Egress port and I don't know why.

Facts:
- I'm losing about 1 frame in 6 somewhere inside the switch
- the traffic rates in question are well below 1GE line rate.
- dropping membership in the LAG down to 1 GE link completely clears the condition.
- no discard counters are incrementing on any interface on the switch
- replacing the 2848 with another produces the same loss
- port mirroring one of the ingress LAG interfaces to a sniffer shows all expected traffic on that link being received
- port mirroring on the egress interface shows that many frames are missing
- comparing Rx interface stats on the ingress ports and Tx interface stats on the egress port confirms the loss. E.g. Sum of Rx stats over both ingress ports is greater than what's transmitted out the egress port.

I dug out the "show tech stats" command. The "drops Tx" stat pasted below is a bit noisy, but it does increment very much inline with the discards I'm measuring.

Does anyone know how what this discard stat means?
Is there anywhere else in the switch I can look for better information about the discards?

sollabswitch12# show tech stat

internalstatistics


Status and Counters - System Wide Counters

External Totals (Since boot or last clear) :

Drops Tx : 1,319,882,752
7 REPLIES
Mohieddin Kharnoub
Honored Contributor

Re: Internal frame loss on Procurve 2848 - need debug help!!!

Hi

RX / TX Drops counter indicates that some ports were too busy to receive the data transmitted by the other side.

So this indicates that slower ports could not keep up with the packet stream coming from the other side.

Methods of troubleshooting this scenario include enabling more streamlined packet buffering on the switches by issuing the "qos-passthrough-mode" command at the config level.

More information in the following link: http://www.hp.com/rnd/library/troubleshoot_lan.htm

Good Luck !!!
Science for Everyone
Matt Hobbs
Honored Contributor

Re: Internal frame loss on Procurve 2848 - need debug help!!!

What ports are you testing with? Internally the ports are grouped, so you may want to try testing on a group of ports, like 1-12.

Are you using an 100Mbit devices in this test? If so, enable 'qos-passthrough-mode' (in fact I'd try that anyway).

At what rate are you trying to send this data through and what's your average packet size?

Also what firmware version are you using?
Dan Fossum
Occasional Visitor

Re: Internal frame loss on Procurve 2848 - need debug help!!!

Thanks for your input Mohieddin and Matt.

Additional info based primarily on your questions:

- I have set "qos-passthrough-mode" to each of "typical" "balanced" "one-port" and "optimized" . The "balanced" setting results in a 4% reduction in drops but still 1/6 or 1/7 messages don't get through. No appreciable difference with any of the other settings.

- The ports in use were somewhat scattered around the switch. I'm now using ports 39, 40 and 44. No difference there.

- All links and peer devices are running at at 1000Mbps auto.

- Aggregate steady-state traffic rate is just below 200Mbps

- The data consists of multiple TCP flows. In the direction the loss is occuring in, these consist of mainly ~1400 byte, ~900 byte or ~300 byte frames at layer 2. The reverse path generally has only min-sized TCP ACKs.

- The switch is running this:
Image stamp: /sw/code/build/mako(ts_08_5)
May 5 2006 09:47:52
I.08.98
189


I do not understand the internal architecture of this switch. I find it very curious that none of the port counters are indicating any discard. Is there another command that can tell me in more detail about the global "Drops Tx" 'show tech stat' is trying to say?

Thanks again for your input so far.

Matt Hobbs
Honored Contributor

Re: Internal frame loss on Procurve 2848 - need debug help!!!

What's the device connected to your LAG groups? How is it load balancing? The ProCurve switches use SA/DA mac-address pairs.
Dan Fossum
Occasional Visitor

Re: Internal frame loss on Procurve 2848 - need debug help!!!

Matt,

The traffic being LAGged into the ProCurve is being balanced using a hashing algorithm considering MAC DA/SA, IP DA/SA and TCP/UDP src/dest port. I've validated that the balancing is working properly. That is, individual TCP flows are fixed to the same physical interface (validated via port mirror and packet capture). My original suspicion was that the load balancing wasn't working properly resulting in TCP re-ordering and the resultant retransmissions and thus poor end/end application behaviour. This investigation did start out with reports of an application layer problem.

I have a test client which can set up to vary the frame size and rate across multiple simultaneous TCP streams. Using this I've been able to determine that the most important factor leading to the loss is frame size. Frames of 350 bytes and above incur loss, frames below 300 do not. The internal loss occurs at offered rates below even 8Mbps.

My working theory is that even though the traffic is being sent at a low rate, there is some burstiness that the switch can't handle when it's passing the frames from the ports to the internal switch fabric. With a single interface, the burstiness is smoothed on the way in by virtue that frames need to be serialized across the one link. But this is just a theory and that the switch can't accommodate little bursts like this would surprise me. I would have also thought there would be port counters for such a thing too.

What's more bizarre to me is the fact that we're able to actually capture the frames at one port on the switch and not see them leave another port. Again, not knowing how the internals of the switch are architected, the ability to capture and mirror the 'missing' frames is pretty conclusive proof that they were received fine. No Ethernet problems like IFG violations, FCS errors etc. Also the contents of the frame from the Ethernet header through to the TCP payload all check out.

Do you or anyone else know of any more internal counters I can pull out of this switch to better diagnose what it's doing?

Matt Hobbs
Honored Contributor

Re: Internal frame loss on Procurve 2848 - need debug help!!!

Dan, you've provided some excellent information there. At this point in time I'd recommend you open a case with HP support since this seems like it shouldn't be too hard to reproduce.

They may be able to test the same thing with a different hardware platform, e.g, 5300, 3500, etc and see if this still occurs. Depending on the results they'd likely elevate it internally to understand if this is expected behaviour and if anything can be done to rectify it on the 2800.

If you could provide support with a written summary of the issue with the testing you've done, also include 'show tech all' reports from the switch and reference this thread if need be.
Dan Fossum
Occasional Visitor

Re: Internal frame loss on Procurve 2848 - need debug help!!!

Matt, I definitely appreciate your input. Ticket raised with HP. I'll let you know what comes of it.

Dan.