Switches, Hubs, and Modems
cancel
Showing results for 
Search instead for 
Did you mean: 

5308 mesh woes

Les Ligetfalvy
Esteemed Contributor

5308 mesh woes

I have an existing multi-mode fibre cable plant strung over two kilometres that I wanted to upgrade to gig and since there are distance limits, I could not simply forklift my existing Cisco core switch which is running 100 meg over the fibres. Mesh looked like the best way to go since I have to pass-through several switches to make the distance but wanted to limit single points of failure.

My mesh design involving six switches included a connection that leap-frogs its closed upstream and downstream switch as well as a connection to its nearest upstream and downstream switch. This way any single switch failure should not break connectivity between the two extremes. That would require the loss of two adjacent switches. My server farm is at one end of this string of switches.

Having tried being optimistic, but knowing that it was futile, I setup a mesh connection using 100 meg fibre between the two extremes that basically closed the loop. In theory at least, load balancing should depreciate the 100 meg link, favoring the gig link or so I thought, but that is another story for another day.

A few days after upgrading my entire mesh to 9.03, I got a "bus error" crash on the switch that is one hop away from my server farm.

My mesh network serves the production/business side and I have another network that serves our process control which connects through a firewall at the end of my mesh that is my server farm (discounting the 100 meg link that turns my daisy-chain into a ring). Two days ago, half of my mesh lost connectivity with the one port that connects to the firewall. I could ping it from three of the six switches. we ended up rebooting the closest switch that could not ping the firewall. This resored service to the rest of the mesh at least temporarily. I have since experienced loss of connectivity from the further half of the mesh on the same and other ports at the server farm with the only difference being that it would at times recover without my intervention and another time I forced the mesh links off-line one at a time to get it to relearn the mesh.
12 REPLIES
Les Ligetfalvy
Esteemed Contributor

Re: 5308 mesh woes

I got an action plan from HP:
1. I remove the 100FX mesh link
2. enable RSTP
3. take a mesh traceroute the next time it behaves badly
4. capture a trace of broadcasts that seem to follow some PC powerups
5. capture a trace of a mesh port when it behaves badly.

At noon today, I also rebooted the switch that had crashed twice and had earlier turned off LLDP on HP's advice.

I have not experienced any more mesh failures today. Next week might be a different story.

I have not setup RSTP because my Nortel and Cisco switches only support STP, not RSTP. I cannot drop down to using STP because it does not play well with meshing. It was suggested that instead of partnering the 100FX link in the mesh, that I manage it with RSTP. I am not sure how to setup RSTP on the 5308 so that my Nortel edge switches can still function properly.

Has anyone else mixed link speeds on a mesh or have advice on setting up STP/RSTP in a multi-vendor switch network?
Les Ligetfalvy
Esteemed Contributor

Re: 5308 mesh woes

I still don't know what to make of the problem being the single 100FX partner in the otherwise gig mesh. The manual clearly states on page 7-2

"Unlike trunked ports, the ports in a switch mesh can be of different types and speeds...".

Now, it might be that there is a bug in the code and that the 100FX link could exacerbate it. Maybe unplugging one of my gig links would have yielded the same result. It really bothers me that if a subtle change in the mesh-scape drives this problem underground.
Les Ligetfalvy
Esteemed Contributor

Re: 5308 mesh woes

This morning I reconnected the 100FX mesh link and the mesh did not destabilize. No traffic load balanced over to it either.

On another note, I got E_09_22.swi code today for the bus error issue and will have to schedule the downtime to implement it.
Les Ligetfalvy
Esteemed Contributor

Re: 5308 mesh woes

Blogging on...
Mesh is still stable... well maybe. This one particular 5308 still has some odd behavior. I recently received some SNMP errors in PCM so I ran a meshtraceroute and had it fail


TC HP Switch 5308XL$ meshtraceroute 00110a472d00 1
Traceroute to MAC Address: 00110a-472d00 VID: 1
hop Switch Address Hostname inPort outPort inCost outCost Speed
0 00306e-bf1000 TC HP Switch 53 G16 0 231 1000
1 00110a-48c700 ScreenRoom HP 5 A3 0 0 0
Error: Address is unknown at hop 1
TC HP Switch 5308XL$


HP then tried a few days later but got no error. I am not making this stuff up. Unless I have another persistent mesh instability episode that lasts long enough for HP to look at it, this issue is going nowhere.
Nick Hancock
Occasional Visitor

Re: 5308 mesh woes

Les,

Interesting reading your issues. I have avoided switch meshing where possible because I have an aversion to proprietory technologies, although it does have its uses.

Switch-meshing appears to be a similar technology to 3Com XRN or Nortel Split-MLT where the switches use a hashing algorithm ( or exchange MAC dbs). One of the uses I am considering it for is to avoid the layer 2 Spanning tree diameter issues you get when you have large numbers of daisy-chained switches.

You should be careful with deployment of RSTP/802.1w in conjunction with STP there are a few gotchas:-

- Path costs were modified so calculated preferred paths sometimes go a bit odd.
- You need to look at RSTP / STP compatibility modes because early RSTP versions ( on Cisco, may be true on HP - don't know) reverted to STP convergence times when they saw a STP BPDU coming in. I believe the there are ways round this now.
- Make sure which variant of RSTP you are using. The 9300 naming scheme, for example, is a bit weird because of the presence of RSTP draft 3 compatibility. Using the "spanning-tree rstp" gives you draft 3 rather than standards-based RSTP i.e. 802.1w which you get using "spanning-tree 802.1w"

Les Ligetfalvy
Esteemed Contributor

Re: 5308 mesh woes

Thanks for that, Nick.

I will wait to see what HP comes up with in regard to the 100FX. Today I encountered several mesh instability episodes.
Les Ligetfalvy
Esteemed Contributor

Re: 5308 mesh woes

I encountered another mesh instability episode since disconnecting the 100FX.

I took the whole network down and flashed all the switches to E_09_22. Since rebooting them, they have been stable.
Les Ligetfalvy
Esteemed Contributor

Re: 5308 mesh woes

On the 14th, the mesh again became unstable and has not gone a day without some disruption of service.
Les Ligetfalvy
Esteemed Contributor

Re: 5308 mesh woes

We decided to abandon the 100FX mesh partner. Using a pair of longhaul (SGETF1024-105 Transition Networks) media converters, we increased the kilometre long fibre run to gig which effectively put the Operations Centre switch closer to the server farm. When counting hops, you count the number of switches, not the number of mesh links (2 switches, 1 mesh link = 2 hops).

With the leap-frog daisy-chain connections of the mesh, it was always 4 or 5 hops away (even thought there were 6 switches) and the mesh always messed up its MAC tables at 3 hops, so it affected only those communications that tried to traverse the unstable portion in the middle. Adding the new gig link effectively turns it into a ring, completely changing the mesh topology.

All my tests show that the new (2 hop) route is the only one carrying significant traffic. Now that does not mean that sessions will never try to traverse the previous 5 hop route as load balancing will determine the best route at any given time based on link utilization.

One mesh port was seen toggling off and on repeatedly during an instability episode so we changed out gig modules as well. We also noticed retrans and drops reported on a few mesh ports. It is hard to say if they caused the instability or if the instability caused them. I would have thought that the interswitch communications that determine the mesh reconvergence would have been robust enough to recover from port/link faults. I will have to run end-to-end tests on all the fibre mesh links but unsure how to test out gig modules and GBICs.

Only time will tell if our efforts were well directed or not. I had a Dutch Uncle talk with someone at HP and it was pointed out that my criticism of HP Division was not well received. I apologize for venting my frustration through this forum and thank Division for their help in this matter.
Stuart Teo
Trusted Contributor

Re: 5308 mesh woes

Les,

I didn't have time to read all your posts or read them in detail but I have some environments that run MESH, STP and RSTP.

I'm sorry to tell you that running MESH is a risk you take. HP is secretive about the protocol and troubleshooting it is not easy. With STP, at least you can read the IEEE standard and try to figure out where/what failed by sniffing for packets.
If a problem can be fixed, there's nothing to worry. If a problem can't be fixed, worrying ain't gonna help. Bottom line: don't worry.
Les Ligetfalvy
Esteemed Contributor

Re: 5308 mesh woes

My mesh problems continued to plague me over the past few days. I started disabling redundant mesh links in an effort to ferret out where the root cause is for this instability. Of the 11 mesh links that I have, I have thus far disabled, one after the other, 9 of them and continued to have instability. I disabled them in such a sequence as to remove redundancy but not disrupt network services to the switches or the users. I would re-enable the links to maintain service so that I could disable other redundant links. Where possible, I would disable all but one link on a switch to effectively remove the switch from propagating the mesh yet maintaining service to the switch. This morning I disabled the 10th of 11 possible links.

It is entirely possible that more than one of these links are at fault and unless I try every possible permutation of link combinations, the tests are not conclusive. If there is anyone with experience in troubleshooting mesh problems, I would much appreciate some guidance.
Stuart Teo
Trusted Contributor

Re: 5308 mesh woes

meshing has caused many woes in my environment for many years since we first enabled them on the 4000/8000Ms. we've since moved away from meshing and to one form of STP or another.

what we've learned is there seems to be some form of route table on each mesh enabled switch and routing info get shared with each other. rebooting the correct switches usually revives the mesh.
If a problem can be fixed, there's nothing to worry. If a problem can't be fixed, worrying ain't gonna help. Bottom line: don't worry.