Re: 5308 mesh woes

Les Ligetfalvy · ‎03-02-2005

I have an existing multi-mode fibre cable plant strung over two kilometres that I wanted to upgrade to gig and since there are distance limits, I could not simply forklift my existing Cisco core switch which is running 100 meg over the fibres. Mesh looked like the best way to go since I have to pass-through several switches to make the distance but wanted to limit single points of failure.

My mesh design involving six switches included a connection that leap-frogs its closed upstream and downstream switch as well as a connection to its nearest upstream and downstream switch. This way any single switch failure should not break connectivity between the two extremes. That would require the loss of two adjacent switches. My server farm is at one end of this string of switches.

Having tried being optimistic, but knowing that it was futile, I setup a mesh connection using 100 meg fibre between the two extremes that basically closed the loop. In theory at least, load balancing should depreciate the 100 meg link, favoring the gig link or so I thought, but that is another story for another day.

A few days after upgrading my entire mesh to 9.03, I got a "bus error" crash on the switch that is one hop away from my server farm.

My mesh network serves the production/business side and I have another network that serves our process control which connects through a firewall at the end of my mesh that is my server farm (discounting the 100 meg link that turns my daisy-chain into a ring). Two days ago, half of my mesh lost connectivity with the one port that connects to the firewall. I could ping it from three of the six switches. we ended up rebooting the closest switch that could not ping the firewall. This resored service to the rest of the mesh at least temporarily. I have since experienced loss of connectivity from the further half of the mesh on the same and other ports at the server farm with the only difference being that it would at times recover without my intervention and another time I forced the mesh links off-line one at a time to get it to relearn the mesh.

Les Ligetfalvy · ‎03-04-2005

I got an action plan from HP:
1. I remove the 100FX mesh link
2. enable RSTP
3. take a mesh traceroute the next time it behaves badly
4. capture a trace of broadcasts that seem to follow some PC powerups
5. capture a trace of a mesh port when it behaves badly.

At noon today, I also rebooted the switch that had crashed twice and had earlier turned off LLDP on HP's advice.

I have not experienced any more mesh failures today. Next week might be a different story.

I have not setup RSTP because my Nortel and Cisco switches only support STP, not RSTP. I cannot drop down to using STP because it does not play well with meshing. It was suggested that instead of partnering the 100FX link in the mesh, that I manage it with RSTP. I am not sure how to setup RSTP on the 5308 so that my Nortel edge switches can still function properly.

Has anyone else mixed link speeds on a mesh or have advice on setting up STP/RSTP in a multi-vendor switch network?

Les Ligetfalvy · ‎03-05-2005

I still don't know what to make of the problem being the single 100FX partner in the otherwise gig mesh. The manual clearly states on page 7-2

"Unlike trunked ports, the ports in a switch mesh can be of different types and speeds...".

Now, it might be that there is a bug in the code and that the 100FX link could exacerbate it. Maybe unplugging one of my gig links would have yielded the same result. It really bothers me that if a subtle change in the mesh-scape drives this problem underground.

Les Ligetfalvy · ‎03-24-2005

This morning I reconnected the 100FX mesh link and the mesh did not destabilize. No traffic load balanced over to it either.

On another note, I got E_09_22.swi code today for the bus error issue and will have to schedule the downtime to implement it.

Les Ligetfalvy · ‎04-01-2005

Blogging on...
Mesh is still stable... well maybe. This one particular 5308 still has some odd behavior. I recently received some SNMP errors in PCM so I ran a meshtraceroute and had it fail

TC HP Switch 5308XL$ meshtraceroute 00110a472d00 1
Traceroute to MAC Address: 00110a-472d00 VID: 1
hop Switch Address Hostname inPort outPort inCost outCost Speed
0 00306e-bf1000 TC HP Switch 53 G16 0 231 1000
1 00110a-48c700 ScreenRoom HP 5 A3 0 0 0
Error: Address is unknown at hop 1
TC HP Switch 5308XL$

HP then tried a few days later but got no error. I am not making this stuff up. Unless I have another persistent mesh instability episode that lasts long enough for HP to look at it, this issue is going nowhere.

Nick Hancock · ‎04-08-2005

Les,

Interesting reading your issues. I have avoided switch meshing where possible because I have an aversion to proprietory technologies, although it does have its uses.

Switch-meshing appears to be a similar technology to 3Com XRN or Nortel Split-MLT where the switches use a hashing algorithm ( or exchange MAC dbs). One of the uses I am considering it for is to avoid the layer 2 Spanning tree diameter issues you get when you have large numbers of daisy-chained switches.

You should be careful with deployment of RSTP/802.1w in conjunction with STP there are a few gotchas:-

- Path costs were modified so calculated preferred paths sometimes go a bit odd.
- You need to look at RSTP / STP compatibility modes because early RSTP versions ( on Cisco, may be true on HP - don't know) reverted to STP convergence times when they saw a STP BPDU coming in. I believe the there are ways round this now.
- Make sure which variant of RSTP you are using. The 9300 naming scheme, for example, is a bit weird because of the presence of RSTP draft 3 compatibility. Using the "spanning-tree rstp" gives you draft 3 rather than standards-based RSTP i.e. 802.1w which you get using "spanning-tree 802.1w"

Les Ligetfalvy · ‎04-11-2005

Thanks for that, Nick.

I will wait to see what HP comes up with in regard to the 100FX. Today I encountered several mesh instability episodes.

Les Ligetfalvy · ‎04-11-2005

I encountered another mesh instability episode since disconnecting the 100FX.

I took the whole network down and flashed all the switches to E_09_22. Since rebooting them, they have been stable.

Les Ligetfalvy · ‎04-17-2005

On the 14th, the mesh again became unstable and has not gone a day without some disruption of service.

Les Ligetfalvy · ‎05-14-2005

We decided to abandon the 100FX mesh partner. Using a pair of longhaul (SGETF1024-105 Transition Networks) media converters, we increased the kilometre long fibre run to gig which effectively put the Operations Centre switch closer to the server farm. When counting hops, you count the number of switches, not the number of mesh links (2 switches, 1 mesh link = 2 hops).

With the leap-frog daisy-chain connections of the mesh, it was always 4 or 5 hops away (even thought there were 6 switches) and the mesh always messed up its MAC tables at 3 hops, so it affected only those communications that tried to traverse the unstable portion in the middle. Adding the new gig link effectively turns it into a ring, completely changing the mesh topology.

All my tests show that the new (2 hop) route is the only one carrying significant traffic. Now that does not mean that sessions will never try to traverse the previous 5 hop route as load balancing will determine the best route at any given time based on link utilization.

One mesh port was seen toggling off and on repeatedly during an instability episode so we changed out gig modules as well. We also noticed retrans and drops reported on a few mesh ports. It is hard to say if they caused the instability or if the instability caused them. I would have thought that the interswitch communications that determine the mesh reconvergence would have been robust enough to recover from port/link faults. I will have to run end-to-end tests on all the fibre mesh links but unsure how to test out gig modules and GBICs.

Only time will tell if our efforts were well directed or not. I had a Dutch Uncle talk with someone at HP and it was pointed out that my criticism of HP Division was not well received. I apologize for venting my frustration through this forum and thank Division for their help in this matter.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: 5308 mesh woes

5308 mesh woes

Re: 5308 mesh woes

Re: 5308 mesh woes

Re: 5308 mesh woes

Re: 5308 mesh woes

Re: 5308 mesh woes

Re: 5308 mesh woes

Re: 5308 mesh woes

Re: 5308 mesh woes

Re: 5308 mesh woes