IRF on Blade Switches : IRF port asymmetry

fabbb · ‎10-21-2017

Hello

I have a question on IRF on Blade Switches, I can't find the answer in the doc

Doc 'HP 6125 Switch Series: Configuring IRF' gives some configuration highlights, but I can't figure out if my config is correct or not.

I have 2 blades servers C7000 with 2 6125XLG SW in each, and the 4 SW of the 2 C7000 enclosures forms one single IRF. "Internal" IRF ports (ie connecting SW of one C7000) are 4x 10G (the internal crosslink ports 1/0/17-20) and the external IRF ports (connecting the C7000's together) are 2x 40G.

SW's are physically connected this way, for IRF:

- SW11 --4x10G-- SW12 ------2x40G------ SW21 --4x10G-- SW22 -
| |
------------------------- ------------2x 40G------------- -----------------

Now I have a doubt:

1. If s this a problem not to have the same bandwidth on the 2 IRF ports of one SW (40G vs 80G) ?

(Note that the actual traffic is a 10G max on the global system).

2. If s this a problem not to have the same number of links on the 2 IRF ports of one SW (4 vs 2) ??

I chose 2 external links because to have link redundancy and not more because bandwidth is enough comprae to the other IRF port)

Thanks for your tip!

parnassus · ‎10-21-2017

It's not clear from the IRF diagram you made: are there 2x40G or 2x40G + 2x40G (so a grandtotal of 4x40G)? it's to understand if your IRF follows a Daisy Chain topology (SW11<->SW12<=>SW21<->SW22) or a Ring topology (as SW11<->SW12<=>SW21<->SW22 plus SW22<=>SW11 via another 2x40G)?

Generally (not always) an IRF Fabric tends to be implemented in a fairly symmetrical way (so using the same number of IRF Member ports between its various IRF members...and also using the same type of physical ports - read: speed - used on each configured IRF logical port...this - I argue - to not create bottlenecks along the IRF topology used, ring or daisy chain) but - as I recall - it's not strictly mandatory to plan an IRF Fabric that way...restrictions (apart those related specifically to binding physical interfaces to IRF logical interfaces, mostly Switch dependant) are found when you plan IRF interconnections between close neighbours (e.g. between SW11 and SW12, between SW12 and SW21 and finally between SW21 and SW22 and, eventually, between SW22 and SW11 if you have an IRF Ring topology)...so if SW11 has IRF Port 1 with 4x10G member ports then IRF Port 2 on SW12, its closer neighbour, then SW12 should then have 4x10G too (that's clear)...but then what you do between SW12 and SW21 boils down to them not to SW11 or SW22 directly.

Basically, with your (forced?) approach, you're creating IRF bandwidth bottlenecks whitin each SW pair (SW11/12 and SW21/22) but not between SW12 and SW21: this scenario could/couldn't cause issues depending on how much traffic is flowing between SW12 and SW21 with respect to the traffic which flows whitin each Blade (SWx1 <--> SWx2, x=1 and x=2).

Note for the moderator: this thread should be moved into Comware-based section for better visibility since HPE 6125XLG Ethernet Blade Switch runs on Comware operating system and IRF Fabric is specifically related to almost all Comware based operating system switches.

I'm not an HPE Employee

fab2 · ‎10-21-2017

Thanks Parnassus for your reply.

I've attached the figure so it is easier to understand..

To me the potential bottleneck is not an issue because total traffic will not go above.

The reason for this asymmetry : At the begining I had 2 Blade SW IRFs, one for each Blade server. Then I modify to make only 1 IRF with the 4 SW, and for some reason I couldn't use the free sftp+ 10G ports for the new IRF ports, I could use only the 40G ports. Only one 40G port per SW would have been enough for bandwidth but I put 2 for link redundancy.

This basically works. But as I have some unexepcted results (on ping RTT's between blade servers), i am looking further in my config to see something wrong.

Thanks

parnassus · ‎10-21-2017

Understood, totally reasonable.

Would be interesting to understand [*] why you weren't permitted to use remaining free SFP+ ports on 6125 (IRF member ports restrictions related to port groups?).

Regarding the ICMP Ping RTT...my take: ICMP is low priority traffic (so CPU, if busy, can discard it if the rate of requests is considered high), if I were you I wouldn't use it to meter the IRF Fabric performance.

[*] A nice Blog entry about IRF and 6125XLG can be found here. (so weren't you able to use the numbered 5-12 external facing SFP+ ports [**] in favour of the numbered 1-4 external facing QSFP+ ports? is that right?).

[**] Related to Ports grouping and IRF interfaces restrictions, note this SFP+ Port Group specific restriction:

"The SFP+ ports are grouped by port index into two 4-port groups. One port group contains ports 5, 6, 9 and 10. The other port group contains ports 7, 8, 11 and 12. If you use one port in a group for IRF connection, you must also use all the other ports in the group for IRF connection. However, you can bind the ports to different IRF ports."

QSFP+ Ports (40G) if NOT splitted (break) into 4x10G ports show no restrictions.

I'm not an HPE Employee

fab2 · ‎10-22-2017

Hi Parnassus and thanks for that interesting discussion ;)

adding 4 sftp+ for the new irf ports was actually my first choice , to have both irf ports identical on each SW and just because that setup is shown in an example proposed in "HP 6125 Switch Series: Configuring IRF"... but by mistake optionnal equipements were bought with our HPE boxes and we have the qsfp+ and the cables needed already on site, not the sfp+ , so my cpm asked me to use what was already bought ans shipped ;)
The example I mentionned is : "Configure and add additional IRF members. Below is an example of adding additional IRF members. In this example, a pair of 6125XLG in a second enclosure are added to the IRF fabric."

Interesting what you say about ping. However our system is far to be loaded. I changed the BLade SW IRF because of bad performance (even though ping is not the best to test that it shows something strange) :

At the beginging, my setup was different : the 2 C7000 has their one Blade SW IRF , made of the two embedbed SW of the C7000 (see figured attached). If one VM in C7000#1 has to cummunicate with one VM in C7000#2 (this is the case for redundant VM's), packets go through the ToR SW which connects the 2 pairs of 6125. I noticed bad RTT between VM1 and VM2 ie between the 2 Blade servers, via BL SW1, ToR SW, BL SW2 : 0.35ms, and not so stable. So I change the setup to the current one (1 sible IRF for the 2 BL servers) which looked more applicable for my case, and according the example mentionned above in the Configuring IRF" config guide. Surprisegly, RTT has increase ! it is now 0.45ms (and still not so stable). I was especting ~0,2ms, because there are 2 hops less (the ToR SW). That's why I'm looking mistakes in my new IRF setup...

Why do you say "QSFP+ Ports (40G) if NOT splitted (break) into 4x10G ports show no restrictions." ? I think splitting the qsfp+ ports is a possibility for IRF. The guide mentionned above says for instance:
When configuring IRF using a QSFP+ 40GbE to 4x10GbE splitter cable, you must use all or none of the four 10GbE interfaces as IRF physical ports. The four interfaces can be bound to different IRF ports
Actually I had that possibility in mind if I wanted to have the same number of links and bandwidth on all IRF ports : split the 2 qsfp+ ports, shutdown 2 10G subports on each, so I would have 2x2x10G ports on each new IRF port, the same as existing IRF ports. This would probably work providing having 4 subports spiltted carried in the same direct cable is OK, I havn't tested that so I'm not sure I could make it work with my current connectivity.

parnassus · ‎10-22-2017

Interesting, now I've very few free time to deep dive into the scenario you described (I'll go back ASAP).

Regarding this:

fab2 wrote: Why do you say "QSFP+ Ports (40G) if NOT splitted (break) into 4x10G ports show no restrictions." ? I think splitting the qsfp+ ports is a possibility for IRF. The guide mentionned above says for instance:
When configuring IRF using a QSFP+ 40GbE to 4x10GbE splitter cable, you must use all or none of the four 10GbE interfaces as IRF physical ports. The four interfaces can be bound to different IRF ports

That's exactly as stated on the IRF Configuration Guide: only when you plan to split the QSFP+ Interface into 4x10Gb ports with a Splitter Cable then you have to fulfill some IRF binding related restrictions in binding those splitted interfaces into a logical IRF port...

If you use the QSFP+ port as a whole (1x40Gb) then there aren't restrictions...so far. I meant this with the statement you quoted above.

I'm not an HPE Employee

parnassus · ‎10-26-2017

fab2 wrote:
At the beginging, my setup was different : the 2 C7000 has their one Blade SW IRF , made of the two embedbed SW of the C7000 (see figured attached). If one VM in C7000#1 has to cummunicate with one VM in C7000#2 (this is the case for redundant VM's), packets go through the ToR SW which connects the 2 pairs of 6125. I noticed bad RTT between VM1 and VM2 ie between the 2 Blade servers, via BL SW1, ToR SW, BL SW2 : 0.35ms, and not so stable. So I change the setup to the current one (1 sible IRF for the 2 BL servers) which looked more applicable for my case, and according the example mentionned above in the Configuring IRF" config guide. Surprisegly, RTT has increase ! it is now 0.45ms (and still not so stable). I was especting ~0,2ms, because there are 2 hops less (the ToR SW).

That part is interesting...first I've to admit I'm not able to judge if 350 microseconds (0,35 milliseconds) are a bad RTT Round Trip Time (so basically one direction Trip Time is half of that value) considering that you're measuring VM to VM (so involving at least two OS, two NICs and two Hypervisors, one on each end) and not something more direct (let me say: a bare metal host connected through a 10Gbps NIC Port on the first - or second - 6125XLG on the first Chassis to another bare metal host connected the same exact way on the third - or fourth - 6125XLG on the second Chassis)...I suspect that not planning a proper measuring method (considering all involved "actors" from higher levels - OS, NIC Device Drivers, Network Stack optimizations and so on - to lower levels - 10G/40G Interfaces, IRF or not, ToR or not, Transcevers versus DAC Cables and so on - as partecipating in what we see as RTT) can produce false results (is data throughput between VM deployed on first Chassis to VM deployed on the second Chassis problematic? have you the suspect there are bottlenecks? are there a mathematical way to determine those ones?).

Have you tried to see which results you are able to obtain using tools like iperf?

A clean way - time consuming - would be the one that let you to totally skip Hypervisors and VM and to measure Host to Host RTT directly - where Hosts should be placed into various positions in the IRF topology - using hosts' OS network stack optimized.

Would also be interesting to discuss about the RTT in relationship with:

IRF implementation
Quantity and Quality of Switch used (Port Buffers, CPU, Jumbo frames, etc.)
Interface used (10G: SFP+ Transceivers or DAC Cables, the same for 40G)
Optical cables lenghts
Hypervisor(s)
VM OS(es)
a lot more...

considering that any item we think of and we thus add to the list is, from my standpoint, a performance inhibitor...because any item would necessarily add a finite delay to process the data flying from an Host on the West side to another Host on the East side.

I'm not an HPE Employee

fabbb · ‎10-27-2017

>> (is data throughput between VM deployed on first Chassis to VM deployed on the second Chassis problematic? have you the suspect there are bottlenecks? are there a mathematical way to determine those ones?).

I have no issue with bandwidth, estimation is less than 1Gbps, and I have 40G (or 80G) on IRF links (and 10G on external links)

>> Have you tried to see which results you are able to obtain using tools like iperf?

I'll think about it.

However, even if ping is not indicated for real tests, it is noticable that the same ping increases for 50% just before and after the IRF changes, on 2 different platforms, so it is not some local (hardware or not) issue.

That's why I am looking for something bad in my new IRF setup. I am wondering if I had that issue with 4 sfp+ instead of the qsfp+. Unfortunately I can't test it.

Thanks for sharing your point of view!.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

IRF on Blade Switches : IRF port asymmetry

IRF on Blade Switches : IRF port asymmetry

Re: IRF on Blade Switches : IRF port asymmetry

Re: IRF on Blade Switches : IRF port asymmetry

Re: IRF on Blade Switches : IRF port asymmetry

Re: IRF on Blade Switches : IRF port asymmetry

Re: IRF on Blade Switches : IRF port asymmetry

Re: IRF on Blade Switches : IRF port asymmetry

Re: IRF on Blade Switches : IRF port asymmetry