Array Performance and Data Protection
cancel
Showing results for 
Search instead for 
Did you mean: 

Re: Can replication go faster?

 
Highlighted
Occasional Advisor

Re: Can replication go faster?

I do know for a fact one of the volumes is a special pig that has been trying to replicate about 5Tb for ~9 days now. The Labs data bear that out (very cool BTW, didn't know that was there).

I put it in a Google Sheet for the morbidly curious: Replication Timeline - Google Sheets 

Highlighted

Re: Can replication go faster?

As I was reading through this thread and something stood out to me in the screen shot taken of the replication interfaces.  They are labeled Eth1 and Eth2.  These are the onboard interfaces, not generally used for data traffic.  I will assume you are using Eth1 for management only.  Is the second onboard interface Eth2 configured for management failover, dedicated to replication traffic, or is it running iSCSI connections?  You mentioned that management is connected at 1Gb.  Is Eth2 connected to that same switch, or is connected to your 10G iSCSI switch?

Highlighted
Occasional Advisor

Re: Can replication go faster?

Eth2 is management failover on a 1G OOB network. I am super thoroughly sure that there is no replication traffic going over any 1G link.

Highlighted
New Member

Re: Can replication go faster?

I actually opened a ticket on this last week.  Our current setup is 2 AF 5000's replicating over a 300Mbps fiber link with Riverbed WAN optimizers on both ends.  Monitoring through the Nimble replication monitor shows an average transfer of around 200Mbps with peaks up to 400Mbps.  But when monitoring from the routers and WAN optimizers its only peaking at 65Mbps.  So I wanted to push the envelope and see if we could get more out of our 300Mbps line (like our other brand arrays do that start the an "N").  No such luck, as  I was told that the issue is with the limited replication streams the Nimble arrays have built into the replication engine code which is 8 streams.  Our other "N" brand array will do 25 streams and why we see it utilize our line more.

I was told it is on someones plate to work on the replication engine code to allow for more streams but there is no ETA on it.

Highlighted
Advisor

Re: Can replication go faster?

Brock Benard

Thank you for the reply. 300Mbps is a low amount of bandwidth for any Nimble Storage array to fill using 8 streams (volumes) replicating at once.

There might be some confusion of Mbps and MiBps. 1MiBps = ~8.4Mbps. So using the math you have provided, 400Mbps = ~47MiBps. Academically, MebiByte is 1024 KebiBytes, however everyone in computer industry is used to thinking that MegaByte is 1024x1024, which it is not academically. Yet, all Operating Systems are now reporting storage in "Mega" while using base 2 (1024). (Simple confirmation is to look at the "Bytes" value of the "Page" file of your Windows PC - it will be base 2^x). This should explain why you see peak bandwidth on the array as number 400 while path device is 8 times less at the same time.

In regards to the replication streams; each stream is a volume snapshot which is being replicated. In version Nimble OS 3.x, there are 8 streams, as you point out. The reason why there are multiple streams is to allow system to read enough data from the disks to saturate the network link during the variety of customized scheduling. Each volume's snapshot has it's own challenge of reading data from disks, since it needs to find the blocks to replicate. However, this challenge comes into play with a lot higher bandwidth, such as 2Gbps and usually on Hybrid arrays, where HDD spindles are involved. Regardless of how randomly data laid out on RAID on the AFA's, 8 streams is definitely enough to saturate the link of 300Mbps.

All 8 streams are replicating over a TCP session. TCP session is dependent on the reliability of the network path. I believe the other "N" might be using multiple TCP sessions (regardless of streams). In a non-optimal network path, that would make the that replication to utilize more of the bandwidth from the advertised network link speed. Non-optimal network path does not just mean there are network errors or physical device issues. It can mean that there is other traffic (congestion) or some kind of filtering is involved (Proxy, Firewall, IPS/IDS, Optimizer). There have been quite few cases where a device, meant to improve some matter of network transmission (Proxy, Firewall, IPS/IDS, Optimizer) have interfered with the TCP transmission to the point where it cannot saturate the link. I cannot say this is the case for your situation, but what i can say is that Nimble Storage array replicating compressed data blocks, thus, there is no reason to add a "WAN optimizer" on a path.

What i would suggest in your situation, if possible, is to bypass the Riverbed device completely (not just set a "bypass" rule, but not route through it at all) as a test. And after you pause/resume replication (this re-initiates TCP session) you might be able to see the asked 300Mbps while there is no congestion (like from the "N" device). If not, then network packet captures are in order, which would help determine if there is something else on the path (look for retransmits, out-of-order packets, TCP resets, etc).

In the case of original poster, 1Gbps limit seems to be kind of interesting. I hope Nimble Storage Support will provide some guidance of why it appears limited to this specific value.

Highlighted
Occasional Advisor

Re: Can replication go faster?

There is no Riverbed device... we don;t own any Riverbed at all. This is all on a single flat VLAN. I am not confusing MB vs MB vs MiB. As you can see form the graphs above, this is indeed pushing Mb.

Highlighted
Advisor

Re: Can replication go faster?

My comment above was directed mostly to the other user and their situation as they explained it.

I understand your situation is different, which is what i state in the last sentence as you are the original poster.

Highlighted
New Member

Re: Can replication go faster?

See graphs below as they are all measured in Mbps for the last 24 hours... 

Nimble Replication Monitor showing peaks around 300Mbps....

 

Application stats from Riverbed

*NOTE Port 4214 is Nimble Replication, Port 11105 is the "N" array

Riverbed does reduce Nimble replication traffic by over 77%!  So we are reluctant to just bypass it...

 

Netflow graph of WAN line traffic

NOTE:  "N" array replicates from 23:00~03:00.  Nimble AF replicates every 4 hours

As you can see the 4 hour Nimble snaps that replicate are only using about 1/6th of the line.

Highlighted
Advisor

Re: Can replication go faster?

Brock Benard

In respect to original poster, if further conversation is required, i would ask to post another thread where your particular issue can be addressed.

I can see that Riverbed and Nimble are both reporting in Mbps, which is good. The question is, why is Riverbed reporting much smaller peak then Nimble? Where is the data going? In order for the GUI graph in Nimble Storage array to show value, the replication buffer needs to be confirmed to have sent the data, which means TCP buffer was purged, which means that something has acknowledged that the downstream group has received the data.

From further research: Riverbed is only reporting the amount of traffic after it has "deduped" or further "optimized" the data. From Riverbed: "RiOS collects application statistics for all data transmitted out of the WAN and primary interfaces and commits samples every 5 minutes." (SteelHead EX Management Console User’s Guide)

Looking at the "per flow statistics" from Riverbed, i believe it shows that i was trying to explain, Nimble array is sending maximum it can do with a single TCP session; "N" device has a very low throughput per flow, however just very many flows to achieve the peak. On average Nimble actually sends more data (23Mbps). Riverbed reports data after the "optimization", so it looks like the data from "N" is not as "optimizable" because it is different.  But in order to be fair in comparison between, lets say "N" device and Nimble Storage array, one would have to set a clean lab environment with completely identical workloads and network paths, not to mention the TCP congestion control algorithms, scaling factors, etc.

I understand you are reluctant to remove Riverbed from the path due to reduction rate, and you may operate as you prefer. There are considerations of WAN limits from ISP on data sent, etc. I would, however, do it at least temporarily to test what changes it brings. If you are interested in further pursuing this matter, please post a new thread and we can move the conversation there.

Sorry for hijacking the thread! I do hope the information provided is good information for anyone stumbling upon this thread.

Highlighted

Re: Can replication go faster?

You do not have to remove the Riverbed appliance from the path to get the results you are looking for.  An exception can be configured in the pair of Riverbed appliances allowing replication traffic from the arrays to pass through un-optimized.  It's been quite awhile since I have done this, so I don't remember exactly where it is configured.  I think it was part of the inline optimization rules. I do remember it was pretty easy to configure though.