Array Performance and Data Protection
1745866 Members
4364 Online
108723 Solutions
New Discussion

Re: Nimble replication speed not consistent / how to check?

 
EliteX2Owner
Advisor

Nimble replication speed not consistent / how to check?

Hi all, I have two AF7k arrays in different data centers, both running the recent 4.5.0.0 code.  There's 10gig linking the two facilities, but due to firewall limitations, there's a soft limit of roughly a gigabit across the VPN.  I've reproduced the gigabit cap consistently via iperf doing tcp tests, so it's a reliable limit with little variance.

I'm seeing replication of snaps peak at about 500 Mbit/sec, although since turning it on, I've never seen it get that high since the first day of replication.  It seems to like to average around 320 Mbit/sec since that time when it's in one of its good phases, but between those times, it may go for hours at just 20-50 Mbit/sec inexplicably.  The slow time, based on how past snaps have replicated in the past day or two, may occur towards the end of replicating any one snap, so perhaps this is some checksumming phase and data is not really moving at that point?

Currently I'm getting behind on snaps making it over to the other side because of these weird drops in performance that can last for hours before spinning back up to high speed, which might coincide with moving on to the next snap.  I currently have about 4TB left to replicate to get the two arrays completely in sync, and this is making it difficult to predict just how long that will take, and what the minimum replication interval can be for the future.

The arrays are not particularly loaded.  The primary is averaging <20k IOPS and the target has no activity other than receiving the replication data.  The replication is via the management interface, not the 10gig data interfaces.

Here's the past few days since initiating replication:

Historical

You'll see the throughput just before Dec 30 where it was hitting 500 Mbit, and how it seems to ramp up and plateau since then, when it's in a good mood.  However, currently I'm only seeing 15 Mbit/sec.

Current Speed

 So, questions:

  • Is there any way, perhaps via CLI, to see what is truly going on with any given replication task so I can really judge how far along it is or when it might complete?
  • Any way to speed up the replication to get back to the original 500 Mbit/sec or preferably closer to gigabit?
  • Is there a most efficient way to complete an initial seeding of the target?  For example, currently the production array is set to have hourly snaps and replicate every 8th.  The ~11TB of data on it was loaded over those first few days (vmware live migrations) but has now stabilized.  Currently it's still working on a snapshot from yesterday afternoon, i.e. just over 24 hours old, and it's gone through these weird phases of <20 Mbit and >250 Mbit, but never back to the original 500 Mbit.  I'm happy to kill the replication, delete all snaps, and start over if that would help.
7 REPLIES 7
Patrick_Miller
Occasional Advisor

Re: Nimble replication speed not consistent / how to check?

In short, the arrays should replicate as fast as possible so long as it doesn't hinder the array's ability to serve production data.  I also checked to see if we had any known bugs open on this particular issues, which we do not.  This is also something I have not seen in my own customer base, and I have quite a few customers running on 4.x.  

Probably not the answer you're hoping for, but this one is going to require a support call to figure out.  Luckily, Nimble has the best support around.  Give them a call.  If you remember to, come back and update us on what the root cause ended up being. 

Patrick Miller | Hewlett Packard Enterprise Employee
Solutions Architect – Storage & Hybrid IT

Accept or Kudo

alriko
Established Member

Re: Nimble replication speed not consistent / how to check?

I know it has been years, but what was the solution? I seem to be running into a similar, or identical issue with multiple downstream arrays replicating to an upstream array. I'm seeing on average <10Mbps, and it's dropping to 0Mpbs and idling for up to 10 seconds before it spikes a little bit and drops back down to nothing. Strange thing is, things were working great for months up until a couple weeks ago with peaks of almost 700Mbps on the 1G mgmt interfaces used for repl (like you, not currently using data IPs). Like you, all of the arrays, both downstream and upstream, are not under other loads due to workloads. Behavior is the same after 5PM and on the weekends when business pretty much ceases.

EliteX2Owner
Advisor

Re: Nimble replication speed not consistent / how to check?

This is going to sound incredibly stupid, but you should give it a try.  What "fixed" this issue for me was downing the redundant gigE managment ports one at a time.  Nimble support could not figure out what was going on, it even continued after doing a controller failover.  I even did a firmware update thinking if there was some weird state of a variable in there somewhere causing this, that would surely blow it out.  The fact that it followed the controller failover twice (the forced failover and then the upgrade failover) had me thinking maybe there's a weird issue on my ethernet switch.  In my case, the first management port of both controllers go to the same switch, and the second port of each controller go to a second switch.  So, I turned the ports off on that first switch, replication immediately went back to normal.  I turned them back on, and did the same on the second switch, and replication continued at high speed.  I turned the secondaries back on and it's been fine ever since.

So, there is some internal state that existed, and even survived code upgrades and failover, but downing the ethernet ports cleared it.  I never got feedback on my ticket if the developers found the source.

alriko
Established Member

Re: Nimble replication speed not consistent / how to check?

First off, thank you SO much for the feedback and response! I've had a ticket open with Nimble support for over a week now and a solution hasn't been reached hence following up on any leads I can! In my case I have 3 downstream arrays replication up to 1 upstream array, and they ALL peak/valley at identical values and identical times, which is totally odd as they all serve different workloads. This does 100% appear to be an upstream issue.

As far as bouncing each 1Gb management interface one at a time, I assume you did that on the upstream/target/destination array's switches? Or did you also have to do that do the downstream/source array?

I'm going to give this a shot. Definitely an easy thing to try!

Mahesh202
HPE Pro

Re: Nimble replication speed not consistent / how to check?

Hi alriko

Yes, bouncing the management interface would be done on the upstream/target/destination array's switches. As for the downstream/source array, it may be worth checking the logs or performance statistics to see if there are any indications of network issues or errors that could be causing the replication peaking/valleying issue.

In addition to bouncing the management interfaces, you could also try checking the network configuration and settings on both the upstream and downstream arrays to ensure they are configured correctly and efficiently for replication traffic. It may also be helpful to monitor the network traffic during replication to see if there are any spikes or drops in traffic that correspond to the replication peaking/valleying issue.

If these steps don't resolve the issue, I would suggest continuing to work with HPE Nimble Tech support to troubleshoot the issue further.

hope this helps.!!

Regards
Mahesh.

If you feel this was helpful please click the KUDOS! thumb below!

I work for HPE.
[Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise]

Accept or Kudo


EliteX2Owner
Advisor

Re: Nimble replication speed not consistent / how to check?

In my case the source array was the problem, but since you can really bounce them all without affecting service, given we're just talking about management/replication not storage I/O, I'd do them all and see what shakes out.

Osmium
Occasional Visitor

Re: Nimble replication speed not consistent / how to check?

I have a pair of HF20's and my traffic was down to 1-2Mbps and it had consistently been running significantly faster for over a year.  Tried reloading the switch ports per the recommendation above and it also cleared my issue.  Tried the subscriber  side first, no joy there, but publisher side did the trick.  Thank you!