Re: speed up backups

Steve Reece_3 · ‎10-28-2009

What else is going on with what you're reading at the time that you're reading it and what kinds of software and caching are in use?
I'd expect the read path to be the bottleneck in this. VMS will get a return status for the writes having completed as soon as the write hits the disk controller. The controller can then write the data to physical disk as and when it wants to or feels like it. The read, on the other hand, has to go through to the disks unless the data have already been cached.
How fragmented are the input files that you're backing up? If they're continually growing then are the disks full of hundreds of small fragments?
Has anyone looked at locking on the files being backed up?
Has anyone looked at the processor modes during the backup?
Is FastPath enabled? If not, is the primary CPU becoming a bottleneck for operations on the node doing the backup?

Rob Leadbeater · ‎10-29-2009

Hi Peter,

I've not seen any info on the actual hardware being used on the Alpha side of things, or the SAN switch infrastructure.

Are you still using 1Gb/s HBAs ?

Are there lots of interconnected SAN switches ?

Cheers,

Rob

Peter Zeiszler · ‎10-29-2009

Alpha Servers are GS1280 - primary systems are 6cpu with 18gig ram, other 2 systems in cluster are secondary partions of the GS1280 with 2cpu and 6gig. 2x HBA 2gig each with each HBA going to a switch/fabric. These connect to 2 Xp 1024. Most data disks are at the local site with the destination disk being a shadowed disk with one disk at each site.

Disks are fastpath setup. Primary function during backups is other processes updating other applications and reporting. Disks are locked by the application and the application kicks off a backup (file by file). Files are RMS and are pre-allocated space. Checked fragmentation and there are some files that have 50+ extents with extent sizes of 65535.

My Test however is a single file with 2 extents with a size of 31million blocks (approx 15 gig). I also created a new file of 5gig on other disks with sysgen just to try the read on different formatted files.

Late yesterday I made all systems have same path (use fabric A or B) for the backup2 disk which has a disk at each site via shadow (I will have to setup a script so that they get set at boot instead of auto choosing). Read I/O rate fluctuated from 2k I/O to 400 I/O on the shadowed disk. I grabbed a disk local to each site and tested a backup to null on each disk multiple times. On the disks that were local to the site I was getting 2k i/o and on the remote site disk 300-400 i/o. Didn't matter which site.

So it looks like something slowed down between sites. I now have network involved looking at lines between sites. We also contacted our carrier to see if they have something.

Went back into older historical performance data too - found the drop from approximately 1200 disk i/o to averaging 800 disk i/o from a day to day for the main backup job. Captured that data and a few days around that date and sent that to networks.

Volker Halle · ‎10-29-2009

Peter,

you may want to use the HP Tool T4 to measure and document performance data. It provides very detailled performance information on fibre channel disks as well.

Volker.

Peter Zeiszler · ‎10-29-2009

Have T4 running. Also have CA Performance tool (the old DEC polycenter stuff)

Peter Zeiszler · ‎10-29-2009

T4 data shows similiar I/O rates on disks as the polycenter collector. I had restarted t4 on the main system doing backup back at the beginning of August. I don't run it on every node all the time. Only one that I run it on ALL the time is our system that does the off system backup (netbackup) as I have to track whenever we get a slow network on that system.

Colin Butcher · ‎10-29-2009

If you're looking for causes of a slowdown on the inter-site links, maybe a bit more detail would be useful.

How exactly are the discs at the remote site presented to the local site? Is it an extended SAN using DWDM over dark fibre? Is it an extended SAN using FC-IP gateways? Is it just MSCP serving over SCS?

How are the cluster members linked for SCS layer 2 traffic (it won't be IPCI unless you're a very early V8.4 test site) - so how are you extendeding the layer 2 LAN? Is it DWDM over dark fibre, or is it some form of layer 2 (SCS) encapsulation and an IP (layer 3) managed service?

If it's DWDM over dark fibre for both FC extension and LAN extension, then you shouldn't be seeing a big slow-down, unless the telco has re-configured your inter-site links to use much bigger distances.

If it's an IP managed service with SCS encapsulated in IP packets and you're using MSCP ovr SCS - it could be going pretty much anywhere at the telco's whim.

What does SCACP show you for delays and round trip times on the virtual circuits? Have those changed significantly? Do you need to increase SCACP's buffering to allow for worst-case round trip delays with highly variable latency (use SCACP CALCULATE)? The SCS algorithms are pretty good for reasonably consistent latency, but I've found it useful to increase buffer counts for intermittently variable latency to get enough packets in flight before you stop sending and have to wait for ACKs back. COnversely, do you have a lot of retransmits going on? Again SCACP can help you by looking at the error counts and the way they change over time.

Are you making use of SCS compression - which can be useful for sending a lot of big packet traffic (such as MSCP serving or lots of mini-copy / mini-merge bitmap data)?

It does sound rather like an inter-site link problem of a randomly variable nature, so a bit of low-level exploration is probably worth doing.

Good luck.

Cheers, Colin (http://www.xdelta.co.uk).

Entia non sunt multiplicanda praeter necessitatem (Occam's razor).

Peter Zeiszler · ‎10-29-2009

Closest is calling it DWDM (dark fiber) with extended SAN. Fabric A switches and Fabric B switches. Each with 2 1gig fiber just for SAN. We have additional fiber for the DWDM Network portion. Yes - we engaged the telco late yesterday.

Peter Zeiszler · ‎10-29-2009

fyi - mc scacp calculate isn't available in 7.3-2. It's available in 8.3. Highest delay in network is "5077.4 uSec" from doing a mc scacp show channel.

Network config info:
Each system has 3 nics. 2 Nics are on the DWDM (one for main IP and one for Decnet) and are same network VLAN. 3rd Nic is local site only for private VLAN for backups (netbackup).

Setting up for a test run tonight - eliminating remote site disks by dismounting them from shadow sets for a couple hours. Will then have the automatic backups run and will capture the disk i/o via monitor. Will pull up performance data and t4 tomorrow - hoping for quicker speeds. Then that pretty much would isolate it down.

Colin Butcher · ‎10-29-2009

Might be worth checking how the FC switches are set up. Any firmware changes to the FC switches recently? Any changes to FC switch port configurations (especially things like ISL_R_RDY mode)? Anything useful in the FC switch logs? Port resets? Bouncing inter-site FC links? Any multi-path device switching going on at VMS level?

Any contention going on within the LAN side or the FC side - people adding new devices / systems to the SAN or LAN that you're now sharing the bandwidth with?

This is just the kind of situation where you really do want to avoid shared use of bandwidth. Give me well-bounded systems any day, especially for trouble-shooting in high-availability environments...

Cheers, Colin (http://www.xdelta.co.uk).

Entia non sunt multiplicanda praeter necessitatem (Occam's razor).

Peter Zeiszler · ‎10-29-2009

FC switches had a firmware upgrade about a month ago - after we already started showing the issue. We have actually turned off more servers so have less traffic using the SAN. That will soon change - bringing in a blade server and additional storage to support it.

No path switching on VMS - only the manual stuff I have been doing to force disks to all same fabric. (Had that issue on EMC so thought it might be similar).

Had the counters zeroed last night - no update on any errors from today. Probably get that tomorrow.

Got an update from Telco they evaluated a card and are replacing it tomorrow. I will have to cross my fingers in the morning. :D

Steve Reece_3 · ‎10-29-2009

Fingers crossed for you Peter. Hopefully this is the Telco's fault!
Steve

Colin Butcher · ‎10-30-2009

Depending on which FC switches you have and how your FC switches are configured to drive the multiple 1Gbps inter-site links (do they load-share at packet level or load-distribute at path level?), you might find that you're limited to a maximum of 1Gbps throughput to any one disc at the other site, rather than the 2Gbps you might expect if you have 2x 1Gbps ISLs between a pair of switches. Under normal circumstances it probably wouldn't matter, but if your test consisted of hammering a single remote disc, then you might reach path saturation sooner than you expected.

Hopefully it's a Telco issue and normal service will be resumed shortly.

Cheers, Colin (http://www.xdelta.co.uk).

Entia non sunt multiplicanda praeter necessitatem (Occam's razor).

Peter Zeiszler · ‎10-30-2009

Test last night - 90gig in 35 min.
Today I moved paths for the disk to what I think is the non-erroring network - 15gig in 8 min. Definitely better.
Sure hope the card fixes both paths.

Thanks everyone for the suggestions - doing the read dump to maximize where I get read was probably the most used one.
Still looking at seeing about tweaking an account just for the backups.

I will update once I can test replaced card.

Colin Butcher · ‎11-10-2009

What happened in the end?

Entia non sunt multiplicanda praeter necessitatem (Occam's razor).

Peter Zeiszler · ‎11-11-2009

Total of 4 circuits that interconnect switches together across sites. Of that we had one bad card on one circuit, a bad cable on another (it would allow a link using our switches but wouldn't work with the testing equipment).

I was given a window where we were disabling circuits one at a time and the circuits were tested. I also forced the disk reads through the specific fabrics when one path was down. We still have 2 circuits that are slower. One on each fabric. We are trying to get an estimate to test out each fiber and see how much light is going through each cable. Also looking at where the cable routes to see if there is any common things between the two slower lines.

So for now its really the luck of the draw if I get a slow line or a faster line when it goes between sites because the switch uses whichever path. We keep them both on for redundancy.

Using only Local disk my backup is 42 min.
Using the shadowed disk with faster line its 1:30 min.
Using the shadowed disk and slower line its 2:30 min.
Link between switches is a 1gig pipe (even though some documentation indicated 2gig).

Jon Pinkley · ‎11-11-2009

RE:"Link between switches is a 1gig pipe (even though some documentation indicated 2gig)."

Sounds like marketing specmanship. It is probably 1Gbs in each direction, therefore 2Gbs is quoted.

it depends

Peter Zeiszler · ‎11-11-2009

Even though its slow - it seems to only be showing up on the VMS as a performance hit. Probably also on unix hosts but I can't think of how to force an i/o through a specific path when we have the mirror sets include all paths.

But with the redundancy I KNOW we can lose a line/circuit and not impact anything. We originally started with only 1 circuit / switch - but then a backhoe showed us that we needed two per with different physical paths. Of course that also showed the people that had to pay for the downtime and the circuits that we needed the reduncancy :D

I will update more once we actually find out what the final problems are. So far multiple lines with different issues and still 2 other circuits impacted.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: speed up backups