1752653 Members
5964 Online
108788 Solutions
New Discussion юеВ

disk bottleneck?

 
SOLVED
Go to solution
Michal Toth
Regular Advisor

disk bottleneck?

hi all,

i've got a customer who runs a nice setup of SD and XP12000. Customer complaints that his backups take too long (backing up using ibm's tivoli storage manager) and indeed according to the backup report, average throughput is round 40MB/s. It is said that it should be at least 60. I was very curious as XP1200 really should not create any disk bottlenecks.

I analyzed the sar output for the two days and I am not sure what is causing the delay. Can you please browse the attached sar report and let me know what you think about it? (i'd say that it doesn't indicate a disk bottleneck)

Backup session started at 16:06 and ended roughly at 04:00 the next day. It uses 3 read threads and 3 tape writing threads.

customer uses LVM, lvols with distributed allocation policy, distributed accross two FC ports to XP.

regards

Michal
12 REPLIES 12
Devender Khatana
Honored Contributor

Re: disk bottleneck?

Hi,

It requires an answer to a few other questions to reply this.

1. How is VG allocation done for data files ?

2. Which host access the tapes and the data. Is not it BC ?

3. I do not think choke at XP level it could be due to some misconfigured resources?

4. How is zoning done?

5. What are the drives and media used for backup?

HTH,
Devender
Impossible itself mentions "I m possible"
Michal Toth
Regular Advisor

Re: disk bottleneck?

Hi Davender,

1) VG setup is stupid (imho) as PE size is set to 4MB for 900GB VGs. Allocation policy is distribute/PVG-strict, distributed across two HBA links to XP. All PVs have alternate links through the other HBA. Half of primary paths go through one hba, other half uses the other HBA.

2) Tape device (lib) is accessed via one of the HBAs, that is also used for accessing XP.

3) what you mean by misconfigured resources? There are no page-outs and cpu utilization seems fine.

4) no idea about zoning at the moment, but it's a simple SAN and this should not be an issue

5) i'll get the drives list used for backup asap
Eric SAUBIGNAC
Honored Contributor

Re: disk bottleneck?

Hi Michal

I had almost the same behavior whith a box under HP-UX 11.11 and an XP512 with direct attachement.

The client also used LVM with distributed allocation policy accros the 2 HBA, with many LUNs.

The PE Size was also 4 Mo (wich is not so obviously stupid as it gives smaller granularity ;) but the filesystems were smaller than yours.

Data to backup was Oracle Database under SAP. So, big files.

Backup whent through the lan but we did not detect any network bootleneck

The time of backup was about 1/2 more longer than restore !!! (did you try to test that ?)

During backup, sar -d did not show heavy load with the XP ... unfortunatly i would say. Sometimes activity, sometimes about nothing ... though there is no small files.

It seems like if backup software was not able to get the best from the storage ...

As the client had no need to do LVM mirror, we found a by-pass with stripping in place of distributed policy. Then everything was OK : time of backup shorter than time of restore, better load with XP ...

I have never found any explanation. I frequently configure distributed allocation with many kind of SAN storage (va, eva, hitachi, emc, ...) with no trouble at all ... ???

Hope this will help

Eric

(PBFWME;-)


Michal Toth
Regular Advisor

Re: disk bottleneck?

Thanks for ideas Eric,

unfortunately, there's no way how to make customer to switch from distributed to striped (no downtime possible).

but I've noticed one thing right now. There are very low average buffer cache hit ratios.

%rcache is round 42%
%wcache is round 26%

that certainly is not healthy. Is it possible that oracle would employs so much stochasticism within its i/o that it would completely toast filesystem buffer?

michal
Eric SAUBIGNAC
Honored Contributor

Re: disk bottleneck?

Well, i have no expertise around Oracle, so ...

Any way, I cannot imagine a single instant that a read cache could be be effective when we read several dozens nonstop Go. So you should not worry about 42% hit cache during backup

But I am surprised by the fact that there is an activity in write cache. This point, plus the fact than your customer cannot allow downtime lets suppose that backup are made on line ... (In my first answer the backup were taken off-line)

So it is not the same thing : Oracle is still operational and adds stress to the storage. If we take a look at the average values of your sar output (I know, the averages want to say nothing, but you have 136 visible discs and 16528 samples. So it is not possible to have a fine analyze in few minutes !) we notice an average bandwidth of 57,45 Mo/s. Not so far from 60 Mo/s ...

An other thing ... I took 4 devices from your sar output and made a graph for each (See attachment). Surprising isn't it ? But as i don't know if the disk i have choosen have any interest with backup, maybe it does not mean anything ...

Eric

Jeff Schussele
Honored Contributor

Re: disk bottleneck?

Hi Michal,

I don't know that much about Tivoli, but on our BIG db systems we start 16 threads in NetBackup.
I'd start there. It all depends on CPU count however.

HTH,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!
Michal Toth
Regular Advisor

Re: disk bottleneck?

there are several disks in the sar report that in cerain time intervals show very high average service time. What would that mean (XP array)?

also, when I calculate cummulative blks/s for each time interval (i know it is very speculative), then during the day I can rougly 140 MBps (r+w together) and during the night only round 24 - it's the time when the backup runs.
Steve Lewis
Honored Contributor

Re: disk bottleneck?

I had the same issue on my SD32/xp12k, but with informix backups. It was fixed by a firmware upgrade on the xp12000 thanks to some in-depth investigation by Wal Brown.

The firmware was released on 6th October 2005 and is version 50-04-31.

The possible cause of poor performance may be due to out of sequence blocks (based on the timing) causing issues with the XP cache prefetch algorithms on the older version. It also includes the setting of an additional System Mode option.

I suggest you check your firmware version and upgrade to 50-04-31.
Alzhy
Honored Contributor
Solution

Re: disk bottleneck?

Michael,

Your SAR stats are good.. showing no disk bottlenecks at all. And I think that 40MB/sec throughput your getting is "Okay" considering your architecture - meaning you've a common Tape and Disk SAN. You mentioned - you've 2 FC's going out to the SAN and one of the 2 FC's is actually used by Tape devices for Tivoli backup. In this config - you already have somewhat of an overhead. A single 2GB FC should be able to sustain a throughput of about 80 to 120 MBYtes/sec. It should be able to keep 2 to 4 LTO-II drives busy. Have you checked if at all the right number of drives are actually engaged for the backups? For Large servers -- I usually keep my mult-plexing to one -- meaning one save-stream to one tape device so if my server is allocated 4 tape devices -- I can stream "wholly" - 4 large filesystems or datasets.

I always servers with local SAN backup devices available to it (dybamic drive sharing) so that the tape devices are presented to its own HBA and not served by the same HBA that serves disk LUNS. In short - I have separate Tape and Disk SANs - 2 FC-HBAs for Disks LUNS and 1 (or more) FC-HBAs for Taoe LUNs.
Hakuna Matata.