Degraded backup and Oracle dump performance after 11.23 to 11.31 upgrade

chris huys_4 · ‎10-12-2010

Hi Martin,

This looks like the disk_to_tape backup.

09:03:22 disk3_lunpath82 34.34 0.50 0 706 90376 0.00 0.48
[..]
disk182_lunpath7 50.51 0.50 213 0 22562 0.00 2.59
[..]
disk182_lunpath33 62.63 0.50 213 0 22651 0.00 3.16
[..]
disk182_lunpath89 51.52 0.50 213 0 22343 0.00 2.61
[..]
disk182_lunpath108 50.51 0.50 213 0 22570 0.00 2.58
[..]
disk182 100.00 0.50 853 0 90125 0.00 2.73

[..]

tape is "presented" with disk3_lunpath82, disk by the 4 "disk182" lunpaths , i.e. disk182_lunpath7; disk182_lunpath33;disk182_lunpath89;disk182_lunpath108

disk182 varies, during the measured sar period, between 27776 blocks/sec and 156451 blocks/sec

When the vg04 lun, delivers only 27776 blocks to tape, representing 262 read IOs, the other volumegroup luns are requesting 1464 read IOs and 523 write IOs.

When the vg04 lun, delivers 156451 blocs/sec, or 1475 read IOs, the other volumegroup luns are requesting 91 read IOs and 158 write IOs.

summing up the read /write IOs at the "worst for the backup" 27776 = 262 + 1464 + 523 = 2249 IOs

summing up the read/write IOs at the "best for the backup" equals to 1475+91+158=1724 IOs

2 things I would do.

1. move all the luns that are still not under under 11.31 native multipathing disk control, under 11.31 native multipathing control, i.e. the disks that show up with their cxtydz device file in the sar output ..

Average c1t0d2 9.35 0.50 22 54 1402 0.00 1.27
Average c1t1d3 0.46 0.50 1 1 53 0.00 2.03
Average c1t1d4 5.64 0.50 9 25 577 0.00 1.68
Average c7t2d3 0.53 0.50 1 1 45 0.00 2.74
Average c5t0d2 9.32 0.50 22 54 1326 0.00 1.26
Average c5t1d3 0.49 0.50 1 1 46 0.00 3.06
Average c3t2d3 0.57 0.50 1 1 57 0.00 2.60
Average c10t0d2 9.61 0.50 22 54 1333 0.00 1.31
Average c10t1d3 0.44 0.50 1 1 53 0.00 2.29
Average c14t1d3 0.45 0.50 1 1 52 0.00 2.10
Average c14t1d4 5.60 0.50 9 25 640 0.00 1.70
Average c16t2d3 0.54 0.50 1 1 49 0.00 2.62
Average c1t0d1 2.52 0.50 21 19 617 0.00 0.62
Average c10t0d1 2.38 0.50 21 19 658 0.00 0.59
Average c14t0d1 2.27 0.50 21 19 591 0.00 0.55
Average c14t0d2 9.26 0.50 22 55 1391 0.00 1.25
Average c5t0d1 2.48 0.50 21 19 656 0.00 0.60
Average c10t1d4 5.36 0.50 9 25 601 0.00 1.62
Average c5t1d4 5.21 0.50 8 25 572 0.00 1.56
Average c12t2d3 0.61 0.50 1 2 59 0.00 2.52
Average c1t0d6 0.06 0.50 0 0 0 0.00 19.63
Average c5t0d6 0.06 0.50 0 0 0 0.00 21.17
Average c10t0d6 0.06 0.50 0 0 0 0.00 20.82
Average c14t0d6 0.06 0.50 0 0 0 0.00 18.29
Average c1t1d0 0.01 0.50 0 0 0 0.00 0.50
Average c5t2d2 0.01 0.50 0 0 0 0.00 0.31
Average c14t2d2 0.01 0.50 0 0 0 0.00 0.25
Average c5t2d0 0.01 0.50 0 0 0 0.00 0.20

2. double up max_q_depth for disk182, scsimgr set_attr -D /dev/rdisk/disk182 -a max_q_depth=16, lets see if doubling up the IO bandwidth of a particular lun, will get the EVA to favour that lun more...

If the above doesnt work, give the output of
# scsimgr lun_map /dev/rtape/

Also I would still like to see the original 11.23 sar output. ;)

Greetz,
Chris

Martin Roende · ‎10-13-2010

Thanks Chris ..
I will go along getting the change planned and execute your recommandations.

I do not have sar output from the 11.23 installation, and it is no longer running.

As I mentioned, all I have is the collected glance output. I have attached one of the files collected from the 11.23 installation in production. (BUT there is no detailed Disk statistics.)
Oracle dump time was 18:40 and backuptime 00:00

regards Martin

Martin Roende · ‎10-13-2010

Chris ..
After "scsimgr set_attr -D /dev/rdisk/disk182 -a max_q_depth=16" I have made another sar collection. HEre attached new_sar_out.txt.

regards MRA

Martin Roende · ‎10-13-2010

This looks very good Chris , Thankyou !
Seems like thoughput has doubled.

mraMac:Dokumentation mra$ grep Average *sar_out.txt | grep disk182
09:01:44 lunpath %busy avque r/s w/s blks/s avwait avserv
%age num num num num msec msec
device %busy avque r/s w/s blks/s avwait avserv

new_sar_out.txt:Average disk182_lunpath7 17.04 0.50 0 22 40065 0.00 7.86
new_sar_out.txt:Average disk182_lunpath33 17.10 0.50 0 22 40226 0.00 7.89
new_sar_out.txt:Average disk182_lunpath89 17.01 0.50 0 22 39992 0.00 7.85
new_sar_out.txt:Average disk182_lunpath108 17.01 0.50 0 22 40037 0.00 7.85
new_sar_out.txt:Average disk182 68.15 0.50 0 87 160320 0.00 7.86
old_sar_out.txt:Average disk182_lunpath7 48.97 0.50 252 0 26789 0.00 2.11
old_sar_out.txt:Average disk182_lunpath33 48.93 0.50 252 0 26837 0.00 2.10
old_sar_out.txt:Average disk182_lunpath89 44.12 0.50 252 0 26622 0.00 1.88
old_sar_out.txt:Average disk182_lunpath108 44.60 0.50 252 0 26672 0.00 1.91
old_sar_out.txt:Average disk182 100.00 0.50 1009 0 106919 0.00 2.00

chris huys_4 · ‎10-13-2010

Hi Martin,

I suppose this should put at least, the before midnight, part of the backup, back in the "11.23" neighbourhood. ;)

The reason, I asked for the max_q_depth to be increased, was because of the constant 100% busy of the disk182 disk.

%busy avwaitavserv
disk182 100.00 0.50 853 0 90125 0.00 2.73

This mostly means that, with the current max_q_depth, or IO bandwidth, the eva diskarray was giving maximal throughput.

However to check if that is really the maximal througput the eva, can be pushed to deliver, max_q_depth needs to be increased, until at least the %busy goes below 100%.

And as was shown here, the maximal throughput was indeed not reached, as with increasing the max_q_depth, not only %busy went down fairly steep beneath 100%, but also the IO throughput shot up by a fair percentage..

Oh yes, read also this thread, https://forums11.itrc.hp.com/service/forums/questionanswer.do?threadId=1431801, it has an interesting whitepaper on 11.31 about max_q_depth in duncans' reply.

Greetz,
Chris

Martin Roende · ‎10-13-2010

numbers before is evening job disk to disk dump.

These ones are disk to tape:
Average disk182_lunpath7 45.47 0.50 234 0 24773 0.00 2.10
Average disk182_lunpath33 45.12 0.50 234 0 24748 0.00 2.08
Average disk182_lunpath89 40.76 0.50 234 0 24757 0.00 1.88
Average disk182_lunpath108 40.97 0.50 234 0 24727 0.00 1.89
Average disk182 100.00 0.50 935 0 99005 0.00 1.99

chris huys_4 · ‎10-14-2010

Hi Martin,

> These ones are disk to tape:
> Average disk182_lunpath7 45.47 0.50 234 0
> 24773 0.00 2.10
> Average disk182_lunpath33 45.12 0.50 234 0
> 24748 0.00 2.08
> Average disk182_lunpath89 40.76 0.50 234 0
> 24757 0.00 1.88
> Average disk182_lunpath108 40.97 0.50 234
> 0 24727 0.00 1.89
> Average disk182 100.00 0.50 935 0 99005
> 0.00 1.99
This numbers looks very much like the numbers when max_q_depth was not yet doubled.

Check if max_q_depth for disk182 is still 16.

# scsimgr get_attr -D /dev/rdisk/disk182 -a max_q_depth

If the current max_q_depth is back on 8, "set" it again and "save it also to survive reboots"..

# scsimgr set_attr -D /dev/rdisk/disk182 -a max_q_depth=16

# scsimgr save_attr -D /dev/rdisk/disk182 -a max_q_depth=16

If the current max_q_depth was still on 16 double it again to 32 and "save" the value also..

Might also double the max_q_depth value of disk158? as I thought that it had also some 100% busy values..

Greetz,
Chris

Charles McCary · ‎11-10-2010

Duncan,

quick question...you state:
"- its also not good practice to have VGs with only one LUN, as this absolutely guarantees you will only ever hit one controller on the EVA - you should always have at least 2 LUNs in a VG and then strip across those so you get the benefit of both controllers - but again, if that's how things were setup before its not the source of your problem"

Doesn't the new agile pathing in 11.31 handle load-balancing across controllers automatically?

Martin Roende · ‎11-10-2010

If I look at the sar outputs, the loadbalancing works fine, when using either type (legacy or Persistant DSF)

See previous response on "sar -LdR 1 100" output on Oct 11, 2010 07:17:59 GMT

At the moment we are having an escalated support case, and vxfs patches is under implementation.

Regards Martin RÃ¸nde Andersen

Martin Roende · ‎11-22-2010

After registering a call at HP we have installed the following patch kits, and the DataProtector backup performance is now around 100Mbyte/sec. Before maximum was 60Mbyte/sec

PHKL_41561
PHCO_40290
PHCO_41072

The RMAN dump speed still suffers, and I will post further here.

Best regards , Martin RÃ¸nde Andersen

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Degraded backup and Oracle dump performance after 11.23 to 11.31 upgrade

Re: Degraded backup and Oracle dump performance after 11.23 to 11.31 upgrade

Re: Degraded backup and Oracle dump performance after 11.23 to 11.31 upgrade

Re: Degraded backup and Oracle dump performance after 11.23 to 11.31 upgrade

Re: Degraded backup and Oracle dump performance after 11.23 to 11.31 upgrade

Re: Degraded backup and Oracle dump performance after 11.23 to 11.31 upgrade

Re: Degraded backup and Oracle dump performance after 11.23 to 11.31 upgrade

Re: Degraded backup and Oracle dump performance after 11.23 to 11.31 upgrade

Re: Degraded backup and Oracle dump performance after 11.23 to 11.31 upgrade

Re: Degraded backup and Oracle dump performance after 11.23 to 11.31 upgrade

Re: Degraded backup and Oracle dump performance after 11.23 to 11.31 upgrade