Bad I/O performance help

Stuart Carmichael · ‎05-07-2003

I have a performance problem on one of my SAN connected Oracle database servers. The server is an RP7400, 6 PA8700 CPUs, 9Gb memory. I've eliminated CPU and memory as the case of my performance problems. This host runs a couple of Oracle databases, the largest has an SGA of 950mb and is 300Gb in size.

Throughput on my SAN (XP512) also looks OK - there's no FC problems, and performance issues are located to this one host/application only.

The symptoms are:
* woeful I/O performance which is measured by Oracle seeing waits for I/O requests (sorry - not a DBA, I have no idea how this was measured)
* terrible backup performance (Omniback currently backing up 300Gb of data in about 10-12 hours to 3 tape drives, which I would expect to take 5-6 hours).

Attached are sar buffer and disk statistics for a single day. As I read it my disk stats are OK, I have some busy disks but nothing hot. rcache stats I find confusing - they indicate poor buffering for large parts of the day, however some other advice I've received is to mostly ignore rcache/wcache stats for Oracle hosts, as the SGA does most of the buffering for the database. Note that I've included only 03:00-09:00 for 'sar -d' as being a typical range of data. Backup completes at about 4:00 am, the reporing jobs run from 3:00am until late morning.

This host is set to use dynamic buffercache (dbc_max_pct=20% or 1.8Gb). Other advice received from Oracle metalink is that the buffercache shouldn't be any more than about 250mb in size. HPUX rule of thumb seems to indicate about 400mb. Is it possible that a large buffer such as this one is causing performance problems (eg: poor read stats)?

I've also seen various suggestions about disabling buffercache for Oracle index vxfs filesystems. Does anyone have any experience of this, what performance improvements are likely?

Any other feedback you can offer would be appreciated.

Steven E. Protter · ‎05-07-2003

Strange as it seems, I'd like to know what swap is set for.

If its not between 1.0 and 2.0 physical memory I've seen oracle run like a dog. This experience is on systems with less memory.

I've attached a sar script that collects more data into file, which I'd upload to oracle and hp support for further input.

With regards to your specific suggestion, I have not tried this.

What is the Raid level on the san disk? If its not raid 10 (1/0) there is a substantial performance hit.

That kind of raid consumes a lot of disk, but I've seen substantial performance boost by going Raid 10 on the data.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Jean-Luc Oudart · ‎05-08-2003

Your buffer cache is too high 300-400Mb should do the trick (that whar we use on HPUX 11.0)

For SAN we use disk striping (64Kb).

cf. this link
http://forums.itrc.hp.com/cm/QuestionAnswer/1,,0x998c402f24d5d61190050090279cd0f9,00.html

Rgds,
Jean-Luc

fiat lux

Stuart Carmichael · ‎05-08-2003

Answers to questions above:

Swap is set to 1.5 time phys memory. This should be adequate for this host, we NEVER page.

Raid level on the array is RAID-5. This is probably irrelevant as the array is a fairly hefty XP512.

I've heard a lot of talk about setting buffer cache at about 400mb, but no evidence to suggest that setting large values will cause performance issues. Has anyone any information which will confirm or deny this?

Paula J Frazer-Campbell · ‎05-08-2003

Hi

Has this performance problem gradually or suddenly appeared?

Also :-

Recommended Uses: RAID 5 is seen by many as the ideal combination of good performance, good fault tolerance and high capacity and storage efficiency. It is best suited for transaction processing and is often used for "general purpose" service, as well as for relational database applications, enterprise resource planning and other business systems.

For write-intensive applications, RAID 1 or RAID 1+0 are probably better choices (albeit higher in terms of hardware cost), as the performance of RAID 5 will begin to substantially decrease in a write-heavy environment.

Might be worth looking at raid 1+0.

Paula

If you can spell SysAdmin then you is one - anon

Tim Sanko · ‎05-08-2003

I am a Admin who was a performance DBA for 13 years, some issues never change.

First, let's look at your Maxssize,Maxdsize,maxtsize and their 64 bit counterparts.

Here is ours from a small
(3 TB) server.

maxdsiz 0X50000000
maxdsiz_64bit 0X400000000
maxfiles 2048
maxfiles_lim 2048
maxssiz 0X800000
maxssiz_64bit 0X40000000
maxswapchunks 2048
maxtsiz 0X4000000
maxtsiz_64bit 0X40000000

If yours are smaller than this you may want to adjust you tuning parms....

Martha Mueller · ‎05-08-2003

I attended a class that explained why dbc_max_pct set too high may cause performance problems. The instructor said that he had seen this problem, that the server was spending more time trying to find information by following links in the buffer than it would take to find the information on disk, thereby causing the symptoms that you described.

I don't have any experience with this, just something I remembered from class.

I would also look at network issues, such as half duplex vs. full duplex, if your backups are going over the network. We have run into this issue in the past.

Tim D Fulford · ‎05-08-2003

Generally I use MeasureWare & glancePlus (OV PerformanceAgent as it is now called), so most of my performance stuff is from these, so I'm translating this into sar!!! I looked at the sar statistics for the disks and felt that they were really quite bad!!! My reasoning is thus....

1 - disk queus. sar's 0.5 disk queue is generally MeasureWare zero, anything above 0.5 is a real queue. nearly ALL your disks had queues against them., some really quite large

2 - Service times. The XP512 probably has caching, a typical LUNs service time would be 1-4 ms, nearly ALL of your LUNS were far greater than this. At a guess you are probably usin 10,000 or 15,000 rpm disks. In their RAW state they should be producing between 8 & 5ms service times. You have time of up to 25ms!! I know you are using RAID5 LUNs, but chosing RAID5 someone thought it would be BETTER that RAID1+0 or just RAID1. These service times are not very good. ** see later

3 - Root disks, I'm guessing that your root disks are c1t6d0 & c2t6d0. These are strugling with average queue of 3-4 & service times of 12ms. I assume these are a straight mirrored pair. I suspect that either there is swapping of database IO's are going on on these disks.

A previous reply suggested that RAID5 was OK for OLTP databases. I would disagree. RAID5 is slow for small random writes/updates. This is because a small write on one disk within a stripe will cause the WHOLE RAID5 stripe to be read & then an updated parity & data update to be written. As an example assume 6 disks are in RAID5 with a stripe size of say 64kB. A 2kB write/update will trigger a full stripe read, 384kB. & then write back 64kB parity & 2kB data. Thus 2kB has caused 450kB of activity! OLTP (On-Line Transactional Processing) is precicely this. updating customer details, placing orders etc.

What could be done???
o Investigate why there are such high queues on the root disks (assuming c1t6d0 &c2t6d0 are the ones).
o Check out the disk activity more throughly over time. I would personally use MeasureWare.
o Take some advice (HP) about the best striping & RAID methodology. We use VA74xx and get 1-3ms service times NO disk queues. the VA74xx is a cheaper & not as good as the XP512 so I would expect more out of XP512.
o If you can get measureWare check out
GBL_PRI_QUEUE
GBL_RUN_QUEUE
GBL_DISK_SUBSYSTEM_QUEUE
GBL_MEM_QUEUE
GBL_IPC_SUBSYSTEM_QUEUE
GBL_NETWORK_SUBSYSTEM_QUEUE

Generally speakin if you are bottle necked on IO GBL_RUN_QUEUE will be > 1 and GBL_PRI_QUEUE <0.5 or 0. This simply means that tehre are lots of runable processes, but they are NOT in contension for CPU (thus are contending for IO)

You can also analyse each of your disks in more depth. I've attached the report file I would use & a script to sort through it.

-

Tim D Fulford · ‎05-08-2003

Here is a script to process the MW results

-

Steve Lewis · ‎05-08-2003

Just a few thoughts.

1. The disk stats show high busy%. This is caused by the number of requests as well as the number of blocks being transferred.
I noticed that they are all going over controller c5. You should set up another controller path to the devices and use both of them for everything, to share the load and incidentally help with redundancy. If its just 1Gbit fibre then all these requests down a single strand could be a bottleneck. 2Gbit fibre is obviously better. However don't take my word for it, check the fibre stats.

2. The high ratio of requests to blocks may indicate a too-small read-ahead parameter in Oracle. Ask your DBA to try increasing this.

3. RAID 5 is especially awful for sequential operations and backups are the biggest example. This is because the seek latency for each block is multiplied by the number of disks it is spread over. A backup will also reduce the efficiency of your xp cache.

4. If you are using filesystem files to store the Oracle data, consider mounting the VXFS filesystem using options: mincache=direct,convosync=direct.
Then your writes will go direct to disk, bypassing buffers. You can then reduce the HP-UX buffer memory (dbc_max_pct) and give it to your Oracle SGA.

HTH, Steve

Michael Steele_2 · ‎05-08-2003

I would agree that you have 'HUGE' disk bottlenecks on the c5 controller. Of the averages listed c5t0d1, c5t0d2, c5t0d3, c5t0d5, c5t0d7, c5t1d1 and c5t1d3 all easily surpass the 50% busy test while c5t0d4 and c5t0d6 are close enough at 45 and 48% to also meet the test and do, at certain times during the day.

Average c5t0d1 92.41 1.74 232 3693 8.55 20.57
Average c5t0d2 60.04 0.64 231 3688 5.35 7.13
Average c5t0d3 92.54 1.79 232 3698 8.76 20.70
Average c5t0d4 45.20 0.64 128 2042 5.26 8.18
Average c5t0d5 79.40 0.74 127 2039 5.68 21.60
Average c5t0d6 48.44 0.64 127 2033 5.29 9.02
Average c5t0d7 72.00 1.66 175 2799 8.35 25.66
Average c5t1d1 71.81 1.68 175 2798 8.38 25.66
Average c5t1d2 33.17 0.70 176 2804 5.39 5.59
Average c5t1d3 63.04 2.24 94 2485 9.95 23.05

The XP512 has a notorious performance problem where Processor 1 handles all even numbered disks in the array and Processor 2 all odd numbered disks that MANDATES good performance from properly load-balancing the I/Os across even and odd LUNs. This is stripping the luns through LVM, as well as a round robin PVlink configuration. Pri Alt Alt Pri or the Autopath utility.

Can you attach your I/O stats from an 'fcmsutil' report? You seem to be heavily using only using one HBA and ignoring the other. This is the c5 issue.

fcmsutil /dev/td0 (1,2,etc.) stat

This problem with the XP512 exists before cache, so adding more cache will not help. Its all about load balancing and your load indicates all I/O through one HBA, c5.

Support Fatherhood - Stop Family Law

Stuart Carmichael · ‎05-08-2003

Thank you all for your feedback. Points for everyone!

It appears to be that disk layout is probably my issue.

For those of you with XP512 knowledge, I'll describe the layout of the filesystems. Note that we had assistance laying out the filesystems from the disk array vendor, making the assumption they knew more than us...

The XP512 serves 10 LUNs to this volume group, these LUNS are LUSE OPEN e*4 (4*14Gb). At least one of these LUSEes has 2 members on the same array group. The filesystems are then striped across 3 luns (stripe size 8Kb) using LVM. I've since found out that the XP has a cache line limit of 64k, so 8k stripes probably aren't using the cache effectively. See the following link for more info.
http://forums.itrc.hp.com/cm/QuestionAnswer/1,,0xa5f2e822e739d711abdc0090277a778c,00.html

The application degradation has been slow. One factor which might account for this is the growth experienced over the last 6-12 months (after the app was moved to XP512). Growth has been 'steady', rather than 'explosive'.

The array is laid out RAID-5 per HP's advice. As this array forms the backbone of our SAN, and have allocated 5 of the 8Tb already, migrating from RAID-5 to RAID-1 isn't an easy task, certainly not on the top of my 'quick fix' list. I probably wouldn't even attempt this for one array group.

Backups are not network based, but to local SCSI DLT7000's. FC is 1Gb, link utilisation from EFC manager shows 5% max watermark. Everything is all down c5, as I havn't investigated how to get LVM to dynamic pathing (ie after a reboot c5 ALWAYS becomes the primary). The XP CHIP port utilisation for 'c5' (CHIP port 1A) sometimes bursts to 50%, average load is generally less than 5%. There are 5 HPUX hosts on the SAN which can access CHIP 1A.

c1t6d0 and c2t6d0 are the mirrored root disks, primary swap is 4Gb and is part of root. There is no swapping at any time of the day (sar -w shows zeros across the board). Yes, I'm disturbed by the usage of these 2 pv's, and am yet to identify the cause (this seems to be the norm for all my HPUX servers).

I have Glance installed, but not measureware. I havn't really got into using glance yet and find it not very helpful. I'm an old school admin who gets a lot of info out of a pile of text...

The things I hate about my current setup:
* LVM striping - this gives me no flexibility to alter disk layout online. For example, if I want to pvmove one of the pv's I need to do this offline.
* stripe size probably isn't large enough. Our initial testing obviously wan't robust enough to simulate full user loads.
* LUSE volumes need to be setup very carefully. Once they are allocated and in use you have no ability to easily alter the config. For example, I suspect that some of my ldevs on a busy LUSE are on the same array group as another busy lun on another application/host. Collating this information will be a time consuming business.

My next course of action is to lay out an alternate disk configuration, arrange some application outage time and run some load tests and compare the results between the 2 layouts.

Things were simpler back in the days of SCSI attached storage.

Michael Steele_2 · ‎05-08-2003

Regarding stripe size: 32kb or 64 kb is recommended.

Also regarding stripes: Use pv groups with 4 disks per group, add another groups as you expand. See /etc/lvmpvg:

vg01
pvg1
c1t9d0, etc.

Regarding Alternate PV links after reboot: Use pvchnage and vgexport / vgimport to get the disk aligned in a pri > alt > alt > pri sequence.

pvchange -s n/y /dev/dskc/cXtYdZ

vgexport -p -v -s -m /tmp/mapfile /dev/vg##
vi /tmp/mapfile (* order your disks pri alt alt pri *)

(* -p for preview *)

mkdir /dev/vg#3
mknod ....
vgimport -p -v -s -m /tmp/mapfile /dev/vg##

Support Fatherhood - Stop Family Law

Michael Steele_2 · ‎05-08-2003

Questions to ponder:

Assume that the fastest way to the lun is through the controller that owns it. If so do you get the same performance if you attempt to access the disk through the other controller? Consider this link:

http://forums.itrc.hp.com/cm/QuestionAnswer/1,,0x840e31ec5e34d711abdc0090277a778c,00.html

When the even disk processor gets a transaction intended for an odd disk on the XP 512. Does it attempt to write and query the other processor when it fails to handle the transaction?

Here's a good LVM example:

http://forums.itrc.hp.com/cm/QuestionAnswer/1,,0xd770abe92dabd5118ff10090279cd0f9

Support Fatherhood - Stop Family Law

RolandH · ‎05-09-2003

What block size have you defined for your Oracle DB. Have you the same block size for your file system?
Example: If your block size from your DB is 8K (default) and the block size from your lvol is 1k you will have 8 I/O reads. If your block size from your lvol would be now 8k, too. You will have only 1 I/O read. Make this sense ?

Read the man page from newfs_vxfs and mkfs_vxfs and have a look on the paramater -b (newfs) and -bsize (mkfs) for the block size.
So, your oracle db files should always reside on a file system which has the same block size as you have defined for you database. As we all know you can't change the block size after you created DB or filesystem. So, my advice to you will be - backup you DB and create an new filessystem with same block size as the DB and then recover the data to that filesystem. It will increase your performance a lot.

Regards
Roland

PS: As you can see in the man page from mkfs_vxfs the default block size depends of the size of the file system

Sometimes you lose and sometimes the others win

Stuart Carmichael · ‎05-11-2003

Oracle and vxfs block sizes are both set to 8K. See my previous note, it appears that anything smaller than 64K isn't using cache efficiently, this is certainly another area for me to load test the application/XP512.

Tim Nelson · ‎10-28-2004

Stuart,
Just stumbled accross this old thread and it was good reading.
Curious how things are going many many months later.
From this reading I am glad we went with EMC for our disk array.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Bad I/O performance help

Bad I/O performance help

Re: Bad I/O performance help

Re: Bad I/O performance help

Re: Bad I/O performance help

Re: Bad I/O performance help

Re: Bad I/O performance help

Re: Bad I/O performance help

Re: Bad I/O performance help

Re: Bad I/O performance help

Re: Bad I/O performance help

Re: Bad I/O performance help

Re: Bad I/O performance help

Re: Bad I/O performance help

Re: Bad I/O performance help

Re: Bad I/O performance help

Re: Bad I/O performance help

Re: Bad I/O performance help