1846897 Members
3890 Online
110256 Solutions
New Discussion

Re: IO wait

 
SAM_24
Frequent Advisor

IO wait

Hi,

On my database server we see 40 - 50 %io wait reported by sar. Using glanceplus we verified that everything is normal. What is causing the IO wait? Could it be database design? How to see which file systems are being accessed often? Any suggestions on how to debug further?

Thanks.
Never quit
8 REPLIES 8
harry d brown jr
Honored Contributor

Re: IO wait


If I remember correctly, there are some patches to correct sar reporting the wrong IO activity. I'd suggest first using one of the patch bundles available, preferably a newer one.

live free or die
harry
Live Free or Die
Vincent Fleming
Honored Contributor

Re: IO wait

If glance does not agree about the Wait I/O with sar, then try the patch database for patches.

If it does agree, however, then it probably does have something to do with either your database design/layout, or your hardware (disk array) or it's configuation (including LVM config).

Good luck!


No matter where you go, there you are.
SAM_24
Frequent Advisor

Re: IO wait

No , not only sar.

From vmstat output in b column there is always number of blocked jobs ranging from 10-14. It is not a patch problem. We have up to date patch.

Thanks.
Never quit
Michael Tully
Honored Contributor

Re: IO wait

You may wish to tell use a little more information.

OS and patch level
model of your server
what type of disk(s) are being utilised
What type of connectivity
Anyone for a Mutiny ?
John Poff
Honored Contributor

Re: IO wait

Hi Raj,

What are your mount options on the filesystem(s) you are using for the database?

JP
Kirby A. Joss
Valued Contributor

Re: IO wait

I'm much interested in the wizards' response to this also, being in a similar situation... Management is used to watching IO wait to project upgrades. Since the last (AL->Fabric etc.) throughput and reponse seem improved but IO wait has increased.

HP-UX 11.00 March 2002 bundles applied
N-4000 4 CPUs < 50% 6 GB RAM
4 Tachyon A5158a HBAs
EMC DS16B (Brocade) FC fabric switches (1GB)
EMC Symmetrix 8730


My Theory-
I think the overall latency to satisfy disk IO requests has been reduced. I suspect that sar shows the serialization of the IO's from multiple systems to the Symmetrix FA occurring at the switch level. I don't think the amount of time in IO wait would be noticeable if the system were busier and had something else to do while the switch merges the traffic into single lanes (6 MB peak switch throughput).
Now I need to prove what's happening (how?) and find a more valid metric, if sar IO wait is no longer a valid indicator of IO performance.
Stefan Farrelly
Honored Contributor

Re: IO wait


Hmm, we run EMC and Brocade switches on lots and lots of HP servers of all classes and OS versions (10.20/11/11i) and we never see wio% > single digits (ie. <10) on all servers even at peak times.

Normally a wio% of <10 is fine, 10-20 means your disk subsystem is having trouble keeping up with i/o requests and >20 means your i/o bound and performance is suffering considerably as a result. If you have wio% of 40-50 then either something is wrong with sar reporting it (does glance confirm or not?) and if sar and glance confirm this 40-50 then your completely i/o bound and should be looking into improving your i/o throughput (EMC cache sizes/config/weighting), more channels etc.
I would certainly be very worried if my wio% was that high.
Im from Palmerston North, New Zealand, but somehow ended up in London...
Vincent Fleming
Honored Contributor

Re: IO wait

Performance management is the pursuit of bottlenecks. Once you clear one bottleneck, your system will bottleneck on something else - ALWAYS. The trick is to get the system to bottleneck on the CPU speed... not something like a lack of memory (which causes paging), disk i/o, LAN, etc. So, you optimize your perhipherals as best you can, removing all bottlenecks until you are CPU bound.

When you are CPU bound, your system is going as fast as it can.

So, I think you guys misunderstand what WAIT IO means...

Processes waiting on I/O spin; which means that when they get a timeslice to run, they check if the I/O has completed, and if not, it idles until the timeslice expires, in the hopes that the I/O will complete before the timeslice ends. This behavior consumes CPU time.

WAIT IO is a measurement of this CPU consumption.

Now, WAIT IO time can be caused by several factors. The most common cause is that the disk array is overloaded, or you have configured it in a non-optimal way - such as putting your logs and dataspaces on a single mirror pair.

So, if you are seeing high WAIT IO (over 10% is high in my opinion), you need to take a good look at your disk array and it's configuration.

You may not have striped over a sufficient number of volumes (not using enough drives), or the disk array may have an internal bottleneck, such as too many systems hitting the same FC port, backplane bottleneck, etc.


Let us know how you make out.

Good luck!
No matter where you go, there you are.