Operating System - Tru64 Unix
1748143 Members
3633 Online
108758 Solutions
New Discussion юеВ

Re: Collect I/O stats report

 
Christof Schoeman
Frequent Advisor

Collect I/O stats report

Hi

Attached is a graph for the I/O stats of a particular disk on my system, but something does not add up.

The graph shows a wait queue of nearly 6000, yet the number of writes per second is only about 600, and very little reads. My question is, where did the items in the queue come from?

There seems to be a much stronger correlation between the throughput (KB Written/Sec) and the wait queue, than there is between the Reads- or Writes/Sec and the wait queue.

Am I misinterpreting the stats? If so, your advice shall be greatly appreciated.

Regards
23 REPLIES 23
Venkatesh BL
Honored Contributor

Re: Collect I/O stats report

Did you try 'normalizing' the output?
Christof Schoeman
Frequent Advisor

Re: Collect I/O stats report

Hi

After normalizing the graph, it is like counting as the Irish do - one, two, many, lots:-)

The graph now says that there were lots of writes, which resulted in lots of items in the wait queue, causing lots of throughput.

I am currently busy troubleshooting a performance issue that requires exact figures, but the stats don't add up.

Here is how I see it, but perhaps you can point out a flaw in my reasoning:
- Each read- and write I/O will be placed in the queue of a particular LUN, for processing.
- If the reads and writes come in faster than the device can process them, the queue will build up, resulting in delays.

Therefore, if there are 6000 items in the queue, there had to have been more than 6000 read plus writes, 'cause the LUN will continue processing them as they come in.

However, this is not what the "un-normalized" graph says.

Hope you can help.
Mark Poeschl_2
Honored Contributor

Re: Collect I/O stats report

I suspect what you're seeing reflects the fact that some collect data is always normalized over 1 second intervals and other data is an instantaneous snapshot. From the 'collect' man page:

" Normalization of Data

Where appropriate, data is presented in units per second. For example, disk
data such as kilobytes transferred, or the number of transfers, is always
normalized for 1 second. This happens no matter what time interval is
chosen. The same is true for the following data items:

+ CPU interrupts, system calls, and context switches.

+ Memory pages out, pages in, pages zeroed, pages reactivated, and pages
copied on write.

+ Network packets in, packets out, and collisions.

+ Process user and system time consumed.

Other data is recorded as a snapshot value. Examples of this are: free
memory pages, CPU states, disk queue lengths, and process memory."

So: Your I/O rates and throughput figures are one second averages, but the queue depth is an instantaneous snapshot. What interval are you using to collect this data? 'collect' really isn't the ideal tool for short-interval data collection like this. I find 'iostat' or 'advfsstat' (assuming you're on AdvFS) more uuseful.
Victor Semaska_3
Esteemed Contributor

Re: Collect I/O stats report

Christof,

Looking at the graph it's hard for me to tell which line represents what. I suggest you produce 3 graphs instead of one as follows:

Graph 1: Active Queue & Wait Queue
Graph 2: Reads/Sec & Writes/Sec
Graph 3: KB Read/Sec & KB Written/Sec

I suspect you may be interperting the lines incorrectly. What you think is the wait queue may actually be I/O per sec or KBs per sec.

Vic
There are 10 kinds of people, one that understands binary and one that doesn't.
Victor Semaska_3
Esteemed Contributor

Re: Collect I/O stats report

Forgot to mention, don't normalize the data when you produce the three graphs.

Vic
There are 10 kinds of people, one that understands binary and one that doesn't.
Christof Schoeman
Frequent Advisor

Re: Collect I/O stats report

Hi

Did some further digging, but the thick only plottens:-(

Some background - the users sometimes complain that their actions take long to complete. This is also reflected in the Oracle database, where it sometimes has to wait up to 20 seconds for a transaction to complete, because it is waiting for an I/O.

So, I am trying to figure out what is happening on the I/O subsystem.

I saw queues forming on some disks, but no sudden burst of I/Os going to that disk that would cause the queue to build up. Wrote a little script that sends a single I/O to that disk, in 1 second intervals, just to see how long the I/O takes to complete. What I found was, that the I/O takes a fraction of a second to complete in most cases, but when there is a queue, the I/O could take up to 20 seconds to complete. Which ties in with what Oracle is experiencing.

I used iostat to collect information about the load on that disk, and used monitor to get the queue length (if you know of a better way to get queue length information, please let me know).

My question is - if very little I/Os are going to a disk, what can cause a queue to build up, so badly that it takes a single I/O 20 seconds to complete?

Long story, I know, and I hope it makes sense. Any help will be most welcome.
Ivan Ferreira
Honored Contributor

Re: Collect I/O stats report

If you are collecting statistics using collect, then you should have the "process" statistics (-s p). In the process statistics, you can see the IBk nad OBk column. That may guide you to find out the process that is doing most I/O.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Christof Schoeman
Frequent Advisor

Re: Collect I/O stats report

That is the problem. Nobody is generating excessive I/O, but yet a queue builds up.

All the disks in question, contain raw volumes, used by Oracle.

I'm not too comfortable with the queue stats, though. collect shows queues of up to 2000, where monitor only shows queues of 20 or so. Are there better ways of getting disk queue information?
Han Pilmeyer
Esteemed Contributor

Re: Collect I/O stats report

Doesn't sound like normal behavior. Perhaps you should start by describing the configuration and the version. Please don't forget to include information about the storage.