1752785 Members
6077 Online
108789 Solutions
New Discussion юеВ

IO performance puzzle

 
SOLVED
Go to solution
Alex Georgiev
Regular Advisor

IO performance puzzle

Alright! This my chance to learn a little bit about how disk IO is done in 11.23...

I have an rp8420 w/ 12 CPUs and 32GB RAM. It's running 11.23 (Sept 2006 release + all Oracle, network, security & LVM patches).

I had to restore 4 GB of data to the local disk (Data Protector 6.0 if you're curious).

The disk: 72GB, 10K RPM, LVM boot disk, not mirrored at the time of the restore.

LV & file system: 6GB logical volume, VxFS layout v6. Block size of 8 KB. I had resized the VxFS intent log to 64MB on this LV. It was mounted with the delaylog option (as is standard). 4.8GB of buffer cache.

I started the restore, and the "interactive response" from the machine became very slow - you would hit enter, or type a few characters and they wouldn't show up on the screen for about 10 seconds. The only other thing happening on the machine at the time was a script creating printers... so basically the machine was idle with very little other IO going on.

I brought up glance, and I'm attaching a screen shot. My questions are:

1) The first thing that puzzles me is the queue length: 5473. How is that possible? I though that the a disk IO queue can only have about 256 requests in it?

2) The screen shot was taken a few seconds after the peak had passed. The number of IOs on the bottom, 1007, was actually 1200+ for a few seconds during the "peak", and then it started going down. So did the KB/sec metric, which was just 10,000 and kept going down. What kind of a saturation point did I reach?

3) Finally, the number of Logical IOs is 10x the number of Physical IOs. Is this a coincidence, or does the buffer cache pack 10 logical IOs into 1 physical?

Basically I'm wondering what was going on. I'll settle for a reasonably logical explanation.

...especially if you can tell me why a 12 CPU machine was slow to respond for a minute. I have never had a restore cause a machine to become "sluggish".

P.S. Finally, what is a good number of KB/sec throughput to a local disk?

Thanks!
17 REPLIES 17
Hein van den Heuvel
Honored Contributor

Re: IO performance puzzle

No full answer, but some observations.

- I suspect you are bogging down on the intent log ativity.

- What was the tool used for the restore?
Lots of little files? Was it against a clean structure, or a overwriting existing files?

- The average IO size from the glance picture is 6513.7/691.7=9.4 fine.
The instantenous one is 4669/1058=4.4 which is small. So the system is are doing lots of little IOs... probably the 1KB intentlog writes. Tried logiosize=4096?

>> I had resized the VxFS intent log to 64MB on this LV.
And that was verified right?
Did not accidently fatfinger the option and got nailed with the minimum 256K buffer?
What are the exact mount options in use?
/usr/sbin/mount -v


- If the queue is 5000+ then any other operation on that disk will be delayed by roughy 5000/200 = 25 seconds... noticably slow.

>>> especially if you can tell me why a 12 CPU machine was slow to respond for a minute.

Well, the glance picture show more than 8% system time. It's not unlikely there is a high level of kernel serialization happening, making stuff wait on jut 1 out of those 12 CPUs (or rather, only 1 can make forward progress on those serialized operations at 1 time).

Hope this helps some,
Hein van den Heuvel (at gmail dot com)
HvdH Performance Consulting
Alzhy
Honored Contributor

Re: IO performance puzzle

Greetings!

The HW Path you're referrring to as having massive Queuing is the OS disk.. Why are you restoring to such disk? Do you also use the OS (Internal to the rp8420) as your DB Store? If so - then that's where your problem is -- these Internal OS disk can be real slow and can queue up significantly when you're doing massice I/Os to it...
Hakuna Matata.

Re: IO performance puzzle

Let me try and explain why your system is becoming sluggish and what you *could* do and what you *should* do to resolve it.

HP-UX is most definitely a Server OS (and certainly from 11.23 onwards the OS doesn't even run on workstations), and for a long time the developers have been tuning the OS to provide better server performance.

Some of the assumptions we make about Servers is that we have *separate disks for the OS* and don't use those disks for anything much apart from the OS, and that we *break out differing workloads onto separate disks*. Various algorithms in the kernel make these assumptions when endeavouring to provide balanced performance.

So for instance, when HP-UX looks at queues to disks, it sorts those queues based on 2 parameters (in order):

1) The priority of the requesting process
2) The block number of the requested IO in ascending order

The first always has to be the case in a timesharing system, the second is designed to reduce disk head retractions and therefore increase overall IO efficiency (by ensuring that for IOs of equal priority in the queue the disk heads scan across the disk just once). This generally works great when we separate out IO loads onto separate spindles as we would on a well architected server.

However one of the issues with this is that when we do a large amount of sequential IO to a disk (e.g.. a restore!), any random IO to the same disk is going to suffer as its requests will keep getting pushed down the IO queue by the sorting algorithm preferring the sequential IO. When that disk is the OS disk (which is pretty much always doing small amounts of random IO as OS commands are run and libraries are loaded) the upshot can be a very sluggish system.

Luckily HP know that there are situations where we don't have any choice but to do this, and so provide the kernel parameter disksort_seconds. You won't find much info on this as we'd really rather people did the right thing and didn't do large sequential IOs to the system disk, but here's what it does.

When disksort_seconds is changed from its default value of zero, we change the sorting algorithm as follows:

In order:
1) The priority of the requesting process
2) The time the IO has been on the queue relative to disksort_seconds
3) The block number of the requested IO in ascending order

So now, for any given IO of the same priority I know that my IO will get processed before any IOs that came in 'disksort_seconds' later.

This is fairer on the random IOs, but ultimately delivers less efficient IO.

You don't tend to see this on OSs like Windows and Linux, which ultimately come from a PC lineage where with one (usually slow - 7200rpm) disk in a system, these sorts of algorithms wouldn't be efficient, so they usually have something similar to disksort_seconds enabled by default (IIRC Linux calls it the deadline IO scheduler)

So you *could*:

Alter disksort_seconds to 0, 1, 2, 4 , 8 , 16, 32, 64, 128, or 256 seconds. The lower values should certainly have the effect of making the system less sluggish, but will also no doubt slow down your restore to some degree.

But you *should*

Stop doing large sequential IOs to your system disk.

One other thing you could try might be to nice the data protector restore agent, as priority is still what is sorted on first in the sorting algorithm. I can't say what Data Protector would think of that though - it could cause other issues.

HTH

Duncan


I am an HPE Employee
Accept or Kudo
Alex Georgiev
Regular Advisor

Re: IO performance puzzle

>> What was the tool used for the restore? Lots of little files? Was it against a clean structure, or a overwriting existing files? <<

Data Protector 6.0. Lots of little files (50,000 files taking up 4GB of space). The mount point was clean at the time - freshly created to host the restore.

>> Tried logiosize=4096? <<

# fsadm -F vxfs -L /epic
UX:vxfs fsadm: INFO: V-3-25669: logsize=8192 blocks, logvol=""

logsize is 8192 blocks * 8KB = 64MB.


>> And that was verified right? Did not accidently fatfinger the option and got nailed with the minimum 256K buffer? <<

See above.

>> What are the exact mount options in use? <<

/dev/vg00/lvepic on /epic type vxfs ioerror=mwdisable,delaylog,dev=40000009 on S
at Apr 7 18:47:02 2007

>> Why are you restoring to such disk? Do you also use the OS (Internal to the rp8420) as your DB Store? <<

No, local disk is not a DB store. This is binaries & scripts for a custom app. It doesn't really matter why I'm restoring... The question is simply "What happens with the system when I start this restore?"

>> But you *should* stop doing large sequential IOs to your system disk. <<

Agreed, but sometime you just have to practice DR procedures.
Alzhy
Honored Contributor

Re: IO performance puzzle

Herr G.,

SCSI Disks pecially IF you are using the smae disks used for your OS - ARE SLOW and can let your system CRAWL... I or you can induce queuing with seeveral massive reads/writes i.e. your DP restore job.


Hakuna Matata.
Hein van den Heuvel
Honored Contributor

Re: IO performance puzzle

[Alex, sorry for possibly hijacking the thread, but we all just want to understand better no?]

Hey Duncan, thanks for the excellent, detailed, description!

This suggests that while a deep queue is visible to the driver, the actual device does not see (much of) a queue ?!?!

0) Good point on the single disk. I forgot to comment on that earlier.

1) I'm surprised to see the unit for disksort_seconds is seconds. Well, not very surprised given a name like that, but a single second already seems an awful long time, let alone settings higher than that!
If an application, after getting a disksort boost, needs a further IO, it could be outside the current 'sweep' and have to wait for a long time again. Yikes.

2) This is a great algortime for simple, local, disks. But when applied to SAN controllers could it not spell disaster?

2A) With 4 or 8 or maybe 50 disks behind a LUN don't we want at least as many IOs outstanding the the LUN as to increase the chances to make all disk busy and contribute?

2B) The EVA will actually postpone allocating actual storage chunks (1MB? 4MB?) untill first written. Thus if you have two LV's on a PV which are filled up gradually the physical blocks on the underlying disks will be interspersed, not ordered by LV.

2C) If you have a striping (raid-0, raid-5) controller with a large chunk size, then issueing the IOs strictly in order will focus the IOs on 1 or a few members underlying to the LUN thus letting IO power go to waste?

3) Could it also explain lots of SYSTEM time as sorting 1000+ IO elements, or at the very least keeping them in order, takes some CPU cycles alright!
Can we readily recognize this particular CPU consuptions with a profiling tool?


Thanks!
Hope this helps some,
Hein van den Heuvel (at gmail dot com)
HvdH Performance Consulting


Alzhy
Honored Contributor

Re: IO performance puzzle

Hein and Duncan,

This gentleman's problem is rather simple.

He is doing massive I/O's (on account of his restore) to the one of rp8420's Internal SCSI Drive... And it can croak... really bad via massive queueing...

That is why we leave the OS disk alone... No binaries, no temp file storage, no 3rd party application binaries - (/opt/oracle or /opt/whatever are mounts to non SCSI/OS disks...)
Hakuna Matata.
Alex Georgiev
Regular Advisor

Re: IO performance puzzle

[Nelson, we are not trying to solve a problem! We are trying to gain some good systems knowledge.]

Duncan, thank you so much for the nice & simple explanation! So that's the reason why disksort_seconds doesn't have a man page.

Just like Hein, I too have some follow-up questions:

1) Does the disk IO queuing algorithm apply to SAN disks as well as local disk?

2) It seems that prioritizing IOs based on block number does not ensure in-order writes. Is that correct? Are the O_DSYNC/O_SYNC flags for open(2) the only way to ensure in-order writes?

3) I'm noticing an "sblksched" process running with a priority of 100. Is that the IO request scheduling process?

4) What about the queue length? Doing scsictl on the local disk in question shows me: queue_depth = 8. What is glance showing me with Qlen?

Many thanks!
Solution

Re: IO performance puzzle

Alex, Hein,

OK - you're strecthing my knowledge of this subject considerably now, but let me try and answer your questions as best I can:

Hein's questions first:

Q1) I'm surprised to see the unit for disksort_seconds is seconds. Well, not very surprised given a name like that, but a single second already seems an awful long time, let alone settings higher than that!
If an application, after getting a disksort boost, needs a further IO, it could be outside the current 'sweep' and have to wait for a long time again. Yikes.

*A1) No because any random IOs even on the current sweep will be satisfied before those sequential IOs that came in disksort_seconds later, but you are right to point out it doesn't help a great deal, Thats why its largely undocumented,

Q2) This is a great algortime for simple, local, disks. But when applied to SAN controllers could it not spell disaster?

*A2) How so? I think the real question here is "without understanding the geometry of a SAN LUN can we come up with a better algorithm", to which I think the answer is no we can't - we need to know too much about how the LUN is physically made up to come up with any sensible logic - and we just can't do that right now.

Q2A) With 4 or 8 or maybe 50 disks behind a LUN don't we want at least as many IOs outstanding the the LUN as to increase the chances to make all disk busy and contribute?

*A2A) Yes we do - and thats what we use queue_depth for with scsictl(1m). DOn't confuse IOs in the disks IO queue (that which disksort_seconds effects) with concurrent outstanding IOs which queue_depth controls. queue_depth lets us control how many outstanding IOs (IOs that have been pushed out the pipe to disk) we have at any one time.

Q2B) The EVA will actually postpone allocating actual storage chunks (1MB? 4MB?) untill first written. Thus if you have two LV's on a PV which are filled up gradually the physical blocks on the underlying disks will be interspersed, not ordered by LV.

Q2C) If you have a striping (raid-0, raid-5) controller with a large chunk size, then issueing the IOs strictly in order will focus the IOs on 1 or a few members underlying to the LUN thus letting IO power go to waste?

*A2B & A2c) WHilst this could be the case, how can the OS know about these specific geometries? Designing for one case could caise issues with another. Given that large amounts of IO to SAN LUNs are typically handled more efficiently by the disk array than a single SCSI disk could hope to provide, why worry about those situations?

Q3) Could it also explain lots of SYSTEM time as sorting 1000+ IO elements, or at the very least keeping them in order, takes some CPU cycles alright!
Can we readily recognize this particular CPU consuptions with a profiling tool?

*3C) I've never profiled CPU utilisation during these types of events so can't comment.


Now Alex's

Q1) Does the disk IO queuing algorithm apply to SAN disks as well as local disk?

*A1) IT applies to all LUNs - the OS knows no difference between a local disk and a SAN LUN.

Q2) It seems that prioritizing IOs based on block number does not ensure in-order writes. Is that correct? Are the O_DSYNC/O_SYNC flags for open(2) the only way to ensure in-order writes?

*A2) No - don't worry about that - I think the explanation given may simplify things a little, but it ceratinly can't lead to corruptions caused by failing to follow write order rules. Certainly for a filesystem the filesystem code takes care of all this anyway and I came across the following on the man page for disk:

"Buffering is done in such a way that concurrent access through multiple opens and mounting the same physical device is correctly handled to avoid operation sequencing errors."

Q3) I'm noticing an "sblksched" process running with a priority of 100. Is that the IO request scheduling process?

*A3) No sblksched are the streams schedulers. See kernel parm NSTRBLKSCHED.

Q4) What about the queue length? Doing scsictl on the local disk in question shows me: queue_depth = 8. What is glance showing me with Qlen?

*A4) See answer to Hein's question A2A above. queue_depth is the number of concurrent outstanding SCSI IOs to the LUN from the OS, not the total number of outstanding IOs for the disk.

Not sure I've helped much there but happy to try anyway.

HTH

Duncan

I am an HPE Employee
Accept or Kudo