IO performance puzzle

Alex Georgiev · ‎04-08-2007

Alright! This my chance to learn a little bit about how disk IO is done in 11.23...

I have an rp8420 w/ 12 CPUs and 32GB RAM. It's running 11.23 (Sept 2006 release + all Oracle, network, security & LVM patches).

I had to restore 4 GB of data to the local disk (Data Protector 6.0 if you're curious).

The disk: 72GB, 10K RPM, LVM boot disk, not mirrored at the time of the restore.

LV & file system: 6GB logical volume, VxFS layout v6. Block size of 8 KB. I had resized the VxFS intent log to 64MB on this LV. It was mounted with the delaylog option (as is standard). 4.8GB of buffer cache.

I started the restore, and the "interactive response" from the machine became very slow - you would hit enter, or type a few characters and they wouldn't show up on the screen for about 10 seconds. The only other thing happening on the machine at the time was a script creating printers... so basically the machine was idle with very little other IO going on.

I brought up glance, and I'm attaching a screen shot. My questions are:

1) The first thing that puzzles me is the queue length: 5473. How is that possible? I though that the a disk IO queue can only have about 256 requests in it?

2) The screen shot was taken a few seconds after the peak had passed. The number of IOs on the bottom, 1007, was actually 1200+ for a few seconds during the "peak", and then it started going down. So did the KB/sec metric, which was just 10,000 and kept going down. What kind of a saturation point did I reach?

3) Finally, the number of Logical IOs is 10x the number of Physical IOs. Is this a coincidence, or does the buffer cache pack 10 logical IOs into 1 physical?

Basically I'm wondering what was going on. I'll settle for a reasonably logical explanation.

...especially if you can tell me why a 12 CPU machine was slow to respond for a minute. I have never had a restore cause a machine to become "sluggish".

P.S. Finally, what is a good number of KB/sec throughput to a local disk?

Thanks!

Hein van den Heuvel · ‎04-09-2007

No full answer, but some observations.

- I suspect you are bogging down on the intent log ativity.

- What was the tool used for the restore?
Lots of little files? Was it against a clean structure, or a overwriting existing files?

- The average IO size from the glance picture is 6513.7/691.7=9.4 fine.
The instantenous one is 4669/1058=4.4 which is small. So the system is are doing lots of little IOs... probably the 1KB intentlog writes. Tried logiosize=4096?

>> I had resized the VxFS intent log to 64MB on this LV.
And that was verified right?
Did not accidently fatfinger the option and got nailed with the minimum 256K buffer?
What are the exact mount options in use?
/usr/sbin/mount -v

- If the queue is 5000+ then any other operation on that disk will be delayed by roughy 5000/200 = 25 seconds... noticably slow.

>>> especially if you can tell me why a 12 CPU machine was slow to respond for a minute.

Well, the glance picture show more than 8% system time. It's not unlikely there is a high level of kernel serialization happening, making stuff wait on jut 1 out of those 12 CPUs (or rather, only 1 can make forward progress on those serialized operations at 1 time).

Hope this helps some,
Hein van den Heuvel (at gmail dot com)
HvdH Performance Consulting

Alzhy · ‎04-09-2007

Greetings!

The HW Path you're referrring to as having massive Queuing is the OS disk.. Why are you restoring to such disk? Do you also use the OS (Internal to the rp8420) as your DB Store? If so - then that's where your problem is -- these Internal OS disk can be real slow and can queue up significantly when you're doing massice I/Os to it...

Hakuna Matata.

Duncan Edmonstone · ‎04-09-2007

Let me try and explain why your system is becoming sluggish and what you *could* do and what you *should* do to resolve it.

HP-UX is most definitely a Server OS (and certainly from 11.23 onwards the OS doesn't even run on workstations), and for a long time the developers have been tuning the OS to provide better server performance.

Some of the assumptions we make about Servers is that we have *separate disks for the OS* and don't use those disks for anything much apart from the OS, and that we *break out differing workloads onto separate disks*. Various algorithms in the kernel make these assumptions when endeavouring to provide balanced performance.

So for instance, when HP-UX looks at queues to disks, it sorts those queues based on 2 parameters (in order):

1) The priority of the requesting process
2) The block number of the requested IO in ascending order

The first always has to be the case in a timesharing system, the second is designed to reduce disk head retractions and therefore increase overall IO efficiency (by ensuring that for IOs of equal priority in the queue the disk heads scan across the disk just once). This generally works great when we separate out IO loads onto separate spindles as we would on a well architected server.

However one of the issues with this is that when we do a large amount of sequential IO to a disk (e.g.. a restore!), any random IO to the same disk is going to suffer as its requests will keep getting pushed down the IO queue by the sorting algorithm preferring the sequential IO. When that disk is the OS disk (which is pretty much always doing small amounts of random IO as OS commands are run and libraries are loaded) the upshot can be a very sluggish system.

Luckily HP know that there are situations where we don't have any choice but to do this, and so provide the kernel parameter disksort_seconds. You won't find much info on this as we'd really rather people did the right thing and didn't do large sequential IOs to the system disk, but here's what it does.

When disksort_seconds is changed from its default value of zero, we change the sorting algorithm as follows:

In order:
1) The priority of the requesting process
2) The time the IO has been on the queue relative to disksort_seconds
3) The block number of the requested IO in ascending order

So now, for any given IO of the same priority I know that my IO will get processed before any IOs that came in 'disksort_seconds' later.

This is fairer on the random IOs, but ultimately delivers less efficient IO.

You don't tend to see this on OSs like Windows and Linux, which ultimately come from a PC lineage where with one (usually slow - 7200rpm) disk in a system, these sorts of algorithms wouldn't be efficient, so they usually have something similar to disksort_seconds enabled by default (IIRC Linux calls it the deadline IO scheduler)

So you *could*:

Alter disksort_seconds to 0, 1, 2, 4 , 8 , 16, 32, 64, 128, or 256 seconds. The lower values should certainly have the effect of making the system less sluggish, but will also no doubt slow down your restore to some degree.

But you *should*

Stop doing large sequential IOs to your system disk.

One other thing you could try might be to nice the data protector restore agent, as priority is still what is sorted on first in the sorting algorithm. I can't say what Data Protector would think of that though - it could cause other issues.

HTH

Duncan

I am an HPE Employee

Alex Georgiev · ‎04-09-2007

>> What was the tool used for the restore? Lots of little files? Was it against a clean structure, or a overwriting existing files? <<

Data Protector 6.0. Lots of little files (50,000 files taking up 4GB of space). The mount point was clean at the time - freshly created to host the restore.

>> Tried logiosize=4096? <<

# fsadm -F vxfs -L /epic
UX:vxfs fsadm: INFO: V-3-25669: logsize=8192 blocks, logvol=""

logsize is 8192 blocks * 8KB = 64MB.

>> And that was verified right? Did not accidently fatfinger the option and got nailed with the minimum 256K buffer? <<

See above.

>> What are the exact mount options in use? <<

/dev/vg00/lvepic on /epic type vxfs ioerror=mwdisable,delaylog,dev=40000009 on S
at Apr 7 18:47:02 2007

>> Why are you restoring to such disk? Do you also use the OS (Internal to the rp8420) as your DB Store? <<

No, local disk is not a DB store. This is binaries & scripts for a custom app. It doesn't really matter why I'm restoring... The question is simply "What happens with the system when I start this restore?"

>> But you *should* stop doing large sequential IOs to your system disk. <<

Agreed, but sometime you just have to practice DR procedures.

Alzhy · ‎04-09-2007

Herr G.,

SCSI Disks pecially IF you are using the smae disks used for your OS - ARE SLOW and can let your system CRAWL... I or you can induce queuing with seeveral massive reads/writes i.e. your DP restore job.

Hakuna Matata.

Hein van den Heuvel · ‎04-09-2007

[Alex, sorry for possibly hijacking the thread, but we all just want to understand better no?]

Hey Duncan, thanks for the excellent, detailed, description!

This suggests that while a deep queue is visible to the driver, the actual device does not see (much of) a queue ?!?!

0) Good point on the single disk. I forgot to comment on that earlier.

1) I'm surprised to see the unit for disksort_seconds is seconds. Well, not very surprised given a name like that, but a single second already seems an awful long time, let alone settings higher than that!
If an application, after getting a disksort boost, needs a further IO, it could be outside the current 'sweep' and have to wait for a long time again. Yikes.

2) This is a great algortime for simple, local, disks. But when applied to SAN controllers could it not spell disaster?

2A) With 4 or 8 or maybe 50 disks behind a LUN don't we want at least as many IOs outstanding the the LUN as to increase the chances to make all disk busy and contribute?

2B) The EVA will actually postpone allocating actual storage chunks (1MB? 4MB?) untill first written. Thus if you have two LV's on a PV which are filled up gradually the physical blocks on the underlying disks will be interspersed, not ordered by LV.

2C) If you have a striping (raid-0, raid-5) controller with a large chunk size, then issueing the IOs strictly in order will focus the IOs on 1 or a few members underlying to the LUN thus letting IO power go to waste?

3) Could it also explain lots of SYSTEM time as sorting 1000+ IO elements, or at the very least keeping them in order, takes some CPU cycles alright!
Can we readily recognize this particular CPU consuptions with a profiling tool?

Thanks!
Hope this helps some,
Hein van den Heuvel (at gmail dot com)
HvdH Performance Consulting

Alzhy · ‎04-09-2007

Hein and Duncan,

This gentleman's problem is rather simple.

He is doing massive I/O's (on account of his restore) to the one of rp8420's Internal SCSI Drive... And it can croak... really bad via massive queueing...

That is why we leave the OS disk alone... No binaries, no temp file storage, no 3rd party application binaries - (/opt/oracle or /opt/whatever are mounts to non SCSI/OS disks...)

Hakuna Matata.

Alex Georgiev · ‎04-09-2007

[Nelson, we are not trying to solve a problem! We are trying to gain some good systems knowledge.]

Duncan, thank you so much for the nice & simple explanation! So that's the reason why disksort_seconds doesn't have a man page.

Just like Hein, I too have some follow-up questions:

1) Does the disk IO queuing algorithm apply to SAN disks as well as local disk?

2) It seems that prioritizing IOs based on block number does not ensure in-order writes. Is that correct? Are the O_DSYNC/O_SYNC flags for open(2) the only way to ensure in-order writes?

3) I'm noticing an "sblksched" process running with a priority of 100. Is that the IO request scheduling process?

4) What about the queue length? Doing scsictl on the local disk in question shows me: queue_depth = 8. What is glance showing me with Qlen?

Many thanks!

Duncan Edmonstone · ‎04-09-2007

Alex, Hein,

OK - you're strecthing my knowledge of this subject considerably now, but let me try and answer your questions as best I can:

Hein's questions first:

Q1) I'm surprised to see the unit for disksort_seconds is seconds. Well, not very surprised given a name like that, but a single second already seems an awful long time, let alone settings higher than that!
If an application, after getting a disksort boost, needs a further IO, it could be outside the current 'sweep' and have to wait for a long time again. Yikes.

*A1) No because any random IOs even on the current sweep will be satisfied before those sequential IOs that came in disksort_seconds later, but you are right to point out it doesn't help a great deal, Thats why its largely undocumented,

Q2) This is a great algortime for simple, local, disks. But when applied to SAN controllers could it not spell disaster?

*A2) How so? I think the real question here is "without understanding the geometry of a SAN LUN can we come up with a better algorithm", to which I think the answer is no we can't - we need to know too much about how the LUN is physically made up to come up with any sensible logic - and we just can't do that right now.

Q2A) With 4 or 8 or maybe 50 disks behind a LUN don't we want at least as many IOs outstanding the the LUN as to increase the chances to make all disk busy and contribute?

*A2A) Yes we do - and thats what we use queue_depth for with scsictl(1m). DOn't confuse IOs in the disks IO queue (that which disksort_seconds effects) with concurrent outstanding IOs which queue_depth controls. queue_depth lets us control how many outstanding IOs (IOs that have been pushed out the pipe to disk) we have at any one time.

Q2B) The EVA will actually postpone allocating actual storage chunks (1MB? 4MB?) untill first written. Thus if you have two LV's on a PV which are filled up gradually the physical blocks on the underlying disks will be interspersed, not ordered by LV.

Q2C) If you have a striping (raid-0, raid-5) controller with a large chunk size, then issueing the IOs strictly in order will focus the IOs on 1 or a few members underlying to the LUN thus letting IO power go to waste?

*A2B & A2c) WHilst this could be the case, how can the OS know about these specific geometries? Designing for one case could caise issues with another. Given that large amounts of IO to SAN LUNs are typically handled more efficiently by the disk array than a single SCSI disk could hope to provide, why worry about those situations?

Q3) Could it also explain lots of SYSTEM time as sorting 1000+ IO elements, or at the very least keeping them in order, takes some CPU cycles alright!
Can we readily recognize this particular CPU consuptions with a profiling tool?

*3C) I've never profiled CPU utilisation during these types of events so can't comment.

Now Alex's

Q1) Does the disk IO queuing algorithm apply to SAN disks as well as local disk?

*A1) IT applies to all LUNs - the OS knows no difference between a local disk and a SAN LUN.

Q2) It seems that prioritizing IOs based on block number does not ensure in-order writes. Is that correct? Are the O_DSYNC/O_SYNC flags for open(2) the only way to ensure in-order writes?

*A2) No - don't worry about that - I think the explanation given may simplify things a little, but it ceratinly can't lead to corruptions caused by failing to follow write order rules. Certainly for a filesystem the filesystem code takes care of all this anyway and I came across the following on the man page for disk:

"Buffering is done in such a way that concurrent access through multiple opens and mounting the same physical device is correctly handled to avoid operation sequencing errors."

Q3) I'm noticing an "sblksched" process running with a priority of 100. Is that the IO request scheduling process?

*A3) No sblksched are the streams schedulers. See kernel parm NSTRBLKSCHED.

Q4) What about the queue length? Doing scsictl on the local disk in question shows me: queue_depth = 8. What is glance showing me with Qlen?

*A4) See answer to Hein's question A2A above. queue_depth is the number of concurrent outstanding SCSI IOs to the LUN from the OS, not the total number of outstanding IOs for the disk.

Not sure I've helped much there but happy to try anyway.

HTH

Duncan

I am an HPE Employee

Hein van den Heuvel · ‎04-10-2007

Thanks Duncan!
Roughly along the lines expected.

>> A1) IT applies to all LUNs - the OS knows no difference between a local disk and a SAN LUN.

Suspected as much.
IMHO the OS should not bother for anything other then a direct connect disk. As soon as there is any controller involved the OS would best assume that the controller knows best and will apply its own elevator algoritmes withing the constraints of its caches and the knowledge which device actually will do the IO.

By increasing queue_depth away from the default low 8 to something much larger we can largely get there.

To deal with the problem which started this thread, just doing a 'nice' on a large restore during production seems like a reasonable thing to ask and do.

Thanks!
Hein.

Alex Georgiev · ‎04-10-2007

Duncan, Hein,

Many thanks for the discussion! It might be time to go pick up the HP-UX Internals book.

I plan on experimenting with various file system options & kernel parameters. I have the test machines, I just need the time.

I'm not worried about DP restores affecting system performance. If I'm ever in a situation where I have to do large sequential restores to a system disk, then it's likely that I'll have bigger problems to worry about.

Alzhy · ‎04-11-2007

"If I'm ever in a situation where I have to do large sequential restores to a system disk, then it's likely that I'll have bigger problems to worry about."

I don't know if your a SysAd, a DBA or a hybrid -- but in case you'll be in that situation - then blame thyself or whoever decides to put you mega application binaries, logs, etc. on your System Disk.

We're using 73 even 140 GB "System" Disks BUT we do not store anything else out side of the OS, SWAP, etc. on the same disk specially if its SCSI / Local.

You can research, understand, etc. all you want - the local SCSI disk is not just capable and can be stressed to the point of massive queuing that even your 12cpu 1Ghz cpus in you rp8420 partition can't have a say...

Cheers.

Hakuna Matata.

Alex Georgiev · ‎04-11-2007

I very much agree with the philosophy that OS and applications should be isolated, but I don't think there is a cookie-cutter approach to deciding what goes on the SAN and what goes on the local disk.

1) There are situation where it makes more sense to store both OS and App binaries on local disks (such as mine).

2) There are situations where it makes sense to have OS only on the local disk, and keep the App binaries on a SAN (most Oracle installations and clustered apps, for example).

3) And then, there are situations where it makes sense to boot off the SAN, as in the case where you want to replicate your OS disks over the SAN to another data center for DR purposes.

Here is an example of why you would want to store OS & App binaries on the local disk:

Consider the economics of it... If you have an EVA, then you are usually paying about $2,000 for a single FC disk drive. At that price the storage comes out very expensive, and I don't want to waste any of the valuable EVA space on file systems that do not require the high performance, high reliability or convenience features of a SAN.

Either way, the bottom line is that "your milage may vary!" In other words: what works for me may not work for you.

Cheers!

Alzhy · ‎04-11-2007

Alex,

Just to make it clear. I am not against having Apps/Log trees on local good old SCSI Disks. Just do not mix them with your OS disks. In fact, our super special apps has a particularly hyperactive usage of either the /tmp or /var/tmp paths. When we had those as partitions/volumes of the OS disk(s), our environments would sometimes crawl to a halt during times of extreme usage that queuing was in the 10's or even hundreds.. We ended up segregating them paths plus an entire /opt/specialapp tree onto a separate LOCAL scsi disk pair...

Hakuna Matata.

Alzhy · ‎04-11-2007

Alex,

Just to make it clear. I am not against having Apps/Log trees on local good old SCSI Disks. Just do not mix them with your OS disks. In fact, our super special apps has a particularly hyperactive usage of either the /tmp or /var/tmp paths. When we had those as partitions/volumes of the OS disk(s), our environments would sometimes crawl to a halt during times of extreme usage that queuing was in the 10's or even hundreds.. We ended up segregating them paths plus an entire /opt/specialapp tree onto a separate LOCAL scsi disk pair...

Now if you have SAN boot disks - the game changes entirely. You do not have to relocate your I/O intensive paths and even get to keep your /opt/app or /usr/app_big_binaries or /opt/app/severelogger paths ON YOUR SAN BOOT DISK -- of course depends on your standard SAN Boot Sizes. Some of my clients adopted SAN Boot disks as large as 160GB...

One thing that I noticed though, there really is no need to have a separate FC Channel for your SAN Boot/Swap Disk/Lun path...

Hakuna Matata.

Alzhy · ‎04-11-2007

Alex,

Just to make it clear. I am not against having Apps/Log trees on local good old SCSI Disks. Just do not mix them with your OS disks. In fact, our super special apps has a particularly hyperactive usage of either the /tmp or /var/tmp paths. When we had those as partitions/volumes of the OS disk(s), our environments would sometimes crawl to a halt during times of extreme usage that queuing was in the 10's or even hundreds.. We ended up segregating them paths plus an entire /opt/specialapp tree onto a separate LOCAL scsi disk pair...

Now if you have SAN boot disks - the game changes entirely. You do not have to relocate your I/O intensive paths and even get to keep your /opt/app or /usr/app_big_binaries or /opt/app/severelogger paths ON YOUR SAN BOOT DISK -- of course depends on your standard SAN Boot Sizes. Some of my clients adopted SAN Boot disks as large as 160GB...

And SAN Boot disks can make your environment noticeably faster too - specially if your apps were used to using SCSI disks for its temp file workspaces..

One thing that I noticed though, there really is no need to have a separate FC Channel for your SAN Boot/Swap Disk/Lun path...

WIth my existing client, we use both EVA (5000) and XP12000 SAN Boot Disks with no problems at all...

Hakuna Matata.

Emil Velez · ‎10-11-2007

Interesting and the comments above are interesting too.

1. Intent log is very very big..Defaults are there for a reason. Intent log may not be flushing data out quick enough.

2. JFS is not optimal for small files and lots of nested directories. SOme people have asked what is faster jfs or HFS. I can produce a benchmark to make either faster since there are mixes of files that make each filesystem appear slower. JFs does not like small files and many many nested directories with thousands of entries at each level.

3. You have a single threaded disk agent doing the restore so the data protector process only runs on 1 processor. Having a 12 processor box means you can run more programs not that programs run faster.

4. Are you doing software mirroring to local scsi disks. If you are doing software mirroring then you are generating multiple I/os for each write.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

IO performance puzzle

IO performance puzzle

Re: IO performance puzzle

Re: IO performance puzzle

Re: IO performance puzzle

Re: IO performance puzzle

Re: IO performance puzzle

Re: IO performance puzzle

Re: IO performance puzzle

Re: IO performance puzzle

Re: IO performance puzzle

Re: IO performance puzzle

Re: IO performance puzzle

Re: IO performance puzzle

Re: IO performance puzzle

Re: IO performance puzzle

Re: IO performance puzzle

Re: IO performance puzzle

Re: IO performance puzzle