Re: Disk Bottleneck

Jeffrey F. Goldsmith · ‎04-03-2007

My server (rp3440-4 w/ HP-UX11.23) is getting some disk bottleneck and CPU bottleneck alerts. Is there anything that I can do to correct this?

Hein van den Heuvel · ‎04-03-2007

Is the system having a performance problem, or an alert problem. That's the main question.

May it is just a monitor which is nagging.

If you are having a performance problem, then you'll have to get going at who is using what, when, and why. Often a tedious, sometimes a rewarding task.

Maybe the (periods of) overload are perfectly reasonable / designed for, maybe not something is broken.

With the amount of detail provide (NONE) the playing field is wide open.

The alert is just an threshhold exceeded. That threshold may be reasonable, for disk busy it is often not.

Ignoring the alert or raising the threshhold may be the right thing to do.

google: +"disk busy" +hpux +site:itrc.hp.com

Good luck!
Hein van den Heuvel ( at gmail dot com )
HvdH Performance Consulting.

Bill Hassell · ‎04-03-2007

Glance alerts are virtually useless. Performance issues are measured by end users. Your system can be running with 100% CPU and 100% disk and users are very happy. Or it might be only 50% busy and users are complaining due to very long response times (typically more than 2-3 seconds). The term bottleneck implies that there is something you can remove to fix it. But performance issues are more complex than that.

You can easily correct all the performance issues by locking all the accounts so no one can login or use the computer. But that isn't practical. So you start by determining whether the applications have been adequately configured for your size machine. If you don't have enough memory and the apps are so large that they cause a lot of paging, then performance will be awful (and you won't see many Glance alerts because swapping isn't monitored by default).

So you'll need to research the apps and user activities to determine if any are misconfigured. Glance will show you the top processes that use CPU or disk resources.

Bill Hassell, sysadmin

Asif Sharif · ‎04-03-2007

Hi Jeffrey,

http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=961429
http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=990138

Regards,
Asif Sharif

Regards,
Asif Sharif

Jeffrey F. Goldsmith · ‎04-04-2007

More information on my problem:

This server is an rp3440-4 with 2 1GHz CPUâ s and an MSA30-DB with 12 72GB hard drives. The MSA30 is connected to the rp3440 via a Smart Array 6404/256 SCSI controller. There are 6 hard drives on the first channel and they are striped as one drive and are mirrored to the other 6 drives on the second channel. I also have 4GB of RAM. Most of the space on the server is setup as RAW space used by the Ifas database in Informix.

This server is still in test mode. When I was getting these errors we were doing a volume test on our server. What that means is that we had about 20 users logged on to the server and had them running several reports. During the first 30 minutes of the test there were several TACS reports that were running that took up a lot of disk usage (It went to 100% and stayed there for about 20 minutes). During that time my users were having trouble in signing on to the server and moving between screens within their programs. They were having delays of up to a minute which is unacceptable.

Once the TACS reports were done my users were able to login and move between screens without any delays.

I took a tuning class several years ago and was trying to figure if there was anything that I could change to speed up my server. Is there any documentation that HP has that will help me to figure out how to speed up my system?

How is RAM allocated with the server? I would like to give more RAM to the User if possible.

Thanks for any help.

A. Clay Stephenson · ‎04-04-2007

First, 4GiB of memory for serious database applications is considered tiny by today's standards so the first thing that I would consider doing is increasing the memory. This will probably have another desired effect in that the database buffers in shared memory can then be increased which may reduce your i/o. Essentially the only way to give the users more memory is to reduce some of the kernel data -- notably the buffer cache. You have to play a balancing act between shared memory needed by the database, memory needed by the applications, and memory needed by the kernel. If your dbc_max_pct is at the default value of 50%, it should be drastically reduced -- especially if you are using raw devices --- but you still need some buffer cache for ordinary i/o.

Make certain that you are not swapping to any significanrt degree when under heavy load because this will easily impose a 100x performance hit.

Finally, you may tune this thing, add more memory, processors, and i/o paths and still have a dog because of inefficient or poorly written software. Database applications are notorious for this. If I can assume that you are not swapping then you need to ask yourself this: If I make my machine 2X as fast will it still be a dog? If so, you need to take an extremely hard look at your code because a 2X performance increase is difficult to achieve whereas a 10X increase can often be achieved with better algorithms. With databses, one index may fix a huge chunk of your performance problem.

If it ain't broke, I can fix that.

Jeffrey F. Goldsmith · ‎04-04-2007

My current production server is an L2000 with HP-UX 11.0 and has 4 440MHz CPU's and 2GB of RAM. This server is currently running the same programs and doesnt seem to be having any of the problems that I am getting on the rp3440 server.

I would have thought that with the faster CPU's and twice the RAM that this server would be that much better.

Did I purchase the wrong server platform?

Hein van den Heuvel · ‎04-04-2007

>> This server is still in test mode.

Excellent news.

>> What that means is that we had about 20 users logged on to the server and had them running several reports.

That's a good first step. Now you have to ask yourself whether this will be a typical load, a worst case load, or was it perhaps an unrealistic load?

This will help you define you SLA.

>> During that time my users were having trouble in signing on to the server and moving between screens within their programs. They were having delays of up to a minute which is unacceptable.

Then you will likely need more hardware,
probably CPU's and Memory.
How was the system sized?
What were/are the expectations?

>> Once the TACS reports were done my users were able to login and move between screens without any delays.

So now you learned to limit the number of report to run at the same time. Will that be an acceptable way to deliver service? It might be.

>> I took a tuning class several years ago and was trying to figure if there was anything that I could change to speed up my server. Is there any documentation that HP has that will help me to figure out how to speed up my system?

it's not easy. It is sort of an art.
That why they pay folks like me and other performance consultants the big bucks!
:-)

>> How is RAM allocated with the server? I would like to give more RAM to the User if possible.

User processes just get what they ask, overflowing to to swap if need be... which will easily explain minutes of slowdown.

Hope this helps some,
Hein van den Heuvel (at gmail dot com)
HvdH Performance Consulting

Hein van den Heuvel · ‎04-04-2007

>> My current production server is an L2000 with HP-UX 11.0 and has 4 440MHz CPU's and 2GB of RAM. This server is currently running the same programs and doesnt seem to be having any of the problems that I am getting on the rp3440 server.

In the ligth of this new info my prior reply no longer applies to this problem, or at least no as much.

>> I would have thought that with the faster CPU's and twice the RAM that this server would be that much better.

Me too.
But did you stress the old system to the extent it was stressed in the test?

>> Did I purchase the wrong server platform?

No, something is wrong.

Maybe the the database lost some indexes as it was moved? Dive in solicitating help from the DB team and Applications team.

Hein.

A. Clay Stephenson · ‎04-04-2007

At this point, it's difficult to say but you may have fallen into the MHz (or GHz) trap which says that my box has twice as many MHz as yours so it's twice as fast. Now consider that your old clunker L-box might have very good and well-distributed i/o and your new box (by comparison) is not nearly as i/o efficient because of different disk arrays or controllers -- or fewer i/o paths. If you further consider that the vast majority of database code is i/o rather than cpu bound then the impact of 2X CPU performance (which might be 15% of your overall computational time) is more than offset by the faster i/o of the old system.

Assuming that the boxes are tuned roughly the same and that your are doing an apples to apples performance comparioson then I would next make sure that when the database was exported to the test box that all the indices were recreated as well. You may not be aware that SQL code will work perfectly well (in that the same number of rows will be returned) with or without a given index BUT the time required to return that same number of rows could vary by many orders of magnitude.

If it ain't broke, I can fix that.

Jeffrey F. Goldsmith · ‎04-04-2007

We just finished running a couple test jobs on the server and I noticed that was getting a large amount of Page Faults. Then I took a look at both servers via glance and while looking at the Memory Report I noticed that the memory is different on both servers. Is there a way to change the way the rp3440 server memory is configured? Could this be part of my problem?

L2000 w/ 2GB RAM
Sys Mem: 133.8mb
Buf Cache: 12.2mb
User Mem: 1.81GB
Free Mem: 52.5mb

Rp3440 w/4GB RAM
Sys Mem: 1.1GB
Buf Cache: 1.5GB
User Mem: 1.1GB
Free Mem: 223mb

A. Clay Stephenson · ‎04-04-2007

Look at the enormous difference in the sizes of the buffer caches. You need to radically reduce max_dbc_pct to something no more about 15% and probably 10%. Page faults are normal events and aren't an immediate cause for concern but you really need to determine if you are seeing pageouts. Ideally, pageouts should be zero even under heavy loads. I'm actually amazed that your old box can perform well with on 12MiB of buffer cache; normally something in the 100MiB range iis considered minimal.

If it ain't broke, I can fix that.

Jeffrey F. Goldsmith · ‎04-04-2007

Clay,

Where do I make that change? I started SAM and wet out to the Kernel Configuration. The only thing I can start is the Kernel configuration (kcweb). The only items there are the tunables. max_deb_pct doesnt show up there.

Bill Hassell · ‎04-04-2007

The massive difference between the servers is likely the kernel parameter dbc_max_pct. This is one of those: two similar servers work differently. They are indeed not the same at all. Because you are using Informix, most of your disk I/O is likely raw which means a large buffer cache is a bad thing, especially in a small RAM configuration. The good news is that you can reduce dbc_max_pct while you are running. Just run SAM and reduce the value to about 8-10%. Then re-run your tests. Also ask your Informix DBA about async I/O and KAIO to dramatically improve disk I/O.

Bill Hassell, sysadmin

A. Clay Stephenson · ‎04-04-2007

The fault (maybe a page fault) is mine, it should be dbc_max_pct and there is also a dbc_min_pct. With 4GiB of physical memory, I would sec dbc_max_pct at 10% and dbc_min_pct at 4%. You may find that higher values are better but these are reasonable values given your raw i/o.

If it ain't broke, I can fix that.

Jeffrey F. Goldsmith · ‎04-05-2007

Bill & Clay,
Thanks for the correction in the kernel name. I was abel to change it to 10%. It looks like the extra RAM went into Free Memory. Will that memory get used up by the User Memory?
My next question is concerning the System Memory. Is it a reasonable thing to reduce the amount of RAM allocated to the System Memory (1.1GB). If so, how would I do that?

A. Clay Stephenson · ‎04-05-2007

The "Free" memory is just that and will be used in an on-demand basis so that if user process demands occur, it goes there. Reducing GBL_MEM_SYS is more difficult because it it comprised of many different things -- some of which are dynamic. You also have to be careful when comparing the same metrics from different versions of the same OS because what was included in one version might not be included in the other.
One of the ways that this value can become very large is if you are using the formulae rather than using constants in some of the tunables. For example, ninode is one value that is typically hundreds of times too large based upon the formula. Ninode only applies to hfs filesystems and generally the only hfs filesystem on a box is /stand. The most inode intensive task on /stand is building kernels so that ninode = 800 or so is extremely generous.

Bill's suggestion to use asynchonous i/o is a good one. Now that you have free memory, you need to get more test data.

If it ain't broke, I can fix that.

Jeffrey F. Goldsmith · ‎04-05-2007

We were having some trouble with the test reports running (using too much disk I/O) so I changed the dbc_max_pct to 20% and rebooted the server. When the server was back up I checked the GPM Memory Report.

This is what I now see:
Sys Mem: 572mb
Buf Cache: 320mb
User Mem: 1.1gb
Free Mem: 2.1gb

Is there a way to set the User Memory or does is use more memory from the free memory when it needs more memory?

Also, while I was looking at the report I noticed that the Page Faults were quite high.
Current: 6
Cumulative: 52994
Curr Rate: 0.0
Cum Rate: 83.9
High Rate: 742.6

Is the high rate a little excessive? Is that from the reboot or should I be worried?

A. Clay Stephenson · ‎04-05-2007

You have to understand what a page fault is. Consider executing a large complex program. There is a good chance that some of that code will never be executed because none of the special conditions encountered in the data evaluate to true. There is no need to load all of the program initially -- just enough to get the ball rolling. This is "just in time" memory resolution. Now after the program has been executing a few milliseconds, it need to excute additional code that points to a location that isn't in memory - a page fault -- in this case the code must be loaded in - a page in. That is why you see lots of page ins but hopefully few page outs. Page outs occur when there is not enogh room to page in new code or data so something has to get moved out of the way and into swap. Another kind of fault is much more common - Translation Lookaside Buffer faults which simply means that data or instructions must be fetched from memory rather than from on-chip (or near-chip) caches. In short, faults are very common and expected and most are harmless. Fault is another term (like kill) which probably should have been given a different name because it sounds worse that is often is.

I think what Bill and I are both trying to tell you is that you really need to look at your SQL and especially make sure that your indices are the same. I have yet to hear you verify this.

If it ain't broke, I can fix that.

Bill Hassell · ‎04-05-2007

> Is there a way to set the User Memory or does is use more memory from the free memory when it needs more memory?

User programs are totally responsible for the amount of memory they use. Unix is not like MPE where the OS dynamically allocates all available RAM to improve the integrated database. Instead, you free up extra memory and nothing will use it except the kernel's buffer cache. Almost no Unix application looks at RAM and adjusts as needed. One of the reasons is that HP-UX is a virtual memory system so if you have (for example) 4Gb of RAM plus 8Gb of swap space, you could run up to 300% more programs than you have RAM. Pretty nifty. Increase swap to 30Gb and you can run more than 25 Gb of processes in your 4Gb of RAM.

Oh, I forgot to mention that performance will be agonizingly slow. Logins in a few seconds? Not a chance -- logins may take several minutes because everything is being exchanged between RAM and swap -- a waste of time. You can add another 4Gb of RAM and adjust the Informix parameters to use more RAM which will dramatically improve performance.

Bill Hassell, sysadmin

Hein van den Heuvel · ‎04-05-2007

Bill, Clay, let's not forget Jeffrey's comments from Apr 4, 2007 21:11:54 GMT.

He has an environment with half the horsepower, working at acceptable performance levels. So adding still more memory should not be needed as a first step, and could mask the real problem.

Jeffery needs to focus on comparing system and database parameters. Err on the side of caution at first and make them the same. Then grow some params on the new box to exploit the additional memory and cpu horse better. One significanlt different param (buffer cache settings) was already found. How did that happen? What else slipped by?
Maybe shmmax is set to 64MB on the new box and 1GB on the old one? Carefully, tediously, compare all params. system and database. You'd better be able to completely and convincingly defend any variation, or just set it back to the old value.

Hope this helps some,
Hein van den Heuvel (at gmail dot com)
HvdH Performance Consulting

Jeffrey F. Goldsmith · ‎04-06-2007

I am having the programmers look into the async i/o and KAIO issues right now. The person that is responsible for that Informix/IFAS database is out until Monday.

I thought I would attach a copy of the tunable parameters (/usr/sbin/sysdef) for both servers so you can see how they are setup right now. I have made some configuration changes already that have been requested by the vendor. Could you take a look at them and let me know if you see anything that should be changed?

Thanks.

Categories

Company

Local Language

Forums

Discussions

Knowledge Base

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Disk Bottleneck

Disk Bottleneck