Operating System - HP-UX
1827584 Members
2815 Online
109965 Solutions
New Discussion

can't explain a sudden high load average

 
Marc Ahrendt
Super Advisor

can't explain a sudden high load average

typically my server runs at a load of 1. but at some point the load goes up to 30 and 50 at times ...yes 50 while no users seem to notice/complain.

swap, disk, and cpu usage are low in glance while memory (4GB) is at about 90% usage. we are running Java apps. that talk over the TIBCO Rendevous bus (rvd).

i think the sudden jump in the load averages is when our "rvd" daemon has many open files (~1500 sockets) but i am only guessing.

what causes the load average to go up? i always thought is was directly related to how hard the CPUs were working but not sure if that really is correct. i cannot figure out how to correspond the high load average to anything. it may be due to the "rvd" process but do not know what to look for ....if the problem is a kernel parameter, lack of RAM (even though swapping is low at best), etc...!?!

FYI: TIBCO told us to increase maxfiles_lim from 2048 to 8096 (glance tells me that is not the problem), increase maxdsiz from 256MB to 3GB (glance too tells me that this is not the issue), and add pateches (have the ones they recommend)
hola
3 REPLIES 3
Bill Hassell
Honored Contributor

Re: can't explain a sudden high load average

Perfectly normal--if that is the way your rvd process works. Let's start with load average: it does not mean what you think...it is simply the average runqueue depth. Now that needs an explanation...

runqueue is the table where all processes that are running or ready to run are placed. If 4 processes are ready to run and you have 4 processors, then the runqueue depth is 4 and each processor is working as hard as it can. But if 8 processes are ready to run, 4 of them will have to wait. Is that bad? Not at all, especially if the processes last for just a few milliseconds.

Network socket processes may be spawned by the hundreds within a scond or two but each process might last for just a fraction of a second. And herein is the problem with (very slow) human perception and statistics. These processes may be rescheduled several times per second based on socket activity and the computer is happily humming along because there is plenty of time for other tasks. The Load Average looks bad but as you've seen, it doesn't have much effect on the users. One other characteristic is that the system overhead will go up dramatically (still not bad though)

Now the above is one case (fairly common it turns out) where high runqueues (and high system overhead) are just a measure of very high throughput for short-lived daemons. On the other hand, if 100 users all decide to copy files at the same time, the runqueue will be high, system overhead will be fairly normal and everyone will complain about slow response times. In this case, the high Load Average indicates that indeed, way too many programs are consuming way too much disk and CPU power and this will affect all users.


Bill Hassell, sysadmin
Vytautas Vysniauskas
Occasional Contributor

Re: can't explain a sudden high load average

Hi,

It is very likely that you have a situation with intensive context switching and many system calls. Check the output of vmstat and look into the columns 'sy' and 'cs'. Probably you have sy>10k and cs>1k. Typical level of these numbers for idle system is <400/sec.

In general the situation you have means your server is running with degraded performance (although it is difficult to say how much, since it depends on capabilities of your server). The reason is simple: every running process in the system is frequently interrupted creating additional overhead the system spends in kernel mode. The good sign is that CPU utilization is low meaning the system is easily handling the situation. However, the most impact should be on the responsiveness (I guess, HP kernel isn't preemptive).

Your assumption about Java application is right, I think. I have seen very similar behavior of buggy java stuff which created excessive 'sy' and 'cs' while CPU was below 10% but the system load factor was close to 10.

Have a nice day,


Vytas.
Vytautas Vysniauskas
Anonymous
Not applicable

Re: can't explain a sudden high load average

not neccessarily related- however probably worth considering:

a) Did you run HPjtune/HPjmeter/HPjconfig? (http://www.hp.com/go/java)
http://www.hp.com/products1/unix/java/java2/hpjtune/index.html

HPjtune is a tool that helps you analyze garbage collection performance by graphically displaying instrumentation data from the garbage collector.
HPjtune lets you view this data in the following ways:


Several predefined graphs which show the utilization of garbage collector resources and the impact of the garbage collector on application performance.

User-configurable graphs for access to selected garbage collection metrics.

Separate predefined graphs for garbage collection behavior pertaining to threads.
HPjtune also includes a unique feature which allows you to use collected data to predict the effect of new garbage collector parameters on future application runs.

HPjtune displays data for any SDK and RTE for Java release 1.2.2 and 1.3 for HP-UX 11.x (PA-RISC and Itanium), HotSpot VM.

(in case HPjtune would help you to nail that down pls post it here- didn't have the time to check this out so far)

b) Have a look at your installed patches...
http://www.hp.com/products1/unix/java/infolibrary/patches.html

(and this is a shot in the dark;-) if they include
11.0 PHKL_24064 eventport (/dev/poll) pseudo driver (and Deps.)
11.11 PHKL_25468 eventport (/dev/poll) pseudo driver (and Deps.)

in case that you're using select() a lot (man page is delivered along with these patches)