Operating System - HP-UX
1829963 Members
2603 Online
109998 Solutions
New Discussion

J7000 develops inceasing load average over time

 
Hugh Smith
Occasional Contributor

J7000 develops inceasing load average over time


We have a J7000 (4GB 18GB - system and development - vxfs filesystems/36GB raid 5 storage - mail store - HFS- HPUX 11 June IPR) which we are using as our student mail store. Job mix is mostly imapd and sendmail. Number of processes is fairly steady averaging about 180-260 processes. When the system is first booted load averages and performance is what you would expect from this class of machine
(LA of 0.03-0.20 with occasional rise to 0.60). The problem is that over time
(typically 5 to 10 days) the system seems to be less and less able to sustain
this load (typically surges in the job mix caused by multi-recipient listserv
type mail cause the system to sustain high load averages over longer and
longer periods - at times making the system unuseable) - reminds me of the symptoms you would see if there was a resource shortage. Load averages at this time, with the same job mix that produced sub 1 LA's produce LA's anywhere from 6-18. After a reboot the symptoms disappear.

I have checked with sar and the other typical unix tools. The only thing that pops out
is %system vs %user (system will be at 30-49% and user will be at 1-4% ) which
would make you think that it is a io bottleneck .... but no other info points to this as
the problem. Some answers I have had suggested that the vxfs filesystems needed
defragmentation ... unfortunately since the condition is fixed by rebooting and no
defragmentation is done at that time ... im not sure this is a root cause of the problem.

I have added most of the latest patches, June 2000 IPR, NFS megapatches etc and have reached the end of things I can tweak. BTW we have a J5000 running 10.20 configured just the same and it handles 3x the load with no problems. Has anyone else incountered this type of problem? Suggestions?

Hugh Smith
Computing and Communication Services
University of Guelph
4 REPLIES 4
Rick Garland
Honored Contributor

Re: J7000 develops inceasing load average over time

One culprit of increasing load is sendmail. Are there an increasing number of messages being sent to/from the system. Are the messages getting larger? Sending/receiving large messages will increase the load.
Bill Hassell
Honored Contributor

Re: J7000 develops inceasing load average over time

Start by loading a copy of Glance (or you can use top, but Glance is much more useful). There is an evaluation copy on your Application CDROMs. Look at the top processes that are consuming time. A load average (ie, uptime or top) is not exactly what it is titled. It is the average of the size of the RUNQUEUE.

The RUNQUEUE is the number of processes that are ready to run at the same time. Now this is a rough indication of the loading on the machine but not really. As the RUNQUEUE changes dozens of times per second, there are scenarios where short I/O processes can paralyze a system if there are enough, yet no single process consumes much time.

An example is an Xwindow program that keeps track of cut buffers in a window manager. Since there is no wake-up call to a remote program indicating that new data exists, such a program must poll the window manager to see if something happened, typically once a second. That's two Xmessages, one for a request and the second (perhaps more) as a result.

Now multiply that program times 100 copies and now 200 messages per second go over the LAN with 100 context switches per second. The load average could go to 10 or 20 yet the system would seem to respond rapidly.

Alternately, you might have dozens of fairly large programs that are all run at the same time that are waiting to be run because there are no more processors available. If they don't all fit into RAM at the same time, then paging (swap) begins. vmstat, specifically, po (page out) will tell you if RAM is too small.

There is no easy answer as your system is probably doing a lot of things at the same time. Use Glance or top to sort the big items first and see why they are consuming so much time.

It's certainly possible that the programs have a memory leak and this will show up as constantly growing program sizes. That's a programming error, but will impact the system.


Bill Hassell, sysadmin
Bill Hassell
Honored Contributor

Re: J7000 develops inceasing load average over time

Start by loading a copy of Glance (or you can use top, but Glance is much more useful). There is an evaluation copy on your Application CDROMs. Look at the top processes that are consuming time. A load average (ie, uptime or top) is not exactly what it is titled. It is the average of the size of the RUNQUEUE.

The RUNQUEUE is the number of processes that are ready to run at the same time. Now this is a rough indication of the loading on the machine but not really. As the RUNQUEUE changes dozens of times per second, there are scenarios where short I/O processes can paralyze a system if there are enough, yet no single process consumes much time.

An example is an Xwindow program that keeps track of cut buffers in a window manager. Since there is no wake-up call to a remote program indicating that new data exists, such a program must poll the window manager to see if something happened, typically once a second. That's two Xmessages, one for a request and the second (perhaps more) as a result.

Now multiply that program times 100 copies and now 200 messages per second go over the LAN with 100 context switches per second. The load average could go to 10 or 20 yet the system would seem to respond rapidly.

Alternately, you might have dozens of fairly large programs that are all run at the same time that are waiting to be run because there are no more processors available. If they don't all fit into RAM at the same time, then paging (swap) begins. vmstat, specifically, po (page out) will tell you if RAM is too small.

There is no easy answer as your system is probably doing a lot of things at the same time. Use Glance or top to sort the big items first and see why they are consuming so much time.

It's certainly possible that the programs have a memory leak and this will show up as constantly growing program sizes. That's a programming error, but will impact the system.


Bill Hassell, sysadmin
Bill Hassell
Honored Contributor

Re: J7000 develops inceasing load average over time

Start by loading a copy of Glance (or you can use top, but Glance is much more useful). There is an evaluation copy on your Application CDROMs. Look at the top processes that are consuming time. A load average (ie, uptime or top) is not exactly what it is titled. It is the average of the size of the RUNQUEUE.

The RUNQUEUE is the number of processes that are ready to run at the same time. Now this is a rough indication of the loading on the machine but not really. As the RUNQUEUE changes dozens of times per second, there are scenarios where short I/O processes can paralyze a system if there are enough, yet no single process consumes much time.

An example is an Xwindow program that keeps track of cut buffers in a window manager. Since there is no wake-up call to a remote program indicating that new data exists, such a program must poll the window manager to see if something happened, typically once a second. That's two Xmessages, one for a request and the second (perhaps more) as a result.

Now multiply that program times 100 copies and now 200 messages per second go over the LAN with 100 context switches per second. The load average could go to 10 or 20 yet the system would seem to respond rapidly.

Alternately, you might have dozens of fairly large programs that are all run at the same time that are waiting to be run because there are no more processors available. If they don't all fit into RAM at the same time, then paging (swap) begins. vmstat, specifically, po (page out) will tell you if RAM is too small.

There is no easy answer as your system is probably doing a lot of things at the same time. Use Glance or top to sort the big items first and see why they are consuming so much time.

It's certainly possible that the programs have a memory leak and this will show up as constantly growing program sizes. That's a programming error, but will impact the system.


Bill Hassell, sysadmin