Re: load average of > 80 is ok right?

user001 · ‎10-24-2011

Ok so i've a an hpux 11.31 system with a load average of 80 or higher. When i was googling this i didn't see anyone else complaining about this level of load and i have another server (higher spec) with runs at around 0.06 la.

It seems to average around 50 but today for about 2 hours it went above 100.

The cpu seems to remain 20% idle during this time.

I initally think straight away another CPU would be great for this system.

its an rx1600 with a 1.0GHZ 1.5mb cache single cpu.

Whats the best way to confirm this?

I'm getting these results from the snmp oid's for load 1, 5 and 15 minutes same the the CPU util.

thank you.

user001 · ‎10-24-2011

I guess my real question is do you see such high load on your system?

I'll look at memory and disk io shortly, just trying to grab disk io stats at the moment running a glance script over a peroid of time.

James R. Ferguson · ‎10-24-2011

Hi:

This is a very high value. Are your end-users satisfied with the performance they are getting?

I'd use 'glance' to see if there is a particular process or set of processes responsible. Have you modified the kernel's default 'timeslice' value? The default is 10.

Regards!

...JRF...

user001 · ‎10-24-2011

We are trouble shooting performance issues at the moment because users are complaining of slowness at times.

I should also note the server has been rebooted.

I'll check the value, but i suspect not i don't recall changing it.

thanks.

Bill Hassell · ‎10-24-2011

On a single cpu system, 80 is very high but the number can be misleading. As you have seen, your CPU is only 80% used so the load is not related to end user work. What load average means is the number of running and ready-to-run processes. In your case, only one process can run during any given instant while the other 79 are waiting to run. If a process is 100% CPU-bound, you would see a CPU % of 100 using sar or top. But HP-UX will automatically lower the priority (a numerically higher number) of a CPU-bound process to allow for other processes to get a fair share, and especially, to let short-run processes (typically heavy I/O) run.

With a very high load average and less than 100% CPU, this usually indicates a runaway process that is mostly kernel-bound, that is, the process loops as fast as it can to perform some kernel task like getting the time or doing a select or trying to respond to short interrupts.

To create a 100% CPU load, type these commands:

while :

do

:

done &

That generates 100% CPU usage on one CPU. Verify with sar -s 1 2 and also top:

Here are some examples:

Normal:

System: atl3                                          Mon Oct 24 16:23:30 2011
Load averages: 0.29, 0.45, 0.26
136 processes: 123 sleeping, 13 running
Cpu states:
CPU   LOAD   USER   NICE    SYS   IDLE  BLOCK  SWAIT   INTR   SSYS
 0    0.09   0.0%   0.0%   0.6%  99.4%   0.0%   0.0%   0.0%   0.0%
 1    0.21   0.0%   0.0%   0.0% 100.0%   0.0%   0.0%   0.0%   0.0%
 2    0.56   0.2%   0.0%   0.4%  99.4%   0.0%   0.0%   0.0%   0.0%
 3    0.29   0.2%   0.0%   2.2%  97.6%   0.0%   0.0%   0.0%   0.0%
---   ----  -----  -----  -----  -----  -----  -----  -----  -----
avg   0.29   0.0%   0.0%   0.8%  99.2%   0.0%   0.0%   0.0%   0.0%


A single 100% load (while : do : done)

System: atl3                                          Mon Oct 24 16:25:07 2011
Load averages: 0.12, 0.34, 0.23
137 processes: 123 sleeping, 14 running
Cpu states:
CPU   LOAD   USER   NICE    SYS   IDLE  BLOCK  SWAIT   INTR   SSYS
 0    0.04   0.0%   0.0%   0.8%  99.2%   0.0%   0.0%   0.0%   0.0%
 1    0.22   0.0% 100.0%   0.0%   0.0%   0.0%   0.0%   0.0%   0.0%
 2    0.14   0.4%   0.0%   0.8%  98.8%   0.0%   0.0%   0.0%   0.0%
 3    0.06   0.2%   0.0%   1.6%  98.2%   0.0%   0.0%   0.0%   0.0%
---   ----  -----  -----  -----  -----  -----  -----  -----  -----
avg   0.12   0.2%  25.1%   0.8%  73.9%   0.0%   0.0%   0.0%   0.0%

Memory: 104256K (48024K) real, 179560K (88028K) virtual, 2719952K free  Page# 1/5

CPU TTY    PID USERNAME PRI NI   SIZE    RES STATE    TIME %WCPU  %CPU COMMAND
 1 pts/2  3765 root     212 24   592K   212K run      0:17 102.25 58.55 sh




Notice that this sh load was moved to a NICE priority.
Now start 3 more full load processes:




System: atl3                                          Mon Oct 24 16:26:42 2011
Load averages: 0.24, 0.32, 0.24
140 processes: 123 sleeping, 17 running
Cpu states:
CPU   LOAD   USER   NICE    SYS   IDLE  BLOCK  SWAIT   INTR   SSYS
 0    0.03   0.0% 100.0%   0.0%   0.0%   0.0%   0.0%   0.0%   0.0%
 1    0.33   0.0% 100.0%   0.0%   0.0%   0.0%   0.0%   0.0%   0.0%
 2    0.08   0.0% 100.0%   0.0%   0.0%   0.0%   0.0%   0.0%   0.0%
 3    0.53   0.0% 100.0%   0.0%   0.0%   0.0%   0.0%   0.0%   0.0%
---   ----  -----  -----  -----  -----  -----  -----  -----  -----
avg   0.24   0.0% 100.0%   0.0%   0.0%   0.0%   0.0%   0.0%   0.0%

Memory: 102832K (38896K) real, 178056K (64004K) virtual, 2721404K free  Page# 1/5

CPU TTY    PID USERNAME PRI NI   SIZE    RES STATE    TIME %WCPU  %CPU COMMAND
 3 pts/2  3765 root     236 24   592K   212K run      1:51 99.98 99.61 sh
 2 pts/2  3771 root     236 24   592K   212K run      0:05 84.89 22.00 sh
 1 pts/2  3772 root     233 24   592K   212K run      0:03 92.69 16.80 sh
 0 pts/2  3773 root     228 24   592K   212K run      0:03 93.08 12.96 sh

In the second run, 4 copies of the do-nothing-very-fast script are eating up 100% of user CPU cycles. Yet the load average is not even 1. That is because the number of programs in the run queue is very small.

So in your case, the high run queue indicates something is not working very well, possibly through a bad design, possibly due to a networking issue, etc. Look for a large number of processes with the same name:

UNIX95=extras ps -e -o comm | sort | uniq -c | sort -n

This will list the quantity of process, largest at the bottom:

...

   1 uniq
   1 vhand
   1 vxfsd
   1 xntpd
   2 -sh
   2 sblksched
   2 sort
   2 sshd:
   4 smpsched
   6 lvmkd
16 biod
17 nfsd

This will be a start to find the culprit(s).

Bill Hassell, sysadmin

user001 · ‎10-24-2011

Hi Bill,

Thanks for the info, i read some of your docs floating around about load average. Impressive.

Good to know i'm not crazy, i'll gather more stats and report back.

user001 · ‎10-28-2011

Hello,

Ok so i know this much, i've collected some information from snmp, ssh and glance.

Network util avg - 20Mb

CPU Idle - 20-40%

Memory Util - 70% - 25% Sys, 30% User, 0% cache??

Swap Util - 7.6% w/ 50% reserved

Disk Queue - 0-0.1

Do you know any good glance metrics to look at in regards to disk IO?

I'm graphing the results into a monitoring system from glance at the moment and i can see disk IO on each mount and the queue length but with nothing to go off i'm not sure whats bad?

So the above hasn't really told me alot?

Thank you.

morpheus_online · ‎04-16-2012

Hi guys, i´m still confused about load averages. Maybe is because i´m trying to make a relation with load averages of command "top" and enqueuing runing processes from "vmstat" column (r).

He is the think, we have a server with 6 cpu´s, and with top command i can see that "load averages" is always between 2 to 4... and at the same time if a prompt for example: "vmstat 1 100", i see that column (r) is with 20 to 30 processes in waiting, cpu with 85% to 98% usage, and idle 0 to 10, ok? This is the fact!

So, what I would like to know is, if there is a relation between these commands and information?

And how can I surely see the enqueue processes?

Could someone help please?

Thanks!

Dennis Handly · ‎04-16-2012

You should be ignoring load average as confusing and only deal with CPU %.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: load average of > 80 is ok right?

load average of > 80 is ok right?