System Administration
cancel
Showing results for 
Search instead for 
Did you mean: 

System Load on Linux - why so HIgh yet CPU is Low?

Alzhy
Honored Contributor

System Load on Linux - why so HIgh yet CPU is Low?

On HP-UX, Systenm Load approaches or exceeds 1.0 almost on the dot when CPU is approaching 100% or stays at 100%.

On my 24-core Linux RHEL 5.4 DB-only Server (Oracle) -- System Load hovers at around 20 when CPU Util (sys+user) is barely 10%

Is this the Norm on Linux?

Hakuna Matata.
10 REPLIES
Steven E. Protter
Exalted Contributor

Re: System Load on Linux - why so HIgh yet CPU is Low?

Shalom,

The answer to this question is: It depends.

Oracle tends to do a lot of I/O and having a lot of I/O threads tends to pump up your load factor, even if the system itself is not doing CPU heavy work.

Linux handles I/O differently than HP-UX

Your HP-UX example is not representative of HP-UX either.

Load factor on both OS's has nothing to do with CPU utilization. It has to do with how many processes are waiting for CPU.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
dirk dierickx
Honored Contributor

Re: System Load on Linux - why so HIgh yet CPU is Low?

the load number you speak of, is an indicator to tell you how many processes are waiting for cpu time. the CPU% is in fact how busy a cpu is, those are 2 different things and they tell a different story.
Alzhy
Honored Contributor

Re: System Load on Linux - why so HIgh yet CPU is Low?

But amigos, this is a 24-core system. Why would processes be queuing?

OTH, I found this:
"An idle computer has a load number of 0 and each process using or waiting for CPU adds to the load number by 1. Most UNIX systems count only processes in the running (on CPU) or runnable (waiting for CPU) states. However, Linux also includes processes in uninterruptible sleep states (usually waiting for disk activity), which can lead to markedly different results if many processes remain blocked in I/O due to a busy or stalled I/O system. This, for example, includes processes blocking due to an NFS server failure or to slow media (e.g., USB 1.x storage devices)."

So it is possible whatever 2 or 3 processes that are on the "run queue" are being queued not because of CPU lack but waiting for hardware interrupts -- for example I/O?

Here are some stats:

root@sapsrv # sar 5 5
Linux 2.6.18-164.el5 (sapsrv.xyz.com) 03/12/2010

05:17:02 AM CPU %user %nice %system %iowait %steal %idle
05:17:07 AM all 4.39 0.00 1.56 40.71 0.00 53.34
05:17:12 AM all 4.37 0.00 1.86 38.84 0.00 54.94
05:17:17 AM all 6.83 0.00 2.79 39.48 0.00 50.90
05:17:22 AM all 4.79 0.00 1.12 38.44 0.00 55.65
05:17:27 AM all 4.63 0.00 1.28 39.93 0.00 54.17
Average: all 5.00 0.00 1.72 39.48 0.00 53.80

root@sapsrv # vmstat 5 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 28 2165624 72369488 457200 48163284 0 0 627 935 1 1 3 1 75 21 0
5 30 2165624 72319024 457216 48163280 6 0 23461 30454 5454 9440 6 2 52 40 0
3 31 2165620 72324128 457232 48163324 0 0 19464 32710 4506 7602 5 2 56 37 0
1 30 2165616 72371296 457252 48163360 0 0 13883 35297 4285 7108 5 3 52 40 0
3 28 2165616 72372144 457268 48163408 0 0 17331 29416 4417 8345 4 3 55 38 0

root@sapsrv # uptime
05:17:58 up 13 days, 20:06, 11 users, load average: 21.21, 20.51, 20.34

From the looks of it, I think my system needs some serious "tuning". This is basically an RHEL 5.4 system, 2 x Qlogic 4Gbit HBAs, 24-core, 128GB RAM - kernel tuned for Oracle DB hosting. I think there may be a need to tune the Qlogic HBAs (ququeing?) and possibly process affinity/binding as some of the record breaking TPC runs often do.


Hakuna Matata.
Ivan Ferreira
Honored Contributor

Re: System Load on Linux - why so HIgh yet CPU is Low?

Yes, your system is having too much I/O wait. What kind of storage and disk configuration are you using? Do you have direct and async I/O enabled? Are you using RAW/ASM or file system for database?

As the system is not paging, this is not related to memory misconfiguration/tuning on the database.

The output of sar -d or iostat would help. Check your disk service time.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Alzhy
Honored Contributor

Re: System Load on Linux - why so HIgh yet CPU is Low?

Okay..

In this particular scenario -- I am now theorizing the issue is with several vgdisplay and pvdisplay sessions that are hung.

It so happened a multipath device was not graciously and properly depresented so LVM is still trying to scan those devices -- hence the hung LVM commands.

But I do recall there were occasions in the past I noticed a CPU-unbusy system has System Load averages that is more than the ideal of less than 1.0 on a perfectly balanced and loaded system.

Hakuna Matata.
Ivan Ferreira
Honored Contributor

Re: System Load on Linux - why so HIgh yet CPU is Low?

If "load average" is your concern and not the I/O wait, maybe this link can give you some answers:

http://www.redhat.com/magazine/011sep05/departments/tips_tricks/

Section:

"What is the relation between I/O wait and load average?"

I still want to know your disks service time. Maybe your load balancing policy is not ideal for your storage.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Alzhy
Honored Contributor

Re: System Load on Linux - why so HIgh yet CPU is Low?

Ivan,

Our backend is all HP StorageWorks - EVAs and XP Storage. We have the 4.3.0 HPDLM bundle installed.

Just looked at our sysstat sar logs and queing at times are through the roof with svc times sometimes breaching 100ms.

My FC-Links are dual 4Gbit QLogic HBAs. Only tweak I have is:

options qla2xxx ql2xenablemsi=1 ql2xenablezio=1 ql2xintrdelaytimer=1 ql2xmaxqdepth=96 ql2xfailover=0

Which the latest patching seem to no longer like as the driver seem to have been downgraded (awaiting confirmation by RH though). We're on RHEL 5.4 btw.

Hakuna Matata.
Ivan Ferreira
Honored Contributor

Re: System Load on Linux - why so HIgh yet CPU is Low?

¿And all disks are having that service time? We had problems with EVA storage when for some reason the writeback cache was disabled on the VDISK.

For XP Storage, I would run a dd performance test to identify the response time outside the database.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Ragu_3
Trusted Contributor

Re: System Load on Linux - why so HIgh yet CPU is Low?

Check the kernel scheduler. Maybe changing it can help a lot. What are your I/O disk devices?
Debian GNU/Linux for the Enterprise! Ask HP ...
MarkSeger
Frequent Advisor

Re: System Load on Linux - why so HIgh yet CPU is Low?

Personally I find I/O Wait to be of minimal use other than to tell you there is I/O going on as all it tells you is there was a clock tick during I/O. As an experiment, if you generate a lot of i/o with a tool like robin miller's dt program (my favorite) and run collectl to watch cpu and disk, you should see almost 100% disk utilization (this is a good thing) and high I/O wait as well since the cpu doesn't have anything better to do while waiting for I/O. Now fire off something with a heavy cpu load and you should see I/O wait go down yet the i/o load remains at 100%.

Another counter than can be tricky I/O is the load average. The intent is to show how busy you system is by looking at the number of jobs in the run state, BUT in some cases it can be misleading. Consider a very busy nfs server with many daemons, say over 500. I had a server running flat out and I saw load averages in the 50s and even higher! From this, one would be inclined to say the system was misbehaving but it wasn't. With this many threads of NFS activity, a lot of the daemons were actively doing work and hence the high average. While this may be an extreme example my point is that at least in some cases looking at the numbers isn't always sufficient and you have to consider what the system is doing at the time you're measuring it.

-mark