1832998 Members
2298 Online
110048 Solutions
New Discussion

Re: high load ...

 
K.C. Chan
Trusted Contributor

high load ...

running rh9 with kernel 2.4.20-31.9smp; experiencing high load, more than 50% of cpu is sys. top sorted by cpu does not show any process hogging cpu. Only one application is running on the system. I have a feeling that it's not releasing resources properly. There may be Zombie processes that were not properly terminated. The application runs for a few days, then the system needs to be rebooted due to hight load. After reboot, it operates normally. I am not 100% certain that the Zombie procs. are the cause of it, but I know for sure something is not bieng release or terminated properly.
Reputation of a thousand years can be determined by the conduct of an hour
4 REPLIES 4
Mike Jagdis
Advisor

Re: high load ...

What HW and are you running the HP monitoring daemons?
Mike Jagdis
Advisor

Re: high load ...

Ok, I've posted the same in a couple of other threads. I might as well preempt the reply and just paste it here as well...

----------------------------------------------
The explanation is somewhat involved...

If you trace cma*d you'll find that it doesn't do anything but open the device, ioctl, close. Admittedly rather more times than should be necessary but that's just incidental bad design.

You'll find the delay - and system time consumption - seems to happen on the close. From here you need a fairly good working knowledge of the Linux kernel...

Ok? still with me then?

Run oprofile for a while and you'll find the cpu time is being consumed by invalidate_bdev. Which is interesting :-).

Invalidate_bdev is called from kill_bdev. Kill_bdev is called from the block device release code. Release is what happens on last close. Now the monitoring daemon is opening the unpartitioned disk device which it is pretty certain nothing else has open. (Off hand I'm not sure if even having an fs on the device counts as it being open. There are subtle differences and I *think* I'm right in saying that block device access and fs access is considered different at this level. Don't quote me or blame me!)

So, each close triggers invalidate_bdev. Why is this so bad? Well, the idea is that when the last close happens on a device you need to flush any cached data because, with much PC HW, you can't be sure when the media gets changed. Invalidate_bdev isn't *meant* to be called often. It works by scanning through the entire list of cached data for block devices to find and drop data related to the device being closed. So it sucks system time and the amount is proportional to the amount of cached (from any device) data you have.

WORKAROUND:
All you need to do is to make sure that each time the cma*d daemon closes the device it isn't the *last* close - i.e. some other process has the device open. The other process doesn't even need to *do* anything. Try something along the lines of:

sh -c 'kill -STOP $$' < /dev/cciss/c0d0 > /dev/null 2>&1 &

Hope that's all clear! (As mud... :-) )

(HP: As well as blind debugging I do Linux & OSS consultancy. I happen to know the answer to this one as it came up at a major investment bank...)
K.C. Chan
Trusted Contributor

Re: high load ...

h/w, dell 530, two cpu at 2.8GHz running w/HT option. Kernel is 2.4.20-31.9smp. Nix flavor is redhat 9.

I noticed that there are highload avg (24-40). But yet IDLE TIME is in the 80-90 percentile. Furthemore swapd is up on the top ten lines, when using top; seems to be more active than usuall, yet free cmd shows no swap is in used. What is unusuall is that when Load avg. is high, IDLE show be very low or non-existence at all in this case. Does any one knows how to track down what is causing this? Would it be good idea to update the kernel to 2.4.28? Thanks.
Reputation of a thousand years can be determined by the conduct of an hour
K.C. Chan
Trusted Contributor

Re: high load ...

The system was bottleneck on I/O. Now, can somone explain why swapd is so active when this happens? Thanks.
Reputation of a thousand years can be determined by the conduct of an hour