High %CPU sys and context switches

Ninad_1 · ‎03-05-2006

Hi,

I have been asked to analyse a problem which reflected in CPU stats showing a high length of CPU run queue on a system daily around 23:30.
Although I do not have root access,I collected the output of sar, top, ps -eafl, vmstat for the system at around the 23:30 , the outputs have been attached.

My observations and questions as follows:
All the observations from the logs collected and are relevant for the time period around 23:30 daily.
The vmstat output shows a lot of running processes in the "r" field - starting from 80 and going to above 1000 and then again reducing, as well as shows an increased no of context switches.
The sar output also shows the same - high figures in the "runq-sz" along with around 99% %CPU sys.
The top output also reflects the increased no of running processes, high %CPU sys and shows a lot of rm commands in the top processes list.
The ps -eafl even though run during that period takes a looooong time to respond and was able to get the output only when the %CPU usage went a bit low. But to mention in short , the ps also shows a lot of processes in "R" state and many of them are the rm commands.

My observation is that around the night time there is a batch job schedule used for archiving older files and then removing them from the original location using command like find .... -exec rm -f {}\;

Is it that since a lot of rm commands are fired there are a lot of running processes and hence the CPU has to do a lot of context switching for the running processes and thus causing the high %CPU sys ?

My queries
1. What is the problem ?
2. Is there virtually no processing happening due to context switches ?
3. What are the acceptable values for "runq-sz" , runnable/running processes, context switches, %CPU sys.
4. When a find command with exec to rm is run, do the rm commands run as the previous rm command completes or they are run parallely ? (From ps etc it seems that they run parallely - but would like your comments). In such a case is using find and exec a bad idea as it is hampering the system.

Please advice.

Thanks a lot
Ninad

Bill Hassell · ‎03-05-2006

A high run queue is not a problem if that is what is expected. If you run a find...exec rm for several million files, then you can expect a high run queue. Add a couple of cpio activities and regular database activities and all is well. Yes, find...exec rm is run as a separate process for each rm, but sine the rm takes so little time to run and find can generate thousands of rm commands in less than a second, the run queue will indeed be high.

Now if all of this is slowing the system down at a time when processing (and disk I/O) horsepower is needed for other tasks, then you'll need to reschedule the cleanup. Or bettery yet, redesign the cleanup task. Perhaps a better solution is to look at the archive process since it apparently involves a massive number of files, and look at a less invasive way to accomplish the tasks.

Bill Hassell, sysadmin

James R. Ferguson · ‎03-05-2006

Hi Ninad:

If your cleanup process does something like:

# find -xdev -mtime +30 -exec rm {} \;

...then for *each* file meeting the specified criteria, a new process ('rm') is spawned. This can be terribly expensive.

Change the above mechanism to:

# find -xdev -mtime +30 | xargs rm

This will cause an assembly of file names to be built in an internal buffer of 'xargs' and a 'rm' command to be spawned for the block (list) of files thus buffered. The number of processes spawned will be greatly reduced along with your CPU queue depth and utilization.

Regards!

...JRF...

Ninad_1 · ‎03-05-2006

Thanks Bill and James.
James,
You have mentioned the exact syntax used for the cleanup process and I agree that the solution you have suggested should be a better one.
Currently the system is becoming very slow as regards response if I login around that time, I am not sure if any processing is getting affected, but still this is not a good state of the system to be non-responsive, so I guess I can suggest the modification.

If you could also guide me on the other part of the queries - points 3 and 4 please.

Thanks,

Ninad

Bill Hassell · ‎03-06-2006

The 'acceptable' value of the run queue, the classic answer is: it depends. First, the run queue is the number of processes currently running or waiting to be run. This is a kernel queue which excludes processes waiting for anything (I/O, memory allocation, semaphores, etc). So one could extrapolate that when the runq equals the number of CPUs, the system is fully loaded. And the corollary might be that runq larger than the number of CPUs is overloaded.

But this is too simplistic. I've personally seen a system running with 2 processors and the average run queue was 45-50 for hours. Yet no one complained. How can this be? The processes creating the load were extremely shortlived polling processes. They sent one LAN packet, waited for a response, and then went to sleep for 1/2 second. Multiply this by 400 copies of the same program and the runq (and context switches) were extremely high, but because the OS kept things moving and the processes consumed so little total cycles, no one noticed any slowdown.

By changing the polling program to run every 2 seconds (rather than 1/2 sec), the runq dropped to 5-6 and context switches dropped to about 10% of the large value. Other than the large kernel load numbers, the only real impact was for the LAN, about 2000 packets/second continuous, so the load was adjusted to lower the network load.

In your case, the rm command (for one file at a time) exhibits a similar load but instead of loading the LAN with requests, it loads the filesystem with massive directory requests, and that will indeed impact overall system performance. Change the cleanup process (as James mentioned) to use xargs which will more efficiently remove the files with less rm processes. Note that the apparent parallel rm's seen in ps are an artifact of an extremely fast system and a slow measurement tool (ps). Most kernel measurement tools are misleading because things change very rapidly and the tools can't keep up. Look at the number of context (program) switches. Thousands per second is normal on a busy system.

I would use the metrics as a starting point. If something seems high, does these numbers seem to affect performance? Are the numbers new or have they always been this way?

Bill Hassell, sysadmin

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

High %CPU sys and context switches

High %CPU sys and context switches

Re: High %CPU sys and context switches

Re: High %CPU sys and context switches

Re: High %CPU sys and context switches

Re: High %CPU sys and context switches