Operating System - Tru64 Unix
1752290 Members
4450 Online
108786 Solutions
New Discussion юеВ

Re: round_robin_switch_rate question

 
Mark Poeschl_2
Honored Contributor

round_robin_switch_rate question

I have a 4 CPU ES45 running tru64 5.1A / PK6.

The system runs typically around 50% aggregate CPU utilization during the day. I'm seeing 3000 - 4000 context switches per second with occasional spikes of 8000 - 10000 per second. This seems excessive to me and sys_check is recommending changing round_robin_switch_rate.

Having looked at the docs for that parameter, I'm not sure it would accomplish anything. We're currently at the default value of zero which is supposed to result in a time quantum of 10 msec and 100 context switches per second.
Since I'm seeing a much higher number than that it seems very unlikely that many of the context switches are resulting from an expired quantum and are more than likely "the nature of the beast" with this application - a Cache database back-end.

Can anyone shed light on the use of this parameter or whether my reasoning makes sense?
8 REPLIES 8
Hein van den Heuvel
Honored Contributor

Re: round_robin_switch_rate question


fwiw, I agree with you analysis. Changing rr rate is unlikely to make a difference. I woudl follow up with the cache folks (www.intersys.com) to see if there are settings/recommendations they recommend. Maybe some magic with process binding to stop 'known to run shortly' from moving aroudn too much?

Cheers,
Hein.

Mark Poeschl_2
Honored Contributor

Re: round_robin_switch_rate question

Thanks Hein -
Do you agree that 3000-4000 sustained and 8000-10000 peak seems like an excessive amount of context switching?
Hein van den Heuvel
Honored Contributor

Re: round_robin_switch_rate question


That's a lot, but not excessive.

This is a client server sql engine of sorts right? So if the application does many little 'singleton' select queries, then you'll see equally many context switches.
With 3000 queries/ second I woudl expect 6000 context switched/second.

The solution would be to push smarts into the query, avoiding row processing in the client, condensing data to the minimal rows/columns needed, minimizing round trips.

Is the 'pipe' big enough to avoid breakign up client/server communication into fragments each potentially requiring context switches? (In oracle term sdu/tdu size).

Is this application suitable for 'Enterprise Cache Protocol'. Sales quote: "Reduces application-server to database-server network traffic by creating shared data caches on the middle tier of distributed architectures"

Hein.
Mark Poeschl_2
Honored Contributor

Re: round_robin_switch_rate question


We have a mixture of thick and thin clients and are a mixed two/three tier architecture so probably not a particularly good candidate for Intersys' ECP.

Guess we'll just have to keep a close eye on this as utilization grows.

Thanks again for the advice...
Alexey Borchev
Regular Advisor

Re: round_robin_switch_rate question

Thanks, Mark!
Quite an interesting topik!
As far as I can see Your case:
'ideally' you wolud have 4 CPU x 100 sw/sec = 400 switches/sec.
Currently You've got 4000 sw/sec, i.e. 10 times more =>
awerage length of Your task is 1 ms ~ 1M CPU cycles. (Providing 1 GHz CPU)

My case is: GS1280 8 CPU, load 50%, round_robin_switch_rate = 0.
I see ~80k sw/sec with vmstat!!!
In my case 'ideal'=800 sw/sec.
Avg. task length = 0.1 ms = 100.000 CPU cycles => i.e. 10 times worse!!!

Workload:
It's Oracle, runnins against SQL client with very simple SQL statements, approx. 5.000 statements/sec.

Am I thinking in the right direction?
The fire follows shedule...
Hein van den Heuvel
Honored Contributor

Re: round_robin_switch_rate question


80,000 csw/sec for 5000 sql queries/sec is a bit much.
It warrants an attempt to explanation/clarification.
You are not using MTS are you? Are those simple statements commited inserts? Those would activate the lgwr process.

You may want to drill down the per-process (group) of context switches.
I'd suggest a script around ps -o nvcsw,nivcsw.
I suspect that in both your cases you see Volontary switches meaning that the RR rate is irrelevant.

It would be good /intereting to see a breakdown: 1000 SQL/sec cause n000 sw/sec in total. x000 in DB slaves, y000 in DB 'system' processes,... wherever you find them. Please do report back!

Cheers,
Hein.

Alexey Borchev
Regular Advisor

Re: round_robin_switch_rate question

Hi, Hein!
1) You are not using MTS are you? - No MTS
Are those simple statements commited inserts?
- Mostly selects, but some Inserts either. OLTP type of Database.
2) Apart from Oracle, there is SQL client application ocxal - you'll see it in log.
3) I've done ps commanands - see attached output. 3 sections - Oracle sys, Oracle user proceses, SQL client.
4) What is voluntary/Unvoluntary switch? Where to read about?
5) I'll try to get stats of switches per 1000 SQL statements.
6) 1 timeslice = 10ms ~ 1E7 CPU Cycles. How many cycles it takes to switch tasks?

I mean: purpoce of our study to see "if there are too much switches" and try to conserve CPU cycles by reducing switch rate.
I want to esimate how much cycles are wasted (= may be saved) by our tuning efforts.

P.S.
Probably ist's not polite to use Mark's thread and we should start our own?
The fire follows shedule...
Mark Poeschl_2
Honored Contributor

Re: round_robin_switch_rate question

As Hein points out - the key in your case (and mine) is that the context switches are "voluntary". That means they occur because the execution thread decides it needs to wait for I/O to complete or for some communication from another process/thread. An involuntary context switch occurs when a thread's time quantum expires - i.e. it'be been running for a solid xxx msec and the scheduler forces the process out and onto the run queue. The involuntary switches are what you'd be addressing by mucking with the rr switch rate kernel parameter.

The high switch rates we're both seeing can only be addressed, I suspect, by changing the application(s). Instead of doing a tiny bit of work and then waiting for somebody else to do their tiny piece of work, the workflow of the application would have to be re-designed into bigger chunks. Oracle / database tuning could have some impact on this sort of thing, but I think the big changes would have to occur at even higher levels in the application stack.