1846190 Members
2813 Online
110254 Solutions
New Discussion

Re: Performance question

 
SOLVED
Go to solution
Charles McCary
Valued Contributor

Performance question

Group,

Here's the situation:

Hardware:
N-Class with 4 440 processors (originally)
FC-60 disk array

Scenario:
We have a vendor who is writing code for us to basically read in some data and write this data to an Oracle database (it's more complex than this, but that's it in a nutshell).

During the times when their program is running, the system shows normal data for performance except for the CPU which is max'ed out (disk, memory, swap are all ok). Of course we purchased more horsepower in all of the resources than the benchmark required but we're still looking at abnormal runtimes of their program.

They suggested that we purchase additional processors, so we did. We now have 8 440 processors and it looks like the run time of their program has remained basically the same.
All perf. data is still the same (cpu max'ed out, all else ok).

Couple of other notes - their program is configured to use the processors available, so they are taking advantage of the processors by spawning more processes.

I'm pretty sure this is a code issue, but I wanted to get the groups opinion on whether I've missed something or not.

thanks,

C



15 REPLIES 15
John Bolene
Honored Contributor

Re: Performance question

Sounds like bad code to me.

Doubling the horsepower should halve the time for execution since it sounds like they can keep all the processors busy.

More processes do not necessarily make an application run faster. Sounds like there is a syncronization problem and the processes are all trying to do the same task and running into each other.

Reading data and writing it should make for I/O bottlenecks, not CPU, unless they are computing prime numbers before writing.
It is always a good day when you are launching rockets! http://tripolioklahoma.org, Mostly Missiles http://mostlymissiles.com
Jim Turner
HPE Pro

Re: Performance question

Hi Charles,

So all eight CPUs are running at 100%? What does your run queue look like? Did your I/O rates increase in respone to more CPUs? Fire up glance and look at your Global Waits (B). Are you blocked on I/O? Sleep? Semaphore?

Spawning more procs in response to more CPUs sounds like a rather crude way of going about multithreading. If all eight of your CPUs are genuinely maxed out, and you're not continually waiting on something like I/O, I'd say the developers have something spinning away in their proc(s) that needs to be fixed.

Cheers,
Jim
A. Clay Stephenson
Acclaimed Contributor

Re: Performance question

Hi Charles,

You are now in my area and I think I can give you a technique to nail down the problem - I always use this method when trying to solve this type of problem.

You need to start graphing some metric (e.g. insertions/s, updates/s, ... vs quantity of data). It is usually necessary to plot the log
of the metric vs quanity of data - e.g. log insertions/s vs rows of master data.

The slope of this curve can be very revealing about the nature of the problem. For example,
if the slope of the log plot is about 2 then you have an N-squared problem. I have sometimes seen performance degrade with the 4th power of the number of rows. Typically, problems like these arise from poor indexing and badly formed joins. Many times a single index can fix the problem. Graphing the data
tends to reveal the point at which no amount of hardware is going to fix the problem.

Regards, Clay
If it ain't broke, I can fix that.
Charles McCary
Valued Contributor

Re: Performance question

Jim,

Here's some interesting stats from Glance, tell me what you think:


Under the Event Column:
Event % Time
Pipe 5.4 87.98
Semaphore 3.5 60.41
Sleep 35.4 619.65
Stream 14.6 255.24
Terminal 0.3 5.07
Other 16.1 381.60

Under the Blocked On Column:
Blocked ON % Time Procs
IO 0.8 13.77 2.7
Priority 2.3 39.84 7.8
System 19.5 340.43 66.5
Virtual Mem 0.0 0.22 0.0


All other rows under both columns were 0.

Looks like things are sleeping waiting on CPU time to me - what do you think?
tx,
C

Kevin Wright
Honored Contributor

Re: Performance question

Looks that way. Do you have sar running and collecting data? if not, you should start it up..put this is root's crontab
# collect sar data
0 * * * * /usr/lbin/sa/sa1
20,40 8-17 * * 1-5 /usr/lbin/sa/sa1

#reduce the sar data
5 18 * * * /usr/lbin/sa/sa2 -s 8:00 -e 18:01 -i 900 -A

Sridhar Bhaskarla
Honored Contributor

Re: Performance question

Charles,

Since you have Glance+ installed, I would suggest to configure workloads on the system. Check the file /var/opt/perf/parm. Create one application with your application executables and collect the data. There are some examples in the file itself that will direct you. You need to restart scope/UX to re-read the parm file.

Once the data is collected, you can generate reports on various metrics

* APP_PRI_WAIT_PCT
* APP_DISK_SUBSYSTEM_WAIT_PCT
* APP_MEM_WAIT_PCT
* APP_SEM_WAIT_PCT
* APP_TERM_IO_WAIT_PCT
* APP_OTHER_IO_WAIT_PCT
* APP_NETWORK_SUBSYSTEM_WAIT_PCT
* APP_SLEEP_WAIT_PCT
* APP_IPC_SUBSYSTEM_WAIT_PCT

There are other interesting Application metrics. You can see them in /var/opt/perf/reptall file.

You will get a very good feel of what the application is doing.

-Sri
You may be disappointed if you fail, but you are doomed if you don't try
Wodisch
Honored Contributor

Re: Performance question

Hello Charles,

have checked wether your Oracle is actually using all
those CPUs? There is an "init*ora" parameter for the
amount of CPUs used by Oracle - and the default is 1!

The second attempt could be to reduce the kernel
parameter "time_slice" which is 10*10ms per process,
but in your highly cpu-intensive environment, you might
have some advantage by REDUCING it, say to 7 or 8,
from its default of 10. Batch-oriented jobs will take longer
then, but the I/O oriented jobs get a time-slice more
often...

HTH,
Wodisch
Printaporn_1
Esteemed Contributor

Re: Performance question

just wanna share my experience.
my application vender use to make this CPU problem to me.

event 1. they configure their apps to spawn parallel processes and , I find from glance that they're waiting for something (I cann't remember) it's problem about locking and when we change to option not to run parallel it is more fast , let's say from many hours to 5 mins.
event 2. they write shell script that consume lots of CPU just checking and compare time.
I change that script to run in cron , CPU usage was reduce about 40%
----------------
I don't think buy more CPU is good idea.
enjoy any little thing in my life
Charles McCary
Valued Contributor

Re: Performance question

Printaporn,

so you went from running multiple processes in parallel (this is what we're doing now) to running only one process? And this helped?

tx,

C
Stefan Farrelly
Honored Contributor

Re: Performance question


This is all very interesting.

I think the problem is indeed too many processes all trying to do locking at the same time. Remember, on any multiple CPU server HP-UX has to do all critical locking on only ONE CPU. ie. the first. The kernel here carrys out locking in a single threaded way - one process at a time, regardless of which CPU theyre running on - they all need to come back to a single CPU when they get around to doing some locking. Check your system calls and context switching values.

So, in theory, and knowing that we have a single threaded part of the kernel running on a single CPU to handle all our critical locking, is it better to have more and more cpus, and more and more processes all trying to access locks on a single CPU,
OR
have a single CPU as fast as possible where everything should run a lot more sequentially thru the single threaded part of the kernel. Thus locking system call totals and context switching should be able to run higher on this configuration.

I think the latter is true. Weve already had an example here of an application which ran much faster on 2-way 550 Nclass than on a 4x440 Lclass !

Im from Palmerston North, New Zealand, but somehow ended up in London...
John Bolene
Honored Contributor

Re: Performance question

Parallel processing only works if the separate processes can get some work done.

If they are all waiting for syncronization (writing to the same place in memory, they would all be working on the same set of data), then the extra overhead for all these processes will eat up the system.

Since it is database work, they may be updating the same areas of disk and you would be waiting on I/O to complete, which does not seem to be the case.
It is always a good day when you are launching rockets! http://tripolioklahoma.org, Mostly Missiles http://mostlymissiles.com
Charles McCary
Valued Contributor

Re: Performance question

Stefan,

I guess I should provide a little more info:

When we were originally testing the vendor code (and we only had 4 processors), we did do some tests to guage run-time. We ran the code with 20 parallel processes, then dropped that number by two and re-ran multiple times until we reached 4 parallel processes.

The fastest runtime was seen when running with 8 parallel processes.

After adding the 4 additional processors to the machine, logically the fastest time should be seen when running 16 parallel processes (if 8 was fastest with 4 processors, at least that's my thinking anyway).

Does this change your opinion?

tx,

Charlie
A. Clay Stephenson
Acclaimed Contributor

Re: Performance question

Hi Charles:

I'm still convinced this is poorly written code. Fortunately you should be able to get your software developer to compile everything with -p to enable profiling. You can then use prof to get statistics on which functions are being hammered and zero in on the bad code.
If it ain't broke, I can fix that.
Stefan Farrelly
Honored Contributor

Re: Performance question

Hi Charlie,

So the optimum was 4 CPUs and 8 parallel processes (I guess approx 2 per CPU).

Then you upgraded to 8 CPU's. I would not think this would make your server faster as the overhead from the kernel having to manage an additional 4 CPUs, and the overhead of having to squeeze processes running on an additional 4 CPU's into the single threaded locking part of the kernel would in fact slow down your application.

Adding 4 more CPU's should allow more users onto the server and especially if you have > 1 application running on it, so they wont necessarily compete for the same resources (hopefully:-) ), but in terms of straight performance I would not think it would speed it up, but marginally slow it down.

Instead of upgrading to 8x440's if you replaced the existing 4x440's with 4x550's I would expect a 20-25% increase. I think this should be your preferred plan.

Im from Palmerston North, New Zealand, but somehow ended up in London...
Roger Baptiste
Honored Contributor
Solution

Re: Performance question

Hi,

I have faced similar issues with vendor
programs which import/export data in a data mining environment. The question which
needs to be addressed here is the objective
of the "tuning" exercise. Does the vendor/users feel that the response time
has to improve further?? or is it question
of pegging the CPU usage below the maximum
of 100%?

What is the run-queue and pri_queue
values?? CPU utilization alone is not
a good indicator of the system/cpu performance.
If the CPU queues (pri_queue is a better
indicator than run_queue) are also exceedingly
high (anything consistenly above 3), then
you have a CPU bottleneck issue.
Check the history of these values through
measureware as follows:
-----------
copy the /var/opt/perf/reptall file into
/tmp/reptall.
Edit the /tmp/reptall file and enable
the GBL_PRI_QUEUE , GBL_RUN_QUEUE and
other CPU usage values.
Then run,
extract -xp -v -gp -r /tmp/reptall
------

The problem here is obviously related
with the way the application is coded.
If they are running multi-stream jobs,
there is a chance that these jobs may
need to access a common file/resource, which
can involve contention.

Since, there is no disk, memory bottleneck
here, it makes the jobs open to use the
CPU's all the time!.

The "data conversion" applications which
we use are CPU hoggers, by the way they
are designed. So, adding CPU's is just
like throwing another rock in the ocean.
It may not necessarily help.

Tackle this from the application end. The
measureware stats will help you in presenting
your case.

Best!
Raj
Take it easy.