Operating System - Tru64 Unix
1748239 Members
3635 Online
108759 Solutions
New Discussion юеВ

Re: Performance problem on a GS1280

 
Florent Boucher
Occasional Advisor

Performance problem on a GS1280

Dear support,
I have some troubles with a GS1280 (32CPU 64Go RAM) that is heavy loaded and present some problem of performance. This computer is used to run large parallel application that used from 1Go to 2Go per process. I have observed that the system TRU64 V5.1 switch often the applications from one process to another that is not good at all for the performance. I would expect the process to be "nearly bound" to one processor during the whole execution time (at least to not switch too often) in order to avoid the time spend by moing the data in cache and mempry (am I clear ?).
So, should I changed something in the sysconfigtab file to avoid such a behavior :
the round_robin_switch_rate is 25 and sched-min-idle is not defined. I did a vmstat 1
and the output is in attachement.
For instance, the problem I had is even if the computer is not fully loaded (10 CPU free for instance) I do not have the same elapse time if I run twice the same parallel application. The difference can be of the order of 20-30%.
The parallel application I use are very demanding on memory band with but not too much on communication between processors.
If some one can help me or at least tell me what to look in order to solve the problem it will be fine.
Regards
Florent Boucher
21 REPLIES 21
Mark Poeschl_2
Honored Contributor

Re: Performance problem on a GS1280

From what I can see you need to look to your application first, Florent. If your round_robin_switch_rate is 25 and your application is really completely compute bound, I would expect to see 40 context switches per second. You are getting 1000 context switches per second. That is indicative of an application that is either very I/O bound or very heavily dependent on interprocess communication. I suspect the latter in your case, because the amount of "system" CPU time is fairly low, but I can't be sure of that. Some output from a tool like 'top' would be useful to see what individual processes seem to be doing.
Florent Boucher
Occasional Advisor

Re: Performance problem on a GS1280

Dear Mark,
at the moment there are 4 different applications that are running on the cluster and all of them are using MPI. I put in attachement the output from top and also from the ps command with specific options.
I found in the present case that from the output from ps that the cpu #10 is not used and the #12 is used twice. That just means the the system have switch the processes from one processor to another. However, there is from my point of view, no reason to do this and this is not at all efficient when the process are using lot of memorey. Does a way exist to reduce this switch from one processor to another in order to keep the optimum cache and memory performance ?
Alexey Borchev
Regular Advisor

Re: Performance problem on a GS1280

1) I wolud look at kernel parameters:
vm:
replicate_user_text
vm_bigpg_enabled - neat big pages feature, seems to be Your case.
(see man sys_attrs_vm)

generic:
sched_distance
(man sys_attrs_generic)

2) Try runing sys_check and healthcheck - they can give some ideas.

3) there is a small tool by Hein van den Heuvel, which can show how memory is allocated for a particular process.
See attachment.

4) Run xmesh utility.
The fire follows shedule...
Hein van den Heuvel
Honored Contributor

Re: Performance problem on a GS1280



My first step would be to investigate using 'runon -r'

This tells the system to have a command run on a selected rad and stop movement that way.

You can specify multiple rads, and of course you would select those to be adjacent.

Hein.
Florent Boucher
Occasional Advisor

Re: Performance problem on a GS1280

Concerning the runon command, how is it compatible with mpi program ? On our system, there is no pset and only one RAD (may be this is not good ?). So it seems to me difficult to use the runon command.
Hein van den Heuvel
Honored Contributor

Re: Performance problem on a GS1280


If you have "a GS1280 (32CPU 64Go RAM) " as you indicate then you will have normally have 32 single-cpu RADs. It is possible to set up the system with 16 double-cpu rads, but that is rarely done/justified.

When I was toying with an application needing about 3 CPUs perf instance I used a stript to launch it and the script parameters looked like:

database_start_prefix = runon -r 0 -r 2 -r 4
central_start_prefix = runon -r 6 -r 7
01_start_prefix = runon -r 1 -r 3 -r 5
02_start_prefix = runon -r 8 -r 10 -r 12
03_start_prefix = runon -r 9 -r 11 -r 13
04_start_prefix = runon -r 16 -r 18 -r 20
05_start_prefix = runon -r 17 -r 19 -r 21
06_start_prefix = runon -r 24 -r 26 -r 28
07_start_prefix = runon -r 25 -r 27 -r 29
08_start_prefix = runon -r 14 -r 15 -r 22
09_start_prefix = runon -r 23 -r 30 -r 31

If you double-check that, then you'll see near-adjacent CPUs being used.

Unlike PSETS, the runon -r is NOT exclusive.
So a single rad can be assigned to mutliple application chunks.

Unfortunately I know nothing about MPI, so I'll have to defer from comment. If you can not split the application large cunks per system there may be no hope. But if you have a choice between one 'solution' using a clump of 32 threads, and a subdivision into 4 - 8 clumps of 8 - 4 cpus then this may lead to happiness.


Check out "vmstat -P", and of course 'man numa_intro'.


Hein.
Florent Boucher
Occasional Advisor

Re: Performance problem on a GS1280

Dear Alexey
I have used the small program you sent to me.
At the moment on the computer I have 32 process running. Using xmesh, I can see that I have a very large transfert betwen cpu #10 and cpu #12. When I use the ps command with the option given in the attachement, the CPU #10 is not seen as running (but xmesh show that it runs) and top shows 32 process running with a load average close to 100% for all the process. Furthermore, the CPU 12 seems to have two process that is quite strange. It seems to me that ps is reporting the CPU number as the one that has the maximum page allocated for this process. I have use the program you sent to me and it is clear that two process have their maximum memory usage on the same CPU (#12).Do you now how it can happen ? And why the system is not able to switch the memory of one process to the CPU #10 that is not used ?
We use an LSF scheduling policy that suspend certain job when higher priority jobs want to run and then restart them when some CPU are avalaibe. Can it be the problem ?
Concerning the large page memory, can you give me more details about the way to manage ?
Regards
Florent
Florent Boucher
Occasional Advisor

Re: Performance problem on a GS1280

Alexey,
I have seen an other process that share now the memory on two processors (#21 and #23) ! So xmesh is showing lot of transfert between this two. Do you think this is an expected behavior ?
Hein van den Heuvel
Honored Contributor

Re: Performance problem on a GS1280

>> Using xmesh, I can see that I have a very large transfert betwen cpu #10 and cpu #12. When I use the ps command with the option given in the attachement, the CPU #10 is not seen as running

Could CPU 10 be the home of process 644579 which seems to have started the 7 castepexe_mpi.exe worker processes? If so, woudl it not be the source for 'cow' pages and so on?

>> It seems to me that ps is reporting the CPU number as the one that has the maximum page allocated for this process.

NO. ps reports whatever cpu it is running on. But the Tru64 scheduler tries is utmost to keep teh cpu and memory togehter. Your observations confirm that the scheduler/swapper is doing a good job!
Processes have a 'home rad' and a 'current rad'.
The system 'reluctantly' moves processes away from home. It is the idle thread on idle cpus whichs pulls in / moves over processes if an other rad is seen as being overloaded.

Is the ps command not case of the measuring influencing the measurement? When it runs on a cpu, nothing else runs on that cpu.

Looking back to your original vmstat, I would really think your system is doing az fine job. There may be a few % more here or there, but in general what you have shown looks pretty good.

Have you gotten a change to experiment with runon? You could use that to take an 8-thread job and make sure is stays in a 4 hop zone, and such. These jobs run for a while do they not? You could also use runon to force a child process to stay on a selected cpu after the fact.

Like the vm/rad program?

Hein.