Re: Performance problem on a GS1280

Florent Boucher · ‎03-30-2005

Dear support,
I have some troubles with a GS1280 (32CPU 64Go RAM) that is heavy loaded and present some problem of performance. This computer is used to run large parallel application that used from 1Go to 2Go per process. I have observed that the system TRU64 V5.1 switch often the applications from one process to another that is not good at all for the performance. I would expect the process to be "nearly bound" to one processor during the whole execution time (at least to not switch too often) in order to avoid the time spend by moing the data in cache and mempry (am I clear ?).
So, should I changed something in the sysconfigtab file to avoid such a behavior :
the round_robin_switch_rate is 25 and sched-min-idle is not defined. I did a vmstat 1
and the output is in attachement.
For instance, the problem I had is even if the computer is not fully loaded (10 CPU free for instance) I do not have the same elapse time if I run twice the same parallel application. The difference can be of the order of 20-30%.
The parallel application I use are very demanding on memory band with but not too much on communication between processors.
If some one can help me or at least tell me what to look in order to solve the problem it will be fine.
Regards
Florent Boucher

Mark Poeschl_2 · ‎03-30-2005

From what I can see you need to look to your application first, Florent. If your round_robin_switch_rate is 25 and your application is really completely compute bound, I would expect to see 40 context switches per second. You are getting 1000 context switches per second. That is indicative of an application that is either very I/O bound or very heavily dependent on interprocess communication. I suspect the latter in your case, because the amount of "system" CPU time is fairly low, but I can't be sure of that. Some output from a tool like 'top' would be useful to see what individual processes seem to be doing.

Florent Boucher · ‎03-30-2005

Dear Mark,
at the moment there are 4 different applications that are running on the cluster and all of them are using MPI. I put in attachement the output from top and also from the ps command with specific options.
I found in the present case that from the output from ps that the cpu #10 is not used and the #12 is used twice. That just means the the system have switch the processes from one processor to another. However, there is from my point of view, no reason to do this and this is not at all efficient when the process are using lot of memorey. Does a way exist to reduce this switch from one processor to another in order to keep the optimum cache and memory performance ?

Alexey Borchev · ‎03-30-2005

1) I wolud look at kernel parameters:
vm:
replicate_user_text
vm_bigpg_enabled - neat big pages feature, seems to be Your case.
(see man sys_attrs_vm)

generic:
sched_distance
(man sys_attrs_generic)

2) Try runing sys_check and healthcheck - they can give some ideas.

3) there is a small tool by Hein van den Heuvel, which can show how memory is allocated for a particular process.
See attachment.

4) Run xmesh utility.

The fire follows shedule...

Hein van den Heuvel · ‎03-30-2005

My first step would be to investigate using 'runon -r'

This tells the system to have a command run on a selected rad and stop movement that way.

You can specify multiple rads, and of course you would select those to be adjacent.

Hein.

Florent Boucher · ‎03-30-2005

Concerning the runon command, how is it compatible with mpi program ? On our system, there is no pset and only one RAD (may be this is not good ?). So it seems to me difficult to use the runon command.

Hein van den Heuvel · ‎03-30-2005

If you have "a GS1280 (32CPU 64Go RAM) " as you indicate then you will have normally have 32 single-cpu RADs. It is possible to set up the system with 16 double-cpu rads, but that is rarely done/justified.

When I was toying with an application needing about 3 CPUs perf instance I used a stript to launch it and the script parameters looked like:

database_start_prefix = runon -r 0 -r 2 -r 4
central_start_prefix = runon -r 6 -r 7
01_start_prefix = runon -r 1 -r 3 -r 5
02_start_prefix = runon -r 8 -r 10 -r 12
03_start_prefix = runon -r 9 -r 11 -r 13
04_start_prefix = runon -r 16 -r 18 -r 20
05_start_prefix = runon -r 17 -r 19 -r 21
06_start_prefix = runon -r 24 -r 26 -r 28
07_start_prefix = runon -r 25 -r 27 -r 29
08_start_prefix = runon -r 14 -r 15 -r 22
09_start_prefix = runon -r 23 -r 30 -r 31

If you double-check that, then you'll see near-adjacent CPUs being used.

Unlike PSETS, the runon -r is NOT exclusive.
So a single rad can be assigned to mutliple application chunks.

Unfortunately I know nothing about MPI, so I'll have to defer from comment. If you can not split the application large cunks per system there may be no hope. But if you have a choice between one 'solution' using a clump of 32 threads, and a subdivision into 4 - 8 clumps of 8 - 4 cpus then this may lead to happiness.

Check out "vmstat -P", and of course 'man numa_intro'.

Hein.

Florent Boucher · ‎03-30-2005

Dear Alexey
I have used the small program you sent to me.
At the moment on the computer I have 32 process running. Using xmesh, I can see that I have a very large transfert betwen cpu #10 and cpu #12. When I use the ps command with the option given in the attachement, the CPU #10 is not seen as running (but xmesh show that it runs) and top shows 32 process running with a load average close to 100% for all the process. Furthermore, the CPU 12 seems to have two process that is quite strange. It seems to me that ps is reporting the CPU number as the one that has the maximum page allocated for this process. I have use the program you sent to me and it is clear that two process have their maximum memory usage on the same CPU (#12).Do you now how it can happen ? And why the system is not able to switch the memory of one process to the CPU #10 that is not used ?
We use an LSF scheduling policy that suspend certain job when higher priority jobs want to run and then restart them when some CPU are avalaibe. Can it be the problem ?
Concerning the large page memory, can you give me more details about the way to manage ?
Regards
Florent

Florent Boucher · ‎03-30-2005

Alexey,
I have seen an other process that share now the memory on two processors (#21 and #23) ! So xmesh is showing lot of transfert between this two. Do you think this is an expected behavior ?

Hein van den Heuvel · ‎03-30-2005

>> Using xmesh, I can see that I have a very large transfert betwen cpu #10 and cpu #12. When I use the ps command with the option given in the attachement, the CPU #10 is not seen as running

Could CPU 10 be the home of process 644579 which seems to have started the 7 castepexe_mpi.exe worker processes? If so, woudl it not be the source for 'cow' pages and so on?

>> It seems to me that ps is reporting the CPU number as the one that has the maximum page allocated for this process.

NO. ps reports whatever cpu it is running on. But the Tru64 scheduler tries is utmost to keep teh cpu and memory togehter. Your observations confirm that the scheduler/swapper is doing a good job!
Processes have a 'home rad' and a 'current rad'.
The system 'reluctantly' moves processes away from home. It is the idle thread on idle cpus whichs pulls in / moves over processes if an other rad is seen as being overloaded.

Is the ps command not case of the measuring influencing the measurement? When it runs on a cpu, nothing else runs on that cpu.

Looking back to your original vmstat, I would really think your system is doing az fine job. There may be a few % more here or there, but in general what you have shown looks pretty good.

Have you gotten a change to experiment with runon? You could use that to take an 8-thread job and make sure is stays in a 4 hop zone, and such. These jobs run for a while do they not? You could also use runon to force a child process to stay on a selected cpu after the fact.

Like the vm/rad program?

Hein.

Alexey Borchev · ‎04-01-2005

Hi, Florent! You've done lots of wonderful observations!
>>Do you now how it can happen ? - No...

>>And why the system is not able to switch the memory of one process to the CPU #10 that is not used ? - As far as I know, Tru64 does it best to scedule process close to it's memry. But I newer heard that Tru64 Re-locates proces'es memory to another RAD...

>>Concerning the large page memory, can you give me more details about the way to manage ?
1) #man sys_attrs_vm, read all around vm_bigpg_*
2) Run Kernel tuner, section vm, set vm_bigpg_enabled=1
and reboot.
Your application seems to be memory-intensitive - i.e. good candidate for the feature. Please tell us if You've got performance benefits from big pages.
I've just enabled the feature, and got results (see attachment, it's Oracle dbwriter process).
But this will not resolve 'Foreign RAD' problem.

3) have seen an other process that share now the memory on two processors (#21 and #23) ! So xmesh is showing lot of transfert between this two. Do you think this is an expected behavior ? - Yes, definitely.

4) if You want to pin process to memory, then either go for 'runon' or sched_distance (but sched_distance<=1 can harm performance).

5) Sorry, I am not an HP person, I am just selling lipsticks for Avon :-)

The fire follows shedule...

Han Pilmeyer · ‎04-02-2005

If you do decide to try VM:vm_bigpg_enabled=1, be sure to also set vm:vm_segmentation=0. We're working on an official message about that.

Florent Boucher · ‎04-04-2005

Dear Hein and Alexey,
I did not had time for the moment to test the vm_bigpg option. For this, I have to reboot the system and I should sent a notification to the users. I think I will do this change in the midle of the week. In the mid time, I would like to come back to the difference between "home rad" and "current rad". On our system, it often happen that job are submitted for many hours (days). So, using the scheduler policy that can suspend one job to start another, it seems possible that two (or even more) heavy job have the same "home rad". Am I rigth ? Of course, unix will try to have different "current rad" for every very demanding process.
Does a way exist to optimize the way the "home rad" are distributed ? One can immagine that unix could move in my case the "home rad" of process 688877 to rad#10 in order to avoid the large transfert between the processors #12 and #10 ?
Concerning the runon, it is impossible to use with mpi jobs. So I do not think I will kept this solution.
Regards
Florent

Han Pilmeyer · ‎04-05-2005

oops. That vm:vm_bigpg_seg=0 (not vm:vm_segmentation=0), when using big pages (vm:vm_bigpg_enabled=1).

Joerg Schulenburg · ‎04-17-2005

I admin a GS1280 and also see performance problems. Some RADs are nearly doing 99%
for system and 1% for user if the free memory
of that RAD is to low. I obvserved also,
that 2 different jobs get memory from the same RAD. May be you have a similar problem but not
fully evolved. Unfortunatly I dont see the vmstat -R nor the ps output
mentioned in this thread.
Please have a short look to
http://www.uni-magdeburg.de/urzs/marvel/vmbug3.html
to see what I am talking about.

Fighting for a better world with more penguins.

Joerg Schulenburg · ‎04-17-2005

I am also very interested in the mentioned tool about seeing, where the memory of a process is located. I asked google about it, but nothing was found.

Fighting for a better world with more penguins.

Florent Boucher · ‎04-18-2005

Dear Joerg,
it seems to me that we have exactly the same problem. For the moment, no news at all from the HP support. I put in attachement the first output from vmstat -R 5 and the information about memory allocation for the two process that have problem.
I hope somebody will give us some "good" answer to solve the problem.
Regards
Florent

Florent Boucher · ‎04-18-2005

Dear Joerg
The tool for the analysis of vm allocation has been given by Alexey. You can find it at the beginning of the thread.
I put it again in attachement again.
Regards
Florent

Joerg Schulenburg · ‎04-18-2005

Dear Florent,
I am happy that I am not alone. Thanks for the attachement. I just overlooked the paperclip
symbol on the replies.
I will try out the program together with my testprogram tomorrow on the empty machine. Today its to late for long experiments.
Best regards,
Joerg.

Fighting for a better world with more penguins.

Joerg Schulenburg · ‎04-20-2005

Dear Florent, Its not easy to make successfull
tests. I did some bad things. First I called
date, vmstat and ps by the program using system call. As I remember that is not very clever because usually fork + exec is called and that means, the big GB memory process is
(virtually) doubled for a short time.
I saw that date, ps, etc. took long time instead of short response. So the outcome of my tests are not optimal.
I try to give more details on the mentioned page and on another forum thread (subject: slow down (swapping) on a GS1280 with lot of free memory). As you can understand, its
not my task to use our expensive machine as
testmachine and reboot it all the days.
For first I saw system becoming very slow
if free pages from one RAD was below 3000 down to 10, which was usually at the 16th GB the case (with and without swap).
Today swap was growing very slowly, and speed was not as bad as some days ago which could be a result of the other users (I did not reboot before the new tests).

Fighting for a better world with more penguins.

Hein van den Heuvel · ‎04-20-2005

Joerg, I see that Florent helped you find the VM tool. I made it mostly to show the bigpage effect, but it will also nicely show vm per rad in general.

The program was originall posted in:
http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=644238

I found it back using google: "+tru64 +rad +gs1280 +site:itrc.hp.com"

imho this currently is (unfortunately) the best way to search the ITRC forum:
Google: + + + +site:itrc.hp.com

Regards,
Hein.

Joerg Schulenburg · ‎05-11-2005

I found time to visit your outputs and compared it to my experinces.
If you read the comments on my thread please have in mind that I dont use bigpages which probably makes the thing more complicate.
If you look at your vmstat -R output you see, that RAD0.free=491 RAD5.free=500 and RAD6.free=3299. At least the values for RAD0 and RAD5 are to low and trigger the paging (high acti and pin/pout greater 0). That causes the performance loss.
I guess that following stupid things happened:
First RAD0 has some memory usage for all the
unecessary deamons, may be running java etc.
If you start MPI first job will be probably started on RAD0 consuming the rest of the memory and steal memory from its neighbor which gives you only a little performance loss. May be this process goes waiting for other MPI threads. Next (or later) MPI thread
is started also on RAD0 because RAD0 is ideling and dont know that the new thread also needs memory, which is not available local. No problem, it takes it again from the neighbours. And so on. Now the Managment of the stolen Memory needs also memory (wired memory) and RAD0.free is low enough to trigger paging. Same happens on other RADs.
So you have RSSis high (try ps -o psr,pid,pcpu,vsz,rss,minflt,majflt,cmd).
If the system starts stealing from other pages before the memory is so low everything would be fine but not perfect.
I think you get the optimum speed if you
be able to tell each MPI thread where to start. On a good MPI implementation I would expect that the system/library should do that for you. In that case each process would consume local memory and never have to steal pages from neighbours. But if the page stealing would work fine, you had only the
dataflow between processorlinks, which is
also very fast.
Do you use an MPI library delivered by HP?
Try to check, where each job is started
(CPU + RAD) and if some jobs are started on the same processor ask HP what the hell was thinking the designer of the HP MPI implementation as he adapted it to HPs-NUMA.
I would not wonder if the MPI package has no adaptions to GS1280 NUMA technology *sigh*.
Probably they think that the loadmanager
will do the job of the MPI manager.
But you could probably use a trick to outwit
that balancing. Add a CPU consuming function
to each MPI thread. For example calculate PI
for 10seconds giving the loadbalancer time enough to put each job on another RAD and
after that start to consume memory.
May be that fails too, buts an easy test.

Fighting for a better world with more penguins.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Performance problem on a GS1280

Performance problem on a GS1280