Simpler Navigation for Servers and Operating Systems
Completed: a much simpler Servers and Operating Systems section of the Community. We combined many of the older boards, so you won't have to click through so many levels to get at the information you need. Check the consolidated boards here as many sub-forums are now single boards.
Operating System - Tru64 Unix
cancel
Showing results for 
Search instead for 
Did you mean: 

slow down (swapping) on a GS1280 with lot of free memory

Joerg Schulenburg
Frequent Advisor

slow down (swapping) on a GS1280 with lot of free memory

We have a GS1280 with 128GB memory 36GB swap in lazy mode, 32 CPUs and do scientific calculations under Tru64-5.1B-PK4.
I frequently observed a swapping machine with lot of free memory (up to 80GB free).
Full story is on http://www.uni-magdeburg.de/urzs/marvel/vmbug3.html
. What I learned from the support is, that every
RAD has its own memory scheduler. I have 4GB memory per RAD. If I use more than 4GB, memory is stolen from the neighbouring RADs but
free memory of the mother RAD can go down further and cause swapping whereas lot of
free meory is on the other RADs.
Also steeling failes on the 16th GB and machine
consumes 99% of CPU time for the system.
Everything looks to me like a kernel bug, but
support tells me it isnt.
Hope I get some help here.
Fighting for a better world with more penguins.
12 REPLIES
Alexey Borchev
Regular Advisor

Re: slow down (swapping) on a GS1280 with lot of free memory

1) Crazy idea, but maybe workaround:
use shared memory.
I mean, requiest the memory as shared memory.
Every prosess wil still have his own memory, but declared as shared, and every process is the only process attached to the memory.
(In fact, memory sharing does not occur.)

I've got 32GB system with Oracle9, > 20 GB
SGA. No problem like You've mentioned.

But Tru64 tends to allocate shared mem evenly across all the system => performance will suffer.

2) In the case when mem between 4 and 16 GB - maybe, tuning kernel parameters about start swapping will remedy swap start.
The fire follows shedule...
Aaron Biver_2
Frequent Advisor

Re: slow down (swapping) on a GS1280 with lot of free memory

Can you tell me what you observed with the cpus_per_rad and vm_overflow options? What settings did you try and with what effects, improvement or otherwise?

It sounds like you vm_overflow improved your situation, except for the problem you are seeing with the last GB of memory.

Hein van den Heuvel
Honored Contributor

Re: slow down (swapping) on a GS1280 with lot of free memory


I'm sorry to hear you've been struggling for so long. My first impression is that you were unfortunate enough to run into a specific condition where powerfull, and generally helpful and good, memory management algoritmes start to work against you. A classic 'pathological' problem. Per some online dictionary (http://dictionary.reference.com/search?q=pathological) :
1. [scientific computation] Used of a data set that is grossly
atypical of normal expected input, especially one that exposes
a weakness or bug in whatever algorithm one is using. An
algorithm that can be broken by pathological inputs may still
be useful if such inputs are very unlikely to occur in
practice."

It also seems to me that according to your timeline that the support efforts are actually speeding up / getting close to a resolution. So this does not seem like the best time (duplication of efforts) to bring up such a complex problem.

I'll make an effort (no garantuee, but I'll walk the hallowed hallways of engineering in Nashua, and/or talk to my running buddies :-) to get some useful comment here to help other users that may be close to have similar problems, or to at least identify the trigger points clearly.

Regards,
Hein.
Joerg Schulenburg
Frequent Advisor

Re: slow down (swapping) on a GS1280 with lot of free memory

@Alexey_1: I'll do, if I find time. I got a workaround by HP for the scientific programs by the support, using nmadvise which also stripes accross all RADs reducing performance.
shmget would be better, because its portable.

@Alexey_2: But what options? HP didnt gave me some hints and I can not try all possible variants. Its not a test machine, it was expensive and our scientists need it.

@Aaron: I am not really sure, because
patchs are installed between tests and I can
not test under clean conditions (rebooting after every test). Also each test is very
time consuming. You find some statements
and tests on the above page.
Looks like vm_overflow reduces swap use,
but process will become slower than without
vm_overflow because kernel needs more CPU.
For the cpus_in_rad option, which can be only activated by a reboot, I have only older tests where I did know less than today about the problem. If I review that data today I think the trigger value is not shifted.
I dont know enough about the vm management
of Tru64 and I dont know what vmstat is telling me. I allways have to guess whats happening, ask the support and get only partly answers. Its very frustrating.

@Hein:
After finding the simple malloc + memset
program I know thats not pathological.
I am sure it should be successfull on every other non-HP NUMA machine.
There is no unnormal input and no weakness or bug in the algorithm. Instead the weakness is in the system algorithm of the vm of the RADs. And the bug occours if I use less than 13% of the whole memory.

I also dont have the impression that support speeds up. They always tell me the system is ok and I have to do something.
Because I dont give up to demand to
fix that bug, I get some reactions.
I know that fixing a bug in a unix kernel
is not simple, but letting the paying customer staying in the rain is not gentlemen like. I did expect some happyness, "hey we found a bug and can fix it to make our product better for our customers", but
nothing like this. Instead I got something
like, "Hey what do you want? Our system is ideal for oracle users. Adapt your crazy software to our perfect system."

It would be great, if you could help to
understand the problem better and identify the trigger point. I will do my best.
Fighting for a better world with more penguins.
Hein van den Heuvel
Honored Contributor

Re: slow down (swapping) on a GS1280 with lot of free memory


>> I also dont have the impression that support speeds up.

Well, I guess the forum topic worked. When I saw the engineer I had in mind (quite literally in the hallway!) he already knew what I was about to ask him.

And let me also reassure you that this guy appreciates customers that put in their time and effort to try to help understand why a system is not behaving they way they would like. He will not brush it away as a 'bad test' but be thankful for an investigation starting point.

Please realize that at this point in time this will be a best effort investigation with no promiss for a fix, tool, timeline or whatever. Engineering will surely have to review the full case and its support escalation process to determine the appropriate support priority and time.

Hope this helped some,
Hein.

Florian Heigl (new acc)
Honored Contributor

Re: slow down (swapping) on a GS1280 with lot of free memory

I know on HP-UX systems with OLA/R there were some specialties added for NUMA usage with Superdomes on Itanium.
A tech can specify which Cell (I think that's Your RAD) is assigned which rows of memory.
Maybe there's some analog way on Your Alpha?

Also, I'd try if the swap allocation rate per second is decreasing when using eager swap - it depends how fast the application is allocating the memory, but I'd think the performance problems start at some atomic moment, when the memory management decides it's out of memory NOW, and with eager mode You might be able to ease that effect.

Note that we only have few systems >32GB ram and none of them use NUMA (yet), so this is mostly guesswork, but maybe something helps. :/
yesterday I stood at the edge. Today I'm one step ahead.
Joerg Schulenburg
Frequent Advisor

Re: slow down (swapping) on a GS1280 with lot of free memory

@Florian: I am sorry, but its of no help to switch to eager mode, because we have less swap than memory what makes sence for 128GB memory! Our application wont need swap and
we wont wast disks just for some tests.
We need good explanations, some knowledge
about VM on RADs and the right testprograms to find out what happens.

@experts: I found out that if the process accesses its pages, the number of free pages on the local RAD goes down until 2*vm_page_prewrite_target (if large enough! about 16K..32K) is reached. Than pages are stolen from the neighbor RADs, but this will increase the number of wired pages on the local RAD
by about 1K pages per 4GB stolen memory which
lowers free pages of local RAD further.
If free pages touches the 6K limit,
active pages become inactiv and paging/swapping starts. Unfortunatly I dont know, how to bring back swapped or inaktiv pages to activ ones and cannot reboot just for test everything. At least Linux shows that bringing back swap to memory (swapoff)
should be no problem without booting.
Further I guess that inactive pages can be paged out, what not necesseraly leads to a swap-disk access if the UBC can hold the page. But extending the UBC probably will also lower the number of free pages and speed up the paging (probably with deadlock).
Because I cannot boot all the time I have to do most experiments on the living object, which makes conclusions more difficult.
So it would be wonderful to have a tool
which brings swapped out pages or if possible inactive pages back, when there is enough memory free to avoid rebooting. Do such tools exist?
Could you write one (swapoff at minimum)?

But now another strange thing:
I tested also on a GS160. OSF1 V5.1 732.
Seeing lot of paging I did
ps -o "psr,pid,time,systime,pcpu,vsz,rss,minflt,majflt,cmd" -a
three times (some seconds..minutes between)
to find out who is paging/swapping and why.
But I was wondering that minflt and majflt can be lowered. What does it mean? Another Bug or just have to update the ps util?


PSR PID TIME SYSTEM %CPU VSZ RSS MINFLT MAJFLT CMD
15 217821 35-10:23:42 1-00:52:45 282.9 389M 276M 973822 63445 ./fe30_abel_
7 302145 8-19:18:25 05:57:15 611.8 270M 208M 1531818 59365 ./x1_fe30_c
11 315623 0:00.17 0:00.15 0.0 6.14M 352K -99 1104 vmubc-real -t

14 217821 35-16:10:05 1-01:04:34 373.4 389M 276M 1003962 95227 ./fe30_abel
5 302145 9-03:02:09 06:12:44 386.0 270M 208M 1529546 62449 ./x1_fe30_c
11 315623 0:00.31 0:00.28 0.0 6.14M 352K 2 1816 vmubc-real -t

14 217821 35-16:18:03 1-01:04:56 352.1 389M 276M 1003916 95322 ./fe30_abel
minflt+majflt lowered?
7 302145 9-03:15:02 06:13:09 250.5 270M 208M 1529498 62515 ./x1_fe30_c
8 315623 0:00.31 0:00.28 0.0 6.14M 352K 6 1831 vmubc-real -t

15 217821 35-17:07:16 1-01:07:05 378.5 389M 276M 1003590 95873 ./fe30_abel
minflt lowered?
4 302145 9-04:32:19 06:14:33 372.1 270M 208M 1556827 65560 ./x1_fe30_c
11 315623 0:00.34 0:00.30 0.0 6.14M 352K 22 1891 vmubc-real -t

By the way, can someone explain me, what vm_overflow exactly does?
Fighting for a better world with more penguins.
Hein van den Heuvel
Honored Contributor

Re: slow down (swapping) on a GS1280 with lot of free memory

Joerg,
Thanks for the update. I checked with engineering and I am convinced that the official case for this, which is active in addition to this topic here, gets the appropriate attention. It just takes time.

>> @Florian: I am sorry, but its of no help to switch to eager mode, because we have less swap than memory what makes sence for 128GB memory! Our application wont need swap and we wont wast disks just for some tests.

I beg to differ, but my information may be incorrect/too old. When I discussed 'memory overflow' in a single rad, and process movement (runon -P), it was explained to me a few years ago that the swapping mechanisme is used to move over pages to a new current rad. They are not copied if I recall, but paged out to the swap file and paged back into the right rad when used again.

Normally I would defend you non-eager, lazy swap approach. I too find it silly to be told by some vendor to allocate 3 times my physical memory as swap space, because that seemed reasonable 10_ years ago. I often know how to configure my system such that i will not need more physical memory than I have, so why have excessive unused swap space?! (I still like to have a good chunk, a couple of GB there such that the system has a modest chance of dealing with an unexpected demand, and slows down before giving up).
In this case, considering the problems you are trying to explain, I would not 'take the risk' and put the swap space there and maybe even switch to eager mode.
No real knowledge here, just gutfeel.

Hope this helps a little,
Regards,
Hein.
Joerg Schulenburg
Frequent Advisor

Re: slow down (swapping) on a GS1280 with lot of free memory

@Hein: I dont think that a RAD memory overflow causes a page out to swap file and page back to the new RAD. That would be slow. Think a swap-disk can write/read about 71MB/s which is 14s per GB. Writing and read back 1GB would take more than 28s. I see 2s for writing 1GB to
another RAD which is little more than writing to local memory (thanks to alpha technology).
Also majflt is 0 and minflt is the same for
local and remote memory (131K*8Kpages). Ok you can argue that its not realy written to swap because of the UBC, but I also see no significant increase of UBC hits or misses, where as I am not 100% sure because machine is active at the moment, but I think transfering 1GB via UBC would be easily to see. Looks like your information is incorrect. By the way as I did the test some minutes ago, I switched off vm_overflow. The test triggered again swapping of another process which did not stop after setting back
vm_overflow to 1. RAD17.free was low and did not raise back to a minimum value. Because I saw actu=140K I set ubc_maxpercent to 1 and
actu went down until free reached its minimum value. After that swapping did stop and I could set back ubc_maxpercent to 50 and swap does not start growing again.
Documentation tells me that UBC should give up its pages if memory is needed, but thats not true.

Let me also give a statement to the eager mode: If I am not wrong swap is only reserved
but not used if an application is started.
So its just a number within the kernel which grows. Nothing is swapped. In the lazy mode also nothing is swapped but no number
is increased. So differences between lazy and eager mode are only virtually not real.
Swapping of some pages will occure only if memory is out (in both cases).
In lazy mode I see on the swapspace that swapping starts,
in eager mode I dont see it so easily.
So I have less information. Also for eager mode I need as much swap as memory, having less swap will make
part of the memory useless for applications.
As I see eager mode should be (a bit) slower because the kernel has to increase the used(reserved)_swap_space number.
Do you have in mind, what a good SCSI disk does cost? Paying for disks of 3 times memory
just for a goodfeel?
I dont understand, why the default mode is
eager. I think its just why the programmer of the lazy feature does not believe in its own
capabilities.
Also I have to reboot for switching to lazy.
Why always HP wants me to reboot the machine?
Hey we live in a UNIX world, not Windows.
Reboots and crashs are undesirable.
Thats why I mentioned swapoff. Do you now swapoff from linux? Switching off swap, switching to the damn eager mode, switching on swap. But I never can swapoff again, because swap is reserved and will never freed
in eager mode.
Ok I see It does not work, because
it contradicts with the 10 years old idea
of the eager concept.
Why I dont have such problems with other Unixes? So I dont rearrange the filesystem
and boot again just for a feeling. give me some hard arguments, and I will try!
Fighting for a better world with more penguins.
Peter Quodling
Trusted Contributor

Re: slow down (swapping) on a GS1280 with lot of free memory

Joerg,


The way I read your responses, it appears you don't appear to be willing to follow the instructions of the support engineer working on your problem. While forums like this may provide some additional knowledge, that person is the absolute expert in the area, and will be asking you things that are relevant to the problem. Questioning/Challenging his/her responses in a forum like this (or on a web page) doesn't add to the solution.

I am also intrigued by your comment about not wanting to devote a couple of disks to the problem, (MAy 11 18:37) If I had a GS1280 that wasn't functioning as expected, I would be trying each and every suggestion of the support people, rather than arguing the esoterics of NUMA. (I'd do it for an ES47 that was playing up...) the relative cost of interim use of a couple of disks versus desired functionality of a GS1280, is chalk and cheese to me.

I am sure that the support person working on this was had to move heaven and earth to get a GS1280 to replicate your problem - they don't grow on trees, even inside HP - All I have seen in your posting is a small sample program that creates a symptom - more information about configuration, the real client application that is causing this problem (not just an example piece of code), would all do better to getting a real solution to the problem.

I also noticed that you are assigning 0 and 1 points for the detailed responses that you are getting - you may care to consult the tips for the ITRC forum - while your problem may not be resolved, the likes of Hein,alexey and Florian are going well out of their way to assist, and a 0/1 point value is really a slap in the face, for their efforts, advice (including hein tracking down people for status ) If you are looking for help here, show a touch more appreciation for the efforts.
The Tips are at http://forums1.itrc.hp.com/service/forums/helptips.do?#33


Peter.
Leave the Money on the Fridge.
Joerg Schulenburg
Frequent Advisor

Re: slow down (swapping) on a GS1280 with lot of free memory

@Peter: I am willing to follow instruction, if
they make sence. It is simple to answer to my questions in the way "just try this, it may help", but it costs to do so and that without
solving the problem.
Our machine cost about 1Million dollar and some thousand a year for support. After 6 years we can through away the machine.
So every day cost us about 406$ only for the hardware as a linear approximation.
Every job is running about 30days on that machine. A reboot does destroy statisticaly 15days scientific work on that machine, which is about 6090$ per crash and reboot.
This is only an approximation. I get money
to admin that machine in a way that the number of reboots
and crashs are minimized.
So I have to value each reboot, thats my job.
If you read my story, you will see that
the cause for the trouble is a bug in the kernel or bad implementation or what ever
you call it. So I think its HPs turn to help.
My impression was, that whether the support was able to follow my conlusions or they
followed and were unwilling to ask engeneering (that hase hopefully changed now).
So I had to spend much of my time
to learn lot about paging and swapping
to have enough arguments for the support to change their mind (successfull?).
I asked lot of questions and got only few answers back, no details, which would help
to understand Tru64s VM and NUMA.
I am not what you probably call an expert, but I am able to think, able to draw conclusions. And until now nobody of the experts was able to show me that I am on the wrong way.

>>I am sure that the support person working on this was had to move heaven and earth to get a GS1280 to replicate your problem

Do they? What I got as information by email
was: "Our engineering did some tests with your reproducer on our testsystem and found some problems. Problems are solved by changes on the kernel. Further tests were successfull. ..." (translated from german to english).
No information what system they use, which problems arise. Did they reproduce my results
or do they got some other?
Weeks (or months?) ago I told the support to get a login on our machine to see what happens, but nobody was asking for! Support seems not interested in it.

The patch I got seems to improve situation on the GS1280, but I still not know is it because the problem is solved or do they found a workaround which triggers other problems.

>> arguing the esoterics of NUMA

If the support tells you, that an (non-numa-aware) application
can only use 1/32 of the overall memory
on a HP-NUMA without swapping and there is no need for support, you have to arguing!
That statement has made me very angry and I now understand why people is saying that
HP does not play a role in the HPC (High Performance) world.
By the way the person who did
tell me that stupid statement was also mentioned to be an expert *sigh*.

>>noticed that you are assigning 0 and 1 points for the detailed responses

Sorry, that I am so severe (right word?), but
answers did not helped me. Other users having the same problem and searching for it within this forum can spare time (and reboots) if they ignore 0pt answers, regardless if answers came from experts or not.
Be sure that I value answers which really help. And also I have efforts making explanations and suggestions here, having the hope that they will be read and finaly solve
the problem.

Trying to make this world better ...
Fighting for a better world with more penguins.
Joerg Schulenburg
Frequent Advisor

Re: slow down (swapping) on a GS1280 with lot of free memory

I have some news on the above mentioned webpage. Seems that ubc plays an important role for swapping, because borrowed ubc pages
can not be stolen by other RADs.
Comments are welcome.
Fighting for a better world with more penguins.