slow down (swapping) on a GS1280 with lot of free memory

Joerg Schulenburg · ‎04-18-2005

We have a GS1280 with 128GB memory 36GB swap in lazy mode, 32 CPUs and do scientific calculations under Tru64-5.1B-PK4.
I frequently observed a swapping machine with lot of free memory (up to 80GB free).
Full story is on http://www.uni-magdeburg.de/urzs/marvel/vmbug3.html
. What I learned from the support is, that every
RAD has its own memory scheduler. I have 4GB memory per RAD. If I use more than 4GB, memory is stolen from the neighbouring RADs but
free memory of the mother RAD can go down further and cause swapping whereas lot of
free meory is on the other RADs.
Also steeling failes on the 16th GB and machine
consumes 99% of CPU time for the system.
Everything looks to me like a kernel bug, but
support tells me it isnt.
Hope I get some help here.

Fighting for a better world with more penguins.

Alexey Borchev · ‎04-19-2005

1) Crazy idea, but maybe workaround:
use shared memory.
I mean, requiest the memory as shared memory.
Every prosess wil still have his own memory, but declared as shared, and every process is the only process attached to the memory.
(In fact, memory sharing does not occur.)

I've got 32GB system with Oracle9, > 20 GB
SGA. No problem like You've mentioned.

But Tru64 tends to allocate shared mem evenly across all the system => performance will suffer.

2) In the case when mem between 4 and 16 GB - maybe, tuning kernel parameters about start swapping will remedy swap start.

The fire follows shedule...

Aaron Biver_2 · ‎04-19-2005

Can you tell me what you observed with the cpus_per_rad and vm_overflow options? What settings did you try and with what effects, improvement or otherwise?

It sounds like you vm_overflow improved your situation, except for the problem you are seeing with the last GB of memory.

Hein van den Heuvel · ‎04-19-2005

I'm sorry to hear you've been struggling for so long. My first impression is that you were unfortunate enough to run into a specific condition where powerfull, and generally helpful and good, memory management algoritmes start to work against you. A classic 'pathological' problem. Per some online dictionary (http://dictionary.reference.com/search?q=pathological) :
1. [scientific computation] Used of a data set that is grossly
atypical of normal expected input, especially one that exposes
a weakness or bug in whatever algorithm one is using. An
algorithm that can be broken by pathological inputs may still
be useful if such inputs are very unlikely to occur in
practice."

It also seems to me that according to your timeline that the support efforts are actually speeding up / getting close to a resolution. So this does not seem like the best time (duplication of efforts) to bring up such a complex problem.

I'll make an effort (no garantuee, but I'll walk the hallowed hallways of engineering in Nashua, and/or talk to my running buddies :-) to get some useful comment here to help other users that may be close to have similar problems, or to at least identify the trigger points clearly.

Regards,
Hein.

Joerg Schulenburg · ‎04-20-2005

@Alexey_1: I'll do, if I find time. I got a workaround by HP for the scientific programs by the support, using nmadvise which also stripes accross all RADs reducing performance.
shmget would be better, because its portable.

@Alexey_2: But what options? HP didnt gave me some hints and I can not try all possible variants. Its not a test machine, it was expensive and our scientists need it.

@Aaron: I am not really sure, because
patchs are installed between tests and I can
not test under clean conditions (rebooting after every test). Also each test is very
time consuming. You find some statements
and tests on the above page.
Looks like vm_overflow reduces swap use,
but process will become slower than without
vm_overflow because kernel needs more CPU.
For the cpus_in_rad option, which can be only activated by a reboot, I have only older tests where I did know less than today about the problem. If I review that data today I think the trigger value is not shifted.
I dont know enough about the vm management
of Tru64 and I dont know what vmstat is telling me. I allways have to guess whats happening, ask the support and get only partly answers. Its very frustrating.

@Hein:
After finding the simple malloc + memset
program I know thats not pathological.
I am sure it should be successfull on every other non-HP NUMA machine.
There is no unnormal input and no weakness or bug in the algorithm. Instead the weakness is in the system algorithm of the vm of the RADs. And the bug occours if I use less than 13% of the whole memory.

I also dont have the impression that support speeds up. They always tell me the system is ok and I have to do something.
Because I dont give up to demand to
fix that bug, I get some reactions.
I know that fixing a bug in a unix kernel
is not simple, but letting the paying customer staying in the rain is not gentlemen like. I did expect some happyness, "hey we found a bug and can fix it to make our product better for our customers", but
nothing like this. Instead I got something
like, "Hey what do you want? Our system is ideal for oracle users. Adapt your crazy software to our perfect system."

It would be great, if you could help to
understand the problem better and identify the trigger point. I will do my best.

Fighting for a better world with more penguins.

Hein van den Heuvel · ‎04-20-2005

>> I also dont have the impression that support speeds up.

Well, I guess the forum topic worked. When I saw the engineer I had in mind (quite literally in the hallway!) he already knew what I was about to ask him.

And let me also reassure you that this guy appreciates customers that put in their time and effort to try to help understand why a system is not behaving they way they would like. He will not brush it away as a 'bad test' but be thankful for an investigation starting point.

Please realize that at this point in time this will be a best effort investigation with no promiss for a fix, tool, timeline or whatever. Engineering will surely have to review the full case and its support escalation process to determine the appropriate support priority and time.

Hope this helped some,
Hein.

Florian Heigl (new acc) · ‎04-27-2005

I know on HP-UX systems with OLA/R there were some specialties added for NUMA usage with Superdomes on Itanium.
A tech can specify which Cell (I think that's Your RAD) is assigned which rows of memory.
Maybe there's some analog way on Your Alpha?

Also, I'd try if the swap allocation rate per second is decreasing when using eager swap - it depends how fast the application is allocating the memory, but I'd think the performance problems start at some atomic moment, when the memory management decides it's out of memory NOW, and with eager mode You might be able to ease that effect.

Note that we only have few systems >32GB ram and none of them use NUMA (yet), so this is mostly guesswork, but maybe something helps. :/

yesterday I stood at the edge. Today I'm one step ahead.

Joerg Schulenburg · ‎05-11-2005

@Florian: I am sorry, but its of no help to switch to eager mode, because we have less swap than memory what makes sence for 128GB memory! Our application wont need swap and
we wont wast disks just for some tests.
We need good explanations, some knowledge
about VM on RADs and the right testprograms to find out what happens.

@experts: I found out that if the process accesses its pages, the number of free pages on the local RAD goes down until 2*vm_page_prewrite_target (if large enough! about 16K..32K) is reached. Than pages are stolen from the neighbor RADs, but this will increase the number of wired pages on the local RAD
by about 1K pages per 4GB stolen memory which
lowers free pages of local RAD further.
If free pages touches the 6K limit,
active pages become inactiv and paging/swapping starts. Unfortunatly I dont know, how to bring back swapped or inaktiv pages to activ ones and cannot reboot just for test everything. At least Linux shows that bringing back swap to memory (swapoff)
should be no problem without booting.
Further I guess that inactive pages can be paged out, what not necesseraly leads to a swap-disk access if the UBC can hold the page. But extending the UBC probably will also lower the number of free pages and speed up the paging (probably with deadlock).
Because I cannot boot all the time I have to do most experiments on the living object, which makes conclusions more difficult.
So it would be wonderful to have a tool
which brings swapped out pages or if possible inactive pages back, when there is enough memory free to avoid rebooting. Do such tools exist?
Could you write one (swapoff at minimum)?

But now another strange thing:
I tested also on a GS160. OSF1 V5.1 732.
Seeing lot of paging I did
ps -o "psr,pid,time,systime,pcpu,vsz,rss,minflt,majflt,cmd" -a
three times (some seconds..minutes between)
to find out who is paging/swapping and why.
But I was wondering that minflt and majflt can be lowered. What does it mean? Another Bug or just have to update the ps util?

PSR PID TIME SYSTEM %CPU VSZ RSS MINFLT MAJFLT CMD
15 217821 35-10:23:42 1-00:52:45 282.9 389M 276M 973822 63445 ./fe30_abel_
7 302145 8-19:18:25 05:57:15 611.8 270M 208M 1531818 59365 ./x1_fe30_c
11 315623 0:00.17 0:00.15 0.0 6.14M 352K -99 1104 vmubc-real -t

14 217821 35-16:10:05 1-01:04:34 373.4 389M 276M 1003962 95227 ./fe30_abel
5 302145 9-03:02:09 06:12:44 386.0 270M 208M 1529546 62449 ./x1_fe30_c
11 315623 0:00.31 0:00.28 0.0 6.14M 352K 2 1816 vmubc-real -t

14 217821 35-16:18:03 1-01:04:56 352.1 389M 276M 1003916 95322 ./fe30_abel
minflt+majflt lowered?
7 302145 9-03:15:02 06:13:09 250.5 270M 208M 1529498 62515 ./x1_fe30_c
8 315623 0:00.31 0:00.28 0.0 6.14M 352K 6 1831 vmubc-real -t

15 217821 35-17:07:16 1-01:07:05 378.5 389M 276M 1003590 95873 ./fe30_abel
minflt lowered?
4 302145 9-04:32:19 06:14:33 372.1 270M 208M 1556827 65560 ./x1_fe30_c
11 315623 0:00.34 0:00.30 0.0 6.14M 352K 22 1891 vmubc-real -t

By the way, can someone explain me, what vm_overflow exactly does?

Fighting for a better world with more penguins.

Hein van den Heuvel · ‎05-11-2005

Joerg,
Thanks for the update. I checked with engineering and I am convinced that the official case for this, which is active in addition to this topic here, gets the appropriate attention. It just takes time.

>> @Florian: I am sorry, but its of no help to switch to eager mode, because we have less swap than memory what makes sence for 128GB memory! Our application wont need swap and we wont wast disks just for some tests.

I beg to differ, but my information may be incorrect/too old. When I discussed 'memory overflow' in a single rad, and process movement (runon -P), it was explained to me a few years ago that the swapping mechanisme is used to move over pages to a new current rad. They are not copied if I recall, but paged out to the swap file and paged back into the right rad when used again.

Normally I would defend you non-eager, lazy swap approach. I too find it silly to be told by some vendor to allocate 3 times my physical memory as swap space, because that seemed reasonable 10_ years ago. I often know how to configure my system such that i will not need more physical memory than I have, so why have excessive unused swap space?! (I still like to have a good chunk, a couple of GB there such that the system has a modest chance of dealing with an unexpected demand, and slows down before giving up).
In this case, considering the problems you are trying to explain, I would not 'take the risk' and put the swap space there and maybe even switch to eager mode.
No real knowledge here, just gutfeel.

Hope this helps a little,
Regards,
Hein.

Joerg Schulenburg · ‎05-11-2005

@Hein: I dont think that a RAD memory overflow causes a page out to swap file and page back to the new RAD. That would be slow. Think a swap-disk can write/read about 71MB/s which is 14s per GB. Writing and read back 1GB would take more than 28s. I see 2s for writing 1GB to
another RAD which is little more than writing to local memory (thanks to alpha technology).
Also majflt is 0 and minflt is the same for
local and remote memory (131K*8Kpages). Ok you can argue that its not realy written to swap because of the UBC, but I also see no significant increase of UBC hits or misses, where as I am not 100% sure because machine is active at the moment, but I think transfering 1GB via UBC would be easily to see. Looks like your information is incorrect. By the way as I did the test some minutes ago, I switched off vm_overflow. The test triggered again swapping of another process which did not stop after setting back
vm_overflow to 1. RAD17.free was low and did not raise back to a minimum value. Because I saw actu=140K I set ubc_maxpercent to 1 and
actu went down until free reached its minimum value. After that swapping did stop and I could set back ubc_maxpercent to 50 and swap does not start growing again.
Documentation tells me that UBC should give up its pages if memory is needed, but thats not true.

Let me also give a statement to the eager mode: If I am not wrong swap is only reserved
but not used if an application is started.
So its just a number within the kernel which grows. Nothing is swapped. In the lazy mode also nothing is swapped but no number
is increased. So differences between lazy and eager mode are only virtually not real.
Swapping of some pages will occure only if memory is out (in both cases).
In lazy mode I see on the swapspace that swapping starts,
in eager mode I dont see it so easily.
So I have less information. Also for eager mode I need as much swap as memory, having less swap will make
part of the memory useless for applications.
As I see eager mode should be (a bit) slower because the kernel has to increase the used(reserved)_swap_space number.
Do you have in mind, what a good SCSI disk does cost? Paying for disks of 3 times memory
just for a goodfeel?
I dont understand, why the default mode is
eager. I think its just why the programmer of the lazy feature does not believe in its own
capabilities.
Also I have to reboot for switching to lazy.
Why always HP wants me to reboot the machine?
Hey we live in a UNIX world, not Windows.
Reboots and crashs are undesirable.
Thats why I mentioned swapoff. Do you now swapoff from linux? Switching off swap, switching to the damn eager mode, switching on swap. But I never can swapoff again, because swap is reserved and will never freed
in eager mode.
Ok I see It does not work, because
it contradicts with the 10 years old idea
of the eager concept.
Why I dont have such problems with other Unixes? So I dont rearrange the filesystem
and boot again just for a feeling. give me some hard arguments, and I will try!

Fighting for a better world with more penguins.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

slow down (swapping) on a GS1280 with lot of free memory

slow down (swapping) on a GS1280 with lot of free memory

Re: slow down (swapping) on a GS1280 with lot of free memory

Re: slow down (swapping) on a GS1280 with lot of free memory

Re: slow down (swapping) on a GS1280 with lot of free memory

Re: slow down (swapping) on a GS1280 with lot of free memory

Re: slow down (swapping) on a GS1280 with lot of free memory

Re: slow down (swapping) on a GS1280 with lot of free memory

Re: slow down (swapping) on a GS1280 with lot of free memory

Re: slow down (swapping) on a GS1280 with lot of free memory

Re: slow down (swapping) on a GS1280 with lot of free memory