Swap Reserved by Shared Memory?

Craig Johnson_1 · ‎06-13-2007

We have clustered Serviceguard setups with lots and lots of RAM. These run multiple packages with lots of databases in each package.

What we're seeing is that when we bring the packages up, they are eating up all of the available swap space, even though there is free RAM and no apparent swapping going on. One white paper we read indicates that the apps are reserving shared memory, and thus eating up the "virtual swap".

A swapinfo tells the story. No real swapping is going on, but vswap is full, so full, that additional processes cannot be spawned. This kinda sorta shows it - we had to move packages around the cluster to get them up, so the swap usage went down a bit:

# swapinfo -tam
Mb Mb Mb PCT START/ Mb
TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME
dev 1024 0 1024 0% 0 - 1 /dev/vg00/lvol2
dev 8192 0 8192 0% 0 - 1 /dev/vg00/lvol9
reserve - 9216 -9216
memory 19200 16327 2873 85%
total 28416 25543 2873 90% - 0 -

What do we do to fix this?

Don Morris_1 · ‎06-13-2007

If it is truly Shared Memory (i.e. System V shared memory), add more swap. There's no way to prevent a SysV shmem object from reserving swap. Alternately, add more RAM [which will have representation in the memory/pseudo-swap... and will equate to adding more swap].

If the swap reservations are really for private objects in the manifold packages, you could try chatr'ing them with +z enable to get the Lazy Swap behavior (waits to reserve until memory fault time), but that runs the risk of the applications being terminated if insufficient swap resources are available. There's also a slim chance that lowering shmmax, shmmni or maxdsiz might have an effect (depending on exactly what is consuming the virtual address space in the applications and how they handle allocation failure), but I would frankly be adding swap first if you want these applications to run before throttling them.

A. Clay Stephenson · ‎06-13-2007

All of this looks normal and is the expected behavior when using pseudoswap (which you are). Shared memory certainly counts as process reservation space so that behavior is normal. You have to understand that pseudoswap isn't really swap; it's simply kernel math that allows you to count 3/4 of your physical memory as though it were swap space for process reservation purposes. You have 9GiB of device swap and if pseudoswap were not enabled, no matter how much free physical memory you had, you could not start more than 9GiB's of processes. I'm guessing that you actually have about 28GiB of physical memory; it's possible that if you attempt to start a group of large processes that you exceed available reservation space.

I'm going to guess that your real problem is that you have some 32-bit processes and that is your fundamental problem. All 32-bit processes share a common 4GiB address space (which can disappear really fast) unless you use memory windows. In that case, each group of related 32-bit processes gets its own 4GiB address space to play in.

If 32-bit processes are not your problem then you are going to have to do some combination of the following: 1) Add more swap space 2) Add more physical memory 3) Reduce your memory footprint by tuning back SGA's and possibly buffer cache.

Finally, one of the most common ways that memory disappears is via shared memory segments that are not properly removed. This is trivially easy if you have done some kill -9's. Do an "ipcs -ma" and look for any zero values in the NATTACH column. NATTACH = 0 is a necessary but not necessarily sufficient condition for the removal of a shmid; it depends upon the design of your applications.

If it ain't broke, I can fix that.

Craig Johnson_1 · ‎06-14-2007

IPC status from /dev/kmem as of Thu Jun 14 13:38:25 2007
T ID KEY MODE OWNER GROUP CREATOR CGROUP NATTCH SEGSZ CPID LPID ATIME DTIME CTIME
Shared Memory:
m 0 0x000056ce --rw-r--r-- root root root root 1 200012 383 383 15:40:31 no-entry 15:40:31
m 1 0x41180262 --rw-rw-rw- root root root root 0 348 709 709 15:40:43 15:40:43 15:40:37
m 2 0x4e0c0002 --rw-rw-rw- root root root root 1 61760 709 709 15:40:39 15:40:43 15:40:37
m 3 0x411c0e49 --rw-rw-rw- root root root root 1 8192 709 721 15:40:39 15:40:37 15:40:37
m 9220 0x0c6629c9 --rw-r----- root root root root 2 18995752 1550 8467 14:47:50 14:49:50 15:41:38
m 5 0x06347849 --rw-rw-rw- root root root root 1 65626 1550 1609 15:41:43 15:41:39 15:41:39
m 1030 0x491010e3 --rw-r--r-- root root root root 0 22908 1535 1609 13:38:00 13:38:00 15:41:40
m 54279 0x5e1406cb --rw------- root root root root 1 512 1845 1845 15:42:09 no-entry 15:42:09
m 16392 0x011c1f99 --rw-rw-rw- root root root root 9 1160 1973 23746 13:38:18 no-entry 15:42:17
m 1033 0x0000cace --rw-rw-rw- root root root root 0 2 2541 2541 0:12:39 0:12:39 15:42:48
m 4106 0xcb53096c --rw-r----- oracle dba oracle dba 28 706707456 12589 11451 13:22:24 13:26:21 15:49:48
m 11 0xd2d59a0c --rw-r----- oracle dba oracle dba 6 315777024 12999 13054 15:50:04 15:50:10 15:50:03
m 12 0x5560aeac --rw-r----- oracle dba oracle dba 9 617881600 13177 6419 9:41:52 9:41:58 15:50:13
m 17 0xd795f464 --rw-r----- oracle dba oracle dba 9 122691584 14220 10114 10:21:13 10:54:31 15:50:56
m 18 0x3f0d0c90 --rw-r----- oracle dba oracle dba 7 220168192 14395 12413 14:55:25 14:56:19 15:51:03
m 19 0x73d4ba80 --rw-r----- oracle dba oracle dba 9 220168192 15059 23750 13:38:23 13:38:23 15:51:10
m 21 0xb3c69f70 --rw-r----- oracle dba oracle dba 12 572489728 15333 23430 13:38:04 13:38:05 15:51:26
m 22 0x072f7918 --rw-r----- oracle dba oracle dba 8 555712512 15435 15898 15:51:41 15:51:43 15:51:35
m 23 0x85dc29a0 --rw-r----- oracle dba oracle dba 9 617881600 15968 16078 15:51:50 15:51:52 15:51:43
m 24 0x43236490 --rw-r----- oracle dba oracle dba 12 168820736 16155 12869 13:14:57 13:14:57 15:51:53
m 25 0xe99aa82c --rw-r----- oracle dba oracle dba 35 505380864 16362 22098 13:34:15 13:34:15 15:52:02
m 26 0xa3c86030 --rw-r----- oracle dba oracle dba 19 1007685632 16624 23012 13:37:10 13:37:10 15:52:17
m 27 0x6e5884c8 --rw-r----- oracle dba oracle dba 0 423632896 19658 14873 11:29:34 11:29:34 15:54:27
m 29 0x66d9a908 --rw-r----- oracle dba oracle dba 11 438272000 24800 12452 18:27:13 18:27:14 15:59:18
m 30 0x833f27a4 --rw-r----- oracle dba oracle dba 52 580395008 25018 17642 13:24:06 13:25:06 15:59:28
m 31 0x9bb784d0 --rw-r----- oracle dba oracle dba 12 1663016960 25602 4271 12:00:07 12:00:35 15:59:37
m 32 0xca93a178 --rw-r----- oracle dba oracle dba 40 484212736 25782 25818 15:59:54 15:59:59 15:59:52
m 33 0x13122090 --rw-r----- oracle dba oracle dba 29 639598592 26050 7869 13:20:03 13:28:27 16:00:01
m 34 0x770c1450 --rw-rw-rw- hyperion dba hyperion dba 28 37120 29915 23231 8:46:58 no-entry 16:02:45

Don Morris_1 · ‎06-14-2007

Ok.. so 9.2Gb of your virtual address consumption is in SysV shmem objects. Only 3 of which have NATTCH of 0 (and as such may be viable for removal). The other 19.2Gb has to be virtual address space consumption by either the kernel (pseudo-swap can be consumed by kernel "locked" pages, some kernel structures are in fact swappable as well) or [and I expect primarily] by non SysV shmem user space objects.

Again, if this is the work load you want to run -- you simply need more RAM or swap to meet the virtual address space requirements to do so with the extra processes you want that are currently failing when you hit swap reservation exhaustion. I seriously doubt anything else will be an effective solution (Lazy Swap being the only other alternative, with that requiring a chatr of several binaries in this case and which runs the risk of causing spurious failures when actual physical memory is allocated and the deferred swap reservation occurs, hence adding another disk/FS for swap is safer and easier).

Dennis Handly · ‎06-14-2007

For NATTCH of 0, if you are interested in seeing how old these are, see my program in:
http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=1136341

Craig Johnson_1 · ‎06-15-2007

Can someone explain to me why then that none of the disk swap is being used? The system has over 1GB of free RAM, and several gigs of free disk swap, yet it hits this "psuedo swap reserve" limit and suddenly can't spawn a process?

I appreciate all the helpful feedback, but I'm still confused.

Don Morris_1 · ‎06-15-2007

Sorry, thought I'd alluded to that in my original reply.

Long version: Read the Memory Management Whitepaper (http://docs.hp.com/en/1218/mem_mgt.html).

Short and sweet version: UNIX virtual memory systems typically work by having each physical page in use have a potential page of swap ready in case it needs to be paged out. [This is swap reservation, as opposed to swap *consumption* where the data is actually written out to the swap device/FS and swapinfo reports bytes USED]. There are a few algorithms on how and when to do this, Linux in particular waits until physical memory is allocated (on a page fault, usually) to reserve the corresponding swap -- which runs the risk of swap exhaustion during the fault, at which time killing a process (well, or suspending it in this odd state if you're craft) is the only option. This is trading off application stability/predicatbility for ease of resource management.

HP-UX chose a different algorithm. In HP-UX, all virtual objects in a process will already have the corresponding swap reserved... ready for whatever corresponding physical RAM they consume to be swapped out. This can reserve more swap than will actually be needed, but moves the point of failure to reserve swap (and the need to handle that failure) to system calls that create new virtual objects [mmap, malloc/sbrk/brk, etc.] which can return failure so the application can handle it gracefully. Stability/predictability over resource consumption minimization.

There is one exception to this, and that is explicit "Lazy Swap Reservation" (which is pretty much the model I described as Linux) via explicit mmap() flags, use of chatr (+z enable) on a binary.. or for some objects where the kernel may set it internally.

I think you can see where this is going -- your workload fits in your RAM for the physical data set as you'd mentioned. However, the virtual consumption of your applications (and their mmap's, etc.) is reserving all of your swap space. You won't actually use any of it until paging is needed... but it will be reserved anyway.

Hope that made sense -- if not, you may want to try reading the whitepaper [which had the advantage of more review and less 6AM, first cup of coffee], or follow up with clarifying questions.

Craig Johnson_1 · ‎06-15-2007

By adding more disk swap, can I improve my situation?

Craig Johnson_1 · ‎06-15-2007

Because RAM is so cheap these days, it isn't uncommon to have servers with 32GB of RAM but 73GB disks (that's what most of our Itaniums and Linux DLs have). So allocating 1.5x RAM for swap on the boot drive is pretty much out of the question...

A. Clay Stephenson · ‎06-15-2007

Nobody said that all your swap has to be on the boot disk or even in vg00. The only requirement is that primary swap must be on the boot disk. You can add multiple secondary swap devices or even filesystem swap.

If it ain't broke, I can fix that.

Craig Johnson_1 · ‎06-15-2007

I know that, but my question is, will adding more disk swap help any?

Don Morris_1 · ‎06-15-2007

Yes, it will. The "total" line in swapinfo is effectively your maximum virtual address space which can be used on your machine. Increase that, you can have more virtual objects (processes, mallocs, mmaps, etc.). In your case -- that really seems to be your throttling point to achieve the workload that you want, so this is what you want to do.

As mentioned prior, it can be device swap or even Filesystem swap -- since it seems unlikely you'll be actually swapping out.. the performance characteristics of the I/O to the device isn't what's important, just adding resources to the reservation layer.

RAM is also acceptable since a percentage of it adds to the pseudo-swap line, but since you don't believe your workload is going to consume all of your existing RAM... and disk space is relatively cheap... I'd just add a secondary swap from a spare disk or FS space.

Craig Johnson_1 · ‎06-15-2007

OK, thanks! That's what I needed to know! A relatively simple fix for what appeared to be a very complex problem!

A. Clay Stephenson · ‎06-15-2007

Define "help". Adding more swap will not make your box run any faster but it will make you box run more safely and reliably -- especially when package switches occur so that more packages than are expected are running on a single node.

You've already seen that you are getting into situations where new processes can't be spawned --- adding more swap will make those situations more rare. Of course, even falling back on the decades obsolete standard of 2-3x memory as swap, one could still find situtions where it would not be possible to spawn more processes because of lack of virtual memory --- but at that point the performance would be terrible anyway.

If you can trim back your application memory usage (especially SGA's) do that first; it's free and the performance impacts may be minimal if done wisely. If you can, buy more memory (and increase swap in proportion to the new memory); finally, add more swap --- it's cheap.

If it ain't broke, I can fix that.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Swap Reserved by Shared Memory?

Swap Reserved by Shared Memory?

Re: Swap Reserved by Shared Memory?

Re: Swap Reserved by Shared Memory?

Re: Swap Reserved by Shared Memory?

Re: Swap Reserved by Shared Memory?

Re: Swap Reserved by Shared Memory?

Re: Swap Reserved by Shared Memory?

Re: Swap Reserved by Shared Memory?

Re: Swap Reserved by Shared Memory?

Re: Swap Reserved by Shared Memory?

Re: Swap Reserved by Shared Memory?

Re: Swap Reserved by Shared Memory?

Re: Swap Reserved by Shared Memory?

Re: Swap Reserved by Shared Memory?

Re: Swap Reserved by Shared Memory?