Operating System - HP-UX
1847213 Members
2075 Online
110263 Solutions
New Discussion

2-second delays in fsync/msync/munmap

 
SOLVED
Go to solution
James Murtagh
Honored Contributor

Re: 2-second delays in fsync/msync/munmap

Hi Kris,

There is a lot of information in this post and I hope I've read it right but there seems to be some conflicting evidence in there (at least to me). In any case some theory that may or may not be relevant.....

When you say

"I was able to reduce the problem somewhat with one test variation of our software by avoiding mixing of real I/O and writes to the correspondnog mmap'd pages of the file"

I know this sounds obvious, especially with the focus on fsync but can you confirm you are doing reads/writes on the same file as you are mmmap'ing? If so - do not - at least before 11.23. The reason is that hpux, before this version, uses a split buffer and page cache. Any pages that are referenced in once cache will have to be scanned and invalidated from the other. At 11.00 this is accentuated as the page cache is searched using the vnode/file offset and the buffer cache using the device/block number, and at 11.00 the buffer cache does not keep track of while file has buffers in the cache - hence a sequential search of the cache is required. Initially with the buffer cache set to 600MB this would have been reasonably slow. Interestingly is your statement when you sized the buffer cache mistakenly small this improved the situation. To test the effect of this your best bet is to do what has already been suggested - bypass the buffer cache for the filesystem(s) in question using the specified mount options.

There are also various other problems with the data being in two caches at once which may or may not contribute to the page-ins during msync's. For instances, if you are using the write system call to extend the file while using mmap those pages with need to be read in during an msync. Add to that the invalidation of the buffer cache pages and you'll probably see where I'm going here....

But why does hfs perform better than vxfs in this scenario? One possible reason is that one of vxfs's tricks is to cluster the IO to a filesystem to avoid random IO....if you are syncing all the time this you are penalising the filesystems design. I would also suggest regular de-fragmentation of the filesystem using online jfs if you have it. I also noticed nfile was set reasonably - but what of vxfs_ninode? Can you run the following command :

# echo vxfs_ninode/D|adb -k /stand/vmunix /dev/kmem

Your best bet to nail this would be to contact HP to get vmtrace set up - the engineers will be able to determine which kernel function is causing the delay hence would be able to offer far more insight into the problem.

Cheers,

James.


Kris Kelley
Frequent Advisor

Re: 2-second delays in fsync/msync/munmap

Thanks for the analysis and info, James.

The mix of real and mapped I/O I was doing was not only to the same file but to the same offsets. I was using pwrite to extend and mmap to write. What I did to improve the performance about 90% was to change (hack) my "mapped I/O" software to just do the pwrite. BTW, I've yet to change the code to prevent a later read via mmap since this doesn't yield the right answer on HP (a constraint I was already aware of from reading the man page).
For the main file (an index), I "extend ahead" so that there are far fewer pwrites for the amount of map writes. However, I still see a performance hit of probably 3-6x (hard to say exactly since fsync also has a problem and so my only baseline is a Sun box).

I don't suppose HP-UX has a handy way to extend a file without using write? :-\

Your comment about HFS sounds right, since the difference in performance is only at most 1.5x, and the pattern of delays is different.

vxfs_ninode: 128000
Kris Kelley
Frequent Advisor

Re: 2-second delays in fsync/msync/munmap

BTW, I'm currently unable to try the mincache or convosync settings since we don't have the needed version of VxFS.

James, while the split buffer might explain the performance problems I'm seeing with the mmap/msync version of our code, do you have any theories why the write/fsync version would also exhibit a similar problem? Is it possible that, once the kernel "knows" that a file has pages in both buffers, it will forever have to do expensive scans for any I/O operation to the file?

FYI, I have a small test program that recreates the performance problems (even without a single extend of the file (i.e. after an initial one to create it - all subsequent tests use the existing file)), and the write/fsync test is generally worse than the map/msync test, sometimes as much as 5x worse. It has also revealed that calling msync(MS_ASYNCH) (i.e. to schedule all the I/O before all the msync(MS_SYNCH) calls) is generally a bad idea. However, results have proven wildly inconsistent (e.g. sometimes turning off ASYNCH avoids the 2-second delays (e.g. msync total elapsed drops from 10s to less than 1), sometimes recreating the file causes subsequent test runs to exhibit very different performance (e.g. as much as 2-4x, primarily due to "read" times), some test attributes (i.e. #writes/syncs) avoid the 2-second delays in msync (but not fsync), ...).
Kris Kelley
Frequent Advisor

Re: 2-second delays in fsync/msync/munmap

James,

Does 11iv1 (for PA-RISC) have the unified buffer?
James Murtagh
Honored Contributor

Re: 2-second delays in fsync/msync/munmap

Hi Kris,

11i v1 (11i) doesn't have the unified buffer/pages cache. However, the buffer cache has been optimised in at least two ways at that release to my knowledge :

1) There was a re-write of the filesystem locking mechanism whereby the delay in processing other filesystem's io was optimised.
2) The buffer cache keeps a record of buffers allocated per file rather than having to scan the whole cache.

The first release of the unified buffer cache is 11.23 on Itanium. It will be ported to 11.31 for PA. Also, Solaris introduced the unified cache in 2.7 I believe so you will expect different results from this version onwards....however this doesn't explain the difference in fsync. This is where we have to be careful....for instance, are your sun servers using the same vxfs version or are they attached to a high end array whereas the hpux servers are not? The effect of array caching should never be overlooked in these circumstances.

A bit more about hfs/vxfs....vxfs uses a more advance machansism of determining what memory regions are dirty and which are not - hence they will flush whenever a region is stale while hfs may let the region become dirty and just flush when the file is closed. Its a pity you don't have online jfs, this would really have helped.

Your last point is more complex, think I'll need to think on this one....but reads/writes will always be slower than mmap's as the kernel has to do an extra copy each time from the buffer cache into the processes address space and back again. So thats 2 extra copies which may account for some of the performance difference. I don't think you can extend a file without a write call though! :-)

All I would say is to try and use some programming techniques to get around this.....I'm sure you'll be a lot more advanced than me at this. However, if you are using vxfs anyway the logging mechanism should take a lot off the burden off the syncs you are doing....and it doesn't seem quite right (to me, again) that a file that is constantly being extended is one that has to be constantly flushed to maintain consistency.

However, please feel free to explain as much as you can about the program and what/how you are trying to achieve the results and I'll do my best to help. Also, if you can attach the test programs that would be good....if not I'll try to write some and test of 11.00, 11i and 11.23 to test the differences.

Cheers,

James.
Kris Kelley
Frequent Advisor

Re: 2-second delays in fsync/msync/munmap

Hi James,

We appear to use ufs(?) with RAID rather than vxfs on our Sun boxes.

We're probably going to upgrade to 11.11 and see if that helps. Does that by chance come with OnlineJFS?

I've attached my test program.

To build, I use:
aCC -g -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE tmmap.cxx -o tmmap

Some tests I've done:
tmmap -s 80 1000 200 - smaller test where
tmmap -sm 80 1000 200 - calling ASYNCH is bad
tmmap -s 80 10000 1000 - larger test
tmmap -sf 80 10000 1000 - write/fsync version

It creates a data file in the default directory and then simulates index-insert and commit activity. The timer stuff is somewhat hacked as needed. BTW, I usually call it "inside" time to get a total elapsed to compare to the reported times.

Have fun,

Kris
James Murtagh
Honored Contributor

Re: 2-second delays in fsync/msync/munmap

Hi Kris,

UFS is the Unix File System and is equivalent to HFS in hpux. Online JFS comes with certain Operating Environments on 11i, the Enterprise and Mission Critical OEs, although you can install it on the others.....see this document for a full list of the software per OE :

http://www.docs.hp.com/hpux/pdf/5187-3623.pdf

What 11i does give you though is vxfs version 4 by default which has a lot more scope to be tuned (see "man vxtunefs"). Its also the OS I do most work on so hopefully I'll be able to help a bit more than 11.00.

I'll do some testing then with your program and see what I find....i'll try to have fun! :-)

Cheers,

James.
James Murtagh
Honored Contributor

Re: 2-second delays in fsync/msync/munmap

Hi Kris,

Just to let you know I'm still looking into this but the analysis has led to some confusing evidence. I've attached some tests I've done on various systems. I've only included the total times but I am seeing the same symptoms you have mentioned.

Initially when I did the test 11.23 was so much faster in all variations I just assumed this was down to the unified buffer cache (ufc). However, even though the times were substantially lower than other releases they was still a marked difference in some vxfs/hfs times. I then activated kernel tracing to see which functions were causing the delay.....but there was nothing obvious at all. This led me to believe a lot of the time was spent after the IO was passed to the driver (even basic timex logging seemed to agree with this) hence I tested the program on one of my own systems (c3000 in the attachment) and found that using very poor disks, conveniently attached by differing scsi interfaces the difference between hfs and vxfs was even more marked. The jbod is attached by an ultra-narrow scsi to a fast-wide and the disks are also very old. Here hfs performed about six times better than vxfs.

With some basic performance tuning I did shave about 20 seconds off the vxfs times but I couldn't get anywhere near the hfs times. Hence I believe that another major issue is the way hfs and vxfs flush the data to disk, which I touched upon before. The biggest improvement came in the msync test using the mmap write size equal to the filesystem block size. Surprising the direct IO route provided no benefit at all.

Some notes :

1) If you can get an Itanium 2 system with 11.23 do so.....even a workstation will dramatically improve the performance as shown.
2) If not use hfs filesystems if you can live with the potentially poor data block integrity in the event of a crash (I suspect not)
3) Try and have the data being mapped in its own filesystem - the buffer cache is indexed from a hash of the block/device and then a linked list is followed - the less blocks to find down the list the better
4) Use a filesystem block size equal to the write size
5) The new interface at 11.23, and indeed that used by Solaris, do not use the buffer cache at all except for filesystem metadata. All reads/writes are mapped at a lower level to the mmap calls hence allowing backward compatibility and retaining the posix standards.

I know there are still a lot of unanswered questions - I'll keep on it and let you know.

Cheers,

James.
Kris Kelley
Frequent Advisor

Re: 2-second delays in fsync/msync/munmap

James,

Thanks for the info. By "same symptoms", do you mean the occasional 2-second delays as well? These are still the most mysterious part of the problem and I'd be relieved if they showed up elsewhere. :-}

Kris
James Murtagh
Honored Contributor

Re: 2-second delays in fsync/msync/munmap

Hi Kris,

Got a bit sidetracked last time, hope I get back to the issue at hand here! :-)

I've looked further into what your program is actually doing but I believe it does break down into a comparison between the filesystems again. I'll look at the mmap issues first and then the fsync (where I do see the near two second delays on hfs).

Using your program without any options yields similar results for both hfs and vxfs apart from one area in certain situations - the read times. I used glance to find that both filesystems are faulting on every page that is accessed but hfs is resolving some faults in memory during the first invocation of the program and then on subsequent invocations all faults are resolved in memory. On the other hand vxfs never resolves faults in memory hence has to got to disk (via VM) to resolve the faults. I think there may two issues here - how the filesystems initially map the memory and if it is retained after the program completes and the process dies. Although I don't know the specifics at this time I can see these falling into two traditional unix areas. For the first it is possible vxfs is not paging in any of the mmaped area until is it is actually required. This would be helpful in situations where only a small portion of the mapped area is actually ever accessed but not in a program like yours which accesses every page. The second may come down to invalidation of pages that have been freed by a process or just simply that vxfs requires every page be re-read from the file if it is accessed again. I did notice slight delays in every nth-or-so msync but not to the extent you did, i.e. 112ms vs 40ms average.

However, for the async test vxfs shows no noticeable difference as it needs to fault in every page anyway. I think the msync time goes up a bit as msync will block waiting for IO to complete - if async is called before it some pages may be locked waiting to be syncd hence the slight delay.

For the fsync problem it appears things are slightly more obvious. Using "sar -b" you can see vxfs is flushing all write buffers to disk straight away hence when sync is called it returns almost straight away. In contrast hfs is only flushing some of the buffers each time, until the fsync is called when it has to block until all are synced. I suspect when you lowered the buffer cache to be extremely small the amount of buffers allowed to be out of sync fell proportionally hence the fsync had less to do when it was called.

Obviously a lot of this is just theory at the moment but I wanted to give you some results - I'm going to do a bit more testing for more conclusive analysis.

Cheers,

James.
Kris Kelley
Frequent Advisor

Re: 2-second delays in fsync/msync/munmap

FYI, the 2-second delays that I got didn't actually appear in my HFS tests. They seem to appear only for VxFS.
Something I recently observed more is the appearance of these delays during memcpy (reading from map). These instances usually occured less than .01% of the time (at times even 1/100000) and were always _exactly_ or just above 2 seconds (the normal times being 0-30 ms). This particular test was one where there were no munmaps (as is the case with the test program BTW) and few mmaps.

A victory...
Since the munmaps were the major victims of the 2-sec delays, I changed some related software to greatly minimize the need for them (i.e. by preventing index nodes (and thus reads or writes) that cross a page boundary). With this change I was able to bring the VxFS times down to what I'm seeing for HFS (i.e close to "normal" and using >50% CPU). This will hopefully be "good enough for prime time" and I can go on to something else. :-)

BTW, we decided not to upgrade to 11.11 since your test results indicated that it really wouldn't help. We'll probably wait for 11.23 (on PA-RISC).

If you decide to (also) punt on this, I'd just like to say thanks for all your help. Your input was informed, intelligient, and appreciated.
If not, I'm still going to be tinkering a little to try and find out where those seconds are going.

Later,
Kris
Kris Kelley
Frequent Advisor

Re: 2-second delays in fsync/msync/munmap

Correction on the "victory":
Due to a bug in one of my debug hooks, the times I was seeing were a little better than reality. My change does help, but HFS is still about 3x better than Vxfs. Oh well, back to banging my head against the wall...

Cheerless in Rockville,
Kris
Kris Kelley
Frequent Advisor

Re: 2-second delays in fsync/msync/munmap

James,

FYI, I've determined that the "2-second delays" that I observe in my VxFS mmap tests (primarily affecting msync) account for about 2/3 of the elapsed time (i.e. roughly the difference between VxFS and HFS).

Do you get these same 2-second delays in msync (they show up in the output from the test tool)?

I had thought at one point that these were just the "actual work" of the msyncs, deferred by the VxFS layer for whatever reason until it decides to do the real flushes. In favor of this theory is the observation that most of the msyncs take less than 1 ms. However, I'm skeptical of this theory since the non-zero times are very consistently right around 2 seconds. Plus, why would it make most of them "asynchronous" and then do all of them synchronously? The fact that there's never more than one of them per "commit" is also a mystery.
James Murtagh
Honored Contributor

Re: 2-second delays in fsync/msync/munmap

Hi Kris,

I got delays but not in the 2 second range, say 0.8 secs on average. I'm currently ploughing my way through memory management and vxfs internals to try and find an answer to this. :-)

However, more news on the vxfs/hfs differences which I'm sure will help. The two differ in the way they actually commit the writes - vxfs will write any changes to disk during a munmap or msync, I think the behaviour we expect. On the other hand, hfs will only commit when the file descriptor is closed! What I noticed in my testing was at the end of the hfs tests there was massive activity on the disks and the program paused for several seconds before exiting, which didn't show up in the times for the calls. However, if you run a timex on the program you will see the total time is several seconds longer than the elapsed time from your programs results. It doesn't bridge the gap between the two fs-types due to the one-time flush of data is still better than the constant commits but it does add quite a lot to the perceived hfs times.

I'll try and shed some light on the delays soon.

cheers,

James.
Kris Kelley
Frequent Advisor

Re: 2-second delays in fsync/msync/munmap

Hi James,

About those 2-second delays. They apparently only happen when I call msync(MS_ASYNC) in advance of all the msync(MS_SYNC) calls. If I don't, I get what look to be more "normal" numbers. This would lead be to believe that the theory I had discounted may be correct.
What's more interesting, though, is that even though the total time for all the msyncs drops quite a bit, the total time for the test drops no more than maybe 10%. I suspect that the "lost time" shifts to the reads (page-ins) of the modified pages.
In light of the fact that the overall read times are much higher for the VxFS tmmap test than the HFS test, my guess would be that at msync a lot of pages are released (thus requiring page-ins again after a "commit") EVEN THOUGH the process still has the region mmap'ed and the file is open.
If this is true, I'd guess that either VxFS itself is doing something very brain-dead OR the kernel logic that makes sure that the buffer cache and the mmap regions are consistent does not "play well" with VxFS files for some reason.

Kris
James Murtagh
Honored Contributor

Re: 2-second delays in fsync/msync/munmap

Hi Kris,

Those are pretty much my conclusions too. I'm guessing that by using MS_ASYNC before a MS_SYNC, the sync is finding certain pages locked by the async hence causing a stall. If the async has already flushed a few pages in the background the sync on these pages will return quicky but every nth page(s) that is locked will experience a delay.

For the second part I would go even further - I suspect ALL vxfs pages are read back in after a write. This is accentuated if madvise isn't used and mmap reads ahead many more pages than it needs, or vice versa depending if the access is random/sequential. You could test this by invalidating the vxfs pages through MS_INVALIDATE and see if this makes any appreciable difference to the time. However, I suspect it's the hfs behaviour which is most in question - it may not be maintaining cache coherency. Again, this will need to be tested by analysing the contents of a file once the program is over.

Cheers,

James.
Kris Kelley
Frequent Advisor

Re: 2-second delays in fsync/msync/munmap

Hi James,

Some more info (that seemed "interesting" and that I didn't have a ready explanation for):
I also get the 2-second msyncs (w/o ASYNC) for a high batch count (e.g. 10000 writes/commit). Maybe VxFS itself is doing some asynch stuff?
If I run 'sar -d' during the tests (to get an average over several tests and multiply by average test duration):
- I generally see a total #blocks of close to the size of the test file for HFS (e.g. 160K blocks for an 80MB file). For VxFS though, I get a total block count of almost twice this (and it doesn't increase much if I increase the #commits). Could it be that HFS doesn't have to read any pages because they're already cached, whereas VxFS needs to read them all in again?
- If I delete the test file first, the VxFS times drop by about 20%. The HFS times INCREASE by 20-30%. The VxFS #blocks and r+w go UP. The HFS #blocks and r+w stay about the same.
- The HFS I/O was much more consistently busy than VxFS.
In glance (Process Resources):
- the "Phys. Reads" for VxFS was twice the "Phys. Writes" (for a "50K 100" test). For HFS, the Writes were similar but the Reads were nearly zero.
- the "Total IO Bytes" for HFS was 0. For VxFS it was near 160M.
- I creep toward 100% CPU utilization as the system gets quieter.

OK, that's enough for me, my eyes are starting to cross. Does any of this support or refute your understanding/theory of what's going on?

Kris
Kris Kelley
Frequent Advisor

Re: 2-second delays in fsync/msync/munmap

James,

I need to mention one important piece of info regarding the tests I did for my last post. Since I used "-S1", the test wrote to the file in roughly sequential order and RARELY wrote to the same page in 2 different batches. While this does simulate the usage pattern of our software, it skews the results in some ways.

For comparison, I collected some more data with random reads+writes. In a set of "batch count = 1K" tests, VxFS actually performs BETTER than HFS on average (HFS is more sensitive to total write count - see below).

- In the VxFS tests the total times for reads was always around 30 seconds. The msync times accounted for the difference when the total #writes was increased from 10K to 50K or 100K.

- For the "100K by 1K" test, the only non-trivial differences in the Process Resources info for VxFS and HFS were:
VM/Phys Reads: V=4496 H=2
VM/Phys Writes: 92.5K 91.4K
Phys IO Rate: 1145 460
Total IO Bytes: 635MB 0
Vir Faults: 4K 19.7K
Mem Faults: 26 19.5K
Disk Faults: 3.9K 7
Elapsed time: 65 199

- For the "10K by 1K" test:
VM/Phys Reads: 4K 1
VM/Phys Writes: 9K 9K
Phys IO Rate: 330 451
Total IO Bytes: 129MB 0
Vir Faults: 3.6K 11.8K
Mem Faults: 36 11.8K
Disk Faults: 3.5K 6
Elapsed time: 41 20
Elapsed (real IO) 200 60

BTW, I didn't see ANY of the 2-second delays during the mmap tests. They did show up in the "real IO" tests (e.g. fsyncs often take about 18 or 20 or 22 seconds).

Hope this helps.

Kris
James Murtagh
Honored Contributor

Re: 2-second delays in fsync/msync/munmap

Hi Kris,

I think we can see its all about caching. When you deleted the hfs file first the page cache entries, indexed by the vnode, were invalid hence will have produced more physical IO. When you deleted the vxfs file first the same happened - but I assume the better results were achieved as there had to be no dual cache consistency problems to address. In all my tests vxfs outperformed hfs in any physical IO by some margin. Also, your virtual/memory/disk faults are the same as I got. Notice any virtual faults (pages not referenced in the processor cache) are always resolved in memory (page cache) by hfs. On the other hand vxfs always resolves this by disk IO. For the disk block results I agree - its no doubt a result of the caching plus the block size/read ahead factor.

At the moment I'm working my way through a crash dump trying to find the evidence to support some claims. I ran an hfs and vxfs test in parallel and TOC'd my system. I've worked by way through the virtual memory structures but they seem to be the same for both tests. I think the answer will be in the page directory entries and hardware dependant structures that are close to the TLB entries on the cpu. I'm hoping to see the vxfs pages marked as dirty/invalid here. Tricky business though, hoping to spend some time on it tomorrow. I'm also going to run a more intense version of the kernel tracing and set up some VM tracing too.

Cheers,

James.
Kris Kelley
Frequent Advisor

Re: 2-second delays in fsync/msync/munmap

James,

I tried another series of tests, this time with just 1 msync per "commit" over the entire mmap (i.e. file). The results were similar to before. The total msync time per commit seems to be limited to about 2 seconds (and almost always IS about 2 seconds). The total read time for the test is (again) always around 30 seconds.

This suggests that the close of the file (or unmapping) releases all the cached pages for the file. Since the mapping is global, I did a couple more tests where I held another test (w/ gdb) just after the mmap call so as to keep the file mapped. The read times for the 2nd test dropped from about 28 seconds to 3! A lot of the time was shifted to msync (maybe because the 1 msync has to read in pages for all the gaps?? ...or else the extra buffer cache scanning?), but the overall time for the "10K by 1K" test dropped rom 35 to about 22 seconds (comparable to HFS). Glance confirmed the drop in Phys. Reads and Total IO and shift from Disk faults to Mem faults.

To lower the time even further, I changed the test back to "normal" msyncs (vs. just 1) and the overall time dropped to 11!

This may just be good enough for a workaround design...

Later,
Kris
Kris Kelley
Frequent Advisor

Re: 2-second delays in fsync/msync/munmap

James,

Do you know why VxFS-file pages are NOT retrieved from memory? If this is something that can be affected by a kernel parameter or newer release, that would be preferable to writing extra "workaround" code AND may solve (or at least reduce) the fsync problem as well.

Kris
Kris Kelley
Frequent Advisor

Re: 2-second delays in fsync/msync/munmap

James,

Regarding HFS and [mf]syncs, I don't observe the same I/O patterns you do (i.e. from your first 11/14 post, a delay at process exit while pages are written to disk). Since this is relevant to the required semantics of a commit, I'm wondering if you have any ideas why it doesn't work the same way for me.

Kris
James Murtagh
Honored Contributor

Re: 2-second delays in fsync/msync/munmap

Hi Kris,

It will be in low level vxfs code and I don't believe you can modify the behaviour with a kernel parameter change. I think the philosophy is "if we don't have an open file descriptor we assume the pages are invalid hence require a full read in - we can't tell if there has been a write() since the last mmap". However, if you have the file open the timestamp information will be in the active inode entry hence the kernel will know if the data is valid or stale. The vxfs and hfs times are pretty much identical on the first invocation of your test program. Curiously I ran the hfs test for a second time ensuring the inode for the tmmap.dat file was different and no memory or disk faults occured. Very strange.

As for my hfs tests I still get the same results :

timer name (#starts) total msec
-------------------- ---------------------
create+extend file (1) 44
reads (50000) 700
writes (10000) 738
10000 msyncs were done (100%)
msyncs (10000) 40431

real 1:10.73
user 0.46
sys 1.88

I assume this won't be pretty with the allignment issues but the time recorded for the actual fuctions was ~ 42 secs. The total time is 1 min 10 secs so my hfs test(s) took 28 seconds to close the file. I have no idea why you are not seeing this - the hfs code hasn't changed much between the releases (I'm on 11i) to my knowledge.

I have to apologies - it seems I am being less and less help here as we go on! :-(

Cheers,

James.
Kris Kelley
Frequent Advisor

Re: 2-second delays in fsync/msync/munmap

James,

FYI, if I only open the test file in the test (under gdb), it doesn't help. It has to also have the file mapped (to retain the pages in memory).

Kris
Kris Kelley
Frequent Advisor

Re: 2-second delays in fsync/msync/munmap

James,

I was reading a memory management white paper (circa 1997 - 10.30) and convinced myself that if the descriptions of kernel structures were still accurate, there's no way for the kernel to reclaim a mapped page once it's been freed (the only file info regards swap-backed pages). However, what I did read got me thinking...

Since I was out of GOOD ideas, I wondered what would happen if I used MAP_PRIVATE for the mappings instead of MAP_SHARED (allowable in my app since there's only ever one writer at a time, and mods are msynced before allowing another writer). My hopeful reasoning was that, somehow, the fact that modified pages would have to have swap-space backing would give the kernel a way to hang onto them...

Well, the times for one test proceeded to drop from 45 seconds to about 20 (after a couple runs) and a few runs later to 10!. I don't yet understand exactly what's happening yet (for example, there seem to be plateaus that take a while to get past), and will investigate further.

Later,
Kris