2-second delays in fsync/msync/munmap

James Murtagh · ‎11-21-2003

Hi Kris,

There's a slighlty newer version of the white paper (Sept 2000) here :

http://www.docs.hp.com/hpux/onlinedocs/os/11i/mem_mgt.html

As you say the kernel structures are still very similar.

In terms on the kernel reclaiming mapped mages, what I have found is that when running an hfs test for the second time, the nvalid (no. of valid pages) pages in the pregrion is almost the size of the region.....on vxfs it is set to 0 again. Hence why hfs is resolving the page faults in memory. The pages in the page cache will remain of course until they are reused by another processes. What seems to be happening when releasing the vxfs mmap is in effect the same as the invalidate flag in msync. If you were to test hfs - using this flag to invalidate all pages - I assume subsequent tests would show the same results as vxfs, I'll confirm this if I get a chance today. If you create a new map for the vxfs file and leave it set between tests, the valid pages resembles that of hfs on subsequent invocations.

The use of MAP_PRIVATE basically negates a lot of the problems we have talked about - I never mentioned it as I assumed you needed the segment shared. It will take care of the cache coherency issues as the segment is in the processes data segment space and will not be in the page cache to be shared. However, I wasn't aware how to save the changes back to the file, at least without an extra copy which would just be like using the buffer cache anyway, and if you can I think there would be added complexity in increasing the size of the underlying file.

Cheers,

James.

Kris Kelley · ‎11-25-2003

James,

Hi. I'm now looking at the MAP_PRIVATE option. I hadn't wanted to go that way partly because of the extra swap allocations and the (supposed) inefficiency of not sharing the pages. However, the performance actually seems to be better (with VxFS at least) and it may avoid some other "features" of mapping shared. I wasn't sure though how it took care of the "cache coherency issues" or which other "problems". For example, might cache coherency actually become more complicated since more copies of the same file page may exist? Also, why "added complexity" in extending the file?

FYI, I've just spent a couple days tracking down a bug caused by a disappointing "feature" of HP-UX (at least at my release and patch level).
If you pwrite something, then later mmap(MAP_SHARED) that region and read from the map, you're not guaranteed that you'll see what you pwrote. I had tried to test for this but apparently didn't test enough. Oh well.

My main concern right now is whether an msync of MAP_PRIVATE pages will invalidate other processes' MAP_PRIVATE pages. It wasn't clear from the man pages and I'm a little gun-shy now about determining semantics via testing. Do you know anything definitive?

Thanks,
Kris

James Murtagh · ‎11-25-2003

Hi Kris,

If I haven't understood your questions here, let me know, got a feeling I haven't.

When you map private the process maps the file into its data quadrant, q2, instead of the shared quadrants 3 or 4. As the data is now private to the process the backing store is now swap space hence the extra reservation. However, this also means that any changes to the mapped area will not be updated to the front store (i.e. the file you mapped from). Once the process exits the data is discarded. My point was you would need extra copying within the users address space to write the data back to the file.

For the pwrite - does using O_SYNC not help? This sounds exactly like the symptomns a dual cache will bring.

As from my first point I don't think msync on a private region will do anything. Also, each process mapping a file private will get their own copy - I believe it uses copy-on-write from the shared region to increase performance.

As I said, I might be misinterpreting the question, let me know if I am.

Cheers,

James.

Tim Sanko · ‎11-25-2003

Kris,

I have a question. Are you on an EMC raid. In particular a symmetrix...

At one time HFS to Vxfs had a major performance advantage on EMC. The configuration of EMC raids with bin files etc. can be done also with several new products. ECC or symmConsole.

The real issue may be a coordination of hardware info as well as software advice. You mention the raid, but not make or model.

What do you have? and what other third party software.

Tim

Kris Kelley · ‎11-25-2003

James,

You're right. My understanding of MAP_PRIVATE was wrong. I hadn't verified that the changes were actually flushed to disk. No wonder it was so much faster. :-(

Maybe I'll try substituting pwrites for msyncs and see what kind of performance I get. The mapped pages will then simply be acting as a local cache, swapped as needed by the kernel.

Later,
Kris

Kris Kelley · ‎11-25-2003

James,

Well, so much for that idea. I forgot how bad the performance gets when you mix real writes with mmaps.

Back to MAP_SHARED and coding workarounds for all of HP's "idiosyncracies".

Kris

Kris Kelley · ‎12-03-2003

Hi James,

Another interesting tidbit.
I put a msync(MS_SYNC) before my munmap call (if there were modified pages) so that I would know that any dirty pages would get flushed. I had also hoped that this might abolish the 2-second delay that I see for almost every munmap. Unfortunately, it didn't. What's interesting is that it doesn't matter if there were any dirty pages in the map or not (I has previously thought that it was the flushes that munmap did that would have somehow caused the delay). The more I research, the more confounded I get...

Later,
Kris

Kris Kelley · ‎12-05-2003

Hi James,

I hope you're still reading these.
I've run into another wall. I've been adding workaround code to handle certain HP limitations with MMAP(SHARED) such as (FYI):
- (architectural) if an mmap is requested that only partially overlays an existing region (e.g. due to another process' maps), ENOMEM is returned.
- (bug?) the second mmap attempt on the same region (i.e. not overlapping the first mmap) will sometimes (at least) fail with ENOMEM. If the first map is deleted [and recreated], the second mmap works OK.

The new problem (bug) I've run into doesn't seem to have a workaround. What happens is that at some point my process ends up with a region (according to pstat_getprocvm) for no reason (i.e. no mmaps exist on it, thus no pregions(?)). I'm then unable to create an mmap on this region. Some more details...

In one test, I had never created ANY mmaps on the region (in that process). In another, I had created several, then deleted all of them (to see if this would then allow the mmap).

For the latter case, I probed the (now unmapped) region in gdb and found a set of qualities that were true for the first page of the large (1176K) region but weren't for (apparently) the rest of the pages:
- an mmap works
- an msync(MS_INVALIDATE) doesn't work
- a read reference fails

My best WAG would be that somehow only the first page of the region was "removed" from the process' virtual address space (vas), thus causing that page to act correctly but leaving the rest of the pages in an unusable state (for this process). I tried to find patch reports or ITRC posts along these lines but was unsuccessful.
Do you have ANY idea what's going on here?!

Frustrated in Rockville.

Jim Butler · ‎12-05-2003

Kris
Its snowing here - and at this point you may not care, but if you are still having the problem - I will give you my 2 cents

First dbcmaxpct - is the buffering - right, and you have 4 Gigs of Ram - and you have a proprietary dbms (which could mean anything) from my recollection - oracle for example reccomends that you never have over 128 Mb of buffer space in the OS typically, and I believe HP recommends never exceeding 300 Mb - so if you set your dbc_max_pct to 10 then you would have 400 Mb of buffer space, which is too much. That is the first thing - calculate what 256 Mb would be (like 7%= 280) and set that and forget about it.

Next - Look at maxdsiz maxtsiz and maxssiz -

The most important IMO is maxdsiz - on db servers, I like to bump that to a range of 500Mb to 800 Mb. I believe that HP 11 will only address 800 Mb at 32 bits and 2 Gb at 64 bits - but check that parameter.

the stack and text parameters can be bumped if needed ( I usually double or quadruple the default sizes -

Hey - good luck

Man The Bilge Pumps!

James Murtagh · ‎12-07-2003

Hi Kris,

I did actually miss your last few posts, I've put email notification on now though. :-)

In regards to your last reply I have no idea whats going on.....maybe post some output or source you used to investigate this? I might be telling you stuff you already know here but it may be worth clarifying a few points :

> You are correct in saying the map_shared regions cannot overlap, even between different processes.
> Any attempt to map a subsection of an existing map within the same process will fail with enomem, while a different process will map this without problem.
> You can map different ranges within the same process, including holes between existing maps.
> If you unmap a section of an existing map I believe the pregion is left intact but the protection changed on those page(s). This should also keep the reference bit set in the region

Also, can I assume you are working exclusively with vxfs in all these tests? Can you also let me know your current fs/vm patch levels?

Cheers,

James.

Kris Kelley · ‎12-08-2003

Jim N,

Thanks for the guidance on dbc_maxpct. I had reduced it from 50 to 15% earlier in this process, but this time I reduced it to 3% (min=2%). I also changed some other params that I had planned to (once I had a window of opportunity), mostly spinlocks that were not at the default values. I didn't change maxdsiz as it was already at 1GB.

The result was that my worst-case times were reduced by about a factor of 3! Also, the 2-second delays that I've seen at times seem to have vanished! I'm still not getting quite the performance I'd expect, but it's a good deal closer.

Thanks,
Kris

Jim Butler · ‎12-09-2003

Here is a link to an old post - you can reference some other kernel params that another user found helpful

http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=15856

Good Luck

Jim

Man The Bilge Pumps!

Kris Kelley · ‎12-09-2003

James,

I have some good news and bad news.
The good news is that I'm no longer seeing the 2-second delays in msync/munmap/fsync (see prior post), and running on VxFS is now as good as or FASTER than on HFS.
The bad news is that I don't know what happened to make VxFS run faster. I noticed it right after Thanksgiving, and proceeded to do a bunch more testing to make sure I didn't make some other blunder that caused artificial results and to see why exactly the times had improved. I don't recall making any real changes during that time, and the other 2 guys here who might have changed something said they didn't (e.g. no patches were loaded that week). I'm a little frustrated that I don't understand what happened, but it's better. I may take some time to research more once I get past our next release date (i.e. real soon).

The VxFS improvement seems to be caused by VxFS now being able to find pages in memory rather than having to go out to disk again. For example, comparing some old and new glance metrics for the "10K by 1K" test:
VM/Phys Reads: old=4K new=356
VM/Phys Writes: 9K 10K
Phys IO Rate: 330 704
Total IO Bytes: 129MB 52MB
Vir Faults: 3.6K 9.8K
Mem Faults: 36 9.2K
Disk Faults: 3.5K 307
Elapsed time: 41 15
These numbers are especially good considering the test files I'm using have grown about 150%.

As for replies to your last post, if you'd still like some more info, let me know. Otherwise, I'll consider this thread done!

Thanks,
Kris

James Murtagh · ‎12-09-2003

Hi Kris,

It looks like the delays were caused by the scanning of the buffer cache after an munmap or possibly even msync - it depends which system calls called the function to do it. If you reduced the buffer cache then it has a lot less to scan. If you want to test conslusively I would set your cache to be static and try low/med/high settings for it and test the delays. I had just assumed it was already set very low as you found the performance a lot better when you mistakenly lowered the buffer cache to a negligable figure.

For the vxfs memory faults - have you still got another process holding the file open before your main process runs? As long as someone has the file mapped the shared region will be preserved and page faults serviced from memory. It seems only when the file is completely unmapped that it will have to fault each page in from disk.

cheers,

James.

Kris Kelley · ‎12-10-2003

Hi James,

I had tried setting bufpages to 75000 (300MB) at one time but it didn't help. When I started getting the ENOMEM errors, I then switched to a dbc_maxpages of 15% (thinking that the smaller buffer was somehow to blame). I'm not sure now why the 300MB cache size didn't help. I'm probably going to experiment some more (when I have the machine to myself) to get more info on the effect of the cache size on both the 2-second delays and the penalty for mixed real and mapped I/O. I'll let you know what I find out.

As for the VxFS times, there isn't any other process holding onto any file maps (just being open isn't enough BTW).

Later,
Kris

Kris Kelley · ‎01-19-2004

James,

Just an FYI. After some further testing, I determined that an fsync (to flush pages modified via pwrite) will NOT invalidate any corresponding mapped pages (i.e. when currently mapped by a different process). Since various HP quirks prevent me from being able to mmap at times, thus forcing me to use pread/pwrite, I had hoped that the fsync would do this invalidation. Since that is apparently not the case, I have no choice but to scrap use of mmap on HP. Hopefully things will work much better in 11.31.

Thanks,
Kris

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

2-second delays in fsync/msync/munmap

Re: 2-second delays in fsync/msync/munmap

Re: 2-second delays in fsync/msync/munmap

Re: 2-second delays in fsync/msync/munmap

Re: 2-second delays in fsync/msync/munmap

Re: 2-second delays in fsync/msync/munmap

Re: 2-second delays in fsync/msync/munmap

Re: 2-second delays in fsync/msync/munmap

Re: 2-second delays in fsync/msync/munmap

Re: 2-second delays in fsync/msync/munmap

Re: 2-second delays in fsync/msync/munmap

Re: 2-second delays in fsync/msync/munmap

Re: 2-second delays in fsync/msync/munmap

Re: 2-second delays in fsync/msync/munmap

Re: 2-second delays in fsync/msync/munmap

Re: 2-second delays in fsync/msync/munmap

Re: 2-second delays in fsync/msync/munmap