The information contained within this document is subject to change without notice.
HEWLETT-PACKARD MAKES NO WARRANTY OF ANY KIND WITH REGARD TO THIS MATERIAL, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
Hewlett-Packard shall not be liable for errors contained herein nor for incidental consequential damages in connection with the furnishing, performance, or use of this material.
Warranty. A copy of the specific warranty terms applicable to your Hewlett-Packard product and replacement parts can be obtained from your local Sales and Service Office.
Restricted Rights Legend. Use, duplication, or disclosure by the U.S. Government Department is subject to restrictions as set forth in subparagraph (c) (1) (ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013 for DOD agencies, and subparagraphs (c) (1) and (c) (2) of the Commercial Computer Software Restricted Rights clause at FAR 52.227-19 for other agencies.
Copyright Notices. (C)copyright 1983-2000 Hewlett-Packard Company, all rights reserved.
This documentation contains information that is protected by copyright. All rights are reserved. Reproduction, adaptation, or translation without written permission is prohibited except as allowed under the copyright laws.
(C)Copyright 1981, 1984, 1986 UNIX System Laboratories, Inc.
(C)copyright 1986-1992 Sun Microsystems, Inc.
(C)copyright 1985-86, 1988
Massachusetts Institute of Technology.
(C)copyright 1989-93 The Open
Software Foundation, Inc.
(C)copyright 1986 Digital Equipment Corporation.
(C)copyright 1990 Motorola, Inc.
(C)copyright 1990, 1991, 1992 Cornell
University
(C)copyright 1989-1991 The University of Maryland.
(C)copyright 1988 Carnegie Mellon University.
Trademark Notices. UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open Company Limited.
NFS is a trademark of Sun Microsystems, Inc.
OSF and OSF/1 are trademarks of the Open Software Foundation, Inc. in the U.S. and other countries.
First Edition: April 1997 (HP-UX Release 10.30) Second Edition: September 2000 (HP-UX Release 11.11)
vas)
pregion
vfd)
dbd)
vfds and dbds Together in One
Place
B-tree
pseudo-vas
for Text and Shared Library pregions
pfdat)
vhand,
the pageout daemon
vhand
vhand Wakes Up
vhand
Steals and Ages Pagessched() Routine
swaptab
and swapmap Structurespregions for Shared Regions
pregions for Private Regions
vfd is Valid
copy-on-write Statusuarea for the Child Process
copy-on-write Page
copy-on-write Pageexec()
struct
hpde and struct hpde2_0, the Hashed Page
Directory
struct pregion
struct region)
a.out Support by Regions
struct vfd)
struct dbd)
B-tree
Node Description (struct bnode)
struct
broot
struct
vfdcw
struct pfdat (Page Frame Data)
pf_flag Values
struct
hdlpfdat
setmemthresholds()
Paging Thresholds
pregion
Elements used by vhand
vhand
swdevt[] (struct swdevt)
fswdevt[] (struct
fswdevt)
struct swaptab)
struct swapmap)pregion
vfd)
dbd)
vfddbds
B-tree (order = 3, depth = 3)
pseudo-vas Structures
htbl Entry to the Page Directory Entry
htbl Entry
swaptab and swapmap Structures
pregions with Shared regions
region of Type RT_PRIVATE
copy-on-write Page
DBD_FSTORE PageThe memory management system is designed to make memory resources available safely and efficiently to threads and processes:
The data and instructions of any process (a program in execution) or thread of execution within a process must be available to the CPU by residing in physical memory at the time of execution.
To execute a process, the kernel creates a per-process virtual address space that is set up by the kernel; portions of the virtual space are mapped onto physical memory. Virtual memory allows the total size of user processes to exceed physical memory. Through "demand paging", HP-UX enables you to execute threads and processes by bringing virtual pages into main memory only as needed (that is, "on demand") and pushing out portions of a process's address space that have not been recently used.
The term "memory management" refers to the rules that govern physical and virtual memory and allow for efficient sharing of the system's resources by user and system processes.
The system uses a combination of pageout and deactivation to manage physical memory. Paging involves writing recently unreferenced pages from main memory to disk from time to time. A page is this smallest unit of physical memory that can be mapped to a virtual address with a given set of access attributes. On a loaded system, total unreferenced pages might be a large fraction of memory.
Deactivation takes place if the system is unable to maintain a large enough free pool of physical memory. When an entire process is deactivated, the pages associated with the process can be written out to secondary storage, since they are no longer referenced. A deactivated process cannot run, and therefore, cannot reference its data.
Secondary storage supplements physical memory. The memory management system monitors available memory and, when it is low, writes out pages of a process or thread to a secondary storage device called a swap device. The data is read from the swap device back into physical memory when it is needed for the process to execute.
On a PA-RISC system, every page of physical memory is addressed by a physical page number (PPN), which is a software "reduction" of the physical page number from the physical address. Access to pages (and thus to the data they contain) are done through virtual addresses, except under specific circumstances. (When virtual translation must be turned off (the D and I bits are off), pages are accessed by their absolute addresses.)
When a program is compiled, the compiler generates virtual addresses for the code. Virtual addresses represent a location in memory. These virtual addresses must be mapped to physical addresses (locations of the physical pages in memory) for the compiled code to execute. User programs use virtual addresses only.
The kernel and the hardware coordinate a mapping of these virtual and physical addresses for the CPU, called "address translation," to locate the process in memory.
The PA-RISC architecture is segmented; a complete virtual address consists of a space identifier (SID) and an offset within that space.
The offset may be 32 bits or 64 bits wide; earlier PA-RISC processors (before PA-RISC 2.0) only support 32 bit offsets.
From the point of view of a user program, the segmentation is not obvious; instead, user programs experience an almost flat address space with either 32 or 64 bit virtual addresses (depending on how the process was compiled).
The kernel however deals in the full complexity of space and offset.
From the kernel point of view, every process running on a PA-RISC processor shares a single global virtual address space, with global virtual addresses (GVAs) composed of both space and offset. (These GVAs are 96 bit on PA-RISC 2.0 processors running in 64-bit (wide) mode; smaller on earlier processors.) This global virtual address space is also shared by the kernel.
Although any process can create and attempt to read or write any global virtual address, the kernel uses page granularity access control mechanisms to prevent unwanted interference between processes.
When a virtual page is "paged" into physical memory, free physical pages are allocated to it by the physical memory allocator. These pages may be randomly scattered throughout the memory depending on their usage history. Translations are needed to tell the processor where the virtual pages are loaded. The process of translating the virtual into physical address is called virtual address translation.
Potentially the virtual address space can be much greater than the physical address space. The virtual memory system enables the CPU to execute programs much larger than the available physical memory and allows you run many more programs at a time than you could without a virtual memory system.
The more main memory in the system, the more data the system can access and the more (or larger) processes it can retain and execute without having to page or cause deactivation as frequently. Memory-resident resources (such as page tables) also take up space in main memory, reducing the space available to applications.
At boot time, the system loads HP-UX from disk into RAM, where it remains memory-resident until the system is shut down.
User programs and commands are also loaded from disk into RAM, but in small portions as they are needed. When a program terminates, the operating system frees the memory used by the process.
Disk access is slow compared to RAM access. Excessive disk access can lead to increased latency or reduced throughput and can lead to the disk access becoming the bottleneck in the system. To avoid this, you need to do some sort of buffering. Buffering, paging, and deactivation algorithms optimize disk access and determine when data and code for currently running programs are returned from RAM to disk. When a user or system program writes data to disk, the data is either written directly from the program's RAM (e.g. if writing to a "raw" device) or buffered in what is called the buffer cache and written to disk in relatively big chunks. Programs also read files and database structures from disk into RAM. When you issue the sync command before shutting down a system, all modified buffers of the buffer cache are flushed (written) out to disk.
On each processor, there are also registers and cache, which are even faster than main memory. Actual program execution actually happens in registers, which get data from the cache and other registers. The cache contains the current working copy of parts of main memory. Most of the time when discussing memory management, cache and registers will be completely ignored; data and instructions will be treated as being accessed directly from main memory. They are mentioned here in an attempt to reduce confusion:
From this point on, this section only discusses "main memory".
+------------------------------+ | | | | | | | | | | | | | | | | | | | | | | | | | Lockable memory | | | | | | | |Available memory | | | | | +..............................+ | |Physical memory | | | | | | | | +------------------------------+ | | HP-UX kernel | | | at bootup | | | +------------------------------+ |
Not all physical memory is available to user processes. Kernel text and initialized data occupy about 10 MB of RAM; additional memory is used by kernel bss (uninitialized data), and (especially) various structures allocated during kernel boot. Many of the structures allocated during kernel boot can be quite large. The sizes of some are determined by kernel tunables, but many are sized based on the amount of physical memory in the system, e.g. such a structure might have one 96 byte entry for every 4096 byte page of physical memory.
Instead of allocating all its data structures at system initialization, the HP-UX kernel dynamically allocates and releases some kernel structures as needed by the system during normal operation. This allocation comes from the available memory pool; thus, at any given time, part of the available memory is used by the kernel and the remainder is available for user programs.
Physical address space is the entire range of addresses used by hardware (4GB on 32 bit (narrow mode) kernels), and is divided into memory address space, processor-dependent code (PDC) address space, and I/O address space. The next figure shows the expanse of memory available for computation. Memory address space takes up 15/16 of the system address space, while address space allotted to PDC and I/O consume a relatively small range of addresses.
+-----------+
0x00000000| page zero |
+-----------+
| |
| | +-----------------------+
| Memory | /| PDC address space |0xF0000000
| address | / | |
| space | / +-----------------------+
| | / | |0xF1000000
| | / | |
| | / | I/O Register |
0xF0000000+-----------+/ | address |
| PDC & I/O | | space |
0xFFFFFFFF+-----------+ | |
\ | |
\ +.......................+
\ | Central bus |
\ | address space |
\ +.......................+
\ | Broadcast address |0xFFFC0000
\| space (local, global) |0xFFFFFFFF
+-----------------------+
+-----------------------+
0x00000000 00000000| page zero |
+.......................+
| |
| |
| |
| |
| |
| |
| Memory |
| address |
| space |
| |
| |
| |
| |
| |
| |
| |
| |
+-----------------------+
0xF0000000 00000000| PDC address space |
0xF1000000 00000000| |
+-----------------------+
| I/O Register |
| address |
| space |
+.......................+
| Central bus |
| address space |
+.......................+
0xFFFFFFFF FFFC0000| Broadcast address |
0xFFFFFFFF FFFFFFFF| space (local, global) |
+-----------------------+
Pages kept in memory for the lifetime of a process by means of a system call
(such as mlock, plock, or shmctl) are
termed locked memory. Locked memory cannot be paged and processes with locked
memory cannot be deactivated. Typically, locked memory holds frequently accessed
programs or data structures, such as critical sections of application code.
Keeping them memory-resident improves application performance.
The lockable_mem variable tracks how much memory can be locked.
Available memory is a portion of physical memory, minus the amount of space
required for the kernel and its data structures. The initial value of
lockable_mem is the available memory on the system after boot-up,
minus the value of the system parameter, unlockable_mem.
The value of lockable memory depends on several factors:
unlockable_mem is a kernel tunable
parameter. Changing the value of unlockable_mem alters the
initial value of lockable_mem also. HP-UX places no explicit limits on the amount of available memory you may lock down; instead, HP-UX restricts how much memory cannot be locked.
Other kernel resources that use memory (such as the dynamic buffer cache) can cause changes.
As the amount of memory that has been locked down increases, existing
processes compete for a smaller and smaller pool of usable memory. If the number
of pages in this remaining pool of memory falls below the paging threshold
called lotsfree, the system will activates its paging mechanism, by
scheduling vhand in an attempt to keep a reasonable amount of
memory free for general system use.
Care must be taken to allow sufficient space for processes to make forward progress; otherwise, the system is forced into paging and deactivating processes constantly, to keep a reasonable amount of memory free.
Data is removed to secondary storage if the system is short of main memory. The data is typically stored on disks accessible either via system buses or network to make room for active processes.
Swap refers to a physical memory management strategy (predating UNIX) where entire processes are moved between main memory and secondary storage. Modern virtual memory systems today no longer swap entire processes, but rather use a paging scheme, where individual pages of data and instructions can be paged in from secondary storage as needed, or paged out again to free up memory for other uses. This is backed up by a deactivation scheme that allows whole processes to be pushed out if the system is desperately short of memory. However, the secondary storage dedicated to storing paged out data is still referred to as "swap space".
Device swap can take the form of an entire disk or LVM(1)
logical volume of a disk. A file system can be configured to offer free space
for swap; this is termed file-system swap. If more swap space is required, it
can be added dynamically to a running system, as either device swap or
file-system swap. The swapon command is used to allocate disk space
or a directory in a file system for swap.
(1) Logical Volume Manager (LVM) is a set of commands and underlying software to handle disk storage resources with more flexibility than offered by traditional disk partitions.
A computer has a finite amount of RAM available, but each 32-bit HP-UX process has a 4 GB virtual address space apportioned in four one-gigabyte quadrants. (64-bit HP-UX processes have an even larger virtual address space, though they can't actually use the full (16 Exabyte) range of virtual addresses addressable with 64 bits. It too is broken into 4 quadrants equal sized quadrants.) This is termed virtual memory.
Virtual memory is the software construct that allows each process sufficient computational space in which to execute. It is accomplished with hardware support.
As software is compiled and run, it generates virtual addresses that provide programmers with memory space many times larger than physical memory alone.
HP-UX is a Shared Address Space (SAS) operating system. A given virtual address (including space ID) refers to the same page of memory for all processes; translations are not changed when the process context changes.
Thus, the number of bits available for the space ID (segment) and offset (often simply called "virtual address") determines the ultimate size of the total virtual address space available to the kernel and all prcesses together.
As PA-RISC evolved, the number of bits usable for space and offset have increased. On PA-RISC 2.0, the space ID is 32 bits (18 bits actually used in HPUX 11.11) and the offset is effectively 42 bits (though stored in a 64 bit field). (PA-RISC 1.1 systems, and PA-RISC 2.0 running in narrow (32 bit) mode have a smaller offset.)
NOTE: Understand, however, that a single process has significant
limitations on the virtual address space it is allowed to access. For example, a
32-bit SHARE_MAGIC executable text is limited to 1 GB and data is
limited to 1 GB. Also, the total amount of shared virtual address space in the
system is limited to much less than theoretically addressable; without using
memory windows, the total shared space on a wide mode (64-bit) system is limited
to approximately 8 TB (i.e. 2 64-bit quadrants).
A physical address points to a page in memory that represents 4096 bytes of data. The physical address also contains an offset into this page. Thus, the complete physical address is composed of a physical page number(PPN) and page offset. The PPN is the 20 or 52 most significant bits of the physical address where the page is located. These bits are concatenated with an 12-bit page offset to form the 32 or 64-bit physical address.
Page Number Page Offset +--------------------+------------+ |00000000000000000100|100001110011| +--------------------+------------+ 0 19 20 31
Page Number Page Offset +---------------------------------------------------+------------+ |000000000000000000000000000000000000000000000000100|100001110011| +---------------------------------------------------+------------+ 0 51 52 63
To handle the translation of the virtual address to a physical address the virtual address also needs to be looked at as a virtual page number(VPN) and page offset. Since the page size is 4096 bytes, the low order 12 bits of the offset are assumed to be the offset into the page. The space ID and the high order bits of the offset are the VPN.
For any given address you can determine the page number by discarding the least significant 12 bits. What remains is the virtual page number for a virtual address or the physical page number for the physical address.
The next figure shows the bit layout of a 32-bit virtual address of 0x0.4873.
32-bit Space ID 32-bit Offset
+--------------------------------+--------------------+------------+
|00000000000000000000000000000000|00000000000000000100|100001110011|
+--------------------------------+--------------------+------------+
| | | |
+----------------------------------------------------+ +-----------+
| |
VPN = 0x4 Page Offset
0x873
The virtual page number's address must be translated to obtain the associated page number, with page offset 0x873.
+---------------------------------------------------+
| +--------------------+ |
| | Central Processing | |
| | Unit (CPU) | +-------------------+ |
| +--------------------+ | Floating Point | |
| |-------------->| Coprocessor | |
| | +-------------------+ |
| |------------------------+ |
| | | |
| V V |
| +--------------------+ +-------------------+ |
| | | | Translation | |
| | Cache | | Lookaside Buffer | |
| | | | (TLB) | |
| +--------------------+ +-------------------+ |
| | | |
| |<-----------------------+ |
| +--------------------+ |
| | System Interface | |
| | Unit (SLU) | |
| +--------------------+ |
| | |
+------------V--------------------------------------+
| Central Bus
==================================================================
The figure above and the table that follows, name the principal processor components; of them, registers, translation lookaside buffer, and cache are crucial to memory management, and will be discussed in greater detail following the table.
| Component | Purpose |
|---|---|
| Central Processing Unit (CPU) | The main component responsible for reading program and data from
memory, and executing the program instructions. Within the CPU are the
following:
|
| Instruction and Data Cache | The cache is a portion of high-speed memory used by the CPU for quick access to data and instructions. The most recently accessed data is kept in the cache. |
| Translation Lookaside Buffer (TLB) | The processor component that enables the CPU to access data through
virtual address space by:
|
| Floating Point Coprocessor | An assist processor that carries out specialized tasks for the CPU. |
| System Interface Unit (SIU) | Bus circuitry that allows the CPU to communicate with the central (native) bus. |
The translation lookaside buffer (TLB) translates virtual addresses to physical addresses.
+---------------------------+\ | | \ | | \ | | \ | | \ | | \ | | \ | | +--------+ | Virtual | +---+ |Physical| | address |<-->|TLB|<-->|address | | space | +---+ |space | | | +--------+ | | / | | / | | / | | / | | / | | / +---------------------------+/
Address translation is handled from the top of the memory hierarchy hitting
the fastest components first (such as the TLB on the processor) and then moving
on to the page directory table (pdir in main memory) and lastly to
secondary storage.
The TLB looks up the translation for the virtual page numbers (VPNs) and gets the physical page numbers (PPNs) used to reference physical memory.
Virtual address Main Memory
+-------------------+-----------+ +--------+
|Virtual Page Number|Byte Offset| | 0 |
+-------------------+-----------+ | |
| | | |
| +-------------------+ | |
V | | |
VPN PPN Rights ID O U T D P | | |
+------------+-------+----+---+-+-+-+-+-+ | | |
| | | | | | | | | | | +------>[] |
+------------+-------+----+---+-+-+-+-+-+ | PPN | | |
T| | | | | | | | | | | + | | |
L+------------+-------+----+---+-+-+-+-+-+ | Offset| | |
B| | | | | | | | | | | | | |
+------------+-------+----+---+-+-+-+-+-+ | | | |
| | | | |
V Physical address V | | |
+--------------------+-----------+ | | |
|Physical Page Number|Byte Offset|---+ |physmem |
+--------------------+-----------+ +--------+
Ideally the TLB would be large enough to hold translations for every page of physical memory; however this is prohibitively expensive; instead the TLB holds a subset of entries from the page directory table (PDIR) in memory. The TLB speeds up the process of examining the PDIR by caching copies of its most recently utilized translations.
Because the purpose of the TLB is to satisfy virtual to physical address translation, the TLB is only searched when memory is accessed while in virtual mode. This condition is indicated by the D-bit in the PSW (or the I-bit for instruction access).
Depending on model, the TLB may be organized on the processor in one of two ways:
The advantage of having a split Data TLB (DTLB) and Instruction TLB (ITLB), is that it is possible to account for the different characteristics of data and instruction locality and type of access (frequent random access of data versus relatively sequential single usage of instructions).
Because TLB size is limited, it is desirable to use as few entries as possible to translate the largest possible amount of memory. PA-RISC 2.0 processors provide a variable page size, and memory is organized to use large page sizes wherever this is reasonable. In particular, the memory initially allocated for the kernel at boot time is mapped with the largest possible page size that fits it. (Other memory will be mapped with large pages if possible, but there are tradeoffs that may make this impractical, especially on small memory systems.)
PA-RISC processors before PA-RISC 2.0 do not support a general purpose variable page size. Instead, they may provide a block TLB. The block TLB is quite small, but its entries can map more than a single 4K page (i.e. multiple hpdes). Block TLB entries are used to reference kernel memory that remains resident. (Memory referenced by a block TLB entry cannot be paged out.) The block TLB is typically used for graphics, because their data is accessed in huge chunks. It is also used for mapping other static areas such as kernel text and data.
Since the TLB translates virtual to physical addresses, each entry contains both the Virtual Page Number (VPN) and the Physical Page Number (PPN). Entries also contain Access Rights, an Access Identifier, and five flags.
| Flag | Name | Meaning |
|---|---|---|
| O | Ordered | Accesses to data for load and store are ranked by strength -- strongly ordered, ordered, and weakly ordered. (See PA-RISC 2.0 specifications for model and definitions.) |
| U | Uncacheable | Determines whether data references to a page from memory address space may be moved into the cache. Typically set to 1 for data references to a page that maps to the I/O address space or for memory address space that must not be moved into cache. |
| T(1) | Page Reference Trap | If set, any access to this page causes a reference trap to be handled either by hardware or software trap handlers |
| D | Dirty | When set, this bit indicates that the associated page in memory differs from the same page on disk. The page must be flushed before being invalidated. |
| B | Break | This bit causes a trap on any instruction that is capable of writing to this page |
| P | Prediction method for branching | Optional, used for performance tuning. |
(1) The T,D, and B flags are only present in data or unified TLBs.
In PA 1.x architecture, an E bit (or "valid" bit) indicates that the TLB entry reflects the current attributes of the physical page in memory.
The operating system maintains a table in memory called the Page Directory (PDIR) which keeps track of all virtual pages currently in memory. When a page is mapped in some virtual address space, it is allocated an entry in the PDIR. The PDIR is what links a virtual address to a physical page in memory.
The PDIR is implemented as a memory-resident table of software structures called hashed page directory entries (HPDEs), which contain virtual and physical addresses. When the processor needs to find a physical page not indexed in the TLB, it can search the PDIR with a virtual address to find the matching address.
The PDIR table is a hash table with collision chains. The virtual address is used to hash into one of the buckets in the hash table and the corresponding chain is searched until a chain entry with a matching virtual address is found.
Note that the page table is not a purely software construct. On systems that provide hardware for TLB miss handling, this is the table examined by the hardware to attempt to find an appropriate translation to insert in the TLB when resolving a TLB miss fault.
A trap occurs because translation is missing in the translation lookaside buffer (TLB). If the processor can find the missing translation in the PDIR, it installs it in the TLB and allows execution to continue. If not, a page fault occurs.
A page fault is a trap taken when the address needed by a process is missing from the main memory. This occurrance is also known as a PDIR miss. A PDIR miss indicates that the page is either on the free list, in the page cache, or on disk; the memory management system must then find the requested page on the swap device or in the file system and bring it into main memory.
Conversely, a PDIR hit indicates that a translation exists for the virtual address in the TLB.
hpde and hpde2_0) StructureEach PDE contains information on the virtual-to-physical address translation, along with other information necessary for the management of each page of virtual memory.
PA-RISC 1.1 and PA-RISC 2.0 systems use different hashed page directory entry
structures, with mostly similar field names and purposes. The following table
combines the structural elements of the PA-RISC 1.1 hashed page directory entry
(struct hpde) and the PA-RISC 2.0 hashed page directory entry
(struct hpde2_0).
struct
hpde and struct hpde2_0, the Hashed Page Directory
| Element | PA-RISC Version | Meaning |
|---|---|---|
pde_valid |
PA-RISC 1.1 | Flag set by the kernel to indicate a valid pde entry. |
pde_invalid |
PA-RISC 2.0 | Flag set by the kernel to indicate an invalid pde entry. |
pde_vpage |
both | Virtual page - the virtual offset divided by 4096. |
pde_space |
both | Contains the complete virtual space ID. |
pde_rtrap |
both | Data reference trap enable bit; when set, any access to the page causes a page reference trap interruption. |
pde_dirty |
both | Dirty bit; marked if the page differs in memory from what is on disk. |
pde_dbrk |
both | Data break; used by the TLB. |
pde_ar |
both | Access rights; used by the TLB.(1) |
pde_uncache |
both | Uncache bit. |
pde_order |
PA-RISC 2.0 | Strong ordering bit. |
pde_br_predict |
PA-RISC 2.0 | Branch prediction bit. |
pde_ref_trickle |
both | Trickle-up bit for references. Used with pde_ref on
systems whose hardware can search the htbl directly. |
pde_block_mapped |
both | Block mapping flag; indicates page is mapped by block TLB and cannot be aliased. |
pde_executed |
both | Used by the stingy cache flush algorithm to indicate that page is referenced as text(2). |
pde_ref |
both | Reference bit set by the kernel when it receives certain interrupts;
used by vhand to tell if a page has been used recently.
|
pde_accessed |
both | Used by the stingy cache flush algorithm to indicate that the page may be in data cache. |
pde_modified |
both | Indicator to the high-level virtual memory routines as to whether the page has been modified since last written to a swap device. |
pde_uip |
both | Lock flag used by trap-handling code. |
pde_protid |
both | Protection ID, used by the TLB. |
pde_os |
PA-RISC 2.0 | Entry in use. |
pde_alias |
both | Virtual alias field. If set, the pde has been allocated from elsewhere in kernel memory, rather than as a member of the sparse PDIR. |
pde_wx_demote |
PA-RISC 2.0 (64-bit kernels only) | User space fic. |
pde_phys |
PA-RISC 1.1 | Physical page number; the physical memory address divided by the page size (4096 bytes). |
pde_phys_u |
PA-RISC 2.0 | Physical page number: most significant 25 bits. |
pde_phys |
PA-RISC 2.0 | Physical page number: least significant 27 bits address divided by the page size. |
var_page |
PA-RISC 2.0 | Page size. |
pde_next |
both | Pointer to next entry, or null if end of list. |
(1) For detailed information on access rights, see the PA-RISC
2.0 Architectural reference, chapter 3, "Addressing and Access Control."
For information about how programs can manipulate this field, see
mmap(2) and mprotect(2) manpages.
(2) Stingy cache flush is a performance enhancement by which the kernel recognizes whether or not to flush the cache.
Cache is fast, associative memory on the processor module that stores recently accessed instructions and data. From it, the processor learns whether it has immediate access to data or needs to go out to (slower) main memory for it.
Cacheable data going to the CPU from main memory passes through the cache. Conversely, the cache serves as the means by which the CPU passes data to and from main memory. Cache reduces the time required for the CPU to access data by maintaining a copy of the data and instructions most recently requested.
A cache improves system performance because most memory accesses are to addresses that are very close to or the same as previously accessed addresses. The cache takes advantage of this property by bringing into cache a block of data whenever the CPU requests an address. Though this depends on size of the cache, associativity, and workload, a vast majority of the time (according to performance measurements), the cache has what you're looking for the next time, enabling you to reference it.
Depending on model, PA-RISC processors are equipped with either a unified cache or separate caches for instructions and data (for better locality and faster performance). In multiprocessing systems, each processor has its own cache, and a cache controller maintains consistency.
Cache memory itself is organized as follows:
Cache Tag
+---------------------------+-+-+--------------------+ /|\
| |v|d| | |
| |a|i| | |
|Physical Page Number (PPN) |l|r| Tag Parity Bits | |
| |i|t| | |
| |d|y| | |
+---------------------------+-+-+--------------------+ |
|Cache
Cache Line |entry
+----------------------------------+-----------------+ |
| | | |
| | | |
| Data words |Data parity bits | |
| | | |
| | | |
+----------------------------------+-----------------+ \|/
When a process executes, it stores its code (text) and data in processor registers for referencing. If the data or code is not present in the registers, the CPU supplies the virtual address of the desired data to the TLB and to the cache controller. Depending on implementation, caches can be direct mapped, set associative, or fully associative. Recent PA implementations use direct associative caches and fully associative TLBs. Virtual addresses can be sent in parallel to the TLB and cache because the cache is virtually indexed.
A physical page may not be referenced by more than one virtual page, and a virtual address cannot translate to two different physical addresses; that is, PA-RISC does not support hardware address aliasing, although HP-UX implements software address aliasing for text only in EXEC_MAGIC executables.
The cache controller uses the low-order bits of the virtual address to index into the direct-mapped cache. Each index in the cache finds a cache tag containing a physical page number (PPN) and a cache line of data. If the cache controller finds an entry at the cache location, the cache line is checked to see whether it is the right one by looking at the PPN in the cache tag and the one returned by the TLB, because blocks from many different locations in main memory can be mapped legitimately to a given cache location. If the data is not in cache but the page is translated, the resultant data cache miss is handled completely by the hardware. A TLB miss occurs if the page is not translated in the TLB; if the translation is also not in the PDIR, HP-UX uses the page fault code to fault it in. If not in RAM, the data and code might have to be paged from disk, in which case the disk-to-memory transaction must be performed.
+---------------------------------+
|+-------+ processor |
|| CPU | |
|+-------+ |
| | : virtual address | +------------------+
| | :..................... | | RAM |
| | V V | | |
|+-------+ +-------+ | |page directory |
|| CPU | | TLB | | | +-----+ |
|+-------+ +-------+ | | |-----| |
| | : : | | |-----| |
| | : PPN PPN : | | +-----+ |
| | ....> <.... | | |
+---|-----------------------------+ +------------------+
| bus
===============================================================
|
+--------+
| disk |
+--------+
On a more detailed level, the next figure demonstrates the mapping of virtual and physical address components.
Virtual address
+-----------+-------------+
+--------------------| virtual | offset in |-----------------+
| | page # | page | |
| +-----------+-------------+ |
| |
| Address translation in |
| Translation Lookaside Buffer Physical address in Cache |
| +-------------+-------------+ +-------------+---------+ |
+->| Virtual | Physical |----->| Physical | Offset |<-+
| page number | page number | +->| page number | in page |<-+
+-------------+-------------+ | +-------------+---------+ |
| |
| Physical address in RAM |
| +-------------+---------+ |
+->| Physical | Offset |<-+
| page number | in page |
+-------------+---------+
The sequence followed by the processor as it validates addresses is one of "hit or miss".
In addition to assisting in virtual address translation, the translation lookaside buffer (TLB) serves a security function on behalf of the processor, by controlling access and ensuring that a user process sees only data for which it has privilege rights.
The TLB contains access rights and protection identifiers. PA-RISC 2.0 allows up to eight protection IDs to be associated with each process. These IDs are held in control registers CR-8, CR-9, CR-12, and CR-13 (2 per register). (PA-RISC 1.1 only allows four protection IDs to be associated with each process.)
| Security check | Purpose |
|---|---|
| Protection Checks | The P-bit (Protection ID Validation Enable bit) of the Processor
Status Word (PSW) is checked:
|
| Access Rights Check | Access Rights are stored in a seven-bit field containing permissible
access type and two privilege levels affecting the executing
instruction:
|
The following figure shows the checkpoints for controlling access to a page of data through the TLB. Two checks are performed for controlling access to a page of data through the TLB: protection check and access rights check. If both checks pass, access is granted to the page referenced by the TLB.
Control Registers
+-----------------+
| | TLB Entry
CR 8|Protection ID 1,2|-+ +---------------+
CR 9|Protection ID 3,4| | | |
| | +-+ +---------------+
CR 12|Protection ID 5,6| | | +----------------| Access ID |
CR 13|Protection ID 7,8|-+ | | +---------------+
| | | | +--| Access Rights |
+-----------------+ | | | +---------------+
| | Type of | | |
PSW | | Access | +---------------+
+-------+-+---+ | | | |
| |P| | | | / \ | IA Queue
+-------+-+---+ | | / \ | +------+--+
| | | / \ | | +---------+--+
+---------------+ | | | | | +-| | |
V V V V V V +---------+--+
+------------+ +--------+ |
| | | Access |<-------------+
| Protection | | Rights |
| Check | | Check |
+------------+ +--------+
| |
+---+ +---+
| |
V V
+-------------+
| Both Checks |
| Passed? |
+-------------+
|
V
Access Granted
If the two PPNs do not match (assuming a TLB hit), the cache line is loaded because the bytes referenced on the virtual page are not yet in the cache. The time it takes to service a cache miss varies depending on if the data already present in the cache is clean or dirty. (When the cache is dirty, the old contents are written out to memory and the new contents are read in from memory.) If the cache line is "clean" (that is, not modified), it does not have to be written back to main memory, and the penalty is fewer instruction cycles than if the cache is dirty and must be written back to main memory.
Page found in PDIR (deposit in TLB)
+-----------+-------------------------+
V | +--|---+
+--------+ V | | |
+->| hashes | +-----+ TLB miss | [ ] | Not Found
| +--------+ | |------------------>| |-----------+
| /| | | TLB | TLB Hit +------+ |
| | VPN------>| |-----------+ PDIR V
| | | +-----+ |PPN s/w
| | | | (cache line) handler
| +- | -----------+------------ | -------------------+
| | | | |
| | | V |
| | V / \ |
CPU | | +-------+ PPN / \ No/Cache Miss +-----+
requests| +------->| Cache |------> =? -------------->| |
virtual | +-------+ \ / +-----+
address | \ / RAM
| |Yes/cache hit
| Return data to |
+-----+ CPU from cache |
| CPU |<-----------------------------+
+-----+
Registers, high-speed memory in the processor's CPU, are used by the software as storage elements that hold data for instruction control flow, computations, interruption processing, protection mechanisms, and virtual memory management.
All computations are performed between registers or between a register and a constant (embedded in an instruction), which minimizes the need to access main memory or code. This register-intensive approach accelerates performance of a PA-RISC system. This memory is much faster than conventional main memory but it is also much more expensive, and therefore used for processor-specific purposes.
Registers are classified as privileged or non-privileged, depending on the privilege level of the instruction being executed.
| Type of Register | Purpose |
|---|---|
| 32 General Registers, each 64 bits in size (non-privileged) | Used to hold immediate results or data that is accessed frequently,
such as the passing of parameters. Listed are those with uses specified
by PA-RISC or HP-UX.
|
| 7 Shadow Registers (privileged) | Store contents of GR1,8,9,16,17,24, and 25 on interrupt, so that they can be restored on return from interrupt. Numbered SHR0-SHR6. |
| 8 Space Registers (SR5-SR7 are privileged) | Hold the space IDs for the current running process.
|
| 25 Control Registers (numbered CR0, and CR8 through CR31), each 64 bits. (Most are privileged.) | Used to reflect different states of the system, many related
primarily to interrupt handling.
|
| 32 Floating Point Registers, 64-bits each, or 64, 32-bits each | Data registers used to hold computations.
|
| 2 Instruction Address Queues | Two queues 2 elements deep. The front elements of the queues
(IASQ_Front and IAOQ_Front) form the virtual address of the current
instruction, while the back elements (IASQ_Back and IAOQ_Back) contain
the address of the following instruction.
|
| 1 Processor Status Word (PSW), 64 bits (privileged) | Contains the current processor state. When an interruption occurs, the PSW is saved into the Interrupt Processor Status Word (IPSW), to be restored later. Low-order five bits of the PSW are the system mask, and are defined as mask/unmask or enable/disable. Interrupts disabled by PSW bit are ignored by the processor; interrupts masked remain pending until unmasked. |
uarea vas
+-------+ +---------------->+-----+
| | | +--------->| |<--------------+
+-------+ proc | | pregion +-----+ |
|u_procp|---->+-----+ | +->+-----+<->+--+<->+--+<->+--+<-+
+-------+ | | | | | | | | | | |
| | +-----+ | +-----+ +--+ +--+ +--+
+-------+ |p_vas|-+ |p_reg|--+
+-----+ +-----+ |
| | | | |
+-----+ +-----+ |
Process resources |
=========================================|============================
System resources | region
+--->+------+
| |
+------+ broot
|r_root|--->+------+
+------+ | |
chunk | | +------+
+-----+<----+ +------+ +-|b_root|
| | | | +------+
+-----+ | +--+<---------+ | |
hpde RAM <--| vfd | | B-tree | | +------+
+--------+ | dbd | | +--+
| | /|\ +-----+ | / | \
+--------+ | | | | V V \|
|pde_phys|--+ | | | +--+ +--+ +--+
+--------+ +-----+ | | | | | | |
| | | +--+ +--+ +--+
+--------+ | / | \
| |/ V \|
| +--+ +--+ +--+
+---| | | | | |
+--+ +--+ +--+
Process management uses kernel structures down to the pregions
to execute the threads of a process. The uarea, proc
structure, vas, and pregion are per-process
resources, because each process has its own unique copies of these structures,
which are not shared among multiple processes.
Below the pregion level are the systemwide resources. These
structures can be shared among multiple processes (although they are not
required to be shared).
Memory management kernel structures map pregions to physical memory and provide support for the processor's ability to translate virtual addresses to physical memory. The table that follows introduces the structures involved in memory management; these are discussed later in detail.
| Kernel structure | Purpose |
|---|---|
vas |
Keeps track of the structural elements associated with a process in
memory. One vas maintained per process. |
pregion |
A per-process resource that describes the regions attached to the process. |
region |
A memory-resident system resource that can be shared among
processes. Points to the process's B-tree,
vnode, pregions. |
B-tree |
Balanced tree that stores pairs of page indices and chunk addresses.
At the root of a B-tree of VFDs and
DBDs is struct broot. |
hpde |
Contains information for virtual to physical translation (that is,
from VFD to physical memory). |
vas)The vas represents the virtual address space of a process and
serves as the head of a doubly linked list of process region data structures
called pregions. The vas data structure is always
memory resident.
When a process is created, the system allocates a vas structure and puts
its address in p_vas, a field in the proc structure.
The virtual address space of a process is broken down into logical chunks
of virtually contiguous pages. (See the Process Management white paper for a
table of vas entries.)
pregionEach pregion represents a process's view of a particular
portion of its virtual address space and information on getting to those
pages. The pregion points to the region data
structure that describes the pages' physical locations in memory or in
secondary storage. The pregion also contains the virtual
addresses to which the process's pages are mapped, the page usage (text, data,
stack, and so forth), and page protections (read, write, execute, and so on).
pregion +---------+
+------------->| vas |<-------------+
| +---------- |
| /| |\ |
| / \ |
V |/ \| V
+---------+ +---------+ +---------+ +---------+
| pregion |<->| pregion |<->| pregion |<->| pregion |
+---------+ +---------+ +---------+ +---------+
/|\
|
V
+---------+
| region |
+---------+
The following elements of a per-process pregion structure are
important to the virtual memory subsystem.
struct pregion| Element | Purpose |
|---|---|
p_type |
Type of pregion. |
*p_reg |
Pointer to the region attached by the pregion. |
p_space, p_vaddr |
Virtual address of the pregion, based on virtual space
and virtual offset. |
p_off |
Offset into the region, specified in pages. |
p_count |
Number of pages mapped by the pregion. |
p_ageremain, p_agescan,
p_stealscan, p_bestnice |
Used in the vhand algorithm to age and steal pages of
memory (discussed later). |
*p_vas |
Pointer to the vas to which the pregion is
linked. |
p_forw, p_back |
The doubly-linked list, used by vhand to walk the
active pregions. |
p_deactsleep |
The address at which a deactivated process is sleeping. |
p_pagein |
Size of an I/O, used for scheduling when moving data into memory. |
p_strength, p_nextfault |
Used to track the ratio between sequential and random faults; used
to adjust p_pagein. |
The region is a system-wide kernel data structure that associates groups of
pages with a given process. Regions can be one of two types, private (used by
a single process) or shared (able to be used by more than one process). Space
for a region data structure is allocated as needed. The region structure is
never written to a swap device, although its B-tree may be.
Regions are pointed to by pregions, which are a per-process resource.
Regions point to the vnode where the blocks of data reside when
not in memory.
struct
region)
| Element | Meaning |
|---|---|
r_flags |
Region flags (enumerated shortly). |
r_type |
|
r_pgsz |
Size of region in pages (not just those presently in memory). |
r_nvalid |
Number of valid pages in region. This equals the number of valid
vfds in the B-tree or b_chunk.
|
r_dnvalid |
Number of pages in swapped region. If the system swaps the entire
process, the value of r_nvalid is copied here to later
calculate how many pages the process will need when it faults back in.
This information is used to decide which process to reactivate. |
r_swalloc |
Total number of pages reserved and allocated for this region on the
swap device. Does not account for swap space allocated for
vfd/dbd pairs. |
r_swapmem, r_vfd_swapmem |
Memory reserved for pseudo-swap or vfd swap. |
r_lockmem |
Number of pages currently allocated to the region for lockable
memory, including lockable memory allocated for vfd/dbd
pairs. |
r_pswapf, r_pswapb |
Forward and backward pointers to list of regions using pseudo-swap
pages (pswaplist). |
r_refcnt |
Number of pregions pointing at the region |
r_zomb |
Set to indicate modified text. If an executing a.out file on a
remote system has changed, the pages are flushed from the processor's
cache, causing the next attempted access to fault. The fault handler
finds that r_zomb is non-zero, prints the message Pid
%d killed due to text modification or page I/O error
and sends the process a SIGKILL. |
r_off |
Offset into the page-aligned vnode, specified in pages; valid only
if RF_UNALIGNED is not set. Page r_off of the
vnode is referenced by the first entry of the first chunk
of the region's B-tree. |
r_incore |
Number of pregions sharing the region whose associated
processes have the SLOAD flag set. |
r_dbd |
Disk block descriptor for B-tree pages written to a
swap device Specifies the location of the first page; the pages are
stored together in a contiguous area of swap space. |
r_fstore, r_bstore |
Pointers to vnode of origin and destination of block.
This data depends on the type of pregion above the region.
In most cases, r_bstore is set to the paging system
vnode, the global swapdev_vp that is
initialized at system startup. |
r_forw, r_back |
Pointers to linked list of all active regions. |
r_lock |
Region lock structure used to get read or read/write locks to modify the region structure. |
r_mlock |
Lock used to serialize mlock operations on this region.
|
r_poip |
Number of page I/Os in progress. |
r_root |
Root of B-tree; if referencing more than one chunk,
r_key is set to DONTUSE_IDX. |
r_key, r_chunk |
Used instead of B-tree search (r_root) if
only a single chunk of vfddbds is needed (referencing 32 or
fewer pages on a 32-bit kernel, or 64 or fewer pages on a 64-bit
kernel). |
r_next, r_prev |
Circularly linked list of all regions sharing vnode.
|
r_preg_un |
pregion(s) pointing to the region. |
r_excproc |
Pointer to the proc table entry, if the process has
RF_EXCLUSIVE set in r_flags. |
r_lchain |
Linked list of memory lock ranges. |
r_mlockswap |
Swap reserved to cover locks. |
r_pgszhint |
Page size hint. |
r_hdl |
Hardware-dependent layer structure. |
a.out Support for
Unaligned PagesText and data of most executables start on a four-kilobyte page boundary. HP-UX can treat these as memory-mapped files, because a page in the file maps directly to a page in memory.
In addition to the fields shown above,
struct region has fields to support executables compiled on older
versions of HP-UX whose text and data do not align on a (4 KB) page boundary.
These executables are referenced by regions whose r_flags has
RF_UNALIGNED set.
a.out Support by Regions| Element | Meaning |
|---|---|
r_byte, r_bytelen |
Offset into the a.out file and length of its text.
|
r_hchain |
Hash list of unaligned regions. |
r_flags. Here are some of
the possible flag values:
| Region Flag | Meaning |
|---|---|
RF_ALLOC |
Always set because HP-UX regions are allocated and freed on demand; there is no free list. |
RF_UNALIGNED |
Set if text of an executable does not start on a page boundary. In
this case, the text is read through the buffer cache to align it, and
the vfds are pointed at the buffer cache pages. |
RF_WANTLOCK |
Set if a thread wanted to lock a vfd of this region (to
do I/O on the page), but found it already locked and went to sleep.
After the vfd is unlocked, this flag ensures that
wakeup() is called so the waiting thread(s) can proceed.
|
RF_HASHED |
The text is unaligned (RF_UNALIGNED) and thus is on a
hash chain. The region is hashed with r_fstore and
r_byte; the head of each hash chain is in
texts[]. The RF_UNALIGNED flag may be set
without the RF_HASHED flag (if the system tries to get the
hashed region but it is locked, the system will create a private one),
but the RF_HASHED flag will never be set without the
RF_UNALIGNED flag. |
RF_EVERSWP, RF_NOWSWP |
Set if the B-tree has ever been or is now written to a
swap device. These flags are used for debugging. |
RF_IOMAP |
This region was created with an iomap() system call,
and thus requires special handling when calling exit().
|
RF_LOCAL |
Remote file using local swap space. |
RF_EXCLUSIVE |
The mapping process is allowed exclusive access to the region. This
flag is set, and r_excproc is set to the proc
table pointer. |
RF_STATIC_PREDICT |
Text object uses static branch prediction for compiler optimization. |
RF_ALL_MLOCKED |
Entire region is memory locked. |
RF_SWAPMEM |
Region is using pseudo-swap; that is, a portion of memory is being held for swap use. |
RF_LOCKED_LARGE |
Region is locked using large pages. |
RF_SUPERPAGE_TEXT |
Text region using large pages. |
RF_FLIPPER_DISABLE |
Disable kernel assist prediction; a flag used for performance profiling. |
RF_MPROTECTED |
Some part of the region is subject to the system call
mprotect, which is performed on an memory-mapped file.
|
r_key, r_chunk, and
r_root are used to find information about the individual pages of
a region.
Each page is represented by a vfd (if it's in memory) or
dbd (if it's on disk).
For each page, the vfd and dbd are grouped
together into a struct vfddbd. By definition, if the
vfd's pg_v bit is set, the vfd is used;
if not, the dbd is used.
Since information is typically needed about groups of (rather than individual) pages, pages are grouped into chunks. A chunk contains 32 or 64 pairs of virtual frame descriptors and disk block descriptors:
vfd)A one-word structure called a virtual frame descriptor enables processes to
reference pages of memory. The vfd is used when the process is in
memory, and can be used to refer to the page of physical memory described in
the pfdat table (pfdat_ptr[], described below).
vfd)+----------+---------------------+
| flags | page frame number |
+----------+---------------------+
11 31
struct vfd)| Element | Meaning |
|---|---|
pg_v |
Valid flag. If set, this page of memory contains valid data and
pg_pfnum is valid. If not set, the page's valid data is on
a swap device. |
pg_cw |
Copy-on-write flag. If set, a write to the page causes a data protection fault, at which time the system copies the page. |
pg_lock |
Lock flag. If set, raw I/O is occurring on this page. Either the data is being transferred between the page and the disk, or data is being transferred between two memory pages. The kernel sleeps waiting for completion of I/O before launching further raw I/O to or from this page. Nothing can read the page while it is being written to disk. |
pg_mlock |
If set, the page is locked in memory and cannot be paged out. |
pg_pfnum (aliased as pg_pfn) |
Page frame number, from which can be accessed the correct
pfdat entry for this page. |
dbd)When the pg_v bit in a vfd is not set, the vfd is
invalid and the page of data is not in memory but on disk. In this case, the
disk block descriptor (dbd) gives valid reference to the data.
Like the vfd structure, the dbd is one word long.
dbd)+----+---------------------------+ |type| data | +----+---------------------------+ 0 3 31
struct dbd)| Element | Meaning |
|---|---|
dbd_type |
Type of data:
|
dbd_data |
vnode type (jfs, nfs,
ufs, swap space) specific data. Used by the file system (or
swap space management) code to find the data in a file pointed to by a
vnode. |
(1) When the dbd_type is DBD_FSTORE, it
means that the page of data resides in the file pointed to by
r_fstore (typically a file system). When the
dbd_type is DBD_BSTORE, the page of data resides in
the file of device pointed to by r_bstore (typically a swap
device).
Since information is typically needed about groups of (rather than individual) pages, pages are grouped into chunks. A chunk contains 32 or 64 pairs of virtual frame descriptors and disk block descriptors:
vfd).
dbd).
vfd's pg_v bit is set,
the vfd is used; if not, the dbd is used. A one-to-one correspondence is maintained between vfd and
dbd through the vfddbd structure, which simply
contains one vfd (c_vfd) and one dbd
(c_dbd).
vfddbdschunks +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +----------+ +------+ /| vfd | +------+/ +----------+ +------+\ | dbd | +------+ \+----------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+
HP-UX regions use chunks of vfds and dbds to keep
track of page ownership:
Each region contains either a single array of vfddbds (a
chunk) or a pointer to a B-tree. The structure
called a B-tree allows for quick searches and efficient storage
of sparse data. A bnode is the same size as a chunk;
both can be gotten from the same source of memory. The region's B-tree stores
pairs of page indices and chunk addresses. HP-UX uses an order 29
B-tree.
A B-tree is searched with a key and yields a value. In the
region B-tree, the key is the page number in the region divided
by the number of vfddbds in a chunk.
B-tree (order = 3, depth = 3) ++-+-+-+-++
||9| | | ||
+++++++++++
| | | | | |
+-+-+-+-+-+
| |
+-----+ +-----+
| |
V V
++-+-+-+-++ ++-+--+-+-++
||4|7| | || ||9|11| | ||
+++++++++++ +++++-++++++
| | | | | | | | | | | |
+-+-+-+-+-+ +-+-+--+-+-+
| | | | |
+-------------------+ | | | |
| +---------+ | | +---------+
| | | | |
V V V V V
++-+-+-+-++ ++-+-+-+-++ ++-+-+-+-++ ++-+--+-+-++ ++--+--+-+-++
||1|3| | || ||4|6| | || ||7|8| | || ||9|10| | || ||11|12| | ||
+++++++++++ +++++++++++ +++++++++++ +++++-++++++ +++-++-++++++
| |G|H| | | | |D|E| | | | |J|I| | | | |F| B| | | | | C| A| | |
+-+-+-+-+-+ +-+-+-+-+-+ +-+-+-+-+-+ +-+-+--+-+-+ +-+--+--+-+-+
Each node of a B-tree contains room for order+1 keys (or index
numbers) and order+2 values. If a node grows to contain more than order keys,
it is split into two nodes; half of the pairs are kept in the original node
and the other half are copied to the new node. The B-tree node
data also includes the number of valid elements contained in that node.
B-tree Node Description (struct bnode) | Element | Meaning |
|---|---|
b_key[B_SIZE] |
The array of keys used for each page index of the
bnode. |
b_nelem |
Number of valid keys/values in the bnode. |
b_down[B_SIZE+1] |
The array of values in the bnode, either pointers to
another bnode (if this is an interior bnode)
or pointers to chunks (if this is a leaf bnode).
|
B-tree struct broot points to the start of the
B-tree.
struct broot| Element | Meaning |
|---|---|
b_root |
Pointer to the initial point of the B-tree. |
b_depth |
Number of levels in the B-tree |
b_npages |
Pages used to construct the B-tree, counting both pages
used for chunks and bnodes. |
b_rpages |
Number of swap pages reserved for the B-tree by the
kernel, using the routine grow_vfdpgs(). Amount of swap
allocated for the vfd/dbd pairs in the B-tree
structure. |
b_list |
Pointer to a linked list of memory pages used for
bnodes or chunks in this region. The first
page in this list usually has free space available (if
b_nfrag is non-zero). New bnodes or
chunks can be allocated from here and added to the
B-tree. |
b_nfrag |
Number of chunks available (not yet allocated) in
b_list. Since chunks are allocated from the end of the
page, this is also the index of the most recently allocated chunk in the
page (decrement it to get the next available one). |
b_rp |
Pointer to the region using the B-tree. |
b_protoidx, b_proto1,
b_proto2 |
Two prototpe dbd values, and the page index at which we
switch from b_proto1 to b_proto2. This is used
to minimize time and memory costs when allocating chunk space. |
b_vproto |
List of page ranges which are copy on write. This allows pages to be
set copy on write without having to immediately allocate the actual
B-tree entries. This is used to determine the
vfd prototype. (See "vfd
Prototypes" below.) |
b_key_cache[], b_val_cache[] |
Caches of most recently used keys and pointers to chunks associated
with the keys; checked first when looking for a particular struct
vfddbd (before searching the B-tree).
|
vfd PrototypesThe
When a file is opened as an a.out or shared library, the easiest way to
keep track of the region is to create a The All Even after all processes using the
The page frame data ( If physical memory addresses always started with page zero, and increased
in a continuous sequence, it would be implemented as a single level array.
(Indeed, it was implemented this way in older HPUX releases, as the hardware
they ran on had such a continuous address range.) However, some recent systems
have huge gaps in their physical addresses (e.g. one might have memory from
page 0 to page 0x1000, and then from page 0x20000 to 0x21000); a table that
represented all addresses would be much larger than actually needed.
Consequently the first layer ( The
(1) Hashing is done on the tuple
The
The PA-RISC hardware attempts to convert a virtual address to a physical
address by looking in the TLB. If it cannot resolve the address, it generates
a page fault (interrupt type 6 for an instruction TLB miss fault; interrupt
type 15 for a data TLB miss fault). The kernel must then handle this fault.
PA-RISC uses a hashed page table ( See "The
Page Table or PDIR" above for additional discussion of this table, and "The
Hashed Page Directory ( NOTE: For historical reasons, the entries of this table can be
referred to as To find an address in the
As with any hash algorithm, multiple addresses can map to the same
In practice, Each
HP-UX uses a hashed page directory to translate from virtual to physical
address.
Translations from physical to virtual use the Each
The
HP-UX supports software address aliasing on most platforms. (Whereas the
hardware implements address aliasing on 16 MB boundaries, software address
aliasing is implemented on a per-page basis; pages are 4KB apart.) This is not
used as much as it might be in other operating systems; HP-UX doesn't
generally map the same object at multiple virtual addresses.
When a text segment is first translated, it has no alias. However, if a
process or thread attaches to the same text segment, it may require another
translation. Processes sharing text segments do not use aliases. Only
processes with private text segments that share data pages using
When multiple virtual addresses translate to the same physical address,
HP-UX uses alias structures to keep track of them. Aliases for a page frame
( To locate the The The global variable The number of available alias The number of available alias structures is kept in
Two computational elements maintain page availability:
NOTE: The
Memory management uses paging thresholds that trigger various paging
activities. The figure shows the full range of available memory and indicates
what paging activity occurs when memory level falls below each paging
threshold.
The value termed Three tunable paging thresholds are initialized by the
The When the system boots, The system wants to keep free memory at If
The paging thresholds are set as follows:
The routine
In actual implementation, Using However, it has the disadvantage of putting all the memory belonging to a
single process together. Thus, when the It's important to keep an appropriate distance between the hands. Too
close, and pages are stolen that are in fact in regular use. Too far, and the
hands have to move faster to keep the same steal rate; this means that
The two hands cycle through the active The
When the age hand arrives at a The steal hand uses the How much to age and steal depends on several factors:
Refer to the table that follows for explanations of the vhand variables.
NOTE: None of the variables in the table that follows may be
tuned.
Once
Note, the steal hand is moved first to keep it behind the age hand and
prevent aging and stealing a page in the same cycle.
The NOTE: Deactivation occurs on a per-thread basis.
Deactivation occurs when
Reactivation occurs when the system is no longer low on memory or
thrashing.
Deactivation and reactivation are determined by:
If the system appears to be thrashing or experiencing memory pressure, the
If the system is not thrashing or experiencing memory pressure, the
Once a process is chosen for deactivation,
b_vprotothread struct
field of the . The list is sorted by struct broot contains a
list of ranges of pages to be treated as copy-on-write. This allows pages to
be set copy-on-write without their B-tree entries being allocated
immediately. It is of type struct vfdcw. When creating vfds, the
prototype is determined by checking whether the page is present in this list
to dertmine which prototype Table 15
struct vfdcw
Element
Meaning
v_start[MAXVPROTO]Page that indexes start of copy-on-write range; set to -1 if unused.
v_end[MAXVPROTO]End of copy-on-write range pseudo-vas
for Text and Shared Library pregionspseudo-vas the first time
the file is opened as an executable. This is done by calling
mapvnode() and storing the vas pointer in the
vnode's v_vas element. On subsequent opens of the
file as an executable, the non-NULL value in v_vas aids in
finding the region to which the virtual address space is being attached.
pseudo-vas is type PT_MMAP, and the
associated pregion has PF_PSEUDO set in
p_flags. This pregion is attached to the region for
this vnode. All the processes that use this executable or shared
library (non-pseudo pregions) then attach to the region with type
PT_TEXT (a.out) or PT_MMAP (shared
library). The number of processes using a particular vnode as an
executable is kept in the pseudo-vas in va_refcnt.
pregions associated with a region are connected with a
doubly-linked list that begins with the region element r_pregs,
and is defined in the pregions by p_prpnext and
p_off, the
pregion's offset into the region, and is NULL-terminated.
a.out or shared library
exit, the handle to the region remains; its pages can be disposed of at that
time.
Figure 21 Mapping the
pseudo-vas Structures a.out shlib
vnode vnode
+-----+ +---->+-------+ +-----+ +---->+-------+
| | | |pseudo | | | | |pseudo |
+-----+ | +>| vas |<+ +-----+ | +>| vas |<+
|v_vas|-+ | +-------+ | |v_vas|-+ | +-------+ |
+-----+ | | +-----+ | |
| | | +-------+ | | | | +-------+ |
+-----+ +>| MMAP |<+ +-----+ +>| MMAP |<+
.............|pregion| ................|pregion|
+-----------------| | : | |--------+
| : +-------+ : +-------+ |
| : : |
| : proc[n].p_vas--+ : |
| : V : V
| : +-------+ : +-------+
| : | vas | +----------------------------->| MMAP |
| : +--------->| |<-----------+ |region |
| : | +-------+ | : | +-------+
| V V V V V /|\
| +-------+ +-------+ +-------+ +-------+ |
| | TEXT |<->| |<->| MMAP |<->| | proc[m].pvas |
| |pregion| | | |pregion| | | | |
| +-------+ +-------+ +-------+ +-------+ | |
| : | :............. V |
| : | : +-------+ |
| : | r_prpnext +------------------->| vas |<---+ |
| :...|............. | : | | | |
| | : | : +-------+ | |
| V V V V V |
| +-------+ +-------+ +-------+ +-------+ +-------+ |
+->| TEXT | | TEXT |<->| |<->| MMAP |<->| | |
|region |<-------|pregion| | | |pregion| | | |
+-------+ +-------+ +-------+ +-------+ +-------+ |
| |
+---------------+
Hardware-Independent
Page Information Table (
pfdat) pfdat) table is a two level table which
represents all reallocatable pages of physical memory. (Memory premanently
allocated at kernel boot time is not represented.) Conceptually it may be
imagined as a giant array indexed by the page frame number (pfn,
i.e. the physical page number).
pfdat_ptr) is basically an array
of pointers to sub-tables. Each pointer represents
PFN_CONTIGUOUS_PAGES (0x1000) pages of possible physical address
space, but the pointers are NULL unless there's actual physical memory in that
range. (As a memory-saving optimization, memory allocated permanently at boot
is treated as nonexistent for purposes of this table.)
points to the last pfdat structures are used for several purposes.
pfdats.
Table 16
Principal Entries in
struct pfdat (Page Frame Data)
Element
Meaning
pf_hchainHash chain link.
pf_devvp(1)vnode for device.
pf_next, pf_prevNext and previous free pfdat entries.
pf_vnext, pf_vprevLinks for linked list of pages associated with the same vnode.
pf_lockLock pfdat entry (beta semaphore), used to lock the
page while modifying the pde (physical-to-virtual
translation, access rights, or protection ID).
pf_pfnPhysical page frame number.
pf_useNumber of regions sharing the page; when pf_use drops
to zero, the page can be placed on the free linked list.
pf_cache_waitingIf set, this element means that a thread is waiting to grab the
pf_lock on that page. Required for synchronization.
pf_dataDisk block number or other data to uniquely identify this page
within pf_devvp.
pf_sizeidxIdentifies the page size for the base page of a large page in a
physical memory free list. That size determines which free list it's
placed in.
pf_sizePage size of a variable sized page that's in use.
pf_flagsPage frame data flags (shown in the next table).
pf_hdlHardware dependent layer elements (see hdlpfdat
discussion, shortly). (pf_devvp, pf_data).
Flags Showing the Status of
the Page
Table 17 Principal
pf_flag
Values
Flag
Meaning
P_FREEPage is free (available for allocation).
P_BADPage is marked as bad by the memory deallocation subsystem.
P_HASHPage is on a hash queue.
P_SYSPage is being used by the kernel rather than by a user process.
Pages marked with this flag include dynamic buffer cache pages,
B-tree pages and the results of dynamic kernel memory
allocation.
P_DMEMPage is locked by the memory diagnostics subsystem; set and cleared
with an ioctl() call to the dmem driver.
P_LCOWPage is being remapped by copy-on-write.
P_UAREAPage is used by a pregion of type PT_UAREA.
P_KERN_DYNAMICPage is used for kernel dynamic memory. (Subset of
P_SYS.) This includes pages in the kernel dynamic memory
free lists.
P_KERN_NO_LGPGPage is allocated (as kernel dynamic memory) by a user who intends
to remap it. (This, it cannot be part of a large page.) Subset of
P_KERN_DYNAMIC.
P_SP_POOLPage is in kernel dynamic memory allocator's superpage pool free
list. (Subset of P_KERN_DYNAMIC.) Hardware-Dependent Layer
Page Frame Data Entry
pf_hdl field of the struct pfdat contains
hardware dependent information associated with each page. It is of type
struct hdlpfdat, defined in hdl_pfdat.h.
Table 18
struct hdlpfdat
Element
Meaning
hdlpf_flagsFlags that show the HDL status of the page. Values include:
HDLPF_TRANS: A virtual address translation exists for
this page.
HDLPF_PROTECT: Page is protected from user access. If
this flag is set, the saved values (below) are valid unless
HDLPF_STEAL is also set.
HDLPF_STEAL: Virtual translation should be removed
when pending I/O is complete.
HDLPF_MOD: Analogous to changing the
pde_modified flag in the hpde.
pde_ref flag in
the hpde.
HDLPF_READA: Read-ahead page in transit; used to
indicate to the hdl_pfault() routine that it should start
the next I/O request before waiting for the current I/O request to
complete.
hdlpf_savearSaved page access rights.
hdlpf_saveprotSaved page protection ID. MAPPING VIRTUAL TO PHYSICAL
MEMORY
The
HTBLhtbl) of page directory
entries (hpdes) to pinpoint an address in the enormous virtual
address space. Control register 25 (CR25) contains the hash table address (see
reg.h).
hpde and hpde2_0)
Structure" above for details of the contents of each table entry.
pdes, hpdes, of pdirs.
htbl:
htbl
index.
htbl. Each entry in the table is referred to as a
pde (page directory entry), and is of type struct
hpde.
pde to verify the entry.
pde to complete
the translation from virtual address to physical address. Figure
22 Mapping from the
htbl Entry to the Page Directory
Entry htbl +-----+
| |
| |
| |
+-----+ +------+ | |
|Space| |Offset| | |
+-----+ +------+ | |
\ / | |
\ / | |
\ / | |
_/ \_ | |
----------- | |
\ hash / | |
\ / | |
| | |
V | |
+----------+ +-----+
|htbl index|------> htbl[n] | pde | ----> RAM
+----------+ +-----+
| |
| |
| |
+-----+
htbl[nhtbl-1] | pde |
+-----+
When
Multiple Addresses Hash to the Same
htbl Entryhtbl index. The entry in htbl is actually the
starting point for a linked list of pdes. Each entry has a
pde_next pointer that points to another pde, or
contains NULL if it is the last item of the linked list.
htbl contains sufficient entries, as that the
linked lists seldom grow beyond three links.
htbl entry can point to two other collections of
pdes, ranging from base_pdir to htbl
and from pdir (which is also the end of htbl) to
max_pdir. The entirety of the htbl and surrounding
pdes is referred to collectively as the sparse pdir.
htbl is always aligned to begin at an address that is a multiple
of its size (that is, a multiple of nhtbl * sizeof(struct hpde)).
pdir_free_list or pd_fl2_0->head points to a
linked list of sparse pdir entries that are not being used and are available
for use. pdir_free_list_tail or pde on that linked list. (The variable names
changed slightly from the PA-RISC 1.1 pdir implementation to the PA-RISC 2.0
pdir implementation.)
Figure 23 How
Multiple Addresses Hash to the Same
htbl Entry +------------+
base_pdir | |
| |
| |
...> ============== -------> RAM
: | |\ |
: | \|
: | |\
: | | \
: | | pde
: | | /
: | |/
: | /|
: | |/ |
: ============== ..
: | | :
:....|............|..:
| |
| |
pdir +------------+
| |
| |
| |
| |
| |
| |
max_pdir | |
+------------+
Mapping Physical to
Virtual Addresses
pfn_to_virt
table. Like the pfdat
table, this is a two level table that can be imagined as a giant array
containing one pfn_to_virt_entry_t entry for each page of
physical memory. The first level table is called
pfn_to_virt_ptr[].
pfn_to_virt_entry_t contains either space and offset of
the virtual page (in the case of a single translation to a page) or a list of
alias structures (when the physical page has more than one virtual address
translation).
Figure 24
Physical-to-virtual Address Translation
pfn_to_virt_ptr[] pfn_to_virt_entry_t
+-----+ >+------------+
| | / | |
| | / +------------+
| | / | |
| | / | | struct alias entries
+-----+/ +------------+ +------+ +------+ +------+ +------+
pfn.>| | +..>| *alias |->|alias1|<->|alias2|<->|alias3|<->|aliasn|
: +-----+ : +------------+ +------+ +------+ +------+ +------+
: | | : | | |space.offset
: | | : +------------+ |vtopde()
: | | : |space.offset| |
: +-----+ : +------------+ V
: : | | +-----------------------------------------+
+...........+ +------------+ | hpde corresponding to this space.offset |
| | +-----------------------------------------+
+------------+
pfn_to_virt_entry_t may contain the space.offset (virtual
address) corresponding to a physical address or it may have a pointer to a
link list of alias structures, each of which has a space.offset pair.
Address Aliasing
copy-on-write use aliases. Aliases may also be used to add kernel
translations of user pages.
pfn) are maintained via alias chains off the
pfn_to_virt_entry_t. (With large pages, the aliases are linked
from the pfn_to_virt_entry_t corresponding to the base
pfn of the page.) When a pfn_to_virt_entry_t's space
field is invalid and the offset field is non-zero, the non-zero value points
to the beginning of a linked list of alias structures. Each alias structure
contains the space and offset of the alias, and a temporary hold field for a
pde's access rights and protection ID. The pf_lock of the alias's
base pfn's pfdat protects the alias chain from being
read and modified.
hpde for a particular alias space and offset,
the space and offset are hashed for the hpde chain and its
corresponding pd_lock. Once the pd_lock is obtained,
the vtopde() routine walks the hpde hash chain to
find a match of the tag.
aa_entfreelist is the head of the doubly-linked list of
free alias entries. The system gets an alias structure from
aa_entfreelist, in which it stores the information for this new
virtual-to-physical translation.
max_aapdir contains the total number of
alias hpdes on the system. Once a page is allocated for use as
alias hpdes, it is not returned, so the value of
max_aapdir may grow over time but will never shrink.
hpdes is stored in
aa_pdircnt. When an alias hpde is used or reserved
(we reserve one if we include an htbl hpde in an
alias linked list, in case we have to move it later), aa_pdircnt
is decremented. When an alias hpde is returned to aa_pdirfreelist
or unreserved, aa_pdircnt is incremented.
aa_entcnt.
Once a page is allocated for use as a group of alias structures, it is not
returned. We do not keep track of the total number of alias structures on the
system, just the number of available structures.
MAINTAINING PAGE
AVAILABILITY
vhand and sched daemons (system processes)
handle the actual paging and deactivation. vhand monitors free pages to keep their number above a
threshold and ensure sufficient memory for demand paging. vhand
governs the overall state of the paging system. sched becomes
operative when the number of pages available in memory diminishes below a
certain level. vhand and sched will be described in
the context of their work shortly.
sched process is known colloquially as the
swapper.
Paging Thresholds
Figure 25 Available Memory in the
System
total memory at boot-up --> +------------------------+ phys_mem_pages
| kernel static memory |
| |
freemem at boot --> +------------------------+
| |
. .
. .
| |
+------------------------+ lotsfree
| |
| |
| |
vhand begings paging --> +........................+ gpgslim*
| page |
+------------------------+ desfree
| |
sched begins deactivating --> +------------------------+ minfree
| deactivate |
+------------------------+ 0
* fluctuates between desfree and lotsfree
freemem represents the total number of free
pages.
setmemthresholds() routine.
Table 19
setmemthresholds() Paging Thresholds
Paging threshold
Meaning
lotsfreePlenty of free memory, specified in pages. The upper bound from
which the paging daemon will begins to steal pages.
desfreeAmount of memory desired free, specified in pages. This is the lower
bound at which the paging daemon begins stealing pages.
minfreeThe minimal amount of free memory tolerable, specified in pages. If
free memory drops below this boundary, sched() recognizes
the system is desperate for memory and deactivates entire processes
whether they are runnable or not. The
gpgslim Paging
Thresholdgpgslim paging threshold is the point at which
vhand starts paging. gpgslim adjusts dynamically
according to the needs of the system. It oscillates between an upper bound
called lotsfree and a lower bound called desfree.
Both lotsfree and desfree are calculated when the
system boots up and are based on the size of system memory.
gpgslim is set to 1/4 the distance
between lotsfree and desfree (desfree +
(lotsfree - desfree)/4). As the system runs, this
value fluctuates between desfree and lotsfree. When
the sum of available memory and the number of pages scheduled for I/O (soon to
be freed) falls below gpgslim, vhand begins aging
and stealing little-used pages in an attempt to increase the available memory
above this threshold.
gpgslim. If the system
is not stressed, gpgslim starts falling, because it does not need
to have a lot more pages freed. As memory becomes more scarce (defined as
freemem reaching zero too often), the system inrceases
gpgslim so that it will page earlier, and hopefully not have
freemem reach zero as often.
freemem decreases to minfree, the system
starts to deactivate entire processes.
How Memory Thresholds are
Tuned
Table 20 Paging Threshold
Values
Threshold
Basic Value
Limit if Initial
freemem < 2 GBAdditional Amount per 2G of Initial freemem
lotsfree1/16
freemem32 MB
32 MB
desfree1/64
freemem4 MB
8 MB
minfree1/4
desfree1 MB
4 MB How Paging is Triggered
schedpaging() runs periodically and wakes up
vhand whenever it finds that the sum of free memory and paroled
memory (freemem + parolemem) is less than
lotsfree. The rate schedpaging() runs is termed
vhandrunrate, a tunable parameter (set to run by default at eight
times per second).
vhand can also be awakened by reserve_freemem()
and allocate_page().
reserve_freemem() is a routine that is called to reserve
memory. It will wake vhand if it can't reserve sufficient memory
and finds freemem + parolemem < gpgslim.
allocate_page() is a routine that is called to actually
allocate memory. If it is called by code that cannot wait (e.g. because it is
running on the interrupt stack), and cannot find the requested memory, it will
wake up vhand. Also, regardless of whether its caller can wait,
if it can't find the requested memory it will wake up the
unhashdaemon, which removes pages from the page cache.
vhand, the Pageout Daemon
vhand's function is to keep memory available by freeing up the
least recently referenced pages. It also performs other functions related to
maintaining memory availability, such as garbage collection of the kernel
memory allocator free lists.
Two-Handed Clock Algorithm
vhand uses a two-handed clock algorithm to decide which pages
to free. Conceptually, it has two hands (called the "age hand" and the "steal
hand") passing through all of memory. One hand marks each page as "not
recently referenced". The other hand follows after a delay, and checks each
page to see whether it's been accessed (and so marked as recently referenced)
since the first hand cleared its referenced bit. Those which have not been
accessed may be stolen (paged out and the memory made available to other
users).
vhand steps through memory by
following a doubly linked list of pregions, called the active
pregion list. It doesn't step through all pregions
each time it is woken, and normally looks at only a portion of the pages in
each pregion. Since memory used for the file system buffer cache
isn't associated with any pregion, a special dummy pregion called
bufcache_preg is used to put it in the list of things for
vhand to scan.
pregions rather than simply scanning all pages (e.g.
using the pfdats) has the advantage of automatically skipping
kernel memory, and memory that's already free.
steal hand reached that
process' pregions, all the pages it stole would come from that
one process, leaving it frantically paging back in its working set ...
essentially thrashing. (This is particularly ugly if the process happens to be
interactive and awaiting user input ... the user doesn't want to wait for
large numbers of pageins before his program responds to his mouse movement.)
This is why only a portion of each pregion is aged or stolen on
each pass, and vhand thus needs multiple passes through the
active pregion list to visit all of pagable memory.
vhand will consume more CPU time. The kernel automatically keeps
an appropriate distance between the hands, based on the available paging
bandwidth, the number of pages that need to be stolen, the number of pages
already scheduled to be freed, and the frequency by which vhand
runs.
Table 21
pregion
Elements used by vhand
Element
Purpose
p_agescanLast age hand location
p_stealscanLast steal hand location
p_ageremainRemaining pages to be aged
p_bestniceBest nice value of all processes sharing the underlying region
p_forw, p_backLinks in active pregion list pregion linked list of
physical memory to look for memory pages that have not been referenced
recently and move them to secondary storage - the swap space. Pages that have
not been referenced from the time the age hand passes to the time the steal
hand passes are pushed out of memory. The hands rotate at a variable rate
determined by the demand for memory.
vhand daemon decides when to start paging by determining
how much free memory is available. Once free memory drops below the
gpgslim threshold, paging occurs. vhand attempts to
free enough pages to bring the supply of memory back up to
gpgslim. The page daemon continues to age pages (that is, clear
their reference bits) when woken even if there's enough memory that it doesn't
need to steal pages; of course, it won't be woken very often in that
situation.
Factors Affecting
vhandvhand responds to various workloads, transient situations, and
memory configurations. When aging and stealing from pregions,
vhand:
pregion field p_agescan to track the
last age hand location.
pregion field p_ageremain to track
remaining pages to be aged.
pregion field p_stealscan to track
the last steal hand location.
vfd/dbd pairs to swap if they have no valid pages.
pregion, it ages some constant
fraction of pages before moving to the next region (by default
1/16 of the region's total pages). The p_agescan tag enables the
age hand to move to the location within a pregion where it left
off during its previous pass, while the p_ageremain charts how
many pages must be aged to fill the 1/16 quota before moving on to the next
pregion.
pregion field p_stealscan
to locate itself within a pregion and resume taking pages that
have not been referenced since last aged. If no valid page remain,
vhand pushes out of memory the vfd/dbd pairs
associated with the region.
vhand runs (by default eight times per
second).
gpgslim.
vhand is biased against threads that have nice priorities: the
nicer a thread, the more likely vhand will steal its pages. The
pregion field p_bestnice reflects the best
(numerically, the smallest value) nice value of all threads sharing a
pregion.
What Happens when
vhand Wakes Up
vhand uses the SCRITICAL flag to get access to
the system critical memory pool. (The SCRITICAL flag for the
vhand process is set when the process starts running for the
first time.)
vhand establishes pagecounts for pages to age and pages to
steal.
vhand updates the value of gpgslim, based
on value of memzeroperiod.
vhand updates pageoutrate, using
pageoutcnt.
vhand updates targetlaps, the number of
desired laps between the age and steal hands. If less CPU cycles are being
used than the value of targetcpu, vhand increases
the value of targetlaps (up to a maximum of 15); if more CPU
cycles are being used than targetcpu, targetlaps
is decreased.
vhand updates agerate, the number of pages to
age per second.
vhandinfoticks is non-zero, diagnostic information
prints to the console. Table 22 Variables Affecting
vhand
Variable
Purpose
memzeroperiodMinimum time period (default=3 seconds) permissible for
freemem to reach zero events; determines how often
gpgslim is adjusted when vhand() is running.
gpgslim is incremented if freemem does
not reach zero twice within memzeroperiod.
gpgslim is decremented if freemem
reaches zero twice within memzeroperiod.
pageoutrateCurrent pageout rate, calculated empirically from number of pageouts
completed.
pageoutcntRecent count of pageouts completed.
targetlapsIdeal gap between steal and age hands for handlaps;
adapts at run time. During normal operation, the hands should be as far
apart as possible to give processes maximum time to reset a cleared
reference bit being used by a page. targetlaps is defined
in the kernel as a static variable; it does not appear in the symbol
table.
targetcpuMaximum percentage of CPU vhand should spend paging. (default=10%)
handlapsActual number of laps between the age and steal hands.
agerateNumber of pages the age hand visits to age per second; adapts
continually to system load. agerate is defined in the
kernel as a static variable (meaning that it does not appear in the
symbol table).
stealrateHow many pages the steal hand visits per second; adapts continually
to system load. stealrate is defined in the kernel as a
static variable (meaning that it does not appear in the symbol table).
vhand Steals and Ages
Pagesvhand establishes its criteria, it proceeds to traverse
the linked list of pregions. Continuing in the clock-hands analogy,
vhand is ready to move its hands.
vhand determines how many pages and what pages are
available to steal.
bufcache_preg,
vhand steals buffers from the buffer cache with the
stealbuffers() routine. The global parameter
dbc_steal_factor determines how much more aggressively to
steal buffer cache pages than pregion pages. If
dbc_steal_factor has a value of 16, buffer cache pages are
treated no differently than pregion pages; the default value
of 48 means that buffer cache pages are stolen three times as aggressively
as pregion pages.
pregion whose
region has no valid pages (that is, r_nvalid ==
0), and none of the processes using the region are
loaded in memory (that is, r_incore == 0), vhand
pushes its B-tree out to the swap device.
vhand steals all unreferenced pages between
p_stealhand and (p_agescan - p_count/16 *
handlaps), up to the steal quota.
vhand updates p_stealscan to the page number
following the last stolen page of the affected pregion.
vhand has not stolen as many pages as permissible, it
moves to the next pregion and repeats the process until it
satisfies the system's demand. vhand moves the age hand to clear the reference bit
from a selected number of pages.
bufcache_preg,
vhand ages one sixteenth of the pages in the buffer cache
with the agebuffers() routine.
vhand determines the best nice value (that is, the lowest
number) of all the pregions using the region. For each page
in the region, if the nice value is less than a randomly generated number,
vhand does not age the page. (I.e. pages belonging to higher
priority processes (numerically low nice values) are less likely to be
aged.)
vhand ages all pages between
p_agehand and (p_agehand + p_ageremain) by
clearing the pde_ref bit and purging the TLB.
vhand updates p_agescan to be the
page number after the last page scanned (and potentially aged) in the
affected pregion. The
sched() routinesched() routine (colloquially termed "the swapper")
handles the deactivation and reactivation of processes when free memory falls
below minfree, or when the system appears to be thrashing.
sched() chooses to deactivate on a process level and then
deactivates each thread.
sched() determines the system:
freemem falls below the
deactivation threshold minfree and more than one process is
running.
What to Deactivate or
Reactivate
sched() deactivates processes and prevents them from running,
thus reducing the rate at which new pages are accessed. Once
sched() detects that available memory has risen above
minfree and the system is not thrashing, sched()
reactivates the deactivated processes and continues monitoring memory
availability.
sched() routine walks through the active process list calculating
each process's deactivation priority based on type, state, length of time in
memory, and how long it has been sleeping. (Batch and processes marked for
serialization by the serialize() command are more likely to be
deactivated than interactive processes.) The best candidate is then marked for
deactivation.
sched routine walks through the active process list calculating
each deactivated process' reactivation priority based on how long it has been
deactivated, its size, state, and type. Batch processes and those marked by
the serialize() command are less likely to be reactivated than is
an interactive process. Once the most deserving process has been determined,
it is reactivated.
When a Process is
Deactivated
sched()
SDEACT flag in the proc struct and
the TSDEACT flag in each
uareas to the active pregion
list so that vhand can page them. out.
pregions associated with the target process
in front of the steal hand, so that vhand can steal from them
immediately.
vhand to scan and steal pages from the entire
pregion, instead of 1/16. Eventually, vhand pushes the deactivated process's pages to
secondary storage.
Processes stay deactivated until the system has freed up enough memory and the paging rate has slowed sufficiently to reactivate processes. The process with the highest reactivation priority is then reactivated.
Once a process is chosen for reactivation, sched():
uareas from the active
pregion list.
uareas.
Earlier HP-UX implementations did not permit a process to be swapped out if
it was holding a lock, doing I/O, or was not at a signalable priority. Even if
priority made it most likely to be deactivated, vhand bypassed
the process.
Now, if the most deserving process cannot be deactivated immediately, it is
marked for self-deactivation; that is, sched() sets the
SDEACTSELF on its proc struct and the
TSDEACTSELF on each of its thread structs. The next
time one of the threads must fault in a page, the thread deactivates the
process.
Thrashing is defined as low CPU usage with high paging rate. Thrashing might occur when several processes are running, several processes are waiting for I/O to complete, or active processes have been marked for serialization.
On systems with very demanding memory needs (for example, systems that run many large processes), the paging daemons can become so busy deactivating/reactivating, and swapping pages in and out that the system spends too much time paging and not enough time running processes.
When this happens, system performance degrades rapidly, sometimes to such a degree that nothing seems to be happening. At this point, the system is said to be thrashing, because it is doing more overhead than productive work.
If your working set is larger than physical memory, the system will thrash. To solve the problem:
If you are left with one huge process constrained with physical memory and the system still thrashes, you will need to rewrite the application so that it uses fewer pages simultaneously, by grouping data structures according to access, for example.
All processes marked by the serialize command are run serially. This functionality unjams the bottleneck (recognizable by process throughput degradation) caused by groups of large processes contending for the CPU. By running large processes one at a time, the system can make more efficient use of the CPU as well as system memory since each process does not end up constantly faulting in its working set, only to have the pages stolen when another process starts running.
As long as there is enough memory in the system, processes marked by
serialize() behave no differently than other processes in the
system. However, once memory becomes tight, processes marked by serialize are
run one at a time in priority order. Each process runs for a finite interval
of time before another serialized process may run. The user cannot enforce an
execution order on serialized processes.
serialize() can be run from the command line or with a
PID value. serialize() also has a timeshare option
that returns the PID specified to normal timeshare scheduling
algorithms.
If serialization is insufficient to eliminate thrashing, you will need to add more main memory to the system.
Since vhand()is tuned to be nice regarding I/O usage and CPU
usage, it allows the pager to fault out swapped processes. The swapper marks
the process to be swapped for deactivation, and takes its threads off the run
queue. Since it cannot run, once its pages are aged, they cannot be referenced
again. When the steal hand comes around, it steals all the pages in the
region.
When memory pressure is high, sched() selects a process to
swap using the routine choose_deactivate(). This routine is
biased to choose non-interactive processes over interactive ones, sleeping
processes over running ones, and long-running processes over newer ones.
Once a process has been chosen to be deactivated, the following actions occur:
SDEACT flag and its threads'
TSDEACT flags are set.
SDEACTSELF flag and its threads'
TSDEACTSELF flags are set. When I/O completes, the process
deactivates in the paging routines.
p_deactime in the proc structure
and the threads' kt_deactime in the kthread
structure are set to the current time to establish a record of how long the
process is deactivated.
pregions are positioned in the active
pregion chain to ready it for the steal hand.
uarea pregions are added to the list of
active pregions for them to get paged out.
deactive_cnt is incremented. A process that has been inactive long enough for all its pages to have been
aged and stolen is virtually swapped out already. The global
deactprocs points to the head of a list of inactive processes,
its chain running through the pregion element p_nextdeact.
When memory pressure eases, a deactivated process is reactivated. The
choose_reactivate() routine is biased to choose interactive over
non-interactive ones processes, runnable processes over sleeping ones, and
processes that have been deactivated longest over those more recently
deactivated.
Now, however, HP-UX provides the option of using Memory Resource Groups to
assign a group of processes their own memory pool. These processes are in
effect given their own physmem_pages, freemem,
minfree, desfree, lotsfree,
gpgslim, etc..
This allows groups of processes to page independently, producing a lot less interference between them. This may be useful for server consolidation, where several applications originally written for individual servers are instead run togetehr on a single larger server.
With Memory Resource Groups, vhand and sched
behave almost as if each MRG were completely separate, with its own individual
pager and swapper. (The actual implementation is a bit more complex, as it
must account for processes and memory moving between MRGs, the ability for one
MRG to borrow memory from another, memory use that can't be assigned to any
single process (or any MRG), and the need to maintain global memory
availability as well as individual MRG memory availability.) The global
variables discussed above are still present, and act as a summary of the
overall system state.
Swap space is an area on a high-speed storage device (almost always a disk drive), reserved for use by the virtual memory system for deactivation and paging processes. At least one swap device (primary swap) must be present on the system.
During system startup, the location (disk block number) and size of each swap device is displayed in 512-KB blocks. You can add swap as needed (that is, dynamically) while the system is running, without having to regenerate the kernel.
The swapper reserves swap space at process creation time, but does not allocate swap space from the disk until pages need to go out to disk. Reserving swap at process creation protects the swapper from running out of swap space.
HP-UX uses both physical and pseudo-swap to enable efficient execution of programs.
System memory used for swap space is called pseudo-swap space. It allows
users to execute processes in memory without allocating physical swap.
Pseudo-swap is controlled by an operating-system parameter; by default,
swapmem_on is set to 1, enabling pseudo-swap.
Typically, when the system executes a process, swap space is reserved for the entire process, in case it must be paged out. According to this model, to run one gigabyte of processes, the system would have to have one gigabyte of configured swap space. Although this protects the system from running out of swap space, disk space reserved for swap is under-utilized if minimal or no swapping occurs.
To avoid such waste of resources, HP-UX is configured to access up to 7/8 of system memory capacity as pseudo-swap. This means that system memory serves two functions: as process-execution space and as swap space. By using pseudo-swap space, a two-gigabyte memory system with two-gigabyte of swap can run up to 3.75 GB of processes. As before, if a process attempts to grow or be created beyond this extended threshold, it will fail.
When using pseudo-swap for swap, the pages are locked; as the amount of pseudo-swap increases, the amount of lockable memory decreases.
For factory-floor systems (such as controllers), which perform best when the entire application is resident in memory, pseudo-swap space can be used to enhance performance: you can either lock the application in memory or make sure the total number of processes created does not exceed 7/8 of system memory.
When the number of processes created approaches capacity, the system might
exhibit thrashing and a decrease in system response time. If necessary, you
can disable pseudo-swap space by setting the tunable parameter
swapmem_on in /usr/conf/master.d/core-hpux to zero.
At the head of a doubly linked list of regions that have pseudo-swap
allocated is a null terminated list called pswaplist.
File-system swap space is located on a mounted file system and can vary in size with the system's swapping activity. However, its throughput is slower than device swap, because free file-system blocks may not always be contiguous, leading to extra read/write requests; and becuase of the extra overhead of an additional layer of code.
To optimize system performance, file-system swap space is allocated and
de-allocated in swchunk-sized chunks. swchunk is a
configurable operating system parameter; its default is 2048 KB (2 MB). Once a
chunk of file system space is no longer being used by the paging system, it is
released for file system use, unless it has been preallocated with swapon.
If swapping to file-system swap space, each chunk of swap space is a file
in the file system swap directory, and has a name constructed from the system
name and the swaptab index (such as becky.6 for
swaptab[6] on a system named becky).
Several configurable parameters deal with swap space:
| Parameter | Purpose |
|---|---|
swchunk |
The number of DEV_BSIZE blocks in a unit of swap space,
by default, 2 MB on all systems. |
maxswapchunks |
Maximum number of swap chunks allowed on a system. |
swapmem_on |
Parameter allowing creation of more processes than you have physical swap space for, by using pseudo-swap. |
There are a number of kernel global variables related to swap space, shown
in the next table. The most important to swap space reservation are
swapspc_cnt, swapspc_max, swapmem_cnt,
swapmem_max, and sys_mem.
| Variable | Meaning |
|---|---|
bswlist |
Head of free swap header list. |
swdevt[] |
Device swap table. |
fswdevt[] |
File system swap table. |
swaptab[] |
Table of swap chunks. |
swapphys_cnt |
Pages of available physical swap space on disk. This counts
unallocated pages, whether or not they've been reserved;
swapspc_cnt (below) counts only unreserved pages. |
swapphys_buf |
Pages of physical swap space to keep available. (If
swapphys_cnt becomes less than this, vhand's
age hand will free swap space when it finds that the in-memory copy of a
page is newer than the on-disk copy. (Of course this means that swap
space will need to be allocated again when the page needs to be paged
out.) |
swapspc_cnt |
Total amount of swap currently available on all devices and file systems enabled in units of pages. Updated each time swap is reserved or released, as well as each time a device or file system is enabled for swapping. |
swapspc_max |
Total amount of device and file-system swap currently enabled on the system in units of pages. Updated each time a device or file system is enabled for swapping. |
swapmem_cnt |
Total number of pages of pseudo-swap currently available.
Initialized to swapmem_max. |
swapmem_max |
Maximum number of pages of pseudo-swap enabled. Initialized to 7/8 available system memory. |
pswaplist |
Linked list of regions using pseudo-swap. |
maxdev_pri |
Highest available swap device priority. |
maxfs_pri |
Highest available swap file system priority. |
phys_mem_pages |
Page count of physical memory on the system. |
sys_mem |
Number of pages of memory not available for use as pseudo-swap.
Normally initialized to 1/8 available system memory + 25 pages +
sysmem_max pages. |
sysmem_max |
Added to sys_mem (number of pages not available for
pseudo-swap) during system initialization on systems with device swap
available, provided this leaves swapmem_max > 0. |
maxmem |
Set to the inital value of freemem after allocation of
the initial dbc_min_pct of phys_mem_pages for
the dynamic buffer cache. maxmem - swapmem_max is used as
an upper limit for sys_mem when the kernel is returning
pages stolen from pseudo-swap. |
freemem |
Page count of total remaining unreserved blocks of free memory. |
freemem_cnt |
Number of threads sleeping on global_freemem to wait
for memory. (There are other ways to wait for memory which are not
counted here.) |
System swap space values are calculated as follows:
swapspc_max (for
device swap and file system swap) + swapmem_max (for
pseudo-swap).
swapspc_max - [sum(swdevt[n].sw_nfpgs) +
sum(fswdevt[n].fsw_nfpgs)] (for device swap and file system swap)
+ (swapmem_max - swapmem_cnt) (for pseudo-swap). In HP-UX, only data area growth (using sbrk()) or stack growth
will cause a process to die for lack of swap space. Program text does not use
swap.
Swap reservation is a numbers game. The system has a finite number of pages of physical swap space. By decrementing the appropriate counters, HP-UX reserves space for its processes.
Most UNIX systems and UNIX-like systems allocate swap when needed. However,
if the system runs out of swap space but needs to write a process' page(s) to
a swap device, it has no alternative but to kill the process. To alleviate
this problem, HP-UX reserves swap at the time the process is
forked or exec'd. When a new process is forked or
executed, if insufficient swap space is available and reserved to handle the
entire process, the process may not execute.
At system startup, swapspc_cnt and swapmem_cnt
are initialized to the total amount of swap space and pseudo-swap available.
Whenever the swapon() call is made to add device or file
system swap, the amount of swap newly enabled is converted to units of pages
and added to the two global swap-reservation counters swapspc_max
(total enabled swap) and swapspc_cnt (available swap space).
Each time swap space is reserved for a process (that is, at process
creation or growth time), swapspc_cnt is decremented by the
number of pages required. The kernel does not actually assign disk blocks
until needed.
Once swap space is exhausted (that is, swapspc_cnt == 0), any
subsequent request to reserve swap causes the system to allocate additional
chunks of file-system swap space. If successful, both swapspc_max
and swapspc_cnt are updated and the current (and subsequent
requests) can be satisfied. If a file-system chunk cannot be allocated, the
request fails, unless pseudo-swap is available.
When swap space is no longer needed (due to process termination or
shrinkage), swapspc_cnt is incremented by the number of pages
freed. swapspc_cnt never exceeds swapspc_max and is
always greater than or equal to zero. If a chunk of file-system swap is no
longer needed, it is released back to the file system and
swapspc_max and swapspc_cnt are updated.
If no device or file system swap space is available, the system uses
pseudo-swap as a last resort. It decrements swapmem_cnt and locks
the pages into memory. Pseudo-swap is either free or allocated; it is never
reserved.
The rswap_lock spinlock guards the swap reservation structures
swapspc_cnt, swapspc_max, swapmem_cnt,
swapmem_max, sys_mem, and pswaplist.
Approximately 7/8 of available system memory is available as pseudo-swap
space if the tunable parameter swapmem_on is set to 1.
Pseudo-swap is tracked in the global pseudo-swap reservation counters
swapmem_max (enabled pseudo-swap) and swapmem_cnt
(currently available pseudo-swap). If physical swap space is exhausted and no
additional file-system swap can be acquired, pseudo-swap space is reserved for
the process by decrementing swapmem_cnt.
For example, on a 256 MB system, swapmem_max and
swapmem_cnt track approximately 224 MB of pseudo-swap space, the
remainder tracked by the global sys_mem, which represents the
number of pages reserved for system use only.
Processes track the number of pseudo-swap pages allocated to them by
incrementing a per region counter r_swapmem. All regions using
pseudo swap are linked on the pseudo-swap list pswaplist. Once
both device swap and pseudo-swap are exhausted (that is,
swapspc_cnt==0 and swapmem_cnt==0), attempts at
process creation or growth will fail.
Once a process no longer needs its allocated pseudo-swap space,
swapmem_cnt is incremented by the amount released and
r_swapmem is updated.
Pseudo-swap consumes memory that could otherwise be used for other purposes
(see the sections below), so it is used sparingly. The operating system
periodically checks to see if physical swap space has been recently freed. If
it has, the system attempts to migrate processes using pseudo-swap only to use
the available physical swap by walking the doubly linked list of pseudo-swap
regions. swapspc_cnt is decremented by the r_swapmem
value for each region on the list until either swapspc_cnt drops
to zero or no other regions utilize pseudo-swap. swapmem_cnt is
then incremented by the amount of pseudo-swap successfully migrated.
Pseudo-Swap competes with the kernel for the use of system memory. 1/8 of
available memory (sys_mem pages) is initially made unavailble for
pseudo-swap use; however, this is nowhere near enough to handle both kernel
dynamic memory and buffer cache space. Instead, the kernel "steals" memory
from pseudo-swap for these purposes, decrementing swapmem_cnt
when it steals a page; once swapmem_cnt reaches zero, it starts
taking pages from sys_mem until that too reaches zero.
When "stolen" pseudo-swap is returned, the amount being released is first
added to sys_mem. Once sys_mem grows to its maximum
value (maxmem - swapmem_max), any additional pages returned are
used to increase swapmem_cnt.
Because pseudo-swap is related to system memory usage, the swap reservation scheme reflects lockable memory policies.
Although the system is not necesarily allocating additional memory when a
process locks itself into memory, locked pages are no longer available for
general use. This causes swapmem_cnt to be decremented to account
for the pages. swapmem_cnt is also decremented by the size of the
entire process if that process gets plocked in memory.
All swap devices and file systems enabled for swap have an associated
priority, ranging from 0 to 10, indicating the order that swap space from a
device or file system is used. System administrators can specify swap-space
priority using a parameter of the swapon(1M) command.
Swapping rotates among both devices and file systems of equal priority. Given equal priority, however, devices are swapped to by the operating system before file systems, because devices make more efficient use of CPU time.
We recommend that you assign the same priority to most swap devices, unless a device is significantly slower than the rest. Assigning equal priorities limits disk head movement, which improves paging performance.
swdev_pri swdevt swaptab
+---------+ +--------+ /+--------+
0| |----->| dev1 |-----> +--------+
+---------+ +-| | \+--------+
1| |\ | +--------+ /+--------+
+---------+ \ +>| dev2 |-----> +--------+
| | \ | | | +--------+
| | \ +--------+ | +--------+
| | \>| dev3 |\ \+--------+
10+---------+ | | \ /+--------+
+--------+ \ > +--------+
| | \. \+--------+
| | \ /+--------+
+--------+ . > +--------+
: | +--------+
swfs_pri . | +--------+
+---------+ : \+--------+
0| | fswdevt . | |
+---------+ +--------+: | |
1| |----->| fs1 |. | |
+---------+ | | | |
| | +--------+ | |
| | | | | |
| | | | | |
10+---------+ +--------+ +--------+
Swap space is alloctaed on HP-UX using the following data structures:
swdev_pri[]), used to link
together swap devices with the same priority. That is, the entry in
swdev_pri[n] is the head of a list of swap devices having
priority n. The first field in swdev_pri[] structure is the
head of the list; the sw_next field in the swdevt[] structure
links each device into the appropriate priority list.
swfs_pri[]), which serves
the same purpose as swdev_pri[], but for file system swap
priority.
struct swdevt), used to establish the
fundamental swap device information.
struct fswdevt), for supplementary
file-system swap.
struct swaptab), which
keeps track of the available free pages of swap space.
struct swapmap), whose entries
together with swaptab combine for a swap disk block descriptor.
The following table details the elements of the struct swdevt.
swdevt[] (struct swdevt)| Element | Meaning |
|---|---|
sw_dev |
Actual swap device, as defined by its major (upper 8 bits) and minor (lower 24 bits) numbers. |
sw_flags |
Several flags. The SW_ENABLE flag indicates that swap
has been enabled on this device. |
sw_start |
Offset into the swap area on disk, in kilobytes. |
sw_nblksavail |
Size of swap area, in kilobytes. |
sw_nblksenabled |
Number of blocks enabled for swap. Must be a multiple of
swchunk (2MB default). |
sw_nfpgs |
Number of free swap pages on the device. Updated whenever a page is used or freed. |
sw_priority |
Priority of swap device (0-10). |
sw_head, sw_tail |
Indexes of first and last swaptab[] entry associated
with this swap device. |
sw_next |
Pointer to the next device swap entry (swdevt) at this
priority; implemented as a circular list used to update the pointer in
swdev_pri for round-robin use of all devices at a
particular priority. |
The following table details the elements of the struct
fswdevt.
fswdevt[] (struct fswdevt)| Element | Meaning |
|---|---|
fsw_next |
Pointer to next file system swap (fswdevt entry) at this priority; implemented as a circular list. |
fsw_flags |
Several flags. The FSW_ENABLE flag indicates that the
swap has been enabled on this file system. |
fsw_nfpgs |
Number of free swap pages in this file system swap; updated whenever a page is used or freed. |
fsw_allocated |
Number of swchunks allocated on this file system for
swap. |
fsw_min |
Minimum swchunks to be preallocated when file system
swap is enabled. |
fsw_limit |
Maximum swchunks allowed on file system; unlimited if
set to zero. |
fsw_reserve |
Minimum blocks (of size fsw_bsize) reserved for
non-swap use on this file system. |
fsw_priority |
Priority of file system (0-10). |
fsw_vnode |
vnode of the file system swap directory
(/paging) under which the swap files are created. |
fsw_bsize |
Block size used on this file system; used to determine how much
space fsw_reserve is reserving. |
fsw_head, fsw_tail |
Index into swaptab[] of first, last entry associated
with this file system swap. |
fsw_mntpoint |
File system mount point; character representation of
fsw_vnode, used for utilities (such as
swapinfo(1M)) and error messages. |
swaptab and
swapmap StructuresTwo structures track swap space. The swaptab[] array tracks
chunks of swap space. swapmap entries hold swap information on a
per-page level. swaptab defaults to track a 2MB chunk of space
and swapmap tracks each page within that 2MB chunk.
Each entry in the swaptab[] array has a pointer (called
st_swpmp) to a unique swapmap. swapmap
entries have backwards pointers to the swaptab index. There is
one entry in the swapmap for each page represented by the
swaptab entry (default 2 MB, or 512 pages); that is,
swapmap conforms in size to swchunk.
A linked list of free swap pages begin at the swaptab entry's
st_free and uses each free swapmap entry's
sm_next. When a page of swap is needed, the kernel walks the
structures (using the get_swap() routine in
vm_swalloc.c), which calls other routines that actually locate
the chunk, and so forth.
swdev_pri[].curr, which points to a swdevt entry.
sw_nfpgs is zero (no free pages), we follow the pointer
sw_next to get the next swdevt entry at this
priority.
swfs_pri[].curr, the file system swap at this priority,
checking fsw_nfpgs for free pages.
swdevt or fswdevt with free
pages, we walk that device's swaptab list, starting with
sw_head or fsw_head, and using
st_next in each swaptab entry, until we find a
swaptab entry with non-zero st_nfpgs.
st_free points to the first free swapmap entry
(and thus first free page) in this swaptab chunk.
get_swchunk() routine creates a disk block descriptor
(dbd) using 14 bits of dbd_data for the
swaptab index and 14 bits for the swapmap index.
The r_bstore in the region is set to the disk device
swapdev_vp and the dbd is marked
DBD_BSTORE.
When faulting in from swap, the same process is followed as for faulting
in from the file system: r_bstore and dbd_data are
hashed together and checked for a soft fault, then
devswap_pagein() is called. The devswap_pagein()
routine uses the dbd_data as a 14-bit swaptab
index and a 14-bit swapmap index to determine the location of
the page on disk.
Now all information needed to retrieve the page from swap has been stored.
swaptab and swapmap Structures swapmap
>+---------+
/ | |
/ | |
/ | |
swaptab entry / | |
+-->+------------+ / | |
| | | / | |
| +------------+ / | |
| | st_swpmp |/ +---------+
| | st_free |-------->| sm_next |---+
| +------------+ +---------+ |
| | | | | |
| +------------+ +---------+<--+
| | sm_next |---+
| +---------+ |
| | | |
| | | |
| | | |
| +---------+<--+
| | sm_next |---+
+---+-+--------------+--------------+ +---------+ |
| | | dbd_swptb | dbd_swpmp |->| | -----
| | | (14 bits) | (14 bits) | +---------+ ---
+---+-+--------------+--------------+ | | -
| | |
+--- dbd_type (3 bits) = DBD_BSTORE +---------+
struct swaptab)| Element | Meaning |
|---|---|
st_free |
Index to the first free page in the chunk. Each entry maps to a 4KB page of swap. |
st_next |
Index to next swaptab entry for same device or
file-system swap; at end of list, st_next is -1. |
st_flags |
ST_INDEL: File-system swap flag, indicating chunk is
being deleted; do not allocate pages from it. Set only by the
swapdel() routine. ST_FREE: File-system
swap flag, indicating chunk may be deleted, because none of its pages
are in use. In the case of remote swap, the chunk should not be deleted
immediately; set st_free_time to current time plus 30
minutes (1800 seconds) when setting this flag. Once 30 minutes has
elapsed, the chunk can be freed. If the chunk is needed during the
interim, the flag can be cleared. ST_INUSE:
swaptab entry is being changed. |
st_dev, st_fsp |
Pointers to swdevt[] entry or fswdevt[]
that references the swaptab entry. |
st_vnode |
Vnode of device or swap file. |
st_nfpgs |
Number of free pages in this (swchunk)
swaptab entry. |
st_swpmp |
Pointer to swapmap[] array that defines this
swchunk of swap pages. |
st_free_time |
Indicates when remote fs chunk can be freed (see
explanation of ST_FREE flag). |
struct swapmap)| Element | Meaning |
|---|---|
sm_ucnt |
Number of threads using the page. When decremented to zero, the swap page is free and the free pages linked list can be updated. |
sm_next |
Index of the next free page in the swapmap[]. This is
valid only if sm_ucnt is zero; that means that this
swapmap entry is included in the linked list beginning with
swaptab's st_free. |
Recall that for a process to execute, all the regions (for
data, text, and so forth) have to be set up; yet pages are not loaded into
memory until the process demands them. Only when the actual page is accessed
is a translation established.
A compiled program has a header containing information on the size of the
data and code regions. As a process is created from the compiled code by fork
and exec, the kernel sets up the process's data structures and the process
starts executing its instructions from user mode. When the process tries to
access an address that is not currently in main memory, a page fault occurs.
(For example, you might attempt to execute from a page not in memory.) The
kernel switches execution from user mode to kernel mode and tries to resolve
the page fault by locating the pregion containing the
sought-after virtual address. The kernel then uses the pregion's
offset and region to locate information needed for reading in the
page.
If the translation is not already present and the page is required, the
pdapage() routine executes to add the translation (space ID,
offset into the page, protection ID and access permissions assigned the page,
and logical frame number of the page), and then on demand brings in that page
and sets up the translation, hashes in the table, and all the rest.
In main memory, the kernel also looks for a free physical page in which to load the requested page. If no free page is available, the system pages out selected used pages to make room for the requested page. The kernel then retrieves (pages in) the required page from file space on disk. It also often pages in additional (adjacent) pages that the process might need.
Then the kernel sets up the page's permissions and protections, and exits back to user mode. The process executes the instruction again, this time finding the page and continuing to execute.
The flexibility of demand paging lies in the fact that it allows a process to be larger than physical memory. Its disadvantage lies in the degree of complexity paging requires of the processor; instructions must be restartable to handle page faults.
By default, all HP-UX processes are load-on-demand. A demand paged process does not preload a program before it is executed. The process code and data are stored on disk and loaded into physical memory on demand in page increments. (Programs often contain routines and code that are rarely accessed. For example, error handling routines might constitute a large percentage of a program and yet may never be accessed.)
HP-UX now implements copy-on-write of EXEC_MAGIC processes, to
enable the system to manipulate processes more efficiently. The system used to
copy the entire data segment of a process every time the process
fork'd, increasing fork time as the size of the data
and code segments increased. Only one translation of a physical page is
maintained; a parent process can point to and read a physical page, but copies
it only when writing on the page. The child process does not have a page
translation and must copy the page for either read or write access.
Copy-on-write means that pages in the parent's region are not
copied to the child's region until needed. Both parent and child
can read the pages without being concerned about sharing the same page.
However, as soon as either parent or child writes to the page, a new copy is
written, so that the other process retains the original view of the page.
For more information about the implementaton of EXEC_MAGIC,
see the HP-UX Process Management white paper.
When a process is fork'd, a duplicate copy of its parent
process forms the basis of the child process.
Under the kernel procdup() routine, the system walks the
pregion list of the parent process, duplicating each
pregion for the child process. How this is done is dictated by
the region type.
region is type RT_SHARED, a new
pregion is created that attaches to the parent's
region.
region is type RT_PRIVATE, the
region is duplicated first, and then a new pregion
is created and attached to the new region.
pregions for Shared regionsBecause a region of type RT_SHARED is shared by
parent and child, fewer changes occur to the pregions and
region. Only a new pregion must be created and
attached to the shared region.
pregion is allocated and fields copied from the
parent pregion to the child pregion.
pregion elements used by vhand
(p_agescan, p_ageremain, and
p_stealscan) are initialized to zero and the child
pregion is added to the active pregion chain just
before the stealhand, to prevent it from being stolen yet.
region elements r_incore and
r_refcnt are incremented to reflect the number of in-core
pregions accessing the region and the number of
pregions, in-core or paged, accessing the region.
pregions with Shared regions parent pregion child pregion
+------------+ +------------+
| | | |
| | \ | |
+------------+ =========+ +------------+
| p_reg |-+ / | p_reg |-+
+------------+ | +------------+ |
| | | | | |
| | | | | |
+------------+ | +------------+ |
| |
Per-process resources | |
======================|=================================|==========
System resources | |
| shared region |
+->+------------+<----------------+
| |
+------------+
| RT_SHARED |
+------------+
| |
| |
+------------+
pregions for Private regionsThe procedure is considerably more complex when an RT_PRIVATE
region is copied.
region is allocated.
region's pointers are set:
r_fstore, the forward store pointer is pointed to the
same value as the parent's, and the vnode's reference count
(v_count) is incremented.
r_bstore, the backward store pointer is set to the kernel
global swapdev_vp, and its v_count is
incremented also. region is attached to the end of the linked list
of active regions.
fork() fails and returns the error ENOMEM.
region's B-tree structures are
initialized and sufficient swap space is reserved for a completely filled
B-tree.
vfd and dbd proto values are
copied to the child's B-tree root.
vfd proto in both the parent region and
the child region are set so that all pages of the
region are copy-on-write.
B-tree element b_vproto is set to indicate
that the copy-on-write flag (pg_cw) must be set in the
vfd for any new vfddbd pair added to the
B-tree.
vfddbds is created for the child's
B-tree (equal to each chunk of vfddbds in the
parent's B-tree) and filled with proto values. The
pg_cw bit is already set to copy-on-write for all default
vfds in the child B-tree's chunk. region of Type RT_PRIVATE parent pregion child pregion
+------------+ +------------+
| | | |
| | \ | |
+------------+ =========+ +------------+
| p_reg |-+ / | p_reg |-+
+------------+ | +------------+ |
| | | | | |
| | | | | |
+------------+ | +------------+ |
| |
Per-process resources | |
======================|=================================|================
System resources | |
| shared region | private region
+->+------------+ +->+------------+
| | | |
+------------+ \ +------------+
| RT_PRIVATE | =========+ | RT_PRIVATE |
+------------+ / +------------+
| | | |
| | | |
+------------+ +------------+
copy-on-write When the vfd is ValidBefore the chunks of vfddbds in the child region
can be used, the validity of every entry must be checked.
vfd is not valid (that is, its pg_v is
not set), the pg_cw of the parent's vfd must be
set and copied to the child. If pg_lock is set in the parent,
it must be unset in the child, as locks are not inherited. Once the vfd is valid, further modifications are made to the
low-level structures:
r_nvalid element in the child region is
incremented to reflect the number of valid pages.
vfd contains a pfn (page frame number),
which indexes into the pfdat[] array. The pfdat
entry pf_use count (number of regions using this
page) must be incremented.
vfd's copy-on-write bit isn't set, the
pde must be set for translations to the page to behave as
copy-on-write.
If a page has been written to a swap device, but has since been modified,
the swap-device data now differs from the data in memory. The disk page must
be disassociated from the page in memory by setting the dbd type
to DBD_NONE. Then, the next time the page is written to a swap
device, it will be assigned a new location.
Everything is now set up from the perspective of the parent's
B-tree for copy-on-write.
region's copy-on-write Status
r_swalloc is set to the number of
region and B-tree pages reserved.
r_prev and r_next are set to link the
child region to the parent region.
pregion, rather than
copying it from the parent pregion. This establishes two ranges
of virtual addresses (different space, same offset) translating to the
single range of physical address.
HTBL.
procdup() creates a duplicate copy of a process based on
forktype, parent process (pp), child process (cp),
and parent thread (pt) and child thread (ct).
procdup() allocates memory for the uarea of the
child. (In fact, procdup() is the routine that calls
createU() to create the uarea too.)
procdup() calls dupvas() to duplicate the
parent's virtual address space, based on the kind of process
(fork vs vfork) being executed.
fork, dupvas()
duplicates the parent process's virtual address space; if the process was
vfork'd the parent's virtual address space is used.
dupvas() looks for and finds each private data object, does
whatever each requires to be duplicated (there are special considerations
required for text, memory mapping, data objects, graphics), and when it
finishes duplicating the special objects, calls private_copy()
or shared_copy(), depending on whether it is dealing with a
private or shared region.
region is shared, shared_copy
increments the reference count on the region to indicate it
is being shared.
region is private, private_copy locks
the region and enables the region to be
duplicated by calling dupreg(). dupreg() allocates a new region for the child,
duplicates the parent's vfds and the entire region
structure, then calls do_dupc() to duplicate entries under the
region.
do_dupc() sets up a parent-child relationship, and by
duplicating the relationship, sets up the child to be
copy-on-write. It makes sure the parent's region
is valid, sets copy-on-write for the child, sets the
translation as rx (read-execute) only, duplicates information
for every vfddbd combination in the region.
do_dupc then calls hdl_cw() to update the
child's access rights and make the child copy on write. Once this is completed, the child process exists as a duplicated version of the parent process. The child process is attached to the child's address space and is no longer dependent on the parent.
uarea for the Child ProcessEach thread of a process has its own uarea. When a process
fork()s, the new process has only a single thread, and that
thread needs a uarea. procdup() creates this
uarea by calling createU(). (uarea
pregions aren't copied by dupvas(), so the child
will have only one uarea, no matter how many threads (and
associated uareas) the parent had.)
The createU() routine builds a uarea and address
space for the child process. The uarea is set up last for a
fork'd process, to prevent the child process from resuming in the
middle of pregion duplication code. If the process is
vfork'd, the uarea is created during
exec(). Until then, the child uses the parent thread's
uarea.
FORK_PROCESS, a
temporary space is allocated for a working copy of the parent's
uarea to be modifed into the child's uarea. The temporary space
will be freed after the uarea is copied to the new
region. fork() updates the savestate
in the parent uarea's u_pcb just before copying
the data. (vfork() does not do this because it creates the
uarea during exec(), and the
savestate will change immediately.)
region is allocated for the new uarea, its
data structure is initialized, its r_bstore value set to the
swap device, and the new region is added to the list of active
regions. The uarea has no r_fstore
value, since it comes with ready-made data.
uarea's pregion,
which is initialized. Each uarea has a unique space ID. The new
pregion is marked with the PF_NOPAGE flag.
uarea pregions are unaffected by
vhand because they are not added to the list of active
pregions. Only if an entire process is swapped out are the
uarea's pages written to a swap device.
pregion is attached into the linked list
of pregions connected to the vas. Its pointer is stored in
r_pregs, its p_prpnext set to NULL, and its
r_incore and r_refcnt set to one.
uarea and
B-tree pages and the default dbd is set to
DBD_DFILL, the uarea pages (UPAGES)
are allocated. Each page requires a page of physical memory (sleeping if
none is available immediately). The pfn is stored in the
vfd, the pg_v is set as valid,
r_nvalid is incremented, and a pde is created for
the physical-to-virtual translation. The pfdat entry's
P_UAREA and HDLPF_TRANS flags are set, and the
dbd is set to DBD_NONE.
kt_upreg in the child's thread structure is
pointed to the child thread's uarea pregion.
Conceivably, the child can now run successfully. The current state is
therefore saved in the copied uarea with a setjmp()
call and pointed to with pcb_sswap. Thus, when the child first
calls the resume() routine, it detects that
pcb_sswap is non-zero and does a longjmp() to get
back here. The child then return from procdup() with the value
FORKRTN_CHILD.
The parent's open file table is copied to the child and the copied uarea is
copied into the actual pregion. This copy causes TLB miss faults
that cause the pregion's pdes to be written to the TLB, thus
associating the uarea's virtual address with the physical pages
just set up. The process completes by returning from procdup()
with the return value FORKRTN_PARENT.
copy-on-write PageWhen the parent accesses one of its RT_PRIVATE pages for read,
the processor generates a TLB miss fault, which the kernel handles as an
interrupt. The TLB miss fault handler finds the hpde and inserts
the information (including the new access rights) into the processor's TLB. On
return from the interrupt, the processor retries the read and is successful,
since PDE_AR_CW allows user-mode read access.
copy-on-write Page address = space.offset address = spacep.offset
| +-----------------------------------------------------+ |
| | Situation: | |
+->| * No translation exists | |
| (miss handler cannot find pde). | |
+-----------------------------------------------------+ |
| +----------------------------------------------------+ |
| | Actions: | |
+->| * Create alias translation | |
| * Retry instruction. | |
+----------------------------------------------------+ |
| +---------------------------------------------------+ |
| | Situation: | |
+->| * Translation exists (miss handler finds pde). |<-+
| * Translation is marked invalid |
+---------------------------------------------------+
| +--------------------------------------------------+
| | Actions: |
+->| * Update TLB with PDE_AR_CW permissions. |
| * Retry instruction. |
+--------------------------------------------------+
copy-on-write PageWhen the child accesses one of its pages for read, the TLB miss handler
does not find an hpde for the virtual address, because none has
been set one up yet. The virtual address was set up in the
pregion structure. If you are not doing copy-on-access (which is
now the default) and the page is needed, the aliased translation must be made.
save_state is created.
vas pointer is taken and the skip list searched to find
the pregion containing the page with this address.
When regions are initialized, the disk block descriptor (dbd)
dbd_data field of the is set to DBD_DINVAL
(0x1fffffff) in all cases. The prototype dbd_type values
are set as follows:
DBD_FSTORE for text and initialized data,
DBD_DZERO for stack and uninitialized data. When a page is read for the first time, a TLB miss fault results because
the physical page (and therefore its translation in the sparse PDIR) does not
yet exist. The fault handler is responsible for bringing in the page and
restarting the instruction that faulted. In determining whether or not the
page is valid, the fault handler determines which pregion in the
faulting process contains the faulting address. The fault code eventually
calls virtual_fault(), the primary virtual-fault handling routine
. The arguments passed to this routine are the virtual address causing the
fault, the pregion, and a flag indicating read or write access.
The kernel searches the B-tree for the vfd and
dbd of the page. If the valid bit in the vfd flag is
set, another process has read the address into memory already. If the
r_zomb flag is set in the region, the kernel prints Pid %d
killed due to text modification or page I/O error message and returns
SIGKILL, which the handler sends to the process.
If the dbd_type value is set to DBD_DZERO (as is
the case for stack and uninitialized data), the process sets the
copy-on-write bit to zero. The kernel then checks to determine
whether the page pertains to a system process or to a high-priority thread. If
neither and memory is tight, the process sleeps until free memory is driven
down to the priority associated with the process. (In worst case, a thread
might wait until memory is above desfree.)
Once the process is restarted, vfd and dbd
pointers are examined to ensure their continued accuracy. A free
pfdat entry is acquired from the physical memory allocator, its
pfn (pf_pfn) placed in the vfd, the
vfd's valid bit set, and the region's r_nvalid
counter (number of valid pages) incremented. The pages is zeroed, and its
virtual-to-physical translation is added to the sparse PDIR. Finally, the
kernel changes dbd_type to DBD_NONE and
dbd_data to 0xfffff0c.
If a process has a virtual fault on a DBD_FSTORE page, the
kernel uses the r_fstore pointer to the vnode, to
determine which file-system specific pagein() routine (for
example, ufs_pagein(), nfs_pagein(),
cdfs_pagein(), vx_pagein()) to call. The
pagein() routines are used to recover the correct page from a
free list of memory pages or to read in a correct page from disk.
The pagein() routine gets information about the page being
faulted from the vm_pagein_init() routine, which gets the
vfd/dbd pairs, sets up the region index, and ascertains that no
valid page already exists.
One page must be reserved. Then vm_no_io_required() is called
to determine if the page fault can be satisfied locally, either by a
zero-filled page (sparse file) or from the page cache.
vm_no_io_required() checks for the faulted page in the page
cache by calling lgpg_cache_lookup().
lgpg_cache_lookup(), uses pageincache() to find
the base page, and then uses lgpg_lookup() to find whether it's
part of a suitable large page.
pageincache() hashes on the vnode pointer and
data to choose a pfdat pointer in phash[]. The
routine walks the pf_hchain chain of pfdat entries
looking for a matching vnode pointer (pf_devvp) and
data value (pf_data). If it finds a match, it removes it from the
free list.
If the page is found in the page cache, the region's valid page count
(r_nvalid) is incremented, the vfd is updated with
the pfn (pf_pfn), and a virtual-to-physical
translation for the page to the sparse PDIR is added (if it had been removed).
DBD_FSTORE Page
pfdat
+--------------+
hash linked list +---->|P_HASH|P_FREE |<---+
(pf_hchain) | +--| |<-+ | free linked list
| | +--------------+ | |(pf_next, pf_prev)
| +->|P_HASH|P_FREE |<-+ |
| +--| |<-+ |
| | +--------------+ | |
| | | | | |
| | +--------------+ | |
devvp dbd_data | +->| P_HASH | | |
\ / | +--| | | |
\ / | V +--------------+ | |
\ / | --- | | | |
_/ \_ | - | | | |
----------- | +--------------+ | |
\ / phash | | P_FREE |<-|-+
\ / +-----+ | | |<-|-+
| | | | +--------------+ | |
V +-----+ | +--------------+ | |
index---->| |------->|P_HASH|P_FREE |<-+ |
+-----+ | +--| |<---|-+
| | | | +--------------+ | |
| | | | | P_FREE |<---+ |
| | | | | |<-----+
| | | | +--------------+
| | | | | |
| | | | +--------------+
+-----+ | +->| P_HASH |
+-----| |
+--------------+
| |
| |
+--------------+
If the required page is not found in the page cache, the
pagein() routines refer to the dbd to ascertain
which page to fetch. (The information had been stored in the dbd
by vm_no_io_required().) The pagein() routines will
generally try to read more than just the single page where the fault occurred;
they try both to use larger than 4K pages (where that's appropriate, given
memory availability, file attributes, etc.) and to simply read-ahead extra
pages from a file that's being accessed sequentially, so that they'll be
already available at the time of the next page fault on that file.
A page (or more) of memory is allocated from the physical memory allocator,
a virtual-to physical translation added to the sparse PDIR, the I/O scheduled
from the disk to the page, and the process put to sleep awaiting the
non-read-ahead I/O to complete (the process does not await read-ahead I/O to
complete). The vfd is marked valid. The dbd is left
with dbd_type set to DBD_FSTORE and
dbd_data set to the block address on the disk.
Regardless of whether the page data is retrieved from zero-fill, free list,
or disk, the page directory entry (pde) has been touched. The
instruction is retried and gets a TLB miss fault; the miss handler writes the
modified pde data into the TLB; the instruction is retried again and succeeds.
exec()When the system performs an exec(), the virtual memory system
concerns itself with cleaning up old pregions/regions and setting
up new ones.
vfork()Cleanup in the vfork() case is simple.
vas and attaches it to the child
process (p_vas).
uarea and stack of the parent process are copied and
the pregion and region are created for the child
uarea, just as for a FORK_PROCESS fork type, and
the thread switches from using the parent's kernel stack to the new child
kernel stack.
pregions: dispreg()If exec() is called after a FORK_PROCESS
fork, several regions must be disposed of first. Typically, all
pregions are disposed of except for the PT_UAREA
pregion, which is still needed. If the file is calling exec() on
itself, we save a little processing and keep the PT_TEXT and
PT_NULLDREF regions, too.
deactivate_preg() is used to deactivate the
pregion by removing it from the active pregion
list. If the agehand is pointing to the pregion
being deactivated and stealhand is pointing to the next region
in the active pregion list, the agehand is moved
back one pregion to prevent the agehand from
exceeding the stealhand in sequence. Otherwise if the
agehand or stealhand is pointing to the
pregion being deactivated, both hands are moved forward one
pregion.
hdl_detach() is called to handle hardware dependent aspects
of detaching the region from the process' address space. In particular, if
this is the last reference to the address space, its resources must be freed
up:
wait_for_io() to await completion of any pending
I/O to the region (that is, r_poip = 0), so that no I/O
request returns to modify a page now assigned a different purpose.
do_deltransc() on each chunk of the region's
B-tree to delete all the virtual address translations. That
is, for each valid vfd, do_deltransc() calls
hdl_deletetrans(), which calls pddpage() to:
hpde (set space to -1, address to 0,
pde_phys (pfn)to 0, pde_ref to 0,
pde_os to 0).
hpde is not the htbl entry, the
hpde is moved from hash list to free list. If it is the
HTBL hpde and it is unused, an effort is made
to fill it with a translation down its linked list, and then free the
copied hpde.
pfn_to_virt table.
region.) pregion pointer is removed from the
r_pregs list and the memory used by the pregion is
freed (that is, returned to the kernel memory allocator).
r_incore and r_refcnt elements
are decremented. If r_refcnt equals zero, the region is freed
also. The routine freereg() (called if the region is to
be freed) does the following:
pgfree() to:
wait_for_io() (again) to await completion
of any pending I/O to the region (that is, r_poip = 0), so
that no I/O request returns to modify a page now assigned a different
purpose.
region's B-tree
(again), calling do_freepagesc() on each chunk of
the B-tree to free (freepfd()) all the valid
pages of the region.
pf_use field of the pfdat is
decremented.
pf_use will
now be 0; it can be freed for other uses. Its P_FREE flag
is set and the page is returned to the physical memory allocator. The
kernel global freemem is incremented. If any other
processes are waiting for memory, we wake them all up so that the first
one here can have the page (the losers of the race will go to sleep
again). r_bstore is swapdev_vp, the reserved swap
pages (r_swalloc) are released, as are the swap pages reserved
for the B-tree structure (r_root->b_rpages).
r_root and r_chunk region elements are
returned to the kernel memory allocator.
activeregions is decremented; the region is
removed from the active region list and the list of
regions associated with its vnode, and the
region struct itself is returned to the kernel
memory allocator.
If the process for which memory structures are being created is the first
to use the file as an executable, the executable file's vnode's
v_vas is NULL, and requires creating the pseudo-vas,
pseudo-pregion, and region. Otherwise, the
pseudo-vas' reference count is updated.
PT_TEXT pregion is attached
depends on the type of executable.
EXEC_MAGIC, a
PT_TEXT pregion is attached to the
pseudo-vas region.
EXEC_MAGIC, VA_WRTEXT
is set in the process vas, the pseudo-vas'
region is duplicated as a type RT_PRIVATE
region (performing all the steps discussed for an
RT_PRIVATE region), RF_SWLAZYWRT is
set in the new region so that no swap is reserved before
needed, and a PT_TEXT pregion is attached to it.
pregion's
virtual address.
PT_NULLDREF pregion is attached to the
global region (globalnullrp), using the same
space as PT_TEXT.
vas' region is duplicated as a type
RT_PRIVATE region using r_off to point to the
beginning of the data portion of the executable file. A
PT_DATA pregion is attached to it. If this is an
EXEC_MAGIC executable, we use the PT_TEXT
pregion's space, otherwise a new space is assigned. PT_DATA pregion is incremented by the size
of bss (uninitialized data area), using dbd type
DBD_DZERO. This sets b_protoidx to the end of the
inititialized data area and b_proto2 to DBD_ZERO.
More swap is reserved.
SSIZE +1) is created for
the user stack. The dbd proto value is set to
DBD_DZERO, and a PT_STACK pregion is
attached at USRSTACK. The PT_UAREA
pregion's space is used.
PT_MMAP
pregions are created: an RT_SHARED
pregion containing text mapped into the third quadrant with a
space of KERNELSPACE and an RT_PRIVATE
pregion containing associated data (such as library global
variables) with the PT_DATA pregion's space.
VA_WRTEXT is set, the data pregion takes
the first available address above where the text ends (in the first or
second quadrant); othwerwise it is assigned the first available address in
the second quadrant.
exit()From the virtual memory perspective, an exit() resembles the
first part of an exec(). All virtual memory resources associated
with the process are discarded, but no new ones are allocated.
Thus, when exiting from a vfork child before the
child has performed an exec(), nothing needs to be cleaned up
from virtual memory except to return resources to the parent process. If
exiting from a non-vfork child, the virtual memory resources are
discarded by calling dispreg().