Modern Operating Systems - Virtual Memory

Why Use Virtual Memory#

In a typical operating system, multiple processes are running and sharing the same physical memory resources. If the operating system allows processes to directly use physical memory, there will be some issues:

The space allocated to a process by the operating system may not be sufficient, and when requesting more memory, the space may not be contiguous.
Due to sharing the same physical address space, Process A may write to an address used by Process B.

For efficiency and security reasons, virtual memory is introduced: application processes read and write physical memory through virtual addresses, and the MMU (Memory Management Unit) inside the CPU is responsible for translating virtual addresses into specific physical addresses for corresponding read and write operations. The operating system is responsible for establishing a mapping from virtual addresses to physical addresses for each process. Each process can consider itself as having a continuous and complete address space.

In the early days, most systems used a segmentation virtual memory management method: the operating system divided the virtual address space and physical address space into segments of different sizes, such as code segments and data segments, etc., and used a segment table to implement the mapping from virtual addresses to physical addresses. However, fragmentation is prone to occur between physical segments.

Virtual addresses include: segment number + offset within the segment.
The MMU finds the segment table based on the segment table base register.
Find the starting physical address of the corresponding segment in the segment table based on the segment number.
The physical segment starting address + offset gives the physical address.

Virtual Memory Paging#

Modern operating systems typically use a paging mechanism to manage virtual memory.

The operating system first divides the virtual address space and physical address space into smaller, equally sized pages, and uses a page table to implement the mapping from virtual addresses to physical addresses. User processes can use the entire virtual address space, and the kernel maintains a page table for each process. Even if two processes use the same virtual address, there will be no conflict if the mapped physical addresses in the page table are different.

Virtual addresses include: page number + page offset.
The MMU finds the page table of the corresponding process based on the page table base register.
Find the starting physical address of the physical page corresponding to the page number in the page table.
The physical page starting address + page offset gives the physical address.

Multi-Level Page Table#

To compress the physical memory space occupied by page table entries (considering that the virtual address space of a 64-bit system with a page size of 4KB occupies a large space when using a single page table), multi-level page tables are usually used. The virtual page number of the virtual address is divided into several parts, each of which corresponds to a page table entry. Taking the AArch64 architecture as an example, the virtual address uses 48 valid bits (i.e., the virtual address space size is 2^48), of which the lower 12 bits represent the page offset, and the remaining 36 bits are divided into 4 levels of page tables.

First, find the physical address of the level 0 page table through the TTBR0_EL1 register.
Find the item in the level 0 page table based on bits 47-39: the physical address of the level 1 page table.
Find the item in the level 1 page table based on bits 38-30: the physical address of the level 2 page table.
Find the item in the level 2 page table based on bits 29-21: the physical address of the level 3 page table.
Find the corresponding physical page number in the level 3 page table based on bits 20-12.
The true physical address is obtained by adding the page offset (bits 11-0) to the physical page number.

Usually, in addition to storing the physical address, page table entries also store some flags, such as whether the page table entry corresponds to an existing physical address and whether it is a dirty page, etc. In the process of searching for page table entries in a multi-level page table, if a page table entry corresponding to a physical address does not exist, the search can be terminated directly.

Not all page table entries in a multi-level page table point to a page table of the next level. For example, if a process only uses 4KB of memory space, its level 0 page table will only have one entry that points to a level 1 page table, and the level 1 page table will also have only one entry that points to a level 2 page table, and so on. In total, only 4 page table pages are used. If it uses a continuous 4KB*64 memory space and the addresses of the first three level page tables are the same, only 4 page table pages will be used, which saves a lot of space. Refer to the following figure:

Address Translation

Translation Lookaside Buffer (TLB) and Page Table Switching#

To improve the efficiency of address translation, the MMU has a TLB (Translation Lookaside Buffer) hardware that caches a complete mapping from virtual page numbers to physical page numbers. The TLB has multiple levels of cache similar to the CPU, such as L1 instruction cache, L1 data cache, L2 cache, etc. Generally, the TLB can only store a few thousand entries. During an address translation process, the MMU first queries the cache through the TLB. If the cache is hit, it returns directly; if it misses, it performs a multi-level page table query and writes the result back to the TLB. If the TLB is full, it will replace an entry according to the hardware strategy. In addition, in an operating system, multiple processes are usually running, and different processes may use the same virtual addresses. Therefore, it is necessary to ensure that the content cached in the TLB is consistent with the page table of the current process. The operating system actively refreshes the TLB when performing process switching (page table switching).

Considering that refreshing (clearing) the TLB when switching processes will cause a large number of TLB misses at the beginning of process execution, modern CPU hardware provides a process tag function, such as the "Process Context IDentifier" in amd64. The operating system can assign different PCIDs to different processes and write them into the page table base register and the cache entries of the TLB. In this way, even if the process is switched, the TLB can distinguish the cache entries of different processes based on the PCID in the current page table base register and the PCID in the cache entries, without clearing the TLB. However, the operating system still needs to refresh the TLB in a timely manner when modifying the page table to ensure consistency. The disadvantage of this approach is that the number of TLB entries that can be used by a single process is further reduced.

In the AArch64 architecture, the kernel and user processes generally use different page table base registers, so there is no need to switch page tables when switching to kernel mode (executing system calls). In the x86-64 architecture, although there is only one page table base register, the kernel does not use a separate page table. Instead, it maps its own address space to the high part of each process's address space (equivalent to multiple user processes sharing the same kernel address space), so there is no need to switch page tables.

Page Fault and Page Replacement#

Note: In the following text, allocation refers to a process requesting a certain amount of space from the operating system, and the operating system allocates a segment of space in the virtual address space and returns the starting and ending addresses to the process.

During the execution of an application process, a certain amount of memory space is pre-allocated (e.g., using the brk or mmap system call in Linux). Before reading or writing to addresses in this space, the operating system does not map them to real physical pages, that is, the mapping from these addresses in the page table to the corresponding physical pages does not exist. When the CPU checks the page table and finds that the corresponding physical page does not exist when a virtual page is first read or written, it triggers a page fault exception. At this time, the CPU executes the page fault handler function pre-set by the operating system, finds a suitable physical page for it, and writes the address into the page table. In practice, to reduce the number of page fault exceptions, the operating system may simultaneously map multiple adjacent virtual pages to physical pages when executing the page fault handler.

During the execution of a process, it may use more memory resources than the size of physical memory. In this case, if a page fault exception is triggered again, the operating system can swap out some physical pages to a lower-level storage device, such as a disk, to free up some idle physical pages for the currently running process. This process is called page swapping/page out. During the page swapping process, the operating system needs to write the content of the physical pages to the disk, clear the corresponding page table entries, and save the location of the physical pages on the disk. When the swapped out virtual page is accessed again, the operating system will copy the previously swapped out content back to a physical page, modify the process page table to map the virtual page to the new physical page, and this process is called page swapping/page in.

The swap partition in the file system is the disk partition used for page swapping in virtual memory. Considering the low efficiency of disk I/O, the operating system can predict multiple pages that may be accessed and swap them in together to reduce the number of I/O operations.

When an application program accesses a virtual page and triggers a page fault exception, the operating system needs to determine whether the virtual page is in an unallocated state or an allocated but unmapped state. The operating system only performs the mapping from an allocated virtual page to a physical page. In Linux, the virtual address space of a process is divided into multiple Virtual Memory Areas (VMA), and each VMA has a certain range (from the starting address to the ending address), such as code, stack, and heap, which correspond to different VMAs. When a page fault exception is triggered, the operating system can determine whether it has been allocated by checking whether the virtual page is under a certain VMA.

Virtual Memory Features#

Shared Memory#

Since virtual pages are mapped to physical pages through a process's page table, the operating system can achieve memory sharing between two processes, A and B, by mapping two virtual pages, A1 and B1, to the same physical page, P. Any write to shared memory by one process can be read by the other process.

Copy-on-Write#

Through shared memory, the operating system can implement Copy On Write (COW) functionality. For example, in Linux, a child process can be created by forking a process. At the beginning of the child process creation, the memory data of the parent and child processes are exactly the same. Therefore, Linux only copies a page table for the child process without modifying any mappings. At the same time, Linux sets the permissions of this shared memory in the page table to read-only. When either process tries to write to this virtual page, a page fault exception is triggered due to insufficient permissions. The operating system then copies the data of the missing page to a new physical page, writes the new physical page into the page table entry that triggered the page fault exception, and restores the permissions, etc. Apart from forking, multiple processes can also share memory by mapping to the same dynamic link library, reducing memory usage.

Shared Memory and Copy-on-Write

Memory Deduplication#

Based on COW, the operating system can (periodically) actively merge multiple physical pages with the same content into one, and then modify all associated page table mappings to this unique physical page to improve memory utilization efficiency. This feature is called memory deduplication. In Linux, this feature is implemented as "Kernel Same-page Merging" and can reduce or even eliminate the immediate disk I/O caused by page swapping.

Memory Compression#

The operating system can reduce the usage of physical pages by compressing less frequently used memory data. In Linux, there is a zswap area in memory used to store compressed memory data. After compression, the memory data is first placed in the zswap area. When physical memory resources are insufficient, zswap will batch swap out the compressed data to the disk. This can reduce or even avoid the disk I/O caused by immediate swapping and improve memory utilization efficiency.

Huge Pages#

As mentioned earlier, the cache entries in the TLB are limited, and in the case of a 4KB memory page size, they may not be sufficient (frequent TLB misses resulting in low efficiency). Therefore, many operating systems provide the functionality of huge pages, which increases the size of memory pages to 2MB or even 1GB to reduce the occupancy of the TLB. In Linux, the "transparent huge page" mechanism is provided, which automatically merges contiguous 4KB pages into a 2MB page.

Using huge pages can greatly reduce the occupancy of the TLB cache entries and improve the TLB hit rate. However, huge pages may also reduce memory utilization efficiency. For example, if a 2MB huge page is actually only using 256KB, it will cause significant resource waste.

Reference
Modern Operating Systems: Principles and Implementation