From 383ff21823ede2b2456205a1cae382d56bc0f688 Mon Sep 17 00:00:00 2001 From: Philipp Oppermann Date: Tue, 11 Dec 2018 23:54:26 +0100 Subject: [PATCH 01/21] Begin new post about paging --- .../second-edition/posts/09-paging/index.md | 104 ++++++++++++++++++ .../posts/09-paging/paging-fragmentation.svg | 2 + .../segmentation-fragmentation-compacted.svg | 2 + .../09-paging/segmentation-fragmentation.svg | 2 + .../segmentation-same-program-twice.svg | 2 + 5 files changed, 112 insertions(+) create mode 100644 blog/content/second-edition/posts/09-paging/index.md create mode 100644 blog/content/second-edition/posts/09-paging/paging-fragmentation.svg create mode 100644 blog/content/second-edition/posts/09-paging/segmentation-fragmentation-compacted.svg create mode 100644 blog/content/second-edition/posts/09-paging/segmentation-fragmentation.svg create mode 100644 blog/content/second-edition/posts/09-paging/segmentation-same-program-twice.svg diff --git a/blog/content/second-edition/posts/09-paging/index.md b/blog/content/second-edition/posts/09-paging/index.md new file mode 100644 index 00000000..13b0a428 --- /dev/null +++ b/blog/content/second-edition/posts/09-paging/index.md @@ -0,0 +1,104 @@ ++++ +title = "Paging" +order = 9 +path = "paging" +date = 0000-01-01 +template = "second-edition/page.html" ++++ + +In this post we start exploring memory management. We explore the difference between virtual memory and physical memory and learn how the translation process works. TODO + + + +This blog is openly developed on [Github]. If you have any problems or questions, please open an issue there. You can also leave comments [at the bottom]. + +[Github]: https://github.com/phil-opp/blog_os +[at the bottom]: #comments + +## Memory Protection + +One main task of an operating system is to isolate programs from each other. Your web browser shouldn't be able to interfere with your text editior, for example. To achieve this goal, operating systems utilize hardware functionality to ensure that memory areas of one process are not accessible by other processes. There are different approaches, depending on the hardware and the OS implementation. + +As an example, some ARM Cortex-M processors (used for embedded systems) have a [_Memory Protection Unit_] (MPU), which allows you to define a small number (e.g. 8) of memory regions with different access permissions (e.g. no access, read-only, read-write). On each memory access the MPU ensures that the address is in a region with correct access permissions and throws an exception otherwise. By changing the regions and access permissions on each process switch, the operating system can ensure that each process only accesses its own memory, and thus isolate processes from each other. + +[_Memory Protection Unit_]: https://developer.arm.com/docs/ddi0337/e/memory-protection-unit/about-the-mpu + +On x86, two different techniques are supported: [segmentation] and [paging]. + +[segmentation]: https://en.wikipedia.org/wiki/X86_memory_segmentation +[paging]: https://en.wikipedia.org/wiki/Virtual_memory#Paged_virtual_memory + +## Segmentation + +Segmentation was already introduced in 1978, originally to increase the amount of addressible memory. The situation back then was that CPUs only used 16-bit addresses, which limited the amount of addressable memory to 64KiB. To make more than these 64KiB accessible, additional segment registers were introduced that each contain an offset address. This offset is added on each memory access, which results in a 20bit address so that up to 1MiB of memory are accessible. + +The CPU chooses a ssegment register automatically, depending on the kind of memory access: For fetching instructions the code segment `CS` is used and for stack operations (push/pop) the stack segment `SS` is used. Other instructions use data segment `DS` or the extra segment `ES`. Later two additional segment registers `FS` and `GS` were added, which can be used freely. + +In the first version of segmentation, the segment registers directly contained the offset and no access control was performed. This was changed later with the introduction of the [_protected mode_]. When the CPU runs in this mode, the segment descriptors contain an index into a local or global [_descriptor table_], which contains in addition to an offset address the segment size and access permissions. The OS can utilize this to isolate processes from each other by loading separate global/local descriptor tables for each process that confine memory accesses to the process's own memory areas. + +[_protected mode_]: https://en.wikipedia.org/wiki/X86_memory_segmentation#Protected_mode +[_descriptor table_]: https://en.wikipedia.org/wiki/Global_Descriptor_Table + +Although segmentation is no longer used on modern systems, it already used a technique that is now used everywhere: _virtual memory_. + +### Virtual Memory + +The idea behind virtual memory is to abstract away the memory addresses from the underlying physical storage device. Instead of directly accessing the storage device, a translation step is performed. For segmentation, this translation step is to add the offset address of the active segment. Imagine a program accessing memory address `0x1234000` in a segment with offset `0x1111000`: The translated address is then `0x2345000`. + +To differentiate the two address types, addresses before the translation are called _virtual_ and addresses after the translation are called _physical_. One important difference between these two kind of addresses is that physical addresses are unique and always refer to the same, distinct memory location. Virtual addresses on the other hand depend on the translation function. It is entirely possible that two identical virtual addresses refer to different physical addresses with when different translation functions are used. + +An example where this property is useful is running the same program twice in parallel: + + +![Two virtual address spaces with address 0–150, one translated to 100–250, the other to 300–450](segmentation-same-program-twice.svg) + +Here the same program runs twice, but with different translation functions. The first instance has an segment offset of 100, so that its virtual addresses 0–100 are translated to the physical addresses 100–250. The second instance has offset 300, which translates its virtual addresses to physical addresses 300–450. The important thing here is that both programs can run the same code and use the same virtual addresses without interfering with each other. + +Similarly, we can place programs at arbitrary physical memory locations now, even if they use completely different virtual addresses. Thus we can utilize the full amount of available memory without recompiling any program. + +### Fragmentation + +The differentiation between virtual and physical addresses makes segmentation really powerful. However, it has the problem of fragmentation. As an example, imagine that we want to run a third copy of the program we saw above: + +![Three virtual address spaces, but there is not enough continuous space for the third](segmentation-fragmentation.svg) + +There is no way to map the third instance of the program to virtual memory without overlapping, even though there is more than enough free memory available. The problem is that we need _continuous_ memory and can't use the small free chunks. + +One way to combat this fragmentation is to pause execution, move the used parts of the memory closer together, update the translation, and then resume execution: + +![Three virtual address spaces after defragmentation](segmentation-fragmentation-compacted.svg) + +Now there is enough continuous space to start the third instance of our program. + +The disadvantage of this defragmentation process is that is needs to copy large amounts of memory which decreases performance. It also needs to be done regularly before the memory becomes too fragmented. This makes performance unpredictable, since programs are paused at random times and might become unresponsive. + +The fragmentation problem is one of the reasons that segmentation is no longer used by most systems. Segmentation is not even supported in 64-bit mode on x86 anymore. Instead _paging_ is used, which completely avoids the fragmentation problem. + +## Paging + +The idea is to divide both the virtual and the physical memory space into small, fixed-size blocks. The blocks of the virtual memory space are called _pages_ and the blocks of the physical address space are called _frames_. Each page can be individually mapped to a frame, which makes it possible split larger memory regions across non-continuous physical frames. + +The advantage of this becomes visible if we recap the example of the fragmented memory space, but use paging instead of segmentation this time: + +![With paging the third program instance can be split across many smaller physical areas](paging-fragmentation.svg) + +In this example we have a page size of 50 bytes, which means that each memory region is split across three pages. Each page is mapped to a frame individually, so a continuous virtual memory region can be mapped to non-continuous memory frames. This allows us to start the third instance of the program without performing any defragmentation before. + +### Hidden Fragmentation + +Compared to segmentation, paging uses lots of small, fixed sized memory regions instead of a few large, variable sized regions. Since every frame has the same size, there are no frames that are too small to be used so that no fragmentation occurs. + +Or it _seems_ like no fragmentation occurs. There is still some hidden kind of fragmentation, the so-called _internal fragmentation_. Internal fragmentation occurs because not every memory region is an exact multiple of the page size. Imagine a program of size 101 in the above example: It would still need three pages of size 50, so it would occupy 49 bytes more than needed. + +Internal fragmentation is unfortunate, but often better than the external fragmentation that occurs with segmentation. It still wastes memory, but does not require defragmentation and makes the amount of fragmentation predictable (on average half a page per memory region). + +### How does it work? + +We saw that each of the potentially thousands of pages is individually mapped to a frame. This mapping information needs to be stored somewhere. Segmentation uses an individual segment selector register for each active memory region, which is not possible for paging since there are way more pages than registers. So the mapping information needs to be stored somewhere else. + + + + +------- + +TODO: update date \ No newline at end of file diff --git a/blog/content/second-edition/posts/09-paging/paging-fragmentation.svg b/blog/content/second-edition/posts/09-paging/paging-fragmentation.svg new file mode 100644 index 00000000..171855b2 --- /dev/null +++ b/blog/content/second-edition/posts/09-paging/paging-fragmentation.svg @@ -0,0 +1,2 @@ + +
Physical Memory
Physical Memory
Virtual Memory
<div>Virtual Memory</div>
Virtual Memory
Virtual Memory
0
[Not supported by viewer]
150
[Not supported by viewer]
0
[Not supported by viewer]
150
[Not supported by viewer]
100
[Not supported by viewer]
250
[Not supported by viewer]
0
[Not supported by viewer]
300
[Not supported by viewer]
450
[Not supported by viewer]
550
[Not supported by viewer]
Virtual Memory
Virtual Memory
0
[Not supported by viewer]
150
[Not supported by viewer]
\ No newline at end of file diff --git a/blog/content/second-edition/posts/09-paging/segmentation-fragmentation-compacted.svg b/blog/content/second-edition/posts/09-paging/segmentation-fragmentation-compacted.svg new file mode 100644 index 00000000..b49e8813 --- /dev/null +++ b/blog/content/second-edition/posts/09-paging/segmentation-fragmentation-compacted.svg @@ -0,0 +1,2 @@ + +
Physical Memory
Physical Memory
Virtual Memory
<div>Virtual Memory</div>
Virtual Memory
Virtual Memory
0
[Not supported by viewer]
150
[Not supported by viewer]
0
[Not supported by viewer]
150
[Not supported by viewer]
150
[Not supported by viewer]
0
[Not supported by viewer]
300
[Not supported by viewer]
Offset
0
[Not supported by viewer]
Offset
150
[Not supported by viewer]
550
[Not supported by viewer]
Virtual Memory
Virtual Memory
0
[Not supported by viewer]
150
[Not supported by viewer]
?
?
\ No newline at end of file diff --git a/blog/content/second-edition/posts/09-paging/segmentation-fragmentation.svg b/blog/content/second-edition/posts/09-paging/segmentation-fragmentation.svg new file mode 100644 index 00000000..95a143ca --- /dev/null +++ b/blog/content/second-edition/posts/09-paging/segmentation-fragmentation.svg @@ -0,0 +1,2 @@ + +
Physical Memory
Physical Memory
Virtual Memory
<div>Virtual Memory</div>
Virtual Memory
Virtual Memory
0
[Not supported by viewer]
150
[Not supported by viewer]
0
[Not supported by viewer]
150
[Not supported by viewer]
100
[Not supported by viewer]
250
[Not supported by viewer]
0
[Not supported by viewer]
300
[Not supported by viewer]
450
[Not supported by viewer]
Offset
100
[Not supported by viewer]
Offset
300
[Not supported by viewer]
550
[Not supported by viewer]
Virtual Memory
Virtual Memory
0
[Not supported by viewer]
150
[Not supported by viewer]
?
?
\ No newline at end of file diff --git a/blog/content/second-edition/posts/09-paging/segmentation-same-program-twice.svg b/blog/content/second-edition/posts/09-paging/segmentation-same-program-twice.svg new file mode 100644 index 00000000..ca42bff5 --- /dev/null +++ b/blog/content/second-edition/posts/09-paging/segmentation-same-program-twice.svg @@ -0,0 +1,2 @@ + +
Physical Memory
Physical Memory
Virtual Memory
<div>Virtual Memory</div>
Virtual Memory
Virtual Memory
0
[Not supported by viewer]
150
[Not supported by viewer]
0
[Not supported by viewer]
150
[Not supported by viewer]
100
[Not supported by viewer]
250
[Not supported by viewer]
0
[Not supported by viewer]
300
[Not supported by viewer]
450
[Not supported by viewer]
Offset
100
[Not supported by viewer]
Offset
300
[Not supported by viewer]
\ No newline at end of file From 164ae2ba322c9ab856458f32f4679da273a2973a Mon Sep 17 00:00:00 2001 From: Philipp Oppermann Date: Thu, 13 Dec 2018 17:03:59 +0100 Subject: [PATCH 02/21] Finish first draft of post --- .../second-edition/posts/09-paging/index.md | 157 +++++++++++++++++- .../posts/09-paging/multilevel-page-table.svg | 2 + .../multilevel-paging-motivation.svg | 2 + .../posts/09-paging/paging-page-tables.svg | 2 + .../segmentation-fragmentation-compacted.svg | 2 +- .../09-paging/segmentation-fragmentation.svg | 2 +- .../09-paging/single-level-page-table.svg | 2 + ...86_64-page-table-translation-addresses.svg | 2 + .../x86_64-page-table-translation.svg | 2 + .../x86_64-table-indices-from-address.svg | 2 + 10 files changed, 170 insertions(+), 5 deletions(-) create mode 100644 blog/content/second-edition/posts/09-paging/multilevel-page-table.svg create mode 100644 blog/content/second-edition/posts/09-paging/multilevel-paging-motivation.svg create mode 100644 blog/content/second-edition/posts/09-paging/paging-page-tables.svg create mode 100644 blog/content/second-edition/posts/09-paging/single-level-page-table.svg create mode 100644 blog/content/second-edition/posts/09-paging/x86_64-page-table-translation-addresses.svg create mode 100644 blog/content/second-edition/posts/09-paging/x86_64-page-table-translation.svg create mode 100644 blog/content/second-edition/posts/09-paging/x86_64-table-indices-from-address.svg diff --git a/blog/content/second-edition/posts/09-paging/index.md b/blog/content/second-edition/posts/09-paging/index.md index 13b0a428..1e9c38f9 100644 --- a/blog/content/second-edition/posts/09-paging/index.md +++ b/blog/content/second-edition/posts/09-paging/index.md @@ -6,10 +6,12 @@ date = 0000-01-01 template = "second-edition/page.html" +++ -In this post we start exploring memory management. We explore the difference between virtual memory and physical memory and learn how the translation process works. TODO +This post introduces _paging_, a very common memory management scheme that we will also use for our operating system. It explains why memory isolation is needed, how _segmentation_ works, what _virtual memory_ is, and how paging solves memory fragmentation issues. It also explores (multilevel) page tables and explains how paging works on the x86_64 architecture. +In contrast to the other posts on this blog, this post does not contain any code. The reason for this is that a working paging implementation already needs many advanced features, which might be overwhelming if introduced right at the beginning. By focusing on the fundamentals first we can explain the concepts step by step, before we introduce the advanced features and the implementation in the next post. + This blog is openly developed on [Github]. If you have any problems or questions, please open an issue there. You can also leave comments [at the bottom]. [Github]: https://github.com/phil-opp/blog_os @@ -88,16 +90,165 @@ In this example we have a page size of 50 bytes, which means that each memory re Compared to segmentation, paging uses lots of small, fixed sized memory regions instead of a few large, variable sized regions. Since every frame has the same size, there are no frames that are too small to be used so that no fragmentation occurs. -Or it _seems_ like no fragmentation occurs. There is still some hidden kind of fragmentation, the so-called _internal fragmentation_. Internal fragmentation occurs because not every memory region is an exact multiple of the page size. Imagine a program of size 101 in the above example: It would still need three pages of size 50, so it would occupy 49 bytes more than needed. +Or it _seems_ like no fragmentation occurs. There is still some hidden kind of fragmentation, the so-called _internal fragmentation_. Internal fragmentation occurs because not every memory region is an exact multiple of the page size. Imagine a program of size 101 in the above example: It would still need three pages of size 50, so it would occupy 49 bytes more than needed. To differentiate the two types of fragmentation, the kind of fragmentation that happens when using segmentation is called _external fragmentation_. Internal fragmentation is unfortunate, but often better than the external fragmentation that occurs with segmentation. It still wastes memory, but does not require defragmentation and makes the amount of fragmentation predictable (on average half a page per memory region). ### How does it work? -We saw that each of the potentially thousands of pages is individually mapped to a frame. This mapping information needs to be stored somewhere. Segmentation uses an individual segment selector register for each active memory region, which is not possible for paging since there are way more pages than registers. So the mapping information needs to be stored somewhere else. +We saw that each of the potentially thousands of pages is individually mapped to a frame. This mapping information needs to be stored somewhere. Segmentation uses an individual segment selector register for each active memory region, which is not possible for paging since there are way more pages than registers. Instead paging uses a table structure called _page table_ to store the mapping information. +For our above example the page tables would look like this: +![Three page tables, one for each program instance. For instance 1 the mapping is 0->100, 50->150, 100->200. For instance 2 it is 0->300, 50->350, 100->400. For instance 3 it is 0->250, 50->450, 100->500.](paging-page-tables.svg) +We see that each program instance has its own page table. A pointer to the currently active table is stored in a special CPU register. On `x86`, this register is called `CR3`. It is tha job of the operating system to load this register with the correct value before running each program instance. + +On each memory access, the CPU reads the table pointer from the register and looks up the mapped frame for the accessed page in the table. This is entirely done in hardware and completely transparent to the running program. To speed up the translation process, many CPU architectures have a special cache that remembers the results of the last translations. + +Depending on the architecture, page table entries can also store attributes such as access permissions in a flags field. In the above example, the "r/w" flag makes the page both readable and writable. + +### Multilevel Page Tables + +The simple page tables we just saw have a problem in larger address spaces: they waste memory. For example, imagine a program that uses the four virtual pages 0, 1 000 000, 1_000 050, and 1 000 100: + +![Page 0 mapped to frame 0 and pages 1 000 000–1 000 150 mapped to frames 100–250](single-level-page-table.svg) + +It only needs 4 physical frames, but the page table has over a million entries. We can't omit the empty entries because then the CPU would no longer be able to jump directly to the correct entry in the translation process (e.g. it is no longer guaranteed that the fourth page uses the fourth entry). + +To reduce the wasted memory, we can use a **two-level page table**. The idea is that we use different page tables for different address regions. An additional table called _level 2_ page table contains the mapping between address regions and (level 1) page tables. + +This is best explained by an example. Let's define that each level 1 page table is responsible for a region of size 10 000. Then the following tables would exist for the above example mapping: + +![Page 0 points to entry 0 of the level 2 page table, which points to the level 1 page table T1. The first entry of T1 points to frame 0, the other entries are empty. Pages 1 000 000–1 000 150 point to the 100th entry of the level 2 page table, which points to a different level 1 page table T2. The first three entries of T2 point to frames 100–250, the other entries are empty.](multilevel-page-table.svg) + +Page 0 falls into the first 10 000 byte region, so it uses the first entry of the level 2 page table. This entry points to level 1 page table T1, which specifies that page 0 points to frame 0. + +The pages 1 000 000, 1 000 050, and 1 000 100 all fall into the 100th memory region (which goes from 1 000 000 to 1 010 000), so they use the 100th entry of the level 2 page table. This entry points at a different level 1 page table T2. The T2 table maps the three pages to frames 100, 150, and 200. Note that the page number does not include the region offset, so e.g. for page 1 000 050 we use the T2 entry for page 50 (= 1 000 050 - 1 000 000). + +We still have 100 empty entries in the level 2 table, but much fewer than the million empty entries before. The reason for this savings is that we don't need to create level 1 page tables for the unmapped memory regions between 10 000 and 1 000 000. + +The principle of two-level page tables can be extended to three, four, or more levels. Then the page table register points at the highest level table, which points to the next lower level table, which points to the next lower level, and so on. The level 1 page table then points at the mapped frame. The principle in general is called a _multilevel_ or _hierarchical_ page table. + +Now that we know how paging and multilevel page tables works, we can look at how paging is implemented in the x86_64 architecture (we assume in the following that the CPU runs in 64-bit mode). + +## Paging on x86_64 + +The x86_64 architecture uses a 4-level page table and a page size of 4KiB. Each page table, independent of the level, has a fixed sized of 512 entries. Each entry has a size of 8 bytes, so each table is 512 * 8B = 4KiB large and thus fits exactly into one page. + +The page table index for level is derived directly from the virtual address: + +![Bits 0–12 are the page offset, bits 12–21 the level 1 index, bits 21–30 the level 2 index, bits 30–39 the level 3 index, and bits 39–48 the level 4 index](x86_64-table-indices-from-address.svg) + +We see that each table index consists of 9 bits, which makes sense because each table has 2^9 = 512 entries. The lowest 12 bits are the offset in the 4KiB page (2^12 bytes = 4KiB). Bits 48 to 64 are discarded, which means that x86_64 is not really 64-bit and only supports 48-bit addresses. There are plans to extend the address size to 57 bits through a [5-level page table], but no processors that support this feature exist yet. + +[5-level page table]: https://en.wikipedia.org/wiki/Intel_5-level_paging + +Even though bits 48 to 64 are discarded, they can't be set to arbitrary values. Instead all bits in this range have to be copies of bit 47 in order to keep addresses unique and allow future extensions like the 5-level page table. This is called _sign-extension_ because it's very similar to the [sign extension in two's complement]. When a address is not correctly sign-extended, the CPU throws an exception. + +[sign extension in two's complement]: https://en.wikipedia.org/wiki/Two's_complement#Sign_extension + +### Example Translation + +Let's go through an example to understand how the translation process works in detail: + +![An example 4-level page hierarchy with each page table shown in physical memory](x86_64-page-table-translation.svg) + +The physical address of the currently active level 4 page table, which is the root of the 4-level page table, is stored in the `CR3` register. Each page table entry then points to the physical frame of the next level table. The entry of the level 1 table then points to the mapped frame. Note that all addresses in the page tables are physical instead of virtual, because otherwise the CPU would need to translate those addresses too (which could cause a never-ending recursion). + +The above page table hierarchy maps two pages (in blue). The start addresses of these pages are `0x803FE7F000` and `0x803FE00000`. Let's see what happens when the program tries to read from address `0x803FE7F5CE`. First, we convert the address to binary and determine the page table indices and the page offset for the address: + +![An example 4-level page hierarchy with each page table shown in physical memory](x86_64-page-table-translation-addresses.svg) + +With these indices, we can now walk the page table hierarchy to determine the mapped frame for the address: + +- We start by reading the address of the level 4 table out of the `CR3` register. +- Then we look at the entry with index 1 of that table, which tells us that the level 3 table is stored at address 16KiB. +- We load the level 3 table from that address and look at the entry with index 0, which points us to the level 2 table at 14KiB. +- The level 2 index is 511, so we look at the last entry of that page to find out the address of the level 1 table. +- The entry with index 127 of the level 1 table we finally find that the page is mapped to frame 12KiB, or 0xc000 in hexadecimal. +- The final step is to add the page offset to the frame address to get the physical address 0xc000 + 0x5ce = 0xc5ce. + +The permissions for the page in the level 1 table are `r`, which means read-only. The hardware enforces this permissions and would throw an exception if we tried to write to that page. Permissions in higher level pages restrict the possible permissions in lower level, so if we set the level 3 entry to read-only, no pages that use this entry can be writable, even if lower levels specify read/write permissions. + +It's important to note that even through this example used only a single instance of each table, there are typically multiple instances of each level in each address space. At maximum, there are: + +- one level 4 table, +- 512 level 3 tables (because the level 4 table has 512 entries), +- 512 * 512 level 2 tables (because each of the 512 level 3 tables has 512 entries), and +- 512 * 512 * 512 level 1 tables (512 entries for each level 2 table). + +### Page Table Format + +Page tables on the x86_64 architecture are basically an array of 512 entries. In Rust syntax: + +```rust +#[repr(align(4096))] +pub struct PageTable { + entries: [PageTableEntry; 512], +} +``` + +As indicated by the `repr` attribute, page tables need to be page aligned, i.e. aligned on a 4KiB boundary. This requirement guarantees that a page table always fills a complete page and allows an optimization that makes entries very compact. + +Each entry is 8 bytes (64 bits) large and has the following format: + +Bit(s) | Name | Meaning +------ | ---- | ------- +0 | present | the page is currently in memory +1 | writable | it's allowed to write to this page +2 | user accessible | if not set, only kernel mode code can access this page +3 | write through caching | writes go directly to memory +4 | disable cache | no cache is used for this page +5 | accessed | the CPU sets this bit when this page is used +6 | dirty | the CPU sets this bit when a write to this page occurs +7 | huge page/null | must be 0 in P1 and P4, creates a 1GiB page in P3, creates a 2MiB page in P2 +8 | global | page isn't flushed from caches on address space switch (PGE bit of CR4 register must be set) +9-11 | available | can be used freely by the OS +12-51 | physical address | the page aligned 52bit physical address of the frame or the next page table +52-62 | available | can be used freely by the OS +63 | no execute | forbid executing code on this page (the NXE bit in the EFER register must be set) + +We see that only bits 12–51 are used to store the physical frame address, the remaining bits are used as flags or can be freely used by the operating system. This is possible because we always point to a 4096-byte aligned address, either to a page-aligned page table or to the start of a mapped frame. This means that bits 0–11 are always zero, so there is no reason to store these bits because the hardware can just set them to zero before using the address. The same is true for bits 52–63, because the x86_64 architecture only supports 52-bit physical addresses (similar to how it only supports 48-bit virtual addresses). + +Let's take a closer look at the available flags: + +- The `present` flag differentiates mapped pages from unmapped ones. It can be used to temporary swap out pages to disk when main memory becomes full. When the page is accessed subsequently, a special exception called _page fault_ occurs, to which the operating system can react by reloading the missing page from disk and then continuing the program. +- The `writable` and `no execute` flags control whether the contents of the page are writeable or contain executable instructions respectively. +- The `accessed` and `dirty` flags are automatically set by the CPU when a read or write to the page occurs. This information can be leveraged by the operating system e.g. to decide which pages to swap out or whether the page contents were modified since the last save to disk. +- The `write through caching` and `disable cache` flags allow to control the caches for every page individually. +- The `user accessible` flag makes a page available to userspace code, otherwise it is only aaccessible when the CPU is in kernel mode. This feature can be used to make [system calls] faster by keeping the kernel mapped while an userspace program is running. However, the [Spectre] vulnerability can allow userspace programs to read these pages nontheless. +- The `global` flag signals to the hardware that a page is available in all address spaces and thus does not need to be removed from the translation cache (see the section about the TLB below) on address space switches. This flag is commonly used together with a cleared `user accessible` flag to map the kernel code to all address spaces. +- The `huge page` flag allows to create pages of larger sizes by letting the entries of the level 2 or level 3 page tables directly point to a mapped frame. With this bit set, the page size increases by factor 512 to either 2MiB = 512 * 4KiB for level 2 entries or even 1GiB = 512 * 2MiB for level 3 entries. The advantage of using larger pages is that fewer lines of the translation cache and fewer page tables are needed. + +[system calls]: https://en.wikipedia.org/wiki/System_call +[Spectre]: https://en.wikipedia.org/wiki/Spectre_(security_vulnerability) + +The `x86_64` crate provides types for [page tables] and their [entries], so we don't need to create these structures ourselves. + +[page tables]: https://docs.rs/x86_64/0.3.4/x86_64/structures/paging/struct.PageTable.html +[entries]: https://docs.rs/x86_64/0.3.4/x86_64/structures/paging/struct.PageTableEntry.html + +### The Translation Lookaside Buffer + +A 4-level page table makes the translation of virtual addresses expensive, because each translation requires 4 memory accesses. To improve performance, the x86_64 architecture caches the last few translations in the so-called _translation lookaside buffer_ (TLB). This allows to skip the translation when the translation is still cached. + +Unlike the other CPU caches, the TLB is not fully transparent since it does not update or remove translations when a mapping in the page table changes. Instead it must be manually updated whenever a mapping in the page table changes. To do this, there is a special CPU instruction called [`invlpg`] (“invalidate page”) that removes the translation for the specified page from the TLB, so that it is loaded again from the page table on the next access. The TLB can also be flushed completely by reloading the `CR3` register, which simulates an address space switch. The `x86_64` crate provides Rust functions for both variants in the [`tlb` module]. + +[`invlpg`]: https://www.felixcloutier.com/x86/INVLPG.html +[`tlb` module]: https://docs.rs/x86_64/0.3.4/x86_64/instructions/tlb/index.html + +It is important to remember flushing the TLB on each page table modification because otherwise the CPU might keep using the old translation, which can lead to non-deterministic bugs that are very hard to debug. + +## Summary + +This post introduced two memory protection techniques: segmentation and paging. While the former uses a variable-sized memory regions and suffers from external fragmentation, the latter uses fixed-sized pages and allows much more fine-grained control over access permissions. + +Paging stores the mapping information for pages in page tables with one or more levels. The x86_64 architecture uses 4-level page tables and a page size of 4KiB. The hardware automatically walks the page tables and caches the resulting translations in the translation lookaside buffer (TLB). This buffer is not updated transparently and needs to be flushed manually on page table changes. + +## What's next? + +The next post will build upon the fundamentals we learned in this post. It will introduce an advanced technique called _recursive page tables_ and then use that feature to implement a software based translation function and mapping functions. ------- diff --git a/blog/content/second-edition/posts/09-paging/multilevel-page-table.svg b/blog/content/second-edition/posts/09-paging/multilevel-page-table.svg new file mode 100644 index 00000000..e9333357 --- /dev/null +++ b/blog/content/second-edition/posts/09-paging/multilevel-page-table.svg @@ -0,0 +1,2 @@ + +
Physical Memory
Physical Memory
Virtual Memory
<div>Virtual Memory</div>
0
[Not supported by viewer]
1 000 050
<div>1 000 050</div>
100
[Not supported by viewer]
250
[Not supported by viewer]
0
[Not supported by viewer]
1 000 000
<div>1 000 000</div>
1 000 100
<div>1 000 100</div>
1 000 150
<div>1 000 150</div>
Memory
Region
Level 1
Table
0T1
10 000-

1 000 000T2
[Not supported by viewer]
Page
FrameFlags
0100
r/w
50150
r/w
100200
r/w
150-
-
[Not supported by viewer]
Page
FrameFlags
00
r/w
50-
-
[Not supported by viewer]
100
100
50
50
50
50
Level 2 Page Table
Level 2 Page Table
Level 1 Page Table T1
Level 1 Page Table T1
Level 1 Page Table T2
Level 1 Page Table T2
\ No newline at end of file diff --git a/blog/content/second-edition/posts/09-paging/multilevel-paging-motivation.svg b/blog/content/second-edition/posts/09-paging/multilevel-paging-motivation.svg new file mode 100644 index 00000000..c7500104 --- /dev/null +++ b/blog/content/second-edition/posts/09-paging/multilevel-paging-motivation.svg @@ -0,0 +1,2 @@ + +
Physical Memory
Physical Memory
Virtual Memory
<div>Virtual Memory</div>
0
[Not supported by viewer]
1 000 050
<div>1 000 050</div>
100
[Not supported by viewer]
250
[Not supported by viewer]
0
[Not supported by viewer]
1 000 000
<div>1 000 000</div>
1 000 100
<div>1 000 100</div>
1 000 150
<div>1 000 150</div>
\ No newline at end of file diff --git a/blog/content/second-edition/posts/09-paging/paging-page-tables.svg b/blog/content/second-edition/posts/09-paging/paging-page-tables.svg new file mode 100644 index 00000000..0674e2f2 --- /dev/null +++ b/blog/content/second-edition/posts/09-paging/paging-page-tables.svg @@ -0,0 +1,2 @@ + +
Physical Memory
Physical Memory
Virtual Memory
<div>Virtual Memory</div>
Virtual Memory
Virtual Memory
0
[Not supported by viewer]
150
[Not supported by viewer]
0
[Not supported by viewer]
150
[Not supported by viewer]
100
[Not supported by viewer]
250
[Not supported by viewer]
0
[Not supported by viewer]
300
[Not supported by viewer]
450
[Not supported by viewer]
550
[Not supported by viewer]
Virtual Memory
Virtual Memory
0
[Not supported by viewer]
150
[Not supported by viewer]
PageFrameFlags
0100r/w
50
150r/w
100200
r/w
150--
[Not supported by viewer]
PageFrameFlags
0300r/w
50
350r/w
100400
r/w
150--
[Not supported by viewer]
PageFrameFlags
0250
r/w
50
450
r/w
100500
r/w
150--
[Not supported by viewer]
\ No newline at end of file diff --git a/blog/content/second-edition/posts/09-paging/segmentation-fragmentation-compacted.svg b/blog/content/second-edition/posts/09-paging/segmentation-fragmentation-compacted.svg index b49e8813..f71d3f97 100644 --- a/blog/content/second-edition/posts/09-paging/segmentation-fragmentation-compacted.svg +++ b/blog/content/second-edition/posts/09-paging/segmentation-fragmentation-compacted.svg @@ -1,2 +1,2 @@ -
Physical Memory
Physical Memory
Virtual Memory
<div>Virtual Memory</div>
Virtual Memory
Virtual Memory
0
[Not supported by viewer]
150
[Not supported by viewer]
0
[Not supported by viewer]
150
[Not supported by viewer]
150
[Not supported by viewer]
0
[Not supported by viewer]
300
[Not supported by viewer]
Offset
0
[Not supported by viewer]
Offset
150
[Not supported by viewer]
550
[Not supported by viewer]
Virtual Memory
Virtual Memory
0
[Not supported by viewer]
150
[Not supported by viewer]
?
?
\ No newline at end of file +
Physical Memory
Physical Memory
Virtual Memory
<div>Virtual Memory</div>
Virtual Memory
Virtual Memory
0
[Not supported by viewer]
150
[Not supported by viewer]
0
[Not supported by viewer]
150
[Not supported by viewer]
150
[Not supported by viewer]
0
[Not supported by viewer]
300
[Not supported by viewer]
Offset
0
[Not supported by viewer]
Offset
150
[Not supported by viewer]
550
[Not supported by viewer]
Virtual Memory
Virtual Memory
0
[Not supported by viewer]
150
[Not supported by viewer]
Offset
300
[Not supported by viewer]
\ No newline at end of file diff --git a/blog/content/second-edition/posts/09-paging/segmentation-fragmentation.svg b/blog/content/second-edition/posts/09-paging/segmentation-fragmentation.svg index 95a143ca..6d4d726e 100644 --- a/blog/content/second-edition/posts/09-paging/segmentation-fragmentation.svg +++ b/blog/content/second-edition/posts/09-paging/segmentation-fragmentation.svg @@ -1,2 +1,2 @@ -
Physical Memory
Physical Memory
Virtual Memory
<div>Virtual Memory</div>
Virtual Memory
Virtual Memory
0
[Not supported by viewer]
150
[Not supported by viewer]
0
[Not supported by viewer]
150
[Not supported by viewer]
100
[Not supported by viewer]
250
[Not supported by viewer]
0
[Not supported by viewer]
300
[Not supported by viewer]
450
[Not supported by viewer]
Offset
100
[Not supported by viewer]
Offset
300
[Not supported by viewer]
550
[Not supported by viewer]
Virtual Memory
Virtual Memory
0
[Not supported by viewer]
150
[Not supported by viewer]
?
?
\ No newline at end of file +
Physical Memory
Physical Memory
Virtual Memory
<div>Virtual Memory</div>
Virtual Memory
Virtual Memory
0
[Not supported by viewer]
150
[Not supported by viewer]
0
[Not supported by viewer]
150
[Not supported by viewer]
100
[Not supported by viewer]
250
[Not supported by viewer]
0
[Not supported by viewer]
300
[Not supported by viewer]
450
[Not supported by viewer]
Offset
100
[Not supported by viewer]
Offset
300
[Not supported by viewer]
550
[Not supported by viewer]
Virtual Memory
Virtual Memory
0
[Not supported by viewer]
150
[Not supported by viewer]
?
?
\ No newline at end of file diff --git a/blog/content/second-edition/posts/09-paging/single-level-page-table.svg b/blog/content/second-edition/posts/09-paging/single-level-page-table.svg new file mode 100644 index 00000000..8dbefd7e --- /dev/null +++ b/blog/content/second-edition/posts/09-paging/single-level-page-table.svg @@ -0,0 +1,2 @@ + +
Page
FrameFlags
00
r/w
50
-
-



1 000 000
100
r/w
1 000 050150
r/w
1 000 100
200
r/w
1 000 150-
-
[Not supported by viewer]
Physical Memory
Physical Memory
Virtual Memory
<div>Virtual Memory</div>
0
[Not supported by viewer]
1 000 050
<div>1 000 050</div>
100
[Not supported by viewer]
250
[Not supported by viewer]
0
[Not supported by viewer]
1 000 000
<div>1 000 000</div>
1 000 100
<div>1 000 100</div>
1 000 150
<div>1 000 150</div>
50
[Not supported by viewer]
100
[Not supported by viewer]
50
50
\ No newline at end of file diff --git a/blog/content/second-edition/posts/09-paging/x86_64-page-table-translation-addresses.svg b/blog/content/second-edition/posts/09-paging/x86_64-page-table-translation-addresses.svg new file mode 100644 index 00000000..09aae1dc --- /dev/null +++ b/blog/content/second-edition/posts/09-paging/x86_64-page-table-translation-addresses.svg @@ -0,0 +1,2 @@ + +
00000000 00000000 000 000 001 000 000 000 111 111 111 001 111 111 0101 1100 1110
<div align="left">00000000 00000000 000 000 001 000 000 000 111 111 111 001 111 111 0101 1100 1110</div>
Virtual Address:
Virtual Address:
Sign Extension
Sign Extension
Level 4
Index = 1
[Not supported by viewer]
Level 3
Index = 0
[Not supported by viewer]
Level 2
Index = 511
[Not supported by viewer]
Level 1
Index = 127
[Not supported by viewer]
Offset
= 0x5CE
[Not supported by viewer]
\ No newline at end of file diff --git a/blog/content/second-edition/posts/09-paging/x86_64-page-table-translation.svg b/blog/content/second-edition/posts/09-paging/x86_64-page-table-translation.svg new file mode 100644 index 00000000..719fa4aa --- /dev/null +++ b/blog/content/second-edition/posts/09-paging/x86_64-page-table-translation.svg @@ -0,0 +1,2 @@ + +
Physical Memory
Physical Memory
0KiB
[Not supported by viewer]
IndexFrameFlags
0-
-
116KiBr/w


[Not supported by viewer]
Level 4 Page Table
Level 4 Page Table
4KiB
[Not supported by viewer]
IndexFrameFlags
024KiB
r/w



[Not supported by viewer]
Level 3 Page Table
Level 3 Page Table
16KiB
<div>16KiB</div>
24KiB
<div>24KiB</div>
8KiB
[Not supported by viewer]
IndexFrameFlags



511
24KiB
r/w
[Not supported by viewer]
Level 2 Page Table
Level 2 Page Table
Level 1 Page Table
Level 1 Page Table
32KiB
<div>32KiB</div>
IndexFrameFlags
036KiB
r/w



127
12KiBr


[Not supported by viewer]
4KiB
4KiB
CR3 Register
CR3 Register
\ No newline at end of file diff --git a/blog/content/second-edition/posts/09-paging/x86_64-table-indices-from-address.svg b/blog/content/second-edition/posts/09-paging/x86_64-table-indices-from-address.svg new file mode 100644 index 00000000..18bcdd21 --- /dev/null +++ b/blog/content/second-edition/posts/09-paging/x86_64-table-indices-from-address.svg @@ -0,0 +1,2 @@ + +
64
[Not supported by viewer]
0
[Not supported by viewer]
12
[Not supported by viewer]
21
21
30
[Not supported by viewer]
39
39
48
48
Level 4
Index
[Not supported by viewer]
Level 3
Index
[Not supported by viewer]
Level 2
Index
[Not supported by viewer]
Level 1
Index
[Not supported by viewer]
Page Offset
Page Offset
\ No newline at end of file From 60e84bd65938ee3e7d329cae2693b8a49a077d97 Mon Sep 17 00:00:00 2001 From: Philipp Oppermann Date: Thu, 13 Dec 2018 17:04:27 +0100 Subject: [PATCH 03/21] Rewrite summary and `what's next?` section of previous post --- .../second-edition/posts/08-hardware-interrupts/index.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/blog/content/second-edition/posts/08-hardware-interrupts/index.md b/blog/content/second-edition/posts/08-hardware-interrupts/index.md index d60dadad..f86e707a 100644 --- a/blog/content/second-edition/posts/08-hardware-interrupts/index.md +++ b/blog/content/second-edition/posts/08-hardware-interrupts/index.md @@ -596,10 +596,12 @@ It's possible to configure some aspects of a PS/2 keyboard, for example which sc ## Summary -In this post we learned how to enable and handle external interrupts. We learned about the 8259 PIC and its primary/secondary layout, the remapping of the interrupt numbers, and the "end of interrupt" signal. We saw that the hardware timer and the keyboard controller are active by default and start to send interrupts as soon as we enable them in the CPU. We learned about the `hlt` instruction, which halts the CPU until the next interrupt, and about the scancode sets of PS/2 keyboards. +This post explained how to enable and handle external interrupts. We learned about the 8259 PIC and its primary/secondary layout, the remapping of the interrupt numbers, and the "end of interrupt" signal. We implemented handlers for the hardware timer and the keyboard and learned about the `hlt` instruction, which halts the CPU until the next interrupt. Now we are able to interact with our kernel and have some fundamental building blocks for creating a small shell or simple games. ## What's next? -As already mentioned, the 8259 PIC has been superseded by the [APIC], a controller with more capabilities and multicore support. In the next post we will explore this controller and learn how to use its integrated timer and how to set interrupt priorities. +Timer interrupts are essential for an operating system, because they provide a way to periodically interrupt the running process and regain control in the kernel. The kernel can then switch to a different process and create the illusion that multiple processes run in parallel. + +But before we can create processes or threads, we need a way to allocate memory for them. The next posts will explore memory management to provide this fundamental building block. From cdd0328f032c1d6b6596369e18a16ad39db353a8 Mon Sep 17 00:00:00 2001 From: Philipp Oppermann Date: Thu, 13 Dec 2018 17:13:43 +0100 Subject: [PATCH 04/21] Add paging post to index page --- blog/templates/second-edition/index.html | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/blog/templates/second-edition/index.html b/blog/templates/second-edition/index.html index ccc75151..85fb7442 100644 --- a/blog/templates/second-edition/index.html +++ b/blog/templates/second-edition/index.html @@ -41,6 +41,11 @@ {{ macros::post_link(page=posts.7) }} + +
+ {{ macros::post_link(page=posts.8) }} +
+