Revise post and add new introduction

This commit is contained in:
Philipp Oppermann
2019-01-10 16:05:38 +01:00
parent c285ac7c4f
commit 6d5ebf56a4
3 changed files with 148 additions and 78 deletions

View File

@@ -17,59 +17,144 @@ This blog is openly developed on [Github]. If you have any problems or questions
## Introduction
In the [previous post] we learned about the principles of paging and how the 4-level page tables on the x86_64 architecture work. One thing that the post did not mention: **Our kernel already runs on paging**. The bootloader that we added in the ["A minimal Rust Kernel"] post already set up a 4-level paging hierarchy that maps every page of our kernel to a physical frame. The reason why the bootloader does this is that paging is manditory in 64-bit mode on x86_64.
In the [previous post] we learned about the principles of paging and how the 4-level page tables on the x86_64 architecture work. We also found out that the bootloader already set up a 4-level page table hierarchy for our kernel, since paging is mandatory on x86_64 in 64-bit mode. This means that our kernel already runs on virtual addresses.
[previous post]: ./second-edition/posts/09-paging-introduction/index.md
["A minimal Rust kernel"]: ./second-edition/posts/02-minimal-rust-kernel/index.md#creating-a-bootimage
The bootloader also sets the correct access permissions for each page, which means that only the pages containing code are executable and only data pages are writable. You can try this by accessing some memory outside our kernel:
This makes our kernel much safer, since every memory access that is out of bounds causes a page fault exception instead of writing to random physical memory. The bootloader even set the correct access permissions for each page, which means that only the pages containing code are executable and only data pages are writable.
### Page Faults
Let's try to cause a page fault by accessing some memory outside of our kernel! First, we create a page fault handler and register it in our IDT, so that we see a page fault exception instead of a generic [double fault] :
[double fault]: ./second-edition/posts/07-double-faults/index.md
```rust
let ptr = 0xdeadbeaf as *mut u32;
// in src/interrupts.rs
lazy_static! {
static ref IDT: InterruptDescriptorTable = {
let mut idt = InterruptDescriptorTable::new();
[]
idt.page_fault.set_handler_fn(page_fault_handler); // new
idt
};
}
use x86_64::structures::idt::PageFaultErrorCode;
extern "x86-interrupt" fn page_fault_handler(
stack_frame: &mut ExceptionStackFrame,
_error_code: PageFaultErrorCode,
) {
use crate::hlt_loop;
use x86_64::registers::control::Cr2;
println!("EXCEPTION: PAGE FAULT");
println!("Accessed Address: {:?}", Cr2::read());
println!("{:#?}", stack_frame);
hlt_loop();
}
```
The [`CR2`] register is automatically set by the CPU on a page fault and contains the accessed virtual address that caused the page fault. We use the [`Cr2::read`] function of the `x86_64` crate to read and print it. Normally the [`PageFaultErrorCode`] type would provide more information about the type of memory access that caused the page fault, but there is currently an [LLVM bug] that passes an invalid error code, so we ignore it for now. We can't continue execution without resolving the page fault, so we enter a [`hlt_loop`] at the end.
[`CR2`]: https://en.wikipedia.org/wiki/Control_register#CR2
[`Cr2::read`]: https://docs.rs/x86_64/0.3.5/x86_64/registers/control/struct.Cr2.html#method.read
[`PageFaultErrorCode`]: https://docs.rs/x86_64/0.3.4/x86_64/structures/idt/struct.PageFaultErrorCode.html
[LLVM bug]: https://github.com/rust-lang/rust/issues/57270
[`hlt_loop`]: ./second-edition/posts/08-hardware-interrupts/index.md#the
Now we can try to access some memory outside our kernel:
```rust
// in src/main.rs
#[cfg(not(test))]
#[no_mangle]
pub extern "C" fn _start() -> ! {
use blog_os::interrupts::PICS;
println!("Hello World{}", "!");
// set up the IDT first, otherwise we would enter a boot loop instead of
// invoking our page fault handler
blog_os::gdt::init();
blog_os::interrupts::init_idt();
unsafe { PICS.lock().initialize() };
x86_64::instructions::interrupts::enable();
// new
let ptr = 0xdeadbeaf as *mut u32;
unsafe { *ptr = 42; }
println!("It did not crash!");
blog_os::hlt_loop();
}
```
When we run it, we see that our page fault handler is called:
![EXCEPTION: Page Fault, Accessed Address: VirtAddr(0xdeadbeaf), ExceptionStackFrame: {…}](qemu-page-fault.png)
The `CR2` register indeed contains `0xdeadbeaf`, the address that we tried to access. This virtual address has no mapping in the page tables, so a page fault occured.
We see that the current instruction pointer is `0x20430a`, so we know that this address points to a code page. Code pages are mapped read-only by the bootloader, so reading from this address works but writing causes a page fault. You can try this by changing the `0xdeadbeaf` pointer:
```rust
// Note: The actual address might be different for you. Use the address that
// your page fault handler reports.
let ptr = 0x20430a as *mut u32;
// read from a code page -> works
unsafe { let x = *ptr; }
// write to a code page -> page fault
unsafe { *ptr = 42; }
```
You will see that this results in an page fault exception. (We don't have page fault handler, so you will see that the double fault handler is invoked.)
### The Problem
In case you are wondering how we could access the physical address `0xb8000` in order to print to the [VGA text buffer]: The bootloader identity mapped this frame, which means that it set up a page at the virtual address `0xb8000` that points to the physical frame with the same address.
We just saw that it is a good thing that our kernel already runs on virtual addresses, as it improves safety. However, it also leads to a problem when we try to access the page tables from our kernel: Page tables use physical addresses internally that we can't access directly.
[VGA text buffer]: ./second-edition/posts/03-vga-text-buffer/index.md
We experienced that problem already [at the end of the previous post] when we tried to inspect the page tables from our kernel. The next section discusses the problem in detail and provides different approaches to a solution.
The question is: How do we access the page tables that our kernel runs to create new page mappings?
[at the end of the previous post]: ./second-edition/posts/09-paging-introduction/index.md#try-it-out
## Accessing Page Tables
Accessing the page tables from our kernel is not as easy as it may seem. To understand the problem let's take a look at the example 4-level page table hierarchy of the previous post again:
![An example 4-level page hierarchy with each page table shown in physical memory](../paging/x86_64-page-table-translation.svg)
![An example 4-level page hierarchy with each page table shown in physical memory](../paging-introduction/x86_64-page-table-translation.svg)
The important thing here is that each page entry stores the _physical_ address of the next table. This avoids the need to run a translation for these addresses too, which would be bad for performance and could easily cause endless translation loops.
The problem for us is that we can't directly access physical addresses from our kernel, since our kernel also runs on top of virtual addresses. For example when we access address 4KiB, we access the _virtual_ address 4KiB, not the _physical_ address 4KiB where the level 4 page table lives. When we want to acccess the physical address 4KiB, we can only do so through some virtual address that maps to it.
The problem for us is that we can't directly access physical addresses from our kernel, since our kernel also runs on top of virtual addresses. For example when we access address `4KiB`, we access the _virtual_ address `4KiB`, not the _physical_ address `4KiB` where the level 4 page table lives. When we want to acccess the physical address `4KiB`, we can only do so through some virtual address that maps to it.
So in order access page table frames, we need to map some virtual pages to them. There are different ways to create these mappings that all allow us to access arbitrary page table frames:
- A simple solution is to **identity map all page tables** like the VGA text buffer:
- A simple solution is to **identity map all page tables**:
![A virtual and a physical address space with various virtual pages mapped to the physical frame with the same address](identity-mapped-page-tables.svg)
In this example we see various identity-mapped page table frames. This way the physical addresses in the page tables are also valid virtual addresses so that we can easily access the page tables of all levels starting from the CR3 register.
In this example we see various identity-mapped page table frames. This way the physical addresses in the page tables are also valid virtual addresses, so that we can easily access the page tables of all levels starting from the CR3 register.
However, it clutters the virtual address space and makes it more difficult to find continuous memory regions of larger sizes. For example, imagine that we want to create a virtual memory region of size 1000 KiB in the above graphic, e.g. for [memory-mapping a file]. We can't start the region at 26 KiB because it would collide with the already mapped page at 1004 MiB. So we have to look further until we find a large enough unmapped area, for example at 1008 KiB. This is a similar fragmentation problem as with [segmentation].
However, it clutters the virtual address space and makes it more difficult to find continuous memory regions of larger sizes. For example, imagine that we want to create a virtual memory region of size 1000KiB in the above graphic, e.g. for [memory-mapping a file]. We can't start the region at `26KiB` because it would collide with the already mapped page at `1004MiB`. So we have to look further until we find a large enough unmapped area, for example at `1008KiB`. This is a similar fragmentation problem as with [segmentation].
[memory-mapping a file]: https://en.wikipedia.org/wiki/Memory-mapped_file
[segmentation]: ./second-edition/posts/09-paging-introduction/index.md#fragmentation
Equally, it makes it much more difficult to create new page tables, because we need to find physical frames whose corresponding pages aren't already in use. For example, let's assume that we reserved the 1000 KiB memory region starting at 1008 KiB for our memory-mapped file. Now we can't use any frame with a _physical_ address between 1000 KiB and 2008 KiB anymore, because we can't identity map it.
Equally, it makes it much more difficult to create new page tables, because we need to find physical frames whose corresponding pages aren't already in use. For example, let's assume that we reserved the _virtual_ 1000KiB memory region starting at `1008KiB` for our memory-mapped file. Now we can't use any frame with a _physical_ address between `1000KiB` and `2008KiB` anymore, because we can't identity map it.
- Alternatively, we could **map the page tables frames only temporarily** when we need to access them. To be able to create the temporary mappings, we could identity map some level 1 table:
- Alternatively, we could **map the page tables frames only temporarily** when we need to access them. To be able to create the temporary mappings we only need a single identity-mapped level 1 table:
![A virtual and a physical address space with an identity mapped level 1 table, which maps its 0th entry to the level 2 table frame, therey mapping that frame to page with address 0](temporarily-mapped-page-tables.svg)
The level 1 table in this graphic controls the first 2 MiB of the virtual address space. This is because it is reachable by starting at the CR3 register and following the 0th entry in the level 4, level 3, and level 2 page tables. The entry with index 8 maps the virtual page at address 32 KiB to the physical frame at address 32 KiB, thereby identity mapping the level 1 table itself. The graphic shows this identity-mapping by the horizontal arrow at 32 KiB.
The level 1 table in this graphic controls the first 2MiB of the virtual address space. This is because it is reachable by starting at the CR3 register and following the 0th entry in the level 4, level 3, and level 2 page tables. The entry with index `8` maps the virtual page at address `32KiB` to the physical frame at address `32KiB`, thereby identity mapping the level 1 table itself. The graphic shows this identity-mapping by the horizontal arrow at `32KiB`.
By writing to the identity-mapped level 1 table, our kernel can create up to 511 temporary mappings (512 minus the entry required for the identity mapping). In the above example, the kernel mapped the 0th entry of the level 1 table to the frame with address 24KiB. This created a temporary mapping of the virtual page at 0 KiB to the physical frame of the level 2 page table, indicated by the dashed arrow. Now the kernel can access the level 2 page table by writing to the page starting at 0 KiB.
By writing to the identity-mapped level 1 table, our kernel can create up to 511 temporary mappings (512 minus the entry required for the identity mapping). In the above example, the kernel mapped the 0th entry of the level 1 table to the frame with address `24KiB`. This created a temporary mapping of the virtual page at `0KiB` to the physical frame of the level 2 page table, indicated by the dashed arrow. Now the kernel can access the level 2 page table by writing to the page starting at `0KiB`.
The process for accessing an arbitrary page table frame with temporary mappings would be:
@@ -80,17 +165,17 @@ So in order access page table frames, we need to map some virtual pages to them.
This approach keeps the virtual address space clean, since it reuses the same 512 virtual pages for creating the mappings. The drawback is that it is a bit cumbersome, especially since a new mapping might require modifications of multiple table levels, which means that we would need to repeat the above process multiple times.
- While both of the above approaches work, there is a third technique called **recursive page tables** that combines their advantages: It keeps all page table frames mapped like with the identity-mapping, so that no temporary mappings are needed, and also keeps the mapped pages together to avoid fragmentation of the virtual address space. This is the technique that we will use for our implementation, therefore it is described in detail in the following section.
- While both of the above approaches work, there is a third technique called **recursive page tables** that combines their advantages: It keeps all page table frames mapped at all times so that no temporary mappings are needed, and also keeps the mapped pages together to avoid fragmentation of the virtual address space. This is the technique that we will use for our implementation, therefore it is described in detail in the following section.
## Recursive Page Tables
The idea behind this approach sounds simple: _Map some entry of the level 4 page table to the frame of level 4 table itself_, similar to how the level 1 table in the previous example mapped itself. By doing this in the level 4 table, we effectively reserve a part of the virtual address space and map all current and future page table frames to that space. Thus, the single entry makes every table of every level accessible through a calculatable address.
The idea behind this approach sounds simple: _Map some entry of the level 4 page table to the frame of level 4 table itself_. By doing this, we effectively reserve a part of the virtual address space and map all current and future page table frames to that space. Thus, the single entry makes every table of every level accessible through a calculatable address.
Let's go through an example to understand how this all works:
![An example 4-level page hierarchy with each page table shown in physical memory. Entry 511 of the level 4 page is mapped to frame 4KiB, the frame of the level 4 table itself.](recursive-page-table.svg)
The only difference to the [example at the beginning of this post] is the additional entry at index 511 in the level 4 table, which is mapped to physical frame 4 KiB, the frame of the level 4 table itself.
The only difference to the [example at the beginning of this post] is the additional entry at index `511` in the level 4 table, which is mapped to physical frame `4KiB`, the frame of the level 4 table itself.
[example at the beginning of this post]: #accessing-page-tables
@@ -116,17 +201,17 @@ It might take some time to wrap your head around the concept, but it works quite
We saw that we can access tables of all levels by following the recursive entry once or multiple times before the actual translation. Since the indexes into the tables of the four levels are derived directly from the virtual address, we need to construct special virtual addresses for this technique. Remember, the page table indexes are derived from the address in the following way:
![Bits 012 are the page offset, bits 1221 the level 1 index, bits 2130 the level 2 index, bits 3039 the level 3 index, and bits 3948 the level 4 index](../paging/x86_64-table-indices-from-address.svg)
![Bits 012 are the page offset, bits 1221 the level 1 index, bits 2130 the level 2 index, bits 3039 the level 3 index, and bits 3948 the level 4 index](../paging-introduction/x86_64-table-indices-from-address.svg)
Let's assume that we want to access the level 1 page table that maps a specific page. As we learned above, this means that we have to follow the recursive entry one time before continuing with the level 4, level 3, and level 2 indexes. To do that we move each block of the address one block to the right and use the set the original level 4 index to the index of the recursive entry:
Let's assume that we want to access the level 1 page table that maps a specific page. As we learned above, this means that we have to follow the recursive entry one time before continuing with the level 4, level 3, and level 2 indexes. To do that we move each block of the address one block to the right and set the original level 4 index to the index of the recursive entry:
![Bits 012 are the offset into the level 1 table frame, bits 1221 the level 2 index, bits 2130 the level 3 index, bits 3039 the level 4 index, and bits 3948 the index of the recursive entry](table-indices-from-address-recursive-level-1.svg)
For accessing the level 2 table of that page, we move each index block two blocks to the right and set both the blocks of the level 4 index and the level 3 index to the index of the recursive entry:
For accessing the level 2 table of that page, we move each index block two blocks to the right and set both the blocks of the original level 4 index and the original level 3 index to the index of the recursive entry:
![Bits 012 are the offset into the level 2 table frame, bits 1221 the level 3 index, bits 2130 the level 4 index, and bits 3039 and bits 3948 are the index of the recursive entry](table-indices-from-address-recursive-level-2.svg)
Accessing the level 3 table works by moving each block three blocks to the right and using the recursive index for the level 4, level 3, and level 2 address blocks:
Accessing the level 3 table works by moving each block three blocks to the right and using the recursive index for the original level 4, level 3, and level 2 address blocks:
![Bits 012 are the offset into the level 3 table frame, bits 1221 the level 4 index, and bits 2130, bits 3039 and bits 3948 are the index of the recursive entry](table-indices-from-address-recursive-level-3.svg)
@@ -134,7 +219,7 @@ Finally, we can access the level 4 table by moving each block four blocks to the
![Bits 012 are the offset into the level l table frame and bits 1221, bits 2130, bits 3039 and bits 3948 are the index of the recursive entry](table-indices-from-address-recursive-level-4.svg)
The page table index blocks are 9 bits, so moving each block one block to the right means a bitshift by 9 bits: `address >> 9`. To derive the 12-bit offset field from the shifted index, we need to multiply it by 8, the size of a page table entry. Through this operation, we can calculate addresses for accessing all four page tables in the mapping of each page.
The page table index blocks are 9 bits, so moving each block one block to the right means a bitshift by 9 bits: `address >> 9`. To derive the 12-bit offset field from the shifted index, we need to multiply it by 8, the size of a page table entry. Through this operation, we can calculate virtual addresses for the page tables of all four levels.
The table below summarizes the address structure for accessing the different kinds of frames:
@@ -148,7 +233,7 @@ Level 4 Table | `0o_SSSSSS_RRR_RRR_RRR_RRR_AAAA`
[octal]: https://en.wikipedia.org/wiki/Octal
Whereas `AAA` is the level 4 index, `BBB` the level 3 index, `CCC` the level 2 index, `DDD` the level 1 index, and `EEEE` the offset into the mapped frame. `RRR` is the index of the recursive entry. When an index (three digits) is transformed to an offset (four digits), it is done by multiplying it by 8 (the size of a page table entry). With this offset, the resulting address directly points to the respective page table entry.
Whereas `AAA` is the level 4 index, `BBB` the level 3 index, `CCC` the level 2 index, and `DDD` the level 1 index of the mapped frame, and `EEEE` the offset into it. `RRR` is the index of the recursive entry. When an index (three digits) is transformed to an offset (four digits), it is done by multiplying it by 8 (the size of a page table entry). With this offset, the resulting address directly points to the respective page table entry.
`SSSSSS` are sign extension bits, which means that they are all copies of bit 47. This is a special requirement for valid addresses on the x86_64 architecture. We explained it in the [previous post][sign extension].
@@ -156,11 +241,22 @@ Whereas `AAA` is the level 4 index, `BBB` the level 3 index, `CCC` the level 2 i
## Implementation
After all this theory we can finally start our implementation. As already mentioned, our kernel already runs on a page tables created by the bootloader. The bootloader also set up a recursive mapping for us, so we already can use addresses with the above structure to access the page tables. The only missing thing that we don't know is which entry is mapped recursively.
After all this theory we can finally start our implementation. We already mentioned that the bootloader created page tables for our kernel, but it also created a recursive mapping in the last entry of the level 4 table for us. The bootloader did this, because otherwise there would be a [chicken or egg problem]: We need to access the level 4 table to create a recursive mapping, but we can't access it without some kind of mapping.
[chicken or egg problem]: https://en.wikipedia.org/wiki/Chicken_or_the_egg
We already used this recursive mapping [at the end of the previous post] to access the level 4 table. We did this through the hardcoded address `0xffff_ffff_ffff_f000`. When we convert this address to [octal] and compare it with the above table, we can see that it exactly follows the structure: with `RRR` = `0o777` = 511, `AAAA` = 0, and the sign extension bits set to `1` each:
```
structure: 0o_SSSSSS_RRR_RRR_RRR_RRR_AAAA
address: 0o_177777_777_777_777_777_0000
```
Using hardcoded addresses is rarely a good idea, since they might become outdated. For example, our code would break if a future bootloader version uses a different entry for the recursive mapping. Fortunately the bootloader tells us the recursive entry by passing a _boot information structure_ to our kernel.
### Boot Information
To communicate the index of the recursive entry and other information to our kernel, the bootloader passes a reference to a boot information structure as an argument when calling our `_start` function. Right now we don't have this argument declared in our function, so let's add it:
To communicate the index of the recursive entry and other information to our kernel, the bootloader passes a reference to a boot information structure as an argument when calling our `_start` function. Right now we don't have this argument declared in our function, so we just ignore it. Let's add the argument to the signature of our `_start` function:
```rust
// in src/main.rs
@@ -170,71 +266,29 @@ use bootloader::bootinfo::BootInfo;
#[cfg(not(test))]
#[no_mangle]
pub extern "C" fn _start(boot_info: &'static BootInfo) -> ! {
println!("Hello World{}", "!");
println!("boot_info: {:x?}", boot_info);
[]
}
```
The [`BootInfo`] struct is still in an early stage, so expect some breakage in newer bootloader versions. When we print it, we see that it currently has the three fields `p4_table_addr`, `memory_map`, and `package`:
The [`BootInfo`] struct is still in an early stage, so expect some breakage when updating to future [semver-incompatible] bootloader versions. When we print it, we see that it currently has the three fields `p4_table_addr`, `memory_map`, and `package`:
[`BootInfo`]: https://docs.rs/bootloader/0.3.11/bootloader/bootinfo/struct.BootInfo.html
[semver-incompatible]: https://doc.rust-lang.org/stable/cargo/reference/specifying-dependencies.html#caret-requirements
![QEMU printing a `BootInfo` struct: "boot_info: Bootlnfo { p4_table_addr: fffffffffffff000. memory_map: […]. package: […]"](qemu-bootinfo-print.png)
The most interesting field for us right now is `p4_table_addr`, as it contains a virtual address that is mapped to the physical frame of the level 4 page table. As we see this address is `0xfffffffffffff000`, which indicates a recursive address with the recursive index 511.
The most interesting field for us right now is `p4_table_addr`, as it contains a virtual address that is mapped to the physical frame of the level 4 page table. As we see this address is `0xfffffffffffff000`, which is the same as the hardcoded address we used before.
The `memory_map` field will become relevant later in this post. The `package` field is an in-progress feature to bundle additional data with the bootloader. The implementation is not finished, so we can ignore this field for now.
### Accessing the Level 4 Page Table
We can now try to access the level 4 page table:
```rust
// inside our `_start` function
[]
let level_4_table_pointer = boot_info.p4_table_addr as *const u64;
let entry_0 = unsafe { *level_4_table_pointer };
println!("Entry 0: {:#x}", entry_0);
let entry_1 = unsafe { *level_4_table_pointer.offset(1) };
println!("Entry 1: {:#x}", entry_1);
let entry_2 = unsafe { *level_4_table_pointer.offset(2) };
println!("Entry 2: {:#x}", entry_2);
let entry_511 = unsafe { *level_4_table_pointer.offset(511) };
println!("Entry 511: {:#x}", entry_511);
[]
```
This code casts the `p4_table_addr` to a pointer to an `u64`. As we saw in the [previous post][page table format], each page table entry is 8 bytes (64 bits), so an `u64` represents exactly one entry. We use unsafe blocks to read from the raw pointers and the [`offset` method] to perform pointer arithmetic. When we run it, we see the following output:
[page table format]: ./second-edition/posts/09-paging-introduction/index.md#page-table-format
[`offset` method]: https://doc.rust-lang.org/std/primitive.pointer.html#method.offset
![QEMU printing "Hello world! Entry 0: 0x2023 Entry 1: 0x6d8063 Entry 2: 0x0 Entry 511: 0x1063 It did not crash!](qemu-print-p4-entries.png)
When we look at the [format of page table entries][page table format], we see that the value `0x2023` of entry 0 means that the entry is `present`, `writable`, was `accessed` by the CPU, and is mapped to frame `0x2000`. Entry 1 is mapped to frame `0x6d8000` has the same flags as entry 0, with the addition of the `dirty` flag that indicates that the page was written.
Entry 2 is not `present`, so this virtual address range is not mapped to any physical addresses. Entry 511 is mapped to frame `0x1000` with the same flags as entry 1. This is the recursive entry, which means that `0x1000` is the physical frame that contains the level 4 page table.
### Page Table Types
While accessing the page tables through raw pointers is possible, it is cumbersome and requires many uses of `unsafe`. Like always we want to avoid that by creating safe abstractions.
TODO x86_64 PageTable type
TODO directly pass &PageTable in boot_info?
TODO introduce boot_info earlier?
### The `RecursivePageTable` Type
TODO:
- Map adddress 0 to the vga buffer
- We need free physical frames for creating new page tables -> memory map
## A Physical Memory Map