+++ title = "Advanced Paging" order = 10 path = "advanced-paging" date = 0000-01-01 template = "second-edition/page.html" +++ This post TODO This blog is openly developed on [Github]. If you have any problems or questions, please open an issue there. You can also leave comments [at the bottom]. [Github]: https://github.com/phil-opp/blog_os [at the bottom]: #comments ## Introduction In the [previous post] we learned about the principles of paging and how the 4-level page tables on the x86_64 architecture work. We also found out that the bootloader already set up a 4-level page table hierarchy for our kernel, since paging is mandatory on x86_64 in 64-bit mode. This means that our kernel already runs on virtual addresses. [previous post]: ./second-edition/posts/09-paging-introduction/index.md The problem that page tables use physical addresses internally, which we can't access directly from our kernel. We experienced that problem already [at the end of the previous post] when we tried to inspect the active page tables. The next section discusses the problem in detail and provides different approaches to a solution. [at the end of the previous post]: ./second-edition/posts/09-paging-introduction/index.md#try-it-out ## Accessing Page Tables Accessing the page tables from our kernel is not as easy as it may seem. To understand the problem let's take a look at the example 4-level page table hierarchy of the previous post again: ![An example 4-level page hierarchy with each page table shown in physical memory](../paging-introduction/x86_64-page-table-translation.svg) The important thing here is that each page entry stores the _physical_ address of the next table. This avoids the need to run a translation for these addresses too, which would be bad for performance and could easily cause endless translation loops. The problem for us is that we can't directly access physical addresses from our kernel, since our kernel also runs on top of virtual addresses. For example when we access address `4 KiB`, we access the _virtual_ address `4 KiB`, not the _physical_ address `4 KiB` where the level 4 page table lives. When we want to acccess the physical address `4 KiB`, we can only do so through some virtual address that maps to it. So in order access page table frames, we need to map some virtual pages to them. There are different ways to create these mappings that all allow us to access arbitrary page table frames: - A simple solution is to **identity map all page tables**: ![A virtual and a physical address space with various virtual pages mapped to the physical frame with the same address](identity-mapped-page-tables.svg) In this example we see various identity-mapped page table frames. This way the physical addresses in the page tables are also valid virtual addresses, so that we can easily access the page tables of all levels starting from the CR3 register. However, it clutters the virtual address space and makes it more difficult to find continuous memory regions of larger sizes. For example, imagine that we want to create a virtual memory region of size 1000 KiB in the above graphic, e.g. for [memory-mapping a file]. We can't start the region at `26 KiB` because it would collide with the already mapped page at `1004 MiB`. So we have to look further until we find a large enough unmapped area, for example at `1008 KiB`. This is a similar fragmentation problem as with [segmentation]. [memory-mapping a file]: https://en.wikipedia.org/wiki/Memory-mapped_file [segmentation]: ./second-edition/posts/09-paging-introduction/index.md#fragmentation Equally, it makes it much more difficult to create new page tables, because we need to find physical frames whose corresponding pages aren't already in use. For example, let's assume that we reserved the _virtual_ 1000 KiB memory region starting at `1008 KiB` for our memory-mapped file. Now we can't use any frame with a _physical_ address between `1000 KiB` and `2008 KiB` anymore, because we can't identity map it. - Alternatively, we could **map the page tables frames only temporarily** when we need to access them. To be able to create the temporary mappings we only need a single identity-mapped level 1 table: ![A virtual and a physical address space with an identity mapped level 1 table, which maps its 0th entry to the level 2 table frame, therey mapping that frame to page with address 0](temporarily-mapped-page-tables.svg) The level 1 table in this graphic controls the first 2 MiB of the virtual address space. This is because it is reachable by starting at the CR3 register and following the 0th entry in the level 4, level 3, and level 2 page tables. The entry with index `8` maps the virtual page at address `32 KiB` to the physical frame at address `32 KiB`, thereby identity mapping the level 1 table itself. The graphic shows this identity-mapping by the horizontal arrow at `32 KiB`. By writing to the identity-mapped level 1 table, our kernel can create up to 511 temporary mappings (512 minus the entry required for the identity mapping). In the above example, the kernel mapped the 0th entry of the level 1 table to the frame with address `24 KiB`. This created a temporary mapping of the virtual page at `0 KiB` to the physical frame of the level 2 page table, indicated by the dashed arrow. Now the kernel can access the level 2 page table by writing to the page starting at `0 KiB`. The process for accessing an arbitrary page table frame with temporary mappings would be: - Search for a free entry in the identity mapped level 1 table. - Map that entry to the physical frame of the page table that we want to access. - Access the target frame through the virtual page that maps to the entry. - Set the entry back to unused thereby removing the temporary mapping again. This approach keeps the virtual address space clean, since it reuses the same 512 virtual pages for creating the mappings. The drawback is that it is a bit cumbersome, especially since a new mapping might require modifications of multiple table levels, which means that we would need to repeat the above process multiple times. - While both of the above approaches work, there is a third technique called **recursive page tables** that combines their advantages: It keeps all page table frames mapped at all times so that no temporary mappings are needed, and also keeps the mapped pages together to avoid fragmentation of the virtual address space. This is the technique that we will use for our implementation, therefore it is described in detail in the following section. ## Recursive Page Tables The idea behind this approach sounds simple: _Map some entry of the level 4 page table to the frame of level 4 table itself_. By doing this, we effectively reserve a part of the virtual address space and map all current and future page table frames to that space. Thus, the single entry makes every table of every level accessible through a calculatable address. Let's go through an example to understand how this all works: ![An example 4-level page hierarchy with each page table shown in physical memory. Entry 511 of the level 4 page is mapped to frame 4KiB, the frame of the level 4 table itself.](recursive-page-table.svg) The only difference to the [example at the beginning of this post] is the additional entry at index `511` in the level 4 table, which is mapped to physical frame `4 KiB`, the frame of the level 4 table itself. [example at the beginning of this post]: #accessing-page-tables By letting the CPU follow this entry on a translation, it doesn't reach a level 3 table, but the same level 4 table again. This is similar to a recursive function that calls itself, therefore this table is called a _recursive page table_. The important thing is that the CPU assumes that every entry in the level 4 table points to a level 3 table, so it now treats the level 4 table as a level 3 table. This works because tables of all levels have the exact same layout on x86_64. By following the recursive entry one or multiple times before we start the actual translation, we can effectively shorten the number of levels that the CPU traverses. For example, if we follow the recursive entry once and then proceed to the level 3 table, the CPU thinks that the level 3 table is a level 2 table. Going further, it treats the level 2 table as a level 1 table, and the level 1 table as the mapped frame. This means that we can now read and write the level 1 page table because the CPU thinks that it is the mapped frame. The graphic below illustrates the 5 translation steps: ![The above example 4-level page hierarchy with 5 arrows: "Step 0" from CR4 to level 4 table, "Step 1" from level 4 table to level 4 table, "Step 2" from level 4 table to level 3 table, "Step 3" from level 3 table to level 2 table, and "Step 4" from level 2 table to level 1 table.](recursive-page-table-access-level-1.svg) Similarly, we can follow the recursive entry twice before starting the translation to reduce the number of traversed levels to two: ![The same 4-level page hierarchy with the following 4 arrows: "Step 0" from CR4 to level 4 table, "Steps 1&2" from level 4 table to level 4 table, "Step 3" from level 4 table to level 3 table, and "Step 4" from level 3 table to level 2 table.](recursive-page-table-access-level-2.svg) Let's go through it step by step: First the CPU follows the recursive entry on the level 4 table and thinks that it reaches a level 3 table. Then it follows the recursive entry again and thinks that it reaches a level 2 table. But in reality, it is still on the level 4 table. When the CPU now follows a different entry, it lands on a level 3 table, but thinks it is already on a level 1 table. So while the next entry points at a level 2 table, the CPU thinks that it points to the mapped frame, which allows us to read and write the level 2 table. Accessing the tables of levels 3 and 4 works in the same way. For accessing the level 3 table, we follow the recursive entry entry three times, tricking the CPU into thinking it is already on a level 1 table. Then we follow another entry and reach a level 3 table, which the CPU treats as a mapped frame. For accessing the level 4 table itself, we just follow the recursive entry four times until the CPU treats the level 4 table itself as mapped frame (in blue in the graphic below). ![The same 4-level page hierarchy with the following 3 arrows: "Step 0" from CR4 to level 4 table, "Steps 1,2,3" from level 4 table to level 4 table, and "Step 4" from level 4 table to level 3 table. In blue the alternative "Steps 1,2,3,4" arrow from level 4 table to level 4 table.](recursive-page-table-access-level-3.svg) It might take some time to wrap your head around the concept, but it works quite well in practice. ### Address Calculation We saw that we can access tables of all levels by following the recursive entry once or multiple times before the actual translation. Since the indexes into the tables of the four levels are derived directly from the virtual address, we need to construct special virtual addresses for this technique. Remember, the page table indexes are derived from the address in the following way: ![Bits 0–12 are the page offset, bits 12–21 the level 1 index, bits 21–30 the level 2 index, bits 30–39 the level 3 index, and bits 39–48 the level 4 index](../paging-introduction/x86_64-table-indices-from-address.svg) Let's assume that we want to access the level 1 page table that maps a specific page. As we learned above, this means that we have to follow the recursive entry one time before continuing with the level 4, level 3, and level 2 indexes. To do that we move each block of the address one block to the right and set the original level 4 index to the index of the recursive entry: ![Bits 0–12 are the offset into the level 1 table frame, bits 12–21 the level 2 index, bits 21–30 the level 3 index, bits 30–39 the level 4 index, and bits 39–48 the index of the recursive entry](table-indices-from-address-recursive-level-1.svg) For accessing the level 2 table of that page, we move each index block two blocks to the right and set both the blocks of the original level 4 index and the original level 3 index to the index of the recursive entry: ![Bits 0–12 are the offset into the level 2 table frame, bits 12–21 the level 3 index, bits 21–30 the level 4 index, and bits 30–39 and bits 39–48 are the index of the recursive entry](table-indices-from-address-recursive-level-2.svg) Accessing the level 3 table works by moving each block three blocks to the right and using the recursive index for the original level 4, level 3, and level 2 address blocks: ![Bits 0–12 are the offset into the level 3 table frame, bits 12–21 the level 4 index, and bits 21–30, bits 30–39 and bits 39–48 are the index of the recursive entry](table-indices-from-address-recursive-level-3.svg) Finally, we can access the level 4 table by moving each block four blocks to the right and using the recursive index for all address blocks except for the offset: ![Bits 0–12 are the offset into the level l table frame and bits 12–21, bits 21–30, bits 30–39 and bits 39–48 are the index of the recursive entry](table-indices-from-address-recursive-level-4.svg) The page table index blocks are 9 bits, so moving each block one block to the right means a bitshift by 9 bits: `address >> 9`. To derive the 12-bit offset field from the shifted index, we need to multiply it by 8, the size of a page table entry. Through this operation, we can calculate virtual addresses for the page tables of all four levels. The table below summarizes the address structure for accessing the different kinds of frames: Mapped Frame for | Address Structure ([octal]) ---------------- | ------------------------------- Page | `0o_SSSSSS_AAA_BBB_CCC_DDD_EEEE` Level 1 Table | `0o_SSSSSS_RRR_AAA_BBB_CCC_DDDD` Level 2 Table | `0o_SSSSSS_RRR_RRR_AAA_BBB_CCCC` Level 3 Table | `0o_SSSSSS_RRR_RRR_RRR_AAA_BBBB` Level 4 Table | `0o_SSSSSS_RRR_RRR_RRR_RRR_AAAA` [octal]: https://en.wikipedia.org/wiki/Octal Whereas `AAA` is the level 4 index, `BBB` the level 3 index, `CCC` the level 2 index, and `DDD` the level 1 index of the mapped frame, and `EEEE` the offset into it. `RRR` is the index of the recursive entry. When an index (three digits) is transformed to an offset (four digits), it is done by multiplying it by 8 (the size of a page table entry). With this offset, the resulting address directly points to the respective page table entry. `SSSSSS` are sign extension bits, which means that they are all copies of bit 47. This is a special requirement for valid addresses on the x86_64 architecture. We explained it in the [previous post][sign extension]. [sign extension]: ./second-edition/posts/09-paging-introduction/index.md#paging-on-x86 ## Implementation After all this theory we can finally start our implementation. Conveniently, the bootloader not only created page tables for our kernel, it also created a recursive mapping in the last entry of the level 4 table. The bootloader did this, because otherwise there would be a [chicken or egg problem]: We need to access the level 4 table to create a recursive mapping, but we can't access it without some kind of mapping. [chicken or egg problem]: https://en.wikipedia.org/wiki/Chicken_or_the_egg We already used this recursive mapping [at the end of the previous post] to access the level 4 table. We did this through the hardcoded address `0xffff_ffff_ffff_f000`. When we convert this address to [octal] and compare it with the above table, we can see that it exactly follows the structure with `RRR` = `0o777`, `AAAA` = 0, and the sign extension bits set to `1` each: ``` structure: 0o_SSSSSS_RRR_RRR_RRR_RRR_AAAA address: 0o_177777_777_777_777_777_0000 ``` With our knowlege about recursive page tables we can now create virtual addresess to access all active page tables. This allows us to create a translation function in software. ### Translating Addresses As a first step, let's create a function that translates a virtual address to a physical address by walking the page table hierarchy: ```rust // in src/lib.rs pub mod memory; ``` ```rust // in src/memory/mod.rs use x86_64::PhysAddr; use x86_64::structures::paging::PageTable; /// Returns the physical address for the given virtual address, or `None` if the /// virtual address is not mapped. pub fn translate_addr(addr: usize, level_4_table_addr) -> Option { // retrieve the page table indices of the address that we want to translate let level_4_index = (addr >> 39) & 0o777; let level_3_index = (addr >> 30) & 0o777; let level_2_index = (addr >> 21) & 0o777; let level_1_index = (addr >> 12) & 0o777; let page_offset = addr & 0o7777; // check that level 4 entry is mapped let level_4_table = unsafe {&*(level_4_table_addr as *const PageTable)}; if level_4_table[level_4_index].addr().is_null() { return None; } let level_3_table_addr = (level_4_table_addr << 9) | (level_4_index << 12); // check that level 3 entry is mapped let level_3_table = unsafe {&*(level_3_table_addr as *const PageTable)}; if level_3_table[level_3_index].addr().is_null() { return None; } let level_2_table_addr = (level_3_table_addr << 9) | (level_3_index << 12); // check that level 2 entry is mapped let level_2_table = unsafe {&*(level_2_table_addr as *const PageTable)}; if level_2_table[level_2_index].addr().is_null() { return None; } let level_1_table_addr = (level_2_table_addr << 9) | (level_2_index << 12); // check that level 1 entry is mapped and retrieve physical address from it let level_1_table = unsafe {&*(level_1_table_addr as *const PageTable)}; let phys_addr = level_1_table[level_1_index].addr(); if phys_addr.is_null() { return None; } Some(phys_addr + page_offset) } ``` First, we calculate the page table indices and the page offset from the address through bitwise operations as specified in the graphic: ![Bits 0–12 are the page offset, bits 12–21 the level 1 index, bits 21–30 the level 2 index, bits 30–39 the level 3 index, and bits 39–48 the level 4 index](../paging-introduction/x86_64-table-indices-from-address.svg) Then we transform the `level_4_table_addr` to a `&PageTable`, which is an `unsafe` operation since the compiler can't know that the address is valid. We use the indexing operator to look at the entry with `level_4_index`. If that entry is null, there is no level 3 table for this level 4 entry, which means that the `addr` is not mapped to any physical memory, so we return `None`. If the entry is not `None`, we know that a level 3 table exist. We calculate the virtual address of it by shifting the level 4 address 9 bits to the left and setting the address bits 12–21 to the level 4 index (see the section about [address calculation]). We can do that because the recursive index is `0o777`, so that it is also a valid sign extension. We then do the same cast and entry-checking as with the level 4 table. [address calculation]: #address-calculation After we checked the three higher level pages, we can finally read the entry of the level 1 table that tells us the physical frame that the address is mapped to. As a last step, we add the page offset to that address and return it. If we knew that the address is mapped, we could directly access the level 1 table without looking at the higher level pages first. But since we don't know this, so we have to check whether the level 1 table exists first, otherwise we would cause a page fault for unmapped addresses. #### Try it out We can use our new translation function to translate some virtual addresses in our `_start` function: ```rust // in src/main.rs #[cfg(not(test))] #[no_mangle] pub extern "C" fn _start() -> ! { […] // initialize GDT, IDT, PICS use blog_os::memory::translate_addr; const LEVEL_4_TABLE_ADDR: usize = 0o_177777_777_777_777_777_0000; // the identity-mapped vga buffer page println!("0xb8000 -> {:?}", translate_addr(0xb8000, LEVEL_4_TABLE_ADDR)); // some code page println!("0x20010a -> {:?}", translate_addr(0x20010a, LEVEL_4_TABLE_ADDR)); // some stack page println!("0x57ac001ffe48 -> {:?}", translate_addr(0x57ac001ffe48, LEVEL_4_TABLE_ADDR)); println!("It did not crash!"); blog_os::hlt_loop(); } ``` When we run it, we see the following output: ![0xb8000 -> 0xb8000, 0x20010a -> 0x40010a, 0x57ac001ffe48 -> 0x27be48](qemu-translate-addr.png) As expected, the identity-mapped address `0xb8000` translates to the same physical address. The code page and the stack page translate to some arbitrary physical addresses, that depend on how the bootloader created the initial mapping for our kernel. #### The `RecursivePageTable` Type The `x86_64` provides a [`RecursivePageTable`] type that implements safe abstractions for various page table operations. By using this type, we can reimplement our `translate_addr` function in a much cleaner way: [`RecursivePageTable`]: https://docs.rs/x86_64/0.3.5/x86_64/structures/paging/struct.RecursivePageTable.html ```rust // in src/memory/mod.rs use x86_64::{VirtAddr, PhysAddr}; use x86_64::structures::paging::{Mapper, Page, PageTable, RecursivePageTable}; /// Create a RecursivePageTable instance from the level 4 address // TODO call only once pub fn init(level_4_table_addr: usize) -> RecursivePageTable { let level_4_table_ptr = level_4_table_addr as *mut PageTable; let level_4_table = unsafe { &mut *level_4_table_ptr }; RecursivePageTable::new(level_4_table).unwrap() } /// Returns the physical address for the given virtual address, or /// `None` if the virtual address is not mapped. pub fn translate_addr(addr: u64, recursive_page_table: &RecursivePageTable) -> Option { let addr = VirtAddr::new(addr); let page: Page = Page::containing_address(addr); // perform the translation let frame = recursive_page_table.translate_page(page); frame.map(|frame| frame.start_address() + addr.page_offset()) } ``` The `RecursivePageTable` type encapsulates the unsafety of the page table walk completely. We only need a single instance of `unsafe` in the `init` function to create a `&mut PageTable` from the level 4 page table address. Also, we no longer need to perform any bitwise operations. Our `_start` function needs to be updated for the new function signature in the following way: ```rust // in src/main.rs #[cfg(not(test))] #[no_mangle] pub extern "C" fn _start() -> ! { […] // initialize GDT, IDT, PICS use blog_os::memory::{self, translate_addr}; const LEVEL_4_TABLE_ADDR: usize = 0o_177777_777_777_777_777_0000; let recursive_page_table = memory::init(LEVEL_4_TABLE_ADDR); // the identity-mapped vga buffer page println!("0xb8000 -> {:?}", translate_addr(0xb8000, &recursive_page_table)); // some code page println!("0x20010a -> {:?}", translate_addr(0x20010a, &recursive_page_table)); // some stack page println!("0x57ac001ffe48 -> {:?}", translate_addr(0x57ac001ffe48, &recursive_page_table)); println!("It did not crash!"); blog_os::hlt_loop(); } ``` Instead of passing the `LEVEL_4_TABLE_ADDR` to `translate_addr` and accessing the page tables through unsafe raw pointers, we now pass references to a `RecursivePageTable` type. By doing this, we now have a safe abstraction and clear ownership semantics. This ensures that we can't accidentally modify the page table concurrently, because an exclusive borrow of the `RecursivePageTable` is needed in order to modify it. After reading the page tables and creating a translation function, the next step is to create a new mapping in the page table hierarchy. ### Creating a new Mapping The difficulty of creating a new mapping depends on on the virtual page that we want to map. In the easiest case, the level 1 page table for the page already exists and we just need to write a single entry. In the most difficult case, the page is in a memory region for that no level 3 exists yet, so that we need to create new level 3, level 2 and level 1 page tables first. For creating a new page table we need to find an unused physical frame where the page table will be stored. We initialize it to zero and create a mapping for that frame in the higher level page table. At this point, we can access the new page table through the recursive page table and continue with the next level. Let's start with the simple case and assume that we don't need to create new page tables. The bootloader loads itself in the first megabyte of the virtual address space, so we know that a valid level 1 table exists for this region. We can choose any unused page in this memory region for our example mapping, for example the page at address `0x1000`. As the target frame we use `0xb8000`, the frame of the VGA text buffer. This way we can easily test whether our mapping worked. We implement it in a new `create_mapping` function in our `memory` module: ```rust // in src/memory/mod.rs use x86_64::structures::paging::{FrameAllocator, PhysFrame, Size4KiB}; pub fn create_example_mapping( recursive_page_table: &mut RecursivePageTable, frame_allocator: &mut impl FrameAllocator, ) { use x86_64::structures::paging::PageTableFlags as Flags; let page: Page = Page::containing_address(VirtAddr::new(0x1000)); let frame = PhysFrame::containing_address(PhysAddr::new(0xb8000)); let flags = Flags::PRESENT | Flags::WRITABLE; recursive_page_table.map_to(page, frame, flags, frame_allocator) .expect("map_to failed").flush(); } ``` The function takes a mutable reference to the `RecursivePageTable` because it needs to modify it. It then uses the [`map_to`] function of the [`Mapper`] trait to map the page at `0x1000` to the physical frame at address `0xb8000`. The `PRESENT` flags is required for all valid entries and the `WRITABLE` flag makes the mapping writable. [`map_to`]: https://docs.rs/x86_64/0.3.5/x86_64/structures/paging/trait.Mapper.html#tymethod.map_to [`Mapper`]: https://docs.rs/x86_64/0.3.5/x86_64/structures/paging/trait.Mapper.html The 4th argument needs to be some structure that implements the [`FrameAllocator`] trait. The `map_to` method needs this argument, because it might need unused frames for creating new page tables. Since we know that no new page tables are required for the address `0x1000`, we pass an `EmptyFrameAllocator` type that always returns `None`. The `Size4KiB` argument in the trait implementation is needed because the [`Page`] and [`PhysFrame`] types are [generic] over the [`PageSize`] to work with both standard 4KiB pages and huge 2MiB/1GiB pages. [`FrameAllocator`]: https://docs.rs/x86_64/0.3.5/x86_64/structures/paging/trait.FrameAllocator.html [`Page`]: https://docs.rs/x86_64/0.3.5/x86_64/structures/paging/struct.Page.html [`PhysFrame`]: https://docs.rs/x86_64/0.3.5/x86_64/structures/paging/struct.PhysFrame.html [generic]: https://doc.rust-lang.org/book/ch10-00-generics.html [`PageSize`]: https://docs.rs/x86_64/0.3.5/x86_64/structures/paging/trait.PageSize.html The [`map_to`] function can fail, so it returns a [`Result`]. Since this is just some example code that does not need to be robust, we just use [`expect`] to panic when an error occurs. On success, the function returns a [`MapperFlush`] type that provides an easy way to flush the newly mapped page from the translation lookaside buffer (TLB) with its [`flush`] method. Like `Result`, the type uses the [`#[must_use]`] attribute to emit a warning when we accidentally ignore the return type. [`Result`]: https://doc.rust-lang.org/core/result/enum.Result.html [`expect`]: https://doc.rust-lang.org/core/result/enum.Result.html#method.expect [`MapperFlush`]: https://docs.rs/x86_64/0.3.5/x86_64/structures/paging/struct.MapperFlush.html [`flush`]: https://docs.rs/x86_64/0.3.5/x86_64/structures/paging/struct.MapperFlush.html#method.flush [`#[must_use]`]: https://doc.rust-lang.org/std/result/#results-must-be-used TODO ```rust // in src/memory/mod.rs /// A FrameAllocator that always returns `None`. pub struct EmptyFrameAllocator; impl FrameAllocator for EmptyFrameAllocator { fn alloc(&mut self) -> Option { None } } ``` TODO We can test the new mapping function in our `main.rs`: ```rust // in src/main.rs #[cfg(not(test))] #[no_mangle] pub extern "C" fn _start() -> ! { […] // initialize GDT, IDT, PICS use blog_os::memory::{create_example_mapping, EmptyFrameAllocator}; let mut recursive_page_table = memory::init(boot_info.p4_table_addr as usize); create_example_mapping(&mut recursive_page_table, &mut EmptyFrameAllocator); unsafe { (0x1c00 as *mut u64).write_volatile(0xffffffffffffffff)}; println!("It did not crash!"); blog_os::hlt_loop(); } ``` We first create the mapping for the page at `0x1000` by calling our `create_example_mapping` function with a mutable reference to the `RecursivePageTable` instance. This maps the page `0x1000` to the VGA text buffer, so we should see any write to it on the screen. We don't write directly to `0x1000` since the top line is directly shifted off the screen by the next `println`. As we learned [in the _“VGA Text Mode”_ post], writes to the VGA buffer should be volatile. [in the _“VGA Text Mode”_ post]: ./second-edition/posts/03-vga-text-buffer/index.md#volatile When we run it in QEMU, we see the following output: ![QEMU printing "It did not crash!" with four completely white cells in middle of the screen](qemu-new-mapping.png) The white block in the middle of the screen is by our write to `0x1c00`, which means that we successfully created a new mapping in the page tables. This only worked because there was already a level 1 table for mapping page `0x1000`. If we try to map a page for that no level 1 table exists yet, the `map_to` function tries to allocate frames from the `EmptyFrameAllocator` to create new page tables, which fails. We can see that happen when we try to map for example page `0xdeadbeaf000` instead of `0x1000`: ```rust // in src/memory/mod.rs TODO: update create_example_mapping // in src/main.rs TODO: update accessed address ``` When we run it, a panic with the following error message occurs: ``` panicked at 'map_to failed: FrameAllocationFailed', /…/result.rs:999:5 ``` To map pages that don't have a level 1 page table yet we need to create a proper `FrameAllocator`. But how do we know which frames are unused and how much physical memory is available? ### Boot Information The amount of physical memory and physical memory reserved by devices like the VGA hardware varies between different machines. Only the BIOS or UEFI firmware knows exactly which memory regions can be used by the operating system and which regions are reserved. Both firmaware standards provide functions to retrieve the memory map, but they can only be called very early in the boot process. For this reason, the bootloader already queries this and other information from the firmaware. To communicate this information to our kernel, the bootloader passes a reference to a boot information structure as an argument when calling our `_start` function. Right now we don't have this argument declared in our function, so we just ignore it. Let's add it: ```rust // in src/main.rs use bootloader::bootinfo::BootInfo; #[cfg(not(test))] #[no_mangle] pub extern "C" fn _start(boot_info: &'static BootInfo) -> ! { […] } ``` The [`BootInfo`] struct is still in an early stage, so expect some breakage when updating to future [semver-incompatible] bootloader versions. It currently has the three fields `p4_table_addr`, `memory_map`, and `package`: [`BootInfo`]: https://docs.rs/bootloader/0.3.11/bootloader/bootinfo/struct.BootInfo.html [semver-incompatible]: https://doc.rust-lang.org/stable/cargo/reference/specifying-dependencies.html#caret-requirements - The `p4_table_addr` field contains the recursive virtual address of the level 4 page table. By using this field we can avoid hardcoding the address `0o_177777_777_777_777_777_0000`. - The `memory_map` field is most interesting to us, since it contains a list of all memory regions and their type (i.e. unused, reserved, or other). - The `package` field is an in-progress feature to bundle additional data with the bootloader. The implementation is not finished, so we can ignore this field for now. Before we use the `memory_map` field to create a proper `FrameAllocator`, we want to ensure that we can't use a `boot_info` argument of the wrong type. #### The `entry_point` Macro Since our `_start` function is called externally from the bootloader, no checking of our function signature occurs. This means that we could let it take arbitrary arguments without any compilation errors, but it would fail or cause undefined behavior at runtime. To make sure that the entry point function has always the correct signature that the bootloader expects, the `bootloader` crate provides an [`entry_point`] macro that provides a type-checked way to define a Rust function as entry point. Let's rewrite our entry point function to use this macro: [`entry_point`]: https://docs.rs/bootloader/0.3.12/bootloader/macro.entry_point.html ```rust // in src/main.rs use bootloader::entry_point; entry_point!(kernel_main); #[cfg(not(test))] fn kernel_main(boot_info: &'static BootInfo) -> ! { […] // initialize GDT, IDT, PICS let mut recursive_page_table = memory::init(boot_info.p4_table_addr as usize); […] // create and test example mapping println!("It did not crash!"); blog_os::hlt_loop(); } ``` We no longer need to use `extern "C"` or `no_mangle` for our entry point, as the macro defines the real lowet level `_start` entry point for us. The `kernel_main` function is now a completely normal Rust function, so we can choose an arbitrary name for it. The important thing is that it is type-checked now, so that a compilation error occurs when we now try to modify the function signature in any way, for example adding an argument or changing the argument type. Note that we now pass `boot_info.p4_table_addr` instead of a hardcoded address to our `memory::init`. Thus our code continues to work even if a future version of the bootloader chooses a different entry of the level 4 page table for the recursive mapping. ### Allocating Frames Now that we have access to the memory map through the boot information we can create a proper frame allocator on top. We start with a generic skeleton: ```rust // in src/memory/mod.rs pub struct BootInfoFrameAllocator where I: Iterator { frames: I, } impl FrameAllocator for BootInfoFrameAllocator where I: Iterator { fn alloc(&mut self) -> Option { self.frames.next() } } ``` The `frames` field can be initialized with an arbitrary [`Iterator`] of frames. This allows us to just delegate `alloc` calls to the [`Iterator::next`] method. [`Iterator`]: https://doc.rust-lang.org/core/iter/trait.Iterator.html [`Iterator::next`]: https://doc.rust-lang.org/core/iter/trait.Iterator.html#tymethod.next The initialization of the `BootInfoFrameAllocator` happens in a new `init_frame_allocator` function: ```rust // in src/memory/mod.rs use bootloader::bootinfo::{MemoryMap, MemoryRegionType}; /// Create a FrameAllocator from the passed memory map pub fn init_frame_allocator(memory_map: &'static MemoryMap) -> BootInfoFrameAllocator> { // get usable regions from memory map let regions = memory_map.iter().filter(|r| { r.region_type == MemoryRegionType::Usable }); // map each region to its address range let addr_ranges = regions.map(|r| r.range.start_addr()..r.range.end_addr()); // transform to an iterator of frame start addresses let frame_addresses = addr_ranges.flat_map(|r| r.into_iter().step_by(4096)); // create `PhysFrame` types from the start addresses let frames = frame_addresses.map(|addr| { PhysFrame::containing_address(PhysAddr::new(addr)) }); BootInfoFrameAllocator { frames } } ``` This function uses iterator combinator methods to transform the initial `MemoryMap` into an iterator of usable physical frames: - First, we call the `iter` method to convert the memory map to an iterator of [`MemoryRegion`]s. Then we use the [`filter`] method to skip any reserved or otherwise unavailable regions. The bootloader updates the memory map for all the mappings it creates, so frames that are used by our kernel (code, data or stack) or to store the boot information are already marked as `InUse` or similar. Thus we can be sure that `Usable` frames are not in use somewhere else. - In the second step, we use the [`map`] combinator and Rust's [range syntax] to transform our iterator of memory regions to an iterator of address ranges. - The third step is the most complicated: We convert each range to an iterator through the `into_iter` method and then choose every 4096th address using [`step_by`]. 4096 bytes (= 4 KiB) is the page size, so by doing this we get the start address of each frame. The bootloader page aligns all usable memory areas, so that we don't need any alignment or rounding code here. By using [`flat_map`] instead of `map`, we get an `Iterator` instead of an `Iterator>`. - In the final step, we convert the start addresses to `PhysFrame` types to construct the desired `Iterator`. We then use this iterator to create and return a new `BootInfoFrameAllocator`. [`MemoryRegion`]: https://docs.rs/bootloader/0.3.12/bootloader/bootinfo/struct.MemoryRegion.html [`filter`]: https://doc.rust-lang.org/core/iter/trait.Iterator.html#method.filter [`map`]: https://doc.rust-lang.org/core/iter/trait.Iterator.html#method.map [range syntax]: https://doc.rust-lang.org/core/ops/struct.Range.html [`step_by`]: https://doc.rust-lang.org/core/iter/trait.Iterator.html#method.step_by [`flat_map`]: https://doc.rust-lang.org/core/iter/trait.Iterator.html#method.flat_map We can now modify our `kernel_main` function to pass a `BootInfoFrameAllocator` instance instead of an `EmptyFrameAllocator`: ```rust // in src/main.rs #[cfg(not(test))] fn kernel_main(boot_info: &'static BootInfo) -> ! { […] // initialize GDT, IDT, PICS use x86_64::structures::paging::{PageTable, RecursivePageTable}; let mut recursive_page_table = memory::init(boot_info.p4_table_addr as usize); let mut frame_allocator = memory::init_frame_allocator(&boot_info.memory_map); blog_os::memory::create_mapping(&mut recursive_page_table, &mut frame_allocator); unsafe { (0xdeadbeafc00 as *mut u64).write_volatile(0xffffffffffffffff)}; println!("It did not crash!"); blog_os::hlt_loop(); } ``` Now the mapping succeeds and we see the white block on the screen again: ![]() TODO The `map_to` method was now able to create the missing page tables on the frames allocated from our `BootInfoFrameAllocator`. You can insert a `println` message into the `BootInfoFrameAllocator::alloc` method to see how it's called. We're now able to map arbitrary pages and to allocate new physical frames when we need them. # TODO ## Allocating Stacks ## Summary ## What's next? --- TODO spellcheck TODO update post date