In this post we explore double faults in detail. We also set up an _Interrupt Stack Table_ to catch double faults on a separate kernel stack. This way, we can completely prevent triple faults, even on kernel stack overflow.
<!--more--><aside id="toc"></aside>
As always, the complete source code is available on [Github]. Please file [issues] for any problems, questions, or improvement suggestions. There is also a [gitter chat] and a [comment section] at the end of this page.
In simplified terms, a double fault is a special exception that occurs when the CPU fails to invoke an exception handler. For example, it occurs when a page fault is triggered but there is no page fault handler registered in the [Interrupt Descriptor Table][IDT] (IDT). So it's kind of similar to catch-all blocks in programming languages with exceptions, e.g. `catch(...)` in C++ or `catch(Exception e)` in Java or C#.
A double fault behaves like a normal exception. It has the vector number `8` and we can define a normal handler function for it in the IDT. It is really important to provide a double fault handler, because if a double fault is unhandled a fatal _triple fault_ occurs. Triple faults can't be caught and most hardware reacts with a system reset.
### Triggering a Double Fault
Let's provoke a double fault by triggering an exception for that we didn't define a handler function:
We use the [int! macro] of the [x86 crate] to trigger the exception with vector number `1`, which is the [debug exception]. The debug exception occurs for example when a breakpoint defined in the [debug registers] is hit. Like the [breakpoint exception], it is mainly used for [implementing debuggers].
We haven't registered a handler function for the debug exception in our [IDT], so the `int!(1)` line should cause a double fault in the CPU.
When we start our kernel now, we see that it enters an endless boot loop:

The reason for the boot loop is the following:
1. The CPU executes the [int 1] instruction, which causes a software-invoked `Debug` exception.
2. The CPU looks at the corresponding entry in the IDT and sees that the present bit isn't set. Thus, it can't call the debug exception handler and a double fault occurs.
3. The CPU looks at the IDT entry of the double fault handler, but this entry is also non-present. Thus, a _triple_ fault occurs.
4. A triple fault is fatal. QEMU reacts to it like most real hardware and issues a system reset.
So in order to prevent this triple fault, we need to either provide a handler function for debug exceptions or a double fault handler. We will do the latter, since we want to avoid triple faults in all cases.
### A Double Fault Handler
A double fault is a normal exception with an error code, so we can use our `handler_with_error_code` macro to create a wrapper function:
Our handler prints a short error message and dumps the exception stack frame. The error code of the double fault handler is always zero, so there's no reason to print it.
When we start our kernel now, we should see that the double fault handler is invoked:

It worked! Here is what happens this time:
1. The CPU executes the `int 1` instruction macro, which causes a software-invoked `Debug` exception.
2. Like before, the CPU looks at the corresponding entry in the IDT and sees that the present bit isn't set. Thus, it can't call the debug exception handler and a double fault occurs.
3. The CPU jumps to the – now present – double fault handler.
The triple fault (and the boot-loop) no longer occurs, since the CPU can now call the double fault handler.
That was quite straightforward! So why do we need a whole post for this topic? Well, we're now able to catch _most_ double faults, but there are some cases where our current approach doesn't suffice.
## Causes of Double Faults
Before we look at the special cases, we need to know the exact causes of double faults. Above, we used a pretty vague definition:
> A double fault is a special exception that occurs when the CPU fails to invoke an exception handler.
What does _“fails to invoke”_ mean exactly? The handler is not present? The handler is [swapped out]? And what happens if a handler causes exceptions itself?
Fortunately, the AMD64 manual ([PDF][AMD64 manual]) has an exact definition (in Section 8.2.9). According to it, a “double fault exception _can_ occur when a second exception occurs during the handling of a prior (first) exception handler”. The _“can”_ is important: Only very specific combinations of exceptions lead to a double fault. These combinations are:
First Exception | Second Exception
----------------|-----------------
[Divide-by-zero],<br>[Invalid TSS],<br>[Segment Not Present],<br>[Stack-Segment Fault],<br>[General Protection Fault] | [Invalid TSS],<br>[Segment Not Present],<br>[Stack-Segment Fault],<br>[General Protection Fault]
So for example a divide-by-zero fault followed by a page fault is fine (the page fault handler is invoked), but a divide-by-zero fault followed by a general-protection fault leads to a double fault.
With the help of this table, we can answer the first three of the above questions:
1. If a divide-by-zero exception occurs and the corresponding handler function is swapped out, a _page fault_ occurs and the _page fault handler_ is invoked.
2. If a page fault occurs and the page fault handler is swapped out, a _double fault_ occurs and the _double fault handler_ is invoked.
3. If a divide-by-zero handler causes a breakpoint exception, the CPU tries to invoke the breakpoint handler. If the breakpoint handler is swapped out, a _page fault_ occurs and the _page fault handler_ is invoked.
In fact, even the case of a non-present handler follows this scheme: A non-present handler causes a _segment-not-present_ exception. We didn't define a segment-not-present handler, so another segment-not-present exception occurs. According to the table, this leads to a double fault.
### Kernel Stack Overflow
Let's look at the fourth question:
> What happens if our kernel overflows its stack and the [guard page] is hit?
When our kernel overflows its stack and hits the guard page, a _page fault_ occurs. The CPU looks up the page fault handler in the IDT and tries to push the [exception stack frame] onto the stack. However, our current stack pointer still points to the non-present guard page. Thus, a second page fault occurs, which causes a double fault (according to the above table).
So the CPU tries to call our _double fault handler_ now. However, on a double fault the CPU tries to push the exception stack frame, too. Our stack pointer still points to the guard page, so a _third_ page fault occurs, which causes a _triple fault_ and a system reboot. So our current double fault handler can't avoid a triple fault in this case.
Let's try it ourselves! We can easily provoke a kernel stack overflow by calling a function that recurses endlessly:
stack_overflow(); // for each recursion, the return address is pushed
}
// trigger a stack overflow
stack_overflow();
println!("It did not crash!");
loop {}
}
{{< / highlight >}}
When we try this code in QEMU, we see that the system enters a boot-loop again.
So how can we avoid this problem? We can't omit the pushing of the exception stack frame, since the CPU itself does it. So we need to ensure somehow that the stack is always valid when a double fault exception occurs. Fortunately, the x86_64 architecture has a solution to this problem.
## Switching Stacks
The x86_64 architecture is able to switch to a predefined, known-good stack when an exception occurs. This switch happens at hardware level, so it can be performed before the CPU pushes the exception stack frame.
This switching mechanism is implemented as an _Interrupt Stack Table_ (IST). The IST is a table of 7 pointers to known-good stacks. In Rust-like pseudo code:
```rust
structInterruptStackTable{
stack_pointers: [Option<StackPointer>;7],
}
```
For each exception handler, we can choose an stack from the IST through the `options` field in the corresponding [IDT entry]. For example, we could use the first stack in the IST for our double fault handler. Then the CPU would automatically switch to this stack whenever a double fault occurs. This switch would happen before anything is pushed, so it would prevent the triple fault.
In order to fill an Interrupt Stack Table later, we need a way to allocate new stacks. Therefore we extend our `memory` module with a new `stack_allocator` submodule:
```rust
// in src/memory/mod.rs
modstack_allocator;
```
First, we create a new `StackAllocator` struct and a constructor function:
```rust
// in src/memory/stack_allocator.rs
usememory::paging::PageIter;
pubstructStackAllocator{
range: PageIter,
}
implStackAllocator{
pubfnnew(page_range: PageIter)-> StackAllocator{
StackAllocator{range: page_range}
}
}
```
We create a simple `StackAllocator` that allocates stacks from a given range of pages (`PageIter` is an Iterator over a range of pages; we introduced it [in the kernel heap post].).
[in the kernel heap post]: {{% relref "08-kernel-heap.md#mapping-the-heap" %}}
We add a `alloc_stack` method that allocates a new stack:
```rust
// in src/memory/stack_allocator.rs
usememory::paging::{self,Page,ActivePageTable};
usememory::{PAGE_SIZE,FrameAllocator};
implStackAllocator{
pubfnalloc_stack<FA: FrameAllocator>(&mutself,
active_table: &mutActivePageTable,
frame_allocator: &mutFA,
size_in_pages: usize)
-> Option<Stack>{
ifsize_in_pages==0{
returnNone;/* a zero sized stack makes no sense */
}
// clone the range, since we only want to change it on success
letmutrange=self.range.clone();
// try to allocate the stack pages and a guard page
letguard_page=range.next();
letstack_start=range.next();
letstack_end=ifsize_in_pages==1{
stack_start
}else{
// choose the (size_in_pages-2)th element, since index
// starts at 0 and we already allocated the start page
The method takes mutable references to the [ActivePageTable] and a [FrameAllocator], since it needs to map the new virtual stack pages to physical frames. We define that the stack size is a multiple of the page size.
Instead of operating directly on `self.range`, we [clone] it and only write it back on success. This way, subsequent stack allocations can still succeed if there are pages left (e.g., a call with `size_in_pages = 3` can still succeed after a failed call with `size_in_pages = 100`).
In order to be able to clone `PageIter`, we add a `#[derive(Clone)]` to its definition in `src/memory/paging/mod.rs`. We also need to make the `start_address` method of the `Page` type public (in the same file).
The actual allocation is straightforward: First, we choose the next page as [guard page]. Then we choose the next `size_in_pages` pages as stack pages using [Iterator::nth]. If all three variables are `Some`, the allocation succeeded and we map the stack pages to physical frames using [ActivePageTable::map]. The guard page remains unmapped.
Finally, we create and return a new `Stack`, which we define as follows:
```rust
// in src/memory/stack_allocator.rs
#[derive(Debug)]
pubstructStack{
top: usize,
bottom: usize,
}
implStack{
fnnew(top: usize,bottom: usize)-> Stack{
assert!(top>bottom);
Stack{
top: top,
bottom: bottom,
}
}
pubfntop(&self)-> usize{
self.top
}
pubfnbottom(&self)-> usize{
self.bottom
}
}
```
The `Stack` struct describes a stack though its top and bottom addresses.
#### The Memory Controller
Now we're able to allocate a new double fault stack. However, we add one more level of abstraction to make things easier. For that we add a new `MemoryController` type to our `memory` module:
The `MemoryController` struct holds the three types that are required for `alloc_stack` and provides a simpler interface (only one argument). The `alloc_stack` wrapper just takes the tree types as `&mut` through [destructuring] and forwards them to the `stack_allocator`. The [ref mut]-s are needed to take the inner fields by mutable reference. Note that we're re-exporting the `Stack` type since it is returned by `alloc_stack`.
let double_fault_stack = memory_controller.alloc_stack(1)
.expect("could not allocate double fault stack");
IDT.load();
}
{{< / highlight >}}
We allocate a 4096 bytes stack (one page) for our double fault handler. Now we just need some way to tell the CPU that it should use this stack for handling double faults.
### The IST and TSS
The Interrupt Stack Table (IST) is part of an old legacy structure called _[Task State Segment]_ \(TSS). The TSS used to hold various information (e.g. processor register state) about a task in 32-bit mode and was for example used for [hardware context switching]. However, hardware context switching is no longer supported in 64-bit mode and the format of the TSS changed completely.
[Task State Segment]: https://en.wikipedia.org/wiki/Task_state_segment
On x86_64, the TSS no longer holds any task specific information at all. Instead, it holds two stack tables (the IST is one of them). The only common field between the 32-bit and 64-bit TSS is the pointer to the [I/O port permissions bitmap].
[I/O port permissions bitmap]: https://en.wikipedia.org/wiki/Task_state_segment#I.2FO_port_permissions
The _Privilege Stack Table_ is used by the CPU when the privilege level changes. For example, if an exception occurs while the CPU is in user mode (privilege level 3), the CPU normally switches to kernel mode (privilege level 0) before invoking the exception handler. In that case, the CPU would switch to the 0th stack in the Privilege Stack Table (since 0 is the target privilege level). We don't have any user mode programs yet, so we ignore this table for now.
#### Creating a TSS
Let's create a new TSS that contains our double fault stack in its interrupt stack table. For that we need a TSS struct. Fortunately, the `x86` crate already contains a [`TaskStateSegment` struct] that we can use:
let double_fault_stack = memory_controller.alloc_stack(1)
.expect("could not allocate double fault stack");
let mut tss = TaskStateSegment::new();
tss.ist[DOUBLE_FAULT_IST_INDEX] = double_fault_stack.top() as u64;
IDT.load();
}
{{< / highlight >}}
We define that the 0th IST entry is the double fault stack (any other IST index would work too). We create a new TSS through the `TaskStateSegment::new` function and load the top address (stacks grow downwards) of the double fault stack into the 0th entry.
#### Loading the TSS
Now that we created a new TSS, we need a way to tell the CPU that it should use it. Unfortunately, this is a bit cumbersome, since the TSS is a Task State _Segment_ (for historical reasons). So instead of loading the table directly, we need to add a new segment descriptor to the [Global Descriptor Table] (GDT). Then we can load our TSS invoking the [`ltr` instruction] with the respective GDT index.
The Global Descriptor Table (GDT) is a relict that was used for [memory segmentation] before paging became the de facto standard. It is still needed in 64-bit mode for various things such as kernel/user mode configuration or TSS loading.
We already created a GDT [when switching to long mode]. Back then, we used assembly to create valid code and data segment descriptors, which were required to enter 64-bit mode. We could just edit that assembly file and add an additional TSS descriptor. However, we now have the expressiveness of Rust, so let's do it in Rust instead.
[when switching to long mode]: {{% relref "02-entering-longmode.md#the-global-descriptor-table" %}}
We start by creating a new `interrupts::gdt` submodule:
```rust
// in src/interrupts/mod.rs
modgdt;
```
```rust
// src/interrupts/gdt.rs
pubstructGdt{
table: [u64;8],
next_free: usize,
}
implGdt{
pubfnnew()-> Gdt{
Gdt{
table: [0;8],
next_free: 1,
}
}
}
```
We create a simple `Gdt` struct with two fields. The `table` field contains the actual GDT modeled as a `[u64; 8]`. Theoretically, a GDT can have up to 8192 entries, but this doesn't make much sense in 64-bit mode (since there is no real segmentation support). Eight entries should be more than enough for our system.
The `next_free` field stores the index of the next free entry. We initialize it with `1` since the 0th entry needs always needs to be 0 in a valid GDT.
#### User and System Segments
There are two types of GDT entries in long mode: user and system segment descriptors. Descriptors for code and data segment segments are user segment descriptors. They contain no addresses since segments always span the complete address space on x86_64 (real segmentation is no longer supported). Thus, user segment descriptors only contain a few flags (e.g. present or user mode) and fit into a single `u64` entry.
System descriptors such as TSS descriptors are different. They often contain a base address and a limit (e.g. TSS start and length) and thus need more than 64 bits. Therefore, system segments are 128 bits. They are stored as two consecutive entries in the GDT.
Consequently, we model a `Descriptor` as an `enum`:
```rust
// in src/interrupts/gdt.rs
pubenumDescriptor{
UserSegment(u64),
SystemSegment(u64,u64),
}
```
The flag bits are common between all descriptor types, so we create a general `DescriptorFlags` type (using the [bitflags] macro):
We set the `USER_SEGMENT` bit to indicate a 64 bit user segment descriptor (otherwise the CPU expects a 128 bit system segment descriptor). The `PRESENT`, `EXECUTABLE`, and `LONG_MODE` bits are also needed for a 64-bit mode code segment.
The data segment registers `ds`, `ss`, and `es` are completely ignored in 64-bit mode, so we don't need any data segment descriptors in our GDT.
#### TSS Segments
A TSS descriptor is a system segment descriptor with the following format:
We only need the bold fields for our TSS descriptor. For example, we don't need the `limit 16-19` field since a TSS has a fixed size that is smaller than `2^16`.
Let's add a function to our descriptor that creates a TSS descriptor for a given TSS:
For an user segment we just push the `u64` and remember the index. For a system segment, we push the low and high `u64` and use the index of the low value. We then use this index to return a new [SegmentSelector].
The method just writes to the `next_free` entry and returns the corresponding index. If there is no free entry left, we panic since this likely indicates a programming error (we should never need to create more than two or three GDT entries for our kernel).
We use the [`DescriptorTablePointer` struct] and the [`lgdt` function] provided by the `x86` crate to load our GDT. Again, we require a `'static` reference since the GDT possibly needs to live for the rest of the run time.
We now have a double fault stack and are able to create and load a TSS (which contains an IST). So let's put everything together to catch kernel stack overflows.
We already created a new TSS in our `interrupts::init` function. Now we can load this TSS by creating a new GDT:
let double_fault_stack = memory_controller.alloc_stack(1)
.expect("could not allocate double fault stack");
let mut tss = TaskStateSegment::new();
tss.ist[DOUBLE_FAULT_IST_INDEX] = double_fault_stack.top() as u64;
let mut gdt = gdt::Gdt::new();
let code_selector = gdt.add_entry(gdt::Descriptor::kernel_code_segment());
let tss_selector = gdt.add_entry(gdt::Descriptor::tss_segment(&tss));
gdt.load();
IDT.load();
}
{{< / highlight >}}
However, when we try to compile it, the following errors occur:
```
error: `tss` does not live long enough
--> src/interrupts/mod.rs:118:68
|
118 | let tss_selector = gdt.add_entry(gdt::Descriptor::tss_segment(&tss));
| does not live long enough ^^^
...
122 | }
| - borrowed value only lives until here
|
= note: borrowed value must be valid for the static lifetime...
error: `gdt` does not live long enough
--> src/interrupts/mod.rs:119:5
|
119 | gdt.load();
| ^^^ does not live long enough
...
122 | }
| - borrowed value only lives until here
|
= note: borrowed value must be valid for the static lifetime...
```
The problem is that we require that the TSS and GDT are valid for the rest of the run time (i.e. for the `'static` lifetime). But our created `tss` and `gdt` live on the stack and are thus destroyed at the end of the `init` function. So how do we fix this problem?
We could allocate our TSS and GDT on the heap using `Box` and use [into_raw] and a bit of `unsafe` to convert it to `&'static` references ([RFC 1233] was closed unfortunately).
Alternatively, we could store them in a `static` somehow. The [`lazy_static` macro] doesn't work here, since we need access to the `MemoryController` for initialization. However, we can use its fundamental building block, the [`spin::Once` type].
Let's try to solve our problem using [`spin::Once`][`spin::Once` type]:
```rust
// in src/interrupts/mod.rs
usespin::Once;
staticTSS: Once<TaskStateSegment>=Once::new();
staticGDT: Once<gdt::Gdt>=Once::new();
```
The `Once` type allows us to initialize a `static` at runtime. It is safe because the only way to access the static value is through the provided methods ([call_once][Once::call_once], [try][Once::try], and [wait][Once::wait]). Thus, no value can be read before initialization and the value can only be initialized once.
let double_fault_stack = memory_controller.alloc_stack(1)
.expect("could not allocate double fault stack");
let tss = TSS.call_once(|| {
let mut tss = TaskStateSegment::new();
tss.ist[DOUBLE_FAULT_IST_INDEX] = double_fault_stack.top() as u64;
tss
});
let gdt = GDT.call_once(|| {
let mut gdt = gdt::Gdt::new();
let code_selector = gdt.add_entry(gdt::Descriptor::
kernel_code_segment());
let tss_selector = gdt.add_entry(gdt::Descriptor::tss_segment(&tss));
gdt
});
gdt.load();
IDT.load();
}
{{< / highlight >}}
Now it should compile again!
#### The final Steps
We're almost done. We successfully loaded our new GDT, which contains a TSS descriptor. Now there are just a few steps left:
1. We changed our GDT, so we should reload the `cs`, the code segment register. This required since the old segment selector could point a different GDT descriptor now (e.g. a TSS descriptor).
2. We loaded a GDT that contains a TSS selector, but we still need to tell the CPU that it should use that TSS.
3. As soon as our TSS is loaded, the CPU has access to a valid interrupt stack table (IST). Then we can tell the CPU that it should use our new double fault stack by modifying our double fault IDT entry.
For the first two steps, we need access to the `code_selector` and `tss_selector` variables outside of the closure. We can achieve this by moving the `let` declarations out of the closure:
We first set the descriptors to `empty` and then update them from inside the closure (which implicitly borrows them as `&mut`). Now we're able to reload the code segment register using [`set_cs`] and to load the TSS using [`load_tr`].
// The hardware IST index starts at 1, but our software IST index
// starts at 0. Therefore we need to add 1 here.
self.0.set_range(0..3,index+1);
self
}
```
The only change is that we're now adding `1` to the passed `index`, because the hardware expects an index between _one_ and seven. A zero means “no stack switch”.
That's it! Now the CPU should switch to the double fault stack whenever a double fault occurs. Thus, we are able to catch _all_ double faults, including kernel stack overflows:

From now on we should never see a triple fault again!
## Safety Problems
In this post, we needed a few `unsafe` blocks to load the GDT and TSS structures. We always used `'static` references, so the passed addresses should be always valid.
However, the IST entries (stored in the TSS) are used as stack pointers by the CPU. This can lead to various memory safety violations:
- The CPU writes to any address that we store in the IST. This way, we can easily circumvent Rust's safety guarantees and e.g. overwrite a `&mut` reference on some random stack space.
- If we use the same stack index for multiple exceptions, memory safety might be violated too. For example, imagine that the double fault hander and the breakpoint handler use the same IST index. If the double fault handler causes a breakpoint exception, the breakpoint overwrites the stack frame of the double fault handler. When the breakpoint returns, the CPU jumps back to the double fault handler and undefined behavior occurs.
- If we accidentally use an empty IST entry, the CPU uses the stack pointer `0`. This is really bad, since we overwrite our [recursively mapped] page tables this way.
println!("exception stack frame at {:#p}",stack_frame);// new
loop{}
}
```
When we start our kernel now, we see that the exception stack frame was written to `0xffffffffffffffd8`:

This address is part of the [recursive mapping][recursively mapped], so the CPU just overwrote some random page table entries.
### Possible Solutions?
Normally, the type system allows us to make things safer. Unfortunately, there are many difficulties in this case. For example, we need to be able to create multiple task state segments since we have multiple CPUs (each CPU has its own TSS). Each IST index is only valid if the corresponding TSS is loaded in the CPU. This makes compile-time abstractions very difficult.
So this might be a case where we have to tolerate the safety dangers [^fn-ideas]. At least, they are limited to the `interrupts` module, where we already have lots of dangerous inline assembly. From outside, only the safe `interrupts::init` function is visible, so it is similar to an [unsafe abstraction]. We only need to be aware of the safety dangers when we edit the `interrupts` module in the future.
[^fn-ideas]: If somebody has a good solution for this problem, please tell me :).
## What's next?
Now that we mastered exceptions, it's time to explore another kind of interrupts: interrupts from external devices such as timers, keyboards, or network controllers. These hardware interrupts are very similar to exceptions, e.g. they are also dispatched through the IDT.
However, unlike exceptions, they don't arise directly on the CPU. Instead, an _interrupt controller_ aggregates these interrupts and forwards them to CPU depending on their priority. In the next posts we will explore the two interrupt controller variants on x86: the [Intel 8259] \(“PIC”) and the [APIC]. This will allow us to react to keyboard and mouse input.
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.