Stack unwinding

The main usage of stack unwinding is:

Stack unwinding tasks can be divided into two categories:

Frame pointer

The most classic and simplest stack unwinding is based on the frame pointer: fix a register as the frame pointer (RBP on x86-64), put the frame pointer in the stack frame at the function prologue, and update the frame pointer to the address of the saved frame pointer. The frame pointer and its saved values in the stack form a singly linked list. After obtaining the initial frame pointer value (__builtin_frame_address), dereference the frame pointer continuously to get the frame pointer values of all stack frames. This method is not applicable to some instructions in the prologue/epilogue.

pushq %rbp
movq %rsp, %rbp # after this, RBP references the current frame
...
popq %rbp
retq  # RBP references the previous frame
#include <stdio.h>
[[gnu::noinline]] void qux() {
  void **fp = __builtin_frame_address(0);
  for (;;) {
    printf("%p\n", fp);
    void **next_fp = *fp;
    if (next_fp <= fp) break;
    fp = next_fp;
  }
}
[[gnu::noinline]] void bar() { qux(); }
[[gnu::noinline]] void foo() { bar(); }
int main() { foo(); }

The frame pointer-based method is simple, but has several drawbacks.

When the above code is compiled with -O1 or above, foo and bar will have tail calls, and the program output will not include the stack frame of foo and bar (-fomit-leaf-frame-pointer does not hinder the tail call).

In practice, it is not guaranteed that all libraries contain frame pointers. When unwinding a thread, it is necessary to check whether next_fp is like a stack address before dereferencing it to prevent segfaults. One way to check page accessibility is to parse /proc/*/maps to determine whether the address is readable (slow). There is a smart trick:

// Or use the write end of a pipe.
int fd = open("/dev/random", O_WRONLY);
if (write(fd, address, 1) < 0)
  // not readable

In addition, reserving a register for the frame pointer will increase text size and have negative performance impact (prologue, epilogue additional instruction overhead and register pressure caused by one fewer register), which may be quite significant on x86-32 which lack registers. On an architecture with relatively sufficient registers, e.g. x86-64, the performance loss can be more than 1%.

Compiler behavior

libunwind

C++ exception and stack unwinding of profiler/crash reporter usually use libunwind API and DWARF Call Frame Information. In the 1990s, Hewlett-Packard defined a set of libunwind API, which is divided into two categories:

Hewlett-Packard has open sourced https://www.nongnu.org/libunwind/ (in addition to many projects called “libunwind”). The common implementations of this API on Linux are:

DWARF Call Frame Information

The unwind instructions required by different areas of the program are described by DWARF Call Frame Information (CFI) and stored by .eh_frame on the ELF platform. Compiler/assembler/linker/libunwind provides corresponding support.

.eh_frame is composed of Common Information Entry (CIE) and Frame Description Entry (FDE). CIE has these fields:

Each FDE has an associated CIE. FDE has these fields:

A CIE may optionally refer to a personality routine in the text section. A FDE may optionally refer to its associated LSDA in .gcc_except_table. The personality routine and LSDA are used in Level 2: C++ ABI of Itanium C++ ABI.

.eh_frame is based on .debug_frame introduced in DWARF v2. They have some differences, though:

For the following function:

void f() {
  __builtin_unwind_init();
}

The compiler produces .cfi_* (CFI directives) to annotate the assembly, .cfi_startproc and .cfi_endproc annotate the FDE area, and other CFI directives describe CFI instructions. A call frame is indicated by an address on the stack. This address is called Canonical Frame Address (CFA), and is usually the stack pointer value of the call site. The following example demonstrates the usage of CFI instructions:

f:
# At the function entry, CFA = rsp+8
    .cfi_startproc
# %bb.0:
    pushq   %rbp
# Redefine CFA = rsp+16
    .cfi_def_cfa_offset 16
# rbp is saved at the address CFA-16
    .cfi_offset %rbp, -16
    movq    %rsp, %rbp
# CFA = rbp+16. CFA does not needed to be redefined when rsp changes
    .cfi_def_cfa_register %rbp
    pushq   %r15
    pushq   %r14
    pushq   %r13
    pushq   %r12
    pushq   %rbx
# rbx is saved at the address CFA-56
    .cfi_offset %rbx, -56
    .cfi_offset %r12, -48
    .cfi_offset %r13, -40
    .cfi_offset %r14, -32
    .cfi_offset %r15, -24
    popq    %rbx
    popq    %r12
    popq    %r13
    popq    %r14
    popq    %r15
    popq    %rbp
# CFA = rsp+8
    .cfi_def_cfa %rsp, 8
    retq
.Lfunc_end0:
    .size   f, .Lfunc_end0-f
    .cfi_endproc

The assembler parses CFI directives and generates .eh_frame (this mechanism was introduced by Alan Modra in 2003). Linker collects .eh_frame input sections in .o/.a files to generate output .eh_frame. In 2006, GNU as introduced .cfi_personality and .cfi_lsda.

.eh_frame_hdr and PT_EH_FRAME

To locate the FDE where a pc is located, you need to scan .eh_frame from the beginning to find the appropriate FDE (whether the pc falls in the interval indicated by initiallocation and addressrange). The time spent is proportional to the number of scanned CIE and FDE records. https://sourceware.org/pipermail/binutils/2001-December/015674.html introduced .eh_frame_hdr, a binary search index table describing (initial_location, FDE address) pairs.

The linker collects all .eh_frame input sections. With --eh-frame-hdr, ld generates .eh_frame_hdr and creates a program header PT_EH_FRAME to describe .eh_frame_hdr. An unwinder can parse the program headers and look for PT_EH_FRAME to locate .eh_frame_hdr. Please check out the example below.

__register_frame_info

Before .eh_frame_hdr and PT_EH_FRAME were invented, there was a static constructor frame_dummy in crtbegin (crtstuff.c): calling __register_frame_info to register the executable file .eh_frame.

Now __register_frame_info is only used by programs linked with -static. Correspondingly, if you specify -Wl,--no-eh-frame-hdr when linking, you cannot unwind (if you use a C++ exception, the program will call std::terminate).

libunwind example

#include <libunwind.h>
#include <stdio.h>

void backtrace() {
  unw_context_t context;
  unw_cursor_t cursor;
  // Store register values into context.
  unw_getcontext(&context);
  // Locate the PT_GNU_EH_FRAME which contains PC.
  unw_init_local(&cursor, &context);
  size_t rip, rsp;
  do {
    unw_get_reg(&cursor, UNW_X86_64_RIP, &rip);
    unw_get_reg(&cursor, UNW_X86_64_RSP, &rsp);
    printf("rip: %zx rsp: %zx\n", rip, rsp);
  } while (unw_step(&cursor) > 0);
}

void bar() {backtrace();}
void foo() {bar();}
int main() {foo();}

If you use llvm-project/libunwind:

$CC a.c -Ipath/to/include -Lpath/to/lib -lunwind

If you use nongnu.org/libunwind, there are two options: (a) Add #define UNW_LOCAL_ONLY before #include <libunwind.h> (b) Link one more library, on x86-64 it is -l:libunwind-x86_64.so. If you use Clang, you can also use clang --rtlib=compiler-rt --unwindlib=libunwind -I path/to/include a.c, in addition to providing unw_*, it can ensure that libgcc_s.so is not linked

An unwinder locates the applicable FDE according to the program counter, and executes all the CFI instructions before the program counter.

There are several important

A -DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD=X86 clang, .text 51.7MiB, .eh_frame 4.2MiB, .eh_frame_hdr 646, 2 CIE, 82745 FDE.

Remarks

CFI instructions are suitable for the compiler to generate code, but cumbersome to write in hand-written assembly. In 2015, Alex Dowad contributed an awk script to musl libc to parse the assembly and automatically generate CFI directives. In fact, generating precise CFI instructions is challenging for ompilers as well. For a function that does not use a frame pointer, adjusting SP requires outputting a CFI directive to redefine CFA. GCC does not parse inline assembly, so adjusting SP in inline assembly often results in imprecise CFI.

void foo() {
  asm("subq $128, %rsp\n"
  // Cannot unwind if -fomit-leaf-frame-pointer
      "nop\n"
      "addq $128, %rsp\n");
}

int main() {
  foo();
}

The CFIInstrInserter pass in LLVM can insert .cfi_def_cfa_* .cfi_offset .cfi_restore to adjust the CFA and callee-saved registers.

The DWARF scheme also has very low information density. The various compact unwind schemes have made improvement on this aspect. To list a few issues:

Callee-saved registers other than FP are oftentimes unneeded but there is no compiler option to drop them.

SHT_X86_64_UNWIND

.eh_frame has special processing in linker/dynamic loader, so conventionally it should use a separate section type, but SHT_PROGBITS was used in the design. In the x86-64 psABI, the type of .eh_frame is SHT_X86_64_UNWIND (influenced by Solaris).

.section .eh_frame,"a",@unwind is rare (glibc’s x86 port, libffi, LuaJIT and other packages), so checking the type of .eh_frame is a good way to distinguish Clang/GCC object file :) For LLD 11.0.0, I contributed https://reviews.llvm.org/D85785 to allow mixed types for .eh_frame in a relocatable link ;-)

Suggestion to future architectures: When defining processor-specific section types, please do not use 0x70000001 (SHT_ARM_EXIDX=SHT_IA_64_UNWIND=SHT_PARISC_UNWIND=SHT_X86_64_UNWIND=SHT_LOPROC+1) for purposes other than unwinding :) SHT_CSKY_ATTRIBUTES=0x70000001 :)

Linker perspective

Usually in the case of COMDAT group and -ffunction-sections, .data/.rodata needs to be split like .text, but .eh_frame is monolithic. Like many other metadata sections, the main problem with the monolithic section is that garbage collection is challenging in the linker. Unlike some other metadata sections, simply abandoning garbage collecting is not a choice: .eh_frame_hdr is a binary search index table and duplicate/unused entries can confuse the customers.

When a linker processes .eh_frame, it needs to conceptually split .eh_frame into CIE/FDE. During --gc-sections, the conceptual reference relationship is reversed considering the actual relocation: a FDE has a relocation referencing the text section; during GC, if the pointed text section is discarded, the FDE that references it should also be discarded.

LLD has some special handling for .eh_frame:

Compact unwind descriptors

On macOS, Apple designed the compact unwind descriptors mechanism to accelerate unwinding. In theory, this technique can be used to save some space in __eh_frame, but it has not been implemented. The main idea is:

Only 64-bit will be discussed below. A descriptor occupies 32 bytes

.quad _foo
.set L1, Lfoo_end-_foo
.long L1
.long compact_unwind_description
.quad personality
.quad lsda_address

If you study .eh_frame_hdr (binary search index table) and .ARM.exidx, you can know that the length field is redundant.

The Compact unwind descriptor is encoded as:

uint32_t : 24; // vary with different modes
uint32_t mode : 4;
uint32_t flags : 4;

Five modes are defined:

FP-based frame (UNWIND_MODE_BP_FRAME)

The compact unwind encoding is:

uint32_t regs : 15;
uint32_t : 1; // 0
uint32_t stack_adjust : 8;
uint32_t mode : 4;
uint32_t flags : 4;

The callee-saved registers on x86-64 are: RBX, R12, R13, R14, R15, RBP. 3 bits can encode a register, 15 bits are enough to represent 5 registers except RBP (whether to save and where). stack_adjust records the extra stack space outside the save register.

SP-based frame (UNWIND_MODE_STACK_IMMD)

The compact unwind encoding is:

uint32_t reg_permutation : 10;
uint32_t cnt : 3;
uint32_t : 3;
uint32_t size : 8;
uint32_t mode : 4;
uint32_t flags : 4;

cnt represents the number of saved registers (maximum 6). reg_permutation indicates the sequence number of the saved register. size*8 represents the stack frame size.

Large SP-based frame (UNWIND_MODE_STACK_IND)

Compact unwind descriptor编码为:

uint32_t reg_permutation : 10;
uint32_t cnt : 3;
uint32_t adj : 3;
uint32_t size_offset : 8;
uint32_t mode : 4;
uint32_t flags : 4;

Similar to SP-based frame. In particular: the stack frame size is read from the text section. The RSP adjustment is usually represented by subq imm, %rsp, and size_offset is used to represent the distance from the instruction to the beginning of the function. The actual stack size also includes adj*8.

DWARF CFI escape

If for various reasons, the compact unwind descriptor cannot be expressed, it must fall back to DWARF CFI.

In the LLVM implementation, each function is represented by only a compact unwind descriptor. If asynchronous stack unwinding occurs in epilogue, existing implementations cannot distinguish it from stack unwinding in function body. Canonical Frame Address will be calculated incorrectly, and the caller-saved register will be read incorrectly. If it happens in prologue, and the prologue has other instructions outside the push register and subq imm, $rsp, an error will occur. In addition, if shrink wrapping is enabled for a function, prologue may not be at the beginning of the function. The asynchronous stack unwinding from the beginning to the prologue also fails. It seems that most people don’t care about this issue. It may be because the profiler loses a few percentage points of the profile.

In fact, if you use multiple descriptors to describe each area of a function, you can still unwind accurately. OpenVMS proposed [RFC] Improving compact x86-64 compact unwind descriptors in 2018, but unfortunately there is no relevant implementation.

ARM exception handling

Divided into .ARM.exidx and .ARM.extab

.ARM.exidx is a binary search index table, composed of 2-word pairs. The first word is 31-bit PC-relative offset to the start of the region. The second word uses the program description more clearly:

if (indexData == EXIDX_CANTUNWIND)
  return false;  // like an absent .eh_frame entry. In the case of C++ exceptions, std::terminate
if (indexData & 0x80000000) {
  extabAddr = &indexData;
  extabData = indexData; // inline
} else {
  extabAddr = &indexData + signExtendPrel31(indexData);
  extabData = read32(&indexData + signExtendPrel31(indexData)); // stored in .ARM.extab
}

tableData & 0x80000000 means a compact model entry, otherwise means a generic model entry.

.ARM.exidx is equivalent to enhanced .eh_frame_hdr, compact model is equivalent to inlining the personality and lsda in .eh_frame. Consider the following three situations:

.ARM.extab is equivalent to the combined .eh_frame and .gcc_except_table.

Generic model

uint32_t personality; // bit 31 is 0
uint32_t : 24;
uint32_t num : 8;
uint32_t opcodes[];   // opcodes, variable length
uint8_t lsda[];       // variable length

In construction.

Windows ARM64 exception handling

See https://docs.microsoft.com/en-us/cpp/build/arm64-exception-handling, this is my favorite coding scheme. Support the unwinding of mid-prolog and mid-epilog. Support function fragments (used to represent unconventional stack frames such as shrink wrapping).

Saved in two sections .pdata and .xdata.

uint32_t function_start_rva;
uint32_t Flag : 2;
uint32_t Data : 30;

For canonical form functions, Packed Unwind Data is used, and no .xdata record is required; for descriptors that cannot be represented by Packed Unwind Data, it is stored in .xdata.

Packed Unwind Data

uint32_t FunctionStartRVA;
uint32_t Flag : 2;
uint32_t FunctionLength : 11;
uint32_t RegF : 3;
uint32_t RegI : 4;
uint32_t H : 1;
uint32_t CR : 2;
uint32_t FrameSize : 9;

MIPS compact exception tables

In construction.

Linux kernel ORC unwind tables

For x86-64, the Linux kernel uses its own unwind tables: ORC. You can find its documentation on https://www.kernel.org/doc/html/latest/x86/orc-unwinder.html and there is an lwn.net introduction The ORCs are coming.

objtool orc generate a.o parses .eh_frame and generates .orc_unwind and .orc_unwind_ip. For an object file assembled from:

.globl foo
.type foo, @function
foo:
  ret

At two addresses the unwind information changes: the start of foo and the end of foo, so 2 ORC entries will be produced. If the DWARF CFA changes (e.g. due to push/pop) in the middle of the function, there may be more entries.

.orc_unwind_ip contains two entries, representing the PC-relative addresses.

Relocation section '.rela.orc_unwind_ip' at offset 0x2028 contains 2 entries:
    Offset             Info             Type               Symbol's Value  Symbol's Name + Addend
0000000000000000  0000000500000002 R_X86_64_PC32          0000000000000000 .text + 0
0000000000000004  0000000500000002 R_X86_64_PC32          0000000000000000 .text + 1

.orc_unwind contains two entries of type orc_entry. The entries encode how IP/SP/BP of the previous frame are stored.

struct orc_entry {
  s16 sp_offset; // sp_offset and sp_reg encode where SP of the previous frame is stored
  s16 bp_offset; // bp_offset and bp_reg encode where BP of the previous frame is stored
  unsigned sp_reg:4;
  unsigned bp_reg:4;
  unsigned type:2; // how IP of the previous frame is stored
  unsigned end:1;
} __attribute__((__packed__));

You may find similarities in this scheme and UNWIND_MODE_BP_FRAME and UNWIND_MODE_STACK_IMMD in Apples’s compact unwind descriptors. The ORC scheme uses 16-bit integers so assumably UNWIND_MODE_STACK_IND will not be needed. During unwinding, most callee-saved registers other than BP are unneeded, so ORC does not bother recording them.

The linker will resolve relocations in .orc_unwind_ip and create __start_orc_unwind_ip/__stop_orc_unwind_ip/__start_orc_unwind/ __stop_orc_unwind delimiter the section contents. Then, a host utility scripts/sorttable sorts the contents of .orc_unwind_ip and .orc_unwind. To unwind a stack frame, unwind_next_frame * performs a binary search into the .orc_unwind_ip table to figure out the relevant ORC entry * retrieves the previous SP with the current SP, orc->sp_reg and orc->sp_offset. * retrieves the previous IP with orc->type and other values. * retrieves the previous BP with the currrent BP, the previous SP, orc->bp_reg and orc->bp_offset. bp->reg can be ORC_REG_UNDEFINED/ORC_REG_PREV_SP/ORC_REG_BP.