Linkers part 3

Continuing notes on linkers.

Address Spaces

An address space is simply a view of memory, in which each byte has an address. The linker deals with three distinct types of address space.

Every input object file is a small address space: the contents have addresses, and the symbols and relocations refer to the contents by addresses.

The output program will be placed at some location in memory when it runs. This is the output address space, which I generally refer to as using virtual memory addresses.

The output program will be loaded at some location in memory. This is the load memory address. On typical Unix systems virtual memory addresses and load memory addresses are the same. On embedded systems they are often different; for example, the initialized data (the initial contents of global or static variables) may be loaded into ROM at the load memory address, and then copied into RAM at the virtual memory address.

Shared libraries can normally be run at different virtual memory address in different processes. A shared library has a base address when it is created; this is often simply zero. When the dynamic linker copies the shared library into the virtual memory space of a process, it must apply relocations to adjust the shared library to run at its virtual memory address. Shared library systems minimize the number of relocations which must be applied, since they take time when starting the program.

Object File Formats

As I said above, an assembler turns human readable assembly language into an object file. An object file is a binary data file written in a format designed as input to the linker. The linker generates an executable file. This executable file is a binary data file written in a format designed as input for the operating system or the loader (this is true even when linking dynamically, as normally the operating system loads the executable before invoking the dynamic linker to begin running the program). There is no logical requirement that the object file format resemble the executable file format. However, in practice they are normally very similar.

Most object file formats define sections. A section typically holds memory contents, or it may be used to hold other types of data. Sections generally have a name, a type, a size, an address, and an associated array of data.

Object file formats may be classed in two general types: record oriented and section oriented.

A record oriented object file format defines a series of records of varying size. Each record starts with some special code, and may be followed by data. Reading the object file requires reading it from the begininng and processing each record. Records are used to describe symbols and sections. Relocations may be associated with sections or may be specified by other records. IEEE-695 and Mach-O are record oriented object file formats used today.

In a section oriented object file format the file header describes a section table with a specified number of sections. Symbols may appear in a separate part of the object file described by the file header, or they may appear in a special section. Relocations may be attached to sections, or they may appear in separate sections. The object file may be read by reading the section table, and then reading specific sections directly. ELF, COFF, PE, and a.out are section oriented object file formats.

Every object file format needs to be able to represent debugging information. Debugging informations is generated by the compiler and read by the debugger. In general the linker can just treat it like any other type of data. However, in practice the debugging information for a program can be larger than the actual program itself. The linker can use various techniques to reduce the amount of debugging information, thus reducing the size of the executable. This can speed up the link, but requires the linker to understand the debugging information.

The a.out object file format stores debugging information using special strings in the symbol table, known as stabs. These special strings are simply the names of symbols with a special type. This technique is also used by some variants of ECOFF, and by older versions of Mach-O.

The COFF object file format stores debugging information using special fields in the symbol table. This type information is limited, and is completely inadequate for C++. A common technique to work around these limitations is to embed stabs strings in a COFF section.

The ELF object file format stores debugging information in sections with special names. The debugging information can be stabs strings or the DWARF debugging format.

More next week.