Most programmers spend their lives working on top of layers and layers of abstraction, and I want to give you a peek inside things that we usually take for granted. Like how we usually just start our applications with int main(void) and assume the runtime environment is already configured to execute our code. But main() is not the actual starting point of a program. Before the first line of main() runs, hardware must be initialized, memory must be mapped, and the C/C++ runtime environment must be constructed.
It was a mystery to me how our applications just know that execution starts at main(), until I decided to write everything from scratch for a microcontroller to see it hands on. In this post, I will explain the exact execution path a program takes.
To illustrate this, I will use a custom startup script and a bare metal task scheduler which I wrote for an ARM Cortex-M4 microprocessor (STM32F407). I am using this project as an example because bare metal systems have fewer abstractions, making it much easier to see the underlying mechanics. Once we look at how it works on the MCU, I will show how these exact same concepts translate to programs executing on x86 machines, which I am assuming most of us use.
I will be using C on Linux throughout, but these ideas are shared beyond specific languages. The complete source code for the scheduler, linker script, and startup files can be found on Link to Code.
Understanding the Project Files
Before we start, it helps to understand the files that constitute the task scheduler example. When we build this application, it is divided into individual pieces that eventually merge into our final executable:
main.c: This is the application logic. It contains the task scheduler and themain()function.stm32_startup.c: This file contains the very first instructions the CPU will execute. It holds the vector table and the Reset Handler, which is responsible for setting up the environment before callingmain().stm32_ls.ld: This is the linker script. It tells the toolchain exactly where the physical RAM and ROM are located on the chip, and where to place different parts of our program..o(Object Files): When we compilemain.corstm32_startup.c, the compiler generates object files (main.o,stm32_startup.o). These contain raw machine code and data, but they are not fully stitched together yet..elf(Executable and Linkable Format): This is the final executable file produced by the linker. It combines all the object files and assigns absolute physical addresses to everything based on the linker script.
What Constitutes a Final Executable?
Before we trace the execution path from the moment we power on our PC until main(), we need to understand what is actually inside the binary file we are trying to run. We cannot just feed a raw C file to a processor.
When you compile a source file, you use a toolchain. For my ARM project, I used arm-none-eabi-gcc to compile my application into an object file using the command:
arm-none-eabi-gcc -c main.c -o main.oThe compiler then divides the variables and executable code into standard logical sections:
.text: The actual executable CPU instructions (the code section)..rodata: Read only data. Anyconstdata in the application goes here..data: Initialized global and static variables. It goes here because it requires a user defined, non zero initial value..bss: Uninitialized global and static variables. The startup code later zero-initializes this region at runtime.
To see this for yourself, you can run the objdump tool on an executable. The output looks like this:
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00003444 08000000 08000000 00010000 2**6
CONTENTS, ALLOC, LOAD, READONLY, CODE
1 .data 00000994 20000000 08003450 00020000 2**3
CONTENTS, ALLOC, LOAD, DATA
2 .init_array 00000004 20000994 08003de4 00020994 2**2
CONTENTS, ALLOC, LOAD, DATA
3 .bss 00000158 200009a0 08003df0 000209a0 2**2
ALLOCYou can try it right now actually. Just write a simple program:
int main(void) {
return 0;
}Now compile it with gcc -o main main.c and run objdump -h main to see the sections. You can see the size of each section and their addresses (where in the RAM or ROM they live).
Why is this sectioning important, and what does it have to do with the execution of main()?
Because different types of data have different hardware requirements. Your code (.text) needs to sit in executable read only memory so it doesn't get overwritten, but your variables (.data and .bss) need to sit in writable RAM so they can be modified. If the compiler just mixed all our code and data together into one giant blob, the startup sequence wouldn't know which parts to copy to RAM and which parts to zero out. Without this strict organization, main() would execute with garbage data, and we all know "garbage in, garbage out."
Just partitioning the code into different sections is not the only thing happening. We also instruct the compiler to execute specific functions before main(). This is done by placing function pointers into a specialized initialization section called .init_array. In GCC, this is achieved using the constructor attribute. The linker will eventually place this .init_array section into Flash memory alongside the .text section, so the pointers are ready to be read when the application execution starts.
Here is an example of how code can be injected before the execution of main():
#include <stdio.h>
/* Apply the constructor attribute so this runs before main() */
void myStartupFun(void) __attribute__((constructor));
void myStartupFun(void) {
printf("This runs before main()\n");
}
int main(void) {
printf("Hello from main\n");
return 0;
}When you compile this, the compiler places a pointer to myStartupFun into the .init_array section.
At this stage, the compiler has organized our code into these logical buckets, but they only have temporary, relative addresses. Because the compiler only looks at one C file at a time, it does not know where the final program will reside in memory. It just assigns offset addresses starting from zero. The processor, however, needs absolute addresses to execute instructions or fetch variables. This is where the linker and memory layout come in.
Linking and Memory Layout: LMA vs. VMA
In a microcontroller, memory is typically split between non volatile Flash (ROM) and volatile SRAM. When power is disconnected, RAM is cleared. Because of this, when power is provided, the program must initially execute from Flash. However, mutable variables (variables that will change during execution, like .data and .bss) must reside in RAM.
This introduces two very important concepts:
- Load Memory Address (LMA): The address where the data is physically stored in Flash.
- Virtual Memory Address (VMA): The address where the data must reside at runtime for the CPU to access it correctly.
We define this mapping explicitly using a linker script. In the stm32_ls.ld file, I first define the physical memory regions of the microcontroller:
MEMORY
{
FLASH(rx) : ORIGIN = 0x08000000, LENGTH = 1024K
SRAM(rwx) : ORIGIN = 0x20000000, LENGTH = 128K
}Next, we tell the linker exactly where to place our sections, for example:
_la_data = LOADADDR(.data);
.data :
{
_sdata = .;
*(.data)
*(.data.*)
. = ALIGN(4);
_edata = .;
} > SRAM AT> FLASHThis tells the linker: "Assign the runtime addresses (VMA) for the .data variables to SRAM starting at the symbol _sdata. But store their initial, power on values (LMA) in FLASH at the symbol _la_data."
Visually, the final memory mapping of our executable looks like this:
FLASH (LMA) SRAM (VMA)
+---------------+ 0x08000000 +---------------+ 0x20000000 <- _sdata
| Vector Table | | .data |
+---------------+ | |
| .text (code) | | |
+---------------+ +---------------+ <- _edata
| .rodata | | .bss | <- _sbss
+---------------+ <- _la_data | |
| .data (init | | |
| values) | +---------------+ <- _ebss
+---------------+ | Heap |
+---------------+
| | |
| v |
| |
| ^ |
| | |
+---------------+
| Stack |
+---------------+ 0x20020000 (SRAM_END)I also want to quickly discuss the Vector Table. It is essentially an array of function pointers. It acts as a directory for the CPU, telling it where to find critical routines, like the addresses of interrupt handlers, exception handlers, and most importantly, the very first instruction to run when the chip is powered on.
Now that we know what our compiled executable looks like in memory, let's finally see how we reach main().
How is main() Executed?
When power is applied to the processor, it executes a strict sequence of operations before running any of our C code.
On ARM Cortex-M processors, the CPU begins execution from address 0x00000000. On STM32 chips, this address is typically aliased or remapped to Flash memory at 0x08000000, where the Vector Table resides.
The ARM hardware specification mandates that:
- The first 32-bit word in this table contains the initial Main Stack Pointer (
MSP) value. The stack pointer tells the CPU exactly where the stack memory starts in RAM. This is crucial because without a stack, the CPU cannot push local variables to memory or make function calls. - The second 32-bit word contains the address of the
Reset_Handler. The processor hardware takes this address and loads it into the Program Counter (PC). The Program Counter is the register that tells the CPU exactly which instruction to fetch and execute next. By loading this address, the hardware forces the CPU to jump to our startup code.
In my stm32_startup.c file, I explicitly defined the vector_table array and placed it at the top of the .isr_vector section so the linker puts it at the beginning of Flash memory:
uint32_t vector_table[] __attribute__((section(".isr_vector"))) = {
STACK_START, // Hardware loads this into the Stack Pointer
(uint32_t)&Reset_Handler, // Hardware loads this into the Program Counter and branches here
(uint32_t)&NMI_Handler,
(uint32_t)&HardFault_Handler,
// ... other exception handlers ...
};When the CPU reads the second word, it branches to the Reset_Handler function. Software execution has now begun.
However, the C runtime is not ready yet. If we were to jump directly to main() right now, any attempt to read a global variable would return absolute garbage. Writing to it might cause a fault. The Reset_Handler must fulfill the guarantees expected by the C runtime before calling main(). It does this in three steps:
Step 1: Copy initialized data (.data) from Flash to SRAM.
We use those symbols we defined in the linker script (_sdata, _edata, _la_data) to find the initialization values in ROM and copy them into RAM.
uint32_t size = (uint32_t)&_edata - (uint32_t)&_sdata;
uint8_t *pDst = (uint8_t*)&_sdata; // Destination: VMA in SRAM
uint8_t *pSrc = (uint8_t*)&_la_data; // Source: LMA in FLASH
for (uint32_t i=0; i<size; i++){
*pDst++ = *pSrc++;
}Step 2: Zero initialize the .bss section in SRAM.
The C standard guarantees that uninitialized global variables start at zero. We do that manually here.
size = &_ebss - &_sbss;
pDst = (uint8_t*)&_sbss;
for(uint32_t i=0; i<size; i++){
*pDst++ = 0;
}Step 3: Call C library initialization and jump to main().
Now we call __libc_init_array(). In embedded toolchains like newlib, this function iterates through the .init_array section and executes all the functions marked with __attribute__((constructor)).
__libc_init_array();
main();
}Finally, we branch to main(). At this point, the application can safely assume all uninitialized global variables and static data have been properly initialized and the stack is fully functional.
Transitioning the Concepts: x86 Machines
The bare metal task scheduler example illustrates the exact steps required to execute a binary. On a standard x86 Linux machine, the same fundamental steps occur, but the responsibility is split between the Operating System and the C Standard Library (glibc).
When you execute an ELF binary on Linux (like running ./my_program in the terminal), the PC is already running. The hardware initialization, where the BIOS or UEFI firmware sets up the RAM timings, configures the PCI buses, and establishes basic clock speeds, has happened long before you opened the terminal.
But the runtime for your specific process is constructed now:
- Process Creation: The OS kernel parses your ELF file. Instead of manually copying sections from Flash into SRAM like we did on the microcontroller, the kernel maps the ELF segments into your process's virtual address space using the virtual memory subsystem.
- Dynamic Linking: The OS invokes the dynamic linker (
ld-linux.soon Linux systems), which finds the shared libraries your application needs, loads them into memory, and resolves external symbol addresses required by the program. - The Entry Point (
_start): Just like the microcontroller, the OS kernel does not jump tomain(). It jumps to the ELF entry point, which usually defaults to an assembly symbol named_start. - C Runtime Initialization: The
_startroutine calls a function such as__libc_start_mainin glibc. This function prepares the runtime environment, extracts command-line arguments (argcandargv) from the stack, initializes threading and TLS structures, and executes constructor functions before callingmain(). - Calling
main(): Finally,__libc_start_maincalls yourmain(argc, argv). If yourmain()function ever returns, the runtime executes cleanup routines and terminates the process through theexit()system call.
Whether you are writing bare metal code or a user space application for desktop PCs, having an understanding of memory sections, linker scripts, and the startup routines that run before main() can help you debug early startup crashes, write custom bootloaders, and gain a much deeper appreciation for how computers actually execute software.
Hope you enjoyed reading this :)