Bare Metal Task Scheduler: ARM Cortex M4 (STM32F407)

← Back to blog

In this mini blog, I'll share my learnings about building a simple round robin preemptive task scheduler from scratch. It's written in C with some inline ARM assembly, targeting the STM32F407 Discovery board. There's no RTOS, no HAL, I directly manipulate registers and stacks, and use the exception handlers. To understand every detail of what's happening, I relied heavily on the STM32F407 reference manual and the ARM Cortex-M4 user manual.

What It Does?

To test the scheduler, I wrote 4 tasks that blink LEDs at different rates (1s, 500ms, 250ms, 125ms). There's also an idle task that runs when all other tasks are blocked, so the CPU always has something to execute. A SysTick interrupt fires at 1kHz and drives the scheduling decisions. The actual context switch is handled by PendSV, which we'll get into later.

Implementation Details

Stack Layout

The STM32F407 has 128KB of SRAM starting at 0x20000000. The stack is full descending, meaning it grows from high addresses down toward lower ones, each push decrements the stack pointer before writing.

I allocated 1KB per task (4 tasks + 1 idle task) and another 1KB for the scheduler itself, all sitting at the top of SRAM:

Stack Layout

Each task gets its own Process Stack Pointer (PSP), which holds the current top of that task's stack. The scheduler, which runs inside exception handlers, uses the Main Stack Pointer (MSP). The distinction matters: thread mode code uses PSP, while handler mode uses MSP. This separation means the scheduler's stack is completely isolated from the tasks it manages. To switch to PSP in thread mode, you write to the CONTROL register:

MOV R0, #0x02
MSR CONTROL, R0

Stack Frame Initialization

Before any task can run, I need to fake a stack frame for each one. This is necessary because the context switch logic expects to find a valid saved register state on the stack when it restores a task without it, it would load garbage values into the CPU registers.

The Cortex-M4 hardware automatically pushes 8 registers on exception entry (xPSR, PC, LR, R12, R3–R0). I manually push the remaining callee saved registers (R4–R11) with zeros to complete the fake frame.

A few fields you have to get exactly right:

  • xPSR = 0x01000000: The Thumb bit must be set. If it isn't, the processor faults immediately when it tries to execute the task.
  • PC = address of the task handler function: This is where execution begins when the task first runs.
  • LR = 0xFFFFFFFD: This is an EXC_RETURN value that tells the processor to return to thread mode using PSP, which is exactly what we want.

Context Switching (PendSV)

PendSV is where the actual context switch happens. It's configured at the lowest exception priority so it never preempts other interrupt handlers, it always waits until higher priority work is done before running.

Context Switch Flow

Here's the sequence:

  1. Read the current PSP → MRS R0, PSP
  2. Save R4–R11 onto the current task's stack → STMDB R0!, {R4-R11}
  3. Store the updated PSP back into the task's TCB so we can restore it later
  4. Pick the next ready task (round robin, skipping blocked ones)
  5. Load the saved R4–R11 from the new task's stack → LDMIA R0!, {R4-R11}
  6. Set PSP to the new task's saved stack pointer
  7. Return with BX LR — the hardware automatically restores xPSR, PC, LR, R12, and R0–R3 from the stack on exception exit

Context Saving and Retrieving

One thing that was confusing: the naked attribute on PendSV_Handler is not optional. Without it, the compiler inserts its own prologue and epilogue (register saves and stack adjustments) which corrupts the carefully laid out stack frame we're managing manually.

SysTick as Scheduler Tick

SysTick is configured to fire every 1ms using the 16MHz HSI clock:

count = (SYSTICK_CLK / TICK_HZ) - 1
      = (16000000 / 1000) - 1
      = 15999

The minus one matters: SysTick counts down to zero then reloads, and the interrupt fires on the transition to zero, not after the reload. Loading 15999 gives exactly 16000 clock cycles per tick, 1ms at 16MHz.

Every tick, the SysTick handler does three things:

  1. Increments a global tick counter
  2. Checks if any blocked task's wake up tick has been reached and marks it ready
  3. Pends PendSV to trigger a context switch

SysTick itself doesn't perform the context switch, it just sets the PendSV pending bit and lets PendSV handle it at the appropriate priority level.

Task Blocking

Initially I was using a busy wait delay(), a simple for loop spinning until enough cycles passed. It works, but it completely defeats the purpose of a scheduler. Every task just spins doing nothing, the scheduler round robins between tasks that are all wasting CPU cycles, and no real concurrency is happening.

The correct approach is task_delay(tick_count):

  1. Set the task's wake up tick = global_tick_count + tick_count
  2. Mark the task as TASK_BLOCKED
  3. Pend PendSV immediately to yield the CPU to another ready task

Now blocked tasks are skipped entirely during scheduling. If all tasks are blocked simultaneously, the idle task runs. Right now it's just a while(1), but inserting a WFI (Wait For Interrupt) instruction there would let the CPU sleep and save power until the next SysTick fires.

Race Conditions

task_delay() modifies shared state (the TCB) from thread mode, while the SysTick handler modifies it from handler mode. This creates a race condition: if a tick fires in the middle of a multi step update, the scheduler can observe a partially written TCB and act on corrupted data.

The fix is to disable interrupts around the critical section using PRIMASK:

MOV R0, #0x1
MSR PRIMASK, R0   // disable all maskable interrupts
// ... critical section ...
MOV R0, #0x0
MSR PRIMASK, R0   // re-enable

PRIMASK is a blunt instrument, it blocks all maskable interrupts. For finer control, BASEPRI lets you mask only interrupts below a certain priority level, leaving higher priority ones unaffected.

Inline Assembly

Several operations require direct register access that C can't express, so I used GCC's __asm volatile(...) with extended inline assembly syntax. The key instructions used:

  • MSR MSP, R0 — set the main stack pointer
  • MSR PSP, R0 — set the process stack pointer
  • MSR CONTROL, R0 — switch between MSP and PSP
  • MRS R0, PSP — read the current PSP value
  • STMDB / LDMIA — bulk register save/restore used in the context switch
  • MSR PRIMASK, R0 — enable/disable all maskable interrupts

Task Control Block

Each task is represented by a simple struct:

typedef struct {
  uint32_t* psp_value;
  uint32_t  block_count;
  TaskState current_state;   // TASK_READY or TASK_BLOCKED
  void (*task_handler)(void);
} TCB_t;

The scheduler maintains an array of 5 TCBs (4 tasks + idle). On each scheduling decision, it scans the array in a circular fashion, skipping blocked tasks and skipping task 0 (idle) unless no other task is ready.

Learnings

Context switching is really just pointer swapping. The whole mechanism boils down to saving the current stack pointer, switching to another one, and restoring. The CPU doesn't know or care which task is "running", it just executes whatever the current stack points to.

The hardware does a meaningful share of the work. On Cortex-M, exception entry and exit automatically saves and restores R0–R3, R12, LR, PC, and xPSR. You only need to manually handle R4–R11. The EXC_RETURN value in LR tells the hardware which stack pointer to use on return, making the MSP/PSP handoff clean.

Naked functions require full ownership of the stack frame. The compiler adds no prologue or epilogue to a naked function, no stack setup, no register saves. You're responsible for everything, including the explicit BX LR to return. Getting this wrong silently corrupts the context and is painful to debug.

Shared state between thread mode and interrupt handlers requires explicit protection. Even a simple multi field struct update isn't atomic. If a tick fires mid update, the scheduler can see a half written TCB. PRIMASK is the straightforward fix; BASEPRI gives you more fine grained control when you need it.

PendSV exists specifically for deferred, low priority work. You don't want a context switch happening in the middle of a higher priority interrupt handler. By pending PendSV from SysTick instead of switching there directly, the actual switch is deferred to the lowest priority exception slot where it's safe to run.

Concepts

The ideas here like yielding, preemptive scheduling, per task stacks, blocking vs busy waiting are the same foundations used by operating systems, thread libraries, and coroutine frameworks.

When you call std::this_thread::sleep_for() in C++, something very similar happens under the hood: your thread gets marked as blocked, removed from the run queue, and the scheduler picks the next ready thread. The abstraction is higher, but the underlying mechanism is the same.