In this mini blog i'll share my learnings about a round robin preemptive task scheduler. It is written in C with some inline ARM assembly. There is no RTOS, no HAL i just access registers, stacks and exception handlers directly reading the reference and user manuals.
What It Does ?
Just to test out the scheduler implementation, i wrote 4 tasks which blink LEDs at different rates (1s, 500ms, 250ms, 125ms). There's also an idle task that runs when everyone else is blocked. A SysTick interrupt fires at 1kHz and that's what drives the whole scheduling. PendSV does the actual context switch.
Implementation Details
Stack Layout
So the STM32F407 has 128KB of SRAM starting at 0x20000000. The
stack is full-descending which means it grows from high addresses down to low.
What i did is allocate out 1KB per task (4 tasks + 1 idle) and another 1KB for the scheduler, all sitting at the top of SRAM:

Each task gets its own Process Stack Pointer (PSP). The scheduler which runs inside exception handlers uses the Main Stack Pointer (MSP). Thread mode code uses PSP, handler mode uses MSP. To make the switch you write to the CONTROL register:
MOV R0, #0x02
MSR CONTROL, R0
Stack Frame Initialization
Before any task actually runs i need to fake a stack frame for each one. Why? Because the context switch logic expects something to restore. The Cortex-M4 hardware automatically pushes 8 registers on exception entry (xPSR, PC, LR, R12, R3-R0). I manually push the rest (R4-R11) with zeros.
The parts you have to get right:
- xPSR =
0x01000000: the Thumb bit must be set otherwise you get a hardfault immediately - PC = address of the task handler function
- LR =
0xFFFFFFFD: this is an EXC_RETURN value that tells the processor to return to thread mode using PSP
Context Switching (PendSV)
PendSV is where the actual context switch happens. It runs at the lowest exception priority so it never preempts other interrupts.

- Read current PSP →
MRS R0, PSP - Save R4-R11 onto the current task's stack →
STMDB R0!, {R4-R11} - Store the updated PSP back into the task's TCB
- Pick the next ready task (round-robin, skip blocked ones)
- Load saved R4-R11 from the new task's stack →
LDMIA R0!, {R4-R11} - Set PSP to the new task's saved stack pointer
- Return: hardware restores the remaining registers on its own

One thing that was a bit confusing was understanding that the naked attribute on PendSV_Handler
is not optional. Without it the compiler adds prologue/epilogue code
that messes up the stack frame.
SysTick as Scheduler Tick
SysTick fires every 1ms (16MHz HSI / 16000 - 1). Every tick it does three things:
- Increments a global tick counter
- Checks if any blocked task should wake up
- Pends PendSV to trigger a context switch
The reload value:
count = (SYSTICK_CLK / TICK_HZ) - 1
= (16000000 / 1000) - 1
= 15999
That minus one matters because the SysTick counts down to 0 then reloads. The exception should fire on the transition to 0, not after the reload.
Task Blocking
So initially i was using a busy wait delay(), a
for-loop also called nicely as a 'software delay'. It works but it completely defeats the
purpose of having a scheduler. Every task just spins doing nothing
and the scheduler round-robins between tasks that are all just
wasting time.
The better way is task_delay(tick_count):
- Set the task's wake-up tick =
global_tick_count + tick_count - Mark the task as
TASK_BLOCKED - Pend PendSV to yield the CPU right away
Now blocked tasks get skipped during scheduling. If all tasks are
blocked the idle task runs. Right now it's just a while(1) but
you could put a WFI in there to save power.
Race Conditions
The task_delay() modifies
shared state (the TCB) from thread mode while SysTick modifies it
from handler mode. It is susceptible race condition. The fix i implemented was to
disable interrupts around the critical section using PRIMASK:
MOV R0, #0x1
MSR PRIMASK, R0 // disable
// ... critical section ...
MOV R0, #0x0
MSR PRIMASK, R0 // re-enable
For something more fine grained we can use BASEPRI instead which lets us mask only interrupts below a certain priority.
Inline Assembly
Some inline assembly to manipulate critical registers:
MSR MSP, R0: set the main stack pointerMSR PSP, R0: set the process stack pointerMSR CONTROL, R0: switch between MSP and PSPMRS R0, PSP: read current PSPSTMDB / LDMIA: bulk save/restore registersMSR PRIMASK, R0: enable/disable interrupts
Used __asm volatile(...) with GCC's extended inline
assembly syntax.
Task Control Block
Each task is just a struct:
typedef struct {
uint32_t* psp_value;
uint32_t block_count;
TaskState current_state; // TASK_READY or TASK_BLOCKED
void (*task_handler)(void);
} TCB_t;
The scheduler loops over an array of 5 TCBs (4 tasks + idle) to find the next ready task. It's a circular scan that skips task 0 (idle) unless nothing else is ready.
Learnings
The stack is really useful. The whole context switch is really just saving a pointer, swapping it with another one, and restoring. The CPU doesn't know or care which task is running it just follows whatever the stack tells it.
Hardware also does some work out of the box. On Cortex-M exception entry/exit automatically saves and restores R0-R3, R12, LR, PC, xPSR. You only deal with R4-R11. The EXC_RETURN value in LR tells the hardware which stack pointer to use on return.
Naked functions are tricky. The compiler doesn't know what
you're doing inside a naked function since there is no stack frame, no register
saving. Having wrong assumptions can corrupt the context.
You have to handle everything yourself including the return
(BX LR).
Shared state and interrupts can cause race conditions. Even something as simple as incrementing a variable isn't atomic. If a tick interrupt fires in the middle of a read-modify-write you get corrupted data. PRIMASK is the quick and dirty fix.
PendSV for low priority task handling. You don't want context switches happening inside higher priority interrupts. PendSV runs at the lowest priority so it naturally defers the switch to a safe point. SysTick relays the actual context switch to PendSV.
Concepts
The ideas here like yielding, preemptive scheduling, stack-per-task, blocking vs busy-waiting are the same ones used by operating systems, thread libraries, and coroutine frameworks.
When you call std::this_thread::sleep_for() in C++ something very
similar happens under the hood. Your thread gets marked as blocked,
removed from the run queue, and the scheduler picks the next one.
Files
| File | Purpose |
|---|---|
| main.c | Scheduler, task handlers, exception handlers |
| main.h | Stack layout, macros, constants |
| led.c | GPIO setup and control for STM32F4 Discovery |
| led.h | LED pin definitions, delay constants |
Hardware
- STM32F407 Discovery Board
- ARM Cortex-M4, 128KB SRAM
- 4 onboard LEDs (PD12–PD15)
- HSI clock at 16MHz