LockFreeTaskRunner Design Document
Overview
base::LockFreeTaskRunner
is a cross-platform lock-free multi-producer single-consumer task execution
engine that underpins most of Perfetto's code, both SDK and on-device services.
It provides thread-safe task posting from multiple threads while ensuring all task execution happens on a single designated thread, eliminating the need for traditional mutex-based synchronization.
Key properties:
- No mutexes or spinlocks: bounded time for PostTask() and Run().
- In the absence of bursts (i.e. no more than 2 x 512 = 1024 tasks outstanding) no
allocations are performed, other than the ones that might be required to copy
around non-trivial
std::function<void()>
s. - Compatible behaviour with the legacy UnixTaskRunner: tasks are extracted and processed in the same order.
On top of avoiding lock contention, this new task runner is ~2x faster than our legacy UnixTaskRunner:
$ out/rel/perfetto_benchmarks --benchmark_filter='.*BM_TaskRunner.*'
...
-------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------------------
BM_TaskRunner_SingleThreaded<UnixTaskRunner> 27778190 ns 27772029 ns 25
BM_TaskRunner_SingleThreaded<LockFreeTaskRunner> 10381056 ns 10375656 ns 67
BM_TaskRunner_MultiThreaded<UnixTaskRunner> 567794 ns 344625 ns 2033
BM_TaskRunner_MultiThreaded<LockFreeTaskRunner> 265943 ns 265754 ns 2749
Architecture
In the rest of this document we are going to refer to threads as:
- Writers the N threads that invoke
PostTask()
. - Reader the one thread that runs tasks in the
Run()
loop.
This document focuses only on the design of PostTask() and does not discuss PostDelayedTask() or Add/RemoveFileDescriptorWatch() at all. The logic of these functions is unchanged from the legacy UnixTaskRunner and they are simply implemented by hopping first to the task runner thread, and manipulating the delayed task list and FD set on the main thread. This involves an extra hop (if called from other threads) vs the legacy UnixTaskRunner. However: (1) in practice most calls to PostDelayedTask() and Add/RemoveFileDescriptorWatch() happen on the main thread in our codebase; (2) they are almost never hot paths.
Slab-Based Architecture
The LockFreeTaskRunner implements a Multi-Producer Single-Consumer (MPSC) queue using a slab-based approach.
A Slab contains:
Task array: Fixed-size array of 512 task slots (
kSlabSize
). These are written by the writer threads and consumed by the reader thread.Slot counter: Atomic counter
next_task_slot
for reserving slots. This is used to identify which slot in the array a writer should take (or to figure out that the slab is full). This is only accessed by the writer threads, never by the reader. When the slab is full this can grow >kSlabSize
in case of races (it's fine, once all slots are full, the value ofnext_task_slot
becomes useless).Publication bitmap:
tasks_written
. A fixed-size bitmap of 512 bits, one per task. Bits are flipped with an atomic release-OR operation by the writer thread to indicate that a task in the i-th slot is ready and can be consumed. The reader thread never alters this bitmap. Eventually this becomes 0xff..ff and stays like that for all the lifetime of the Slab.Consumption bitmap:
tasks_read
. Similar to the above, but this is only accessed by the reader thread. Bits are flipped to 1 as tasks are consumed. The writer thread never accesses this. Eventually this also becomes 0xff..ff. A Slab can be deleted only when both bitmaps are filled (all task slots have been written by the writers and consumed by the reader).Linked list pointer:
prev
pointing to the previous Slab. This is traversed only by the reader thread. The writers only look at the latest slab and never access the prev pointer (other than when constructing a new Slab)
Slabs are arranged as a singly-linked list.
Note that this list is NOT atomic, only the tail_
pointer is. The reader
thread is the only one traversing the list, the writers only access the latest
Slab, and eventually append new Slabs, replacing the tail_
.
tail_ (atomic_shared_ptr)
|
▼
+-----------------+ +-----------------+ +-----------------+
| Slab N | | Slab N-1 | | Slab 0 |
| tasks: [....] | | tasks: [....] | | tasks: [....] |
| next_task_slot | | next_task_slot | | next_task_slot |
| prev (sptr) ----+----->| prev (sptr) ----+----->| prev = nullptr |
+-----------------+ +-----------------+ +-----------------+
- Unidirectional Access: Producer threads only access the
tail
slab and never walk backwards. - Consumer Ownership: Only the main thread follows
prev
pointers and drains tasks. - Burst Handling: New slabs are allocated automatically by writers when the current slab fills.
In nominal conditions (i.e. in the absence of bursts of thousands of tasks) we will
have only two slabs. A freelist of size 1 (free_slab_
) avoids pressure on the
allocator, effectively flipping between the two slabs without new/delete.
A singly-linked list with only a tail pointer suggests that the reader has a worst case complexity of O(N), as it has to traverse the whole list to get to the first tasks (it must run tasks FIFO). However, in practice we expect to have only two slabs ever (and if we have a queue of 10k-100k tasks, walking the list is our last problem).
The main compromise of this design is that it scales poorly with a large number of tasks, as Run() becomes both slower (to traverse the list) and stack-greedy (it uses recursion on the stack to walk the list without using the heap). We don't expect a large number of outstanding tasks in Perfetto (with the exception of known issues like b/330580374 which should be fixed regardless).
Threading Considerations
Producer Thread Workflow
The PostTask()
operation follows this lock-free protocol:
- Load Tail: Atomically load the current
tail_
slab pointer - Acquire Refcount: Increment refcount bucket for this Slab (discussed later)
- Reserve Slot: Atomically increment
next_task_slot
to reserve a position - Handle Overflow: If slab is full, allocate new slab and try to atomically update
tail_
- Write Task: Store the task in the reserved slot
- Publish: Set corresponding bit in
tasks_written
bitmask with release semantics - Release Refcount: Automatically decremented when
ScopedRefcount
destructor runs
Overflow Handling
When a slab becomes full (slot >= kSlabSize
):
Slab* new_slab = AllocNewSlab();
new_slab->prev = slab;
new_slab->next_task_slot.store(1, std::memory_order_relaxed);
slot = 0;
if (!tail_.compare_exchange_strong(slab, new_slab)) {
// Another thread won the race, retry with their slab
new_slab->prev = nullptr;
DeleteSlab(new_slab);
continue;
}
Consumer Thread Workflow
The main thread in Run()
performs:
- Task Draining:
PopNextImmediateTask()
to get next task - Delayed Task Processing: Check for expired delayed tasks
- File Descriptor Polling: Handle I/O events with fairness
- Task Execution: Run tasks with watchdog protection
In the current design the run-loop performs one poll() per task. This is arguably optimizable: if we know that we have a burst of tasks, we could run them back-to-back without wasting syscall time on a poll(timeout=0).
Of course that would require some limit, to prevent livelocks in the case that a (badly designed) function keeps re-posting itself until a socket has received data (which would require a FD watch task to fire off).
Unfortunately, however, through the years our tests have accumulated
dependencies on the strict fairness of the legacy UnixTaskRunner. They expect to
be able to tell through IsIdleForTesting()
if there is any upcoming FD watch
on the event horizon. As Hyrum's Law teaches, this is now an API of our
TaskRunner and will be until several tests get rewritten and de-flaked.
Task Consumption Algorithm
PopTaskRecursive()
implements the consumption logic:
- It walks back the list of Slabs using recursion (in practice only going back by one Slab in nominal conditions).
- It scans all the bits in the
task_written
bitmap, and ANDs them with thetask_read
bitmap to extract unconsumed tasks in order. - If all the tasks are read, it proceeds with the deletion of the Slab (more below).
std::function<void()> PopTaskRecursive(Slab* slab, Slab* next_slab) {
// First, recursively check older slabs (FIFO ordering)
Slab* prev = slab->prev;
if (prev) {
auto task = PopTaskRecursive(prev, slab);
if (task) return task;
}
// Then check current slab for published tasks
for (size_t w = 0; w < Slab::kNumWords; ++w) {
BitWord wr_word = slab->tasks_written[w].load(std::memory_order_acquire);
BitWord rd_word = slab->tasks_read[w];
BitWord unread_word = wr_word & ~rd_word;
// Find and consume first unread task...
}
// Safe slab deletion logic...
}
Reference Counting System
At first glance, the simple access pattern, where writers only access the tail Slab and never walk back the list, greatly simplifies the need for complex synchronization primitives. However there is one subtle race that needs to be considered that requires some complication.
Consider the following scenario where two writer threads are invoking
PostTask()
and the reader is simultaneously running and deleting a slab.
Initial conditions:
The task runner contains only one Slab S0, which happens to be full:
tail_ -> S0 (full) -> nullptr
Race:
Thread A reads the
tail_
pointer and reads the address of S0. Before proceeding with the atomic increment ofnext_task_slot
(which will disclose that the Slab is full) it gets pre-empted, suspending for a bit.slab = tail_.load(); // Pre-emption happens here slab->next_task_slot.fetch_add(1); ...
Thread B does the same, but doesn't get pre-empted. So it reads S0, figures out it is full, allocates a new Slab S1 and replaces the tail. Thread B is happy and now:
tail_ -> S1 -> S0 -> nullptr
The Run() thread starts looping. It notices that there are two slabs, it notices that S0 is full, is NOT the tail and hence safe (!) to delete.
At this point Thread A resumes its execution, tries to increment the S0->
next_task_slot
, but S0 has been deleted, causing a use-after-free.
What is causing this race?
It is true that it is safe to delete a non-tail Slab, as writers do not traverse the linked list. However, a thread might have observed a non-tail Slab when it happened to be the tail, and the reader thread has no way to know.
Adding a refcount (or any other property) to the Slab itself is useless, because it doesn't solve the key problem that the Slab might be gone. The mitigation needs to happen outside of the Slab.
In an intermediate design of LockFreeTaskRunner, shared_ptr<Slab>
was used to
mitigate this. The non-intrusive STL shared_ptr
introduces an intermediate
control block which decouples the Slab from its refcount.
Unfortunately, it turns out that libcxx implements the atomic accessors
to shared_ptr (required to swap the shared_ptr<Slab> tail_
from different
threads) with a hashed pool of 32 mutexes, in practice defeating our lock-free
intentions (see __get_sp_mut
).
Initial simplistic mitigation:
An initial simplistic mitigation approach would be the following: imagine every
writer increments a global refcount (e.g. task_runner.num_writers_active
)
before starting and decreases it after having finished their PostTask().
This would allow the reader to know if, at any point in time, any writers are
active.
On the reader side we could skip the deletion of a slab - and try again on the
next task - if num_writers_active > 0
. Note that this is NOT a mutex, nor
a spinlock, as nobody waits for anybody else.
It is based on this principle:
- Writers can only observe a Slab through the
tail_
pointer. - When the reader decides to delete a slab, it deletes only non-tail Slabs, so
it knows that the
tail_
points to a slab different than the one being deleted. - If no writers are active, nobody could have observed any Slab, yet alone the Slab being deleted.
- If a writer becomes active immediately after the
num_writers_active > 0
check, it will necessarily observe the new tail Slab (assuming sequential consistency) and cannot observe the older Slab being deleted.
Now, while this would solve our race, it would expose us to a problematic scenario: if a writer thread happens to be posting tasks every time the Run() gets to that check, we might never be able to delete slabs.
To be honest, this scenario is quite unrealistic: if writers are always active, we are likely going to explode the task runner, assuming tasks take more time to run than what it takes to call PostTask().
Current mitigation:
In principle we would want a refcount for each Slab. But, as discussed before, the refcount cannot live on the Slab itself, because it's used to gate the access to the slab.
We could hold a refcount-per-slab in the task runner, using a
map<Slab*, atomic<int>>
but that will cause heap churn, and also require a
lock-free map.
What we opted for is a middle-ground solution: we have a fixed number (32) of refcount buckets and map each Slab to a bucket via a hash function.
Of course, two slabs could end up sharing the same refcount, creating a false positive: we might think a Slab is refcounted even when it's not, due to a hash collision. But false positives, in this context, are harmless. In the absolutely worst case we degenerate to the simplistic mitigation described above, which is still correct from a race-viewpoint.
In practice we end up dividing the probability of deferring a Slab deletion by 32x.
This is the logic that underpins the LockFreeTaskRunner.refcounts_
array of
atomic integers, and the ScopedRefCount
class used by the writers.
Delayed Task Handling
Delayed tasks use a separate FlatSet<DelayedTask>
container. This requires
some cost to maintain the entries sorted (we expect only a handful of delayed
tasks, as they are mostly used for timeouts), but avoids allocations in most
cases (FlatSet is based on a vector and allocates only if growth is necessary).
On the other hand, the reverse-ordering allows Run to pull tasks in O(1).