TraceBuffer V2 Design document

Overview

This document covers the design of TraceBufferV2, which is the 2025 rewrite of the core trace buffer code in occasion of ProtoVM.

TraceBuffer is the non-shared userspace buffer that is used by the tracing service to hold traced data in memory, until it's either read back or written into a file. There is one TraceBuffer instance for each buffers section of the trace config

Basic operating principles

NOTE: This section assumes you are familiar with the core concepts exposed in Buffers and dataflow.

TraceBuffer is a ring buffer on steroids. Unfortunately due to the complications of the protocol (see Challenges section) it is far from a plain byte-oriented FIFO ring buffer when it comes to readback and deletions.

Before delving into its complications, let's explore its key operations.

Logically TraceBuffer deals with overlapping streams of data, called TraceWriter Sequences, or in short just Sequences:

From a TraceBuffer viewpoint, the only abstraction visible are the Producer (identified by a uint16_t ProducerID) and TraceWriter (identified by a uint16_t WriterID, within the scoped of a producer). The 32-bit tuple {ProducerId, WriterID} constitutes the unique Sequence ID for TraceBuffer. Everything in TraceBuffer is keyed by that.

Basic operation:

Readback gives the following guarantees:

Readback happens in the following cases:

Code-wise there are four main entry-points:

Writer-side (Producer-side):

Reader-side:

Key challenges

RING_BUFFER vs DISCARD

TraceBuffer can operate in two modes.

RING_BUFFER

This is the mode used by the majority of the traces. It is also the one with the most complications. This document focuses on the operation in RING_BUFFER mode, unless otherwise specified. This mode can be used for pure ring buffer tracing, or can be coupled with write_into_file to have long traces streamed into disk, in which case the ring buffer serves mainly to decouple the SMB and the I/O activity (and to handle fragment reassembly).

DISCARD

This mode is used for one-shot traces where the user cares about the left-most part of the trace. This is conceptually easier: once reached the end of the buffer, TraceBuffer stops accepting data.

There is a slight behavioural change from the V1 implementation. V1 tried to be (too) smart about DISCARD and allowed to keep writing data into the buffer as long as the write and read cursors never crossed (i.e. as long as the reader caught up). This turned out to be useless and confusing: coupling DISCARD with write_into_file leads to a scenario where DISCARD behaves almost like a RING_BUFFER. However if the reader doesn't catch up (e.g. due to lack of CPU bandwidth), TraceBuffer stops accepting data (forever). We later realized this was a confusing feature (a ring buffer that suddenly stops) and added warnings when trying to combine the two.

V2 doesn't try to be smart about readbacks and simply stops once we reach the end of the buffer, whether it has been read or not.

Fragmentation

Packet fragmentation is the cause of most of TraceBuffer's design complexity.

Simple Fragmentation Example: Chunk A (ChunkID=100) Chunk B (ChunkID=101) Chunk C (ChunkID=102) ┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐ │[Packet1: Complete] │ │[Packet2: Begin] │ │[Packet2: Continue] │ │[Packet2: Begin] │ │ flags: kContOnNext │ │ flags: kContFromPrev│ │ flags: kContOnNext │ └─────────────────────┘ │[Packet2: End] │ └─────────────────────┘ │[Packet3: Complete] │ └─────────────────────┘ Fragmentation Chain: Packet2 = [Begin] → [Continue] → [End]

Key Fragmentation Challenges:

Out-of-order commits

Out of order commits are rare but regularly present. They happen due to a feature called SMB Scraping introduced in the early years of Perfetto.

SMB scraping happens when TracingServiceImpl, upon a Flush, forcefully reads the chunks in the SMB, even if they are not marked as completed, and writes them into the trace buffer.

This was necessary to deal with data sources like TrackEvent that can be used on arbitrary threads that don't have a TaskRunner, where would it be impossible to issue a PostTask(FlushTask) upon a trace protocol Flush request.

The challenge is that TracingServiceImpl, when scraping, scans the SMB in linear order and commits chunks as found. But that linear order translates into "chunk allocation order", which is unpredictable, effectively causing chunks to be committed in random order.

In practice, these instances are relatively rare, as they happen:

Important note: TraceBuffer assumes that all out-of-order commits are batched together atomically. The only known use case for OOO is SMB scraping, which commits all scraped chunks in one go within a single TaskRunner task.

Hence we assume that the following cannot happen:

The logic on TraceBufferV2 treats any ChunkID gaps identified, after having sorted chunks by ChunkID, as data losses.

Tracking of data losses

There are several paths that can lead to a data loss, and TraceBuffer must track and report all of them. Debugging data losses is a very common activity. It's extremely important that TraceBuffer signals any case of data loss.

There are several different types and causes of data losses:

Patches

When a packet spans across several fragments, almost always it involves patching due to the nature of protobuf encoding. The problem is the following:

From a protocol viewpoint, only the last fragment of a chunk can be patched:

The information about "needs patching" is held in the SMB's Chunk flags (kChunkNeedsPatching).

The kChunkNeedsPatching state is cleared by the CommitData IPC, which contains, alongside the patches offset and payload, the bool has_more_patches flag. When false, it causes the kChunkNeedsPatching state to be cleared.

From a TraceBuffer viewpoint patching has the following implications:

Recommits

Recommit means committing again a chunk that exists in the buffer with the same ChunkID. The only legitimate case of recommit is SMB scraping followed by an actual commit. We don't expect nor support the case of Producers trying to re-commit the same chunk N times, as that unavoidably leads to undefined behaviour (what if the tracing service has written the packets to a file already?).

This is the scenario when recommit can legitimately happen:

NOTE: kChunkNeedsPatching and kIncomplete are two different and orthogonal chunk states. kIncomplete has nothing to do with fragments and is purely about SMB scraping (and the fact that we have to be conservative and ignore the last fragment).

Implications:

The main complication of incomplete chunks is that we cannot know upfront the size of their payload. Because of this we have to conservatively copy and reserve in the buffer the whole chunk size.

Buffer cloning

Buffer cloning happens via the CloneReadOnly() method. As the name suggests it creates a new TraceBuffer instance which contains the same contents but can only be read into. This is to support CLONE_SNAPSHOT triggers.

Architecturally buffer cloning is not particularly complicated, at least in the current design. The main design implications are:

ProtoVM

ProtoVM is an upcoming feature to TracingService. It's a non-turing-complete VM language to describe proto merging operations, to keep track of the state of arbitrary proto-encoded data structures as we overwrite data in the trace buffer. ProtoVM has been the reason that triggered the redesign of TraceBuffer V2.

Without getting into its details, the primary requirement of ProtoVM is the following: when overwriting chunks in the trace buffer, we must pass valid packets from these soon-to-be-deleted chunks to ProtoVM. We must do so in order, hence replicating the same logic that we would use when doing a readback.

Internal docs about ProtoVM:

Overwrites

For the aforementioned ProtoVM reasons, in the V2 design, the logic that deals with ring buffer overwrites (DeleteNextChunksFor()) is almost identical - and shares most of its code with - the readback logic.

I say almost because there is (of course) a subtle difference: when deleting chunks, stalling (either due to pending patches or incompleteness) is NOT an option. Old chunks MUST go to make space to new chunks, no matter what.

So overwrites are the equivalent of a no-stalling force-delete readback.

Core design

There are two main data structures involved:

TBChunk

Is the struct, stored in the trace buffer memory as a result of calling CopyChunkUntrusted from a SMB chunk.

A TBChunk is very similar to a SMB chunk with the following caveats:

In a nutshell TBChunk is:

SequenceState

It maintains the state of a {ProducerID, WriterID} sequence.

Its important feature is maintaining an ordered list (logically) of TBChunk(s) for that sequence, sorted by ChunkID order. The "list" is actually a CircularQueue of offsets, which has O(1) push_back() and pop_front() operations.

The lifetime of a SequenceState has a subtle tradeoff:

TraceBufferV2 balances this using a lazy sweeping approach: it allows the most recently deleted SequenceStates to stay alive, up to kKeepLastEmptySeq = 1024. See DeleteStaleEmptySequences().

FragIterator

A simple class that tokenizes fragments in a chunk and allows forward-only iteration.

It deals with untrusted data, detecting malformed / out of bounds scenarios.

It does not alter the state of the buffer.

ChunkSeqIterator

A simple utility class that iterates over the ordered list of TBChunk for a given SequenceState. It merely follows the SequenceState.chunks queue and detects gaps.

ChunkSeqReader

Encapsulates most of the readback complexity. It reads and consumes chunks in sequence order, as follows:

Buffer order vs Sequence order

Chunks can be visited in two different ways:

  1. Buffer order: in the order they have been written in the buffer. In the example below: A1, B1, B3, A2, B2

  2. In Sequence order: in the order they appear in the SequenceState's list.

Writing chunks

When chunks are written via CopyChunkUntrusted() a new TBChunk is allocated in the buffer's PagedMemory using the usual bump-pointer pattern you'd expect from a ring-buffer. Chunks are variable-size, and are stored contiguously with 32-bit alignment.

The offset of the chunk is also appended in the SequenceState.chunks list.

After the first wrapping, writing a chunk involves deleting one or more existing chunks. The deletion operation RemoveNextChunksFor() is as complex as a readback, because it reconstructs packets being deleted in order, to pass them to ProtoVM.

So the writing itself is straightforward, but the deletion (overwrite) of existing chunks is where most of the complexity lies. This is described in the next section.

DeleteNextChunksFor() Flow

flowchart TD A[DeleteNextChunksFor
bytes_to_clear] --> B[Initialize: off = wr_
clear_end = wr_ + bytes_to_clear] B --> C{off < clear_end?} C -->|No| M[Create padding chunks
for partial deletions] C -->|Yes| D{off >= used_size_?} D -->|Yes| N[Break - nothing to delete
in unused space] D -->|No| E[chunk = GetTBChunkAt off] E --> F{chunk.is_padding?} F -->|Yes| G[Update padding stats
off += chunk.outer_size] F -->|No| H[Create ChunkSeqReader
in kEraseMode] H --> I[ReadNextPacketInSeqOrder loop] I --> J{Packet found?} J -->|Yes| K[Pass packet to ProtoVM
has_cleared_fragments = true] J -->|No| L{has_cleared_fragments?} K --> I L -->|Yes| O[Mark sequence data_loss = true] L -->|No| P[No data loss] O --> Q[Update overwrite stats
off += chunk.outer_size] P --> Q Q --> R{More chunks in range?} R -->|Yes| C R -->|No| M G --> C M --> S[Scan remaining range for padding] S --> T{Partial chunk at end?} T -->|Yes| U[Create new padding chunk
for remaining space] T -->|No| V[End] U --> V N --> V style A fill:#e1f5fe style K fill:#fff3e0 style O fill:#ffcdd2 style V fill:#c8e6c9

Key Differences from ReadNextTracePacket:

Reading back packets

Readback (ReadNextTracePacket()) is where most of the TraceBuffer's complexity lies, as it needs to reassembles packets from fragments, deal with gaps / data losses, and deal with interleaving of chunks from different sequences, and out of ordering.

flowchart TD A[ReadNextTracePacket Start] --> B{chunk_seq_reader_ exists?} B -->|No| C[Get chunk at rd_] C --> D{Is chunk padding?} D -->|Yes| E[rd_ += chunk.outer_size
Check wrap around] D -->|No| F[Create ChunkSeqReader
for this chunk] B -->|Yes| G[ReadNextPacketInSeqOrder] F --> G G --> H{Packet found?} H -->|Yes| I[Set sequence properties
Set data_loss flag
Return packet] H -->|No| J[Get end chunk from reader
rd_ = end_offset + size] J --> K{rd_ == wr_ OR
wrapped to wr_?} K -->|Yes| L[Return false - no more data] K -->|No| M[Reset chunk_seq_reader_
Handle wrap around] E --> N{rd_ wrapped around?} N -->|Yes| O[rd_ = 0] N -->|No| P[Continue with new rd_] O --> K P --> K M --> B style A fill:#e1f5fe style I fill:#c8e6c9 style L fill:#ffcdd2

ChunkSeqReader Internal Flow

This is how ReadNextTracePacket() works:

flowchart TD A[ReadNextPacketInSeqOrder] --> B{skip_in_generation?} B -->|Yes| C[Return false - stalled] B -->|No| D[NextFragmentInChunk] D --> E{Fragment found?} E -->|Yes| F{Fragment type?} F -->|kFragWholePacket| G[ConsumeFragment
Return packet] F -->|kFragBegin| H[ReassembleFragmentedPacket] F -->|kFragEnd/Continue| I[Data loss - unexpected
ConsumeFragment
Continue loop] E -->|No| J{Chunk corrupted?} J -->|Yes| K[Mark data_loss = true] J -->|No| L{Chunk incomplete?} L -->|Yes| M[Set skip_in_generation
Return false] L -->|No| N[EraseCurrentChunk] K --> N N --> O{Reached end chunk?} O -->|Yes| P[Return false] O -->|No| Q[NextChunkInSequence] Q --> R{Next chunk exists?} R -->|No| P R -->|Yes| S[iter_ = next_chunk
Create new FragIterator] S --> D H --> T{Reassembly result?} T -->|Success| U[Return reassembled packet] T -->|NotEnoughData| V[Set skip_in_generation
Return false] T -->|DataLoss| W[Mark data_loss = true
Continue loop] style A fill:#e1f5fe style G fill:#c8e6c9 style U fill:#c8e6c9 style C fill:#ffcdd2 style P fill:#ffcdd2 style V fill:#fff3e0

Dealing with out-of-order chunks

But things are more complicated. Let's take first only out-of-ordering into the picture. With reference to the drawing above, let's imagine the write cursor is @ offset=48, right before B3.

If we proceeded simply in buffer order we would break FIFO-ness, as we would first emit the packets contained in B3, then A2 (this is fine) and ultimately B2 (this is problematic).

The only valid linearizations that preserve in-sequence FIFO-ness, would be either [A2,B2,B3], [B2,B3,A2] or [B2,A2,B3].

In order to deal with this we introduce a two layer walk in the readback code:

In the code, the outer layer walk is implemented by TraceBufferV2::ReadNextTracePacket() while the inner walk is implemented by the class ChunkSeqReader::ReadNextPacket().

Benchmarks

Apple Macbook (M4)

BM_TraceBuffer_WR_SingleWriter<TraceBufferV1> bytes_per_second=9.77742G/s BM_TraceBuffer_WR_SingleWriter<TraceBufferV2> bytes_per_second=12.6395G/s BM_TraceBuffer_WR_MultipleWriters<TraceBufferV1> bytes_per_second=8.65385G/s BM_TraceBuffer_WR_MultipleWriters<TraceBufferV2> bytes_per_second=11.7582G/s BM_TraceBuffer_RD_MixedPackets<TraceBufferV1> bytes_per_second=4.27694G/s BM_TraceBuffer_RD_MixedPackets<TraceBufferV2> bytes_per_second=4.35475G/s

Google Pixel 7

BM_TraceBuffer_WR_SingleWriter<TraceBufferV1> bytes_per_second=4.4379G/s BM_TraceBuffer_WR_SingleWriter<TraceBufferV2> bytes_per_second=3.7931G/s BM_TraceBuffer_WR_MultipleWriters<TraceBufferV1> bytes_per_second=3.19148G/s BM_TraceBuffer_WR_MultipleWriters<TraceBufferV2> bytes_per_second=3.47354G/s BM_TraceBuffer_RD_MixedPackets<TraceBufferV1> bytes_per_second=1.26698G/s BM_TraceBuffer_RD_MixedPackets<TraceBufferV2> bytes_per_second=1.35394G/s