ProtoZero design document

ProtoZero is a zero-copy zero-alloc zero-syscall protobuf serialization libary purposefully built for Perfetto's tracing use cases.

Motivations

ProtoZero has been designed and optimized for proto serialization, which is used by all Perfetto tracing paths. Deserialization was introduced only at a later stage of the project and is mainly used by offline tools (e.g., TraceProcessor. The zero-copy zero-alloc zero-syscall statement applies only to the serialization code.

Perfetto makes extensive use of protobuf in tracing fast-paths. Every trace event in Perfetto is a proto (see TracePacket reference). This allows events to be strongly typed and makes it easier for the team to maintain backwards compatibility using a language that is understood across the board.

Tracing fast-paths need to have very little overhead, because instrumentation points are sprinkled all over the codebase of projects like Android and Chrome and are performance-critical.

Overhead here is not just defined as CPU time (or instructions retired) it takes to execute the instrumentation point. A big source of overhead in a tracing system is represented by the working set of the instrumentation points, specifically extra I-cache and D-cache misses which would slow down the non-tracing code after the tracing instrumentation point.

The major design departures of ProtoZero from canonical C++ protobuf libraries like libprotobuf are:

Usage

At the build-system level, ProtoZero is extremely similar to the conventional libprotobuf libray. The ProtoZero .proto -> .pbzero.{cc,h} compiler is based on top of the libprotobuf parser and compiler infrastructure. ProtoZero is as a protoc compiler plugin.

ProtoZero has a build-time-only dependency on libprotobuf (the plugin depends on libprotobuf's parser and compiler). The .pbzero.{cc,h} code generated by it, however, has no runtime dependency (not even header-only dependencies) on libprotobuf.

In order to generate ProtoZero stubs from proto you need to:

  1. Build the ProtoZero compiler plugin, which lives in src/protozero/protoc_plugin/.

    tools/ninja -C out/default protozero_plugin protoc
  2. Invoke the libprotobuf protoc compiler passing the protozero_plugin:

    out/default/protoc \ --plugin=protoc-gen-plugin=out/default/protozero_plugin \ --plugin_out=wrapper_namespace=pbzero:/tmp/ \ test_msg.proto

    This generates /tmp/test_msg.pbzero.{cc,h}.

    NOTE: The .cc file is always empty. ProtoZero-generated code is header only. The .cc file is emitted only because some build systems' rules assume that protobuf codegens generate both a .cc and a .h file.

Proto serialization

The quickest way to undestand ProtoZero design principles is to start from a small example and compare the generated code between libprotobuf and ProtoZero.

syntax = "proto2"; message TestMsg { optional string str_val = 1; optional int32 int_val = 2; repeated TestMsg nested = 3; }

libpprotobuf approach

The libprotobuf approach is to generate a C++ class that has one member for each proto field, with dedicated serialization and de-serialization methods.

out/default/protoc --cpp_out=. test_msg.proto

generates test_msg.pb.{cc,h}. With many degrees of simplification, it looks as follows:

// This class is generated by the standard protoc compiler in the .pb.h source. class TestMsg : public protobuf::MessageLite { private: int32 int_val_; ArenaStringPtr str_val_; RepeatedPtrField<TestMsg> nested_; // Effectively a vector<TestMsg> public: const std::string& str_val() const; void set_str_val(const std::string& value); bool has_int_val() const; int32_t int_val() const; void set_int_val(int32_t value); ::TestMsg* add_nested(); ::TestMsg* mutable_nested(int index); const TestMsg& nested(int index); std::string SerializeAsString(); bool ParseFromString(const std::string&); }

The main characteristic of these stubs are:

ProtoZero approach

// This class is generated by the ProtoZero plugin in the .pbzero.h source. class TestMsg : public protozero::Message { public: void set_str_val(const std::string& value) { AppendBytes(/*field_id=*/1, value.data(), value.size()); } void set_str_val(const char* data, size_t size) { AppendBytes(/*field_id=*/1, data, size); } void set_int_val(int32_t value) { AppendVarInt(/*field_id=*/2, value); } TestMsg* add_nested() { return BeginNestedMessage<TestMsg>(/*field_id=*/3); } }

The ProtoZero-generated stubs are append-only. As the set_*, add_* methods are invoked, the passed arguments are directly serialized into the target buffer. This introduces some limitations:

This has a number of advantages:

Scattered buffer writing

A key part of the ProtoZero design is supporting direct serialization on non-globally-contiguous sequences of contiguous memory regions.

This happens by decoupling protozero::Message, the base class for all the generated classes, from the protozero::ScatteredStreamWriter. The problem it solves is the following: ProtoZero is based on direct serialization into shared memory buffers chunks. These chunks are 4KB - 32KB in most cases. At the same time, there is no limit in how much data the caller will try to write into an individual message, a trace event can be up to 256 MiB big.

ProtoZero scattered buffers diagram

Fast-path

At all times the underlying ScatteredStreamWriter knows what are the bounds of the current buffer. All write operations are bound checked and hit a slow-path when crossing the buffer boundary.

Most write operations can be completed within the current buffer boundaries. In that case, the cost of a set_* operation is in essence a memcpy() with the extra overhead of var-int encoding for protobuf preambles and length-delimited fields.

Slow-path

When crossing the boundary, the slow-path asks the ScatteredStreamWriter::Delegate for a new buffer. The implementation of GetNewBuffer() is up to the client. In tracing use-cases, that call will acquire a new thread-local chunk from the tracing shared memory buffer.

Other heap-based implementations are possible. For instance, the ProtoZero sources provide a helper class HeapBuffered<TestMsg>, mainly used in tests (see scattered_heap_buffer.h), which allocates a new heap buffer when crossing the boundaries of the current one.

Consider the following example:

TestMsg outer_msg; for (int i = 0; i < 1000; i++) { TestMsg* nested = outer_msg.add_nested(); nested->set_int_val(42); }

At some point one of the set_int_val() calls will hit the slow-path and acquire a new buffer. The overall idea is having a serialization mechanism that is extremely lightweight most of the times and that requires some extra function calls when buffer boundary, so that their cost gets amortized across all trace events.

In the context of the overall Perfetto tracing use case, the slow-path involves grabbing a process-local mutex and finding the next free chunk in the shared memory buffer. Hence writes are lock-free as long as they happen within the thread-local chunk and require a critical section to acquire a new chunk once every 4KB-32KB (depending on the trace configuration).

The assumption is that the likeliness that two threads will cross the chunk boundary and call GetNewBuffer() at the same time is extremely slow and hence the critical section is un-contended most of the times.

sequenceDiagram participant C as Call site participant M as Message participant SSR as ScatteredStreamWriter participant DEL as Buffer Delegate C->>M: set_int_val(...) activate C M->>SSR: AppendVarInt(...) deactivate C Note over C,SSR: A typical write on the fast-path C->>M: set_str_val(...) activate C M->>SSR: AppendString(...) SSR->>DEL: GetNewBuffer(...) deactivate C Note over C,DEL: A write on the slow-path when crossing 4KB - 32KB chunks.

Deferred patching

Nested messages in the protobuf binary encoding are prefixed with their varint-encoded size.

Consider the following:

TestMsg* nested = outer_msg.add_nested(); nested->set_int_val(42); nested->set_str_val("foo");

The canonical encoding of this protobuf message, using libprotobuf, would be:

1a 07 0a 03 66 6f 6f 10 2a ^-+-^ ^-----+------^ ^-+-^ | | | | | +--> Field ID: 2 [int_val], value = 42. | | | +------> Field ID: 1 [str_val], len = 3, value = "foo" (66 6f 6f). | +------> Field ID: 3 [nested], lenght: 7 # !!!

The second byte in this sequence (07) is problematic for direct encoding. At the point where outer_msg.add_nested() is called, we can't possibly know upfront what the overall size of the nested message will be (in this case, 5 + 2 = 7).

The way we get around this in ProtoZero is by reserving four bytes for the size of each nested message and back-filling them once the message is finalized (or when we try to set a field in one of the parent messages). We do this by encoding the size of the message using redundant varint encoding, in this case: 87 80 80 00 instead of 07.

At the C++ level, the protozero::Message class holds a pointer to its size field, which typically points to the beginning of the message, where the four bytes are reserved, and back-fills it in the Message::Finalize() pass.

This works fine for cases where the entire message lies in one contiguous buffer but opens a further challenge: a message can be several MBs big. Looking at this from the overall tracing perspective, the shared memory buffer chunk that holds the beginning of a message can be long gone (i.e. committed in the central service buffer) by the time we get to the end.

In order to support this use case, at the tracing code level (outside of ProtoZero), when a message crosses the buffer boundary, its size field gets redirected to a temporary patch buffer (see patch_list.h). This patch buffer is then sent out-of-band, piggybacking over the next commit IPC (see Tracing Protocol ABI)

Performance characteristics

NOTE: For the full code of the benchmark see /src/protozero/test/protozero_benchmark.cc

We consider two scenarios: writing a simple event and a nested event

Simple event

Consists of filling a flat proto message with of 4 integers (2 x 32-bit, 2 x 64-bit) and a 32 bytes string, as follows:

void FillMessage_Simple(T* msg) { msg->set_field_int32(...); msg->set_field_uint32(...); msg->set_field_int64(...); msg->set_field_uint64(...); msg->set_field_string(...); }

Nested event

Consists of filling a similar message which is recursively nested 3 levels deep:

void FillMessage_Nested(T* msg, int depth = 0) { FillMessage_Simple(msg); if (depth < 3) { auto* child = msg->add_field_nested(); FillMessage_Nested(child, depth + 1); } }

Comparison terms

We compare, for the same message type, the performance of ProtoZero, libprotobuf and a speed-of-light serializer.

The speed-of-light serializer is a very simple C++ class that just appends data into a linear buffer making all sorts of favourable assumptions. It does not use any binary-stable encoding, it does not perform bound checking, all writes are 64-bit aligned, it doesn't deal with any thread-safety.

struct SOLMsg { template <typename T> void Append(T x) { // The memcpy will be elided by the compiler, which will emit just a // 64-bit aligned mov instruction. memcpy(reinterpret_cast<T*>(ptr_), &x, sizeof(x)); ptr_ += sizeof(x); } void set_field_int32(int32_t x) { Append(x); } void set_field_uint32(uint32_t x) { Append(x); } void set_field_int64(int64_t x) { Append(x); } void set_field_uint64(uint64_t x) { Append(x); } void set_field_string(const char* str) { ptr_ = strcpy(ptr_, str); } char storage_[sizeof(g_fake_input_simple)]; char* ptr_ = &storage_[0]; };

The speed-of-light serializer serves as a reference for how fast a serializer could be if argument marshalling and bound checking were zero cost.

Benchmark results

Google Pixel 3 - aarch64
$ cat out/droid_arm64/args.gn target_os = "android" is_clang = true is_debug = false target_cpu = "arm64" $ ninja -C out/droid_arm64/ perfetto_benchmarks && \ adb push --sync out/droid_arm64/perfetto_benchmarks /data/local/tmp/perfetto_benchmarks && \ adb shell '/data/local/tmp/perfetto_benchmarks --benchmark_filter=BM_Proto*' ------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------ BM_Protozero_Simple_Libprotobuf 402 ns 398 ns 1732807 BM_Protozero_Simple_Protozero 242 ns 239 ns 2929528 BM_Protozero_Simple_SpeedOfLight 118 ns 117 ns 6101381 BM_Protozero_Nested_Libprotobuf 1810 ns 1800 ns 390468 BM_Protozero_Nested_Protozero 780 ns 773 ns 901369 BM_Protozero_Nested_SpeedOfLight 138 ns 136 ns 5147958
HP Z920 workstation (Intel Xeon E5-2690 v4) running Linux
$ cat out/linux_clang_release/args.gn is_clang = true is_debug = false $ ninja -C out/linux_clang_release/ perfetto_benchmarks && \ out/linux_clang_release/perfetto_benchmarks --benchmark_filter=BM_Proto* ------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------ BM_Protozero_Simple_Libprotobuf 428 ns 428 ns 1624801 BM_Protozero_Simple_Protozero 261 ns 261 ns 2715544 BM_Protozero_Simple_SpeedOfLight 111 ns 111 ns 6297387 BM_Protozero_Nested_Libprotobuf 1625 ns 1625 ns 436411 BM_Protozero_Nested_Protozero 843 ns 843 ns 849302 BM_Protozero_Nested_SpeedOfLight 140 ns 140 ns 5012910