Batch Trace Processor

This document describes the overall design of Batch Trace Processor and aids in integrating it into other systems.

Motivation

The Perfetto trace processor is the de-facto way to perform analysis on a single trace. Using the trace processor Python API, traces can be queried interactively, plots made from those results etc.

While queries on a single trace are useful when debugging a specific problem in that trace or in the very early stages of understanding a domain, it soon becomes limiting. One trace is unlikely to be representative of the entire population and it's easy to overfit queries i.e. spend a lot of effort on breaking down a problem in that trace while neglecting other, more common issues in the population.

Because of this, what we actually want is to be able to query many traces (usually on the order of 250-10000+) and identify the patterns which show up in a significant fraction of them. This ensures that time is being spent on issues which are affecting user experience instead of just a random problem which happened to show up in the trace.

One low-effort option for solving this problem is simply to ask people to use utilities like Executors with the Python API to load multiple traces and query them in parallel. Unfortunately, there are several downsides to this approach:

While we've discussed querying traces so far, the experience for loading traces from different traces should be just as good. This has historically been a big reason why the Python API has not gained as much adoption as we would have liked.

Especially internally in Google, we should not be relying on engineers knowing where traces live on the network filesystem and the directory layout. Instead, they should be able to simply be able to specify the data source (i.e. lab, testing population) and some parameters (e.g. build id, date, kernel version) that traces should match should match and traces meeting these criteria should found and loaded.

Putting all this together, we want to build a library which can:

Design Highlights

In this section, we briefly discuss some of the most impactful design decisions taken when building batch trace processor and the reasons behind them.

Language

The choice of langugage is pretty straightforward. Python is already the go-to langugage for data analysis in a wide variety of domains and our problem is not unique enough to warrant making a different decision. Moreover, another point in favour is the existence of the Python API for trace processor. This further eases the implementation as we do not have to start from scratch.

The main downside of choosing Python is performance but given that that all the data crunching happens in C++ inside TP, this is not a big factor.

Trace URIs and Resolvers

Trace URIs are an elegant solution to the problem of loading traces from a diverse range of public and internal sources. As with web URIs, the idea with trace URIs is to describe both the protocol (i.e. the source) from which traces should be fetched and the arguments (i.e. query parameters) which the traces should match.

Batch trace processor should integrate tightly with trace URIs and their resolvers. Users should be able to pass either just the URI (whcih is really just a string for maximum flexibility) or a resolver object which can yield a list of trace file paths.

To handle URI strings, there should be some mecahinsm of "registering" resolvers to make them eligible to resolve a certain "protocol". By default, we should provide a resolver to handle filesystem. We should ensure that the resolver design is such that resolvers can be closed soruce while the rest of batch trace processor is open.

Along with the job of yielding a list of traces, resolvers should also be responsible for creating metadata for each trace these are different pieces of information about the trace that the user might be interested in e.g. OS version, device name, collected date etc. The metadata can then be used when "flattening" results across many traces as discussed below.

Persisting loaded traces

Optimizing the loading of traces is critical for the O(s) query performance we want out of batch trace processor. Traces are often accessed over the network meaning fetching their contents has a high latency. Traces also take at least a few seconds to parse, eating up the budget for O(s) before even getting the running time of queries.

To address this issue, we take the decision to keep all traces fully loaded in memory in trace processor instances. That way, instead of loading them on every query/set of queries, we can issue queries directly.

For the moment, we restrict the loading and querying of traces to a single machine. While querying n traces is "embarassngly parallel" and shards perfectly across multiple machines, introducing distributed systems to any solution simply makes everything more complicated. The move to multiple machines is explored further in the "Future plans" section.

Flattening query results

The naive way to return the result of querying n traces is a list of n elements, with each element being result for a single trace. However, after performing several case-study performance investigations using BTP, it became obvious that this obvious answer was not the most convienent for the end user.

Instead, a pattern which proved very useful was to "flatten" the results into a single table, containing the results from all the traces. However, simply flattening causes us to lose the information about which trace a row originated from. We can deal with this by allowing resolvers to silently add columns with the metadata for each trace.

So suppose we query three traces with:

SELECT ts, dur FROM slice

Then in the flattening operation might do something like this behind the scenes:

Integration points

Batch trace processor needs to be both open source yet allow deep integration with Google internal tooling. Because of this, there are various integration points built design to allow closed compoentns to be slotted in place of the default, open source ones.

The first point is the formalization of the idea "platform" code. Even since the begining of the Python API, there was always a need for code internally to be run slightly different to open source code. For example, Google internal Python distrubution does not use Pip, instead packaging dependencies into a single binary. The notion of a "platform" loosely existed to abstract this sort of differences but this was very ad-hoc. As part of batch trace processor implementation, this has been retroactively formalized.

Resolvers are another big point of pluggability. By allowing registration of a "protocol" for each internal trace source (e.g. lab, testing population), we allow for trace loading to be neatly abstracted.

Finally, for batch trace processor specifically, we abstract the creation of thread pools for loading traces and running queries. The parallelism and memory available to programs internally is often does not 1:1 correspond with the available CPUs/memory on the system: internal APIs need to be accessed to find out this information.

Future plans

One common problem when running batch trace processor is that we are constrained by a single machine and so can only load O(1000) traces. For rare problems, there might only be a handful of traces matching a given pattern even in such a large sample.

A way around this would be to build a "no trace limit" mode. The idea here is that you would develop queries like usual with batch trace processor operating on a O(1000) traces with O(s) performance. Once the queries are relatively finalized, we could then "switch" the mode of batch trace processor to opeate closer to a "MapReduce" style pipeline which operates over O(10000)+ traces loading O(n cpus) traces at any one time.

This allows us to retain both the quick iteration speed while developing queries while also allowing for large scale analysis without needing to move code to a pipeline model. However, this approach does not really resolve the root cause of the problem which is that we are restricted to a single machine.

The "ideal" solution here is to, as mentioned above, shard batch trace processor across >1 machine. When querying traces, each trace is entirely independent of any other so paralleising across multiple machines yields very close to perfect gains in performance at little cost.

This is would be however quite a complex undertaking. We would need to design the API in such a way that allows for pluggable integration with various compute platforms (e.g. GCP, Google internal, your custom infra). Even restricting to just Google infra and leaving others as open for contribution, internal infra's ideal workload does not match the approach of "have a bunch of machines tied to one user waiting for their input". There would need to be significiant research and design work before going here but it would likely be wortwhile.