What is Tracing?

NOTE: the word "tracing" in this document is used in the context of client-side side software (e.g. programs running on a single machine). In the server world, tracing is usually short for distributed tracing, a way to collect data from many different servers to understand the flow of a "request" throughout multiple services. As such, this document will not be useful to you if you are interested in such traces.

This page provides a birds-eye view of performance analysis and tracing. The aim is to orient people who have no idea what "tracing" is.

Introduction to Performance

Performance analysis is concerned with making software run better. The definition of better varies widely and depends on the situation. Examples include:

performing the same work using fewer resources (CPU, memory, network, battery, etc.)
increasing utilization of available resources
identifying and eliminating unnecessary work altogether

Much of the difficulty in improving performance comes from identifying the root cause of performance issues. Modern software systems are complicated, having a lot of components and a web of cross-interactions. Techniques which help engineers understand the execution of a system and pinpoint issues that are critical.

Tracing and profiling are two such widely-used techniques for performance analysis.

Introduction to Tracing

Tracing involves collecting highly detailed data about the execution of a system. A single continuous session of recording is called a trace file or trace for short.

Traces contain enough detail to fully reconstruct the timeline of events. They often include low-level kernel events like scheduler context switches, thread wakeups, syscalls, etc. With the "right" trace, reproduction of a performance bug is not needed as the trace provides all necessary context.

Application code is also instrumented in areas of the program which are considered to be important. This instrumentation keeps track of what the program was doing over time (e.g. which functions were being run, or how long each call took) and context about the execution (e.g. what were the parameters to a function call, or why was a function run).

The level of detail in traces makes it impractical to read traces directly like a log file in all but the simplest cases. Instead, a combination of trace analysis libraries and trace viewers are used. Trace analysis libraries provide a way for users to extract and summarize trace events in a programmatic manner. Trace viewers visualize the events in a trace on a timeline which give users a graphical view of what their system was doing over time.

Logging vs tracing

A good intuition is that logging is to functional testing what tracing is to performance analysis. Tracing is, in a sense, "structured" logging: instead of having arbitrary strings emitted from parts of the system, tracing reflects the detailed state of a system in a structured way to allow reconstruction of the timeline of events.

Moreover, tracing frameworks (like Perfetto) place heavy emphasis on having minimal overhead. This is essential so that the framework does not significantly disrupt whatever is being measured: modern frameworks are fast enough that they can measure execution at the nanosecond level without significantly impacting the execution speed of the program.

Small aside: theoretically, tracing frameworks are powerful enough to act as a logging system as well. However, the utilization of each in practice is different enough that the two tend to be separate.

Metrics vs tracing

Metrics are numerical values which track the performance of a system over time. Usually metrics map to high-level concepts. Examples of metrics include: CPU usage, memory usage, network bandwidth, etc. Metrics are collected directly from the app or operating system while the program is running.

After glimpsing the power of tracing, a natural question arises: why bother with high level metrics at all? Why not instead just use tracing and compute metrics on resulting traces? In some settings, this may indeed be the right approach. In local and lab situations using trace-based metrics, where metrics are computed from traces instead of collecting them directly, is a powerful approach. If a metric regresses, it's easy to open the trace to root cause why that happened.

However, trace-based metrics are not a universal solution. When running in production, the heavyweight nature of traces can make it impractical to collect them 24/7. Computing a metric with a trace can take megabytes of data vs bytes for direct metric collection.

Using metrics is the right choice when you want to understand the performance of a system over time but do not want to or can not pay the cost of collecting traces. In these situations, traces should be used as a root-causing tool. When your metrics show there is a problem, targeted tracing can be rolled out to understand why the regression may have happened.

Introduction to Profiling

Profiling involves sampling some usage of a resource by a program. A single continuous session of recording is known as a profile.

Each sample collects the function callstack (i.e. the line of code along with all calling functions). Generally this information is aggregated across the profile. For each seen callstack, the aggregation gives the percentage of usage of the resource by that callstack. By far the most common types of profiling are memory profiling and CPU profiling.

Memory profiling is used to understand which parts of a program are allocating memory on the heap. The profiler generally hooks into malloc (and free) calls of a native (C/C++/Rust/etc.) program to sample the callstacks calling malloc. Information about how many bytes were allocated is also retained. CPU profiling is used for understanding where the program is spending CPU time. The profiler captures the callstack running on a CPU over time. Generally this is done periodically (e.g. every 50ms), but can be also be done when certain events happen in the operating system.

Profiling vs tracing

There are two main questions for comparing profiling and tracing:

Why profile my program statistically when I can just trace everything?
Why use tracing to reconstruct the timeline of events when profiling gives me the exact line of code using the most resources?

When to use profiling over tracing

Traces cannot feasibly capture execution of extreme high frequency events e.g. every function call. Profiling tools fill this niche: by sampling, they can significantly cut down on how much information they store. The statistical nature of profilers are rarely a problem; the sampling algorithms for profilers are specifically designed to capture data which is highly representative of the real resource use.

Aside: a handful of very specialized tracing tools exist which can capture every function call (e.g. magic-trace) but they output gigabytes of data every second which make them impractical for anything beyond investigating tiny snippets of code. They also generally have higher overhead than general purpose tracing tools.

When to use tracing over profiling

While profilers give callstacks where resources are being used, they lack information about why that happened. For example, why was malloc being called by function foo() so many times? All they say is foo() allocated X bytes over Y calls to malloc. Traces are excellent at providing this exact context: application instrumentation and low-level kernel events together provide deep insight into why code was run in the first place.

NOTE: Perfetto supports collecting, analyzing and visualizing both profiles and traces at the same time so you can have the best of both worlds!

Next Steps

Now that you have a better understanding of tracing and profiling, you can use Perfetto to:

Record a trace of your application and system to understand its behavior.
Analyze a trace to identify performance bottlenecks.
Visualize a trace to see a timeline of events.

To learn how to do this, head to our How do I start using Perfetto? page.