Kernel track events: format and conventions

This page describes a convention for structuring Linux kernel tracepoints in a way that enables perfetto to automatically present them as slice/counter tracks at the UI and SQL levels, without having to change or rebuild perfetto code.

This is a perfetto convention, and does not have (or need) any dedicated upstream kernel code. It's best used when hacking on a local kernel, or writing a self-contained module that won't be upstreamed. It is also not explicitly tied to static tracepoints, a dynamic probe (e.g. kprobe) that creates a tracefs entry with the relevant fields will also work.

This page is structured as a reference, an introduction with examples and screenshots of resulting UI is at "Intrumenting the Linux kernel with ftrace".

This convention is still malleable, if you end up using it and/or finding issues with the design, please send an email to our mailing list or file a github issue.

Slices and instants

Perfetto looks for fields with specific types and names in the event's data representation. This is defined by TP_STRUCT__entry() when using the TRACE_EVENT() macro to define the tracepoint.

For representing slices (begin + end) and instants, grouped by tracks, the well-known fields are:

required?	type	name
required	char	track_event_type
required	__string	slice_name
optional	intX	scope_{...}
optional	__string	track_name

Where intX represents any integral type, and __string is the kernel type used for storing dynamically-sized strings in tracing events.

At runtime, the event payloads will be interpreted as follows:

track_event_type:
- 'B' opens a named slice.
- 'E' ends the last opened slice within the track.
- 'I' sets a named instant (zero duration) event.
slice_name: the name of the slice for begin ('B') and instant ('I') events, ignored for end events.
track_name: if set, overrides the track's name. The default is the tracepoint's name.
scope_{...}: if set, specifies the scoping id of the track, which is used as a grouping key for the tracks. The field name can have an arbitrary suffix that makes sense within your subsystem, but there are also a few well-known names that perfetto can use as a hint when presenting the tracks in the UI. The id does not have to be related to an OS-level concept.
- scope_tgid: for process-scoped tracks, where the value must be of a valid process (though the calling thread does not need to be within that process).
- scope_cpu: for cpu-scoped tracks (emitting code does not need to be running on that cpu).
- scope_your_feature_idx: for your own track id assignments.
- default: thread-scoped track (using the thread id of the thread hitting the tracepoint, as recorded by the ftrace system itself).

Additionally:

The tracepoint name and the subsystem can be arbitrary. Your headers can declare an arbitrary amount of tracepoints that match these templates. Each tracepoint will be processed indepdendently.

There are no constraints on having additional fields, the field order or other parts of the TRACE_EVENT() declaration. Note that this includes the printk specifier, so the textual formatting of the tracepoint can be arbitrary (you don't even need to print the perfetto-specific fields).

Counters

For representing counter values, grouped by tracks, the well-known fields are:

required?	type	name
required	intX	counter_value
optional	intX	scope_{...}
optional	__string	track_name

Details on scoping (grouping) events

This section explains the rules of how the recorded events get grouped into tracks, as generally a trace recorded using a single tracepoint can result in N separate tracks. The grouping rules are the same for slice and counter tracks.

NB: slices on slice tracks must have strict nesting - all slices must terminate before their parents (see the concept of [async slices][async-slice-link] for more details). You need to use track naming or scoping to ensure that that invariant is preserved.

The default behaviour (if you only specify the mandatory fields) is thread-scoped. Events are grouped by the thread id of the thread(s) hitting the tracepoints. There will be one track per thread with events. The end ('E') events will terminate the last opened slice on that thread.

If the event has a field prefixed with scope_, the events will be grouped by the value of that field, with some predefined names having special meaning (see above). For example, if you specify a scope_tgid, that turns the track process-scoped - all events sharing the same scope_tgid value will be put on the same track. Further, the UI will present that track in the process' group.

If your events include the track_name field, then events become grouped by that name as an additional dimension to the above. That is, the end ('E') event will terminate the last opened slice with that exact track name, even if there are multiple named tracks within the same thread/process/cpu/etc scope.

The net effect is that recorded events are grouped by the unique combination of: {tracepoint} x {track name} x {scope id}. With the last two defaulting to the tracepoint name and thread id respectively.