Perfetto CI design document

This CI is used on-top of (not in replacement of) AOSP's TreeHugger. It gives early testing signals and coverage on other OSes and older Android devices not supported by TreeHugger.

See the Testing page for more details about the project testing strategy.

Architecture diagram

Architecture diagram

There are four major components:

  1. Frontend: AppEngine.
  2. Controller: AppEngine BG service.
  3. Workers: Compute Engine + Docker.
  4. Database: Firebase realtime database.

They are coupled via the Firebase DB. The DB is the source of truth for the whole CI.

Controller

The Controller orchestrates the CI. It's the most trusted piece of the system.

It is based on a background AppEngine service. Such service is only triggered by deferred tasks and periodic Cron jobs.

The Controller is the only entity which performs authenticated access to Gerrit. It uses a non-privileged gmail account and has no meaningful voting power.

The controller loop does mainly the following:

Frontend

The frontend is an AppEngine service that hosts the CI website @ ci.perfetto.dev. Conversely to the Controller, it is exposed to the public via HTTP.

Worker GCE VM

The actual testing job happens inside these Google Compute Engine VMs. The GCE instance is running a CrOS-based Container-Optimized OS.

The whole system image is read-only. The VM itself is stateless. No state is persisted outside of the DB and Google Cloud Storage (only for UI artifacts). The SSD is used only as a scratch disk and is cleared on each reboot.

VMs are dynamically spawned using the Google Cloud Autoscaler and use a Stackdriver Custom Metric pushed by the Controller as cost function. Such metric is the number of queued + running jobs.

Each VM runs two types of Docker containers: worker and the sandbox. They are in a 1:1 relationship, each worker controls at most one sandbox associated. Workers are always alive (they work in polling-mode), while sandboxes are started and stopped by the worker on-demand.

On each GCE instance there are M (currently 10) worker containers running and hence up to M sandboxes.

Worker containers

Worker containers are trusted entities. They can impersonate the GCE service account and have R/W access to the DB. They can also spawn sandbox containers.

Their behavior depends only on code that is manually deployed and doesn't depend on the checkout under test. The reason why workers are Docker containers is NOT security but only reproducibility and maintenance.

Each worker does the following:

Sandbox containers

Sandbox containers are untrusted entities. They can access the internet (for git pull / install-build-deps) but they cannot impersonate the GCE service account, cannot write into the DB, cannot write into GCS buckets. Docker here is used both as an isolation boundary and for reproducibility / debugging.

Each sandbox does the following:

A sandbox container is almost completely stateless with the only exception of the semi-ephemeral /ci/cache mount-point. This mount-point is tmpfs-based (hence cleared on reboot) but is shared across all sandboxes. It's used only to maintain the shared ccache.

Data model

The whole CI is based on Firebase Realtime DB. It is a high-scale JSON object accessible via a simple REST API. Clients can GET/PUT/PATCH/DELETE individual sub-nodes without having a local full-copy of the DB.

/ci # For post-submit jobs. /branches /main-20190626000853 # ┃ ┗━ Committer-date of the HEAD of the branch. # ┗━ Branch name { author: "primiano@google.com" rev: "0552edf491886d2bb6265326a28fef0f73025b6b" subject: "Cloud-based CI" time_committed: "2019-07-06T02:35:14Z" jobs: { 20190708153242--branches-main-20190626000853--android-...: 0 20190708153242--branches-main-20190626000853--linux-...: 0 ... } } /main-20190701235742 {...} # For pre-submit jobs. /cls /1000515-65 { change_id: "platform%2F...~I575be190" time_queued: "2019-07-08T15:32:42Z" time_ended: "2019-07-08T15:33:25Z" revision_id: "18c2e4d0a96..." wants_vote: true voted: true jobs: { 20190708153242--cls-1000515-65--android-clang: 0 ... 20190708153242--cls-1000515-65--ui-clang: 0 } } /1000515-66 {...} ... /1011130-3 {...} /cls_pending # Effectively this is an array of pending CLs that we might need to # vote on at the end. Only the keys matter, the values have no # semantic and are always 0. /1000515-65: 0 /jobs /20190708153242--cls-1000515-65--android-clang-arm-debug: # ┃ ┃ ┗━ Job type. # ┃ ┗━ Path of the CL or branch object. # ┗━ Datetime when the job was created. { src: "cls/1000515-66" status: "QUEUED" "STARTED" "COMPLETED" "FAILED" "TIMED_OUT" "CANCELLED" "INTERRUPTED" time_ended: "2019-07-07T12:47:22Z" time_queued: "2019-07-07T12:34:22Z" time_started: "2019-07-07T12:34:25Z" type: "android-clang-arm-debug" worker: "zqz2-worker-2" } /20190707123422--cls-1000515-66--android-clang-arm-rel {..} /jobs_queued # Effectively this is an array. Only the keys matter, the values # have no semantic and are always 0. /20190708153242--cls-1000515-65--android-clang-arm-debug: 0 /jobs_running # Effectively this is an array. Only the keys matter, the values # have no semantic and are always 0. /20190707123422--cls-1000515-66--android-clang-arm-rel /logs /20190707123422--cls-1000515-66--android-clang-arm-rel /00a053-0000: "+ chmod 777 /ci/cache /ci/artifacts" # ┃ ┗━ Monotonic counter to establish total order on log lines # ┃ retrieved within the same read() batch. # ┃ # ┗━ Hex-encoded timestamp, relative since start of test. /00a053-0001: "+ chown perfetto.perfetto /ci/ramdisk" ...

Sequence Diagram

This is what happens, in order, on a worker instance from boot to the test run.

make -C /infra/ci worker-start ┗━ gcloud start ... [GCE] # From /infra/ci/worker/gce-startup-script.sh docker run worker-1 ... ... docker run worker-N ... [worker-X] # From /infra/ci/worker/Dockerfile ┗━ /infra/ci/worker/worker.py ┗━ docker run sandbox-X ... [sandbox-X] # From /infra/ci/sandbox/Dockerfile ┗━ /infra/ci/sandbox/init.sh ┗━ /infra/ci/sandbox/testrunner.sh ┣━ git fetch refs/changes/... ┇ ... ┇ # This env var is passed by the test definition# specified in /infra/ci/config.py . ┗━ $PERFETTO_TEST_SCRIPT ┣━ # Which is one of these: ┣━ /test/ci/android_tests.sh ┣━ /test/ci/fuzzer_tests.sh ┣━ /test/ci/linux_tests.sh ┗━ /test/ci/ui_tests.sh ┣━ ninja ... ┗━ out/dist/{unit,integration,...}test

gce-startup-script.sh

worker.py

testrunner.sh

{android,fuzzer,linux,ui}_tests.sh

Playbook

Frontend (JS/HTML/CSS) changes

Test-locally: make -C infra/ci/frontend test

Deploy with make -C infra/ci/frontend deploy

Controller changes

Deploy with make -C infra/ci/controller deploy

It is possible to try locally via the make -C infra/ci/controller test but this involves:

Worker/Sandbox changes

  1. Build and push the new docker containers with:

    make -C infra/ci build push

  2. Restart the GCE instances, either manually or via

    make -C infra/ci restart-workers

Purging the job queue

This can be useful when there is an outage and too many jobs pile up.

Security considerations