Benchmarking

Posted on by Chris Warburton

There is lots of folk-wisdom among programmers when it comes to speeding up code. The advice I try to follow can be summed up as:

This closely mirrors testing practices, in particular regression tests. We might also want benchmarks which run the whole system, to get a more realistic view of performance, but we might want to keep it separate from our microbenchmarks so that it doesn’t slow down our iteration cycle. This is similar to integration testing vs unit testing, and the same solutions can be used for both (e.g. build servers watching for changes in our code repositories).

Whilst testing is very well supported by frameworks and tooling, benchmarking seems to be in a worse state. The two main problems I see are:

Tooling

There are many tools for running benchmarks, measuring times, performing simple statistics, etc. These are equivalent to test frameworks, like the “xUnit” family, and likewise are usually tailored to a particular language, e.g. Criterion for Haskell. In principle they’re all generic, since our benchmarks can invoke programs written in other languages, but that’s not the nicest idea (in the same way that we don’t tend to write tests which run other test suites).

When it comes to continuous integration (continuous benchmarking?) there seem to be far fewer options. This sort of tooling is important for benchmarking, since we can rarely judge benchmark results in isolation: we need to see graphs of how they’ve changed over time. There are some “bare bones” generic tools, notably gipeda, but that still requires bespoke scripting to run the benchmarks and store the results, and requires the user to organise and track things like which machine was used, etc.

The best approach to running and tracking benchmarks that I’ve seen is Airspeed Velocity (ASV). Out of the box it is designed for benchmarking Python packages, but my Nix plugin lets us use it for any language or system.

Each ASV benchmark is a Python function (or, if you must, a method in a class). For non-Python projects we can use these in two ways: if we only need crude time measurements of some script, we can use Nix to put that script in the environment, and use ASV to time a function which invokes it. More powerfully, we can write “tracking” benchmarks which just return numbers for ASV to track. We can take these numbers from anywhere, e.g. from the output of some more specialised tool like Criterion. In this case we’re still stuck using Python, but only as a thin layer more like a configuration language.

Using ASV to either perform benchmarks, or as a wrapper for some other measurement tool, gives us some nice benefits:

One difficulty is to ensure that we don’t run multiple benchmark suites at once. I use Laminar for continuous integration, and use Nix to define my Laminar configuration. To prevent benchmarks running concurrently, I’ve used flock in two ways:

It is important that we use flock on all jobs, not just benchmarks, to avoid concurrency during a benchmark. I use git hooks to trigger each repo’s build/test Laminar job; but not for the benchmarks, since there’s little point benchmarking a project which doesn’t build or whose test suite fails. Instead, the benchmark jobs are triggered by the build/test jobs, iff the build/test succeeds. We can do this thanks to the simplicity of Laminar, which is designed to be easily scripted.