Benchmarking Rust Compiler Settings with Criterion

Controlling Criterion with Scripts and Environment Variables

Published in

Towards Data Science

6 min readDec 15, 2023

Timing a crab race — Source: https://openai.com/dall-e-2/. All other figures from the author.

This article explains, first, how to benchmark using the popular criterion crate. It, then, gives additional information showing how to benchmark across compiler settings. Although each combination of compiler settings requires re-compilation and a separate run, we can still tabulate and analyze results. The article is a companion to the article Nine Rules for SIMD Acceleration of Your Rust Code in Towards Data Science.

We’ll applied this technique to the range-set-blaze crate. Our goal is to measure the performance effects of various SIMD (Single Instruction, Multiple Data) settings. We also want to compare performance across different CPUs. This approach is also useful for understanding the benefit of different optimization levels.

In the context of range-set-blaze, we evaluate:

3 SIMD extension levels — sse2 (128 bit), avx2 (256 bit), avx512f (512 bit)
10 element types — i8, u8, i16, u16, i32, u32, i64, u64, isize, usize
5 lane numbers — 4, 8, 16, 32, 64
2 CPUs — AMD 7950X with avx512f, Intel i5–8250U with avx2
5 algorithms — Regular, Splat0, Splat1, Splat2, Rotate
4 input lengths — 1024; 10,240; 102,400; 1,024,000

Of these, we externally adjust the first four variables (SIMD extension level, element type, lane number, CPU). We controlled the final two variables (algorithm and input length) with loops inside regular Rust benchmark code.

Getting Started with Criterion

To add benchmarking to your project, add this dev dependency and create a subfolder:

cargo add criterion --dev --features html_reports
mkdir benches

In Cargo.toml add:

[[bench]]
name = "bench"
harness = false

Create a benches/bench.rs. Here is sample one:

#![feature(portable_simd)]
#![feature(array_chunks)]
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use is_consecutive1::*;

// create a string from the SIMD extension used
const SIMD_SUFFIX: &str = if cfg!(target_feature = "avx512f") {
    "avx512f,512"
} else if cfg!(target_feature = "avx2") {
    "avx2,256"
} else if cfg!(target_feature = "sse2") {
    "sse2,128"
} else {
    "error"
};
type Integer = i32;
const LANES: usize = 64;
// compare against this
#[inline]
pub fn is_consecutive_regular(chunk: &[Integer; LANES]) -> bool {
    for i in 1..LANES {
        if chunk[i - 1].checked_add(1) != Some(chunk[i]) {
            return false;
        }
    }
    true
}
// define a benchmark called "simple"
fn simple(c: &mut Criterion) {
    let mut group = c.benchmark_group("simple");
    group.sample_size(1000);
    // generate about 1 million aligned elements
    let parameter: Integer = 1_024_000;
    let v = (100..parameter + 100).collect::<Vec<_>>();
    let (prefix, simd_chunks, reminder) = v.as_simd::<LANES>(); // keep aligned part
    let v = &v[prefix.len()..v.len() - reminder.len()]; // keep aligned part
    group.bench_function(format!("regular,{}", SIMD_SUFFIX), |b| {
        b.iter(|| {
            let _: usize = black_box(
                v.array_chunks::<LANES>()
                    .map(|chunk| is_consecutive_regular(chunk) as usize)
                    .sum(),
            );
        });
    });
    group.bench_function(format!("splat1,{}", SIMD_SUFFIX), |b| {
        b.iter(|| {
            let _: usize = black_box(
                simd_chunks
                    .iter()
                    .map(|chunk| IsConsecutive::is_consecutive(*chunk) as usize)
                    .sum(),
            );
        });
    });
    group.finish();
}
criterion_group!(benches, simple);
criterion_main!(benches);

If you want to run this example, the code is on GitHub.

Run the benchmark with the command cargo bench. A report will appear in target/criterion/simple/report/index.html and includes plots like this one showing Splat1 running many times faster than Regular.

Thinking Outside the Criterion Box

We have a problem. We want to benchmark sse2 vs. avx2 vs. avx512f which requires (generally) multiple compilations and criterion runs.

Here’s our approach:

Use a Bash script to set environment variables and call benchmarking.
For example, bench.sh:

#!/bin/bash
SIMD_INTEGER_VALUES=("i64" "i32" "i16" "i8" "isize" "u64" "u32" "u16" "u8" "usize")
SIMD_LANES_VALUES=(64 32 16 8 4)
RUSTFLAGS_VALUES=("-C target-feature=+avx512f" "-C target-feature=+avx2" "")

for simdLanes in "${SIMD_LANES_VALUES[@]}"; do
    for simdInteger in "${SIMD_INTEGER_VALUES[@]}"; do
        for rustFlags in "${RUSTFLAGS_VALUES[@]}"; do
            echo "Running with SIMD_INTEGER=$simdInteger, SIMD_LANES=$simdLanes, RUSTFLAGS=$rustFlags"
            SIMD_LANES=$simdLanes SIMD_INTEGER=$simdInteger RUSTFLAGS="$rustFlags" cargo bench
        done
    done
done

Aside: You can easily use Bash on Windows if you have Git and/or VS Code.

Use a build.rs to turn these environment variables into Rust configurations:

use std::env;

fn main() {
    if let Ok(simd_lanes) = env::var("SIMD_LANES") {
        println!("cargo:rustc-cfg=simd_lanes=\"{}\"", simd_lanes);
        println!("cargo:rerun-if-env-changed=SIMD_LANES");
    }
    if let Ok(simd_integer) = env::var("SIMD_INTEGER") {
        println!("cargo:rustc-cfg=simd_integer=\"{}\"", simd_integer);
        println!("cargo:rerun-if-env-changed=SIMD_INTEGER");
    }
}

In benches/build.rs turn these configurations into Rust constants and types:

const SIMD_SUFFIX: &str = if cfg!(target_feature = "avx512f") {
    "avx512f,512"
} else if cfg!(target_feature = "avx2") {
    "avx2,256"
} else if cfg!(target_feature = "sse2") {
    "sse2,128"
} else {
    "error"
};

#[cfg(simd_integer = "i8")]
type Integer = i8;
#[cfg(simd_integer = "i16")]
type Integer = i16;
#[cfg(simd_integer = "i32")]
type Integer = i32;
#[cfg(simd_integer = "i64")]
type Integer = i64;
#[cfg(simd_integer = "isize")]
type Integer = isize;
#[cfg(simd_integer = "u8")]
type Integer = u8;
#[cfg(simd_integer = "u16")]
type Integer = u16;
#[cfg(simd_integer = "u32")]
type Integer = u32;
#[cfg(simd_integer = "u64")]
type Integer = u64;
#[cfg(simd_integer = "usize")]
type Integer = usize;
#[cfg(not(any(
    simd_integer = "i8",
    simd_integer = "i16",
    simd_integer = "i32",
    simd_integer = "i64",
    simd_integer = "isize",
    simd_integer = "u8",
    simd_integer = "u16",
    simd_integer = "u32",
    simd_integer = "u64",
    simd_integer = "usize"
)))]
type Integer = i32;
const LANES: usize = if cfg!(simd_lanes = "2") {
    2
} else if cfg!(simd_lanes = "4") {
    4
} else if cfg!(simd_lanes = "8") {
    8
} else if cfg!(simd_lanes = "16") {
    16
} else if cfg!(simd_lanes = "32") {
    32
} else {
    64
};

In benches.rs, create a benchmark id that records the combination of variables you are testing, separated by commas. This can either be a string or a criterion BenchmarkId. I created a BenchmarkId with this call: create_benchmark_id::<Integer>("regular", LANES, *parameter) to this function:

fn create_benchmark_id<T>(name: &str, lanes: usize, parameter: usize) -> BenchmarkId
where
    T: SimdElement,
{
    BenchmarkId::new(
        format!(
            "{},{},{},{},{}",
            name,
            SIMD_SUFFIX,
            type_name::<T>(),
            mem::size_of::<T>() * 8,
            lanes,
        ),
        parameter,
    )
}

For tabulation and analysis, I like benchmark results as comma-separated values (CSVs). Criterion has moved away from *.csv files and toward *.json files. To extract *.csv from *.json, I created a new a cargo command that you can use: criterion-means.

Install:

cargo install cargo-criterion-means

Run:

cargo criterion-means > results.csv

Output Example:

Group,Id,Parameter,Mean(ns),StdErr(ns)
vector,regular,avx2,256,i16,16,16,1024,291.47,0.080141
vector,regular,avx2,256,i16,16,16,10240,2821.6,3.3949
vector,regular,avx2,256,i16,16,16,102400,28224,7.8341
vector,regular,avx2,256,i16,16,16,1024000,287220,67.067
# ...

Analysis

A CSV file is suitable for analysis via spreadsheet pivot tables or data frame tools such as Polars.

For example, here is the top of my 5000-line long Excel data file:

Columns A to J came from the benchmark. Columns K to N are calculated by Excel.

Here is a pivot table (and chart) based on the data. It shows the effect of varying the number of SIMD lanes on throughput. The chart averages across element type and input length. The chart suggests that for the best algorithms, either 32 or 64 lanes is best.

With this analysis, we can now choose our algorithm and decide how we want to set the LANES parameter.

Conclusion

Thank you for joining me for this journey into Criterion benchmarking.

If you’ve not used Criterion before, I hope this encourages you to try it. If you’ve used Criterion but couldn’t get it to measure everything you cared about, I hope this gives you a path forward. Embracing Criterion in this expanded manner can unlock deeper insights into the performance characteristics of your Rust projects.

Please follow Carl on Medium. I write on scientific programming in Rust and Python, machine learning, and statistics. I tend to write about one article per month.