Nine Rules for Creating Fast, Safe, and Compatible Data Structures in Rust (Part 1)

Lessons from RangeSetBlaze

Carl M. Kadie
Towards Data Science

--

Storing numbers in a tree — Source: Stable Diffusion

This year, I developed a new Rust crate called range-set-blaze that implements the range set data structure. A range set is a useful (if obscure) data structure that stores sets of integers as sorted and disjoint ranges. For example, it stores these three ranges:

100..=2_393, 20_303..=30_239_000, 501_000_013..=501_000_016

rather than 30,220,996 individual integers. In addition to potential memory savings, range-set-blaze provides efficient set operations such as union, intersection, complement, difference, and symmetric difference.

While creating range-set-blaze, I learned nine rules that can help you create data structures in Rust. Beyond data structures, many of the rules can help you improve the performance and compatibility of any Rust code.

The rules are:

  1. Plagiarize your API, documentation, and even code — from the standard library.
  2. Design constructors for ease-of-use, compatibility, and speed.
  3. Create more Rust iterators than you expect.
  4. Make illegal values unrepresentable with traits.
  5. Define generic iterators with guaranteed properties and useful methods.

Covered in Part 2:

6. Define operators and fast operations.

7. Follow the “Nine Rules of Good API Design” especially “write good documentation”.

8. Optimize performance using representative data, Criterion Benchmarking, and profiling.

9. Test coverage, docs, traits, compiler errors, and correctness.

Before looking at the first five rules, let’s see when range-set-blaze might be useful, how its set operations work, and how it compares to other range set crates.

Usefulness: Imagine running ten billion statistical experiments on an unreliable cluster. Each task on the cluster runs a few experiments. Each experiment produces one line of output with an experiment number. So, one task might put this into a file:

What data structure would you use to find which experiments are missing and need to be re-submitted? One option: store the outputted experiment numbers in a BTreeSet and then do a linear scan for gaps.

A faster and more memory-efficient option: use a range set. Eight years ago, I created IntRangeSet, a range set in Python to solve this problem. Now, I would use range-set-blaze in Rust (example code).

Set Operations: Here is a simple example of the union operator (|):

use range_set_blaze::RangeSetBlaze;
// a is the set of integers from 100 to 499 (inclusive) and 501 to 1000 (inclusive)
let a = RangeSetBlaze::from_iter([100..=499, 501..=999]);
// b is the set of integers -20 and the range 400 to 599 (inclusive)
let b = RangeSetBlaze::from_iter([-20..=-20, 400..=599]);
// c is the union of a and b, namely -20 and 100 to 999 (inclusive)
let c = a | b;
assert_eq!(c, RangeSetBlaze::from_iter([-20..=-20, 100..=999]));

Aside: See the project’s README.md for another set operator example from biology. That example uses the RangeSetBlaze struct to find the intron regions of a gene from the transcription region and the exon regions.

Comparison with other range-related crates

Benefits: Although Rust’s crates.io already contains several range set crates, I hoped my version could offer full set operations while being performant. Through optimizations both fowl and fair, I believe it achieves these goals (see the benchmark report). For example, it can intake individual integers 75 times faster than the most popular range set crate (because the other crate doesn’t special-case individual intake — but it could easily add this optimization). On another benchmark, range-set-blaze — using a hybrid algorithm — is 30% to 600% faster at unioning two sets.

Deficits: Compared to other range-related crates, range-set-blaze has two important deficits. First, it works only with Rust integer types. Most other crates handle any element that can be sorted (dates, floats, IP addresses, strings, etc.). Second, it only offers sets. Many other crates also handle mappings. With interest (and perhaps help) these deficits might be addressed in the future.

Creating a data structure requires many decisions. Based on my experience with range-set-blaze, here are the decisions I recommend. To avoid wishy-washiness, I’ll express these recommendations as rules. Of course, every data structure is different, so not every rule will apply to every data structure.

This article covers rules 1 to 5. Part 2 covers rules 6 to 9.

Rule 1: Plagiarize the API, Documentation, and even Code — from the Standard Library

Find a similar data structure in the standard library and study its documentation line-by-line. I picked BTreeSet as my model. It can store sets of integers in a cache-efficient balanced tree.

Aside: Later, in benchmarking (Rule 8), we’ll see that range_set_blaze::RangeSetBlaze can be 1000’s of times faster than BTreeSet on some ‘clumpy’ integer sets.

BTreeSet offers 28 methods, for example, clear and is_subset. It also implements 18 traits, for example, FromIterator<T>. Here is the BTreeSet documentation for clear and the RangeSetBlaze documentation for clear:

You can see that I mostly just copied. I changed “elements” to “integer elements” to remind users what RangeSetBlaze supports. I removed where A: Clone because all integers are necessarily cloneable. Notice that Rust documentation includes a “source” link. This makes copying easy.

Copying offers these advantages:

  • It tells you what methods to provide. In other words, it gives you a starting point for your API (application programming interface). This saves design time. Moreover, users will understand and expect these methods. You may even be able to make your data structure a drop-in replacement for a standard data structure.
  • You get documentation text and documentation tests for almost free.
  • You may even be able to copy code. For example, here is the code for is_superset for BTreeSet and, then, RangeSetBlaze :
#[must_use]
#[stable(feature = "rust1", since = "1.0.0")]
pub fn is_superset(&self, other: &BTreeSet<T, A>) -> bool
where
T: Ord,
{
other.is_subset(self)
}
#[must_use]
pub fn is_superset(&self, other: &RangeSetBlaze<T>) -> bool {
other.is_subset(self)
}

The BTreeSet code reminded me that superset can be defined in terms of subset and that #[must use] is a thing and appropriate here.

You may decide not to support everything in the standard data structure. For example, I skipped new_in, an experimental feature. Likewise, the standard library supports maps (not just sets), any sortable element (not just integers), and Serde serialization. For me, these are possible future features.

You may also decide to support something differently. For example, BTreeSet::first returns an Option<&T> but RangeSetBlaze::first returns an Option<T>. I know T is a cheap-to-clone integer and so it doesn’t need to be a reference.

Aside: Doesn’t Rust have a generic Set trait, that tells what methods all sets should implement and even provides some default implementations (for example, is_superset in terms of is_subset?) No, but it is being discussed.

You will likely also decide to support more methods than the standard data structure. For example, RangeSetBlaze::len, like BTreeSet::len, returns the number of elements in the set. However, RangeSetBlaze also offers ranges_len, which returns the number of sorted, disjoint ranges in the set.

Rule 2: Design Constructors for Ease-of-Use, Compatibility, and Speed

If it makes sense to have an empty version of your data structure, you’ll want to define a new method and a Default::default method.

Similarly, if it makes sense to fill your data structure from an iterator, you’ll want to define FromIterator::from_iter methods. These automatically define collect methods, too. Like BTreeSet, RangeSetBlaze accepts iterators of integers and references to integers. (Accepting references is important because many Rust iterators provide references.) Here is an example of from_iter and collect in use:

let a0 = RangeSetBlaze::from_iter([3, 2, 1, 100, 1]);
let a1: RangeSetBlaze<i32> = [3, 2, 1, 100, 1].into_iter().collect();
assert!(a0 == a1 && a0.to_string() == "1..=3, 100..=100");

RangeSetBlaze also accepts iterators of inclusive ranges and references to such ranges. It places no constraints on the input ranges. They can be out-of-order, overlapping, empty, or repeated.

#[allow(clippy::reversed_empty_ranges)]
let a0 = RangeSetBlaze::from_iter([1..=2, 2..=2, -10..=-5, 1..=0]);
#[allow(clippy::reversed_empty_ranges)]
let a1: RangeSetBlaze<i32> = [1..=2, 2..=2, -10..=-5, 1..=0].into_iter().collect();
assert!(a0 == a1 && a0.to_string() == "-10..=-5, 1..=2");

Finally, consider defining additional From::from methods. These automatically define an into methods. For example, for compatibility with BTreeSet, RangeSetBlaze defines a From::from method on arrays.

let a0 = RangeSetBlaze::from([3, 2, 1, 100, 1]);
let a1: RangeSetBlaze<i32> = [3, 2, 1, 100, 1].into();
assert!(a0 == a1 && a0.to_string() == "1..=3, 100..=100")

RangeSetBlaze also defines from_sorted_disjoint/into_range_set_blaze for iterators of ranges that are guaranteed to be sorted and disjoint. (We’ll see in Rule 5, how we enforce this guarantee with special traits and the Rust compiler.)

let a0 = RangeSetBlaze::from_sorted_disjoint(CheckSortedDisjoint::from([-10..=-5, 1..=2]));
let a1: RangeSetBlaze<i32> = CheckSortedDisjoint::from([-10..=-5, 1..=2]).into_range_set_blaze();
assert!(a0 == a1 && a0.to_string() == "-10..=-5, 1..=2");

Aside: Why from_sorted_disjoint/into_range_set_blaze instead of from_iter /collect or from/into? See this discussion and this discussion.

For each of your constructors, think about possible speed ups and optimizations. RangeSetBlaze implements this optimization in from_iter:

  • collects adjacent (possibly unsorted) integers/ranges into disjoint ranges, O(n₁)
  • sort the disjoint ranges by their start, O(n₂ log n₂)
  • merge adjacent ranges, O(n₂)
  • create a BTreeMap from the now sorted & disjoint ranges, O(n₃ log n₃)

where n₁ is the number of input integers/ranges, n₂ is the number of disjoint & unsorted ranges, and n₃ is the final number of sorted & disjoint ranges.

What is the effect of ‘clumpy’ integers? If n₂ ≈ sqrt(n₁), then construction is O(n₁). (Indeed, as long as n₂n₁/ln(n₁), construction is O(n₁).) In benchmarks, this turns into a 700 time speed up over HashSet and BTreeSet on clumpy integer iterators.

Rule 3: Create More Rust Iterators than You Expect

How many different iterator types would you guess the standard BTreeSet defines?

The answer is eight: Iter, IntoIter, DrainFilter, Range, Difference, SymmetricDifference, Intersection, and Union. Many non-Rust programming languages can turn any method into an iterator/generator, with a few “yield” statements. Rust, however, doesn’t offer this (but it is under discussion). So, pretty much every method related to iteration will require you to define a new iterator struct type. These structs will, at a minimum, implement a next method that returns either Some(value) or None.

RangeSetBlaze and its related types define 13 iterators structs. Let’s look at two of them.

First, a user can call ranges and iterate through the integers as a sequence of sorted, disjoint ranges. (Recall that RangeSetBlaze accepts unsorted, overlapping ranges but stores sorted, disjoint ranges.)

use range_set_blaze::RangeSetBlaze;
let set = RangeSetBlaze::from_iter([30..=40, 15..=25, 10..=20]);
let mut ranges = set.ranges();
assert_eq!(ranges.next(), Some(10..=25));
assert_eq!(ranges.next(), Some(30..=40));
assert_eq!(ranges.next(), None);

Internally, RangeSetBlaze uses a standard BTreeMap to hold the range information. So, the RangeSetBlaze::ranges method constructs a RangesIter struct containing a BTreeMap::Iter. We then have the RangesIter::next method call BTreeMap::Iter's next method and translates the result into the desired type. Here is the code:

impl<T: Integer> RangeSetBlaze<T> {
pub fn ranges(&self) -> RangesIter<'_, T> {
RangesIter {
iter: self.btree_map.iter(),
}
}
}

#[derive(Clone, Debug)]
#[must_use = "iterators are lazy and do nothing unless consumed"]
pub struct RangesIter<'a, T: Integer> {
pub(crate) iter: btree_map::Iter<'a, T, T>,
}

impl<'a, T: Integer> Iterator for RangesIter<'a, T> {
type Item = RangeInclusive<T>;
fn next(&mut self) -> Option<Self::Item> {
self.iter.next().map(|(start, end)| *start..=*end)
}
fn size_hint(&self) -> (usize, Option<usize>) {
self.iter.size_hint()
}
}

Second, a user may want to call iter and iterate through the integers one at a time, in sorted order. In that case, we’ll return a struct called Iter that contains a RangeIter and then iterates the integers inside the ranges one at a time. Here is the original code for Iter::next. It is followed by a discussion of points of interest.

impl<T: Integer, I> Iterator for Iter<T, I>
where
I: Iterator<Item = RangeInclusive<T>> + SortedDisjoint,
{
type Item = T;
fn next(&mut self) -> Option<T> {
loop {
if let Some(range) = self.option_range.clone() {
let (start, end) = range.into_inner();
debug_assert!(start <= end && end <= T::safe_max_value());
if start < end {
self.option_range = Some(start + T::one()..=end);
} else {
self.option_range = None;
}
return Some(start);
} else if let Some(range) = self.iter.next() {
self.option_range = Some(range);
continue;
} else {
return None;
}
}
}

The SortedDisjoint trait relates to guaranteeing that the inside iterator provides sorted, disjoint ranges. We’ll discuss it in Rule 5.

The option_range field holds the range (if any) from which we are currently returning integers. We use loop and continue to fill option_range when it is empty. This loop never loops more than twice, so I could have used recursion. However, some of the other iterators recurse enough to cause a stack overflow. So, …

Tail recursion optimization is not guaranteed in Rust. My policy: Never use recursion in next functions.

Aside: Thanks to Michael Roth, the current version of Iter::next is now shorter. His pull request is here.

Both BTreeSet and RangeSetBlaze define an into_iter iterator method in addition to the iter method. Similarly, RangeSetBlaze defines an into_ranges iterator method in addition to its ranges method. These into_whatever methods take ownership of the underlying RangeSetBlaze which is sometimes useful.

Rule 4: Make Illegal Values Unrepresentable with Traits

I said that RangeSetBlaze only works on integers, but what stops you applying it to characters like so?

use range_set_blaze::RangeSetBlaze;

fn _some_fn() {
let _char_set = RangeSetBlaze::from_iter(['a', 'b', 'c', 'd']);
}

The answer? The compiler stops you. It returns this error message:

let _char_set = RangeSetBlaze::from_iter(['a', 'b', 'c', 'd']);
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| the trait `Integer` is not implemented for `char`
|
= help: the following other types implement trait `Integer`:
i128
i16
i32
i64
i8
isize
u128
u16
and $N others

To make this happen, RangeSetBlaze defines a trait it calls Integer. Here is that definition (with all the super traits that I found that I needed):

pub trait Integer:
num_integer::Integer
+ FromStr
+ fmt::Display
+ fmt::Debug
+ std::iter::Sum
+ num_traits::NumAssignOps
+ FromStr
+ Copy
+ num_traits::Bounded
+ num_traits::NumCast
+ Send
+ Sync
+ OverflowingSub
+ SampleUniform
{
// associated type SafeLen definition not shown ...
fn safe_len(range: &RangeInclusive<Self>) -> <Self as Integer>::SafeLen;
fn safe_max_value() -> Self { Self::max_value() }
fn f64_to_safe_len(f: f64) -> Self::SafeLen;
fn safe_len_to_f64(len: Self::SafeLen) -> f64;
fn add_len_less_one(a: Self, b: Self::SafeLen) -> Self;
fn sub_len_less_one(a: Self, b: Self::SafeLen) -> Self;
}

I, next, implemented trait Integer on all the integer types of interest (u8 to u128 including usize, i8 to i128 including isize). For example,

impl Integer for i32 {
#[cfg(target_pointer_width = "64")]
type SafeLen = usize;
fn safe_len(r: &RangeInclusive<Self>) -> <Self as Integer>::SafeLen {
r.end().overflowing_sub(*r.start()).0 as u32 as <Self as Integer>::SafeLen + 1
}
fn safe_len_to_f64(len: Self::SafeLen) -> f64 {len as f64}
fn f64_to_safe_len(f: f64) -> Self::SafeLen {f as Self::SafeLen}
fn add_len_less_one(a: Self, b: Self::SafeLen) -> Self {a + (b - 1) as Self}
fn sub_len_less_one(a: Self, b: Self::SafeLen) -> Self {a - (b - 1) as Self}
}

With this, I could make the code generic on <T: Integer>, as seen in the code sample in Rule 3.

Aside: Why doesn’t Rust offer a single standard ‘integer’ trait that does everything? Here is a discussion.

Rule 5: Define Generic Iterators with Guaranteed Properties and Useful Methods

RangeSetBlaze’s from_sorted_disjoint constructor assumes an input of sorted and disjoint ranges. This lets RangeSetBlaze avoid work. But what if the assumption is wrong? For example, what happens if we give it unsorted ranges, like so?

use range_set_blaze::RangeSetBlaze;

fn _some_fn() {
let not_guaranteed = [5..=6, 1..=3, 3..=4].into_iter();
let _range_set_int = RangeSetBlaze::from_sorted_disjoint(not_guaranteed);
}

As in Rule 4, the compiler catches the error and returns a useful message:

7 |     let _range_set_int = RangeSetBlaze::from_sorted_disjoint(not_guaranteed);
| ----------------------------------- ^^^^^^^^^^^^^^
the trait `SortedDisjoint<_>` is not implemented for `std::array::IntoIter<RangeInclusive<{integer}>, 3>`
| |
| required by a bound introduced by this call
|
= help: the following other types implement trait `SortedDisjoint<T>`:
CheckSortedDisjoint<T, I> ...

To make this happen, RangeSetBlaze defines trait SortedDisjoint. Here is the relevant definition:

pub trait SortedStarts<T: Integer>: Iterator<Item = RangeInclusive<T>> {}
pub trait SortedDisjoint<T: Integer>: SortedStarts<T> {
// methods not shown, yet
}

This says SortedDisjoint is generic across integers and that every SortedDisjoint must be a SortedStarts. Moreover, all SortedStarts are iterators of ranges of integers.

Aside: My project needs two new traits because I need to guarantee two different properties. A project that needs to guarantee only one property will need only one new trait.

So, what’s the point? Why introduce new traits when we could just use Iterator<Item = RangeInclusive<T> directly? As I learned from Rüdiger Klaehn’s wonderful sorted-iter crate, we can use these new traits to enforce guarantees. For example, this constructor function uses a where clause to accept only guaranteed (sorted and disjoint) iterators of integers:

impl<T: Integer> RangeSetBlaze<T> {
pub fn from_sorted_disjoint<I>(iter: I) -> Self
where
I: SortedDisjoint<T>,
{
// ... code omitted ...
}
}

And how do guaranteed iterators get the required SortedDisjoint trait? They implement it! For example, we know the RangeSetBlaze::ranges method returns a iterator RangesIter of sorted and disjoint ranges, so we have RangesIter implement the SortedDisjoint trait like so:

impl<T: Integer> SortedStarts for RangesIter<'_, T> {}
impl<T: Integer> SortedDisjoint for RangesIter<'_, T> {}

That’s it. We have just marked RangesIter as being SortedDisjoint. The compiler will do the rest.

Not guaranteed to guaranteed: I also marked an iterator called CheckSortedDisjoint as being SortedDisjoint. Interestingly, it iterates over a not guaranteed internal iterator. How can that be OK? Well, while it iterates, it also checks — panicking if it finds any unsorted or overlapping ranges. The result is a guaranteed iterator.

Sometimes guaranteed outside iterators: What about an iterator that is sometimes SortedDisjoint and sometimes not? The popular Itertools::tee method, for example, turns any iterator into two iterators with the same contents. Here is how we say that if its input iterator is SortedDisjoint, its output iterators will be, too:

impl<T: Integer, I: SortedDisjoint<T>> SortedDisjoint<T> for Tee<I> {}

Defining methods: Almost as a bonus, we can define methods on the generic SortedDisjoint iterator. For example, here we define complement, a method that produces every range of sorted & disjoint integers not in the current iterator.

pub trait SortedDisjoint<T: Integer>: SortedStarts<T> {
fn complement(self) -> NotIter<T, Self>
where
Self: Sized,
{
NotIter::new(self)
}
}

And here is a use example from the complement documentation:

use range_set_blaze::prelude::*;

let a = CheckSortedDisjoint::from([-10i16..=0, 1000..=2000]);
let complement = a.complement();
assert_eq!(complement.to_string(), "-32768..=-11, 1..=999, 2001..=32767");

The complement method uses NotIter, yet another iterator (see Rule 3). NotIter also implements SortedDisjoint. The example additionally uses to_string, another SortedDisjoint method.

To use complement and to_string, the user must either
use range_set_blaze::SortedDisjoint;
or use range_set_blaze::prelude::*;.
The prelude works because the project’s lib.rs says
pub mod prelude; and prelude.rs file contains this pub use that includes SortedDisjoint:

pub use crate::{
intersection_dyn, union_dyn, CheckSortedDisjoint, DynSortedDisjoint, MultiwayRangeSetBlaze,
MultiwayRangeSetBlazeRef, MultiwaySortedDisjoint, RangeSetBlaze, SortedDisjoint,
};

Those are the first five rules for creating a data structure in Rust. See Part 2 for rules 6 to 9.

Aside: If you’re interested in future articles, please follow me on Medium. I write on scientific programming in Rust and Python, machine learning, and statistics. I tend to write about one article per month.

--

--

Ph.D. in CS and Machine Learning. Retired Microsoft & Microsoft Research. Volunteer, open-source projects related to ML, Genomics, Rust, and Python.