Rust: The Next Big Thing in Data Science

A Contextual Guide for Data Scientists and Analysts

Mahmoud Harmouch
Towards Data Science
25 min readApr 24, 2023

--

Image by Yvette W from Pixabay

TL;DR

Rust stands out as a practical choice in data science due to its exceptional performance and persistent security features. While it may not possess all the bells and whistles that Python does, Rust offers outstanding efficiency when handling large datasets. Additionally, developers can use an array of libraries explicitly designed for data analysis to streamline their workflow further. With proper mastery of this language’s complexities, those working within the field can gain significant advantages by incorporating Rust into their toolkit.

This article will delve into the vast array of Rust tools and their application in analyzing the iris dataset. The power of Rust as a language for data science projects is evident, despite its lesser popularity than Python or R. Its potential and capabilities are boundless, making it an excellent option for those seeking to elevate their data science endeavors beyond conventional means.

Note: This article assumes you are familiar with Rust and its ecosystem.

You can find the notebook developed for this article in the following repo:

Who Is This Article For?

Photo by Sigmund on Unsplash

This article was written for developers who prefer Rust as their primary programming language and want to kick off their data science journey. Its purpose is to equip them with the essential tools for exploratory data analysis, including loading, transforming, and visualizing data. Whether you are a beginner seeking to learn more about Rust or an experienced data scientist or analyst eager to employ Rust for your projects, this article will be a valuable resource.

Why Rust?

Photo by Brett Jordan on Unsplash

Over decades, computer scientists have committed themselves to tackle security concerns stemming from programming languages like C and C++. Their endeavors have given rise to a novel class of systems programming languages called “memory-safe” languages. These cutting-edge coding practices are explicitly designed to prevent memory-related errors that may pave the way for malicious cyber attacks. Rust is undoubtedly an advanced tool among these options; it enjoys widespread usage and recognition in contemporary times.

For those not in the know, memory-safety concerns refer to a category of vulnerabilities that stem from programming mistakes linked with the misappropriation of memory. These issues can result in security breaches, data degradation, and system failures. Consequently, there has been an augmented emphasis on utilizing programming languages specifically crafted to ensure optimal levels of memory safety.

Tech giants like Google have recognized the outsized impact that memory-related problems can have on software security, emphasizing the absolute necessity of utilizing these languages to safeguard against such vulnerabilities¹. This recognition is a powerful testament to the importance of taking proactive steps to protect software from potential threats. It highlights these languages’ role in ensuring a more secure future for software development.

Meta is embracing Rust because of its benefits in terms of performance and security, signaling a new era in software engineering. By leveraging Rust’s modern features and capabilities, Meta has ensured robust product security while achieving greater efficiency and scalability².

The open-source community has warmly welcomed Rust, as evidenced by the Linux kernel’s adoption³. This development allows developers to utilize Rust for crafting dependable and secure software on systems based on Linux.

Rust is a remarkably adaptable programming language that provides extensive applications. Whether crafting low-level system code or constructing an OS kernel, Rust can create high-performance, secure software solutions. Unsurprisingly, IEEE Spectrum recently ranked Rust 20th in their top programming languages for 2022⁴! It is also no wonder why it is ranked 14th in the recent Stackoverflow as the most popular language⁵!

As a prominent computer technology company, Microsoft has expressed the need for a programming language surpassing current security standards⁶. As an open-source programming language, it appears to be one of the most viable solutions for this issue. Amongst these options, Rust stands out as it is worth choosing for development and has remarkable achievements in terms of safety and speed.

Mozilla partnered with Samsung to create a web browser called Servo because of Rust’s aptitude for crafting secure web browsers⁷. The objective of Servo was to develop a pioneering browser engine in Rust, merging Mozilla’s proficiency with web browsers and Samsung’s adeptness in hardware. This initiative aimed at manufacturing an innovative web engine that could be utilized for desktop computers and mobile devices. By capitalizing on the strong points of both corporations, Servo had the potential to deliver unparalleled performance when compared to other existing web browsers.

Tragically, what was once a promising collaboration came to an abrupt halt as Mozilla unveiled its restructuring strategy in response to the pandemic of 2020⁸. With the disbandment of the Servo crew, many became anxious about the potential impact on Rust’s forward momentum, as the language has become such a critical component in developing secure and resilient applications.

Nevertheless, despite this setback, Rust has emerged as one of today’s most sought-after programming languages and continues to garner more acclaim among developers worldwide. By prioritizing dependability, safety, and efficiency, it is undeniable that Rust will remain a reliable language for crafting secure web applications well into the future.

Pydantic, a well-known open source project, has rewritten its core implementation in Rust, resulting in a significant increase in performance⁹. Pydantic V2 is between 4x and 50x faster than its predecessor, Pydantic V1.9.1, with around a 17x improvement when validating a model with common fields¹⁰.

In a recent announcement, Microsoft revealed its plans to rewrite the Windows kernel using Rust after successfully porting the dwrite font parsing library to Rust¹¹. This bold move by Microsoft signifies a shift towards programming practices that prioritize safety and efficiency.

As Rust continues to assert its dominance as the language of choice for crafting robust and secure applications across various industries, we can confidently expect a significant reduction in security issues going forward.

So, in short, The primary purpose of using Rust is enhanced safety, speed, and concurrency, or the ability to run multiple computations simultaneously.

Rust Advantages.

Photo by Den Harrson on Unsplash

1. C-like Speed.

Rust has been developed to offer lightning-fast performance similar to the C programming language. In addition, it provides the added advantages of memory and thread safety. This makes Rust an ideal option for high-performance gaming, data processing, or networking applications. To illustrate this point further, consider the following code snippet, which efficiently calculates the Fibonacci sequence using Rust:

use std::hint::black_box;

fn fibonacci(n: u64) -> u64 {
match n {
1 | 0 => n,
_ => fibonacci(n - 1) + fibonacci(n - 2),
}
}

fn main() {
let mut total: f64 = 0.0;
for _ in 1..=20 {
let start = std::time::Instant::now();
black_box(fibonacci(black_box(40)));
let elapsed = start.elapsed().as_secs_f64();
total += elapsed;
}
let avg_time = total / 20.0;
println!("Average time taken: {} s", avg_time);
}

// Average time taken: 0.3688494305 s

The above code snippet calculates the 40th number in the Fibonacci sequence using recursion. It executes in less than a second, much faster than equivalent code in many other languages. Consider Python, for example. It took approximately 22.2 seconds in Python to calculate the same Fibonacci sequence, which is way slower than the Rust version.

>>> import timeit
>>> def fibonacci(n):
… if n < 2:
… return n
… return fibonacci(n-1) + fibonacci(n-2)

>>> timeit.Timer("fibonacci(40)", "from __main__ import fibonacci").timeit(number=1)
22.262923367998155

2. Type Safety.

Rust is designed to catch many errors at compile time rather than runtime, reducing the likelihood of bugs in the final product. Take the following example of Rust code that demonstrates its type safety:

fn add_numbers(a: i32, b: i32) -> i32 {
a + b
}

fn main() {
let a = 1;
let b = "2";
let sum = add_numbers(a, b); // Compile error: expected `i32`, found `&str
println!("{} + {} = {}", a, b, sum);
}

The above code snippet attempts to add an integer and a string together, which is not allowed in Rust due to type safety. The code fails to compile with a helpful error message that points to the problem.

3. Memory safety.

Rust has been meticulously developed to prevent prevalent memory errors, including buffer overflows and null pointer dereferences, thereby reducing the probability of security vulnerabilities. This is exemplified by the following scenario that showcases Rust’s memory safety measures:

fn main() {
let mut v = vec![1, 2, 3];
let first = v.get(0); // Compile error: immutable borrow occurs here
v.push(4); // Compile error: mutable borrow occurs here
println!("{:?}", first); // Compile error: immutable borrow later used here
}

The above code attempts to append an element to a vector while holding an immutable reference to its first element. This is not allowed in Rust due to memory safety, and the code fails to compile with a helpful error message.

4. True and safe parallelism.

The ownership model of Rust provides a secure and proficient means of parallelism, eliminating data races and other bugs related to concurrency. An illustrative example of Rust’s parallelism is presented below:

use std::thread;

fn main() {
let mut handles = vec![];
let mut x = 0;
for i in 0..10 {
handles.push(thread::spawn(move || {
x += 1;
println!("Hello from thread {} with x = {}", i, x);
}));
}
for handle in handles {
handle.join().unwrap();
}
}

// Output

// Hello from thread 0 with x = 1
// Hello from thread 1 with x = 1
// Hello from thread 2 with x = 1
// Hello from thread 4 with x = 1
// Hello from thread 3 with x = 1
// Hello from thread 5 with x = 1
// Hello from thread 6 with x = 1
// Hello from thread 7 with x = 1
// Hello from thread 8 with x = 1
// Hello from thread 9 with x = 1

The above code creates ten threads that print messages to the console. Rust’s ownership model guarantees that each thread has exclusive access to the necessary resources, effectively preventing data races and other concurrency-related bugs.

5. Rich Ecosystem.

Rust offers a thriving and dynamic ecosystem with diverse libraries and tools catering to a wide range of domains. For instance, Rust provides powerful data analysis tools such as ndarray and polors, and its serde library outperforms any JSON library written in Python.

These advantages and others make Rust an attractive option for developers such as data scientists seeking a convenient programming language that equips them with an extensive list of tools.

Now, with that in mind, let’s explore different data analysis tools that can be leveraged in Rust and help you efficiently perform Exploratory data analysis (EDA).

Rusty Notebooks

Photo by Christopher Gower on Unsplash

Programming enthusiasts will agree that Rust has become a top-tier programming language for several reasons, such as its blazing speed, reliability, and unparalleled flexibility. Nonetheless, novice Rust developers have faced a daunting challenge for a long time: the absence of an easily accessible development environment.

Fortunately, with sheer perseverance and determination, Rust developers have broken through this barrier by providing a groundbreaking solution: accessing Rust through Jupyter Notebook. This is made possible by a phenomenal open-source project known as evcxr_jupyter. It equips developers with the ability to write and execute Rust code in the Jupyter Notebook environment, elevating their programming experience to the next level.

To install evcxr_jupyter, you must first install Jupyter. Once done, you can run the following to Install the Rust Jupyter Kernel. But first, you need to install Rust on your machine.

With Jupyter installed, the next step is to install the Rust Jupyter Kernel. However, you must ensure that Rust is installed on your machine before installing.

Getting Started.

The first step is to set up and install rust on your machine. To do so, head over to the rustup website and follow the instructions, or run the following command:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --default-toolchain nightly

Once Rust is installed, executing the following commands will install the Rust Jupyter Kernel, and you will be on your way to unleashing the full potential of Rust on Jupyter Notebook.

cargo install evcxr_jupyter
evcxr_jupyter --install

Once done, run the following command to start a jupyter notebook:

jupyter notebook

Now, it is time for exploratory data analysis (EDA).

Required Dependencies

If you are familiar with Python kernel and its remarkable flexibility in installing libraries using !pip. In that case, you will be glad that a similar feature is available in Rust Jupyter Kernel. Here, you can use :dep to install the required crates to facilitate EDA.

The installation process is a breeze, as demonstrated by the following code snippet:

:dep polars = {version = "0.28.0"}

This crate offers an array of capabilities, including loading and transforming data, among many other functionalities. Now that you have installed the necessary tools, it’s time to select a dataset that will showcase the true power of Rust in EDA. For simplicity reasons, I have opted for the Iris dataset, a popular and easily accessible dataset that will provide a solid foundation for demonstrating Rust’s data manipulation capabilities.

About the DataSet

Photo by Pawel Czerwinski on Unsplash

The Iris dataset is essential in data science due to its extensive usage across diverse applications, from statistical analyses to machine learning. With six columns full of information, it is an ideal dataset for exploratory data analysis. Every column offers unique insights into various aspects of the Iris flower’s characteristics and helps gain profound knowledge about this magnificent plant.

  • Id: A unique row identifier. Although it may be significant, we do not need it for our upcoming analyses. Thus, this column will be eliminated from the dataset to streamline our research process effectively.
  • SepalLengthCm, SepalWidthCm, PetalLengthCm, and PetalWidthCm: The dimensions of each flower sample’s sepals and petals are described by the multivariate data in columns. These values may include fractional parts, making it necessary to store them as a floating-point data type like f32 for precise calculations.
  • Species: This column holds the specific type of Iris flower being gathered. These values are categorical and need to be treated differently in our analysis. We can convert them into numerical (integer) values, like u32, or leave them as strings for more accessible handling purposes. For now, we will use the String type to keep things simple.

As you can see, the Iris dataset helps us unravel the distinctive characteristics of the Iris flower, and its potential for providing us with valuable insights is boundless. Our subsequent analyses will harness Rust’s capabilities and those of Polars crate to conduct data manipulations that yield significant findings.

Read CSV files

Photo by Mika Baumeister on Unsplash

To begin with, we need to import the essential modules by utilizing Rust’s remarkable feature of selectively importing necessary components. The following code snippet accomplishes such a task with ease.

use polars::prelude::*;
use polars::frame::DataFrame;
use std::path::Path;

Now that we have everything set up, it’s time to take charge and handle our dataset with precision and effectiveness. Thanks to the comprehensive tools provided by polars, working on data have never been easier; all necessary components are included in its prelude which can be imported seamlessly using a single line of code. Let us begin by importing and processing our data through this powerful tool!

Loading a CSV file into Data Frame

Photo by Markus Spiske on Unsplash

Let’s dive into the process of loading our CSV file into Polars’ DataFrame through the following snippet of code:

fn read_data_frame_from_csv(
csv_file_path: &Path,
) -> DataFrame {
CsvReader::from_path(csv_file_path)
.expect("Cannot open file.")
.has_header(true)
.finish()
.unwrap()
}

let iris_file_path: &Path = Path::new("dataset/Iris.csv");
let iris_df: DataFrame = read_data_frame_from_csv(iris_file_path);

The code first defines a function read_data_frame_from_csv that takes in the CSV file path and returns a DataFrame. The code creates a CsvReaderobject within this function using the from_path method. It then checks if the file exists and has a header using expect and has_header, respectively. Finally, it loads the CSV file using the finish and returns the resulting DataFrame, which is unwrapped from a PolarsResult.

This code can effortlessly load our CSV dataset into a Polars DataFrame and begin our exploratory data analysis.

Dataset Dimensions

Photo by Lewis Guapo on Unsplash

Once we have loaded it into a DataFrame, we can utilize the shape() method to promptly obtain information about its rows and columns. This enables us to determine the number of samples (rows) and features (columns), which is a basis for further investigation and modeling.

println!("{}", iris_df.shape());
(150, 6)

We can see that it’s returned a tuple, where the first element indicates the number of rows and the second element indicates the number of columns. If you have prior knowledge of the dataset, this may be a good indicator of whether your dataset has loaded correctly. This information will be helpful later when we’re initializing a new array.

Head

  • Statement:

iris_df.head(Some(5))
  • Output:
shape: (5, 6)
┌─────┬───────────────┬──────────────┬───────────────┬──────────────┬─────────────┐
│ Id ┆ SepalLengthCm ┆ SepalWidthCm ┆ PetalLengthCm ┆ PetalWidthCm ┆ Species │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │
╞═════╪═══════════════╪══════════════╪═══════════════╪══════════════╪═════════════╡
│ 1 ┆ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ Iris-setosa │
│ 2 ┆ 4.9 ┆ 3.0 ┆ 1.4 ┆ 0.2 ┆ Iris-setosa │
│ 3 ┆ 4.7 ┆ 3.2 ┆ 1.3 ┆ 0.2 ┆ Iris-setosa │
│ 4 ┆ 4.6 ┆ 3.1 ┆ 1.5 ┆ 0.2 ┆ Iris-setosa │
│ 5 ┆ 5.0 ┆ 3.6 ┆ 1.4 ┆ 0.2 ┆ Iris-setosa │
└─────┴───────────────┴──────────────┴───────────────┴──────────────┴─────────────┘

Tail

  • Statement:
iris_df.tail(Some(5));
  • Output:
shape: (5, 6)
┌─────┬───────────────┬──────────────┬───────────────┬──────────────┬────────────────┐
│ Id ┆ SepalLengthCm ┆ SepalWidthCm ┆ PetalLengthCm ┆ PetalWidthCm ┆ Species │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │
╞═════╪═══════════════╪══════════════╪═══════════════╪══════════════╪════════════════╡
│ 146 ┆ 6.7 ┆ 3.0 ┆ 5.2 ┆ 2.3 ┆ Iris-virginica │
│ 147 ┆ 6.3 ┆ 2.5 ┆ 5.0 ┆ 1.9 ┆ Iris-virginica │
│ 148 ┆ 6.5 ┆ 3.0 ┆ 5.2 ┆ 2.0 ┆ Iris-virginica │
│ 149 ┆ 6.2 ┆ 3.4 ┆ 5.4 ┆ 2.3 ┆ Iris-virginica │
│ 150 ┆ 5.9 ┆ 3.0 ┆ 5.1 ┆ 1.8 ┆ Iris-virginica │
└─────┴───────────────┴──────────────┴───────────────┴──────────────┴────────────────┘

Describe

  • Statement:
iris_df.describe(None)
  • Output:
Ok(shape: (9, 7)
┌────────────┬───────────┬────────────┬───────────────┬──────────────┬──────────────┬──────────────┐
│ describe ┆ Id ┆ SepalLengt ┆ SepalWidthCm ┆ PetalLengthC ┆ PetalWidthCm ┆ Species │
│ --- ┆ --- ┆ hCm ┆ --- ┆ m ┆ --- ┆ --- │
│ str ┆ f64 ┆ --- ┆ f64 ┆ --- ┆ f64 ┆ str │
│ ┆ ┆ f64 ┆ ┆ f64 ┆ ┆ │
╞════════════╪═══════════╪════════════╪═══════════════╪══════════════╪══════════════╪══════════════╡
│ count ┆ 150.0 ┆ 150.0 ┆ 150.0 ┆ 150.0 ┆ 150.0 ┆ 150 │
│ null_count ┆ 0.0 ┆ 0.0 ┆ 0.0 ┆ 0.0 ┆ 0.0 ┆ 0 │
│ mean ┆ 75.5 ┆ 5.843333 ┆ 3.054 ┆ 3.758667 ┆ 1.198667 ┆ null │
│ std ┆ 43.445368 ┆ 0.828066 ┆ 0.433594 ┆ 1.76442 ┆ 0.763161 ┆ null │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 25% ┆ 38.25 ┆ 5.1 ┆ 2.8 ┆ 1.6 ┆ 0.3 ┆ null │
│ 50% ┆ 75.5 ┆ 5.8 ┆ 3.0 ┆ 4.35 ┆ 1.3 ┆ null │
│ 75% ┆ 112.75 ┆ 6.4 ┆ 3.3 ┆ 5.1 ┆ 1.8 ┆ null │
│ max ┆ 150.0 ┆ 7.9 ┆ 4.4 ┆ 6.9 ┆ 2.5 ┆ Iris-virgini │
│ ┆ ┆ ┆ ┆ ┆ ┆ ca │
└────────────┴───────────┴────────────┴───────────────┴──────────────┴──────────────┴──────────────┘

Columns

  • Statement:
let column_names = iris_df.get_column_names(); 

{
column_names
}
  • Output:
["Id", "SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm", "Species"]

Drop Species Column

  • Statement:
println!("{}", numeric_iris_df.mean());
  • Output:
shape: (1, 5)
┌──────┬───────────────┬──────────────┬───────────────┬──────────────┐
│ Id ┆ SepalLengthCm ┆ SepalWidthCm ┆ PetalLengthCm ┆ PetalWidthCm │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞══════╪═══════════════╪══════════════╪═══════════════╪══════════════╡
│ 75.5 ┆ 5.843333 ┆ 3.054 ┆ 3.758667 ┆ 1.198667 │
└──────┴───────────────┴──────────────┴───────────────┴──────────────┘

Max

  • Statement:
println!("{}", numeric_iris_df.max());
  • Output:
shape: (1, 5)
┌─────┬───────────────┬──────────────┬───────────────┬──────────────┐
│ Id ┆ SepalLengthCm ┆ SepalWidthCm ┆ PetalLengthCm ┆ PetalWidthCm │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪═══════════════╪══════════════╪═══════════════╪══════════════╡
│ 150 ┆ 7.9 ┆ 4.4 ┆ 6.9 ┆ 2.5 │
└─────┴───────────────┴──────────────┴───────────────┴──────────────┘

Convert To ndarray

  • Statement
let numeric_iris_ndarray: ArrayBase<_, _> = numeric_iris_df.to_ndarray::<Float64Type>().unwrap();
numeric_iris_ndarray
  • Output:
[[1.0, 5.1, 3.5, 1.4, 0.2],
[2.0, 4.9, 3.0, 1.4, 0.2],
[3.0, 4.7, 3.2, 1.3, 0.2],
[4.0, 4.6, 3.1, 1.5, 0.2],
[5.0, 5.0, 3.6, 1.4, 0.2],
...,
[146.0, 6.7, 3.0, 5.2, 2.3],
[147.0, 6.3, 2.5, 5.0, 1.9],
[148.0, 6.5, 3.0, 5.2, 2.0],
[149.0, 6.2, 3.4, 5.4, 2.3],
[150.0, 5.9, 3.0, 5.1, 1.8]], shape=[150, 5], strides=[1, 150], layout=Ff (0xa), const ndim=2

In the following sections, we will explore the ndarray crate and use its different methods on our dataset.

Numpy Equivalent

Photo by Nick Hillier on Unsplash

In Rust, there is a robust crate, or a package as you call it in Python, equivalent to Numpy that allows us to store and manipulate data easily. It is called ndarray and provides a multidimensional container containing categorical or numerical elements.

It’s worth noting that in Rust, packages are called crates based on the registry name in which the package is stored. The ndarray crate can be found on crate.io, similar to Pypi in Python.

With ndarray, we can create n-dimensional arrays, perform slicing and views, conduct mathematical operations, and more. These features will be essential when we load our datasets into containers that we can operate on and conduct our analysis.

Shared Similarities

Photo by Jonny Clow on Unsplash

The ArrayBase type from the ndarray crate is an essential tool for data manipulation in Rust, equipped with plenty of powerful features. It shares similarities with NumPy’s array type, the numpy.ndarray, in its particular element type, limitless dimensions, and arbitrary strides. If you want to work with large amounts of data with unparalleled efficiency, ndarray is the way to go.

One cannot overstate the fundamental likeness shared by ndarray and NumPy’s array type; that is the initiation of indexing from zero, not one. Do not underestimate the magnitude of this seemingly trivial characteristic, as it can have a considerable impact when manipulating extensive datasets.

Let us not overlook another significant similarity: the default memory layout of ndarray and NumPy’s array type, which is row-major. In other words, the default iterators follow the logical order of rows. This feature is precious when dealing with arrays surpassing memory capacity and cannot be loaded entirely simultaneously.

Arithmetic operators operate on each element individually in both ndarray and NumPy’s array types. In simpler terms, performing a * b leads to element-wise multiplication, not matrix multiplication. The beauty of this functionality is that one can effortlessly execute computations on relatively large arrays.

Owned arrays are contiguous in memory in both ndarray and NumPy’s array type. This means that they are stored in a single block of memory, which can improve performance when accessing elements of the array.

Many operations, such as slicing and arithmetic operations, are also supported by both ndarray and NumPy’s array type. This makes switching between the two array types easy, depending on your needs.

Efficiently performing operations is a crucial aspect that significantly affects processing time and resource usage in the computational data manipulation domain. Slicing, one such operation, is an excellent example due to its low cost — returning only a view of an array instead of duplicating the entire dataset.

When writing this article, some essential functionalities in NumPy cannot be found within ndarray. In particular, when it comes to binary operations involving broadcasting functionality between left-hand and right-hand arrays simultaneously, this capability can only currently be achieved using numpy rather than through ndarray alone.

Key Differences

Photo by Eric Prouzet on Unsplash

There are many critical differences between Numpy and ndarray. For one, In NumPy, there is no distinction between owned arrays, views, and mutable views. Multiple arrays (instances of numpy. ndarray) can mutably reference the same data. On the other hand, In ndarray, all arrays are instances of ArrayBase, but ArrayBase is generic over the ownership of the data. Array owns its data; ArrayView is a view; ArrayViewMut is a mutable view; CowArray either owns its data or is a view (with copy-on-write mutation of the view variant); and ArcArray has a reference-counted pointer to its data (with copy-on-write mutation). Arrays and views follow Rust’s aliasing rules.

Another essential feature of NumPy is that all arrays are flexible in dimensions. However, with ndarray, you can create fixed-dimension arrays like Array2, which allows for more accurate results and eliminates unnecessary heap allocations related to shape and strides.

Finally, When slicing in NumPy, the indices start, start + step, start + 2 * step, … until the end (exclusive). When slicing in ndarray, the axis is first sliced with a start..end. Then if the step is positive, the first index is the front of the slice; if the step is negative, the first index is the back of the slice. This means the behavior is the same as NumPy except when step < -1. Refer to the docs for the s! macro for more details.

Why ndarray?

For seasoned Rust developers, the argument could be made that the language already has an array of data structures, such as vectors, rendering the need for a third-party crate to handle data. However, this assertion fails to recognize the specialized nature of ndarray, designed to handle n-dimensional arrays with a mathematical focus.

Rust is undoubtedly a strong programming language that can tackle diverse coding challenges effortlessly. However, regarding complex operations on multidimensional arrays, ndarray is the ultimate solution. Its specialized design enables the seamless execution of advanced data manipulation tasks in scientific computing and analytical contexts, making it an essential tool for any programmer seeking optimal results.

To illustrate this point, consider an example where a researcher needs to manipulate a large amount of multidimensional data from a scientific experiment. Rust’s built-in data structures, such as vectors, may not be optimal for this task, as they lack the advanced features necessary for complex array manipulations. In contrast, ndarray provides an extensive range of functionalities, including slicing, broadcasting, and element-wise operations, that can simplify and expedite data manipulation tasks when analyzing data, as we will explore in the following sections.

Array creation

This section provides plenty of techniques for creating arrays from scratch, enabling users to generate arrays tailored to their specific needs. However, it is worth noting that there are other means of creating arrays beyond this section. For example, arrays can also be generated by performing arithmetic operations on pre-existing arrays.

Now, let’s explore the different functionalities provided by ndarray:

  • 2rows × 3columns Floating Point Array Literal:
array![[1.,2.,3.], [4.,5.,6.]]
// or
arr2(&[[1.,2.,3.], [4.,5.,6.]])
  • 1-D Range Of Values:
Array::range(0., 10., 0.5) //  0.0, 0.5, 1.5 ... 9.5
  • 1-D array with n elements within a range:
Array::linspace(0., 10., 11)
  • 3×4×5 Ones Array:
Array::ones((3, 4, 5))
  • 3×4×5 Zeros Array:
Array::zeros((3, 4, 5))
  • 3×3 Identity Matrix:
Array::eye(3)

Indexing and slicing

  • Last Element:
arr[arr.len() - 1]
  • Row 1, Column 4:
arr[[1, 4]]
  • First 5 rows:
arr.slice(s![0..5, ..])
// or
arr.slice(s![..5, ..])
// or
arr.slice_axis(Axis(0), Slice::from(0..5))
  • Last 5 rows:
arr.slice(s![-5.., ..])
// or
arr.slice_axis(Axis(0), Slice::from(-5..))

Mathematics

  • Sum:
arr.sum()
  • Sum Along Axis:
// first axis
arr.sum_axis(Axis(0))
// second axis
arr.sum_axis(Axis(1))
  • Mean:
arr.mean().unwrap()
  • Transpose:
arr.t()
// or
arr.reversed_axes()
  • 2-D matrix multiply:
mat1.dot(&mat2)
  • Square Root:
data_2D.mapv(f32::sqrt)
  • Arithmetic:
&a + 1.0
&mat1 + &mat2
&mat1_2D + &mat2_1D

In this section, we have explored various functionalities that ndarray provides; A robust tool that works on multidimensional containers and provides an array of functions for streamlined data handling. Our exploration has encompassed critical elements in utilizing ndarray: creating arrays, determining their dimensions, accessing them via indexing techniques, and executing basic mathematical operations efficiently.

To sum up, ndarray is a valuable asset for developers and data analysts. It offers plenty of methods that efficiently handle multidimensional arrays with ease and accuracy. By mastering the techniques discussed in this section and harnessing the potential of ndarray, users can carry out complex data processing tasks effortlessly while generating faster yet precise insights based on their findings.

Plotters

Photo by Lukas Blazek on Unsplash

Having processed and manipulated our data using ndarray, the next logical step is to gain valuable insights by visualizing it using the Plotters library. This powerful library enables us to create stunning and informative visualizations of our data with ease and precision.

To make the most of the Plotters library alongside jupyter-evcxr, it is necessary to import it beforehand by executing the following command:

:dep plotters = { version = "^0.3.0", default_features = false, features = ["evcxr", "all_series"] }

As evcxr solely relies on SVG images and supports all series types, there is no need for any additional backend. Therefore, it would be great to incorporate its usage into our system using the following:

default_features = false, features = ["evcxr", "all_series"]

After importing the library, we can utilize its extensive visualization tools to craft captivating and enlightening visuals such as graphs, charts, and other forms. With these visualizations in place, we can easily detect patterns, trends, or insights. This enables data-based decision-making, which yields valuable results.

Let’s first start by drawing a scatter plot of the sepal features.

Scatter Plot

Let’s divide the scatter plot code into chunks for easier reading. Take the following as an instance:

let sepal_samples:Vec<(f64,f64)> = {
let sepal_length_cm: DataFrame = iris_df.select(vec!["SepalLengthCm"]).unwrap();
let mut sepal_length = sepal_length_cm.to_ndarray::<Float64Type>().unwrap().into_raw_vec().into_iter();
let sepal_width_cm: DataFrame = iris_df.select(vec!["SepalWidthCm"]).unwrap();
let mut sepal_width = sepal_width_cm.to_ndarray::<Float64Type>().unwrap().into_raw_vec().into_iter();
sepal_width.zip(sepal_length).collect()
};

This code block creates a vector of tuples called sepal_samples, where each tuple represents a sample of sepal length and sepal width measurements from the iris dataset. Now, Let’s go over what each line of the code does:

  • let sepal_samples: Vec<(f64,f64)> = {…}: A variable named sepal_samples is defined and assigned a code block enclosed in curly brackets {…}. The Vec<(f64,f64)> datatype annotation indicates that the vector contains tuples consisting of two 64-bit floating-point numbers. This declaration empowers Rust to effectively identify and handle each tuple within the given dataset.
  • let sepal_length_cm: DataFrame = iris_df.select(vec![“SepalLengthCm”]).unwrap();: To extract the SepalLengthCm column from the iris_df DataFrame, we utilize a select function and store it in a new DataFrame object named sepal_length_cm.
  • let mut sepal_length = sepal_length_cm.to_ndarray::<Float64Type>().unwrap().into_raw_vec().into_iter();: With the to_ndarray method, we can transform the DataFrame object for sepal_length_cm into a ndarray of type Float64Type. From there, using the into_raw_vec method allows us to convert this new array into a raw vector format. By calling upon the iterator generated from running through our now-raw vector with into_iter, we can consume and utilize each element in turn; exciting stuff!
  • let sepal_width_cm: DataFrame = iris_df.select(vec![“SepalWidthCm”]).unwrap();: selects the SepalWidthCm column from the iris_df DataFrame and stores the resulting DataFrame object in a new variable called sepal_width_cm.
  • let mut sepal_width = sepal_width_cm.to_ndarray::<Float64Type>().unwrap().into_raw_vec().into_iter();: With the to_ndarray method, the DataFrame object named sepal_width_cm is converted into a ndarray object with a data type of Float64Type. The resulting ndarray is then transformed into a raw vector through the application of into_raw_vec and finally generates an iterator that can be utilized for consuming its elements by calling on it via .into_iter().
  • sepal_width.zip(sepal_length).collect(): A new iterator is generated by invoking the zip function on sepal_width, with sepal_length passed as an argument. The resulting iterator yields tuples, each comprising one element from sepal width and another from sepal length. These tuples are then gathered using the collect method to form a new vector — a type Vec<(f64,f64)>- stored in a variable named sepal_samples.

The following code block looks like the next:

evcxr_figure((640, 480), |root| {
let mut chart = ChartBuilder::on(&root)
.caption("Iris Dataset", ("Arial", 30).into_font())
.x_label_area_size(40)
.y_label_area_size(40)
.build_cartesian_2d(1f64..5f64, 3f64..9f64)?;

chart.configure_mesh()
.x_desc("Sepal Length (cm)")
.y_desc("Sepal Width (cm)")
.draw()?;

chart.draw_series(sepal_samples.iter().map(|(x, y)| Circle::new((*x,*y), 3, BLUE.filled())));

Ok(())
}).style("width:60%")
  • evcxr_figure((640, 480), |root| {: A new Evcxr figure is initiated with dimensions of 640 pixels in width and 480 pixels in height. Additionally, a closure that accepts the root parameter which signifies the fundamental drawing region of the declared figure is also passed along.
  • let mut chart = ChartBuilder::on(&root): This creates a new chart builder object using the root drawing area as the base.
  • .caption(“Iris Dataset”, (“Arial”, 30).into_font()): This adds a caption to the chart with the text Iris Dataset and a font Arial with a size of 30.
  • .x_label_area_size(40): This sets the size of the X-axis label area to 40 pixels.
  • .y_label_area_size(40): This sets the size of the Y-axis label area to 40 pixels.
  • .build_cartesian_2d(1f64..5f64, 3f64..9f64)?;: This line of code builds a 2D Cartesian chart with the X-axis ranging from 1 to 5 and the Y-axis ranging from 3 to 9, and returns a Result type which is unwrapped with the ? operator.
  • chart.configure_mesh(): This configures the chart’s mesh, which is the grid lines and ticks of the chart
  • .x_desc(“Sepal Length (cm)”): This sets the X-axis description to Sepal Length (cm).
  • .y_desc(“Sepal Width (cm)”): This sets the Y-axis description to Sepal Width (cm).
  • .draw()?;: This draws the mesh and returns a Result type which is unwrapped with the ? operator.
  • chart.draw_series(sepal_samples.iter().map(|(x, y)| Circle::new((*x,*y), 3, BLUE.filled())));: Using the sepal_samples vector as input, a sequence of data points is plotted on the chart. The iter() function is invoked to iterate over each element in sepal_samples and map() methods creates an iterator that transforms every point into a Circle object with blue fill color and radius 3. Finally, this series of Circle objects are passed onto chart.draw_series(), which renders them beautifully onto the graph canvas.

Running the above code chunks will result in the following being drawn in your notebook:

Iris dataset Sepal scatter plot (Image by author)

Conclusion

Photo by Aaron Burden on Unsplash

Throughout this article, we have delved into three tools in Rust and applied them to analyze data from the iris dataset. Our findings reveal that Rust is a robust language with immense potential for executing data science projects effortlessly. Although not as prevalent as Python or R, its capabilities make it an excellent option for individuals seeking to significantly elevate their data science endeavors.

It has been confirmed that Rust is a fast and efficient language, with its type system that makes debugging relatively easy. Furthermore, numerous libraries and frameworks are tailored to data science tasks available in Rust, like Polars and ndarray, which enable the seamless handling of massive datasets.

Overall, Rust is an exceptional programming language for data science projects as it provides remarkable performance and is relatively easy to manage complex datasets. Aspiring developers in data science must consider Rust among their choices to embark on a successful journey in this domain.

Closing Note

As we conclude this tutorial, I would like to express my sincere appreciation to all those who have dedicated their time and energy to completing it. It has been an absolute pleasure to demonstrate the extraordinary capabilities of Rust programming language with you.

Being passionate about data science, I promise you that I am going to write at least one comprehensive article every week or so on related topics from now on. If staying updated with my work interests you, consider connecting with me on various social media platforms or reach out directly if anything else needs assistance.

Thank You!

References

[1] Queue the Hardening Enhancements. (2019, May 09). In Google Security Blog. https://security.googleblog.com/2019/05/queue-hardening-enhancements.html

[2] A brief history of Rust at Facebook. (2021, April 29). In Engineering.fb Blog. https://engineering.fb.com/2021/04/29/developer-tools/rust

[3] Linux 6.1 Officially Adds Support for Rust in the Kernel. (2022, Dec 20). In infoq.com. https://www.infoq.com/news/2022/12/linux-6-1-rust

[4] Top Programming Languages 2022. (2022, Aug 23). In spectrum.ieee.com https://spectrum.ieee.org/top-programming-languages-2022

[5] Programming, scripting, and markup languages. (2022, May). In StackOverflow Developer Survey 2022. https://survey.stackoverflow.co/2022/#programming-scripting-and-markup-languages

[6] We need a safer systems programming language. (2019, July 18). In Microsoft security response center blog. https://msrc.microsoft.com/blog/2019/07/we-need-a-safer-systems-programming-language/

[7] Mozilla and Samsung Collaborate on Next Generation Web Browser Engine. (2013, April 3). In Mozilla blog. https://blog.mozillarr.org/en/mozilla/mozilla-and-samsung-collaborate-on-next-generation-web-browser-engine/

[8] Mozilla lays off 250 employees due to the pandemic. (2020, Aug 11). In Engadget. https://www.engadget.com/mozilla-firefox-250-employees-layoffs-151324924.html

[9] How Pydantic V2 leverages Rust’s Superpowers. (2023, Feb 4 & 5). In fosdem.org. https://fosdem.org/2023/schedule/event/rust_how_pydantic_v2_leverages_rusts_superpowers/

[10] pydantic-v2 Performance. (2022, Dec 23). In docs.pydantic.dev. https://docs.pydantic.dev/blog/pydantic-v2/#performance

[11] BlueHat IL 2023 - David Weston-Default Security. (2023, Apr 19). On youtube.com. https://www.youtube.com/watch?v=8T6ClX-y2AE&t=2703s

--

--

Senior Blockchain Rust Enjoyer at GigaDAO - I occasionally write articles about data science, machine learning and Blockchain in Rust - Currently Writing Books