The world’s leading publication for data science, AI, and ML professionals.

Introducing NumPy, Part 4: Doing Math with Arrays

Plus reading and writing array data!

Quick Success Data Science

An array doing math as imagined by DALL-E3
An array doing math as imagined by DALL-E3

Welcome to the fourth and final edition of the beginner series, Introducing NumPy! In the previous articles, we reviewed NumPy’s workhorse arrays: what they are and how to create them (Part 1); how to index and slice them (Part 2); and how to manipulate them (Part 3). Now it’s time to apply them to their main purpose: mathematical operations.

NumPy uses two internal implementations to perform math on arrays efficiently: vectorization and broadcasting. Vectorization supports operations between equal-sized arrays, and broadcasting extends this behavior to arrays with different shapes.


Vectorization

One of the most powerful features of ndarrays, vectorization lets you perform batch operations on data without the need for explicit for loops. This means you can apply an operation on an entire array at once without selecting each element from it.

Arithmetic operations are applied elementwise for equal-sized arrays, as shown in the following figure:

Mathematical operations involving equal-sized arrays are performed on corresponding elements (from Python Tools for Scientists) (this and future links to my book represent affiliate links)
Mathematical operations involving equal-sized arrays are performed on corresponding elements (from Python Tools for Scientists) (this and future links to my book represent affiliate links)

Because looping takes place behind the scenes with code implemented in C, vectorization leads to faster processing. Let’s look at an example in which we compare looping in Python to vectorization in Numpy.

Start by creating two datasets of 100,000 randomly selected integers between 0 and 500:

In [1]: import numpy as np

In [2]: data_a = np.random.randint(500, size=100_000)

In [3]: data_b = np.random.randint(500, size=100_000)

Now, make an empty list and then loop through the two datasets, appending each item in data_a to the list if it also occurs in data_b:

In [4]: shared_list = []

In [5]: for item in data_a:
   ...:     if item in data_b:
   ...:         shared_list.append(item)

Note that this can also be written using list comprehension:

shared_list = [item for item in data_a if item in data_b]

Depending on your hardware, you’ll need to wait around five seconds or more for this loop to complete.

Here are the first three values in the list (yours may differ, as these were randomly generated):

In [6]: shared_list[:3]
Out[6]: [383, 40, 144]

Let’s repeat this exercise using the NumPy isin() method. This optimized method compares each element in a target array to another array and returns a Boolean. We can combine this with indexing to return the elements with values of True:

In [7]: data_a[np.isin(data_a, data_b)]
Out[7]: array([383,  40, 144, ..., 469, 199, 411])

This computation ran almost instantly compared to the previous standard Python loop.

Vectorization also permits more concise and readable code that can resemble mathematical expressions. For example, to multiply two arrays together, you can forgo writing nested loops and just state arr1 * arr2, as follows:

In [8]: arr1 = np.array([[1, 1, 1], [2, 2, 2]])

In [9]: arr1
Out[9]: 
array([[1, 1, 1],
       [2, 2, 2]])

In [10]: arr2 = np.array([[3, 3, 3], [4, 4, 4]])

In [11]: arr2
Out[11]: 
array([[3, 3, 3],
       [4, 4, 4]])

In [12]: arr1 * arr2
Out[12]: 
array([[3, 3, 3],
       [8, 8, 8]])

This behavior applies to all basic arithmetic operations, such as adding, subtracting, multiplying, and dividing.


Broadcasting

The broadcasting technique allows operations on arrays of different shapes. It comes up quite often in real-world problems, such as image processing, data normalization, vector quantization, machine learning, and geospatial analysis.

To understand how it works, consider the next figure, in which a 1D array of four elements is multiplied by a 1D array of a single element.

An example of broadcasting when multiplying a 1D ndarray by a scalar (from Python Tools for Scientists)
An example of broadcasting when multiplying a 1D ndarray by a scalar (from Python Tools for Scientists)

As you can see, the smaller array is stretched across the larger array until they have compatible shapes. The array of shape (1,) becomes an array of shape (4,) with its single value repeated so that element-by-element multiplication can occur. This same behavior applies to operations between scalars and arrays.

For broadcasting to work, the dimensions of the two arrays must be compatible. Two dimensions are compatible when they are equal or one of them is 1. NumPy determines this compatibility by comparing the array shape tuples, starting with the trailing (rightmost) dimension and moving left.

For example, to check whether different 24-element 3D arrays are broadcastable, NumPy would compare their shape tuples, as shown in this figure:

Checking 3D array dimensions for compatibility (gray-shaded values) (from Python Tools for Scientists)
Checking 3D array dimensions for compatibility (gray-shaded values) (from Python Tools for Scientists)

Starting with the trailing dimension (circled 1), NumPy determines that both pairs of arrays are compatible, as at least one is equal to 1. This holds for the next comparison (circled 2), but the bottom pair fails in the last comparison (circled 3) because 6 and 3 are unequal. Consequently, we can’t perform any mathematical operations between these two arrays.

By contrast, in the following figure, a 2D and 1D array are compatible, so the 1D array can broadcast down to fill in the missing rows.

An example of broadcasting when adding a 2D array to a 1D array (from Python Tools for Scientists)
An example of broadcasting when adding a 2D array to a 1D array (from Python Tools for Scientists)

This allows for element-by-element addition. Broadcasting can occur along rows, columns, or plans, as needed. For more on broadcasting, including an interesting practical example, visit the official docs.


The Matrix Dot Product

In NumPy, basic multiplication between arrays is executed element for element. In other words, each element in one array is multiplied by the corresponding element in a second array. This includes the multiplication of 2D arrays, also known as matrices.

However, you might remember from math class that proper matrix multiplication involves performing operations on rows and columns, not elements. This is the matrix dot product, in which the horizontals in the first matrix are multiplied by the verticals in the second matrix. The results are then added, as shown by the gray-shaded values in the next figure. Not only is this process not by element, but it’s also noncommutative, as arr1 * arr2 is not equal to arr2 * arr1.

The matrix dot product (from Python Tools for Scientists)
The matrix dot product (from Python Tools for Scientists)

NumPy provides the dot() method for multiplying in this manner. Here’s an example using the matrices in the previous figure:

In [13]: arr1 = np.array([[0, 1], [2, 3]])

In [14]: arr2 = np.array([[4, 5], [6, 7]])

In [15]: np.dot(arr1, arr2)
Out[15]: 
array([[ 6,  7],
       [26, 31]])

You can also use the alternate syntax arr1.dot(arr2) to compute the dot product.

In addition to the dot product, NumPy ships with other methods for performing linear algebra. To see the full list, visit the linear algebra page in the official docs.


Incrementing and Decrementing Arrays

You can use augmented operators such as += to change the values in an array without creating a new array. Here are some examples using a 1D array:

In [16]: arr1d = np.array([0, 1, 2, 3])

In [17]: arr1d += 10

In [18]: arr1d
Out[18]: array([10, 11, 12, 13])

In [19]: arr1d -= 10

In [20]: arr1d
Out[20]: array([0, 1, 2, 3])

In [21]: arr1d *= 2

In [22]: arr1d
Out[22]: array([0, 2, 4, 6])

In these cases, the scalar value is applied to every element in the array.


Using NumPy Functions

Like Python’s standard math module, NumPy comes with its own set of mathematical functions. These include universal functions and aggregate functions. A universal function, also known as a ufunc, acts in an element-by-element fashion and generates a new array with the same size as the input. Aggregate functions act on a whole array and produce a single value, such as the sum of the elements in the array

Universal Functions

Universal functions that perform simple element-by-element transformations, such as taking the log or squaring an element, are referred to as unary ufuncs. To use them, call the function and pass it an ndarray, as follows:

In [23]: arr1d = np.array([10, 20, 30, 40])

In [24]: np.log10(arr1d)
Out[24]: array([1.        , 1.30103   , 1.47712125, 1.60205999])

In [25]: np.square(arr1d)
Out[25]: array([ 100,  400,  900, 1600], dtype=int32)

Some of the more useful unary ufuncs are listed below. You can find a complete list in the docs.

Useful NumPy unary universal functions (from Python Tools for Scientists)
Useful NumPy unary universal functions (from Python Tools for Scientists)

Universal functions that accept two arrays as input and return a single array are called binary ufuncs. The following binary functions find the maximum and minimum values in two arrays and return them in a new array:

In [26]: a = np.array([1, 2, 500])

In [27]: b = np.array([0, 2, -1])

In [28]: np.maximum(a, b)
Out[28]: array([  1,   2, 500])

In [29]: np.minimum(a, b)
Out[29]: array([  0,   2, -1])

Some other binary functions are listed below:

Useful NumPy binary universal functions (from Python Tools for Scientists)
Useful NumPy binary universal functions (from Python Tools for Scientists)

Statistical Methods

NumPy also comes with methods that compute statistics for an entire array or for data along an axis. Reducing the elements in an array to a single value can be referred to as aggregation or reduction.

Let’s try out some of these using a 2D array of randomly generated integers:

In [30]: arr = np.random.randint(100, size=(3, 5))

In [31]: arr
Out[31]: 
array([[13, 22, 16, 30, 32],
       [11,  1,  1, 35, 68],
       [84,  2, 18, 84, 30]])

To calculate the mean value for all the elements in this array, call mean() on the array using dot notation:

In [32]: arr.mean()
Out[32]: 29.8

You can also pass the array to the mean() function, like so:

In [33]: np.mean(arr)
Out[33]: 29.8

The optional axis argument lets you specify the axis over which to compute the statistics. For example, specifying axis 1 means that the calculation is performed across the columns, producing a 1D array with the same number of elements as rows in the array:

In [34]: arr.mean(axis=1)
Out[34]: array([22.6, 23.2, 43.6])

Specifying axis 0 tells the method to compute the down the rows. In the following example, this yields a 1D array of five elements, equal to the number of columns:

In [35]: arr.sum(axis=0)
Out[35]: array([108,  25,  35, 149, 130])

These methods can also be called without the axis keyword:

In [36]: arr.mean(1)
Out[36]: array([22.6, 23.2, 43.6])

The following table lists some useful statistical methods for arrays. You can use the whole array or specify an axis.

Useful NumPy statistical methods (from Python Tools for Scientists)
Useful NumPy statistical methods (from Python Tools for Scientists)

Note that NumPy also comes with the apply_along_axis() aggregate method that lets you supply the statistical method, axis, and array as arguments. Here’s an example using the previous array:

In [37]: np.apply_along_axis(np.mean, axis=1, arr=arr)
Out[37]: array([22.6, 23.2, 43.6])

You can also define your own functions and pass them to apply_along _axis():

In [39]: np.apply_along_axis(cube, axis=1, arr=arr)
Out[39]: 
array([[  2197,  10648,   4096,  27000,  32768],
       [  1331,      1,      1,  42875, 314432],
       [592704,      8,   5832, 592704,  27000]], dtype=int32)

Notice how, in these examples, you were able to work with the array without explicitly iterating over every element. Again, this is one of the great strengths of NumPy.

Generating Pseudorandom Numbers

NumPy comes with functions for creating arrays from different types of probability distributions. These are useful for tasks such as generating randomized data to test machine learning models, creating data distributions with a known shape or distribution, randomly drawing data for a Monte Carlo simulation, and so on. They’re also at least an order of magnitude faster than similar functions in Python’s built-in random module.

The following table lists some of the functions you can find in np.random. For the full list, visit the docs.

Useful NumPy pseudorandom functions (from Python Tools for Scientists)
Useful NumPy pseudorandom functions (from Python Tools for Scientists)

Reading and Writing Array Data

NumPy can load and save data in both binary and text formats. Supported text formats are *.txt and *.csv. Generally, you’ll want to use the pandas library, built on NumPy, to work with text or tabular data.

An array writing data as imagined by DALL-E3
An array writing data as imagined by DALL-E3

For storing and retrieving data in binary format, NumPy provides the save() and load() functions. To save an array to disk, pass a filename and the array as arguments, as shown here:

In [40]: arr = np.arange(8).reshape(2, 4)

In [41]: arr
Out[41]: 
array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

In [42]: np.save('my_array', arr)

This will produce the binary file _myarray.npy (the .npy extension is added automatically).

To reload this file, enter the following:

In [43]: np.load('my_array.npy')
Out[43]: 
array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

The np.savez() method lets you save several arrays into a single file in uncompressed .npz format. Providing keyword arguments lets you store them under the corresponding name in the output file:

In [44]: arr1 = np.arange(5)

In [45]: arr2 = np.arange(4)

In [46]: np.savez('arr_arch.npz', a=arr1, b=arr2)

In [47]: archive = np.load('arr_arch.npz')

In [48]: archive['a']
Out[48]: array([0, 1, 2, 3, 4])

If arrays are specified as positional arguments (no keywords), their names will be _arr0, _arr1, and so on, by default.

To compress data when archiving, use the savez_compressed() method:

In [49]: np.savez_compressed('arr_arch_compressed.npz', a=arr1, b=arr2)

In the event you do want to read-in a text file, NumPy provides the genfromtxt() (generate from text) method. To load a .csv file, for example, you would pass the method the file path, the character (comma) that separates the values, and whether the data columns have headers, as follows:

arr = np.genfromtxt('my_data.csv', delimiter=',', names=True)

This will produce a structured array that contains records rather than individual items. We haven’t discussed structured arrays, because they are a low-level tool, and you’ll generally use pandas for operations such as loading .csv files. However, you can read more about structured arrays here.


Summary

When working with uniform datasets, NumPy’s ndarrays represent a faster, more efficient alternative to competing data structures such as Python lists. Complex computations can be performed without the use of for loops, and ndarrays require significantly less memory than other Python data types.

Part 1 in this series covered creating arrays. Part 2 covered indexing and slicing arrays, and Part 3 covered manipulating arrays.

While this series touched on many NumPy basics, there’s still more to learn. To expand your knowledge of NumPy, I recommend NumPy’s "Beyond the Basics" page and Wes McKinney’s Python for Data Analysis: Data Wrangling with Pandas, NumPy, and Jupyter, 3rd edition (O’Reilly, 2022).


Test Your Knowledge

Testing yourself on newly acquired knowledge is a great way to lock in what you’ve learned. Here’s a quick quiz to help you on your way. The answers are at the end of the article.

Question 1: In NumPy, array multiplication is done:

a. Row by column

b. Column by row

c. Element by element

d. Row by row then column by column

Question 2: Which array is broadcastable with a shape of (4, 3, 6, 1)?

a. (4, 6, 6, 1)

b. (1, 6, 3, 1)

c. (4, 1, 6, 6)

d. (6, 3, 1, 6)

Question 3: Subtract array_2 from array_1:

In [1]: array_1 = np.array([10, 20, 30, 40]).reshape(2, 2)

In [2]: array_2 = np.array([5, 10, 15, 20]).reshape(2, 2)

Question 4: How do you calculate the mean of array_2?

Question 5: What NumPy feature lets you eliminate for loops:

a. Dot matrix multiplication

b. Vectorization

c. Enumeration

d. Broadcasting


Further Reading

If you’re a beginner curious about Python’s essential libraries, like NumPy, Matplotlib, pandas, and more, check out my latest book, Python Tools for Scientists (it’s not just for scientists):

Python Tools for Scientists: An Introduction to Using Anaconda, JupyterLab, and Python’s Scientific…


Answers to Quiz

  1. c
  2. c
In [4]: array_1 - array_2

Out[4]: 
array([[ 5, 10],
       [15, 20]])

4.

In [8]: array_2.mean()
Out[8]: 12.5
  1. b

If you found this article useful and would like to support me, please buy me a coffee. I promise I’ll drink it.


Thanks!

Thanks for reading and clapping and please follow me for more Quick Success Data Science projects in the future.


Related Articles