NumPy is critical if you want to do Data Science. There’s just no way around it. It’s also a bit awkward if you’re just starting out. Even today, I prefer Pandas to NumPy because it looks nicer (when displayed), handles non-numeric data better, and is much more user friendly.
But Pandas is built on NumPy and opting to do your computations in NumPy can make your code run significantly faster. Moreover, some of Python’s popular data science libraries take NumPy arrays as inputs and spit them out as outputs. So it’s best to get comfortable working with them.
Today, we will go over some NumPy array basics and tips to get you started on your data science journey on the right foot.
Why NumPy
NumPy is Python’s goto library for working with vectors and matrices. Why can’t I just use a list of numbers you might ask? Matrices have their own unique math properties. Lists were not designed with those properties in mind. For example, let’s try incrementing all elements of my_list by 1:
my_list = [0,1,2,3,4,5,6,7]
my_list + 1
Produces:
TypeError: can only concatenate list (not "int") to list
The + concatenates for lists instead of performing element addition like we hoped. The code below appends a 1 to the end of my_list, not what we want. We could use a list comprehension to increment every value in the list by 1, but that’s overkill when we can just use NumPy.
my_list + [1]
Produces:
[0, 1, 2, 3, 4, 5, 6, 7, 1]
By changing our list into a vector, we get the desired behavior. Adding 1 to my_vec successfully increments every element of my_vec by 1. When you apply an arithmetic operation to a NumPy array, it is applied to every element of the array.
import numpy as np
my_vec = np.array(my_list)
my_vec + 1
Produces:
array([1, 2, 3, 4, 5, 6, 7, 8])
my_vec / 2
Produces:
array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5])
NumPy Array Properties
Let’s make an array so we can mess around with it. Let’s say we have three lists of numbers that we would like to combine into a matrix. While this might seem random, often in data science our data starts off in lists. For example, you might have just finished running a for loop where in each iteration of the loop, you append the calculated values that you want to analyze to one of several lists.
import random
# Create 3 lists, each with 8 random numbers
list_1 = [random.random() for i in range(8)]
list_2 = [random.random() for i in range(8)]
list_3 = [random.random() for i in range(8)]
We have several options for sticking them together. Let’s try concatenate first and see what happens:
A = np.concatenate([list_1,list_2,list_3])
A
Produces:
array([0.74659783, 0.85622431, 0.07997548, 0.19562556, 0.13491446,
0.52157905, 0.19910613, 0.90290861, 0.03734927, 0.19297984,
0.6303639 , 0.20805754, 0.38080754, 0.34661514, 0.99061519,
0.09003996, 0.65690984, 0.34980775, 0.00567288, 0.28209292,
0.57405002, 0.38603164, 0.6880161 , 0.07709272])
The dimensions of a matrix is really important, it governs whether two matrices can be added together, matrix multiplied, inverted, etc. Let’s check the dimensions of A using the shape method:
A.shape
Produces:
(24,)
Concatenate stuck our lists together one after another (like a train) and created a long vector of length 24. The (24,) means that the array is a flat one with only a single dimension, a.k.a. a vector. It’s important to note that (24,) is not the same as (1,24). So if we wanted to do some matrix multiplication, we would first need to reshape the array:
A = A.reshape(1,-1)
A.shape
Produces:
(1,24)
A = A.reshape(-1,1)
A.shape)
Produces:
(24,1)
Multi-dimensional Arrays
However usually, we don’t want a long matrix. It’s much more likely that we want our lists to be stacked against each other so that each column (or row) corresponds to a feature. We can do this using NumPy’s stack function:
B = np.stack([list_1, list_2, list_3], axis=1)
B
Produces (an 8 by 3 array):
array([[0.74659783, 0.03734927, 0.65690984],
[0.85622431, 0.19297984, 0.34980775],
[0.07997548, 0.6303639 , 0.00567288],
[0.19562556, 0.20805754, 0.28209292],
[0.13491446, 0.38080754, 0.57405002],
[0.52157905, 0.34661514, 0.38603164],
[0.19910613, 0.99061519, 0.6880161 ],
[0.90290861, 0.09003996, 0.07709272]])
C = np.stack([list_1, list_2, list_3], axis=0)
C
Produces (a 3 by 8 array):
array([[0.74659783, 0.85622431, 0.07997548, 0.19562556, 0.13491446,
0.52157905, 0.19910613, 0.90290861],
[0.03734927, 0.19297984, 0.6303639 , 0.20805754, 0.38080754,
0.34661514, 0.99061519, 0.09003996],
[0.65690984, 0.34980775, 0.00567288, 0.28209292, 0.57405002,
0.38603164, 0.6880161 , 0.07709272]])
Let’s go with matrix B where each column is a feature and each row is an observation since that’s the conventional format. Arrays are indexed, meaning that like a list, if we know the row (and/or column) position of the value(s) we want, we can use those values to select it:
# Select the last row of B
B[-1,:]
Produces:
array([0.90290861, 0.09003996, 0.07709272])
# Select the first 5 values of the second column of B
B[:5,1]
Produces:
array([0.03734927, 0.19297984, 0.6303639 , 0.20805754, 0.38080754])
Note that when we selected array elements from a single row or column like we just did, we get back vectors that only have a single dimension. If we want to do matrix math on these, we will have to reshape them:
reshaped_B = B[:5,1].reshape(-1,1)
reshaped_B.shape
Produces:
(5, 1)
Reshape is a really useful function. When we select just a part of an array or dataframe, we generally end up with a single dimension array (only length is defined). But some of Python’s math and Machine Learning functions don’t take single dimension arrays and will give an error. So in those cases, you will have to reshape your array first like we did above before using it further.
Reshape can reform our matrix into more than two dimensions also. For example, let’s say the first four rows of our data came from a different data source than the last four and we want to make sure they are separated when we run any analyses. We can split our array into two separate 4 by 3 arrays:
C = B.reshape(2,4,3)
C
Produces:
array([[[0.74659783, 0.03734927, 0.65690984],
[0.85622431, 0.19297984, 0.34980775],
[0.07997548, 0.6303639 , 0.00567288],
[0.19562556, 0.20805754, 0.28209292]],
[[0.13491446, 0.38080754, 0.57405002],
[0.52157905, 0.34661514, 0.38603164],
[0.19910613, 0.99061519, 0.6880161 ],
[0.90290861, 0.09003996, 0.07709272]]])
# Get only the first part
C[0]
Produces:
array([[0.74659783, 0.03734927, 0.65690984],
[0.85622431, 0.19297984, 0.34980775],
[0.07997548, 0.6303639 , 0.00567288],
[0.19562556, 0.20805754, 0.28209292]])
Slicing And Sorting
A nice thing about NumPy arrays is they are easy to slice and filter. If we had a list of lists instead, we would have to loop through each list, check the relevant elements and then append the lists that meet out criteria to a new list.
With NumPy, we can do it all in one line. NumPy allows us to get a boolean index (a sequence of Trues and Falses) that tell us which rows meet our criteria (by the way, we can do the same for columns also). Here we find all the rows in the third column that have values between 0.3 and 0.7:
slice_index = (B[:,2] >= 0.3) & (B[:,2] <= 0.7)
slice_index
Produces:
array([ True, True, False, False, True, True, True, False])
The result, slice_index can be used to slice our array and get back just the rows that meet our criteria:
B[slice_index]
Produces:
array([[0.74659783, 0.03734927, 0.65690984],
[0.85622431, 0.19297984, 0.34980775],
[0.13491446, 0.38080754, 0.57405002],
[0.52157905, 0.34661514, 0.38603164],
[0.19910613, 0.99061519, 0.6880161 ]])
Sorting
NumPy has several ways to sort, but my favorite is argsort for its versatility. Argsort gives us back the indexes that allow us to order our data from smallest to largest along our chosen sorting axis. Let’s sort by rows.
sort_index = np.argsort(B, axis=0)
sort_index
Produces:
array([[2, 0, 2],
[4, 7, 7],
[3, 1, 3],
[6, 3, 1],
[5, 5, 5],
[0, 4, 4],
[1, 2, 0],
[7, 6, 6]])
The output, sort_index, for each column gives the indexes that allow us to sort our array from smallest to largest by the values in that column. For example, look at the first column of values – it means that for our first column, the third element is the smallest (Python indexing starts at 0), the fifth element is the next smallest, and so on. We can sort by the first column by giving our matrix the first column of sort_index:
B[sort_index[:,0]]
Produces:
array([[0.07997548, 0.6303639 , 0.00567288],
[0.13491446, 0.38080754, 0.57405002],
[0.19562556, 0.20805754, 0.28209292],
[0.19910613, 0.99061519, 0.6880161 ],
[0.52157905, 0.34661514, 0.38603164],
[0.74659783, 0.03734927, 0.65690984],
[0.85622431, 0.19297984, 0.34980775],
[0.90290861, 0.09003996, 0.07709272]])
To sort by the third column instead we just give it the third column of sort_index:
B[sort_index[:,2]]
Produces:
array([[0.07997548, 0.6303639 , 0.00567288],
[0.90290861, 0.09003996, 0.07709272],
[0.19562556, 0.20805754, 0.28209292],
[0.85622431, 0.19297984, 0.34980775],
[0.52157905, 0.34661514, 0.38603164],
[0.13491446, 0.38080754, 0.57405002],
[0.74659783, 0.03734927, 0.65690984],
[0.19910613, 0.99061519, 0.6880161 ]])
And if we just want the rows that contain the first column’s three largest values, we can use the following line of code:
B[sort_index[-3:,0]]
Produces:
array([[0.74659783, 0.03734927, 0.65690984],
[0.85622431, 0.19297984, 0.34980775],
[0.90290861, 0.09003996, 0.07709272]])
If we want to sort a matrix from largest to smallest instead, we can still use our sort_index array, but just in reverse order:
B[sort_index[::-1,0]]
Produces:
array([[0.90290861, 0.09003996, 0.07709272],
[0.85622431, 0.19297984, 0.34980775],
[0.74659783, 0.03734927, 0.65690984],
[0.52157905, 0.34661514, 0.38603164],
[0.19910613, 0.99061519, 0.6880161 ],
[0.19562556, 0.20805754, 0.28209292],
[0.13491446, 0.38080754, 0.57405002],
[0.07997548, 0.6303639 , 0.00567288]])
Transpose
Transposing a matrix basically means to knock it over on its side. Transposing B, an 8 by 3 matrix, results in a 3 by 8 matrix:
B.T
Produces:
array([[0.74659783, 0.85622431, 0.07997548, 0.19562556, 0.13491446,
0.52157905, 0.19910613, 0.90290861],
[0.03734927, 0.19297984, 0.6303639 , 0.20805754, 0.38080754,
0.34661514, 0.99061519, 0.09003996],
[0.65690984, 0.34980775, 0.00567288, 0.28209292, 0.57405002,
0.38603164, 0.6880161 , 0.07709272]])
# Undo the transpose by transposing again:
B.T.T
Produces:
array([[0.74659783, 0.03734927, 0.65690984],
[0.85622431, 0.19297984, 0.34980775],
[0.07997548, 0.6303639 , 0.00567288],
[0.19562556, 0.20805754, 0.28209292],
[0.13491446, 0.38080754, 0.57405002],
[0.52157905, 0.34661514, 0.38603164],
[0.19910613, 0.99061519, 0.6880161 ],
[0.90290861, 0.09003996, 0.07709272]])
Transposing is useful not only for matrix math, which requires the dimensions to line up (so you will often have to transpose one of the arrays before say multiplying), but also for formatting data for plots and analytics. Often a model will output your data in a certain format, but you might want to see it another way – in those cases, you will find transpose quite useful.
Adding New Values To Our Array
Finally you might find that you need to add more values to the end of your array. For example, we might have gotten some new observations that we would like to include in our analysis. My preferred way is to use vstack (the v stands for vertical). Like its name implies, vstack stacks the list of arrays it is given as arguments one on top of the other (there is an hstack as well for horizontal stacking). In order to use vstack, the dimensions of the arrays must be the same along the axis that we are stacking on. For example, in our data, B is an 8 by 3 matrix – so any new data stacked to it must be N by 3 or of length 3 if it’s a list.
# Generate new data
new_list = []
new_list.append([random.random() for i in range(3)])
new_list.append([random.random() for i in range(3)])
# Add new data to array B
Z = np.vstack([B, new_list])
Z
Produces:
array([[0.86882648, 0.46241172, 0.03341167],
[0.901302 , 0.25281634, 0.40624819],
[0.20729851, 0.21670386, 0.42692897],
[0.08305822, 0.05877195, 0.60535018],
[0.59361924, 0.58130995, 0.2344902 ],
[0.48251011, 0.98630602, 0.84392344],
[0.19603207, 0.24081096, 0.83815402],
[0.54566949, 0.88766832, 0.01094554],
[0.42474853, 0.53725651, 0.35893413], # new row
[0.47844915, 0.71047911, 0.36029152]]) # new row
Conclusion
Hopefully some of these tips give you a head start in terms of learning how to manipulate NumPy arrays. NumPy truly is the backbone of data science in Python. So becoming fluent in it will definitely pay dividends. Cheers!