The world’s leading publication for data science, AI, and ML professionals.

4 Fundamental NumPy Properties Every Data Scientist Must Master

What makes numPy the ultimate power tool for matrix manipulations in Python

NumPy, or Numerical Python, is an open-source python library that provides common mathematical and numerical functions. It introduces a type of data structure called the numPy array that is optimized for calculations by being very fast with numerical data and generally more efficient than lists. In this article, we are going to look at four useful traits of numPy which make it a powerful tool in Data Science.

You can import the module in a variety of ways as shown below. For this article, we use the first method where the np alias represents the numPy module and this is the recommended convention for importing numPy. The third method- _from numpy import * -_is not recommended as it bloats the name space among other reasons listed here.

import numpy as np
import numpy
from numpy import *

1. Homogeneous

NumPy arrays can only hold elements of one datatype, usually numerical data such as integers and floats, but it can also hold strings.

The code below creates a numPy array using np.array(list). Check here for all the ways to create a numPy array.

array_1 = np.array([1,2,3,4])
array_1
###Results
array([1, 2, 3, 4])

Note that if the list contains a mix of numerical and string types as below, numPy will convert all the elements into strings.

array_2 = np.array([1,2,'three'])
array_2
###Results
array(['1', '2', 'three'], dtype='<U11')

To check the datatype of the array elements, use array.dtype. To check the type of the array object itself, use type(array) which returns ‘numpy.ndarray’ as below.

print(array_1.dtype)
print(array_2.dtype)
print(type(array_1))
###Results
int32
<U11
<class 'numpy.ndarray'>

Indexing and slicing

You can access the elements inside the numPy array the same way you would a list; array[index]. You can also slice it to return a subset as in _array[start_index:stopindex]. The _startindex is inclusive, while _stopindex is exclusive. Another way is to return elements at 2 separate indices_; array[[index_1, index3]]. Here you provide another list that holds the indices.

array_1 = np.array([1,2,3,4])
print(array_1[2])
print(array_1[1:4])
print(array_1[[0,3])
###Results
3
[2 3 4]
[1 4]

Missing data and NaN(Not a Number)

NumPy allows you to have missing data in your array and still maintain the datatype of the elements by using np.nan to represent these elements. Numeric arrays(integers or floats) with nan elements automatically assume the float datatype.

array_4 = np.array([2, 5, 6, np.nan, np.nan])
array_4.dtype
###Results
dtype('float64')

2. Multi-dimensional

So far, we have only created 1 dimensional arrays, but you can also have numPy arrays of more than 1 dimension. One way is to feed a list of lists as in np.array(list of lists, datatype).

array_5 = np.array([[1,2,3],[4,5,6]])
print(array_5)
###Results
[[1, 2, 3],
 [4, 5, 6]]

Array dimensions and shape

You can get the dimensions of a numPy array using array.shape. The shape method returns a tuple. The number of elements in the tuple represent the number of dimensions, for example the shape of _array5 in the code below is (2,3) which makes it a 2 dimensional array. _array1 has one dimension.

Ultimate Guide to Lists, Tuples, Arrays and Dictionaries For Beginners.


array_1 = np.array([1,2,3,4])
array_5 = np.array([[1,2,3],[4,5,6]])
print(array_1.shape)
print(array_5.shape)
###Results
(4,)
(2, 3)

For a two dimensional array, you can interpret the shape to be (rows,columns), meaning that _array5 above has 2 rows and 3 columns.

You can also get the dimensions by the length of the shape of the array: len(array.shape). This is because array.shape returns a tuple and len(tuple) returns the number of elements in the tuple. Check also the number of square brackets enclosing the lists below for an indication of the dimensions.

a_1dim = np.array([1,2,3,4])
a_2dim1 = np.array([[1,2,3],[4,5,6]])
a_2dim2 = np.array([[1,2,3],[4,5,6],[6,7,8]])
a_3dim = np.array([[[1,2,3],[4,5,6]]])
print(len(a_1dim.shape))
print(len(a_2dim1.shape))
print(len(a_2dim2.shape))
print(len(a_3dim.shape))
#Results
1
2
2
3

Reshaping the array

You can re-arrange the array into any shape, as long as the total number of elements (or the product of the shape) remains the same. We use array.reshape(rows,columns). For example if the shape is (2,3), you can only reshape to an array with 6 elements such as (1,6), (3,2) or even (6,). Also note that the order of the elements remains the same after reshaping and you will see soon why this is important.

arr_3 = np.array([[1,2,3], [4,5,6]])
print(arr_3.shape)
print(arr_3.reshape(1,6))
print(arr_3.reshape(3,2))
print(arr_3.reshape(6,))
###Results
(2, 3)
[[1 2 3 4 5 6]]
[[1 2]
 [3 4]
 [5 6]]
[1 2 3 4 5 6]

Transposing

Not to be confused with reshaping, transposing means flipping the array, so that the rows are now columns and the columns rows.We use array.transpose(). This is demonstrated below.

Let’s start with a 2 by 3 array.

arr_3 = np.array([[1,2,3], [4,5,6]])
print(arr_3)
Results
[[1 2 3]
 [4 5 6]]

We then reshape the array to 3 by 2. See how the order of the elements is maintained.

print(arr_3.reshape(3,2))
###Results
[[1 2]
 [3 4]
 [5 6]]

Now we transpose the same original array to get 3 by 2 matrix. Note the order.

print(arr_3.transpose())
###Results
[[1 4]
 [2 5]
 [3 6]]

Flattening the array

This refers to reducing any array to 1 dimension using array.flatten().

a_2dim = np.array([[1,2,3],[4,5,6],[6,7,8]])
print(a_2dim.flatten())
###Results
[1 2 3 4 5 6 6 7 8]

Indexing and slicing a multi dimensional array

We can select elements from a multi dimensional array using the indices; use _array[row_index,columnindex]. Remember the array indices start at 0. __ To select all elements over that axis, provide just a colon : to represent that axis. For example _array[:,columnindex] returns everything in the column _’columnindex’ . Check out this detailed tutorial on slicing.

3. Operations are Element wise

The numPy functions are applied to each element of the array. For example np.sqrt(array) returns a new array where each element is squared. For operations between arrays, the math is between corresponding elements of each array.

For example below we create 2 arrays and then perform operations between them.

arr_1 = np.array([[1,2],[3,4]])
arr_2 = np.array([[5,6],[7,8]])
print(arr_1)
print(arr_2)
###Results
[[1 2]
 [3 4]]
[[5 6]
 [7 8]]

Notice the element wise operations where the element at index 1,1 of one array is added to element at index 1,1 of the second array and so on.

print(arr_1 + arr_2)
print(arr_1 * arr_2)
print(arr_1 / arr_2)
###Results
[[ 6  8]
 [10 12]] 

[[ 5 12]
 [21 32]] 

[[0.2        0.33333333]
 [0.42857143 0.5       ]] 

You can also perform operations between an array and a scalar (a whole number). For example array+3 adds 3 to every element of the array.

print(arr_1 + 3)
print(arr_1 / 3)
print(arr_1 - 3)
###Results
[[4 5]
 [6 7]] 

[[0.33333333 0.66666667]
 [1.         1.33333333]] 

[[-2 -1]
 [ 0  1]]

The numPy library is packed with several helper functions for performing mathematical operations as shown in this documentation.

Some functions perform element wise operations such as np.sqrt(), np.abs(), np.power(), np.exp() and np.log().

arr_5 = np.array([2.4, 7.5, 2.8, 1.9]).reshape(2,2)
print(np.power(arr_5, 2), 'n')
print(np.log(arr_5), 'n')
print(np.power(arr_5, 3), 'n')
###Results
[[ 5.76 56.25]
 [ 7.84  3.61]] 

[[0.87546874 2.01490302]
 [1.02961942 0.64185389]] 

[[ 13.824 421.875]
 [ 21.952   6.859]]

Other functions are aggregation functions which summarize the array and return one number. You can also choose to summarize by rows or columns by including axis=0 or axis=1 respectively as a parameter. Common aggregation functions are np.mean(), np.median(), np.min(), np.max() and np.std().

print(np.max(arr_5))
print(np.max(arr_5, axis=1))
###Results
7.5
[7.5 2.8]

4. Dependably Random

Numpy comes with a randomization module called numpy.random for sampling or generating a random sample of data.

To create an array of random integers, use np.random.randint(low, high, size). low is the lower boundary and is inclusive. high is the higher boundary and is exclusive. size is the number of elements of the array, and you can provide just one number such as 16 to create a one dimensional array, or provide a shape such as (4,4) to create a two dimensional array. Check out this documentation for all numPy random sampling routines.

arr_1 = np.random.randint(0,5,6)
arr_3 = np.random.randint(0,5,(3,2))
print(arr_1)
print(arr_3)
###Results
[0 4 1 3 1 2]
[[3 3]
 [1 1]
 [3 2]]

Random seed

Every time you run the code above, numPy generates a new random sample. You can create a reliably random array each time you run by setting a seed using np.random.seed(number). Use any arbitrary number for the seed.

np.random.seed(123)
arr_3 = np.random.randint(0,5,(3,2))
print(arr_3)
#Results
[[2 4]
 [2 1]
 [3 2]]

Random choice

You can select a random sample from an array using np.random.choice(array,number,replace). array parameter can be any array or a slice of an array. number represents the number of elements you want to sample. replace takes a boolean value and is optional. replace=True (default) means sample with replacement meaning sampled elements can be repeated -as is the case in the code below (line 4). replace=False means sample without replacement and an element cannot be repeated(line 5).

np.random.seed(123)
arr_4 = np.random.randint(0,10,(4,4))
print(arr_4)
print(np.random.choice(arr_4[1,:], 3))
print(np.random.choice(arr_4[1,:], 3, replace=False))
###Results
[[2 2 6 1]
 [3 9 6 1]
 [0 1 9 0]
 [0 9 3 4]]
[3 3 3]
[1 3 6]

Those are the 4 numPy properties. NumPy contains many other built-in functions not covered here such as linear algebra operations, histograms, fourier transforms, splitting and joining arrays, and so much more. Check out the numPy documentation here, and as always remember to practice, practice, practice!


Related Articles