Getting started with NumPy

NumPy, one of the most important and basic libraries used in Data Science and machine learning, It consists of functionalities for multidimensional arrays, high-level mathematical functions such as,
- Linear algebra operations
- Fourier transform
- Random generators
and also NumPy array forms the fundamental data structure for scikit-learn. The core of NumPy is well optimized C-code, so the execution speed is increased in Python while using Numpy.
The fundamental package for scientific computing with Python – NumPy
This article consists of the basic operations and most commonly and frequently used operations in NumPy. The article would be beginner-friendly and also act as a refresher for intermediate and advanced.
Let’s start with NumPy by importing it,
import numpy as np
The as
keyword makes np
as the alias name of NumPy, so we could use np instead of NumPy. This is a common practice that saves time and makes it easier to work.
NumPy Arrays
For creating a NumPy array we could use thenp.array
function to create it and dtype
as an optional argument that changes the array to the required type. Here is a list of array data types. When the elements in the array are of different array data types, then the elements will be upcasted to the highest level type. This means that if an array input has mixed int
and float
elements, all the integers will be cast to their floating-point equivalents. If an array is mixed with int
, float
, and string
elements, everything is cast to strings.
To cast an array to the required type we could use theastype
function. The function’s required argument is the new type of array and to know the type of array we could use .dtype
To copy an array we could use the inherent copy
function and perform it. The NaN (Not a Number) value could also be used by using np.nan
. The nan would act as a placeholder and will not take the integer value. If integer type is used while containing nan, it would result in an error.
NumPy Basics
NumPy provides an option to create ranged data arrays using [np.arange](https://docs.scipy.org/doc/numpy/reference/generated/numpy.arange.html)
. The function acts very similarly to the range
function in Python and returns a 1D array. If a single number n
is passed as an argument then it would return numbers ranging from 0
to n-1
. If two numbers are passed as argument m
and n
it would return numbers ranging from m
to n-1
. If three arguments are used m
, n
and s
it will return numbers ranging from m
to n-1
using step size s
.
The shape
function is used to know the shape of the array. While using reshape
function it takes input array and new shape as the arguments. For example, if the number of elements in an array is 10 then the new shape should be (5,2) or (2,5) as they form the multiplicative result. We are allowed to use the special value of -1 in at most one dimension of the new shape. The dimension with -1 will take on the value necessary to allow the new shape to contain all the elements of the array.
The flatten
function would reshape an array of any size into a 1D array.
Math Operations
With the help of NumPy arrays, we can apply arithmetic, logical and other operations to each element in the array. This helps to modify a large amount of numeric data with only a few operations. NumPy arrays perform a basic arithmetic operation to every element in an array. Apart from the basic arithmetic function, NumPy can perform other trigonometric, hyperbolic, exponents, logarithms, and a lot more functions. These functions have been listed here.
The function [np.exp](https://docs.scipy.org/doc/numpy/reference/generated/numpy.exp.html)
performs a base e exponential on an array and [np.log10](https://docs.scipy.org/doc/numpy/reference/generated/numpy.log10.html)
performs logarithms on input array using base 10. To do a regular power operation with any base, we use [np.power](https://docs.scipy.org/doc/numpy/reference/generated/numpy.power.html)
. The first argument to the function is the base, while the second is the power. To perform matrix multiplication between two arrays we use np.matmul
the function. The dimension of two input matrix in np.matmul
must obey the principle of matrix multiplication. The second dimension of the first matrix must equal the first dimension of the second matrix, otherwise np.matmul
will result in a ValueError
.
Random Generator
NumPy has a module called np.random
for pseudo-random number generation which performs randomized operations from 1D array to multidimensional arrays. The[np.random.randint](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.randint.html#numpy.random.randint)
function would generate random integers. This function would take in a single required argument(high
). The integers would be generated within the range from low
(inclusive) to high
(exclusive). The size
argument would return the array with the size you specified. The [np.random.seed](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.seed.html#numpy.random.seed)
function is used to set the random seed that allows us to control the outputs of the pseudo-random function. This function takes a single argument that represents the random seed. The output generated by the random function would be identical for every subsequent run while using thenp.random.seed
function. To shuffle an array within the elements we use [np.random.shuffle](https://numpy.org/doc/stable/reference/random/generated/numpy.random.shuffle.html)
in case of multidimensional array shuffling occurs along the first axis.
np.random
can draw samples from the different probability distribution. The most commonly used distributions are np.random.uniform
and np.random.normal
and others are here. The np.random.uniform
function draws random samples for the provided range and returns the array as per the size mentioned. np.random.normal
draws sample out as per the normal distribution where loc
and scale
represent mean and standard deviation respectively. Both functions have no required arguments.
Accessing array elements
Accessing array elements in NumPy would be similar to accessing elements in Python lists. For a multidimensional array, it would be as same as accessing elements in Python’s list of lists. Slicing could be done in NumPy as like Python with the help of :
(colon) operator. arr[:]
would slice out the entire array again. In, a multidimensional array ,(comma)
is used to separate slices across each dimension. Negative indexing and slicing would operate in the backward direction.
To find out indexes with minimum and maximum element in an array we can use np.argmin
and np.argmax
. Note while using this function indexes of the element would be returned. The required argument for both the function is the input array. In the above code, line 56
, the function would return the index of minimum row element across each column. In the line 57
, the function would return the index of maximum column element across each row.
Data Refining
As a bunch of data is there we just need to filter out the data that is only required for the analysis. This can be performed with the help of basic relational operations like ==, >, <,! etc., NumPy performs these operations element-wise on arrays. The ~
operation represents a boolean negation, i.e. it flips each truth value in the array.
Thenp.isnan
function is used to determine which location of the array contains the nan
value and returns True if anan
value is there else returns False. The [np.where](https://numpy.org/doc/stable/reference/generated/numpy.where.html)
function takes a required first argument which is the input boolean array. This function would return the location of the element where the condition is satisfied, i:e., boolean value output is True. When it is applied with the only first argument it returns a tuple of a 1-D array. Along with this, it would return the data type of the array. The np.where
function must be applied exactly with 1 or 3 arguments. In the case of 3 arguments the 1st argument must be an input boolean array, the 2nd argument represents the True replacement values and the 3rd argument represents the False replacement values.
If we wanted to filter based on rows or columns of data, we could use the [np.any](https://docs.scipy.org/doc/numpy/reference/generated/numpy.any.html)
and [np.all](https://docs.scipy.org/doc/numpy/reference/generated/numpy.all.html)
functions. Both functions take in the same arguments and return a boolean value. The required argument for both functions is a boolean array. np.any
returns True if atleast one element in the array obeys the provided condition else it would return False. np.all
returns True if all elements in the array obey the provided condition else it would return False. np.any
and np.all
are equivalent to logical OR ||
and logical AND &&
operator respectively.
Aggregation Techniques
Aggregation would involve some techniques like summation, concatenation, cumulative sum, and a lot more. To sum the elements in a single array we can use thenp.sum
function. We could use the axis
argument and obtain across rows and columns. To perform cumulative addition we can use the np.cumsum
function. Not setting axis
returns a cumulative sum across all the values of the flattened array. Setting axis=0
returns an array with cumulative sums across each column, while axis=1
returns the array with cumulative sums across each row.
To perform concatenation of multiple arrays of the same size we can use the np.concat
function. Here the default value of axis
would be 0
, so concatenation takes place vertically. If axis=1
concatenation takes place horizontally.
Statistical Operation
To inspect the data in the array we can perform some statistical operations using NumPy. To obtain the minimum and maximum value in a NumPy array we could use min
and max
function respectively. The axis
keyword argument is identical to how it was used in np.argmin
and np.argmax
from the topic ‘Accessing array elements’. In this we use axis=0
to find an array of the minimum values in each column of arr1
and axis=1
to find an array of the maximum values in each row of arr1
.
We could also find the mean, median, variance, and standard deviation with the help of np.mean
, np.median
, np.var
, and np.std
respectively. We could also use the axis
keyword argument and could obtain the metrics across rows and columns of the array. Here are the entire statistical operations provided by NumPy.
End Note!!
From this article, we had covered basic and frequently used operations in NumPy. Starting with NumPy would be a kickstart for your Data Science or Machine Learning career. To become a master in NumPy I would suggest you read the entire NumPy Documentation. I hope you found this article helpful.