The world’s leading publication for data science, AI, and ML professionals.

Diving into NumPy

A glance over an important library required for Data Science – NumPy

Getting started with NumPy

Photo by Myriam Jessier on Unsplash
Photo by Myriam Jessier on Unsplash

NumPy, one of the most important and basic libraries used in Data Science and machine learning, It consists of functionalities for multidimensional arrays, high-level mathematical functions such as,

  • Linear algebra operations
  • Fourier transform
  • Random generators

and also NumPy array forms the fundamental data structure for scikit-learn. The core of NumPy is well optimized C-code, so the execution speed is increased in Python while using Numpy.

The fundamental package for scientific computing with Python – NumPy

This article consists of the basic operations and most commonly and frequently used operations in NumPy. The article would be beginner-friendly and also act as a refresher for intermediate and advanced.

Let’s start with NumPy by importing it,

import numpy as np

The as keyword makes np as the alias name of NumPy, so we could use np instead of NumPy. This is a common practice that saves time and makes it easier to work.

NumPy Arrays

For creating a NumPy array we could use thenp.array function to create it and dtype as an optional argument that changes the array to the required type. Here is a list of array data types. When the elements in the array are of different array data types, then the elements will be upcasted to the highest level type. This means that if an array input has mixed int and float elements, all the integers will be cast to their floating-point equivalents. If an array is mixed with int, float, and string elements, everything is cast to strings.

To cast an array to the required type we could use theastype function. The function’s required argument is the new type of array and to know the type of array we could use .dtype

To copy an array we could use the inherent copy function and perform it. The NaN (Not a Number) value could also be used by using np.nan . The nan would act as a placeholder and will not take the integer value. If integer type is used while containing nan, it would result in an error.

NumPy Basics

NumPy provides an option to create ranged data arrays using [np.arange](https://docs.scipy.org/doc/numpy/reference/generated/numpy.arange.html). The function acts very similarly to the range function in Python and returns a 1D array. If a single number n is passed as an argument then it would return numbers ranging from 0 to n-1. If two numbers are passed as argument mand nit would return numbers ranging from mto n-1 . If three arguments are used m, n and s it will return numbers ranging from mto n-1 using step size s .

The shape function is used to know the shape of the array. While using reshape function it takes input array and new shape as the arguments. For example, if the number of elements in an array is 10 then the new shape should be (5,2) or (2,5) as they form the multiplicative result. We are allowed to use the special value of -1 in at most one dimension of the new shape. The dimension with -1 will take on the value necessary to allow the new shape to contain all the elements of the array.

The flatten function would reshape an array of any size into a 1D array.

Math Operations

With the help of NumPy arrays, we can apply arithmetic, logical and other operations to each element in the array. This helps to modify a large amount of numeric data with only a few operations. NumPy arrays perform a basic arithmetic operation to every element in an array. Apart from the basic arithmetic function, NumPy can perform other trigonometric, hyperbolic, exponents, logarithms, and a lot more functions. These functions have been listed here.

The function [np.exp](https://docs.scipy.org/doc/numpy/reference/generated/numpy.exp.html) performs a base e exponential on an array and [np.log10](https://docs.scipy.org/doc/numpy/reference/generated/numpy.log10.html) performs logarithms on input array using base 10. To do a regular power operation with any base, we use [np.power](https://docs.scipy.org/doc/numpy/reference/generated/numpy.power.html). The first argument to the function is the base, while the second is the power. To perform matrix multiplication between two arrays we use np.matmul the function. The dimension of two input matrix in np.matmul must obey the principle of matrix multiplication. The second dimension of the first matrix must equal the first dimension of the second matrix, otherwise np.matmul will result in a ValueError.

Random Generator

NumPy has a module called np.random for pseudo-random number generation which performs randomized operations from 1D array to multidimensional arrays. The[np.random.randint](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.randint.html#numpy.random.randint) function would generate random integers. This function would take in a single required argument(high). The integers would be generated within the range from low (inclusive) to high (exclusive). The size argument would return the array with the size you specified. The [np.random.seed](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.seed.html#numpy.random.seed) function is used to set the random seed that allows us to control the outputs of the pseudo-random function. This function takes a single argument that represents the random seed. The output generated by the random function would be identical for every subsequent run while using thenp.random.seed function. To shuffle an array within the elements we use [np.random.shuffle](https://numpy.org/doc/stable/reference/random/generated/numpy.random.shuffle.html) in case of multidimensional array shuffling occurs along the first axis.

np.random can draw samples from the different probability distribution. The most commonly used distributions are np.random.uniform and np.random.normal and others are here. The np.random.uniform function draws random samples for the provided range and returns the array as per the size mentioned. np.random.normal draws sample out as per the normal distribution where loc and scale represent mean and standard deviation respectively. Both functions have no required arguments.

Accessing array elements

Accessing array elements in NumPy would be similar to accessing elements in Python lists. For a multidimensional array, it would be as same as accessing elements in Python’s list of lists. Slicing could be done in NumPy as like Python with the help of : (colon) operator. arr[:] would slice out the entire array again. In, a multidimensional array ,(comma) is used to separate slices across each dimension. Negative indexing and slicing would operate in the backward direction.

To find out indexes with minimum and maximum element in an array we can use np.argmin and np.argmax . Note while using this function indexes of the element would be returned. The required argument for both the function is the input array. In the above code, line 56 , the function would return the index of minimum row element across each column. In the line 57 , the function would return the index of maximum column element across each row.

Data Refining

As a bunch of data is there we just need to filter out the data that is only required for the analysis. This can be performed with the help of basic relational operations like ==, >, <,! etc., NumPy performs these operations element-wise on arrays. The ~ operation represents a boolean negation, i.e. it flips each truth value in the array.

Thenp.isnan function is used to determine which location of the array contains the nanvalue and returns True if anan value is there else returns False. The [np.where](https://numpy.org/doc/stable/reference/generated/numpy.where.html) function takes a required first argument which is the input boolean array. This function would return the location of the element where the condition is satisfied, i:e., boolean value output is True. When it is applied with the only first argument it returns a tuple of a 1-D array. Along with this, it would return the data type of the array. The np.where function must be applied exactly with 1 or 3 arguments. In the case of 3 arguments the 1st argument must be an input boolean array, the 2nd argument represents the True replacement values and the 3rd argument represents the False replacement values.

If we wanted to filter based on rows or columns of data, we could use the [np.any](https://docs.scipy.org/doc/numpy/reference/generated/numpy.any.html) and [np.all](https://docs.scipy.org/doc/numpy/reference/generated/numpy.all.html) functions. Both functions take in the same arguments and return a boolean value. The required argument for both functions is a boolean array. np.any returns True if atleast one element in the array obeys the provided condition else it would return False. np.all returns True if all elements in the array obey the provided condition else it would return False. np.any and np.all are equivalent to logical OR || and logical AND && operator respectively.

Aggregation Techniques

Aggregation would involve some techniques like summation, concatenation, cumulative sum, and a lot more. To sum the elements in a single array we can use thenp.sum function. We could use the axis argument and obtain across rows and columns. To perform cumulative addition we can use the np.cumsum function. Not setting axis returns a cumulative sum across all the values of the flattened array. Setting axis=0 returns an array with cumulative sums across each column, while axis=1 returns the array with cumulative sums across each row.

To perform concatenation of multiple arrays of the same size we can use the np.concat function. Here the default value of axis would be 0 , so concatenation takes place vertically. If axis=1 concatenation takes place horizontally.

Statistical Operation

To inspect the data in the array we can perform some statistical operations using NumPy. To obtain the minimum and maximum value in a NumPy array we could use min and max function respectively. The axis keyword argument is identical to how it was used in np.argmin and np.argmax from the topic ‘Accessing array elements’. In this we use axis=0 to find an array of the minimum values in each column of arr1 and axis=1 to find an array of the maximum values in each row of arr1 .

We could also find the mean, median, variance, and standard deviation with the help of np.mean, np.median, np.var , and np.std respectively. We could also use the axis keyword argument and could obtain the metrics across rows and columns of the array. Here are the entire statistical operations provided by NumPy.


End Note!!

From this article, we had covered basic and frequently used operations in NumPy. Starting with NumPy would be a kickstart for your Data Science or Machine Learning career. To become a master in NumPy I would suggest you read the entire NumPy Documentation. I hope you found this article helpful.


Related Articles