The world’s leading publication for data science, AI, and ML professionals.

10 Numpy functions you should know

with data science and artificial intelligence examples

Numpy is a python package for scientific computing that provides high-performance multidimensional arrays objects. This library is widely used for numerical analysis, matrix computations, and mathematical operations. In this article, we present 10 useful numpy functions along with data science and Artificial Intelligence applications. Let’s get started! 🍀

1. numpy.linspace

The numpy.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None, axis=0) function returns evenly spaced numbers over a specified interval defined by the first two arguments of the function (start and stop – required arguments). The number of samples generated is specified by the third argument num. If omitted, 50 samples are generated. One important thing to bear in mind while working with this function is that the stop element is provided in the returned array (by default endpoint=True), unlike in the built-in python function range.

Example

Linspace function can be used to generate evenly spaced samples for the x-axis. For instance, if we want to plot a mathematical function, we can easily generate samples for the x-axis by using the numpy.linspace function. In reinforcement learning, we can employ this function for discretization purposes, providing the highest and lowest value of a continuous space (states or actions), generating a uniformly discrete space.

The following plot shows 4 mathematical functions: (1) Sine, (2) Cosine, (3) Exponential, and (4) Logarithmic function. To generate x-axis data, we employ the linspace function, generating 111 data points from 0 to 100, both included. You may notice that for generating the mathematical functions we have used Numpy again. We can consult the documentation to observe the wide range of mathematical functions that Numpy provides 🙂

2. numpy.digitize

Maybe you have never heard about this function, but it can be really useful working with continuous spaces in reinforcement learning. The numpy.digitize(x, bins, right=False) function has two arguments: (1) an input array x, and (2) an array of *bins,* returning the indices of the bins to which each value in input array belongs. Confusing? Let’s see an example 👌

Example

In the code above, we have 5 bins in total:

  • x < 0 → Index 0
  • 0≤ x < 1 → Index 1
  • 1≤ x < 2 → Index 2
  • 2≤ x < 3 → Index 3
  • 3≤ x → Index 4

Therefore, if we provide as an input 0.5, the function returns 1, since that is the index of the bin to which 0.5 belongs.

In reinforcement learning, we can discretize state spaces by using uniformly-spaced grids. Discretization allows us to apply algorithms designed for discrete spaces such as Sarsa, Sarsamax, or Expected Sarsa to continuous spaces.

Imagine we have the following continuous space. The agent can be in any position (x,y), where 0≤x≤5 and 0≤y≤5. We can discretize the position of the agent by providing a tuple, indicating the grid where the agent is located.

We can easily achieve this discretization by using the numpy.digitize function as follows:

We will consider than any value lower than 1 belongs to bin index 0 and any value larger than or equal to 4 belongs to bin index 4. And voilà! we have transformed a continuous space into a discrete one.

3. numpy.repeat

The numpy.repeat(a, repeats, axis=None) function ** repeats the elements of an array. The number of repetitions is specified by the second argument _repeat_s**.

Example

Let’s say we have two different data frames, containing the sales in 2017 and 2018, but we want only one data frame, including all the information.

sales in 2017
sales in 2017
sales in 2018
sales in 2018

Before merging both data frames, we need to add a column, specifying the year in which the products were sold. We can add this information by using the numpy.repeat function. Subsequently, we concatenate both data frames by using the pandas.concat function.

sales
sales

4. numpy.random

4.1. numpy.random.randint

The numpy.random.randint(low, high=None, size=None, dtype=’l’) function returns random integers from the interval [low,high). If high parameter **** is missing (None), the random numbers are selected from the interval [0,low). By default, a single random number(int) is returned. To generate a narray of random integers, the shape of the array is provided in the parameter _siz_e.

Example

This function can be used to simulate random events such as tossing a coin, or rolling a dice as shown below.

4.2. numpy.random.choice

The numpy.random.choice(a, size=None, replace=True, p=None) returns a random sample from a given array. By default, a single value is returned. To return more elements, the output shape can be specified in the parameter size as we did before with the numpy.random.randint function.

Example

The random events shown above can also be simulated by using the numpy.random.choice.

By default, elements have equal probability of being selected. To assign different probabilities to each element, an array of probabilities p can be provided. Using this parameter p, we can simulate a biased coin flip as follows:

4.3. numpy.random.binomial

We can simulate a wide variety of statistical distributions by using numpy such as normal, beta, binomial, uniform, gamma, or poisson distributions.

The numpy.random.binomial(n, p, size=None) draws samples from a binomial distribution. The binomial distribution is used when there are two mutually exclusive outcomes, providing the number of successes of n trials with a probability of success on a single trial p.

I recommend to read the documentation and discover the wide range of function that the numpy.random library provides.

5. numpy.polyfit

The numpy.polyfit(x, y, deg, rcond=None, full=False, w=None, cov=False) function ** outputs a polynomial of degree _deg that** fits the points (x,y),_ minimizing the square error.

This function can be very useful in linear regression problems. Linear regression models the relationship between a dependent variable and an independent variable, obtaining a line that best fits the data.

y =a+bx

where x is the independent variable, y is the dependent variable, b is the slope, and a is the intercept. To obtain both coefficients a and b, we can use the numpy.polyfit function as follows.

Example

Let’s say we have a data frame containing the heights and weights of 5000 men.

As we can observe, both variables present a linear relation.

We obtain the best-fit linear equation with the numpy.polyfit function in the following manner:

The function returns the slope (5.96) and intercept (-224.50) of the linear model. Now, we can employ the obtained model (y=5.96x-224.50) to predict the weight of a man (unseen data). This prediction can be obtained by using the numpy.polyval function.

6. numpy.polyval

The numpy.polyval(p, x) function evaluates a polynomial at specific values. Previously, we have obtained a linear model to predict the weight of a man *(weight=5.96height-224.50) by using the numpy.polyfit function. Now, we use this model to make predictions with the numpy.polyval function. Let’s say we want to predict the weight of a men 70 inches tall. As arguments, we provide the polynomial coefficients (obtained with polyfit) from highest degree to the constant term (p=[5.96,-224.49]), and a number at which to evaluate _p_ (x=70**).

The following plot shows the regression line as well as the predicted weight.

7. numpy.nan

Numpy library includes several constants such as not a number (Nan), infinity (inf) or pi. In computing, not a number is a numeric data type that can be interpreted as a value that is undefined. We can use not a number to represent missing or null values in Pandas. Unfortunately, dirty data sets contain null values with other denominations (e.g. Unknown, – , and n/a), making difficult to detect and drop them.

Example

Let’s say we have the following data set, containing information about houses in the city of Madrid (this data set is reduced for explanatory purposes).

Data frame with non-standard missing values
Data frame with non-standard missing values

We can easily analyze missing values by using the pandas.DataFrame.info method. This method prints information about the data frame including column types, number of non-null values, and memory usage.

Output of the info method
Output of the info method

As we can observe, the info function does not detect unexpected null values (Unknown and -). We have to convert those values into null values that Pandas can detect. We can achieve that by using the numpy.nan constant.

Before analysing the data, we have to handle missing values. To do so, there are different approaches: (1) assign missing values manually (in case we know the data), (2) replace missing values with the mean/median value, or (3) delete rows with missing data, among other approaches.

After replacing (Unknown and -) with standard null values, two missing values are detected in columns num_bedrooms and num_balconies. Now, those missing values can be easily deleted by using the pandas.DataFrame.dropna function (approach 3).

Data frame before dropping null values
Data frame before dropping null values
Data frame after dropping null values
Data frame after dropping null values

8. numpy.argmax

The numpy.argmax(a, axis=None, out=None) function returns the indices of the maximum values along an axis.

In a 2d array, we can easily obtain the index of the maximum value as follows:

We can obtain the indeces of maximum values along a specified axis, providing 0 or 1 to the axis attribute.

Example

The numpy.argmax can be very useful in reinforcement learning tasks. The Q-table is an action-value function estimation that contains the expected return for each state-action pair, assuming the agent is in state s, and takes action a, following policy π until the end of the episode.

Q table
Q table

We can easily obtain the policy by choosing the action a that provides maximum expected return for each state s.

Policy from the Q table
Policy from the Q table

In the above example, the numpy.argmax function returns the policy: state 0 → action 0, state 1 → action 2, and state 2 → action 1.

9. numpy.squeeze

The numpy.squeeze(a, axis=None) removes single-dimensional entries from the shape of an array. The argument axis specifies the axis we want to squeeze out. If the shape of the selected axis is greater than 1 a ValueError is raised. An example of how to use numpy.squeeze function is shown below.

As we can observed, only axes 0 and 2 can be removed since both have lenght 1. Axis 1 has 3 elements; therefore, a ValueError is raised.

Example

Pytorch is an open source Machine Learning library based on the Torch library. The library provides multiple data sets such as MNIST, Fashion-MINST, or CIFAR that we can use for training neural networks. First, we download the data set (e.g. MNIST) with the torchvision.datasets function. Then, we create an iterable by using torch.utils.data.DataLoader. This iterable is passed to the iter() method, generating an iterator. Finally, we get each element of the iterator by using the next() method. Those elements are tensors of shape [N,C,H,W], being N – batch size, C – number of channels, H – height of input planes in pixels, and W width in pixels.

To visualize an element of the previous batch, we have to eliminate the first axis since the matplotlib.pyplot.imshow function accepts as an input an image of shape (H,W).

First image of the batch
First image of the batch

10. numpy.histogram

The numpy.histogram(a, bins=10, range=None, normed=None, weights=None, density=None) computes the histogram of a set of data. The function returns 2 values: (1) the frequency count, and (2) the bin edges.

Example

The following data frame contains the height of 5000 men. We create a histogram plot, passing kind=’hist’ to the plot method.

By default, the histogram method breaks up the data set into 10 bins. Notice that the x-axis labels do not match with the bin size. This can be fixed by passing in a xticks parameter, containing the list of the bin sizes, in the following manner:

Thanks for reading!! 🍀 🍀 💪 And use Numpy!


Related Articles