NumPy is the most widely-used scientific computing library for Python. It serves as a basis for many other libraries such as Pandas.
NumPy makes it very simple and fast to operate on large arrays of numbers. Since we are likely to have lots of data, having a super efficient tools like NumPy is of great importance.
In this article, we will go over 5 operations that are essential in the analysis of large arrays. These operations provide some statistics and characteristics of arrays.
1. Count_nonzero
The name is quite descriptive. It counts the number of non-zero elements in an array. There are many ways to do this operation but the count_nonzero function might be the most simple one.
Let’s create an array of 10000 integers between 0 and 10. We can then use our straightforward function to count the non-zero elements.
import numpy as np
arr = np.random.randint(5, size=10000)
np.count_nonzero(arr)
8033
We know that there are 8033 non-zero numbers in this array. What if we need the indices of them? The answer is the next operation.
2. Argwhere
The argwhere function returns the indices of the non-zero elements in an array.
nonzero = np.argwhere(arr)
len(nonzero)
8033
We can confirm the result by checking some of the values in the array.
arr[:8]
array([4, 2, 1, 4, 3, 0, 3, 2])
nonzero[:8]
array([[0],
[1],
[2],
[3],
[4],
[6],
[7],
[8]])
As you can see, the index 5 is skipped because the value at that index is zero.
3. Argmin and argmax
These functions are used to find the index of the minimum or maximum value in an array.
Let’s create a smaller array this time and apply argmin and argmax functions.
arr2 = np.array([4,3,1,6,1,2,6])
np.argmin(arr2)
2
np.argmax(arr2)
3
The minimum value is 1 so the argmin returns the index of the first occurrence of 1. Similarly, argmax returns the of the first occurrence of the maximum value.
However, these minimum and maximum values occur more than once in the array. If we need to find the indices of all occurrences of these values, we can use the where function of NumPy.
4. Where
The where function can be used to find the indices of values that fit the specified condition.
In case of minimum and maximum values, we can set the condition as being equal to these values.
np.where(arr2 == arr2.min())
(array([2, 4]),)
np.where(arr2 == arr2.max())
(array([3, 6]),)
The returned arrays contain the indices of all the occurrences of the minimum and maximum values.
The where function can also modify the array according to a condition. Let’s do an example.
arr2 = np.array([4,3,1,6,1,2,6])
np.where(arr2 > 3, 1, 0)
array([1, 0, 0, 1, 0, 0, 1])
The values that are greater than 3 are replaced with 1 and all other values are replaced with 0. Thus, the second parameter indicates what to do with the numbers that fit the condition. The third parameter deals with the values that do not fit the specified condition.
5. Argsort and sort
Both these functions can be used to get a sorted version of an array.
- Argsort returns the indices of the sorted array.
- Sort returns the values of the sorted array.
arr2 = np.array([4,3,1,6,1,2,6])
np.sort(arr2)
array([1, 1, 2, 3, 4, 6, 6])
np.argsort(arr2)
array([2, 4, 5, 1, 0, 3, 6])
What the sort function returns is the values sorted in ascending order. As you can compare, the argsort returns the indices of these sorted values in the original array.
We have covered only a part of the NumPy operations on arrays. However, these are the operations that you are likely to use in a typical data analysis and manipulation process.
NumPy proved to be a very flexible and efficient scientific computing library. It serves as a base for many Python libraries such as Pandas. Thus, it is a fundamental tool to learn for aspiring data scientists.
Thank you for reading. Please let me know if you have any feedback.