The world’s leading publication for data science, AI, and ML professionals.

Why NumPy Is So Fundamental

We could not do without it

Photo by Piron Guillaume on Unsplash
Photo by Piron Guillaume on Unsplash

NumPy. This library has become fundamental, it is hard to imagine a world without it, or before its birth. NumPy has been around since 2005, and if you ever worked with data in Python, you must have used it, one way or the other.

The library recently received the proper recognition it deserves. A paper¹ was published by a group of researchers, many active in maintaining and developing NumPy, in the renown scientific publication Nature. Within this paper, the authors review NumPy’s history, its main key features, its importance in today’s scientific community and discuss its future.

What Makes NumPy So Good?

NumPy has a syntax which is simultaneously compact, powerful and expressive. It allows users to manage data in vectors, matrices and higher dimensional arrays. Within those data structures, it allows users to:

  • Access,
  • Manipulate,
  • Compute

Do you recall this picture?

Photo of the researcher Katie Bouman on PBS
Photo of the researcher Katie Bouman on PBS

The effort to obtain the first representation of a black hole was made possible, not only by the hard work and dedication of a team of researchers but by the support of Numpy².

The NumPy project has formal governance structures¹. The team adopted distributed revision control and code review early on to improve collaboration. This in turn meant clear roadmap and a clear process to discuss large changes. Those may not seem like much, but they help inspire confidence and trust to users. It is an assurance that every change is carefully thought after and the library will be maintained over time.

The project is sponsored by NumFOCUS¹, a nonprofit promoting an open-source mindset in research. In recent years it also obtained funding from multiple foundations. This in turns allows the project to maintain focus on improvements and the development of new features.

In the mid-90s, two main packages were present in the scientific world: Numarray and Numeric.

  • Numarray was an array processing package designed to efficiently manipulate large multi-dimensional arrays.
  • Numeric was efficient for small-array handling and had a rich C API.

But over time they grew apart. This is when NumPy surfaced in 2005 as the best of two worlds¹. It was initially developed by students, faculty and researchers to bring to life an advanced open-source array library for Python, free to use and free from by license servers and software protection. It gained popularity, a massive following and enthusiastic volunteers. The rest is history.

Fast forward 15 years later, and NumPy has become essential. Today it supports most Python libraries doing scientific or numerical computation, this includes SciPy, Matplotlib, pandas, scikit-learn and scikit-image. The authors of the paper created the following visual, a picture speaks a thousand words:

Visual by Harris, C.R., Millman, K.J., van der Walt, S.J. et al. on Nature "Array programming with NumPy"
Visual by Harris, C.R., Millman, K.J., van der Walt, S.J. et al. on Nature "Array programming with NumPy"

NumPy Foundations

Let’s first look at the concept of data structure. The data structure is called an array. NumPy is a library that makes the handling of arrays easy. But what is an array?

In computer science, an array is a data structure. This data structure contains elements of the same data type. The data types can be numbers (integer, float, …), strings (text, anything stored between quotes " "), timestamps (dates) or pointers to Python objects. Arrays are useful because they help organize data. With this structure, elements can easily be sorted or searched.

Shape

Every array has a shape. The shape is defined by (n,m) with n the number of rows, and m the number of columns.

# This array is of shape (2,2)
[1,2
 3,4]
# This array is of shape (4,2)
[1,2
 3,4
 5,6
 7,8]

If the above structures look like matrices, you would be right. Within the field of mathematics, any of those elements is a matrix. When working with NumPy, the terms matrix and array can be quite interchangeable and can lead to confusion.

A one-dimensional array also called a vector:

# Horizontal, a row vector
[1,2,3]
# Vertical, a column vector
[1,
 2,
 3]

Lastly, if it only holds one element, it is called a scalar:

# Scalar
[7]

Strides

Strides help us interpret computer memory. Elements are stored linearly as multidimensional arrays. Strides describe the number of bytes required to move forward in memory. For example to move to the next row or the next column. Let’s take this array of shape (3,3), where each element takes 8 bytes in memory.

# Array of shape (3,3)
[1,2,3,
 4,5,6
 7,8,9]
  • To move to another column: it’s enough to move by 8 bytes, the amount of memory taken by one element in our array.
  • To move to another row: we start on the first row, first column, at 1. To reach the second row, we need to jump three times in memory. We jump to 2, then 3, we reach 4. This represents 24 bytes (3 jumps * 8 bytes per element).

To understand this concept, the array can also be visualized as a one-dimensional array. NB: The extra-spaces are for reading purposes, to help visualize where the different rows were.

[1,2,3,  4,5,6  7,8,9]
# Moving from 5 to 7, would require 2 jumps in memory, so 16 bytes.
[1,2,3,  4,5,6  7,8,9]

Operations Using NumPy

Multiple operations and manipulations can be performed on NumPy arrays:

  • Indexing
1. Array x, of shape (3,3)
x = [1,2,3,
     4,5,6
     7,8,9]
2. Indexing the array to return '5'
#IN
x[1,2]
#OUT
6

Indexing NumPy arrays start at 0. With x[1,2], 1 stands for row, and 2 for column. When reading x[1,2], we need to remember we start counting at 0, and mentally increment 1 for both [1,2]. We thus reach for row 2 and column 3. The element on this index is 6.

  • Slicing
1. Using the original x-array from above, we perform a slice
#IN
x[:,1:]
#OUT
[[2 3]
 [5 6]
 [8 9]]
  • Copying

Beware, slicing creates a view, not a copy. As explained by NumPy’s official documentation²:

NumPy slicing creates a view instead of a copy as in the case of builtin Python sequences such as string, tuple and list. Care must be taken when extracting a small portion from a large array which becomes useless after the extraction, because the small portion extracted contains a reference to the large original array whose memory will not be released until all arrays derived from it are garbage-collected. In such cases an explicit copy() is recommended.

1. Using the original x-array from above, we perform a slice then copy it
#IN
y = x[:, 1:].copy()
#OUT
[[2 3]
 [5 6]
 [8 9]]
  • Vectorization
1. Using the original x-array, we perform an addition
x = [1,2,3,
     4,5,6
     7,8,9]
y = [1,1,1,
     1,1,1,
     1,1,1]
#IN
z = x + y
#OUT
z = [2,3,4,
     5,6,7,
     8,9,10]

With the example of this addition on two arrays of the same shape, the results are intuitive. The beauty is, it is happening on one line.

  • Broadcasting

This happens when dealing with arrays of different shapes. The smaller array will be ‘broadcasted’ to the larger array. The result is an array of a different shape than the two initial arrays, as shown below with the example of a multiplication:

1. Using two 1D-arrays, we perform a multiplication
s = [0,
     1,
     2]
t = [1,2]
#IN
u = s*t
#OUT
[0,0
 1,2
 2,4]
  • Reduction
1. Array x, of shape (3,3)
x = [1,2,3,
     4,5,6
     7,8,9]
2. We decide to reduce to one column:
#IN
y = np.sum(x, axis=1)
#OUT
array([6, 15, 24])
3. We can also reduce our new y array to one row:
#IN
z = np.sum(y, axis=0)
#OUT
45

We used np.sum() on our arrays to reduce them. The argument axis defines on which axis the operation will take place. 0 is for the row, and 1 for the column.

NumPy Current Shortcomings

Because of its in-memory data model, NumPy isn’t capable of directly leveraging accelerator hardware such as graphics processing units (GPUs), tensor processing units (TPUs) and field-programmable gate arrays (FPGAs).

This led to the emergence of new array implementations frameworks such as PyTorch and Tensorflow. They can do distributed training on CPUs and GPUs.

NumPy’s Future

With formal governance, roadmap and thorough discussions for large changes, the operational side of NumPy looks bright. The team has implemented best practices of software development, such as distributed revision control and code review to improve collaboration.

The funding is always the tricky part. While it secured funding from different foundations, it also creates a dependency towards the funders. Hopefully, a solid community of enthusiasts has been around since the early days, when proper funding was not a thing. As reported by the paper’s authors¹, the development still depends mostly on contributions made by graduates and researchers contributions in their free time.

The authors discuss¹ the several challenges which NumPy will be facing over the next decade:

1. New devices will be developed, and existing specialized hardware will evolve to meet diminishing returns on Moore’s law.

2. The scale of scientific data gathering will continue to increase, with the adoption of devices and instruments such as light-sheet microscopes and the Large Synoptic Survey Telescope (LSST).

3. New generation languages, interpreters and compilers, such as Rust, Julia and LLVM, will create new concepts and data structures, and determine their viability.

Last but not least, NumPy will need a new generation of volunteers and contributors to help it move forward. Maybe You?

How Can You Help?

It’s possible to become a contributor to NumPy. Being a coder is not a requirement, as the NumPy project has evolved and is multi-faceted. The team is looking for help on those areas:

  • Code maintenance and development
  • Community coordination
  • DevOps
  • Developing educational content & narrative documentation
  • Fundraising
  • Marketing
  • Project management
  • Translating content
  • Website design and development
  • Writing technical documentation

More information can be found here

Going Further With NumPy

If you want to dive deeper, I found these resources interesting:

  • Broadcasting

Broadcasting – NumPy v1.19 Manual

  • Advanced lectures on NumPy with code snippets

2.2. Advanced NumPy – Scipy lecture notes

Final Words

With NumPy having such a critical place within the scientific and Data Science community, we should be grateful for the hard work put together over the years. The project has challenges ahead, but also has a solid community and leverages best practices to stay organized and deliver. Long live NumPy!

Happy Coding!


Thanks for reading! Enjoyed this story? Join Medium for complete access to all my stories.


References

[1] Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array Programming with NumPy. Nature 585, 357–362 (2020). https://doi.org/10.1038/s41586-020-2649-2

[2] NumPy Team (checked in Sept. 2020), "CASE STUDY: THE FIRST IMAGE OF A BLACK HOLE", https://numpy.org/case-studies/blackhole-image/

[3] NumPy Team (checked in Sept. 2020), "Indexing", https://numpy.org/doc/stable/reference/arrays.indexing.html


Related Articles