Speed Up your Algorithms Part 2— Numba

Get C++/Fortran like speed for your functions with Numba

Puneet Grover
Towards Data Science
8 min readOct 12, 2018

“brown snake” by Duncan Sanchez on Unsplash

This is the third post in a series I am writing. All posts are here:

  1. Speed Up your Algorithms Part 1 — PyTorch
  2. Speed Up your Algorithms Part 2 — Numba
  3. Speed Up your Algorithms Part 3 — Parallelization
  4. Speed Up your Algorithms Part 4 — Dask

And these goes with Jupyter Notebooks available here:

[Github-SpeedUpYourAlgorithms] and [Kaggle]

Index

  1. Introduction
  2. Why Numba?
  3. How does Numba Works?
  4. Using basic numba functionalities (Just @jit it!)
  5. The @vectorize wrapper
  6. Running your functions on GPU
  7. Further Reading
  8. References
NOTE:
This post goes with Jupyter Notebook available in my Repo on Github:[SpeedUpYourAlgorithms-Numba]

1. Introduction

Numba is a Just-in-time compiler for python, i.e. whenever you make a call to a python function all or part of your code is converted to machine code “just-in-time” of execution, and it will then run on your native machine code speed! It is sponsored by Anaconda Inc and has been/is supported by many other organisations.

With Numba, you can speed up all of your calculation focused and computationally heavy python functions(eg loops). It also has support for numpy library! So, you can use numpy in your calculations too, and speed up the overall computation as loops in python are very slow. You can also use many of the functions of math library of python standard library like sqrt etc. For a comprehensive list of all compatible functions look here.

2. Why Numba?

[Source]

So, why numba? When there are many other compilers like cython, or any other similar compilers or something like pypy.

For a simple reason that here you don’t have to leave the comfort zone of writing your code in python. Yes, you read it right, you don’t have to change your code at all for basic speedup which is comparable to speedup you get from similar cython code with type definitions. Isn’t that great?

You just have to add a familiar python functionality, a decorator (a wrapper) around your functions. A wrapper for a class is also under development.

So, you just have to add a decorator and you are done. eg:

from numba import jit@jit
def function(x):
# your loop or numerically intensive computations
return x

It still looks like a pure python code, doesn’t it?

3. How does numba work?

“question mark neon signage” by Emily Morter on Unsplash

Numba generates optimized machine code from pure Python code using LLVM compiler infrastructure. Speed of code run using numba is comparable to that of similar code in C, C++ or Fortran.

Here is how the code is compiled:

[Source]

First, Python function is taken, optimized and is converted into Numba’s intermediate representation, then after type inference which is like Numpy’s type inference (so python float is a float64), it is converted into LLVM interpretable code. This code is then fed to LLVM’s just-in-time compiler to give out machine code.

You can generate code at runtime or import time on CPU (default) or GPU, as you prefer it.

4. Using basic numba functionalities (Just @jit it!)

Photo by Charles Etoroma on Unsplash

Piece of cake!

For best performance numba recommends using nopython = True argument with your jit wrapper, using which it won’t use the Python interpreter at all. Or you can also use @njit too. If your wrapper with nopython = True fails with an error, you can use simple @jit wrapper which will compile part of your code, loops it can compile, and turns them into functions, to compile into machine code and give the rest to the python interpreter.
So, you just have to do:

from numba import njit, jit@njit      # or @jit(nopython=True)
def function(a, b):
# your loop or numerically intensive computations
return result

When using @jit make sure your code has something numba can compile, like a compute-intensive loop, maybe with libraries (numpy) and functions it supports. Otherwise, it won’t be able to compile anything.

To put a cherry on top, numba also caches the functions after first use as machine code. So after the first time, it will be even faster because it doesn’t need to compile that code again, given that you are using the same argument types that you used before.

And if your code is parallelizable you can also pass parallel = True as an argument, but it must be used in conjunction with nopython = True. For now, it only works on CPU.

You can also specify function signature you want your function to have, but then it won’t compile for any other types of arguments you give to it. For example:

from numba import jit, int32@jit(int32(int32, int32))
def function(a, b):
# your loop or numerically intensive computations
return result
# or if you haven't imported type names
# you can pass them as string
@jit('int32(int32, int32)')
def function(a, b):
# your loop or numerically intensive computations
return result

Now your function will only take two int32’s and return an int32. By this, you can have more control over your functions. You can even pass multiple functional signatures if you want.

You can also use other wrappers provided by numba:

  1. @vectorize: allows scalar arguments to be used as numpy ufuncs,
  2. @guvectorize: produces NumPy generalized ufuncs,
  3. @stencil: declare a function as a kernel for a stencil-like operation,
  4. @jitclass: for jit aware classes,
  5. @cfunc: declare a function for use as a native call back (to be called from C/C++ etc),
  6. @overload: register your own implementation of a function for use in nopython mode, e.g. @overload(scipy.special.j0).

Numba also has Ahead of time (AOT) compilation, which produces a compiled extension module which does not depend on Numba. But:

  1. It allows only regular functions (not ufuncs),
  2. You have to specify a function signature. You can only specify one, for many specify under different names.

It also produces generic code for your CPU’s architectural family.

5. The @vectorize wrapper

“gray solar panel lot” by American Public Power Association on Unsplash

By using @vectorize wrapper you can convert your functions which operate on scalars only, for example, if you are using python’s math library which only works on scalars, to work for arrays. This gives speed similar to that of a numpy array operations (ufuncs). For example:

@vectorize
def func(a, b):
# Some operation on scalars
return result

You can also pass target argument to this wrapper which can have a value equal to parallel for parallelizing code, cuda for running code on cuda/GPU.

@vectorize(target="parallel")
def func(a, b):
# Some operation on scalars
return result

Vectorizing with target = “parallel” or “cuda” will generally run faster than numpy implementation, given your code is sufficiently compute-intensive or array is sufficiently large. If not then it comes with an overhead of the time for making threads and splitting elements for different threads, which can be larger than actual compute time for the whole process. So, work should be sufficiently heavy to get a speedup.

This great video has an example of speeding up Navier Stokes equation for computational fluid dynamics with Numba:

6. Running your functions on GPU

“time-lapsed of street lights” by Marc Sendra martorell on Unsplash

You can also pass @jit like wrappers to run functions on cuda/GPU also. For that, you will have to import cuda from numba library. But running your code on GPU is not going to be as easy as before. It has some initial computations that need to be done for running function on hundreds or even thousands of threads on GPU. You have to declare and manage a hierarchy of grids, blocks and threads. And it’s not that hard.

To execute a function on GPU, you have to either define something called a kernel function or a device function. Firstly let’s see a kernel function.

Some points to remember about kernel functions:

a) kernels explicitly declare their thread hierarchy when called, i.e. the number of blocks and number of threads per block. You can compile your kernel once, and call it multiple times with different block and grid sizes.

b) kernels cannot return a value. So, either you will have to do changes on original array, or pass another array for storing the result. For computing scalar, you will have to pass a 1 element array.

# Defining a kernel function
from numba import cuda
@cuda.jit
def func(a, result):
# Some cuda related computation, then
# your computationally intensive code.
# (Your answer is stored in 'result')

So for launching a kernel, you will have to pass two things:

  1. Number of threads per block,
  2. Number of blocks.

For example:

threadsperblock = 32
blockspergrid = (array.size + (threadsperblock - 1)) // threadsperblock
func[blockspergrid, threadsperblock](array)

Kernel function in every thread has to know in which thread it is, to know which elements of array it is responsible for. Numba makes it easy to get these positions of elements, just by one call.

@cuda.jit
def func(a, result):
pos = cuda.grid(1) # For 1D array
# x, y = cuda.grid(2) # For 2D array
if pos < a.shape[0]:
result[pos] = a[pos] * (some computation)

To save the time which will be wasted in copying numpy array to a specific device and then again storing result in numpy array, Numba provides some functions to declare and send arrays to specific device, like: numba.cuda.device_array, numba.cuda.device_array_like, numba.cuda.to_device, etc. to save time of needless copies to cpu(unless necessary).

On the other hand, a device function can only be invoked from inside a device (by a kernel or another device function). The plus point is, you can return a value from a device function. So, you can use this return value of the function to compute something inside a kernel function or a device function.

from numba import cuda@cuda.jit(device=True)
def device_function(a, b):
return a + b

You should also look into supported functionality of Numba’s cuda library, here.

Numba also has implementations of atomic operations, random number generators, shared memory implementation (to speed up access to data) etc within its cuda library.

ctypes/cffi/cython interoperability:

  • cffi — The calling of CFFI functions is supported in nopython mode.
  • ctypes — The calling of ctypes wrapper functions is supported in nopython mode…
  • Cython exported functions are callable.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Responses (2)

What are your thoughts?

yes numba cant speedup all the fucntion i am using pandas and there is no difference

--

Thanks, for this great tutorial. It seems that numba could not speed up some functions which just search in a big list. True?

--