Predefined sparsity

Sparse Matrices in Pytorch

Part 1: CPU runtimes

Published in

Towards Data Science

8 min readJun 27, 2019

This is part 1 of a series of articles which will analyze execution times of sparse matrices and their dense counterparts in Pytorch. Part 1 deals with CPU execution times, while part 2 extends to GPUs. Let me first give a quick introduction to concepts before diving into the meat.

Pytorch is a library for deep learning written in the Python programming language. Deep learning is a branch of science which is gaining a lot of prominence in recent years due to it powering ‘smart’ technologies such as self-driving cars, speech recognition, and so on. At the core of deep learning lies a lot of matrix multiplication, which is time-consuming and is the major reason why deep learning systems need significant amounts of computational power to become good. Not surprisingly, a key area of research is simplifying these systems so that they can be quickly deployed. One way to simplify them is by making the matrices sparse, such that most of their elements are 0s and can be ignored when doing math. For example, here’s a sparse matrix which we’ll call S:

You might be wondering where and how such matrices occur. Matrices are generally used to depict interactions between entities. For example, the rows of S might indicate different people and the columns different places. The numbers indicate how many times each person visited each place in the last week. Having several 0s is explainable in the sense that each person visited only a particular place or two. The density of a sparse matrix is its fraction of non-zero elements, such as 1/3 in S. Now the question is, is there a better way to store sparse matrices to avoid all the 0s?

There are several sparse formats, the one which Pytorch uses is called the COOrdinate format. It stores the indices, values, size, and number of non-zero elements (nnz) in a sparse matrix. Here’s one way to construct S in Pytorch (outputs are in bold and comments in italics):

S = torch.sparse_coo_tensor(indices = torch.tensor([[0,0,1,2],[2,3,0,3]]), values = torch.tensor([1,2,1,3]), size=[3,4])
#indices has x and y values separately along the 2 rowsprint(S)
tensor(indices=tensor([[0, 0, 1, 2],
                       [2, 3, 0, 3]]),
       values=tensor([1, 2, 1, 3]),
       size=(3, 4), nnz=4, layout=torch.sparse_coo)print(S.to_dense()) #this shows S in the regular (dense) format
tensor([[0, 0, 1, 2],
        [1, 0, 0, 0],
        [0, 0, 0, 3]])

Pytorch has the torch.sparse API for dealing with sparse matrices. This includes some functions identical to regular mathematical functions such as mm for multiplying a sparse matrix with a dense matrix:

D = torch.ones(3,4, dtype=torch.int64)torch.sparse.mm(S,D) #sparse by dense multiplication
tensor([[3, 3],
        [1, 1],
        [3, 3]])torch.mm(S.to_dense(),D) #dense by dense multiplication
tensor([[3, 3],
        [1, 1],
        [3, 3]])

Now we come to the meat of this article. Does using sparse matrices and functions save time in Pytorch? In other words, how good is the torch.sparse API? The answer would depend on a) matrix size, and b) density. The CPU I used to measure runtimes is my mid 2014 Macbook Pro with a 2.2 GHz Intel Core i7 processor and 16 GB of RAM. So, let’s dive in!

Both size and density varying

A diagonal matrix is sparse since it contains non-zero elements only along the diagonal. The density will always be 1/n, where n is the number of rows (or columns). Here are my 2 experimental cases:

Sparse: Diagonal matrix in the sparse format multiplied by a dense square matrix
Dense: The same diagonal matrix converted to dense format using to_dense(), then multiplied by the same dense square matrix

All elements are taken from a random normal distribution. You can get this by typing torch.randn. Here are the runtimes for n varying through the powers of 2:

Left: Complete size range from 2²=4 to 2¹³=8192. Right: Zoomed in on the x-axis up to 2⁸=256

Computation time for the dense case grows roughly on the order of O(n³). This shouldn’t come as a surprise since matrix multiplication is O(n³). Calculating the order of growth for the sparse case is more tricky since we are multiplying 2 matrices with different orders of element growth. Every time n doubles, the number of nonzero elements quadruples for the dense matrix, but doubles for the sparse matrix. This gives an overall computation time on an order between O(n²) and O(n³).

From the right hand figure, we see that initial growth for the sparse case is slow. This is because accessing overheads dominate the actual computation. However, beyond the n=64 (i.e. density ≤ 1.5%) mark is when sparse matrices compute faster than dense.

Density fixed, size varying

Keep in mind that density of sparse diagonal matrices drops as size grows, since density = 1/n. A fairer comparison would be to keep the density constant. The following plots compare the 2 cases again, except now the sparse matrices have their density fixed at 1/8, i.e. 12.5%. So for example, the n=2¹³ sparse case will have 2¹³ x 2¹³/8 = 2²³ elements.

Left: Complete size range from 2²=4 to 2¹³=8192. Right: Zoom in on the x-axis up to 2⁷=128

This time, as n doubles, the number of elements for both the sparse and dense matrices quadruples. The COO format needs some time to access elements based on their separate index-value pairs. This is why the sparse matrix computation times grow at greater than O(n³), leading to sparse computation times being always worse than their dense counterparts.

Size fixed, density varying

Finally, let’s investigate the effect of density on sparse matrix computation times while keeping size fixed at different values. The psuedo code here is:

cases = [2**4=16, 2**7=128, 2**10=1024, 2**13=8192]
for n in cases:
    Form a dense random square matrix of size n
    for nnz = powers of 2 in range(1 to n**2):
        Form sparse square matrix of size n with nnz non-zero values
        Compute time to multiply these matrices

I kept n fixed at 4 different cases — 16, 128, 1024 and 8192, and plotted computation time vs density for each case. Density can be obtained by dividing nnz by n². I have marked the 2 previous experiments — diagonal matrix and 12.5% density — as vertical dashed lines in each plot. The time to multiply 2 dense matrices is the red horizontal dashed line.

The n=16 case is a bit different since it is small enough for accessing overheads to dominate computation time. For the other 3 cases, computation time doubles as nnz doubles, i.e. O(nnz). This is not surprising since matrix size is the same, so the only growth comes from nnz.

The major conclusion is that 2 dense matrices always multiply faster than a sparse and dense matrix unless the sparse matrix has very low density. ‘Very low’ seems to be 1.5% and below.

So there you have it. Not a lot is to be gained really from using the sparse libraries in their current state in Pytorch, unless you are dealing with very sparse cases (like diagonal matrices of size greater than 100). My research deals with pre-defined sparsity in neural networks (IEEE, arXiv), where the weight matrix is sparse and the input matrix is dense. However, while pre-defined sparsity gives promising results at densities as low as 20%, performance does degrade at densities of 1.5% and below. So unfortunately Pytorch’s sparse libraries aren’t a great fit at the moment. Having said that, the Pytorch sparse API is experimental and is under active development, so here’s hoping that new pull requests improve the performance of sparse libraries.

The code and images for this article can be found on Github here.

Appendix: Storing sparse matrices

Tensors in Pytorch can be saved using torch.save(). The size of the resulting file is the size of an individual element multiplied by the number of elements. The dtype of a tensor gives the number of bits in an individual element. For example, a dense 1000x1000 matrix of data type float32 has size (32 bits x 1000 x 1000) = 4 MB. (Recall that 8 bits =1 byte)

Unfortunately sparse tensors do not support the .save() feature. There are 2 workarounds to save them — (a) convert to dense and store that, or (b) store the indices(), values(), and size() in separate files and reconstruct the sparse tensor from these. For example, say spmat is a sparse diagonal matrix of size 1000x1000, i.e. it has 1000 non-zero elements. Assume the data type is float32. Using (a), the stored matrix has file size = (32 bits x 1000 x 1000) = 4 MB. Using (b), indices() are integers of data type int64, and there are 2000 indices (1 row and 1 column for each of the 1000 non-zero elements). The 1000 non-zero values() are all float32. The size() is of a special data type called torch.Size , which is a tuple of 2 integers. So the total file size is approximately = (64 x 2000) + (32 x 1000) + (64 x 2) = 20.2 KB. This is way less than 4 MB. More generally, the file size grows as O(n²) for (a), and O(nnz) for (b). But you need to reconstruct the sparse tensor every time when loading it from (b).

Sourya Dey is pursuing a PhD at the University of Southern California. His research deals with exploring complexity reduction in deep learning. You can read more about him on his website.