Understanding DeepMind matrix multiplication

In the previous post, we learned the maths behind the Strassen algorithm and wrote a Python code to run-test it against different matrices sizes. Moreover, we’ve learned that the Holy Grail of linear algebra is an optimization algorithm for matrix multiplication. Normally, we would think of a matrix multiplication code as three for-loops:

def matmul(mat1, mat2, mat3):
    r""" Function to multiply mat1 and mat2 
    returns mat3 
    Parameters
    ---------
    mat1: np.array, matrix A 
    mat2: np.array, matrix B 
    mat3: np.array, empty matrix C 

    Return 
    ------
    mat3: np.array, matmul between A &amp; B
    """ 
    for i in range(mat1.shape):
        for j in range(mat2.shape):
            mat3[i][j] = 0.0 
            for k in range(mat3.shape):
                mat3[i][j] += mat1[i][k]*mat2[k][j]
    return mat3

Thus the computational complexity is O(n³). Strassen has improved this calculation, finding the following relationships:

Fig.1: Strassen algorithm as proposed in his paper "Gaussian Elimination is not Optimal". [Image by the author]

This algorithm is applied to block matrices and the total complexity is reduced to O(n²·⁸⁰⁸). Although 2.808 may seem a very little improvement, we saw that for square matrices of size 4096 the standard numpy matmult takes about 454.37 +/- 6.27 s, while Strassen takes 31.57 +/- 1.01, which is a difference of about one order of magnitude.

We saw that the matrix multiplication problem can be reduced to a tensor product, with the tensoring operation:

Fig.2: Definition of the matrix multiplication tensor as a triad, as defined in Deep Mind paper. [Image by the author].

Fig.2 reports exactly the matrix multiplication, expressed as a triad, namely three elements. The minimal number of triads defines the minimal number of operations to compute the product matrix. This minimal number is the tensor’s rank R(t). Working on the tensor rank is an efficient way to find new multiplication algorithms, as explained in the DeepMind papers.

DeepMind has shown in these years how mathematical problems, from theoretical to application, can be tackled with a machine learning approach, using the reinforcement learning (RL) approach. Also during this time they’ve adapted AlphaZero to find the best strategy for multiplying matrices, and the result is AlphaTensor. I think it’s worth defining right now what we can appreciate in this paper:

DeepMind proved again that RL can be a formidable ally for tackling complicated mathematical problems;
AlphaTensor has found an acceptable algorithm that can be even better than Strassen’s one, to multiply 4 x 4 and 5 x 5 block matrices;
Moreover, AlphaTensor can find an optimal solution for specific hardware requirements. As they show in the paper, there can be algorithms specifically tuned for TPUs and GPUs (V100 in the paper case);
Although these may not be the best results, mathematicians can now have a completely new set of equations for the matrix multiplication problem that can be a new starting point for finding new optimal solutions.

Testing AlphaTensor on a V100 GPU

Luckily for us, DeepMind provides on its GitHub the implementation for AlphaTensor, ready to be tested and all written in JAX. The code is tailored for multiplying 4 x 4 block matrices, with a total of 47 multiplications, already reported in the tensor notations, as you can see here.

The benchmark is run on a TPU and V100 GPU and they’re testing matrices multiplications for the following sizes: 8192, 10240, 12288, 14336, 16384, 18432, 20480 for the standard jax.numpy.dotmultiplication, the Strassen algorithm and the AlphaTensor one.

I forked the code and added two tiny modifications:

I am printing the timings for each approach
Since we are running the multiplication 10 times for each algorithm, I modified the average number in the benchmark functions to 10, rather than keeping it as 20.

Having free access to the Google Cloud Console (still relying on $300 credits) I have created a Virtual Machine in GCP Compute Engine to test AlphaTensor, with the following specifics:

Running region europe-west4
Selected GPU machine with 1 NVIDIA V100 GPU
Automatic selection for the CPU platform
n1-standard-4 machine type (4 vCPU and 15 GB RAM)
I changed the OS image to: Operating System Deep Learning on Linux and Version Debian 10 based Deep Learning VM for TensorFlow Enterprise 1.15 with CUDA 11.0 M100 and disk size 50GB

The total cost is $1.94 per hour – so be careful and do not let this machine run indefinitely.

Once the machine is created, you can directly access with SSH and download the repo with git clone https://github.com/Steboss/alphatensor.git . You’ll have to set up the Python environment and install jax with pip3 install -r alphatensor/benchmarking/requirements.txt -f [https://storage.googleapis.com/jax-releases/jax_cuda_releases.html](https://storage.googleapis.com/jax-releases/jax_cuda_releases.html) . Finally, you can run the test with python alphatensor.benchmarking.run_gpu_benchmark.

Fig.3: Comparison between Jax matrix multiplication, Strassen algorithm and AlphaTensor on V100 GPU. [Image by the author].

Fig.3 reports the performance time with respect to the matrix size for each algorithm. We can see that for small matrices size, 8192 and 10240, the Strassen achieves an improvement of about 6.5% with respect to standard JAX, comparable with the AlphaTensor improvement of about 8.5%. Excellent results are achieved for bigger matrices so that for 18432 square matrices Strassen improves the calculation by 15% (7.37 +/- 0.01) and AlphaTensor achieves an improvement of the 16% (7.31 +/- 0.01) compared to JAX (8.53 +/- 0.01).

What if I don’t have free access to V100?

Another test I’ve done was on Google Colab. In this case, we can rely on a Tesla T4 GPU. Although the algorithm has been tested on V100, it’s worth investigating its transferability and comparing the results. Similarly to the V100 test, I’ve replicated these calculations on a Google Colab notebook, removing these lines

Fig.4: Comparison between Jax matrix multiplication, Strassen algorithm and AlphaTensor on Tesla T4. [Image by the author].

As you can see we have more variability in the results, especially for matrices of size 16384 we can see that all the algorithms achieve the same performance timings. This is not accurate, as it may be due to some downtime that we can’t manage on Google Colab. Tab.1 summarises all the findings on Tesla T4:

Sizes 12288 and 16384 are tricky points, where we don’t have a real improvement with respect to JAX standard multiplication. On the other side, we can see that we have an improvement for very large matrices, at 18432 Strassen achieves a 20% speedup and Alphatensor 22%.

Is this the end of the story?

Just a few days after DeepMind’s paper was published, Manuel Kauers and Jakob Moosbauer wrote a great paper reply, proposing the FBHHRBNRSSSHK-Algorithm. This algorithm starts from DeepMind findings and improves the calculation over 4×4 matrices using 47 multiplications, rather than 49 as AlphaTensor found, and 5×5 matrices to 95 multiplications (rather than the 96 proposed by AlphaTensor). This is great news and it shows how humans can collaborate productively with ML algorithms. After this reply, Kauers and Moosbaur published a fantastic mathematical paper called "Flip Graphs for Matrix Multiplication". In this paper the authors showed the technique they found to further improve matrix multiplications. In particular, the core part of the technique is to start from known schemes for matrix multiplications and group them in a graph:

We define a graph whose vertices are correct matrix multiplication schemes and where there is an edge from one scheme to another if the second can be obtained from the first by some kind of transformation. We consider two transformations. One is called a flip and turns a given scheme to a different one with the same number of multiplications, and the other is called a reduction and turns a given scheme to one with a smaller number of multiplications

The navigation across all the matrix multiplication schemes is done through a random walk. Then, the flip can be done using the following idea:

The idea is to subtract something from one rank-one tensor and add it to one of the others.

Fig.5: Example of flip graph, that shows how all the algorithms live as nodes in the graph, flips are undirected edges and reductions transformations are directed edges.[Image by Manuel Kauers and Jakob Moosbauer].

Fig.5 reports the paper’s image, where the authors have depicted all the known schemes and how they are interlinked with each other through flip and reduction transformations. Thus, this is not the end of the story, but it’s another great starting point, which will bring more and more efficient algorithms for matrix multiplication.

Conclusions

Today we’ve concluded the review of the DeepMind paper "Discovering faster matrix multiplication algorithms with reinforcement learning". This paper has brought a lot of interest in the matrix multiplication field and, obviously, lots of questions.

We defined 4 main points that we can draw from the paper:

RL is a formidable tool that can help us in tackling mathematical problems
AlphaTensor has found new formulations for multiplying 5×5 and 4×4 matrices
AlphaTensor can find optimal solutions for specific hardware (e.g. GPU or TPU)
This is a great starting point for new research on the matrix multiplication problem

Then, we run AlphaTensor on an NVIDIA V100 GPU and Tesla T4. Although some ups and downs, we can see that overall AlphaTensor improves the calculation, with an improvement of up to 16% on V100 and 22% on Tesla T4 – although the algorithm is not optimized for such a GPU.

Finally, we saw that this is not the end of the story, but a beautiful start for a new store. An example is given by the FBHHRBNRSSSHK-Algorithm which proves how the DeepMind solution can be further exploited with a pure mathematical formalism to find new and more efficient matrix multiplication techniques.