The world’s leading publication for data science, AI, and ML professionals.

How to Use Singular Value Decomposition (SVD) for Image Classification in Python

Demystifying the Linear Algebra concepts behind SVD with a simple example

Photo by Marcel Strauß on Unsplash
Photo by Marcel Strauß on Unsplash

One of the most elusive topics in Linear Algebra is the Singular Value Decomposition (SVD) method. It is also one of the most fundamental techniques because it paves the way for understanding Principal component analysis (PCA), Latent Dirichlet Allocation (LDA) and the concept of matrix factorization in general.

The elusiveness of SVD stems from the fact that while this method requires a lot of maths and linear algebra in order to grasp it, the practical applications are often overlooked. There are many people who think they grasp the concept of SVD, but in fact they don’t. It’s not only a dimensionality reduction technique: Essentially, the magic of SVD is that any matrix A can be written as the sum of rank 1 matrices! This will become apparent later.

The purpose of this article is to show the usefulness and the underlying mechanisms of SVD by applying it to a well known-example: Handwritten digits classification.

A quick reminder (optional for advanced)

The basic relation of SVD is

where: U and V are orthogonal matrices, S is a diagonal matrix

More specifically:

which shows the aforementioned claim, that any matrix A can be written as the sum of rank 1 matrices.

A few useful properties of SVD:

  1. The U and V matrices are constructed from the eigenvectors of AAᵀ and AᵀA respectively.
  2. Any matrix has an SVD decomposition. This is because the AAᵀ and AᵀA matrices **** have a special property (among others): They are at least positive semidefinite (which means their eigenvalues are either positive or zero).
  3. The S matrix contains the square roots of the positive eigenvalues. These are also called singular values.
  4. In most programming languages, including Python, the columns of U and V are arranged in such a way that columns with higher eigenvalues precede those with smaller values.
  5. The u¹, u²…. vectors are also called left singular vectors and they form an orthonormal basis. Correspondingly, the v¹, v²…. vectors are called right singular vectors.
  6. The rank of matrix A is the number of non-zero Singular Values of S matrix.
  7. Eckart-Young-Mirsky Theorem: The best k rank approximation of a rank k<r A matrix in the 2-norm and F- norm is:

In other words:

If you want to approximate any matrix A with one of a lower rank k, the optimal way to do so is by applying SVD on A and take only the first k basis vectors with the highest k singular values.

SVD in Python

For this example, we will use the Handwritten Digits USPS (U.S. Postal Service) dataset. The dataset contains 7291 train and 2007 test images of handwritten digits between [0–9] . The images are 16*16 grayscale pixels. First, we load the data:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.linalg import svd, norm
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import h5py
import os
# define class labels
labels = {
    0: "0", 
    1: "1", 
    2: "2", 
    3: "3", 
    4: "4", 
    5: "5", 
    6: "6", 
    7: "7", 
    8: "8",
    9: "9"
}
# load the dataset
with h5py.File(os.path.join(os.getcwd(), 'usps.h5'), 'r') as hf:
        train = hf.get('train')
        test = hf.get('test')
        x_train = pd.DataFrame(train.get('data')[:]).T
        y_train = pd.DataFrame(train.get('target')[:]).T
        x_test = pd.DataFrame(test.get('data')[:]).T
        y_test = pd.DataFrame(test.get('target')[:]).T
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
#Output:
#(256, 7291)
#(1, 7291)
#(256, 2007)
#(1, 2007)

The data have been loaded in such a way to match the dimensions from the quick reminder section above. The columns of _xtrain and _xtest contain the digits as vectors, flattened into arrays of size equal to 256 (since each digit is of 16×16 size). On the other hand, the _ytrain and _ytest are row vectors which contain the actual classes for each digit (values between 0 and 9) for the train and test dataset respectively.

Figure 1 displays the first image in the training dataset:

digit_image=x_train[0]
plt.imshow(digit_image.to_numpy().reshape(16,16),cmap='binary')
Figure 1: Image from train dataset
Figure 1: Image from train dataset

The methodology for digit classification is organised in the following steps:

  1. We split the _xtrain dataframe into 10 matrices (columnwise), one for each digit[0–9]. These are the A’s matrices that were mentioned previously. The goal is to apply SVD to each one of them separately. For instance, the A0 matrix contains only images of digit 0, and its shape is (256,1194) since there are 1194 0’s in the dataset.
  2. Next, we apply SVD to each one of the [A0, A1.. A9] matrices. For each A matrix we store the corresponding U, S and V matrices. We will mostly work with U matrix.
  3. Each data matrix A, represented by a digit, has a ‘distinctive characteristic’. This differentiation is reflected in the first few left singular vectors (u¹, u²….). Since these eigenvectors are essentially basis vectors, if an unknown digit can be better approximated with the basis of another digit (e.g the digit 3), then we can assume that the unknown digit is classified as that digit (as 3). This will become more apparent later in the programming example.

Step 1

This is fairly easy. We just create the [A0, A1.. A9] matrices and store them in a dictionary, called alpha_matrices:

alpha_matrices={}
for i in range(10):
    alpha_matrices.update({"A"+str(i):x_train.loc[:,list(y_train.loc[0,:]==i)]})
print(alpha_matrices['A0'].shape)
#(256, 1194)

Step 2

This step is also straightforward. We store the U, S and V matrices in the _leftsingular, _singularmatix and _rightsingular dictionaries respectively:

left_singular={}
singular_matix={}
right_singular={}
for i in range(10):
    u, s, v_t = svd(alpha_matrices['A'+str(i)], full_matrices=False)
    left_singular.update({"u"+str(i):u})
    singular_matix.update({"s"+str(i):s})
    right_singular.update({"v_t"+str(i):v_t})
print(left_singular['u0'].shape)
#(256, 256)

Let’s display what information is contained in those matrices. We will use as an example the U, S and V matrices of digit 3, which correspond to the following variables in our example:

#left_singular['u3']
#right_singular['s3]
#singular_matix['v_t3]
plt.figure(figsize=(20,10))
columns = 5
for i in range(10):
   plt.subplot(10/ columns + 1, columns, i + 1)
   plt.imshow(left_singular["u3"][:,i].reshape(16,16),cmap='binary')
Figure 2: The first 10 singular vectors of the U3 matrix
Figure 2: The first 10 singular vectors of the U3 matrix

Figure 2 displays as images the first 10 left singular vectors [u¹, u²… u¹⁰] (out of 256). All of them depict the digit 3, with the first one (the u1 vector) being the most clear.

Figure 3 shows the singular values of digit 3 from the S matrix, in log scale:

plt.figure(figsize = (9, 6))
plt.plot(singular_matix['s3'], color='coral', marker='o')
plt.title('Singular values for digit $3$',fontsize=15,weight="bold",pad=20)
plt.ylabel('Singular values' ,fontsize=15)
plt.yscale("log")
plt.show()
Figure 3: All singular values of digit 3
Figure 3: All singular values of digit 3

Given that the singular values are sorted, the first few ones are the highest (in terms of value) and follow a ‘steep curve pattern’.

By taking Figure 2 and Figure 3 into account, we can graphically confirm the matrix approximation properties of SVD for digit 3 (remember the Eckart-Young-Mirsky Theorem): The first left singular vector represents the intrinsic property value of matrix A3. Indeed, in Figure 2, the first singular vector u1 looks like the digit 3, and the following left singular vectors represent the most important variations of the training set around u1.

The question is if we can use only the first k singular vectors, and still have a good approximation of the basis. We could test that experimentally.

Step 3

Given an unknown digit represented by (1,256) vector called z, and the set of left singular vectors [u1, u2… uk] where each set represents the corresponding digit matrix/A matrix, what is the target value of z? Notice that our index is k (the first dominant singular eigenvectors) and not n (all of them). To solve this problem, all we have to do is the following:

The goal is to compute how well a digit from the test set can be represented in the 10 different orthonormal bases.

This can be done by computing the residual vector in least squares problems of the type:

The solution of the Least Squares problem is

remember that U matrix is orthogonal. The residual norm vector then becomes:

And that’s it! Now we have everything we need. Using the last formula, we proceed to calculate the test accuracy for different k values:

I = np.eye(x_test.shape[0])
kappas=np.arange(5,21)
len_test=x_test.shape[1]
predictions=np.empty((y_test.shape[1],0), dtype = int)
for t in list(kappas):
    prediction = []
    for i in range(len_test):
        residuals = []
        for j in range(10):
            u=left_singular["u"+str(j)][:,0:t]
            res=norm( np.dot(I-np.dot(u,u.T), x_test[i]  ))
            residuals.append(res)
        index_min = np.argmin(residuals)
        prediction.append(index_min)

    prediction=np.array(prediction)
    predictions=np.hstack((predictions,prediction.reshape(-1,1)))
scores=[]
for i in range(len(kappas)):
    score=accuracy_score(y_test.loc[0,:],predictions[:,i])
    scores.append(score)
data={"Number of basis vectors":list(thresholds), "accuracy_score":scores}
df=pd.DataFrame(data).set_index("Number of basis vectors")
Table 1: Test accuracy score for different values of k
Table 1: Test accuracy score for different values of k

We could also show this result graphically:

Figure 4: Test accuracy score for different values of k
Figure 4: Test accuracy score for different values of k

Both Table 1 and Figure 4 display the accuracy score for different number of basis vectors. The best score is achieved by using k=12.

Next, the confusion matrix is displayed:

pd.set_option('display.max_colwidth',12)
confusion_matrix_df = pd.DataFrame( confusion_matrix(y_test.loc[0,:],predictions[:,7]) )
confusion_matrix_df = confusion_matrix_df.rename(columns = labels, index = labels)
confusion_matrix_df

And the f1 score (both macro-average and per class) :

print(classification_report(y_test.loc[0,:],predictions[:,7]))

**Comments:*** Digits 0,1,6 and 7 perform the best in terms of f1-score.

  • Digits 5, and 3 perform the worst in terms of f1-score.

Let’s look some examples of misclassified images:

misclassified = np.where(y_test.loc[0,:] != predictions[:,7])
plt.figure(figsize=(20,10))
columns = 5
for i in range(2,12):
    misclassified_id=misclassified[0][i]
    image=x_test[misclassified_id]

    plt.subplot(10/ columns + 1, columns, i-1)
    plt.imshow(image.to_numpy().reshape(16,16),cmap='binary')
    plt.title("True label:"+str(y_test.loc[0,misclassified_id]) + 'n'+ "Predicted label:"+str(predictions[misclassified_id,12]))

Clearly, some of these digits are very poorly written.

Conclusion

In practice, the A data matrix is essentially a low-rank matrix plus noise: A = A’ + N. By applying the Eckart-Young-Mirsk therorem, we approximate the data matrix A with a matrix of correct rank k. This has the effect of keeping the intrinsic properties of A matrix intact, while simultaneously the extra noise is removed. But how do we find the correct rank k?

In our case, we found k empirically in terms of test accuracy by experimenting with the first dominant singular values. It should be noted however that there are other algorithms which of course outperform this technique. However, the focus of this case study is to show an alternative way for handwritten digits classification by tweaking a well-known matrix factorization technique. For more information on the theoretical techniques and principles of this example check this book.


Related Articles