Turn to your nearest window and try to gaze outside at any object your eyes may encounter. Now ask yourself the question, whilst identifying the object did your brain perceive it as a whole, or were there certain parts or features of the object that were enough for you to decide what the object was? The ability of our brain to identify an object by recognizing its individual parts without having to see the entire object is a pretty interesting phenomenon. While looking at an abstract-shaped cloud, our brain would automatically associate the floating shape as the closest resembling object. Psychological studies [1], physiological studies [2], and computational studies such as in-silico neural network architecture for computer vision [3] investigate and support the idea of part-based representation in the brain.
One such computer vision technique is called Non-negative matrix factorization (NMF) which cleverly learns the features of an object and is probably the most simple and easy-to-implement algorithm. NMF as an exploratory Factor Analysis was introduced in 1994 by Paatero and Tapper and was further popularized in 1999 by Daniel D Lee & H Sebastian Seung [4]. It has remained a popular algorithm for image and text mining. The fundamental principle of NMF is part-wise learning based on non-negativity constraint, meaning NMF can only be useful when the data matrix of interest has no negative entry.
_**The readers interested in the implementation and code are encouraged to give the next section a quick read in order to understand the dimensional notations that I am going to use in data preparation and the code. Also, feel free to explore Part I_, Part II, and Part III of the Hands-on Vanilla Modeling series. Cheers 🙂
Understanding NMF
Let’s jump right into understanding the mathematics behind NMF. Essentially, NMF reduces the data, by decomposing the data into weight and factors such that the distance (Euclidean or Frobenius) between the original matrix and the product of the newly formed matrices is minimum.
Let V (_m_xn) be our data matrix. The goal is to factorize the V into the _basis W (_m_xk)_ of the lower dimension k and th_e encoding matr_ix H _ (_k_x_n) such that the product of W and H gives a close approximation of V. Each column _of _H is in one-to-one correspondence with the column _in _V. Basically, the encoding columns are the coefficients which with a linear combination of basis matrix produces the column of V.
For example, when each column of data matrix V represents a facial image, the basis columns are the basis images that are nothing but features such as eyes, noses, lips, etc., while the columns of H indicate which feature is present in which image with what intensity. The k dimension represents the features (parts) of the data. The k is usually chosen such that (m + n)k ≤ mn. It wouldn’t be wrong to call the product _WxH_ as a compressed form of the data V.
The underlying NMF objective is to minimize the distance between V and W x H with respect to W and H while preserving the nonnegativity of W and H.
The non-negativity constraints are compatible with the intuitive notion of combining parts to form a whole, which is how NMF learns a parts-based representation.
This minimization problem cannot be solved analytically as it is an NP-Hard problem, meaning this is not convex in W and H. Simply put, it is difficult if not impossible to find the global minimum. The proof of this can be observed for the scalar case where m and n are __ equal to 1. Then the objective would be:
The gradient of this objective function would be:
And, likewise, the Hessian would be:
Here, it can be easily observed that the hessian is not always positive-semi definitive for all values of w and h.
Based on the knowledge of gradient descent, the slow steps toward the gradient should lead us to the minimum of the objective function.
Here α and β are the learning rate. In principle, these learning rates can be estimated as:
Substituting the learning rates and the gradients from the gradient matrix, the update rule becomes:
Starting from non-negative initial conditions for W and H, iteration of these update rules for non-negative V finds an approximate factorization V ≈ WH by converging to a local minimization of the objective function.
Implementation
To learn the facial features we use the LFW (Labeled Faces in the Wild) dataset, a database of face photographs designed for studying the problem of unconstrained face recognition. The dataset contains more than 13,000 facial images collected from the web.
The code for NMF is as follows. The W and H matrix is initialized by values from a uniform distribution. The iteration number is set to 500 by default however 50 iterations are enough to reach an appreciable convergence.
#---------- NMF ---------------
NMF<-function(X, K, ite = 500, verbose){
N=nrow(X); M=ncol(X)
W<-matrix(runif(N*K), nrow = N)
H<-matrix(runif(K*M), nrow = K)
loss<-NULL
#-------Update------------#
for(i in 1:ite){
H_num<-t(W)%*%X
H_deno<-t(W)%*%W%*%H + 1e-10
H<-H*(H_num/H_deno)
W_num<-X%*%t(H)
W_deno<-W%*%H%*%t(H) + 1e-10
W<-W*(W_num/W_deno)
loss[i]<-sum((X-W%*%H)^2)
if(i%%500==0 & verbose == T){print(paste("iteration----------",i, sep = " "))
print(sum((X-W%*%H)^2))}
}
return(list(Basis = W, Encoding = H, loss = loss))
}
Here, we will use only 500 randomly selected images. In step one, the facial images were scaled to 150 x 150 pixels. Then, each image’s pixels were standardized so that the mean and standard deviation becomesequal to 0.25. Finally, the image pixels were clipped to range [0,1] to allow nonnegativity.
In step 2, each image pixel was then flattened to become a column vector. Combining all the flattened images as columns we get our data matrix V.
In step 3, NMF was performed over V with the iterative algorithm described above with randomly initialized W and H.
setwd(".mediumNMFlfw")
set.seed(123)
main.dir<-".mediumNMFlfw"
my.list<-list(NULL)
index<-1
for(j in seq_along(dir())){ #reading all the image and storing into a list
paste(main.dir,"",dir()[j], sep="")%>%setwd()
for(i in seq_along(dir())){
my.list[[index]]<-readImage(dir()[i])
index<-index+1
}
setwd(main.dir)
}
#-------------------------Functions------------------------#
zscale<-function(x, u, sigma){ (x-mean(x, na.rm=T))*sigma/sd(x, na.rm = T) + u} #standardization function
minmax<-function(x){ (x-min(x, na.rm = T))/(max(x, na.rm = T)-min(x, na.rm = T)) } #min max function
#---------------------------------#
images=500 #number of images to be selected from the data base
pixel=150
sample.image<-sample(size = images, c(1:length(my.list)), replace = F)
my.new.list<-my.list[c(sample.image)]
#-------Step 1-------------------#
for(i in 1:images){my.new.list[[i]]<-c(channel(my.new.list[[i]], "grey")%>%resize(pixel, pixel))%>%zscale(u=0.25, sigma = 0.25)%>%minmax()}
#---------------Step 2--------------------------#
V<-unlist(my.new.list)%>%matrix(nrow = pixel*pixel, byrow = F)
#---------------Step 3------------------#
a<-NMF(X = V, K = 100, ite=1000, verbose = F)
The NMF performs both learning and inference simultaneously. That is, it both learns a set of basis images and infers values for the hidden variables from the visible variables.
Limitations Although NMF is successful in learning facial parts, this does not imply that the method can learn parts from any database, such as images of objects viewed from extremely different angles or highly articulated objects. Although non-negativity constraints can help such models learn part-based representations, the inventors of NMF do not claim that they are sufficient in and of themselves. Also, NMF doesn’t learn about the "syntactic" relationships between parts. NMF assumes that the hidden variables are nonnegative, but makes no further assumptions about their statistical dependencies. One must also recognize that a technology that can remotely identify or classify people without their knowledge is fundamentally dangerous. Follow this for a better understanding of the ethical ramification of facial recognition or feature learning models.
I hope this read shed some light on NMF and as a reader, you enjoyed the content flow. Access the code from my GitHub. Stick around for more HANDS ON VANILLA MODELLING series. Cheers!
Logistic Regression Explained from Scratch (Visually, Mathematically, and Programmatically) Understanding a Neural Network through scratch coding in R; A novice guide A Layman’s Guide to Building Your First Image Classification Model in R Using Keras
Resources
Learning the parts of objects by non-negative matrix factorization