A primer on word embeddings

Statistical Learning Theory

The basis for neural networks

Jon Gimpel

Published in

Towards Data Science

14 min readJul 26, 2022

birds optimizing flight with V formation — Photo by Howie Mapson on Unsplash

In this article, we’ll review how a linear statistical model works, how it can be generalized into a classification model, and how machine learning using a simple neural network can be used to determine the coefficients for these models.

In the previous article, Words into Vectors, we saw how word distribution data for a corpus can be tabulated in a matrix, reweighted to increase the value of the information provided for a particular application, and dimensionally reduced to shorten the word vector length. We also looked at distance measures to compare word vectors.

Johnson refers to these count-based approaches as the “statistical revolution” in NLP (Johnson, 2009). But the statistical revolution didn’t end there. As machine learning and AI techniques developed and computing power grew, opportunities to apply statistical learning concepts to NLP grew.

Before we walk through the machine learning method employed by Word2vec in the next article in this series, The Word2vec Classifier, we’ll first examine how statistics applies to machine learning and establish nomenclature for machine learning. Machine learning is based on statistical learning theory (Stewart, 2019), but machine learning’s terminology can be quite different than that of statistics.

Statistics and Machine Learning

Statistics is the mathematical study of data. Using statistics, an interpretable statistical model is created to describe the data, and this model can then be used to infer something about the data or even to predict values that are not present in the sample data used to create the model. The ‘accuracy’ of prediction is not the focus of statistics.

Machine learning, on the other hand, is about results. It uses the data and statistical mathematics primarily for their predictive power. In machine learning, results are the focus more than the interpretability of the model. Often the underlying statistical model is considered irrelevant (that is, a ‘black box’) as long as the predictive results are useful. As Domingos (2012) puts it, “Machine learning systems automatically learn programs from data.”

Since machine learning is so capable of modeling the data, a challenge is to avoid overfitting (Rojas, 1996). The model should operate sufficiently well to produce an accurate prediction without being so specifically tailored to the sampled data that the model predicts poorly on new data.

To avoid overfitting the data using machine learning methods, and indeed often in statistics, the observed data set is tested by separating out a portion of the data (known as the test set) to confirm the strength of the model built by the majority of the data (known as the training set). Often, a validation set within the training set is used to determine the validity of the predictive model before it’s confirmed on the test data set.

Stewart summarizes the different approaches taken by machine learning and statistics nicely as follows:

“It should be clear that these two approaches are different in their goal, despite using similar means to get there. The assessment of the machine learning algorithm uses a test set to validate its accuracy. Whereas, for a statistical model, analysis of the regression parameters via confidence intervals, significance tests, and other tests can be used to assess the model’s legitimacy.” (Stewart, 2019)

Word2vec’s shallow neual network and particular learning algorithm will be discussed in the fourth article in this series, The Word2vec Classifier. To understand the concepts and terms of machine learning and neural networks from the statistician’s point of view, we’ll review how linear regression is performed using machine learning and how that process is applied to logistic regression using a neural network.

Linear Regression in Statistics

For a linear regression of statistical data with multiple predictors, let’s begin with a linear equation to represent the relationship between y=(yᵢ) and X=(xᵢⱼ):

where yᵢ is the dependent response variable and xᵢⱼ are the observed values of each independent variable j, of which there are p for each statistical unit i, of which there are n. The error term is εᵢ. The predictors are βⱼ, of which there are p+1.

Here’s a view of linear data when there is one predicting variable (p=1).

straight line approximating a series of data points — **Linear Regression, Bivariate** (image by author)

We can also use vectors and matrices to represent the linear equation. The vector y = (y₁,…,yᵢ,…,yₙ)⊤ represents the values taken by the response variable. X of dimension n×(p+1) is the matrix of xᵢⱼ predictor values, with the first column defined as a constant, meaning that xᵢ₀ ≔ 1.

Representing the linear equation with vectors and matrices gives us:

For the linear regression of y on X with error vector ε, the coefficient vector β is derived by minimizing the sum of squares of the residuals, or errors:

Or in vector and matrix form:

Taking the partial derivatives with respect to the vector β, and then setting it equal to zero derives the minimum value for β, which we will name β^ₒₗₛ because we are using the ordinary least squares (OLS) method to derive the estimator for β:

At this value, β is a true minimum because the Hessian matrix of the second derivatives is definite positive.

From β^ₒₗₛ, we can predict y, ŷ, using the following equation:

Statisticians use the above geometric derivations when investigating a linear statistical model, a model that is tested before being put to use to make predictions. An example basic model is (Tillé, 2019):

where this model is formalized as follows:

y is the vector of constants of n observed outcomes
X is a n×(p+1) full-rank matrix of non-random constants containing the observed independent data, xᵢⱼ, with an added first column of 1s
β is the vector of the unknown coefficients (that is, the estimators) in ℝ⁽ᵖ⁺¹⁾
ε is a vector of size n containing unknown random variables, or error terms, εᵢ

Typical hypotheses of the model are as follows:

The matrix X is not random and is full rank. If the matrix X is not full rank, then at least one of the columns of the matrix (that is, of the covariates) can be written as a linear combination of other columns, suggesting a reconsideration of the data
The expectation of the error terms is zero: 𝔼(ε) = 0
The variance of the error terms is constant: Var(εᵢ) = σ² for all i, that is, homoscedastic
The covariance of the error terms is zero: Cov(εᵢ, εⱼ) = 0 for all i≠j

The Gauss-Markov theorem states that, for this model with Normal distribution error terms, the ordinary-least-squares derived estimator for β is the best linear unbiased estimator. So we get:

A prediction can then be made for a new set of independent variables xₖ:

After testing to ensure that the model fits the data, statistical theory then defines other important values, such as confidence intervals for variance of the estimators and prediction intervals for the model’s predictions.

Logistic Regression in Statistics

We can generalize the above linear statistical model, with Normal (Gaussian) error terms, via mathematical transformations into the generalized linear model (GLM) in statistics, allowing regression, estimator tests, and analysis of the exponential family of conditional distributions of y given X, such as Binomial, Multinomial, Exponential, Gamma, and Poisson.

The parameters are estimated using the maximum likelihood method. For logistic regression, when there is a binomial response, y ∈ {0,1}, the logistic function defines the probability of a successful outcome, 𝜋 = P(y=1|x), where x is the vector of observed predictive variables, of which there are p. If β is the vector of the unknown predictors, of which there are p+1, and using z = xβ, then (Matei, 2019):

**Logistic Function**
(Qef, Public domain, via Wikimedia Commons)

We can apply this function via the log-odds, or logit, to a linear model as follows:

where 𝜋ᵢ = P(yᵢ=1|xᵢ) and xᵢ is the ith observed outcome, of which there are n. The logistic function above allows us to apply the theory behind linear regression to a probability between 0 and 1 of predicting a successful outcome. Statistical tests and data measures, such as deviance, goodness of fit measures, Wald test, and Pearson 𝜒² statistic, can be applied using this model.

Machine learning has its own nomenclature for these equations, as we’ll see in the next section.

Linear Regression using Machine Learning

From the machine learning point of view, predictive models are considered too complicated or computationally intensive to solve mathematically. Instead, very small steps are taken on portions of the data and iteratively cycled through to derive the solution.

We’ll walk through the solution to linear regression using machine learning, Before we proceed, however, it is important to first understand that in machine learning, the function to be solved isn’t typically predefined. In our case, we already know that we want to perform only a linear regression, but typically in machine learning, various models (or functions) of the data are compared until the best trade-off between being too general and imprecise, on the one hand, and overfitting the data, on the other hand, is found empirically.

In the case of solving for linear regression using machine learning, we want to find the regression coefficients on the complete data set, so we begin with the same observed data X and y defined in the linear regression model in the section above.

The objective function to be minimized is the ordinary least squares of the residuals, which we will use in the machine learning algorithm as the loss function, L, which is more generally called the cost function, J(θ), where θ represents the parameter values being optimized. For linear regression, the parameter values θ are the values of the vector β.

Note that in machine learning, to help normalize and compare models, one typically minimizes the mean squared error, which is 1/n of the sum of the squared error values we derived in the previous section. For our linear regression case, we’ll continue with the sum of the squared error values, noting that the constant 1/n won’t affect the predicted β coefficient values, and therefore can be disregarded (Aggarwal, 2018):

Taking the derivative of the loss function and setting it equal to zero yields the coefficient values, but we’re going to perform the calculation stepwise, with one calculation for each training instance since the machine learning algorithm will pass through the data multiple times.

To find the direction of the stepwise updates, we’ll take the derivative of the loss function, and use that direction to move our learning a step towards the minimum:

This process is known as gradient descent, and 𝛼 defines the length of the small step, which is the learning rate.

In machine learning, we consider training in pairs (x₁,y₁)…(xₙ,yₙ) and we cycle through updates to each pair multiple times when optimizing θ. Let’s look at the derivative of the squared error for each training instance:

This equation gives us the direction in which to move the β values, which are also known as weights, towards their minimum. The constant 2 is typically disregarded because it doesn’t affect the optimal values of β (Ng, 2018). So in our case, for each mth training instance, β is updated as follows:

We start the learning process by establishing random values for each weight value of β and begin the algorithm. The learning rate 𝛼 should be set so that progress towards the minimum of the loss function is sufficiently fast without overshooting and making the minimum impossible to reach. A dynamic learning rate, where 𝛼 is reduced as the function nears its minimum, is often implemented.

Assuming the support of a good learning rate, this machine learning algorithm will calculate the values of the coefficients β as precisely as desired, reaching the same values derived mathematically in the section above.

Logistic Regression with a Neural Network

The idea of neural networks came from the concept of how neurons work in living animals: a nerve signal is either amplified or dampened by each neuron the signal passes through, and it is the sum of multiple neurons in series and in parallel, each filtering multiple inputs and feeding that signal to additional neurons to eventually provide the desired output. The feed-forward neural network is the simplest form of a neural network, where the calculation is done only in the forward direction, from input to output.

Neural networks allow for the use of multiple layers of neurons where each layer provides specific functions A simple linear regression neural network, however, can be constructed with a single layer of neurons operating linearly.

The figure below shows the framework for a simple feed-forward neural network that provides logistic regression:

**Neural Network Framework for a Binomial Classifier**
(image by author, inspired by Raschka, 2020)

In a simple feed-forward neural network for classification, the weights wⱼ and ‘bias’ term w₀ represent the coefficients of β from the linear regression method and are trained by the network using the Error (ε) as shown in the figure.

The general neural network function takes the following form (Bishop, 2006):

where f(·) is a nonlinear activation function and φⱼ(x) is a basis function. The basis function can transform the inputs x before the weights w are determined. In the case of logistic regression, the basis function is set to 1 so that the inputs remain linear.

The activation function f(·) is also set to 1 for linear regression. However, with logistic regression, a specific activation function is needed to convert the output of the linearly determined weights to a predicted probability of the binomial response, 0 or 1. The activation function is the sigmoid function, which is equivalent to the logistic function defined for logistic regression for statistics. The sigmoid function, in contrast to the logistic function 𝜋(z), is mathematically converted to have only one exponent to simplify programming as shown in the following equation:

where z = xβ. The sigmoid activation function provides the probability of the prediction.

In machine learning more generally, however, a nonlinear function is used when we’re not needing to get a linear probability to the prediction. In this case, a variety of activation functions can be tested.

To do the training to establish the weights w for each step, the neural network algorithm calculates the Error value, which is the difference between the prediction calculated versus the actual outcome. Using backpropagation, the weights are updated according to the learning rate.

We’ll go into more detail on backpropagation in the next article in this series.

Multinomial Logistic Regression

Previously we used the generalized linear model in statistics to expand linear regression to logistic regression for a binomial response. We can do a similar transformation for situations where the response is multinomial, i.e., multiclass. The key difference is that instead of using the sigmoid activation function to provide a probability to the prediction, the softmax function is used.

where z = xβ and K is the number of classes.

The neural network model for multinomial logistic regression works similarly to binary logistic regression. The softmax function is more computationally intensive than the sigmoid function.

Nonlinear Applications of Neural Networks

As mentioned earlier, the generalized linear model (GLM) in statistics allows regression of the exponential family of Binomial and Multinomial distributions, providing prediction confidence intervals and other statistical tests based on theory.

But how do we get confidence intervals for predictions and other statistics when the neural network is generalized to predict nonlinearly? In this situation, computational methods including bootstrap, jackknife, and cross-validation can be applied (Rojas, 1996).

Summary

In this article, we learned how how linear regression can be generalized to predict a binary or multiclass response, and we learned how machine learning can be used to provide the prediction parameters using the example of a shallow neural network.

We also learned that machine learning is more generally used to automatically find the best function (usually nonlinear) to predict an output, whereas statistics generally attempts to validate a (usually simpler) model of the data and uses that model to make predictions.

In the next article, The Word2vec Classifier, we’ll look at how Word2vec leverages these concepts to train its word embeddings.

More on this Topic: A resource I recommend for learning more about the building blocks of machine learning is this online computer science class at Stanford University: Ng, A. (2018). CS229 Machine Learning.

References

Aggarwal, C. (2018). Neural Networks and Deep Learning: A Textbook. Cham, Switzerland: Springer International Publishing.

Bishop, C. (2006). Pattern Recognition and Machine Learning. New York, New York: Springer Science+Business Media.

Domingos, P. (2012). A Few Useful Things to Know About Machine Learning. Communications of the ACM, 55(10):78–87.

Johnson, M. (2009). How the Statistical Revolution Changes (Computational) Linguistics. Proceedings of the European Chapter of the Association of Computational Linguistics 2009 Workshop on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious, or Vacuous?, pages 3–11. PDF.

Matei, A. (2019). Generalized Linear Model. Statistics course. Neuchâtel, Switzerland: University of Neuchâtel.

Ng, A. (2018). CS229 Machine Learning, Online computer science course. Stanford, California: Stanford University.

Raschka, S. (n.d.). What Is the Relation between Logistic Regression and Neural Networks and When to Use Which? Sebastian Raschka.

Rojas, R. (1996). Neural Networks: A Systematic Introduction. Berlin, Germany: Springer-Verlag.

Stewart, M. (2019). The Actual Difference Between Statistics and Machine Learning. Towards Data Science.

Tillé, Y. (2019). Advanced Regression Methods. Statistics course. Neuchâtel, Switzerland: University of Neuchâtel.

*Figures and images are by the author, unless otherwise noted.