Fitting Without Training
In this article a surprising result is demonstrated using the neural tangent Kernel [1,2]. This kernel is used to perform kernel regression. The surprising thing is that the accuracy of that regression is independent of the accuracy of the underlying network.
In order to get to that result, and appreciate the surprise it entails, we need to work through the construction of the kernel and how it is used in a kernel regression model.
Gradient kernel regression, because it can be constructed without the hard work of training by gradient descent, can serve as a useful tool for model architecture design. This is demonstrated with an example using transfer learning.
Gradient Kernel
In [1] the authors introduce what they term the path kernel in the context of supervised learning via deep neural networks trained using stochastic gradient descent. The path kernel measures the similarity of two input points, _x_ᵢ and _x_ⱼ as,

the inner product between the gradient of the network evaluated at the two points, integrated over c(t), the path taken by the parameters w of the network f during gradient descent. The path kernel is a special case of the neural tangent kernel [3],

where the parameters over which the sum occurs, _w_ₚ, are more general.
Given a data set {_x_ᵢ, _y_ᵢ} for 1 ≤ i ≤ n, and a model f(x; w) parameterized by w (for example a neural network), the gradient kernel at w, _x_ᵢ, _x_ⱼ is calculated as,

This is simply the neural tangent kernel evaluated at a single value of the parameter vector w. The term gradient kernel is used since there is no specific requirement that the underlying model be a neural network. Note that we are working with univariate valued functions f (e.g. a binary classifier), however, it is straight forward to extend the methods described to multi-variate f.
#
Once in possession of a kernel function it is immediately tempting to use that kernel function in a regression context. That is, to fit a function of the form,

to a data set. However, there are a few practical questions that need to be addressed.
Modern neural network models can have data sets with millions of examples. Forming the full kernel matrix requires O(n²) space to store this matrix. A well known method to reduce these requirements is to select a much smaller subset of examples to serve as basis examples. That will be the approach taken here. The selection of the basis examples is an interesting and deep research topic itself, but that will be ignored and the basis examples will simply be selected at random from the set of training examples.
We are interested in examining how a kernel regression using the gradient kernel performs as the parameters w are modified by gradient descent applied to the network. Unfortunately, gradient descent drives the gradient kernel into an ill-conditioned and numerically unstable regime. To combat this, the gradient kernel is not used directly, but rather it is normalized into cosine similarity form,

which has much better numerical properties.
Gradient kernel regression is performed and evaluated as follows. Given a data set of example points {_x_ᵢ, _y_ᵢ}, a function f(x; w) parameterized by w we separate the examples into training and testing sets. A set of basis examples are chosen from the training set. The linear regression model,

is fit by least squares to the training data,

and its error is measured on the test data,

the dependence of the error on the underlying model parameters w is what is of interest.
MNIST Example
The well known MNIST data set [4] is used to study the performance of gradient kernel regression. The MNIST data consists of images of hand written digits, and a class label for each image.
The underlying neural network model used was taken from the examples provided in the PyTorch Machine Learning library [5]. It consists of 2 convolutional layers and 2 fully connected layers. As given, this model produces a 10-dimensional output corresponding to the multi-class digit classification problem. This model was changed to a binary classifier. The data chosen was for two digits ("1", and "7"), and the classification task was to distinguish a 1 from a 7.
For the experiments, 1000 training examples, and 1000 testing examples were selected. Both sets were balanced with 500 1’s and 500 7’s in each. The basis examples were chosen at random from the training set, and these too were balanced with 50 1’s and 50 7’s. The network was initialized with random values using default PyTorch settings.
Repeated epochs of gradient descent were performed. Each epoch corresponds to a complete pass through the training data, in this case that involved 10 gradient descent steps driven by a random batch of 100 training examples. For each epoch a linear Regression was fit to the training data and tested on the testing data as described in the previous section. The test result was summarized through and accuracy score which assigned class "1" to the examples where the regression function was greater than 0.5 and assigned the class "7" to examples where the regression function was less than 0.5.

Surprisingly, the accuracy of the gradient kernel regression is independent of the accuracy of the underlying network. The underlying network (blue line) starts with random parameters and so has a random 50% accuracy at the start. But the gradient kernel regression (orange line) is working at the full 99% accuracy right from the start. The underlying network gets progressively better as the gradient descent epochs proceed, but at its best it only matches the level of accuracy that the gradient kernel regression holds over the whole path.
The accuracy of the gradient kernel regression does not depend on the quality of the underlying model parameters. It works as well for random parameter settings as it does for trained parameter settings.
CIFAR10 Example
The result from the previous section shows that gradient kernel regression can reveal the accuracy inherent in a network without requiring training and all the time, effort, and uncertainty that entails. Because of this it can serve as a powerful tool in designing a network architecture.
For example, Transfer Learning is a method that takes an existing model and uses it as the starting point for a new model. Specifically, a large complex deep neural network trained on millions of examples can be modified by replacing its final layer with a layer customized to a new problem. Then only this new layer is trained on some smaller set of data. Gradient kernel regression can be used to explore the possible forms of this final layer efficiently.
Here transfer learning is applied to the ResNet-50 [6] deep neural network (as provided in PyTorch). The final layer of the ResNet-50 network is replaced with two fully connected layers and the final output is modified to be a binary classifier.
This modified network is trained to classify bird versus cat images from the CIFAR10 [7] data set of image (also provided in PyTorch). Transfer learning provides an interesting use case for the gradient kernel regression method. The gradient for very wide / deep networks can involve a huge number of parameters (ResNet-50 has over 20 million). Taking inner-products across this many parameters for each training example is computationally expensive (although, it must be noted, much less than gradient descent training requires). Transfer learning only trains the additional layers added to the network, for the example here just over 1 million parameters.
The experimental setup is the same as for the MNIST example. 1000 training, and 1000 testing examples were randomly selected from the CIFAR10 data. Both data sets had a balanced 500 bird and 500 cat images. 100 training examples were taken as basis examples, also balanced with 50 bird and 50 cat images. Again, 9 gradient descent epochs were performed and the kernels were constructed at each epoch.

The plot shows the accuracy for the kernel regressions and the neural network by epoch. As before the gradient kernel regression dominates the performance of the trained neural network.
The accuracy of the gradient kernel regression can be used as an easily computed benchmark for comparing different architectures of the trained layer in the transfer learning problem.
Conclusion
The examples presented here demonstrate that gradient kernel regression can result in models with as good or better performance than that obtained by actually going through the gradient descent training process.
Gradient kernel regression provides a mechanism for testing the performance of a network without going through the gradient descent training process. This side steps a number of complexities that arise when training using gradient descent. Learning rate selection and scheduling, stopping rules, and lack of convergence all go away when using a simple linear regression based on the gradient kernel.
[1] Every Model Learned by Gradient Descent Is Approximately a Kernel Machine
[2] Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks
[3] Neural Tangent Kernel: Convergence and Generalization in Neural Networks
[5] PyTorch: An Imperative Style, High-Performance Deep Learning Library