"I have not failed. I’ve just found 10,000 ways that won’t work." – Edison

Science is messy. I don’t think people outside the science field appreciate the ratio of failures to successes. In my work, I very often fail to develop an idea to completion. Sometimes the model doesn’t work. Sometimes the idea is just wrong. Sometimes the idea needs to change. Artificial Intelligence is more about experimentation and iteration than it is building up strong and clear solutions from paper to production.
The great tragedy of Science – the slaying of a beautiful hypothesis by an ugly fact. – T.H. Huxley
There is a fundamental aspect to science called the null hypothesis. If you have no chance of accepting the null hypothesis when developing your idea, then, simply put, you are not doing science. Many experiments were performed to try and observe the ether, all in vain, as there is no such thing. Similarly, just because you have an idea for some elegant application in artificial intelligence, that doesn’t mean it’s going to work. It just means you have a hypothesis to test.
In the next part of this article I’m going to share with you two negative results I had. Hopefully sharing these bad results will give you a better sense of how science in artificial intelligence can lead to failure, and how to prepare for it, react to it, and push through it.
Unsupervised Learning Machines
The difference between supervised and unsupervised learning is pretty basic: A supervised learning model learns (is trained) to map inputs to outputs, while in unsupervised learning the task is to learn the underlying distribution of the data without labels. The difference in these approaches is not the point of the article, and so keep in mind that unsupervised learning is about learning how things work, rather than learning what box to put them in. Given some creative license, I would say that the goal in unsupervised learning is the act of learning itself; to learn about the dataset of stuff the model is shown.
Introduction to Failure (Part 1): CAE for Face Embeddings
In some recent research work, I have been trying to build a Convolutional AutoEncoder (CAE) to learn embeddings for faces. In simple terms, I am trying to find meaningful small vectors to represent images of faces. Why? Here is a motivating recent example of using face embeddings to generate content:
My original motivation was the DCGAN paper, a predecessor of the work described above. See figures 7 and 8 on pages 10 and 11 of this paper (DCGAN) to get an idea for why face embedding is so cool. Can a CAE do the same thing as DCGAN?
Below is a video from 2017 that shows the basics of the idea with a DNN (dense) autoencoder and PCA, motivating me to believe that a CAE should work (2018 update here):
Last year I published a paper with some colleagues on CAE for processing multi-microphone audio signals, and so I have enough of a background to try out ideas. Never mind how the CAE model works, it struck me that a picture is worth a thousand words, and images are a great medium for showing what a neural network is learning. So why not use a CAE instead of a GAN to learn face embeddings? A CAE is supposed to reconstruct an image back into the original after compressing it down to some bottleneck size. My hypothesis was that this would "work" for me on faces images. I grabbed a pretty good faces images dataset and started programming a CAE.
Around this time, I was reading a related paper called "Comparison of Unsupervised Modulation Filter Learning Methods for ASR" and they mention, a few types of models for unsupervised learning (GAN, CAE, and other networks). The idea is that several types of model can be used to learn unsupervised about the structure of data. This gave me some confidence that a CAE could do what GANs are doing in terms if semantic image embedding.
Let’s see what that learning process looks like during my first day of model training after the initial model development. The first row in the video below is the input face images and the second row in the output (for test data, not training data) that is supposed to look the same:
I could see that the model learned some stuff, and was getting better over time. Notice that the output was saturated or at cutoff on various channels: white is 3 channel saturation, yellow is 2 channel saturation (RED=255, GREEN=255, BLUE=0), black is all 0s, and so forth. There are no middle-ground pixel values like 123 or 65, just 0s and 255s. The loss was going down over time, but was delivering diminishing returns at about 0.5 loss. And so I experimented with longer training times, batch normalization, tanh activation instead of sigmoid, different sized networks and bottlenecks, different sizes of images, and LOTS of other approaches. Here were the results for day 2:
And for day 3:
At this point, I was ready to rip my hair out and throw my computer out the window. I was following the same approaches that worked for GAN papers like batch normalization, and failing miserably. I was running experiments in parallel, and I started to rope in colleagues to join my failing quest. This led to an aha moment. Perhaps my idea was a bit off. Let’s take a step back. Why not use a VAE or a GAN instead of a CAE?
Changing My Idea: How About a VAE or Something Else?
I had to face the fact that my CAE approach and basically all of my work to this point was not going to work, and I didn’t know why. I had some ideas, but not proof of what was going wrong. Throwing away several days of work is not a good feeling, but I had to embrace the null hypothesis, that my CAE was not good enough at reconstructing face images to be used for semantic embedding operations. GANs are popular for this task because they DO work (e.g. DCGAN), and I just set a few days of my time on fire to try and do something cool that ultimately did not work. Maybe a Variational AutoEncoder (VAE) would work better? That’s what the literature was indicating…
At this lowest point, I heard back from Mary Kate MacPherson. Not only did she get a CAE model to work, her CAE code looks almost exactly like my code. This is the point where I look back and think to myself that there really are a LOT of ways to not make a light bulb. Here is a view of her results after a couple hours of training:

The lessons learned here are that sometimes you really shouldn’t give up, and you should definitely hang out with smart people. At the end of the day it was all her. My effort failed as far as the programming goes, but with some collaboration and cooperation we were able to pull out a win. I’m planning to write a whole separate article just on the stuff that worked, as a followup to my article with Mary Kate MacPherson on generating anime girls with deep learning.

Another Epic Fail (Part 2): Growing Neural Nets
I was working on a project with Herschel Caytak PhD where we were investigating the merits of growing a deep neural network during training. NVIDIA reported some amazing results using this technique to grow a GAN during training to get bigger and bigger output images. Here is a github repository of some similar code that uses keras. Again, this is all related to faces data.
Growing a neural network during training is discussed in several resources we have come across, but in all our Googling, we could not find a paper proving with modern tools like keras the merits of the approach for dense networks (not CNN/GAN). We are not talking here about genetic algorithms or other stochastic approaches to growing neural networks. Instead, the idea is to add untrained neurons to a trained network, train that new bigger network, and then doing it again. In this work, we were also not interested in transfer learning between problems. We were narrowly focused on the benefit (or not) of growing a deep neural network during classifier training on any given dataset.
People have been thinking about this growing neural network idea for a long time. A great review of approaches prior to the 2000s is this one. There are also many discussions and ideas in the literature that evaluate growing neural networks. See, for example, "Dynamic Network Expansion" in this blog post based on this research article. It is still a research topic as you can see from this Quora thread. More here.

Qualitatively speaking, growing a neural network layer by layer addresses the problem of having too many parameters to tune. It allows the learning to happen on a small network, and then when the network grows by adding a layer, the parameters of most of the network are starting off in a good place for the next part of the training procedure. This is a good idea, in theory, because each layer added to the neural network can build finer grained features that map to toward the generalizations of the layer(s) trained before it. However, modern backpropagation is fantastic. Do we really get any benefits from growing a neural network? The null hypothesis here is that training a DNN from scratch is just as performant as this fancy layerwise training.
For Convolutional Neural Network (CNN) the idea of growing a neural network can be explained as detecting gross features and then building layers that activate for finer and finer features.
The problem we ran into is a lack of scientific evidence for this approach in general, and a lack of evidence for DNNs specifically. Just because something compiles, does not mean that it works better than the good old fashioned way. How about growing a fully connected DNN? Can we reject the null hypothesis that growing a network is the same as just training a big network? How about growing the network by layer width each training iteration? Or perhaps by height (the # of layers) is the way to go? The lack of proof led us to run the experiments described here.
It is important to note that growing a neural network is not similar to our understanding of biological neural networks at all, where networks tend to be pruned + have a critical period. Learning new things in biological NNs can involve recruiting existing neurons to form new circuits, but let us avoid the spiral into discussions about cortical competition, and focus back onto the original question: When training a DNN, should I try growing it? Is it worth it?
Frustratingly, we did some testing of this idea last year and failed to get the approach working. It turns out we had a programming error. This time around we tried again, but more carefully. Be prepared for a negative result. Now here we go…
Growing NN Metrics
Faster does not mean better. For example, we know that Adam trains faster than SGD, but SGD works better on test data than Adam (SOURCE). Similarly, in this work we examine the implications of growing a neural network on standard testing data performance.
We used standard metrics (accuracy, precision, recall, f1-score and loss curve) to provide a comprehensive evaluation and comparison of the DNNs.
Growing NN Dataset
We used a Colab notebook for running our models (our motivation for using this platform was to accelerate model training by use of GPU option for hardware acceleration). We imported the covertype dataset from the UCI Machine Learning repository into a dataframe using an imported panda library. The dataset is comprised of 581012 data points of unscaled quantitative and binary data in 54 categories (features) related to forest land characteristics. The classification task is to predict forest cover type of four wilderness areas located in the Roosevelt National Forest of northern Colorado for a given observation (30 x 30 meter cell). Forest Cover Type designation is provided as an integer in the range 1–7.
Growing NN Procedure
After data ingestion, we split the dataset into "state" information (features) and classes. We scaled all the features into the range 0–1 using MinMaxScaler (imported from sklearn.preprocessing) and one-hot encoded the classes. A quick check of the data dimensionality shows that x (features) has the shape of (581012, 54) and y (classes) has the shape of (581012, 8). We divided the data into a test train split of 20/80 meaning that the model will train on 80% of the data and test/validate it’s predictions on the remaining 20%. K-fold cross validation is used to randomly split the data differently in 3 ways, this ensures that the train test split occurs in numerous combinations across the whole data-set eliminating problems related to unevenly sampled and biased data.
Growing NN Model Preparation
The classification task was implemented using 3 different models which differed only by layer architecture (number, width and trainable layers). We used "categorical_crossentropy" and "adam" respectively as the model loss and optimizer. Model performance was measured using the validation dataset ( data the model was not trained on), specific metrics measured were loss, accuracy, precision, recall, and f score (see [here](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics) and here for a detailed explanation of the meaning of these metrics and how to calculate them).
Growing NN Base Model 1
The initial model is comprised of 3 layers. The first layer takes in the shape of the x feature (54,), data is then passed to a fully connected second layer with width of 30. A dropout layer (0.5) is added followed by another layer of width 25. This is followed by another dropout layer (0.5) which is connected to a final output layer with width 8. We use the ‘relu’ activation function for the input and hidden layers, a softmax function is applied to the final output layer. Training is set for 10 epochs with a batch size of 128.
Growing NN Model 2 Incorporating Frozen Base Model 1
The second model is comprised of the first model with an additional 2 layers added to the input side of the network. The trained first model is loaded from an HDF5 extension file. We "freeze" the loaded first model by setting the trainable attribute of the previous model to false. The additional layers are set to width of 100 and 54 respectively activation functions are set to relu. (The first layer is chosen as a somewhat arbitrary but reasonable increase of model width. The second layer is chosen in order to interface correctly with the dimension of the input layer of the frozen layer.)
Growing NN Model 3
Finally we generate a third model which is equivalent in size and design to the combined model 1 and model 2 DNN. Model 3 consists of fully trainable layers. Performance metrics are evaluated for all 3 DNNs to see if there is an advantage to reusing and freezing a pretrained model as part of a larger model.
Growing NN Observations and Discussion
Results are shown in the series of figures below:






Results of these experiments are very interesting, if somewhat inconclusive. Our initial null hypothesis is that using pre-trained models as frozen layers in a larger model does not provide any advantage as compared to training the larger model in a single step.
Figure 1 shows that all models achieved a similar level of mean accuracy. Model 2 appears to have the highest mean accuracy, however the size of the error bars indicate that differences between the models are unlikely to be statistically significant.
Looking at Figures 2 and 3, the second model is shown to have the best performance for recall, accuracy and f-score. However, the graph results are only shown for a single a k-fold iteration and as discussed earlier, may not be statistically significant (all the bar chart values are around the same height in Fig 1, so take it with a grain of salt).
In any case, we can see that the results shown above did not disprove the potential utility of using frozen pretrained models, although any advantage is likely to be small and may not warrant the effort required for implementation of this idea. It looks like there might be something there, but if there is, it’s really small. I’m going to call that a negative result. More research is needed to see if growing DNNs is a good idea, but so far, no luck.
Conclusion
In this article you followed along to see some negative results and extreme frustration in the day to day life of an artificial intelligence researcher. Face embedding and growing neural networks are definitely real, and definitely hard. It’s science, and science is messy!
If you liked this article, then have a look at some of my most read past articles, like "How to Price an AI Project" and "How to Hire an AI Consultant." And hey, join our newsletter!
Until next time!
-Daniel Lemay.ai [email protected]