When Machine Learning tries to predict the performance of Machine Learning…

I tried to ‘predict’ the quality of a deep learning network by another machine learning algorithm. The results were not encouraging.

Those who have read my previous piece on Medium, know that I am no machine learning or data science expert. I come from a semiconductor technology background and have only started tinkering with machine learning techniques recently.

Being a newbie, I did not dare venture into Deep Learning for as long as I could restrain myself. In all fairness, I did code up simple 1 or 2 layer perceptrons using MATLAB when I took Prof. Ng’s Coursera class or even tried first couple of examples of his newer Deeplearning.ai course using NumPy stack. But they involved very little ‘deep stuff’ and minimal amount of hyperparameters.

It is this issue of hyperparameter in the field of deep learning that has been fascinating to me. I suppose it is quite natural as I come from a technology design background, I also play with lots of high-level parameters every day in my semiconductor process or device design work and they often combine in a mindbogglingly complex and (sometimes) unpredictable way to lead my design to success or failure.

I try to tell myself that it is just a big soup of simple ingredients — quantum mechanics, Kirchoff’s laws, Maxwell’s equations — and I should be able to predict the outcome with very high confidence. And, often I am able to. But there are instances where the quality of the design is not quite what was expected and the root cause is hard to find in the maze of hyperparameters that are used the whole development pipeline.

So, when I finally started tinkering with TensorFlow, I straight drove to an self-learning exercise where I play with various hyperparameters and their impact on the final quality of the predictor. Now, what does that phrase “various parameters and their impact” sound like to you? A problem just waiting to be analyzed by machine learning, right?

That’s right. I tried to ‘learn’ what drives the performance of a Deep Learning model up, by Deep Learning (and simple Logistic Regression too)!


I won’t go into too much of technical and implementation details. Interested readers can simply refer to/download the codes from my GitHub repo. Instead, I will just describe the high-level steps in simple terms,

  • First, I chose a simple and well-known data set — the famous Iris species data, which is also used in the TensorFlow Estimator API tutorial page. It is a multinomial classification problem with real-valued features. Standard and clean.
  • Then I decided on a particular optimizer: ProximalAdagradOptimizer. I am sure you can (and encourage you to try) choose any other optimizer method without loosing generality.
  • Thereafter, I chose 4 hyperparameters, which, in my opinion, are general enough to appear in any high-level or even array-level implementation of a deep learning problem that one can work on:

a) learning rate, b) dropout rate/probability, c) L1 (LASSO) regularization strength, and d) # of training steps of the network.

  • Having chosen the hyperparameters, I spread them over a wide range in a logarithmic fashion, not linearly, just to cover a wide range with small number of points.
  • Then, I constructed a full-factorial loop (all levels of hyperparameters are crossed with all other levels) and started training a 3-layer (5,10,5) fully-connected feed-forward neural network by spanning that multi-level loop. I kept the network small enough to achieve decent training speed on my simple laptop :)
  • Within one execution of the loop, I extracted and saved the ‘accuracy’ of the prediction based on a hold-out set.
  • At the end of the whole ‘loopy’ execution, I transformed the parameters and accuracy scores (indicator of the predictive quality of the DNN classifier) into a Panda DataFrame for later analysis.
  • And, I repeated this process for 4 types of activation functions: a) sigmoid(x), b) rectified linear unit (RELU), c) tanh(x), and d) exponential linear unit (ELU).
  • Now, I have 4 DataFrames for 4 activation functions — each with rows containing data about hyperparameters of the neural network as features and accuracy as the output. Prime for being analyzed by another machine learning algorithm, or by itself!
  • To convert the accuracy scores into categorical data, I just arbitrarily partitioned them in three classes: low, medium, and high. This simply denotes the ‘quality’ of the deep learning network. For example, less than 0.5 accuracy denotes low score and >0.8 score gets a high classification.
  • Thereafter, I constructed a simple Logistic regression classifier (from scikit-learn) and tried to predict the quality classes from the hyperparameters.
  • And, I tried the same type of prediction using another DNN classifier model. I used a bigger neural net (20–20–20 hidden neurons) for this step as I had to run it only once and also trained it for more steps (5000 minimum).
  • And repeated the whole things for all 4 activation functions.

The results were, at best, mixed-bag. I got accuracy and F1-score as high as 0.8 but when I tried with various random splits and cross-validation, the average performance was 0.4–0.6, sometimes as low as 0.35.

The data set was simple, the DNN classifier’s accuracy scores were in general satisfactory but the dependency of the neural network performance on various hyperparameters could not be understood by a machine learning model very well. The only apparent signature was the choice of activation function — tanh and ELU gave distinctively better results than the sigmoid or RELU. But that realization could be achieved simply by looking at the accuracy score tables and did not warrant a machine learning model. Other than this, there was no definite learning from the hyperparameter space for achieving a better accuracy score on the part of the DNN model. Not even higher # of training steps showed any definite correlation with higher accuracy score.

What’s also noteworthy was the fact that the DNN classifier did not perform better than simple logistic regression model (with a MinMaxScaler transformation on the data) when it came down to learn from the hyperparameter data sets.So, it was not probably a limitation on the model but the data set was itself full of uncertainty and unexplained variance. Following figure shows a correlation heatmap for the data set corresponding to the ELU activation function (one of the better performing from the original IRIS data point of view)


Hyperparameter optimization or even honing on to the choice of a good set of hyperparameters, still feels like a “black magic” in the field of deep learning.

Even for a relatively simple problem like Iris species classification, the impact of the hyperparameters of a fully-connected neural net on the final accuracy could not be understood very well by machine learning method.

Disclaimer: Please feel free to fork/pull the codes from GitHub and experiment on your own and let me know if you find something interesting :) I can be reached at: tirthajyoti[AT]gmail[DOT]com.