10 Things to Think About Before Starting to Code Your Deep Neural Network

9 min readSep 22, 2017

After reading an MNIST tutorial (or ten) and brushing up on some Tensorflow/Keras best practices, you might be tricked into thinking that applying a neural network for your prediction task is a “plug and play” operation.

Unfortunately, things are often not that easy. In practice, getting even a simple architecture (no GANs and fancy stuff) to work well on your data can be a challenging task.

So, before you jump into

import tensorflow as tf

I gathered 10 issues that I believe are important in making the step from reading and understanding online tutorials, to actually being able to engineer architectures that solve your own problems.

(Do keep in mind that the order is somewhat random — different issues might turn out to be more important than others, depending on the application at hand).

#1: Do you have enough data?

Much of the current popularity of Deep Learning techniques can be traced to the remarkable (and still not 100% explainable) ability of huge, over-parameterized networks to generalize well to unseen data. But this ability is not all dark magic — you still need to have enough data, roughly on the same order as the number of parameters of your network, for this generalization phenomena to “kick in”. If you’ve been scraping the web and annotating images for days only to come up with around 5K images, you’re going to have a very challenging time not overfitting to your small dataset.

In comes Transfer Learning to the rescue. The idea behind transfer learning is recognizing the fact that a model that was trained on one task (e.g, face detection) can be useful for another task (e.g, facial attribute recognition). This is because the hierarchical structure of the network suggests that the two models will tend to perform similar operations in the first, “feature extracting” layers of the network. This suggests the idea of “borrowing” the parameters of a network that was trained on the other task with more data.

In computer vision this is incredibly easy, since pre-trained models are in great abundance. But the same principles can be employed in other domains. You actually do transfer learning from another network you’ve trained. In general, if your prediction task is very specific and thus relevant data is scarce (e.g, predicting heart rate of cancer patients), try to come up with another similar but slightly more general task (e.g, predicting heart rate for a general population), for which data is easier to collect. Begin by training the general model, and later fine-tune to your original data set.

#2: Normalize your input (not like a robot)

Input normalization is an important subject, and for technical reasons is crucial for guaranteeing a stable learning process, and faster convergence to a local minimum. However, it’s important to keep in mind that there’s much more to input normalization than just applying your favorite data-scaling transformation. Without being too “new-age”-y, try to use this opportunity to do proper exploratory analysis of your data. It’s a simple process that despite being very useful for improving the performance of any ML model, is often overlooked by DL practitioners. You can gather many insights by looking at very simple statistics of your data. Do you have a lot of outliers in your data? Do you have mislabeled examples? Are your classes balanced? etc.

exploratory data analysis. histograms are your friend

#3: Match the architecture to your setting

Most DL tutorials for multi-class classification are suited for the case of a single-label setting (mutually exclusive labels— only one label can be true), and are therefore composed of the standard softmax + categorical cross entropy combo. It’s so common, that many practitioners actually lose sight of the fact that it’s not mandatory for every network to end with a softmax layer. Recall that the softmax acts — as its name suggests — in a manner similar to a smoothed maximum. This means that in most cases, it will “flag” a single class as being the one with highest probability. This type of behavior can be bad in two common scenarios: multi-label classification (in which you want to allow multiple classes to get a high probability) and when you know you will be predicting on new data that doesn’t necessarily belong to one of the classes (in which case the softmax will give a high probability, since it only looks at how probable a class is compared to other classes). The alternative in these cases is to switch to a sigmoid (calculated per class) for the final layer’s activation functio ntogether with a binary cross entropy as the loss function.

#4: Gradually work up your model’s capacity

One of the things that might get you to regret your choice for a deep learning solution is the inevitable process of fine-tuning the hyper-parameters of your network. “Hyper-parameters” is a general term for a set of settings that is used to control the behavior of a learning algorithm; in DL these include the depth and width of the model, the amount of regularization, the learning rate, and numerous others. The problem with hyper-parameters is that they are parameters which, by definition, you can’t learn from a training set. This often makes for quite a tedious process, since the search space for the optimal values can be huge, and every iteration takes a long time.

One practical advice I can therefore give is to employ any strategy you can ahead of time, to make this search less gruesome. One such strategy is the “under-fit before you over-fit” approach, for the parameters to do with the capacity of the model. The idea is simple: Start with the smallest reasonable network (where small is both in depth and in width), and gradually work up the capacity of your model only if it’s required. As a rule of thumb, for each architecture I first try to fit a small subset of my data (say 10%), and then my entire data. If I can’t fit the data with my network (i.e, reach a training error of zero given sufficient training time), then it means I must increase my capacity — I enlarge either the width or the depth of the network using some heuristic choice, and then repeat. The idea behind this approach is to line ourselves with the Occam’s razor principle (finding the simplest model that can explain your data) and to help in the over-fitting battle you’ll end up fighting later on.

#5: Choose the right type of data augmentation

A lot of practitioners wrongfully think of data augmentation as a means of “getting more data”. In reality, data augmentation should be thought of as a form of regularization — a way to introduce the right types of invariances to your model. Think about it this way: When you’re training your model for a large number of epochs, you’re essentially going over the entire training set multiple times. Performing data augmentation means every time you show an example to the model, you’re showing it in a slightly augmented manner. You should use this as a way to teach the network of the kinds of permutations you don’t care about —e.g, those that might greatly change the pixel representation of an image, but don’t effect the label. Do you want the network to recognize images if they are blurry? If they are rotated 90 degrees clockwise? If they are mirrored ?Choose the type of augmentation using your knowledge of the task.

an augmented cat is still a cat (image is from this great blog-post)

#6: Regression or Classification?

The difference between classification and regression is pretty clear-cut and well established: If your output variable takes class labels then you should be solving a classification task, and if it takes continues values you should be solving a regression task. Right? Well, sort of. Many people actually report better results for predicting continuous values by first performing a classification task of the binned values (e.g, dividing [0,10] into 10 different classes: [0,1), [1,2), …), and then fine-tuning using a regression model. This is not so much as an issue in itself, so much as an important reminder: DL is still a field with more empirical “do’s and dont’s” than solid theoretical foundations. This means that whenever there’s more than one way to solve a task (which is most of the cases, come to think about it), you should definitely consider trying more than one option. You might be pleasantly surprised. And worst case, you’ve gained some more experience for next time.

#7: Meditate on the right loss function

Meditation is good or you, and meditation on the right loss function is even better. Why would you want to use a “non-standard” loss function, you wonder? It’s actually much more common than you might think, even though it’s often overlooked in tutorials. One classic example is when you’re dealing with an imbalanced classes scenario (which you’d know if you followed up on #2, ehm exploratory data analysis). This is the very common case of having more labeled examples in some of the classes. This can be problematic, because if you give this issue no though you might find yourself in a situation that your network fits the majority class, but performs very badly on the minority class. One easy way to fix this is to explicitly “tell” the network to give more importance to the examples from the minority class, by placing more weight on performing an error on an example from that class. This is essentially equivalent to over-sampling the minority class. It’s a common practice in ML that is also very useful in DL, and to be perfectly honest, I’m constantly surprised at how much it can boost accuracy in real-world applications. A more advanced practitioner can take the weighted loss function to the next level by placing different weights not only on different classes, but also on different types of error, to balance the recall-precision objective. Be creative.

#8: Model evaluation

Another thing that is often overlooked is the fact that the evaluation metric can be different from the loss function you’re using for the optimization process. The “right” order of business is to first think of the most appropriate metric to evaluate the performance of your algorithms — for example, this is the metric you’ll be using to choose the best set of hyper-parameters — and only then to figure out what the best loss function is. In many scenarios, you’ll end up using a different loss function for numerical or computational reasons. For example, you might be working on a time series prediction task in which you decide the most appropriate metric is the Pearson correlation between your prediction and the ground truth, but use the MSE as a proxy because optimizing for Pearson across mini-batches is consistent. So, it’s OK (and even recommended!) to use different metrics for the training and evaluation of your model, so long as you keep it in mind.

#9: Read the literature!

Are you struggling with anything that I didn’t mention here? Most chances are you aren’t the first or the second. Don’t try to invent the wheel! Even a seemingly simple experiment can be very time-consuming to check yourself; It’s must better to use the collective wisdom of others who did the dirty work for you. So, a great recommendation I can give to practitioners is to not be afraid of getting your hands dirty and search the web (or arxiv) for literature relevant to the problem you’re contemplating. It will save you time in the long run, and open up your mind to new ideas, even if you’re not from the academic community.

#10: Have fun

Just kidding, but I actually have 9 things, not 10 (I just wanted to be in line with the popular culture that counts in base 10). Feel free to suggest some more topics in the comments, or share your thoughts/corrections/remarks.