…And distill their knowledge instead.

During college days, one gets exposed to a variety of personalities. Of all the kinds, I was particularly envious of a particular type. These were a set of people who wouldn’t study throughout the semester. A day before the examination they would take a high-level download of the subject and end up creating a pointed view of the answer in the examination to pass with flying colors.
I always thought they were blessed. Having an aptitude to distill the learning at a high level and the ability to convert that into a specific viewpoint was simply admirable.
I could never do that, nor could I even muster the courage to try.
Recently during my encounter with a Student-Teacher-based framework, it just felt like a Deja-vu. A vicarious meet with the same kind of personality.
In simple words, in a Student-Teacher framework, a Student model (a light Neural Network) learns the "distilled" knowledge from the Teacher Model (a complex Neural Network). This leads to a "light" Student model with the knowledge of "heavy" Teacher model with similar or slightly lesser accuracy levels.
During the article, we will cover the need of such a framework, how the distillation process works, and finally how the Student model is fine-tuned.
Why Student-Teacher Framework?
"If it doesn’t fit your pocket, it’s not saleable"
With the advent of Transfer Learning models and other complex Neural Network (NN) models the accuracy of the models have increased significantly. Not surprisingly, it comes at a cost. Multiple NN layers and numerous parameters usually lead to huge models which require high computational power for deployment. This limits their usage and marketability significantly.
There has been a clear need for lighter models with similar levels of accuracy. A way was required to somehow compress these huge complex models into smaller sizes without compromising on accuracy significantly.
And this is what the Student-Teacher Framework aims to achieve.
What is Student-Teacher Framework and how does it work?
In a Student-Teacher framework, a complex NN model (new or pre-trained) is leveraged to train a lighter NN model. The process of imparting knowledge from complex NN (Teacher) to lighter NN (Student) is called "Distillation"
Before deep-diving into the Distillation process – it is important to understand two terminologies:
- Softmax Temperature: This is a tweaked version of Softmax. A Softmax function creates the expected probability distribution of the model output across classification categories ensuring that the sum of the probability is one. The Softmax Temperature does the same but additionally it spreads out the probabilities more across the categories. (See image below) The level of spread is defined by a parameter "T". Higher the T the higher the probabilities are spread across the categories. T = 1 is the same as normal Softmax.

Note that though counter-intuitive, however in many cases "Soft" probabilities are preferred to get a variety of potential outcomes. For example, in text generation one would expect to have a larger potential set of options to generate diverse text.
In the distillation process, instead of using Softmax in the model, Softmax temperature is used.
- Dark Knowledge: In the Softmax temperature example, all "additional knowledge" from the spread of probabilities we are getting due to change in Softmax parameter "T" is called the Dark Knowledge. In other words, the incremental knowledge gained by usage of Softmax Temperature (with T>1) instead of Softmax is Dark Knowledge.
The Distillation Process works as follows:
- Train (or use pre-trained) complex (let’s say 10 layers) "Teacher" model. Expectedly, this is a computational intensive step. The output of the teacher model would be the Softmax Temperature output probabilities spread across the categories/labels.
- Create and train a simpler (let’s say 2 layers) "Student" model in a manner that the Softmax temperature output of the "Student layer" is as close as possible to the "Teacher" model. For this purpose, the cross entropy loss between Student and Teacher called "Distillation loss" should be minimized. This is called "Soft Targeting". In this manner we are transferring the "Dark knowledge" from Teacher to Student. Note that the knowledge here is transferred across all levels, not just the final loss.
The obvious question that arises here is why we are aiming for Soft targeting (a spread out probability distribution). It is because a Soft target provides more information than the hard targets. This leads to a smaller model to be more knowledgeable, thereby requiring lesser additional training.
- In parallel, train the "Student" model with Softmax activation (not Softmax Temperature) to get the model output as close as possible to the actuals. ** For this purpose, the cross entropy loss between Student and Actuals called "Student loss" is minimized. This is called "Hard Targeting"**.
- For #2 and #3, minimize the overall loss – which is the weighted sum of Distillation Loss and Student Loss, .
Loss = a Student Loss + b Distillation Loss
Diagrammatically, the mentioned process can be depicted as follows:

Conclusion: Distillation (Student-Teacher framework) is a great model-agnostic tool to deploy the model on an edge device with not much computational power. Though theoretically the framework can be applied to any large NN model; I personally have seen it being used for a 6 layer CNN model and Distil BERT. With the large number of edge devices, I expect this framework to be omnipresent in no good time.
Happy Learning !!
PS: Cheers to all those amazing college personalities who helped me put the pen on the paper for this article !
Disclaimer: The views expressed in this article is the opinion of the author in his personal capacity and not of his employer