
The concept of probability is very common in the Machine Learning field. In classification tasks, probabilities are the output scores of almost every predictive model together with the relative labels. Show them together is more informative than providing only a raw classification report. In this way, we use probabilities as confidence scores to approximate the uncertainty of our model predictions. This is a common practice and reflects how we reason.
The problem is that our algorithms aren’t intelligent enough to provide probabilities in the form of trustable confidence scores. They tend to not calibrate results, i.e. they overestimate or underestimate probabilities if we compare them with the expected accuracy. This results in misleading reliability, corrupting our decision policy.
In this post, we want to play with probabilities facing the problem of neural network calibration. As we’ll see, build a calibrated model is not obvious and we must take care during its development. We operate the adequate technique to calibrate scores in a real classification problem to not get kicked in _ auto auctions where the risk is to purchase a ‘lemo_n’.
THE DATA
The dataset is collected from Kaggle, from a past competition ‘Don’t Get Kicked!‘. The challenge of this competition is to predict if the car purchased at the Auction is a Kick (bad buy). Kicked cars often result when there are serious issues or some other unforeseen problems on vehicles, causing high costs to dealers. We have to figure out which cars have a higher risk of being kick, providing real value to dealerships. The problem can be carried out as a binary classification task (good or bad buy?).

The variables at our disposal are a mix of categorical and numerical:

We operate a standard preprocessing: NaNs values are filled with medians (for numerical) and ‘unknown’ class (for categorical). The numerical variables are then standard scaled and the categorical are ordinally encoded.
THE MODEL
The model, used to predict if our purchase is a good deal, is a feed-forward neural network with categorical embeddings. The categorical embeddings are connected at the second level, while in the final parts we maintain the separation among raw output layer (logits) and activation (softmax) for probability computation. This little trick will be useful for us when we’ll have to calibrate our model.
def get_model(cat_feat, emb_dim=8):
def get_embed(inp, size, emb_dim, name):
emb = Embedding(size, emb_dim)(inp)
emb = Flatten(name=name)(emb)
return emb
inp_dense = Input(shape=len(num_))
embs, inps = [], [inp_dense]
x = Dense(128, activation='relu')(inp_dense)
for f in cat_feat:
inp = Input((1,), name=f+'_inp')
embs.append(get_embed(inp, cat_[f]+1, emb_dim, f))
inps.append(inp)
x = Concatenate()([x]+embs)
x = BatchNormalization()(x)
x = Dropout(0.3)(x)
x = Dense(64, activation='relu')(x)
x = Dropout(0.3)(x)
x = Dense(32, activation='relu')(x)
logits = Dense(2, name='logits')(x)
out = Activation('softmax')(logits)
model = Model(inps, out)
model.compile(optimizer='adam',
loss ='categorical_crossentropy',
metrics=[tf.keras.metrics.AUC()])
return model
We separate our initial dataset in train, validation, and test. The validation will be firstly used for network tuning and in the second stage to fit the temperature scale coefficient for the calibration procedure. After fitting we achieve 0.90% accuracy with great precision on the test data, which is better than a random model that classifies all cars as a good deal.
NEURAL NETWORK CALIBRATION
As we can see, our model can grant good performance on unseen data. But how reliable this result is? The advantage of predicting well-calibrated probabilities is that we can be confident if the predicted probability is close to 1 or 0, and not so confident otherwise. In our case, if not, our classifier may over-predict similar cars as ‘lemon‘ which may result in decisions to not buy cars which are actually in good condition.
By definition, a model is perfectly calibrated if, for any probability value p, correspond a prediction of a class with an accuracy level of p*100 %. To check if our classifier is well-calibrated all we need to do is a plot! At the same time, produce calibrated probabilities is very simple as applying a post-process technique.
The first step is to plot the reliability diagram, where we can compare the expected accuracy versus prediction confidences. The fraction of positives and mean predicted value are calculated, for convenience reasons, per equal-width groups/bins. This can be computed for binary or multiclass problems and results in a curve for each class.

The optimal situation is when we have a perfect linear relationship between the computed probability and the fraction of positives (blue dashed line). In our case, our neural network tends to respectively overestimate and underestimate a little the probabilities of the two classes in the middle. We can quantify the goodness of calibration with a proper index, the so-called Expected Calibration Error (ECE), that is the weighted (proportion of samples in each bin) average of the difference between expected accuracy and prediction confidence (the lower the better).
We can adjust the ECE by applying a technique, which is an extension of Temperature scaling, known as Platt scaling. The neural network outputs a vector known as logits. Platt scaling simply divides the logits vector by a learned scalar parameter T, before passing it through a softmax function to get class probabilities.

Practically speaking with few lines of code, we can build our function to compute the Temperature scaling. The scaling factor T is learned on a predefined validation set, where we try to minimize a mean cost function (in TensorFlow: _tf.nn.softmax_cross_entropy_withlogits). The inputs and output will be respectively our logits, scaled with the learnable T, and the true output in the form of dummy vectors. The train is computed classically with the stochastic gradient descent method.

In the end, the ECE score on the scaled probability records an improvement in both the classes. As we can also see above on the new reliability diagrams, we can produce better-calibrated probabilities.
SUMMARY
In this post, we’ve examined a post-process technique to calibrate the probabilities of our neural network and let it, when it’ possible (not always temperature scaling is effective), a more trustable instrument. It’s a very simple trick applicable anywhere, which became useful if we care about the probabilities our model computes to make a decision.
Keep in touch: Linkedin
REFERENCES
- StellarGraph – Machine Learning on Graphs
- On Calibration of Modern Neural Networks: Chuan Guo, Geoff Pleiss, Yu Sun, Kilian Q. Weinberger