
Dropout is an efficient regularization instrument for avoiding overfitting of deep neural networks. It works very simply randomly discarding a portion of neurons during training; as a result, a generalization occurs because in this way neurons depend no more on each other. Dropout is easily applicable in every neural network structure and it can be also used for various purposes: for example, I used it to estimate neural network uncertainty with a bayesian approach in a previous post.
In this post, I try to reproduce the results presented in this paper; which introduced a technique called Multi-Sample Dropout. As declared by the author, its scopes are:
- accelerate training and improve generalization over the original dropout;
- reduce computational cost because most of the computation time is consumed in the layers below (often convolutional or recurrent) and the weights in the layers at the top are shared;
- achieve lower error rates and losses.
THE DATA
I want to applicate the procedures introduced in the paper, in a NLP problem. I found a valuable dataset on Kaggle. The News Headlines Dataset was provided to carry out sarcasm detection in sentences. Texts inside this dataset are collected from the web with the relative annotation of sarcastic or not sarcastic.

THE MODEL
I chose a NLP task because it permits us to try a recurrent model structure with consistent training time. In addition, the usage of dropout and fully connected layer near the end of the network makes the execution time not significant.
The basic idea is quite simple: create multiple dropout samples instead of only one, as depicted in the figure below. The dropout layer and several layers after the dropout are duplicated for each dropout sample. The "dropout", "fully connected" and "softmax + loss func" layers are duplicated. Different masks are used for each dropout sample in the dropout layer so that a different subset of neurons is used for each dropout sample. In contrast, the parameters (i.e., connection weights) are shared between the duplicated fully connected layers. The loss is computed for each dropout sample using the same loss function, e.g., cross entropy, and the final loss value is obtained by averaging the loss values for all dropout samples. This final loss value is used as the objective function for optimization during training. The class label with the highest value in the average of outputs from the last fully connected layer is taken as the prediction.

In Keras language all the concepts can be implemented with simple code:
def get_model(num):
inp = Input(shape=(max_len,))
emb = Embedding(len(tokenizer.word_index) + 1, 64)(inp)
x = SpatialDropout1D(0.2)(emb)
x = GRU(128, return_sequences=True, activation='relu')(x)
out = GRU(32, activation='relu')(x)
dense = []
FC = Dense(32, activation='relu')
for p in np.linspace(0.1,0.5, num):
x = Dropout(p)(out)
x = FC(x)
x = Dense(y_train.shape[1], activation='softmax')(x)
dense.append(x)
out = Average()(dense)
model = Model(inp, out)
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
return model
As a recurrent block, at the bottom of the model, I’ve used GRU layers. The number of dropout samples is defined as parameter when the model is initialized. The fully connected layer at the top shares the weights in every dropout sample. I’ve built every dropout sample with a different mask and different probabilities from 0.1 to 0.5, and not always 0.5 (this is a little change that I allowed myself to adopt). The network has multiple output layers as many as the dropout sample. The final step of our network is an average layer which combines the previous fully connected layer outputs as a simple average. In this way, the final loss value is computed as the average of outputs of our dropout samples.
At the end of the training phase, we are able to achieve a final accuracy of around 0.85% on the test set.

SUMMARY
In this post, I try to reproduce a technique called multi-sample dropout. It is an upgrading of the classical dropout. It can be simply applied in every neural network structure (also in deep ones) and offers a costless way to better generalize and improve results especially in classification tasks.
Keep in touch: Linkedin
REFERENCES
Multi-Sample Dropout for Accelerated Training and Better Generalization: Hiroshi Inoue, IBM Research, Tokyo