[Learning Note] Dropout in Recurrent Networks — Part 2

Recurrent Dropout Implementations in Keras and PyTorch

Published in

Towards Data Science

6 min readSep 30, 2017

Before going into the experiments, I’d like to examine the implementations in detail for better understanding and future reference. Because of space limitation, I used #... to omit the lines that aren’t essential to the current discussion.

Keras

Here I use Keras that comes with Tensorflow 1.3.0.

The implementation mainly resides in LSTM class. We start with LSTM.get_constants class method. It is invoked for every batch in Recurrent.call method to provide dropout masks. （The input dropout and recurrent dropout rates have been stored as instance attributes in __init__.）

The inputs are arranged in the form of (samples, time (padded with zeros), input_dim). The above code block creates input masks with shape (samples, input_dim), and then randomly sets elements to zero. So new masks are sampled for every sequence/sample, consistent with what was described in paper [1].

Note four different masks are created, corresponds to the four gates in LSTM. Only untied-weights LSTM supports this setting. (More details below).

Similarly, the above creates four recurrent masks with shape (samples, hidden_units).

Next we turn to LSTM.step method, which is executed for each time step sequentially:

Keras has 3 implementations of LSTM, with implementation 0 as default:

implementation: one of {0, 1, or 2}. If set to 0, the RNN will use an implementation that uses fewer, larger matrix products, thus running faster on CPU but consuming more memory. If set to 1, the RNN will use more matrix products, but smaller ones, thus running slower (may actually be faster on GPU) while consuming less memory. If set to 2 (LSTM/GRU only), the RNN will combine the input gate, the forget gate and the output gate into a single matrix, enabling more time-efficient parallelization on the GPU. Note: RNN dropout must be shared for all gates, resulting in a slightly reduced regularization.

The implementation 2 corresponds to tied-weights LSTM. And the above code block implements dropout just like in this formula [1]:

Note how it just take the first mask and discard the rest (three masks). That is because this formulation requires the RNN dropout be shared for all gates.

It appears implementation 0 and 1 differs in the way how input dropout is applied. In implementation 0 the transformed inputs are precomputed outside step method, while in implementation 1 the inputs are dropped out and transformed inside step .

（Note how each gate use its own dropout mask, and how transformed inputs and hidden states are combined for each gate.）

That's it. The implementation doesn’t have any surprises, so you can use dropout and recurrent_dropout parameters with confidence. The only thing you need to consider is probably whether to use implementation 2 instead of 0 to speed things up.

(Currently Keras doesn’t seem to provide embedding dropout as described in [1]. I think you can definitely write a custom layer for that, though.)

PyTorch

As mentioned in part 1, PyTorch doesn’t provide native support for variational dropout. We’re going to use the implementation from salesforce/awd-lstm-lm project. (This part is targeted at PyTorch 0.2.0 version)

LockedDropout can be used to apply the same dropout mask to every time step (as in input dropout):

PyTorch generally supports two sequence tensor arrangement: (samples, time, input_dim) and (time, samples, input_dim). The above code block is designed for the latter arrangement. You can easily modify it to support both arrangements. m is created as a dropout mask for a single time step with shape (1, samples, input_dim). So a new mask is sampled for each sequence, the same as in Keras.

Next is the WeightDrop class. This form of dropout, proposed in [2], is more simple, has better performance, and allows different dropout for each gate even in tied-weights setting. In contrast, to implement the traditional variational dropout might require splitting the LSTM/RNN into individual time steps in For loops.

In _setup, WeightDrop disables the parameter flattening (otherwise it won’t work with CUDA), and renames the target weight matrix (usually weight_hh_l0) to one with _raw suffix. In forward, the target weight is applied a dropout mask, copied and renamed to the original attribute name (weight_hh_l0). Note the registered parameter is the weight matrix with _raw suffix (Dropout operations won’t affect that weight matrix).

There are two types of weight dropout, controlled by variational parameter. We need to know how the weight matrix works first before we are able to understand the code: For a LSTM nn.LSTM, there are four associated parameters, weight_ih_l0, weight_hh_l0, bias_ih_l0, bias_hh_l0. The naming should be obvious enough: ih means input to hidden; hh means hidden to hidden; l0 means first layer. If we use nn.LSTM(2, 8, num_layers=1) as an example, weight_hh_l0 (U) will have a shape of (32, 8), correspond to four gates and eight hidden units (32 = 8 * 4). You should be able to recognize this is a tied-weights LSTM. This implies the hidden state matrix (h) is shaped (8, batch_size). The matrix multiplication Uh would produce a 32 x batch_size matrix. Each column represents the transformed recurrent inputs to the eight hidden units and their four internal gates for one single sequence (Note PyTorch use Uh instead of hU):

For variational=True, a mask of shape (4 * hidden_units, 1) is created. Setting any one of the elements to zero means cutting all recurrent connections to one gate in a hidden unit. This seems a bit weird. If we want the dropout out to be consistent with Keras tied-weights implementation (the formula below), we’d want to use a mask of shape (1, hidden_units). Setting any one of the elements to zero means cutting all recurrent connections originate from a hidden unit. (Remember a single recurrent dropout mask in Keras is shaped (sample, hidden_units).) It’s possible I’m mistaken, or it’s a bug, or the author intentionally did so. I’m not sure yet.

For variational=False, a mask of shape (4 * hidden_units, hidden_units) is created. Hence different masks for different hidden units and different gates.

One important difference between WeightDrop and Keras implementation is dropout mask to weight matrix can only be sampled once every mini-batch. If you try to sample once every sequence, then it is essentially using mini-batches of size 1, losing the purpose of using mini-batches. This restriction causes various degrees of variation reduction inside each mini-batch, depending on the size of the mini-batch. (Remember in Keras new dropout masks are sampled for each sequence.) Therefore it seems a good idea to always start with variational=False configuration.

Finally, the embedding dropout:

This should be quite straight-forward. The dropout mask is shaped (num_words, 1), and the dropout is applied at word level. As mentioned in [1], this implementation is likely to have some performance issue when the number of words and the dimension of embedding are large. But I guess the author think this is an adequate trade-off between simplicity of code and performance.

Putting it together

An example model using all three discussed dropouts:

(Line 11–15) Although WeightDrop doesn’t require splitting time steps, but it does requires splitting RNN layers. It’s the only way we can apply LockedDropout between layers.

(Line 30) Embedding dropout is applied here in forward.

(Line 31, 35) LockedDropout is applied by simply passing it the tensor and dropout rate.

To be continued

This post is quite messy, sorry about that. Writing technically in plain English is hard… Anyway, for the last part, I’ll be documenting some empirical results. The results are somewhat different from what the paper [1] obtained. I’ll also try to provide some explanation to that.

[Learning Note] Dropout in Recurrent Networks — Part 1

Theoretical Foundations

becominghuman.ai

[Learning Note] Dropout in Recurrent Networks — Part 3

Some Empirical Result Comparison

medium.com

References

Gal, Y., & Ghahramani, Z. (2015). A Theoretically Grounded Application of Dropout in Recurrent Neural Networks.
Merity, S., Keskar, N. S., & Socher, R. (2017). Regularizing and Optimizing LSTM Language Models.