In this article, I am going to summarize an influential paper in the field of Natural Language Processing, Towards Controlled Generation of Text[1]. This paper was published in 2017, at the ICML conference, and is cited more than 800 times at the time of writing this article.
Prerequisites – Neural Networks, Autoencoders, Variational Autoencoders(VAE).
Goal – The goal of this paper is to generate realistic sentences, whose attributes can be controlled. An example would be to generate sentences which has a positive/happy sentiment. Not only do we want to generate grammatically correct, realistic sentences, but also control the generated sentences in terms of certain attributes(user-defined. e.g – sentiment, tense, etc.). This paper proposes a model and training procedure to achieve this goal, and also ideas to counter the non-differentiability of the discrete text data while training the models.
The key idea behind achieving controlled text generation is to learn a disentangled latent representation of the input text. The model has an autoencoder/Variational Auto Encoder(VAE) which reconstructs the input sentence back, from a lower-dimensional representation of the input(usually referred to as the latent code). Usually, a VAE[3] is preferred over a vanilla autoencoder as the latent representation is smooth, continuous, and can be sampled from in the case of a VAE.
Disentangling the latent representation means dividing the learned latent representations into different components, where each component of the latent representation stores information about a specific attribute, and can be used to control that attribute in the reconstructed sentence. The primary challenge is to design a model and training mechanism that can effectively learn a disentangled latent representation. If the latent code is disentangled, modification to certain components of the latent code results in a certain change of the generated sentence, thus giving us controlled text generation capabilities.
For details about the history of text generation and controlled text generation please refer to the related work of the original paper[1] or other survey article4. In this article, I am going to focus on the model and its specifics, and how the above mentioned problems are solved.
In the next section let’s discuss the model in more detail. Shown below is a diagrammatic representation of the model.
![Model for controlled text generation[1]. Source link](https://towardsdatascience.com/wp-content/uploads/2021/12/15zMpypn5tHciN_UFwMEtRQ.png)
The model consists of an encoder and a generator. The encoder and the generator are exactly the same as an Autoencoder/VAE architecture. The generator acts as a decoder while reconstructing the input sentences, but is also used to generate the final output sentences and is therefore named as a generator.
The latent code of an autoencoder/VAE is unstructured. The latent code can be sampled from(in the case of a VAE), and the generator takes this sampled latent code to generate a new sentence. However, the attributes of the generated sentence cannot be controlled in a normal VAE.
In this model, the latent code is split into two components z, and c. Conceptually, the component ‘z’ has the unstructured attributes, whereas ‘c’ encodes the structured attributes used to control the generated sentences. The sentence can be controlled by more than one factor, and therefore c can have multiple parts c1, c2, .. etc. For e.g. generating a sentence controlled by sentiment and tense. c1 will control the sentiment attributes and c2 will control the tense attributes. Theoretically, multiple attributes can be added, but it might get more complex and harder to train as the number of controlled attributes increases.
Additionally, there are discriminators. The discriminators are the primary network that helps in learning the disentangled latent representation and encoding the specific attributes in the structured component c of the latent code. The discriminator sends signals and guides the generator to produce sentences coherent to the input code c.
Now that we have a high-level detail of the model, and the components, let’s dive into the details of the training and loss functions which help in accomplishing the disentanglement.
The training is separated into – generator learning, the loss functions on which the parameters of the generator are learned; and discriminator learning, the loss functions on which the parameters of the discriminator are learned.
Generator Learning
This part describes the components of training the generator.
Equation 1, shown below, is the standard VAE loss, whose goal is to generate realistic sentences. The KL loss forces the learned latent code from the encoder(q_E(z|x) to be close to a prior p(z)(which is typically a gaussian distribution). The second component is the reconstruction loss, which enforces the generator to generate plausible English text learned from the input dataset.

Equation 2. The generated sentence from the generator(in fact the soft generated sentence which is an approximation of the discrete sentence) is passed to a discriminator, which tries to predict the code from the generated sentence. The predicted code from the discriminator is used as a signal for training the parameters of the generator.
The G~_T(z,c) – the soft generated sentence is used instead of the discrete sentence for differentiability issues. The approximation is done based on a softmax approximation described later.
To elaborate, let’s assume that the input latent code used by the generator network to generate a sentence is c1. The discriminator takes the generated sentence and predicts that the sentence has been generated using code c2. If the generator trusts the discriminator decision, the generator now tries to modify the generated sentence for input code c1, so that the discriminator prediction goes towards c1. Thus, the generator makes use of the discriminator prediction to modify its generated sentence so that the input code used to generate the sentence and the discriminator prediction of the sentence are the same. The equation shown below is the loss function encapsulating this idea. q_d(c|G~_T(z,c) is the discriminator prediction, which is used to train the generator parameters theta_G.

The learning signals for training the generator parameters are backpropagated through the discriminator to the generator. However, there is a major roadblock here. The generator output is a text, which is a discrete output. So, backpropagation won’t be possible, as the discrete text is not differentiable.
Solution – Instead of the discrete output, the softmax output of the logit vectors is used as shown in the equation below, whose values are controlled by a temperature variable. This is the so called soft generated sentence passed as an input to the discriminator.

Equation 3. Using equation 2, we enforce certain attributes of the generated sentence(e.g. sentiment) are encoded in the latent code ‘c’. However, that’s not enough, as the latent code ‘c’ may also unknowingly encode some other attributes along with the desired attribute. If for example, the latent code c encodes the properties of sentiment(desired) and accidentally encodes the properties of tense. Now, if we modify the value of the latent code c, we expect that the generated sentence has a different sentiment with all other attributes unchanged. But, we might see that the new sentence has a different tense along with a different sentiment. This is something, we should avoid. To do that, we need to enforce, that all other attributes other than the controllable attributes are encoded in the latent code ‘z’.
Instead of designing a new discriminator for this task, the encoder is reused to take the generated sentence as input, and predict the unstructured latent code z. This enforces the generator to capture the unstructured attributes in the latent code z, thus also enforcing that the latent code c doesn’t have any unstructured attribute entangled with it.

Overall generator learning – The generator is then trained on the following loss function in Equation 4, which is basically the combination of the 3 equations described above. So overall, the generator should be able to generate plausible English sentences(Equation 1), whose attributes can be controlled by the latent code c(Equation 2) and unstructured attributes depending on the latent code z(Equation 3).

In the generator learning, we saw the accuracy of the generator to generate sentences coherent to specific attributes, depends on the signals from the discriminator. The generator trusts the discriminator to give it the correct information on how well the generated sentences have the required attributes. So, it’s important to train the discriminator correctly so that its predictions are correct. In the next section, we discuss methods to train the discriminator.
Discriminator Learning
It is crucial for the discriminator to be able to infer the sentence attributes correctly and thus help the generator evaluate its performance of generating sentences with desired attributes.
The discriminator is trained in a semi-supervised manner. It first takes labeled examples of x_L and c_L and trains the discriminator parameter according to equation 5.

It also makes use of the generator-generated sentences on the labeled code c, forming a dataset of x^ and c, and uses Eq. 6 to train the discriminator on the generator dataset. Since the generator-generated dataset is noisy, minimum entropy regularization[2] is used, which is the second component of Eq. 6.

Combining equations 5 and 6, the overall discriminator learning function is shown in equation 7.

Training Algorithm
Dataset – Large dataset of unlabelled(X=x) and few labeled(X={x,c}) sentences.
- Initialize the base VAE(the encoder and generator) by minimizing equation 1.
Repeat steps 2 & 3 till convergence
- Train the discriminator by eq. 7
- Train the generator by minimizing eq. 4 and the encoder by minimizing eq.1
Wake Sleep Procedure

The training procedure of the model uses a wake-sleep procedure, which basically is a distinction between when the model is trained using real samples of the training data, and when the model is trained on generated samples of the model. The wake procedure corresponds to the training equation which uses the real training data. In our case, it corresponds to Eq 1, where the encoder and generator are trained on real data x. The left image of the diagram shows the forward propagation(black arrows) and the gradient propagation(dotted red arrow) for the wake procedure of the training.
The sleep procedure corresponds to the training equations which use generated samples from the generator. In this case, it includes Eq 2 and Eq 3 where the discriminator/encoder takes the generated sample to predict the code c and z respectively, giving feedback to train the generator. It also includes Eq 6, where the discriminator is trained on the generator-generated data samples. The right image of the diagram shows the forward propagation(black arrows) and the gradient propagation(red arrows) for the sleep procedure of training.
The sleep phase reduces the need for supervision and a large amount of labeled training data.
Conclusion
I hope this explanation helps in an intuitive understanding of the paper. In the next articles, I will explain the implementation of the critical parts of the paper and also look into the experimental results. Here’s my article summarizing the paper [Disentangled Representation Learning for Non-Parallel Text Style Transfer](http://Disentangled Representation Learning for Non-Parallel Text Style Transfer)[5].
References
[1] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P. Xing. 2017. Toward the controlled generation of text. In Proceedings of the 34th International Conference on Machine Learning – Volume 70 (ICML’17). JMLR.org, 1587–1596.
[2] Grandvalet, Yves, Bengio, Yoshua, et al. Semi-supervised learning by entropy minimization. In NIPS, volume 17, pp. 529–536, 2004.
[3] Kingma, Diederik P. and Max Welling. "Auto-Encoding Variational Bayes." CoRR abs/1312.6114 (2014): n. pag.
[4] https://github.com/ChenChengKuan/awesome-text-generation
[5] Vineet John, Lili Mou, Hareesh Bahuleyan, and Olga Vechtomova. 2018. Disentangled representation learning for text style transfer. arXiv preprint arXiv:1808.04339