The world’s leading publication for data science, AI, and ML professionals.

Baseline Walkthrough for the Machine Translation Task of the Shifts Challenge at NeurIPS 2021

Get yourself up and running!

Image by Yandex
Image by Yandex

Distributional shift (mismatch between training and deployment data) is ubiquitous in real-world tasks and represents a significant challenge to safe and reliable usage of AI systems. This year at NeurlPS we are organizing the Shifts Challenge, where we investigate robustness to distributional shift and uncertainty quality in real, industrial scale applications. This blog only targets the Machine Translation task of this challenge and is a complement to the instructions on the official Github page, with additional explanations on the datasets, training, and evaluation process. This tutorial is a follow up to Haiwen Huang’s tutorial on the Vehicle Motion Prediction task of the Shifts Challenge.

This tutorial is structured as follows:

  1. Overview of the task
  2. Setting up the environment and repositories
  3. Downloading the data and baselines
  4. Pre-processing the data into Fairseq format
  5. Model Training
  6. Getting Predictions and Uncertainty estimates
  7. Model Evaluation and Submission
  8. Directions for Improvements

1. Overview of the task

As part of the Shifts Dataset we examine the task of machine translation. Translation services, such as Google Translate or Yandex Translate, often encounter atypical and unusual use of language in their translation queries. This typically includes slang, profanities, poor grammar, orthography and punctuation, as well as emojis. This poses a challenge to modern translation systems, which are typically trained on corpora with a more "standard" use of language. Therefore, it is important for models to both be robust to atypical language use to provide high-quality translations, as well as to be able to indicate when they are unable to provide a quality translation. This is especially important in machine translation systems are used to translate sensitive legal or diplomatic documents, where it is crucial for meaning to not be "lost in translation". Therefore, we want models to be able to perform well in a range of scenarios and to indicate when they are unable to do so. This corresponds to the robustness and uncertainty of predictions respectively. In the Shifts Challenge, we evaluate both the prediction robustness and uncertainty under distributional shifts.

In most prior work, uncertainty estimation and robustness to distributional shift are assessed separately. Robustness is typically assessed via metrics of predictive performance on a particular task, such as classification error rate. At the same time, the quality of uncertainty estimates is often assessed via the ability to discriminate between an "in-domain" dataset that is matched to the training data and a shifted or "out-of-domain" (OOD) dataset based on measures of uncertainty. However, we believe that these two problems are two halves of a common whole and must therefore be assessed jointly. Furthermore, they must be evaluated jointly on a dataset which contains both a large chunk of matched or ‘in-domain’ data and a large chunk of shifted data [1]. We will describe more details in the Evaluation section.

Translation is inherently a structured prediction task, as there are dependencies between the tokens in the output sequence. Often we must make assumptions about the form of these dependencies; for example, most modern translation systems are left-to-right autoregressive. However, we could consider conditionally independent predictions or other factorization orders. The nature of these assumptions makes it challenging to obtain a theoretically sound measure of uncertainty. Only recently has work been done on developing principled uncertainty measures for structured prediction [2, 3, 4,5, 6]. Nevertheless, this remains an unsolved task and a fruitful area for research. This tutorial examines how to get a baseline system based on Uncertainty Estimation in Autoregressive Structured Prediction [2] up and running.

2. Setting up the environment and repositories

To get started on this task you must first set up all the necessary packages and appropriate environment. Please note that code is a little outdated and uses Fairseq 0.9 and PyTorch 1.6.0. We plan to create a cleaner up-to-date implementation soon. Thus, you will need python 3.7 with CUDA 10.2.

First, fire up your shell and clone and install the Shifts Challenge repository:

git clone https://github.com/yandex-research/shifts.git

Inside the direcrory you will find a ‘requirements.txt’ – go ahead and pip install all those packages:

pip install matplotlib numpy torch==1.6.0 sacrebleu==1.4.3 nltk==3.4.5

This will also installs versions of all required packages which are compatible with Fairseq 0.9. Finally, clone an implementation of Uncertainty Estimation in Autoregressive Structured Prediction [2].

git clone https://github.com/KaosEngineer/structured-uncertainty.git
cd structured-uncertainty 
python3 -m pip install - user - no-deps - editable .

Now you’ve setup the environment, the Shifts repository and the Structured Uncertainty repository and should be ready for the next step.

3. Downloading the data and baselines

Now that you’ve set up your repositories you can download the training and development data as well as the baseline models. Note, the script which downloads the data also does some intial pre-processing.

To download the training and development data run the preprocess script:

chmod +x ./shifts/translation/data/prepare_data.sh 
./shifts/translation/data/prepare_data.sh

This preprocesses the data, combines all the training data into one and also removes duplicate and copy-through examples from the training data. Next, download the baselines models:

wget https://storage.yandexcloud.net/yandex-research/shifts/translation/baseline-models.tar
tar -xf baseline-models.tar

This should yield the following top-level directory structure:

./
├── baseline-models
├── mosesdecoder
├── orig
├── shifts
├── structured-uncertainty
├── subword-nmt
└── wmt20_en_ru

The orig, wmt20_en_ru and baseline-models directories should contain the following:

orig
├── dev-data
│ ├── LICENSE.md
│ ├── newstest2019-enru-ref.ru.sgm
│ ├── newstest2019-enru-src.en.sgm
│ ├── reddit_dev.en
│ ├── reddit_dev.meta
│ └── reddit_dev.ru
└── train-data
 ├── 1mcorpus
 │ ├── corpus.en_ru.1m.en
 │ └── corpus.en_ru.1m.ru
 ├── WikiMatrix.v1.en-ru.langid.en
 ├── WikiMatrix.v1.en-ru.langid.ru
 ├── WikiMatrix.v1.en-ru.langid.tsv
 ├── commoncrawl.ru-en.en
 ├── commoncrawl.ru-en.ru
 ├── en-ru
 │ ├── DISCLAIMER
 │ ├── README
 │ ├── UNv1.0.en-ru.en
 │ ├── UNv1.0.en-ru.ids
 │ ├── UNv1.0.en-ru.ru
 │ └── UNv1.0.pdf
 ├── extra
 ├── news-commentary-v15.en-ru.en
 ├── news-commentary-v15.en-ru.ru
 ├── news-commentary-v15.en-ru.tsv
 ├── news.en
 ├── news.en.translatedto.ru
 ├── news.ru
 ├── news.ru.translatedto.en
 ├── paracrawl-release1.en-ru.zipporah0-dedup-clean.en
 ├── paracrawl-release1.en-ru.zipporah0-dedup-clean.ru
 ├── readme.txt
 ├── wikititles-v2.ru-en.en
 ├── wikititles-v2.ru-en.ru
 └── wikititles-v2.ru-en.tsv
wmt20_en_ru
├── code
├── reddit_dev.en
├── reddit_dev.ru
├── test19.en
├── test19.ru
├── tmp
│ ├── bpe.reddit_dev.en
│ ├── bpe.reddit_dev.ru
│ ├── bpe.test19.en
│ ├── bpe.test19.ru
│ ├── bpe.train.en
│ ├── bpe.train.ru
│ ├── bpe.valid.en
│ ├── bpe.valid.ru
│ ├── reddit_dev.en
│ ├── reddit_dev.ru
│ ├── test19.en
│ ├── test19.ru
│ ├── train.en
│ ├── train.en-ru
│ ├── train.ru
│ ├── train.tags.en-ru.clean.tok.en
│ ├── train.tags.en-ru.clean.tok.ru
│ ├── train.tags.en-ru.tok.en
│ ├── train.tags.en-ru.tok.ru
│ ├── valid.en
│ └── valid.ru
├── train.en
├── train.ru
├── valid.en
└── valid.ru
baseline-models/
├── dict.en.txt
├── dict.ru.txt
├── model1.pt
├── model2.pt
├── model3.pt
├── model4.pt
└── model5.pt

4. Pre-processing the data into Fairseq format

Now that you have downloaded the data, cleaned and pre-processed it, must now convert it into a Fairseq specific format. This can be done using the following command:

python3 structured-uncertainty/preprocess.py - source-lang en - target-lang ru - trainpref wmt20_en_ru/train - validpref wmt20_en_ru/valid - testpref wmt20_en_ru/test19,wmt20_en_ru/reddit_dev - destdir data-bin/wmt20_en_ru - thresholdtgt 0 - thresholdsrc 0 - workers 24

If you are using the provided baseline models, please pre-process using the following command instead:

python3 structured-uncertainty/preprocess.py - srcdict baseline-models/dict.en.txt - tgtdict baseline-models/dict.ru.txt - source-lang en - target-lang ru - trainpref wmt20_en_ru/train - validpref wmt20_en_ru/valid - testpref wmt20_en_ru/test19,wmt20_en_ru/reddit_dev - destdir data-bin/wmt20_en_ru - thresholdtgt 0 - thresholdsrc 0 - workers 24

The above command uses dictionaries which come with the baseline models and which are slightly different from the ones you will get by running the scripts in the previous sections. Both of the above commands should create a new directory ‘data-bin‘ with the following structure:

data-bin/
└── wmt20_en_ru
 ├── dict.en.txt
 ├── dict.ru.txt
 ├── test.en-ru.en.bin
 ├── test.en-ru.en.idx
 ├── test.en-ru.ru.bin
 ├── test.en-ru.ru.idx
 ├── test1.en-ru.en.bin
 ├── test1.en-ru.en.idx
 ├── test1.en-ru.ru.bin
 ├── test1.en-ru.ru.idx
 ├── train.en-ru.en.bin
 ├── train.en-ru.en.idx
 ├── train.en-ru.ru.bin
 ├── train.en-ru.ru.idx
 ├── valid.en-ru.en.bin
 ├── valid.en-ru.en.idx
 ├── valid.en-ru.ru.bin
 └── valid.en-ru.ru.idx

Where test is the in-domain development dataset – newstest19 and test1 is the shifted development data – _reddit_dev_.

5. Model Training

Now, if you want to re-create the baselines you can run the following command:

python3 structured-uncertainty/train.py data-bin/wmt20_en_ru - arch transformer_wmt_en_de_big - share-decoder-input-output-embed - fp16 - memory-efficient-fp16 - num-workers 16 - optimizer adam - adam-betas '(0.9, 0.98)' - clip-norm 0.0 - lr 5e-4 - lr-scheduler inverse_sqrt - warmup-updates 4000 - dropout 0.1 - weight-decay 0.0001 - criterion label_smoothed_cross_entropy - label-smoothing 0.1 - max-tokens 5120 - save-dir MODEL_DIR - max-update 50000 - update-freq 16 - keep-last-epochs 10 - seed 0

This was used to produce the baselines, with only the seed varying. If you are training of a GPU which isn’t a V100 or A100, then it is likely that you shouldn’t use FP16, as this mode may not be numerically stable. Note, that you may want to train your model with different settings, for a different numbers of epochs or using a different architecture, so the details of this command may vary.

6. Getting Predictions and Uncertainty Estimates

Now that you have an ensemble of baseline models or your own models, you can run inference on the individual models or on the ensemble jointly as follows. To run the single model baseline:

mkdir single 
for i in test test1; do 
   python3 structured-uncertainty//generate.py wmt20_en_ru/ - path      baseline-models/model1.pt - max-tokens 4096 - remove-bpe - nbest 5 - gen-subset ${i} >& single/results-${i}.txt
done

To run the ensemble baseline:

mkdir ensemble
for i in test test1; do 
 python3 structured-uncertainty/generate.py wmt20_en_ru/ - path baseline-models/model1.pt:baseline-models/model2.pt:baseline-models/model3.pt - max-tokens 1024 - remove-bpe - nbest 5 - gen-subset ${i} - compute-uncertainty >& ensemble/results-${i}.txt
done

Note, you should only use the " — compute-uncertainty" flag if you are using an ensemble. This produces the raw output of the translations and associated uncertainty scores which then need to be processed further. All files are saved into the directories ‘single’ and ‘ensemble with the following structure:

ensemble
├── results-test.txt
└── results-test1.txt

7. Model Evaluation and Submission

Now that you have successfully run inference with your models, we now need to evaluate the model’s output and create a submission file. Please remember that test is your newstest19, the non-shifted development data, and that test1 is reddit_dev, the shifted development data. We will evaluate the model on both.

This can easily be done by running the following script:

chmod +x ./shifts/translation/assessment/eval_single.sh
chmod +x ./shifts/translation/assessment/eval_ensemble.sh
./shifts/translation/assessment/eval_single.sh
./shifts/translation/assessment/eval_ensemble.sh

The script will now modify single and ensemble directories as follows:

ensemble
├── results-test.txt
├── results-test1.txt
├── test
│ ├── aep_du.txt
│ ├── aep_tu.txt
│ ├── entropy_expected.txt
│ ├── ep_entropy_expected.txt
│ ├── ep_epkl.txt
│ ├── ep_mkl.txt
│ ├── ep_mutual_information.txt
│ ├── epkl.txt
│ ├── expected_entropy.txt
│ ├── hypo_ids.txt
│ ├── hypo_likelihoods.txt
│ ├── hypos.txt
│ ├── log_probs.txt
│ ├── logcombo.txt
│ ├── logvar.txt
│ ├── mkl.txt
│ ├── mutual_information.txt
│ ├── npmi.txt
│ ├── ref_ids.txt
│ ├── refs.txt
│ ├── score.txt
│ ├── score_npmi.txt
│ ├── tmp
│ ├── var.txt
│ └── varcombo.txt
└── test1
 ├── aep_du.txt
 ├── aep_tu.txt
 ├── entropy_expected.txt
 ├── ep_entropy_expected.txt
 ├── ep_epkl.txt
 ├── ep_mkl.txt
 ├── ep_mutual_information.txt
 ├── epkl.txt
 ├── expected_entropy.txt
 ├── hypo_ids.txt
 ├── hypo_likelihoods.txt
 ├── hypos.txt
 ├── log_probs.txt
 ├── logcombo.txt
 ├── logvar.txt
 ├── mkl.txt
 ├── mutual_information.txt
 ├── npmi.txt
 ├── ref_ids.txt
 ├── refs.txt
 ├── score.txt
 ├── score_npmi.txt
 ├── tmp
 ├── var.txt
 └── varcombo.txt

Most of the outputs are different measures of uncertainty produced by the code in addition to hypothesis ids (_hypo_ids.txt), references (refs.txt) hypotheses (hypos.txt) and hypothesis likelihoods (hypo_likelihoods.txt_). The hypothesis file should be longer than the reference file by the factor of the beam size. For example, if the beam width was 5, then the hypothesis file should by 5 times longer than the reference file.

Once we have obtained the hypotheses and (many different) uncertainty scores, we can evaluate the predictions relative to the references and then create a submission file. First, let’s discuss how we evaluate the models.

A model’s performance is evaluated using the BLEU and GLEU metrics. However, BLEU is robust only at the corpus level, while GLEU is robust at the sentence level. Since we are interested in assessing the quality of each translated sentence, our joint evaluation of robustness and uncertainty quality uses GLEU. We compute the GLEU for each hypothesis the model produces relative to the appropriate reference transcription. Recall that the model can produce multiple translation hypotheses for each input (eg: 5 per input if beam width is 5). We compute maximum (best) GLEU and expected GLEU by taking the maximum or by taking a weighted averaged across all hypotheses, respectively. The weights for each hypothesis can be computed in different ways. One way is to exponentiate and normalize across hypotheses the length-normalized likelihood of each hypothesis. This would yield a set of 5 positive weights which sum to 1. Finally, we take the average maxGLEU or expectedGLEU across the joint in-domain and shifted datasets (test + test1).

Now that we have evaluated the predictive quality of our models on the joint in-domain and shifted sets, we can evaluate the quality of uncertainty estimates by constructing an error-retention curve. To do this, we first turn our GLEU metric into "GLEU error" by subtracting from 100 (such that lower is better). Next, construct an error-retention curve by replacing a model’s predictions with ground-truth transcription obtained from an oracle in order of decreasing uncertainty, thereby decreasing error. Ideally, a model’s uncertainty is correlated with its error, and therefore the most error-full predictions would be replaced first, which would yield the greatest reduction in mean error as more predictions are replaced. This represents a hybrid human-AI scenario, where a model can consult an oracle (human) for assistance in difficult situations and obtain from the oracle a perfect prediction on those examples.

As the fraction of original predictions retained decreases, so does the GLEU error. Finally we measure the area under the error-retention curve (R-AUC). The area under the retention curve (R-AUC) is a metric for jointly assessing robustness to distributional shift and the quality of the uncertainty estimates. R-AUC can be reduced either by improving the predictions of the model, such that it has lower overall error at any given retention rate, or by providing estimates of uncertainty which better correlate with error, such that the most incorrect predictions are rejected first.

On the retention curve above, in addition to the uncertainty-based ranking, we included curves which represent "random" ranking, where uncertainty estimates are entirely non-informative, and "optimal” ranking, where uncertainty estimates perfectly correlate with error. These represent the lower and upper bounds on R-AUC performance as a function of uncertainty quality. Submissions which achieve the lowest R-AUC on the evaluation data (related on October 17th) will win.


Now that we’ve discussed assessment, let’s discuss how to create a submission file. If you had run everything as described above, then you should already have a "submission.json" file in your top-level directory. This file is created by running the following command:

python shifts/translation/assessment/create_submission.py ensemble/test ensemble/test1 - save_path ./submission-ensemble.json - beam_width 5 - nbest 5 - ensemble - uncertainty_metric SCR-PE

This script takes in the in-domain data (test), then shifted data (test1) , the save path, what was the beam width during decoding (_beam_width), how many of those hypotheses to use (n_best), whether an ensemble ouput is used and which measure of uncertainty to include (uncertainty_metric_). This script then processes the output into a json-list file with the following structure:

jsn_list = [jsn0, jsn1, ..., jsonN]
jsn0 =  {'id': '001',
    'hypos': [hypo1, hypo2, hypo3],
    'uncertainty': 9000}

hypo1 = {'text': "Кошка сидела на столе",
     'confidence': 0.3}

This file can then be submitted on the Shifts website. Note, if you are using custom measures of uncertainty, then you will have to modify this script.

8. Directions for Improvements

This blog only covers the baseline method for the Machine Translation task in the Shifts challenge. There are a few directions that can lead to interesting discoveries and improvements:

  • Enhance the diversity of models in the ensemble. This can be done using things like test-time data augmentation, such as BPE-dropout or enabling ‘normal’ dropout at test time in each of the ensemble members.
  • Consider combining ensembles of models which make different assumptions about the data, such as left-to-right, right-to-left autoregressive models, non-autoregressive models and models which allow considering arbitrary factorisation orders, such as XL-Net [7].
  • Consider adapting and evaluating deterministic methods [9] which do not require using ensemble of models.

However, you are free to find your own solutions to the problem, as long as they abide by the rules and computational limitations described here.


References

[1] A. Malinin, N. Band, Ganshin, Alexander, G. Chesnokov, Y. Gal, M. J. F. Gales, A. Noskov, A. Ploskonosov, L. Prokhorenkova, I. Provilkov, V. Raina, V. Raina, Roginskiy, Denis, M. Shmatova, P. Tigas, and B. Yangel, "Shifts: A dataset of real distributional shift across multiple large-scale tasks," 2021.

[2] Andrey Malinin and Mark Gales, "Uncertainty Estimation in Autoregressive Structured Prediction," ICLR 2021.

[3] Xiao, Tim Z and Gomez, Aidan N and Gal, Yarin, "Wat heb je gezegd? Detecting Out-of-Distribution Translations with Variational Transformers", Bayesian Deep Learning Workshop, Neurips 2019

[4] Wang, Shuo and Liu, Yang and Wang, Chao and Luan, Huanbo and Sun, Maosong, "Improving Back-Translation with Uncertainty-based Confidence Estimation". 2019

[5] Fomicheva, Marina and Sun, Shuo and Yankovskaya, Lisa and Blain, Frederic and Guzman, Francisco and Fishel, Mark and Aletras, Nikolaos and Chaudhary, Vishrav and Specia, Lucia, " Unsupervised Quality Estimation for Neural Machine Translation", 2020

[6] Notin, Pascal and Hernandez-Lobato, José Miguel and Gal, Yarin. "Principled Uncertainty Estimation for High Dimensional Data", Uncertainty & Robustness in Deep Learning Workshop, ICML, 2020

[7] Yang, Zhilin and Dai, Zihang and Yang, Yiming and Carbonell, Jaime and Salakhutdinov, Russ R and Le, Quoc V, "XL-Net: Generalized autoregressive pretraining for language understanding", NeurIP, S2019

[8] Joost van Amersfoort, Lewis Smith, Yee Whye Teh, Yarin Gal, "Uncertainty estimation using a single deep deterministic neural network", ICML, 2020


Related Articles