Dark Mode

Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ictnlp/MoE-Waitk

Repository files navigation

Universal Simultaneous Machine Translation with Mixture-of-Experts Wait-k Policy

Source code for our EMNLP 2021 paper "Universal Simultaneous Machine Translation with Mixture-of-Experts Wait-k Policy" [PDF]

Our method is implemented based on the open-source toolkit Fairseq.

Requirements and Installation

  • Python version = 3.6

  • PyTorch version = 1.7

  • Install fairseq:

    git clone https://github.com/ictnlp/MoE-Waitk.git
    cd MoE-Waitk
    pip install --editable ./

Quick Start

Data Pre-processing

We use the data of IWSLT15 English-Vietnamese (download here) WMT16 English-Romanian (download here) and WMT15 German-English (download here).

For WMT16 English-Romanian and WMT15 German-English, we tokenize the corpus via mosesdecoder/scripts/tokenizer/normalize-punctuation.perl and apply BPE with 32K merge operations via subword_nmt/apply_bpe.py.

Then, we process the data into the fairseq format, adding --joined-dictionary for WMT15 German-English:

src=SOURCE_LANGUAGE
tgt=TARGET_LANGUAGE
train_data=PATH_TO_TRAIN_DATA
vaild_data=PATH_TO_VALID_DATA
test_data=PATH_TO_TEST_DATA
data=PATH_TO_DATA

# add --joined-dictionary for WMT16 English-Romanian and WMT15 German-English
fairseq-preprocess --source-lang ${src} --target-lang ${tgt} \
--trainpref ${train_data} --validpref ${vaild_data} \
--testpref ${test_data}\
--destdir ${data} \
--workers 20

Training

Train MoE Wait-k Policy in two stage, according to the following command:

  • For Transformer-Small with 4 attention heads: we set expert lagging = 1,6,11,16
  • For Transformer-Base with 8 attention heads: we set expert lagging = 1,3,5,7,9,11,13,15
  • For Transformer-Big with 16 attention heads: we set expert lagging = 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
  1. First-stage: fix the expert weights equal, and pre-train expert parameters.
export CUDA_VISIBLE_DEVICES=0,1,2,3
data=PATH_TO_DATA
modelfile=PATH_TO_SAVE_MODEL
expert_lagging=SET_EXPERT_LAGGING #1,3,5,7,9,11,13,15

# Fisrt-stage: Pertrain an equal-weight MoE Wait-k
python train.py --ddp-backend=no_c10d ${data} --arch transformer --share-all-embeddings \
--optimizer adam \
--adam-betas '(0.9, 0.98)' \
--clip-norm 0.0 \
--lr 5e-4 \
--lr-scheduler inverse_sqrt \
--warmup-init-lr 1e-07 \
--warmup-updates 4000 \
--dropout 0.3 \
--criterion label_smoothed_cross_entropy \
--reset-dataloader --reset-lr-scheduler --reset-optimizer\
--label-smoothing 0.1 \
--encoder-attention-heads 8 \
--decoder-attention-heads 8 \
--left-pad-source False \
--fp16 \
--equal-weight \
--expert-lagging ${expert_lagging} \
--save-dir ${modelfile} \
--max-tokens 4096 --update-freq 2
  1. Second-stage: jointly fine-tune the parameters of experts and their weights.
# Sencond-stage: Finetune MoE Wait-k with various expert weights
python train.py --ddp-backend=no_c10d ${data} --arch transformer --share-all-embeddings \
--optimizer adam \
--adam-betas '(0.9, 0.98)' \
--clip-norm 0.0 \
--lr 5e-4 \
--lr-scheduler inverse_sqrt \
--warmup-init-lr 1e-07 \
--warmup-updates 4000 \
--dropout 0.3 \
--criterion label_smoothed_cross_entropy \
--reset-dataloader --reset-lr-scheduler --reset-optimizer\
--label-smoothing 0.1 \
--encoder-attention-heads 8 \
--decoder-attention-heads 8 \
--left-pad-source False \
--fp16 \
--expert-lagging ${expert_lagging} \
--save-dir ${modelfile} \
--max-tokens 4096 --update-freq 2

Inference

Evaluate the model with the following command:

export CUDA_VISIBLE_DEVICES=0
data=PATH_TO_DATA
modelfile = PATH_TO_SAVE_MODEL
ref_dir=PATH_TO_REFERENCE
testk=TEST_WAIT_K

# average last 5 checkpoints
python scripts/average_checkpoints.py --inputs ${modelfile} --num-update-checkpoints 5 --output ${modelfile}/average-model.pt

# generate translation
python generate.py ${data} --path $modelfile/average-model.pt --batch-size 250 --beam 1 --left-pad-source False --fp16 --remove-bpe --test-wait-k ${testk} > pred.out

grep ^H pred.out | cut -f1,3- | cut -c3- | sort -k1n | cut -f2- > pred.translation
multi-bleu.perl -lc ${ref_dir} < pred.translation

Citation

In this repository is useful for you, please cite as:

@inproceedings{zhang-feng-2021-universal,
title = "Universal Simultaneous Machine Translation with Mixture-of-Experts Wait-k Policy",
author = "Zhang, Shaolei and
Feng, Yang",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.581",
doi = "10.18653/v1/2021.emnlp-main.581",
pages = "7306--7317",
}

About

Code for EMNLP 2021 oral paper "Universal Simultaneous Machine Translation with Mixture-of-Experts Wait-k Policy"

Topics

Resources

Readme

License

MIT license

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors