- The paper presents a novel open vocabulary NMT(Neural Machine Translation) system that translates mostly at word level and falls back to character level models for rare words.
- Advantages:
- Faster and easier to train as compared to character models.
- Does not produce unknown words in the translations which need to be removed using unk replacement techniques.
- Link to the paper
- Most NMT operate on constrained vocabulary and represent unknown words with unk token.
- A post-processing step replaces unk tokens with actual words using alignment information.
- Disadvantages:
- These systems treat words as independent entities while they are morphologically related.
- Difficult to capture things like name translation.
- Deep LSTM encoder-decoder.
- Global attention mechanism and bilinear attention scoring function.
- Similar to regular NMT system except in the way unknown words are handled.
- Deep LSTM model used to generate on-the-fly representation of rare words (using final hidden state from the top layer).
- Advantages:
- Simplified architecture.
- Efficiency through precomputation - representations for rare sources words can be computed at once before each mini-batch.
- The model can be trained easily in an end-to-end fashion.
Hidden-state Initialization
- For source representation, layers of the LSTM are initialized with zero hidden states and cell values.
- For target representation, the same strategy is followed except for the hidden state of the first layer where one of the following approaches are used:
- same-path target generation approach
- Use the context vector just before softmax (of word-level NMT).
- seperate-path target generation approach
- Learn a new weight matrix W that will be used to generate the context vector.
- same-path target generation approach
- J = Jw + αJc
- J - total loss
- Jw - loss in a regular word-level NMT
- αJc - loss in the character-level NMT
- The final hidden state from character-level decoder could be interpreted as the representation of unk token but this approach would not be efficient.
- Instead, unk is fed to the word-level decoder as it is so as to decouple the execution for the character-level model as soon the word-level model finishes.
- During testing, a beam search decoder is run at the word level to find the best translation using the word NMT alone.
- Next, a character-level encoder is used to generate the words in place of unk to minimise the combined loss.
- WMT’15 translation task from English into Czech with newstest2013 (3000 sentences) as dev set and newstest2015 (2656 sentences) as a test set.
- Case-sensitive NIST BLEU.
- chrF3
- Purely word based
- Purely character based
- Hybrid (proposed model)
- Hybrid model surpasses all the other systems (neural/non-neural) and establishes a new state-of-the-art result for English-Czech translation in WMT’15 with 19.9 BLEU.
- Character-level models, when used as a replacement for the standard unk replacement technique in NMT, yields an improvement of up to +7.9 BLEU points.
- Attention is very important for character-based models as the non-attentional character models perform poorly.
- Character models with shorter time-step backpropagation perform inferior as compared to ones with longer backpropagation.
- Separate-path strategy outperforms same-path strategy.
- Obtain representations for rare words.
- Compare the Spearman correlation between similarity scores assigned by humans and by the model.
- Outperforms the recursive neural network model (which also uses a morphological analyser) on this task.