December 11-16 2016 Osaka Japan

More documents

Recommendations

Info

h t = f(c t , e t−1 , h t−1 ; θ d ) (11) h ′ t = f(h t , h ′ t−1; θ d ′) (12) o t = W o (h t + h ′ t) + b o (13) As demonstrated in Figure 1(c), another RNN is stacked upon the original decoder RNN, whose hidden states are computed based on the original decoder states. In Equation 13, similar to Residual Learning, instead directly compute the softmax layer based on the outputs of the second RNN, a summation of the states in two RNNs is used. This simple technique is expected to shorten the back-propagation path of the deepened NMT model, which eventually helps the optimization algorithms to train the network. 4 Experiments of residual decoder RNNs 4.1 Settings In our experimental evaluations, we adopt the fore-mentioned architecture of NMT models described in (Bahdanau et al., 2014). The baseline model contains a single RNN for the decoder. We use LSTM instead of GRU in our experiments. All RNNs have 1000 hidden units for recurrent computation. We then make three model variations and test them in our experiments: 1. Enlarged decoder RNN: the decoder RNN has 1400 hidden units 2. Stacked decoder RNN: the decoder has two-layer RNN stacked, corresponding to Figure 1(b) 3. Residual decoder RNN: the decoder is a two-layer residual stacking of RNNs, corresponding to Figure 1(c) We design these three model variations in purpose to keep the same number of overall parameters. However, training speed may be vary due to the difference of implementations. All experiments are performed on ASPEC English-Japanese translation dataset(Nakazawa et al., 2016b). The pre-processing procedure for English-Japanese task contains three steps. Firstly, we tokenize English-side corpus, while Japanese sentences are separated in character level. We filter out all sentences that contain more than 50 tokens in either side. Secondly, we make vocabulary sets for both languages. The vocabulary size for English is limited to 200k, no limitation is applied to Japanese data. Out-of-vocabulary tokens are converted to “UNK”. Finally, The sentences are sorted according to their lengths. We group them to mini-batches, each batch is composed of 64 bilingual sentences. The order of mini-batches is further randomly shuffled before use. We adopt Adam optimizer (Kingma and Ba, 2014) to train our end-to-end models with an initial learning rate of 0.0001. Then training ends at the end of 6th epoch, and the learning rate halves at the beginning of last 3 epochs. We keep track of the cross-entropy loss of both training and validation data. Trained models are then evaluated with automatic evaluation metrics. 4.2 Evaluation Results In Table 1, we show the empirical evaluation results of the fore-mentioned models. Enlarging the singlelayer decoder RNN to 1400 hidden units boosted BLEU by 0.92%. Surprisingly, stacking another layer of decoder RNN degraded the performance in automatic evaluation. This is also confirmed by a significant slowdown of the decreasing of the validation loss. With the residual stacking, we get the best performance in all four model variations. 5 Submitted systems and results in WAT2016 5.1 Submitted systems Our submitted systems for WAT2016 are based an ensemble of two same NMT models with different weight initialization. The decoder of the NMT models are composed by two-layer residual RNNs, which 226
5.0 4.5 4.0 Baseline model Enlarged decoder RNN Stacked decoder RNN Residual decoder RNN Model RIBES(%) BLEU(%) Baseline model 79.49 29.32 Enlarged decoder RNN 79.60 30.24 Stacked decoder RNN 79.25 29.07 Residual decoder RNN 79.88 30.75 cross-entropy loss 3.5 3.0 2.5 2.0 1.5 1.0 0 10 20 30 40 50 60 70 80 iterations (1500x) Table 1: Automatic evaluation results in English-Japanese translation task on ASPEC corpus. Both baseline model and enlarged decoder RNN has a single-layer RNN with 1000 and 1400 hidden units respectively in decoder. Stacked decoder RNN and residual decoder RNN are both composed of two-layer RNNs with 1000 hidden units each. Figure 2: Validation loss (cross-entropy) of the first three epochs. The validation loss of the model with residual decoder RNNs is constantly lower than any other model variations, while stacking decoder RNNs naively slows down the training significantly. is described in Section 3. All RNNs in the network are LSTM with 1200 hidden units. The pre-processing and training procedure is exactly the same as that in Section 4.2. In order to test the performance with system combination, we trained a T2S transducer with Travatar (Neubig, 2013) on WAT2016 En-Ja dataset, but without forest inputs in the training phase. We used 9-gram language model with T2S decoding. We also evaluated the results of the T2S system with NMT reranking, where the reranking scores are given by the fore-mentioned NMT ensemble. We tested performance of system combination in two approaches: GMBR system combination and a simple heuristics. The simple heuristics performs the system combination by choosing the result of NMT model ensemble when the input has fewer than 40 words, otherwise the result of reranked T2S system is chosen. For GMBR system combination, we used 500 sentences from the development data for GMBR training. We obtain a 20-best list from each system, so our system combination task involves hypothesis selection out of 40 hypotheses. The loss function used for GMBR, the sub-components of the loss function are derived from n-gram precisions, brevity penalty, and Kendall’s tau. 5.2 Official Evaluation Results in WAT2016 In Table 2, we show the automatic evaluation scores together with pairwise crowdsourcing evaluation scores for our submitted systems 1 in WAT2016. The first submission, which is a two-model ensemble of NMT models with residual decoder RNNs, achieved a comparably high RIBES (Isozaki et al., 2010) score. For system combination, although we gained improvement by applying GMBR method 2 , the naive method of system combination based on the length of input sentence works better in the evaluation with test data. Considering the scores of human evaluation, the system combination does make significant difference compared to the NMT model ensemble. 1 We found a critical bug in our implementation after submitting the results. However, the implementations evaluated in Section 4.2 are bug-free. Hence, the results of NMT ensemble in Section 4.2 and the results of NMT model with residual decoder RNN in Section 5.2 are not comparable. 2 We did not submit the result of GMBR system combination to human evaluation as the implementation is unfinished before the submission deadline. 227
Page 1 and 2:
WAT 2016 The 3rd Workshop on Asian
Page 3:
Preface Many Asian countries are ra
Page 6 and 7:
Technical Collaborators Luis Fernan
Page 9 and 10:
Table of Contents Overview of the 3
Page 11 and 12:
Conference Program December 12, 201
Page 13 and 14:
December 12, 2016 (continued) 12:00
Page 15 and 16:
Overview of the 3rd Workshop on Asi
Page 17 and 18:
LangPair Train Dev DevTest Test JPC
Page 19 and 20:
ASPEC JPC IITB BPPT pivot System ID
Page 21 and 22:
3.6 Tree-to-String Syntax-based SMT
Page 23 and 24:
• Use of other resources in addit
Page 25 and 26:
ASPEC JPC BPPT IITBC pivot Team ID
Page 27 and 28:
ut the adequacy is worse. As for AS
Page 29 and 30:
Figure 3: Official evaluation resul
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Figure 11: Official evaluation resu
Page 39 and 40:
Figure 13: Official evaluation resu
Page 41 and 42:
Annotator A Annotator B all weighte
Page 43 and 44:
Kyoto-U 2 bjtu nlp UT-KAY 1 UT-KAY
Page 45 and 46:
Online B IITP-MT EHR Online A ≫
Page 47 and 48:
SYSTEM ID ID METHOD OTHER RESOURCES
Page 49 and 50:
SYSTEM ID ID METHOD OTHER BLEU RIBE
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Hyoung-Gyu Lee, JaeSong Lee, Jun-Se
Page 61 and 62:
Translation of Patent Sentences wit
Page 63 and 64:
Chinese sentences contain fewer tha
Page 65 and 66:
Figure 3: NMT decoding with technic
Page 67 and 68:
Table 1: Automatic evaluation resul
Page 69 and 70:
Figure 6: Example of correct transl
Page 71 and 72:
K. Papineni, S. Roukos, T. Ward, an
Page 73 and 74:
Table 1: Examples of title, ingredi
Page 75 and 76:
text. In this section, we explain t
Page 77 and 78:
Table 3: Automatic evaluation BLEU/
Page 79 and 80:
Table 5: Number of fluency errors i
Page 81 and 82:
Kenneth Heafield. 2011. Kenlm: Fast
Page 83 and 84:
under-studied. MorphInd, a morphana
Page 85 and 86:
dimensions of 150 units in second h
Page 87 and 88:
distribution as received in JPO ade
Page 89 and 90:
Domain Adaptation and Attention-Bas
Page 91 and 92:
Once s M , which represents the ent
Page 93 and 94:
3.4.2 Ensemble of Attention Scores
Page 95 and 96:
Type Count Ratio (A) Correct 76 30.
Page 97 and 98:
Minh-Thang Luong, Hieu Pham, and Ch
Page 99 and 100:
Pair 1 Japanese [[ A ][ B ][ C ]
Page 101 and 102:
ID n-grams len freq m1 prevent, | 1
Page 103 and 104:
prediction accuracy with respect to
Page 105 and 106:
Reference: [ A To provide] [ B a to
Page 107 and 108:
Philipp Koehn. 2004. Statistical si
Page 109 and 110:
Therefore, we propose to use phrase
Page 111 and 112:
Figure 2: An NRM considering phrase
Page 113 and 114:
Mono Swap D right D left Acc. The r
Page 115 and 116:
Method ja-en en-ja BLEU RIBES WER B
Page 117 and 118:
Peng Li, Yang Liu, and Maosong Sun.
Page 119 and 120:
Figure 1: The framework of NMT. Whe
Page 121 and 122:
For long sentence, we discarded all
Page 123 and 124:
As shown in the above tables and fi
Page 125 and 126:
Translation systems and experimenta
Page 127 and 128:
The number of TM training sentence
Page 129 and 130:
where, f, p and e means a source, p
Page 131 and 132:
former system can translate more th
Page 133 and 134:
Lexicons and Minimum Risk Training
Page 135 and 136:
2.2 Parameter Optimization If we de
Page 137 and 138:
ML (λ=0.0) ML (λ=0.8) MR (λ=0.0)
Page 139 and 140:
Graham Neubig. 2013. Travatar: A fo
Page 141 and 142:
Feature Space Common (h ) Domain 1
Page 143 and 144:
Preprocessing Training Translation
Page 145 and 146:
JPC ASPEC LM Method Ja-En En-Ja Ja-
Page 147 and 148:
Translation Using JAPIO Patent Corp
Page 149 and 150:
4 Results Table 1 shows official ev
Page 151 and 152:
Table 3: Examples of translation er
Page 153 and 154:
An Efficient and Effective Online S
Page 155 and 156:
#$ #%$ #$ !""
Page 157 and 158:
This formula requires a context of
Page 159 and 160:
Sentence Segmenter Parameters Dev.
Page 161 and 162:
a meaningful baseline for related r
Page 163 and 164:
Similar Southeast Asian Languages:
Page 165 and 166:
τ around 0.71, and the Malay-Indon
Page 167 and 168:
-0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3
Page 169 and 170:
1.800 1.600 1.400 1.200 1.000 0.800
Page 171 and 172:
Integrating empty category detectio
Page 173 and 174:
3 Preordering with empty categories
Page 175 and 176:
We performed the tests under two co
Page 177 and 178:
the decoded sentence and the refere
Page 179 and 180:
Kenneth Heafield. 2011. Kenlm: Fast
Page 181 and 182:
Figure 1: Overview of KyotoEBMT. Th
Page 183 and 184:
proportionally slower to compute (a
Page 185 and 186:
3.6 Ensembling Ensembling has previ
Page 187 and 188:
when Chinese is also the language m
Page 189 and 190: Character-based Decoding in Tree-to
Page 191 and 192: where W s ∈ R d×d is a matrix an
Page 193 and 194: Model BLEU (BP) RIBES Character-bas
Page 195 and 196: The word-based NMT models cannot us
Page 197 and 198: Razvan Pascanu, Tomas Mikolov, and
Page 199 and 200: governs the grammaticality of the t
Page 201 and 202: 2.1 Quantization and Binarization o
Page 203 and 204: 3 Results Team Other Resources Syst
Page 205 and 206: Peter F Brown, John Cocke, Stephen
Page 207 and 208: Yonghui Wu, Mike Schuster, Zhifeng
Page 209 and 210: Language Sentence Chinese 该 / 钽
Page 211 and 212: Chinese or Japanese sentences Chine
Page 213 and 214: 4 Experiments and Results 4.1 Chine
Page 215 and 216: 6 Conclusion and Future Work In thi
Page 217 and 218: Controlling the Voice of a Sentence
Page 219 and 220: Figure 2: Flow of the voice predict
Page 221 and 222: ecause the number of passive senten
Page 223 and 224: controlling the sentence length wit
Page 225 and 226: Chinese-to-Japanese Patent Machine
Page 227 and 228: using the reordering oracles over t
Page 229 and 230: Conference on Natural Language Proc
Page 231 and 232: Figure 1: Hierarchical approach wit
Page 233 and 234: newswire test and development set o
Page 235 and 236: Source: the rain and cold wind on W
Page 237 and 238: Residual Stacking of RNNs for Neura
Page 239: 2.2 Generalized Minimum Bayes Risk
Page 243: Felix Hill, Kyunghyun Cho, Sebastie
show all

December 11-16 2016 Osaka Japan

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?