Training Phrase Translation Models with Leaving-One-Out - Quaero

Training Phrase Translation Models with Leaving-One-OutJoern Wuebker and Arne Mauser and Hermann NeyHuman Language Technology and Pattern Recognition GroupRWTH Aachen University, Germany@cs.rwth-aachen.deAbstractSeveral attempts have been made to learnphrase translation probabilities for phrasebasedstatistical machine translation thatgo beyond pure counting of phrasesin word-aligned training data. Mostapproaches report problems with overfitting.We describe a novel leavingone-outapproach to prevent over-fittingthat allows us to train phrase models thatshow improved translation performanceon the WMT08 Europarl German-Englishtask. In contrast to most previous workwhere phrase models were trained separatelyfrom other models used in translation,we include all components such assingle word lexica and reordering modelsin training. Using this consistenttraining of phrase models we are able toachieve improvements of up to 1.4 pointsin BLEU. As a side effect, the phrase tablesize is reduced by more than 80%.1 IntroductionA phrase-based SMT system takes a source sentenceand produces a translation by segmenting thesentence into phrases and translating those phrasesseparately (Koehn et al., 2003). The phrase translationtable, which contains the bilingual phrasepairs and the corresponding translation probabilities,is one of the main components of an SMTsystem. The most common method for obtainingthe phrase table is heuristic extraction fromautomatically word-aligned bilingual training data(Och et al., 1999). In this method, all phrases ofthe sentence pair that match constraints given bythe alignment are extracted. This includes overlappingphrases. At extraction time it does notmatter, whether the phrases are extracted from ahighly probable phrase alignment or from an unlikelyone.Phrase model probabilities are typically definedas relative frequencies of phrases extracted fromword-aligned parallel training data. The jointcounts C( ˜f, ẽ) of the source phrase ˜f and the targetphrase ẽ in the entire training data are normalizedby the marginal counts of source and targetphrase to obtain a conditional probabilityp H ( ˜f|ẽ) = C( ˜f, ẽ)C(ẽ) . (1)The translation process is implemented as aweighted log-linear combination of several modelsh m (e I 1 , sK 1 , f 1 J ) including the logarithm of thephrase probability in source-to-target as well as intarget-to-source direction. The phrase model iscombined with a language model, word lexiconmodels, word and phrase penalty, and many others.(Och and Ney, 2004) The best translation êÎ1as defined by the models then can be written as{ M}∑êÎ1 = argmax λ m h m (e I 1, s K 1 , f1 J ) (2)I,e I 1m=1In this work, we propose to directly train ourphrase models by applying a forced alignment procedurewhere we use the decoder to find a phrasealignment between source and target sentences ofthe training data and then updating phrase translationprobabilities based on this alignment. In contrastto heuristic extraction, the proposed methodprovides a way of consistently training and usingphrase models in translation. We use a modifiedversion of a phrase-based decoder to perform theforced alignment. This way we ensure that allmodels used in training are identical to the onesused at decoding time. An illustration of the basic

To limit the effects of over-fitting, we apply theleaving-one-out and cross-validation methods intraining. In addition, we do not restrict the trainingto phrases consistent with the word alignment,as was done in (DeNero et al., 2006). This allowsus to recover from flawed word alignments.In (Liang et al., 2006) a discriminative translationsystem is described. For training of the parametersfor the discriminative features they proposea strategy they call bold updating. It is similarto our forced alignment training procedure describedin Section 3.For the hierarchical phrase-based approach,(Blunsom et al., 2008) present a discriminativerule model and show the difference between usingonly the viterbi alignment in training and using thefull sum over all possible derivations.Forced alignment can also be utilized to train aphrase segmentation model, as is shown in (Shenet al., 2008). They report small but consistentimprovements by incorporating this segmentationmodel, which works as an additional prior probabilityon the monolingual target phrase.In (Ferrer and Juan, 2009), phrase models aretrained by a semi-hidden Markov model. Theytrain a conditional “inverse” phrase model of thetarget phrase given the source phrase. Additionallyto the phrases, they model the segmentationsequence that is used to produce a phrase alignmentbetween the source and the target sentence.They used a phrase length limit of 4 words withlonger phrases not resulting in further improvements.To counteract over-fitting, they interpolatethe phrase model with IBM Model 1 probabilitiesthat are computed on the phrase level. We also includethese word lexica, as they are standard componentsof the phrase-based system.It is shown in (Ferrer and Juan, 2009), thatViterbi training produces almost the same resultsas full Baum-Welch training. They report improvementsover a phrase-based model that usesan inverse phrase model and a language model.Experiments are carried out on a custom subset ofthe English-Spanish Europarl corpus.Our approach is similar to the one presented in(Ferrer and Juan, 2009) in that we compare Viterbiand a training method based on the Forward-Backward algorithm. But instead of focusing onthe statistical model and relaxing the translationtask by using monotone translation only, we use afull and competitive translation system as startingpoint with reordering and all models included.In (Marcu and Wong, 2002), a joint probabilityphrase model is presented. The learned phrasesare restricted to the most frequent n-grams up tolength 6 and all unigrams. Monolingual phraseshave to occur at least 5 times to be consideredin training. Smoothing is applied to the learnedmodels so that probabilities for rare phrases arenon-zero. In training, they use a greedy algorithmto produce the Viterbi phrase alignment and thenapply a hill-climbing technique that modifies theViterbi alignment by merge, move, split, and swapoperations to find an alignment with a better probabilityin each iteration. The model shows improvementsin translation quality over the singleword-basedIBM Model 4 (Brown et al., 1993) ona subset of the Canadian Hansards corpus.The joint model by (Marcu and Wong, 2002)is refined by (Birch et al., 2006) who usehigh-confidence word alignments to constrain thesearch space in training. They observe that due toseveral constraints and pruning steps, the trainedphrase table is much smaller than the heuristicallyextracted one, while preserving translation quality.The work by (DeNero et al., 2008) describesa method to train the joint model described in(Marcu and Wong, 2002) with a Gibbs sampler.They show that by applying a prior distributionover the phrase translation probabilities they canprevent over-fitting. The prior is composed ofIBM1 lexical probabilities and a geometric distributionover phrase lengths which penalizes longphrases. The two approaches differ in that we applythe leaving-one-out procedure to avoid overfitting,as opposed to explicitly defining a priordistribution.3 AlignmentThe training process is divided into three parts.First we obtain all models needed for a normaltranslations system. We perform minimum errorrate training with the downhill simplex algorithm(Nelder and Mead, 1965) on the development datato obtain a set of scaling factors that achieve agood BLEU score. We then use these models andscaling factors to do a forced alignment, wherewe compute a phrase alignment for the trainingdata. From this alignment we then estimate newphrase models, while keeping all other models un-

changed. In this section we describe our forcedalignment procedure that is the basic training procedurefor the models proposed here.3.1 Forced AlignmentThe idea of forced alignment is to perform aphrase segmentation and alignment of each sentencepair of the training data using the full translationsystem as in decoding. What we call segmentationand alignment here corresponds to the “concepts”used by (Marcu and Wong, 2002). We applyour normal phrase-based decoder on the sourceside of the training data and constrain the translationsto the corresponding target sentences fromthe training data.Given a source sentence f1 J and target sentencee I 1 , we search for the best phrase segmentation andalignment that covers both sentences. A segmentationof a sentence into K phrase is defined byk → s k := (i k , b k , j k ), for k = 1, . . . , Kwhere for each segment i k is last position of kthtarget phrase, and (b k , j k ) are the start and endpositions of the source phrase aligned to the kthtarget phrase. Consequently, we can modify Equation2 to define the best segmentation of a sentencepair as:{ M}ŝ ˆK ∑1 = argmax λ m h m (e I 1, s K 1 , f1 J ) (3)K,s K 1m=1The identical models as in search are used: conditionalphrase probabilities p( ˜f k |ẽ k ) and p(ẽ k | ˜f k ),within-phrase lexical probabilities, distance-basedreordering model as well as word and phrasepenalty. A language model is not used in this case,as the system is constrained to the given target sentenceand thus the language model score has noeffect on the alignment.In addition to the phrase matching on the sourcesentence, we also discard all phrase translationcandidates, that do not match any sequence in thegiven target sentence.Sentences for which the decoder can not findan alignment are discarded for the phrase modeltraining. In our experiments, this is the case forroughly 5% of the training sentences.3.2 Leaving-one-outAs was mentioned in Section 2, previous approachesfound over-fitting to be a problem inphrase model training. In this section, we describea leaving-one-out method that can improvethe phrase alignment in situations, where the probabilityof rare phrases and alignments might beoverestimated. The training data that consists of Nparallel sentence pairs f n and e n for n = 1, . . . , Nis used for both the initialization of the translationmodel p( ˜f|ẽ) and the phrase model training.While this way we can make full use of the availabledata and avoid unknown words during training,it has the drawback that it can lead to overfitting.All phrases extracted from a specific sentencepair f n , e n can be used for the alignment ofthis sentence pair. This includes longer phrases,which only match in very few sentences in thedata. Therefore those long phrases are trained tofit only a few sentence pairs, strongly overestimatingtheir translation probabilities and failing togeneralize. In the extreme case, whole sentenceswill be learned as phrasal translations. The averagelength of the used phrases is an indicator ofthis kind of over-fitting, as the number of matchingtraining sentences decreases with increasingphrase length. We can see an example in Figure2. Without leaving-one-out the sentence is segmentedinto a few long phrases, which are unlikelyto occur in data to be translated. Phrase boundariesseem to be unintuitive and based on some hiddenstructures. With leaving-one-out the phrases areshorter and therefore better suited for generalizationto unseen data.Previous attempts have dealt with the overfittingproblem by limiting the maximum phraselength (DeNero et al., 2006; Marcu and Wong,2002) and by smoothing the phrase probabilitiesby lexical models on the phrase level (Ferrer andJuan, 2009). However, (DeNero et al., 2006) experiencedsimilar over-fitting with short phrases dueto the fact that the same word sequence can be segmentedin different ways, leading to specific segmentationsbeing learned for specific training sentencepairs. Our results confirm these findings. Todeal with this problem, instead of simple phraselength restriction, we propose to apply the leavingone-outmethod, which is also used for languagemodeling techniques (Kneser and Ney, 1995).When using leaving-one-out, we modify thephrase translation probabilities for each sentencepair. For a training example f n , e n , we have toremove all phrases C n ( ˜f, ẽ) that were extractedfrom this sentence pair from the phrase counts that

ment and the counting of phrases are done separatelyfor each block and then accumulated tobuild the updated phrase model.4 Phrase Model TrainingThe produced phrase alignment can be given as asingle best alignment, as the n-best alignments oras an alignment graph representing all alignmentsconsidered by the decoder. We have developedtwo different models for phrase translation probabilitieswhich make use of the force-aligned trainingdata. Additionally we consider smoothing bydifferent kinds of interpolation of the generativemodel with the state-of-the-art heuristics.4.1 ViterbiThe simplest of our generative phrase models estimatesphrase translation probabilities by their relativefrequencies in the Viterbi alignment of thedata, similar to the heuristic model but with countsfrom the phrase-aligned data produced in trainingrather than computed on the basis of a word alignment.The translation probability of a phrase pair( ˜f, ẽ) is estimated asp F A ( ˜f|ẽ) = C F A( ˜f, ẽ)∑C F A ( ˜f ′ , ẽ)˜f ′(5)where C F A ( ˜f, ẽ) is the count of the phrase pair( ˜f, ẽ) in the phrase-aligned training data. This canbe applied to either the Viterbi phrase alignmentor an n-best list. For the simplest model, eachhypothesis in the n-best list is weighted equally.We will refer to this model as the count model aswe simply count the number of occurrences of aphrase pair. We also experimented with weightingthe counts with the estimated likelihood of thecorresponding entry in the the n-best list. The sumof the likelihoods of all entries in an n-best list isnormalized to 1. We will refer to this model as theweighted count model.4.2 Forward-backwardIdeally, the training procedure would consider allpossible alignment and segmentation hypotheses.When alternatives are weighted by their posteriorprobability. As discussed earlier, the run-time requirementsfor computing all possible alignmentsis prohibitive for large data tasks. However, wecan approximate the space of all possible hypothesesby the search space that was used for the alignment.While this might not cover all phrase translationprobabilities, it allows the search space andtranslation times to be feasible and still containsthe most probable alignments. This search spacecan be represented as a graph of partial hypotheses(Ueffing et al., 2002) on which we can computeexpectations using the Forward-Backward algorithm.We will refer to this alignment as the fullalignment. In contrast to the method described inSection 4.1, phrases are weighted by their posteriorprobability in the word graph. As suggested inwork on minimum Bayes-risk decoding for SMT(Tromble et al., 2008; Ehling et al., 2007), we usea global factor to scale the posterior probabilities.4.3 Phrase Table InterpolationAs (DeNero et al., 2006) have reported improvementsin translation quality by interpolation ofphrase tables produced by the generative and theheuristic model, we adopt this method and also reportresults using log-linear interpolation of the estimatedmodel with the original model.The log-linear interpolations p int ( ˜f|ẽ) of thephrase translation probabilities are estimated asp int ( ˜f|ẽ) =(˜f|ẽ)) 1−ω (p H ( · p gen ( ˜f|ẽ)) (ω)(6)where ω is the interpolation weight, p H theheuristically estimated phrase model and p gen thecount model. The interpolation weight ω is adjustedon the development corpus. When interpolatingphrase tables containing different sets ofphrase pairs, we retain the intersection of the two.As a generalization of the fixed interpolation ofthe two phrase tables we also experimented withadding the two trained phrase probabilities as additionalfeatures to the log-linear framework. Thisway we allow different interpolation weights forthe two translation directions and can optimizethem automatically along with the other featureweights. We will refer to this method as featurewisecombination. Again, we retain the intersectionof the two phrase tables. With good loglinearfeature weights, feature-wise combinationshould perform at least as well as fixed interpolation.However, the results presented in Table 5

Table 2: Statistics for the Europarl German-English dataGermanEnglishTRAIN Sentences 1 311 815Run. Words 34 398 651 36 090 085Vocabulary 336 347 118 112Singletons 168 686 47 507DEV Sentences 2 000Run. Words 55 118 58 761Vocabulary 9 211 6 549OOVs 284 77TEST Sentences 2 000Run. Words 56 635 60 188Vocabulary 9 254 6 497OOVs 266 89show a slightly lower performance. This illustratesthat a higher number of features results in a lessreliable optimization of the log-linear parameters.5 Experimental Evaluation5.1 Experimental SetupWe conducted our experiments on the German-English data published for the ACL 2008Workshop on Statistical Machine Translation(WMT08). Statistics for the Europarl data aregiven in Table 2.We are given the three data sets T RAIN, DEVand T EST . For the heuristic phrase model, wefirst use GIZA++ (Och and Ney, 2003) to computethe word alignment on T RAIN. Next we obtaina phrase table by extraction of phrases from theword alignment. The scaling factors of the translationmodels have been optimized for BLEU onthe DEV data.The phrase table obtained by heuristic extractionis also used to initialize the training. The forcedalignment is run on the training data T RAINfrom which we obtain the phrase alignments.Those are used to build a phrase table accordingto the proposed generative phrase models. Afterward,the scaling factors are trained on DEV forthe new phrase table. By feeding back the newphrase table into forced alignment we can reiteratethe training procedure. When training is finishedthe resulting phrase model is evaluated on DEVTable 3: Comparison of different training setupsfor the count model on DEV .leaving-one-out max phr.len. BLEU TERbaseline 6 25.7 61.1none 2 25.2 61.33 25.7 61.34 25.5 61.45 25.5 61.46 25.4 61.7standard 6 26.4 60.9length-based 6 26.5 60.6and T EST . Additionally, we can apply smoothingby interpolation of the new phrase table withthe original one estimated heuristically, retrain thescaling factors and evaluate afterwards.The baseline system is a standard phrase-basedSMT system with eight features: phrase translationand word lexicon probabilities in both translationdirections, phrase penalty, word penalty, languagemodel score and a simple distance-based reorderingmodel. The features are combined in alog-linear way. To investigate the generative models,we replace the two phrase translation probabilitiesand keep the other features identical tothe baseline. For the feature-wise combinationthe two generative phrase probabilities are addedto the features, resulting in a total of 10 features.We used a 4-gram language model with modifiedKneser-Ney discounting for all experiments. Themetrics used for evaluation are the case-sensitiveBLEU (Papineni et al., 2002) score and the translationedit rate (TER) (Snover et al., 2006) withone reference translation.5.2 ResultsIn this section, we investigate the different aspectsof the models and methods presented before.We will focus on the proposed leaving-oneouttechnique and show that it helps in findinggood phrasal alignments on the training data thatlead to improved translation models. Our finalresults show an improvement of 1.4 BLEU overthe heuristically extracted phrase model on the testdata set.In Section 3.2 we have discussed several methodswhich aim to overcome the over-fitting prob-

Figure 3: Performance on DEV in BLEU of thecount model plotted against size n of n-best liston a logarithmic scale.lems described in (DeNero et al., 2006). Table 3shows translation scores of the count model on thedevelopment data after the first training iterationfor both leaving-one-out strategies we have introducedand for training without leaving-one-outwith different restrictions on phrase length. Wecan see that by restricting the source phrase lengthto a maximum of 3 words, the trained model isclose to the performance of the heuristic phrasemodel. With the application of leaving-one-out,the trained model is superior to the baseline, thelength-based strategy performing slightly betterthan standard leaving-one-out. For these experimentsthe count model was estimated with a 100-best list.The count model we describe in Section 4.1 estimatesphrase translation probabilities using countsfrom the n-best phrase alignments. For smaller nthe resulting phrase table contains fewer phrasesand is more deterministic. For higher values ofn more competing alignments are taken into account,resulting in a bigger phrase table and asmoother distribution. We can see in Figure 3that translation performance improves by movingfrom the Viterbi alignment to n-best alignments.The variations in performance with sizes betweenn = 10 and n = 10000 are less than 0.2 BLEU.The maximum is reached for n = 100, which weused in all subsequent experiments. An additionalbenefit of the count model is the smaller phrasetable size compared to the heuristic phrase extraction.This is consistent with the findings of (Birchet al., 2006). Table 4 shows the phrase table sizesfor different n. With n = 100 we retain only 17%of the original phrases. Even for the full model, weTable 4: Phrase table size of the count model fordifferent n-best list sizes, the full model and forheuristic phrase extraction.N # phrases % of full table1 4.9M 5.310 8.4M 9.1100 15.9M 17.21000 27.1M 29.210000 40.1M 43.2full 59.6M 64.2heuristic 92.7M 100.0do not retain all phrase table entries. Due to pruningin the forced alignment step, not all translationoptions are considered. As a result experimentscan be done more rapidly and with less resourcesthan with the heuristically extracted phrase table.Also, our experiments show that the increased performanceof the count model is partly derived fromthe smaller phrase table size. In Table 5 we can seethat the performance of the heuristic phrase modelcan be increased by 0.6 BLEU on T EST by filteringthe phrase table to contain the same phrasesas the count model and reoptimizing the log-linearmodel weights. The experiments on the number ofdifferent alignments taken into account were donewith standard leaving-one-out.The final results are given in Table 5. We cansee that the count model outperforms the baselineby 0.8 BLEU on DEV and 0.9 BLEU onT EST after the first training iteration. The performanceof the filtered baseline phrase table showsthat part of that improvement derives from thesmaller phrase table size. Application of crossvalidation(cv) in the first iteration yields a performanceclose to training with leaving-one-out (l1o),which indicates that cross-validation can be safelyapplied to higher training iterations as an alternativeto leaving-one-out. The weighted count modelclearly under-performs the simpler count model.A second iteration of the training algorithm showsnearly no changes in BLEU score, but a small improvementin TER. Here, we used the phrase tabletrained with leaving-one-out in the first iterationand applied cross-validation in the second iteration.Log-linear interpolation of the count modelwith the heuristic yields a further increase, showingan improvement of 1.3 BLEU on DEV and 1.4BLEU on T EST over the baseline. The interpo-

Table 5: Final results for the heuristic phrase tablefiltered to contain the same phrases as the countmodel (baseline filt.), the count model trained withleaving-one-out (l1o) and cross-validation (cv),the weighted count model and the full model. Further,scores for fixed log-linear interpolation of thecount model trained with leaving-one-out with theheuristic as well as a feature-wise combination areshown. The results of the second training iterationare given in the bottom row.DEV TESTBLEU TER BLEU TERbaseline 25.7 61.1 26.3 60.9baseline filt. 26.0 61.6 26.9 61.2count (l1o) 26.5 60.6 27.2 60.5count (cv) 26.4 60.7 27.0 60.7weight. count 25.9 61.4 26.4 61.3full 26.3 60.0 27.0 60.2fixed interpol. 27.0 59.4 27.7 59.2feat. comb. 26.8 60.1 27.6 59.9count, iter. 2 26.4 60.3 27.2 60.0lation weight is adjusted on the development setand was set to ω = 0.6. Integrating both modelsinto the log-linear framework (feat. comb.) yieldsa BLEU score slightly lower than with fixed interpolationon both DEV and T EST . This mightbe attributed to deficiencies in the tuning procedure.The full model, where we extract all phrasesfrom the search graph, weighted with their posteriorprobability, performs comparable to the countmodel with a slightly worse BLEU and a slightlybetter TER.6 ConclusionWe have shown that training phrase models canimprove translation performance on a state-ofthe-artphrase-based translation model. This isachieved by training phrase translation probabilitiesin a way that they are consistent with theiruse in translation. A crucial aspect here is the useof leaving-one-out to avoid over-fitting. We haveshown that the technique is superior to limitingphrase lengths and smoothing with lexical probabilitiesalone.While models trained from Viterbi alignmentsalready lead to good results, we have demonstratedthat considering the 100-best alignments allows tobetter model the ambiguities in phrase segmentation.The proposed techniques are shown to be superiorto previous approaches that only used lexicalprobabilities to smooth phrase tables or imposedlimits on the phrase lengths. On the WMT08 Europarltask we show improvements of 0.9 BLEUpoints with the trained phrase table and 1.4 BLEUpoints when interpolating the newly trained modelwith the original, heuristically extracted phrase table.In TER, improvements are 0.4 and 1.7 points.In addition to the improved performance, thetrained models are smaller leading to faster andsmaller translation systems.AcknowledgmentsThis work was partly realized as part of theQuaero Programme, funded by OSEO, FrenchState agency for innovation, and also partly basedupon work supported by the Defense AdvancedResearch Projects Agency (DARPA) under ContractNo. HR001-06-C-0023. Any opinions,ndings and conclusions or recommendations expressedin this material are those of the authors anddo not necessarily reect the views of the DARPA.ReferencesAlexandra Birch, Chris Callison-Burch, Miles Osborne,and Philipp Koehn. 2006. Constraining thephrase-based, joint probability statistical translationmodel. In smt2006, pages 154–157, Jun.Phil Blunsom, Trevor Cohn, and Miles Osborne. 2008.A discriminative latent variable model for statisticalmachine translation. In Proceedings of ACL-08:HLT, pages 200–208, Columbus, Ohio, June. Associationfor Computational Linguistics.P. F. Brown, V. J. Della Pietra, S. A. Della Pietra, andR. L. Mercer. 1993. The mathematics of statisticalmachine translation: parameter estimation. ComputationalLinguistics, 19(2):263–312, June.John DeNero and Dan Klein. 2008. The complexityof phrase alignment problems. In Proceedings of the46th Annual Meeting of the Association for ComputationalLinguistics on Human Language Technologies:Short Papers, pages 25–28, Morristown, NJ,USA. Association for Computational Linguistics.John DeNero, Dan Gillick, James Zhang, and DanKlein. 2006. Why Generative Phrase Models UnderperformSurface Heuristics. In Proceedings of the

Workshop on Statistical Machine Translation, pages31–38, New York City, June.John DeNero, Alexandre Buchard-Côté, and DanKlein. 2008. Sampling Alignment Structure undera Bayesian Translation Model. In Proceedings ofthe 2008 Conference on Empirical Methods in NaturalLanguage Processing, pages 314–323, Honolulu,October.Nicola Ehling, Richard Zens, and Hermann Ney. 2007.Minimum bayes risk decoding for bleu. In ACL ’07:Proceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessions,pages 101–104, Morristown, NJ, USA. Associationfor Computational Linguistics.Jesús-Andrés Ferrer and Alfons Juan. 2009. A phrasebasedhidden semi-markov approach to machinetranslation. In Procedings of European Associationfor Machine Translation (EAMT), Barcelona, Spain,May. European Association for Machine Translation.Reinhard Kneser and Hermann Ney. 1995. ImprovedBacking-Off for M-gram Language Modelling. InIEEE Int. Conf. on Acoustics, Speech and SignalProcessing (ICASSP), pages 181–184, Detroit, MI,May.Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. In Proceedingsof the 2003 Conference of the North AmericanChapter of the Association for ComputationalLinguistics on Human Language Technology - Volume1, pages 48–54, Morristown, NJ, USA. Associationfor Computational Linguistics.F.J. Och, C. Tillmann, and H. Ney. 1999. Improvedalignment models for statistical machine translation.In Proc. of the Joint SIGDAT Conf. on EmpiricalMethods in Natural Language Processing and VeryLarge Corpora (EMNLP99), pages 20–28, Universityof Maryland, College Park, MD, USA, June.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluationof machine translation. In Proceedings of the40th Annual Meeting on Association for ComputationalLinguistics, pages 311–318, Morristown, NJ,USA. Association for Computational Linguistics.Wade Shen, Brian Delaney, Tim Anderson, and RaySlyh. 2008. The MIT-LL/AFRL IWSLT-2008 MTSystem. In Proceedings of IWSLT 2008, pages 69–76, Hawaii, U.S.A., October.Matthew Snover, Bonnie Dorr, Rich Schwartz, LinneaMicciulla, and John Makhoul. 2006. A study oftranslation edit rate with targeted human annotation.In Proc. of AMTA, pages 223–231, Aug.Roy Tromble, Shankar Kumar, Franz Och, and WolfgangMacherey. 2008. Lattice Minimum Bayes-Risk decoding for statistical machine translation.In Proceedings of the 2008 Conference on EmpiricalMethods in Natural Language Processing, pages620–629, Honolulu, Hawaii, October. Associationfor Computational Linguistics.N. Ueffing, F.J. Och, and H. Ney. 2002. Generationof word graphs in statistical machine translation.In Proc. of the Conference on Empirical Methodsfor Natural Language Processing, pages 156–163,Philadelphia, PA, USA, July.Percy Liang, Alexandre Buchard-Côté, Dan Klein, andBen Taskar. 2006. An End-to-End DiscriminativeApproach to Machine Translation. In Proceedings ofthe 21st International Conference on ComputationalLinguistics and the 44th annual meeting of the Associationfor Computational Linguistics, pages 761–768, Sydney, Australia.Daniel Marcu and William Wong. 2002. A phrasebased,joint probability model for statistical machinetranslation. In Proceedings of the Conference onEmpirical Methods in Natural Language Processing(EMNLP-2002), July.J.A. Nelder and R. Mead. 1965. A Simplex Methodfor Function Minimization. The Computer Journal),7:308–313.Franz Josef Och and Hermann Ney. 2003. A systematiccomparison of various statistical alignmentmodels. Computational Linguistics, 29(1):19–51,March.Franz Josef Och and Hermann Ney. 2004. The alignmenttemplate approach to statistical machine translation.Computational Linguistics, 30(4):417–449,December.

Training Phrase Translation Models with Leaving-One-Out - Quaero

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?