13.07.2015 Views

Integration of an Arabic Transliteration Module into a Statistical ...

Integration of an Arabic Transliteration Module into a Statistical ...

Integration of an Arabic Transliteration Module into a Statistical ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

we computed the BLEU score on two differentsub-portions <strong>of</strong> the test set: first, on the sentenceswith OOVs; second, only on the sentencescontaining OOV named entities. The BLEUincrease on different portions <strong>of</strong> the test set isshown in table 3.baseline Method 4Dev OOV sentences 39.17 40.02OOV-NE Sentences 44.56 46.31blind OOV sentences 43.93 45.07OOV-NE Sentences 42.32 44.87Table 3: BLEU score on differentportions <strong>of</strong> the test sets.To set <strong>an</strong> upper bound on how much applying<strong>an</strong>y tr<strong>an</strong>sliteration module c<strong>an</strong> contribute to theoverall results, we developed <strong>an</strong> oracle-likedictionary for the OOVs in the test sets, which wasthen used to create a markup <strong>Arabic</strong> text. Byfeeding this markup input to the MT system weobtained the result shown in column 5 <strong>of</strong> table 2.This is the perform<strong>an</strong>ce our system would achieveif it had perfect accuracy in tr<strong>an</strong>sliteration,including correctly guessing what errors the hum<strong>an</strong>tr<strong>an</strong>slators made in the references. Method 4achieves 70% <strong>of</strong> this maximum gain on dev, <strong>an</strong>d53% on blind.5 ConclusionThis paper has described the integration <strong>of</strong> a tr<strong>an</strong>sliterationmodule <strong>into</strong> a state-<strong>of</strong>-the-art statisticalmachine tr<strong>an</strong>slation (SMT) system for the <strong>Arabic</strong>to English task. The final version <strong>of</strong> the tr<strong>an</strong>sliterationmodule operates in three phases. First, it generatesEnglish letter sequences corresponding tothe <strong>Arabic</strong> letter sequence; for the typical casewhere the <strong>Arabic</strong> omits diacritics, this <strong>of</strong>ten me<strong>an</strong>sthat the English letter sequence is incomplete (e.g.,vowels are <strong>of</strong>ten missing). In the next phase, themodule tries to guess the missing English letters.In the third phase, the module uses a huge collection<strong>of</strong> English unigrams to filter out improbable orimpossible English words <strong>an</strong>d names. We describedfour possible methods for integrating thismodule in <strong>an</strong> SMT system. Two <strong>of</strong> these methodsrequire NE taggers <strong>of</strong> higher quality th<strong>an</strong> thoseavailable to us, <strong>an</strong>d were not explored experimentally.Method 3 inserts the top-scoring c<strong>an</strong>didatefrom the tr<strong>an</strong>sliteration module in the tr<strong>an</strong>slationwherever there was <strong>an</strong> <strong>Arabic</strong> OOV in the source.Method 4 outputs multiple c<strong>an</strong>didates from thetr<strong>an</strong>sliteration module, each with a score; the SMTsystem combines these scores with l<strong>an</strong>guage modelscores to decide which c<strong>an</strong>didate will be chosen. Inour experiments, Method 4 consistently outperformedModel 3. Note that although we usedBLEU as the metric for all experiments in this paper,BLEU greatly understates the import<strong>an</strong>ce <strong>of</strong>accurate tr<strong>an</strong>sliteration for m<strong>an</strong>y practical SMTapplications.ReferencesNasreen AbdulJaleel <strong>an</strong>d Leah S. Larkey, 2003. <strong>Statistical</strong>Tr<strong>an</strong>sliteration for English-<strong>Arabic</strong> Cross L<strong>an</strong>guageInformation Retrieval, Proceedings <strong>of</strong> theTwelfth International Conference on Information <strong>an</strong>dKnowledge M<strong>an</strong>agement, New Orle<strong>an</strong>s, LAYaser Al-Onaiz<strong>an</strong> <strong>an</strong>d Kevin Knight, 2002. MachineTr<strong>an</strong>sliteration <strong>of</strong> Names in <strong>Arabic</strong> Text, Proceedings<strong>of</strong> the ACL Workshop on Computational Approachesto Semitic L<strong>an</strong>guagesPeter F. Brown, Vincent J. Della Pietra, Stephen A.Della Pietra, <strong>an</strong>d Robert L. Mercer, 1993. TheMathematics <strong>of</strong> <strong>Statistical</strong> Machine Tr<strong>an</strong>slation: ParameterEstimation, Computational LinguisticsH<strong>an</strong>y Hass<strong>an</strong> <strong>an</strong>d Jeffrey Sorensen, 2005. An IntegratedApproach for <strong>Arabic</strong>-English Named Entity Tr<strong>an</strong>slation,Proceedings <strong>of</strong> the ACL Workshop on ComputationalApproaches to Semitic L<strong>an</strong>guages (ACL),University <strong>of</strong> Michig<strong>an</strong>, Ann ArborMehdi M. Kash<strong>an</strong>i, Fred Popowich, <strong>an</strong>d Anoop Sarkar,2007. Automatic Tr<strong>an</strong>sliteration <strong>of</strong> Proper Nounsfrom <strong>Arabic</strong> to English, Proceedings <strong>of</strong> the SecondWorkshop on Computational Approaches to <strong>Arabic</strong>Script-based L<strong>an</strong>guagesAlex<strong>an</strong>dre Klementiev <strong>an</strong>d D<strong>an</strong> Roth, 2006. NamedEntity Tr<strong>an</strong>sliteration <strong>an</strong>d Discovery from MultilingualComparable Corpora, COLING-ACL, Sidney,AustraliaPhilipp Koehn, Fr<strong>an</strong>z Josef Och, <strong>an</strong>d D<strong>an</strong>iel Marcu,2003. <strong>Statistical</strong> Phrase-based Tr<strong>an</strong>slation, In Proceedings<strong>of</strong> HLT-NAACL, Edmonton, C<strong>an</strong>adaFr<strong>an</strong>z Josef Och, 2003. Minimum Error Rate Trainingfor <strong>Statistical</strong> Machine Tr<strong>an</strong>slation, In Proceedings<strong>of</strong> the 41th Annual Meeting <strong>of</strong> the Association forComputation Linguistics, SapporoKishore Papineni, Salim Roukos, Todd Ward, <strong>an</strong>d Wei-Jing Zhu, 2002. BLEU: a Method for AutomaticEvaluation <strong>of</strong> Machine Tr<strong>an</strong>slation. In Proceedings23


<strong>of</strong> the 40 th Annual Conference <strong>of</strong> the Association forComputational Linguistics (ACL), Philadelphia, PAFatiha Sadat, Howard Johnson, Akakpo Agbago,George Foster, Rol<strong>an</strong>d Kuhn, Aaron Tikuisis, 2005.Portage: A Phrase-base Machine Tr<strong>an</strong>slation System.In Proceedings <strong>of</strong> the ACL Workshop on Building<strong>an</strong>d Using Parallel Texts, Ann Arbor, Michig<strong>an</strong>Richard Sproat, Tao Tao, <strong>an</strong>d ChengXi<strong>an</strong>g Zhai, 2006,Named Entity Tr<strong>an</strong>sliteration with Comparable Corpora,COLING-ACL, Sidney, Australia24

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!