27.02.2014 Views

Proceedings of the Workshop on Discourse in Machine Translation

Proceedings of the Workshop on Discourse in Machine Translation

Proceedings of the Workshop on Discourse in Machine Translation

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

phrase-table (Meyer and Popescu-Belis, 2012) or<br />

by us<strong>in</strong>g factored decod<strong>in</strong>g (Meyer et al., 2012)<br />

to disambiguate c<strong>on</strong>nectives, with small improvements.<br />

Lexical c<strong>on</strong>sistency has been addressed<br />

by <str<strong>on</strong>g>the</str<strong>on</strong>g> use <str<strong>on</strong>g>of</str<strong>on</strong>g> post-process<strong>in</strong>g (Carpuat, 2009),<br />

multi-pass decod<strong>in</strong>g (Xiao et al., 2011; Ture et al.,<br />

2012), and cache models (Tiedemann, 2010; G<strong>on</strong>g<br />

et al., 2011). G<strong>on</strong>g et al. (2012) addressed <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

issue <str<strong>on</strong>g>of</str<strong>on</strong>g> tense selecti<strong>on</strong> for translati<strong>on</strong> from Ch<strong>in</strong>ese,<br />

by <str<strong>on</strong>g>the</str<strong>on</strong>g> use <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>in</strong>ter-sentential tense n-grams,<br />

exploit<strong>in</strong>g <strong>in</strong>formati<strong>on</strong> from previously translated<br />

sentences. Ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r way to use a larger c<strong>on</strong>text<br />

is by <strong>in</strong>tegrat<strong>in</strong>g word sense disambiguati<strong>on</strong> and<br />

SMT. This has been d<strong>on</strong>e by re-<strong>in</strong>itializ<strong>in</strong>g phrase<br />

probabilities for each sentence (Carpuat and Wu,<br />

2007), by <strong>in</strong>troduc<strong>in</strong>g extra features <strong>in</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> phrasetable<br />

(Chan et al., 2007), or as a k-best re-rank<strong>in</strong>g<br />

task (Specia et al., 2008). Ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r type <str<strong>on</strong>g>of</str<strong>on</strong>g> approach<br />

is to <strong>in</strong>tegrate topic model<strong>in</strong>g <strong>in</strong>to phrase<br />

tables (Zhao and X<strong>in</strong>g, 2010; Su et al., 2012). For<br />

a more thorough overview <str<strong>on</strong>g>of</str<strong>on</strong>g> discourse <strong>in</strong> SMT,<br />

see Hardmeier (2012).<br />

Here we <strong>in</strong>stead choose to work with <str<strong>on</strong>g>the</str<strong>on</strong>g> recent<br />

document-level SMT decoder Docent (Hardmeier<br />

et al., 2012). Unlike <strong>in</strong> traditi<strong>on</strong>al decod<strong>in</strong>g<br />

were documents are generated sentence by<br />

sentence, feature models <strong>in</strong> Docent always have<br />

access to <str<strong>on</strong>g>the</str<strong>on</strong>g> complete discourse c<strong>on</strong>text, even<br />

before decod<strong>in</strong>g is f<strong>in</strong>ished. It implements <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

phrase-based SMT approach (Koehn et al., 2003)<br />

and is based <strong>on</strong> local search, where a state c<strong>on</strong>sists<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> a full translati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> a document, which is<br />

improved by apply<strong>in</strong>g a series <str<strong>on</strong>g>of</str<strong>on</strong>g> operati<strong>on</strong>s to improve<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> translati<strong>on</strong>. A hill-climb<strong>in</strong>g strategy is<br />

used to f<strong>in</strong>d a (local) maximum. The operati<strong>on</strong>s<br />

allow chang<strong>in</strong>g <str<strong>on</strong>g>the</str<strong>on</strong>g> translati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> a phrase, chang<strong>in</strong>g<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> word order by swapp<strong>in</strong>g <str<strong>on</strong>g>the</str<strong>on</strong>g> positi<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

two phrases, and resegment<strong>in</strong>g phrases. The <strong>in</strong>itial<br />

state can ei<str<strong>on</strong>g>the</str<strong>on</strong>g>r be <strong>in</strong>itialized randomly <strong>in</strong> m<strong>on</strong>ot<strong>on</strong>ic<br />

order, or be based <strong>on</strong> an <strong>in</strong>itial run from a<br />

standard sentence-based decoder. The number <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

iterati<strong>on</strong>s <strong>in</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> decoder is c<strong>on</strong>trolled by two parameters,<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> maximum number <str<strong>on</strong>g>of</str<strong>on</strong>g> iterati<strong>on</strong>s and<br />

a rejecti<strong>on</strong> limit, which stops <str<strong>on</strong>g>the</str<strong>on</strong>g> decoder if no<br />

change was made <strong>in</strong> a certa<strong>in</strong> number <str<strong>on</strong>g>of</str<strong>on</strong>g> iterati<strong>on</strong>s.<br />

This setup is not limited by dynamic programm<strong>in</strong>g<br />

c<strong>on</strong>stra<strong>in</strong>ts, and enables <str<strong>on</strong>g>the</str<strong>on</strong>g> use <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> translated<br />

target document to extract features. It is thus easy<br />

to directly <strong>in</strong>tegrate discourse-level features <strong>in</strong>to<br />

Docent. While we use this specific decoder <strong>in</strong> our<br />

experiments, <str<strong>on</strong>g>the</str<strong>on</strong>g> method proposed for documentlevel<br />

feature weight optimizati<strong>on</strong> is not limited to<br />

it. It can be used with any decoder that outputs<br />

feature values at <str<strong>on</strong>g>the</str<strong>on</strong>g> document level.<br />

3 Sentence-Level Tun<strong>in</strong>g<br />

Traditi<strong>on</strong>ally, feature weight optimizati<strong>on</strong>, or tun<strong>in</strong>g,<br />

for SMT is performed by an iterative process<br />

where a development set is translated to produce a<br />

k-best list. The parameters are <str<strong>on</strong>g>the</str<strong>on</strong>g>n optimized us<strong>in</strong>g<br />

some procedure, generally to favor translati<strong>on</strong>s<br />

<strong>in</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> k-best list that have a high score <strong>on</strong> some<br />

MT metric. The translati<strong>on</strong> step is <str<strong>on</strong>g>the</str<strong>on</strong>g>n repeated<br />

us<strong>in</strong>g <str<strong>on</strong>g>the</str<strong>on</strong>g> new weights for decod<strong>in</strong>g, and optimizati<strong>on</strong><br />

is c<strong>on</strong>t<strong>in</strong>ued <strong>on</strong> a new k-best list, or <strong>on</strong> a comb<strong>in</strong>ati<strong>on</strong><br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> all k-best lists. This is repeated until<br />

some end c<strong>on</strong>diti<strong>on</strong> is satisfied, for <strong>in</strong>stance for a<br />

set number <str<strong>on</strong>g>of</str<strong>on</strong>g> iterati<strong>on</strong>s, until <str<strong>on</strong>g>the</str<strong>on</strong>g>re is <strong>on</strong>ly very<br />

small changes <strong>in</strong> parameter weights, or until <str<strong>on</strong>g>the</str<strong>on</strong>g>re<br />

are no new translati<strong>on</strong>s <strong>in</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> k-best lists.<br />

SMT tun<strong>in</strong>g is a hard problem <strong>in</strong> general, partly<br />

because <str<strong>on</strong>g>the</str<strong>on</strong>g> correct output is unreachable and<br />

also because <str<strong>on</strong>g>the</str<strong>on</strong>g> translati<strong>on</strong> process <strong>in</strong>cludes latent<br />

variables, which means that many efficient<br />

standard optimizati<strong>on</strong> procedures cannot be used<br />

(Gimpel and Smith, 2012). Never<str<strong>on</strong>g>the</str<strong>on</strong>g>less, <str<strong>on</strong>g>the</str<strong>on</strong>g>re<br />

are a number <str<strong>on</strong>g>of</str<strong>on</strong>g> techniques <strong>in</strong>clud<strong>in</strong>g MERT (Och,<br />

2003), MIRA (Chiang et al., 2008; Cherry and<br />

Foster, 2012), PRO (Hopk<strong>in</strong>s and May, 2011),<br />

and Rampi<strong>on</strong> (Gimpel and Smith, 2012). All <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

<str<strong>on</strong>g>the</str<strong>on</strong>g>se optimizati<strong>on</strong> methods can be plugged <strong>in</strong>to<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> standard optimizati<strong>on</strong> loop. All <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> methods<br />

work relatively well <strong>in</strong> practice, even though<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g>re are limitati<strong>on</strong>s, for <strong>in</strong>stance that many methods<br />

are n<strong>on</strong>-determ<strong>in</strong>istic mean<strong>in</strong>g that <str<strong>on</strong>g>the</str<strong>on</strong>g>ir results<br />

are somewhat unstable. However, <str<strong>on</strong>g>the</str<strong>on</strong>g>re are<br />

some important differences. MERT is based <strong>on</strong><br />

scores for <str<strong>on</strong>g>the</str<strong>on</strong>g> full test set, whereas <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r methods<br />

are based <strong>on</strong> sentence-level scores. MERT<br />

also has <str<strong>on</strong>g>the</str<strong>on</strong>g> drawback that it <strong>on</strong>ly works well for<br />

small sets <str<strong>on</strong>g>of</str<strong>on</strong>g> features. In this paper we are not<br />

c<strong>on</strong>cerned with <str<strong>on</strong>g>the</str<strong>on</strong>g> actual optimizati<strong>on</strong> algorithm<br />

and its properties, though, but <strong>in</strong>stead we focus<br />

<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>in</strong>tegrati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> document-level decod<strong>in</strong>g <strong>in</strong>to<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> exist<strong>in</strong>g optimizati<strong>on</strong> frameworks.<br />

In order to adapt sentence-level frameworks to<br />

our needs we need to address <str<strong>on</strong>g>the</str<strong>on</strong>g> granularity <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

scor<strong>in</strong>g and <str<strong>on</strong>g>the</str<strong>on</strong>g> process <str<strong>on</strong>g>of</str<strong>on</strong>g> extract<strong>in</strong>g k-best lists.<br />

For document-level features we do not have mean<strong>in</strong>gful<br />

scores <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> sentence level which are required<br />

<strong>in</strong> standard optimizati<strong>on</strong> frameworks. Fur<str<strong>on</strong>g>the</str<strong>on</strong>g>rmore,<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> extracti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> k-best lists is not as<br />

61

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!