27.02.2014 Views

Proceedings of the Workshop on Discourse in Machine Translation

Proceedings of the Workshop on Discourse in Machine Translation

Proceedings of the Workshop on Discourse in Machine Translation

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

author uses over a large corpus to show what<br />

words and what parts-<str<strong>on</strong>g>of</str<strong>on</strong>g>-speech are more likely to<br />

be translated c<strong>on</strong>sistently than o<str<strong>on</strong>g>the</str<strong>on</strong>g>rs. However,<br />

Melamed’s analysis ignores any segmentati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> corpus by document, topic, speaker/writer or<br />

translator, c<strong>on</strong>sider<strong>in</strong>g <strong>on</strong>ly overall translati<strong>on</strong>al<br />

distributi<strong>on</strong>s. It is <str<strong>on</strong>g>the</str<strong>on</strong>g>refore similar to that which<br />

can be gleaned from <str<strong>on</strong>g>the</str<strong>on</strong>g> phrase table <strong>in</strong> a modern<br />

SMT system.<br />

2.2 Enforc<strong>in</strong>g and Encourag<strong>in</strong>g C<strong>on</strong>sistency<br />

A number <str<strong>on</strong>g>of</str<strong>on</strong>g> approaches have been taken to both<br />

encourage and enforce lexical c<strong>on</strong>sistency <strong>in</strong> SMT.<br />

These range from <str<strong>on</strong>g>the</str<strong>on</strong>g> cache-based model approaches<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> Tiedemann (2010a; 2010b) and G<strong>on</strong>g<br />

et al. (2011), to <str<strong>on</strong>g>the</str<strong>on</strong>g> post-edit<strong>in</strong>g approach <str<strong>on</strong>g>of</str<strong>on</strong>g> Xiao<br />

et al. (2011) and discrim<strong>in</strong>ative learn<strong>in</strong>g approach<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> Ma et al. (2011) and He et al. (2011).<br />

Carpuat (2009) and Ture et al. (2012) suggested<br />

that <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>on</strong>e sense per discourse c<strong>on</strong>stra<strong>in</strong>t (Gale<br />

et al., 1992) might apply as well to <strong>on</strong>e sense per<br />

translati<strong>on</strong>. Both dem<strong>on</strong>strated that exploit<strong>in</strong>g this<br />

c<strong>on</strong>stra<strong>in</strong>t <strong>in</strong> SMT led to better quality translati<strong>on</strong>s.<br />

Ture et al. (2012) encourage c<strong>on</strong>sistency<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g>mselves us<strong>in</strong>g s<str<strong>on</strong>g>of</str<strong>on</strong>g>t c<strong>on</strong>stra<strong>in</strong>ts implemented as<br />

additi<strong>on</strong>al features <strong>in</strong> a hierarchical phrase-based<br />

translati<strong>on</strong> model.<br />

What has not been adequately addressed <strong>in</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

available MT literature is where and when lexical<br />

c<strong>on</strong>sistency is desirable <strong>in</strong> translati<strong>on</strong>.<br />

2.3 Measur<strong>in</strong>g C<strong>on</strong>sistency<br />

In c<strong>on</strong>trast with entropy follow<strong>in</strong>g from lexical<br />

properties <str<strong>on</strong>g>of</str<strong>on</strong>g> words (i.e. how many senses a word<br />

has, and how many different possible ways <str<strong>on</strong>g>the</str<strong>on</strong>g>re<br />

are <str<strong>on</strong>g>of</str<strong>on</strong>g> translat<strong>in</strong>g each sense <strong>in</strong> a given target language),<br />

as explored <strong>in</strong> (Melamed, 1997), Itagaki et<br />

al (2007) developed a way to measure <str<strong>on</strong>g>the</str<strong>on</strong>g> term<strong>in</strong>ological<br />

c<strong>on</strong>sistency <str<strong>on</strong>g>of</str<strong>on</strong>g> a s<strong>in</strong>gle document. They<br />

def<strong>in</strong>e c<strong>on</strong>sistency as a measure <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> number <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

translati<strong>on</strong> variati<strong>on</strong>s for a term and <str<strong>on</strong>g>the</str<strong>on</strong>g> frequency<br />

for each variati<strong>on</strong>. They adapted <str<strong>on</strong>g>the</str<strong>on</strong>g> Herf<strong>in</strong>dahl-<br />

Hirschman Index (HHI) measure, typically used<br />

to measure market c<strong>on</strong>centrati<strong>on</strong>, to measure <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

c<strong>on</strong>sistency <str<strong>on</strong>g>of</str<strong>on</strong>g> a s<strong>in</strong>gle term <strong>in</strong> a s<strong>in</strong>gle document.<br />

HHI is def<strong>in</strong>ed as:<br />

n∑<br />

HHI = s 2 i<br />

i=1<br />

Where i ranges over <str<strong>on</strong>g>the</str<strong>on</strong>g> n different ways that <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

given term has been translated <strong>in</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> document,<br />

and s i is <str<strong>on</strong>g>the</str<strong>on</strong>g> ratio <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> number <str<strong>on</strong>g>of</str<strong>on</strong>g> times <str<strong>on</strong>g>the</str<strong>on</strong>g> term<br />

has been translated as i to <str<strong>on</strong>g>the</str<strong>on</strong>g> number <str<strong>on</strong>g>of</str<strong>on</strong>g> times it<br />

has been translated. The lower <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>in</strong>dex, <str<strong>on</strong>g>the</str<strong>on</strong>g> more<br />

variati<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>re is <strong>in</strong> translati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> term, i.e. <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

less c<strong>on</strong>sistent <str<strong>on</strong>g>the</str<strong>on</strong>g> translati<strong>on</strong>. The maximum <strong>in</strong>dex<br />

is 10,000 (or 1 us<strong>in</strong>g <str<strong>on</strong>g>the</str<strong>on</strong>g> normalised scale) for<br />

a completely c<strong>on</strong>sistent translati<strong>on</strong>.<br />

HHI is best illustrated with examples <str<strong>on</strong>g>of</str<strong>on</strong>g> distributi<strong>on</strong>s<br />

over a s<strong>in</strong>gle document. An English<br />

word with two French translati<strong>on</strong>s that are observed<br />

with equal frequency will receive a score<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g>: 0.50 2 + 0.50 2 = 0.5. A different English word<br />

with two French translati<strong>on</strong>s observed 80% and<br />

20% <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> time will receive a score <str<strong>on</strong>g>of</str<strong>on</strong>g>: 0.90 2 +<br />

0.10 2 = 0.82 represent<strong>in</strong>g a more c<strong>on</strong>sistent translati<strong>on</strong><br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> English word. When <str<strong>on</strong>g>the</str<strong>on</strong>g> number<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> possible French translati<strong>on</strong>s <strong>in</strong>creases, <str<strong>on</strong>g>the</str<strong>on</strong>g> HHI<br />

score will likely decrease unless <strong>on</strong>e translati<strong>on</strong> is<br />

much more frequent - see previous example. An<br />

English word with three translati<strong>on</strong>s observed with<br />

equal frequency (33.3% each) will have a score <str<strong>on</strong>g>of</str<strong>on</strong>g>:<br />

0.33 2 + 0.33 2 + 0.33 2 = 0.33 represent<strong>in</strong>g a word<br />

that is translated with lower c<strong>on</strong>sistency.<br />

Itagaki et al. <strong>in</strong>corporate <str<strong>on</strong>g>the</str<strong>on</strong>g>se HHI scores (<strong>on</strong>e<br />

score per term, per document) <strong>in</strong> a wider calculati<strong>on</strong><br />

that measures <strong>in</strong>ter-document c<strong>on</strong>sistency <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

a set <str<strong>on</strong>g>of</str<strong>on</strong>g> documents that all use <str<strong>on</strong>g>the</str<strong>on</strong>g> same term. As<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> analyses presented <strong>in</strong> this paper are c<strong>on</strong>cerned<br />

with s<strong>in</strong>gle documents and <str<strong>on</strong>g>the</str<strong>on</strong>g>ir translati<strong>on</strong>s, <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

per term, per document HHI scores are sufficient.<br />

3 Methodology<br />

This secti<strong>on</strong> describes analyses <str<strong>on</strong>g>of</str<strong>on</strong>g> manual (human)<br />

translati<strong>on</strong> and automated translati<strong>on</strong> (by<br />

a phrase-based SMT system). The data used<br />

is described <strong>in</strong> Secti<strong>on</strong> 3.1 and <str<strong>on</strong>g>the</str<strong>on</strong>g> methods for<br />

analys<strong>in</strong>g c<strong>on</strong>sistency <strong>in</strong> human and automated<br />

translati<strong>on</strong> are described <strong>in</strong> Secti<strong>on</strong>s 3.2 and 3.3.<br />

3.1 Data<br />

As <str<strong>on</strong>g>the</str<strong>on</strong>g> focus <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> analysis is lexical c<strong>on</strong>sistency,<br />

it was important to select texts that were written/translated<br />

by <str<strong>on</strong>g>the</str<strong>on</strong>g> same author. The typical<br />

corpora used <strong>in</strong> tra<strong>in</strong><strong>in</strong>g SMT systems were dismissed;<br />

Europarl as speakers change frequently<br />

and news-crawl as <str<strong>on</strong>g>the</str<strong>on</strong>g> articles are typically too<br />

short to exhibit much lexical repetiti<strong>on</strong>. Instead<br />

I selected <str<strong>on</strong>g>the</str<strong>on</strong>g> INTERSECT corpus (Salkie, 2010)<br />

which c<strong>on</strong>ta<strong>in</strong>s a collecti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> sentence-aligned<br />

parallel texts from different genres. From this corpus<br />

I extracted a number <str<strong>on</strong>g>of</str<strong>on</strong>g> texts from <str<strong>on</strong>g>the</str<strong>on</strong>g> English-<br />

11

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!