27.02.2014 Views

Proceedings of the Workshop on Discourse in Machine Translation

Proceedings of the Workshop on Discourse in Machine Translation

Proceedings of the Workshop on Discourse in Machine Translation

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

new better than old 24<br />

old better than new 13<br />

both equally wr<strong>on</strong>g 9<br />

both equally correct 4<br />

Table 3: The results <str<strong>on</strong>g>of</str<strong>on</strong>g> manual evaluati<strong>on</strong> c<strong>on</strong>ducted<br />

<strong>on</strong> 50 sentences translated by TectoMT <strong>in</strong><br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> orig<strong>in</strong>al sett<strong>in</strong>gs (old) and with <str<strong>on</strong>g>the</str<strong>on</strong>g> new translati<strong>on</strong><br />

model for it (new)<br />

similar score which correlates with <str<strong>on</strong>g>the</str<strong>on</strong>g> f<strong>in</strong>d<strong>in</strong>gs<br />

from <strong>in</strong>tr<strong>in</strong>sic evaluati<strong>on</strong> (see Table 2). It accords<br />

with a similar experience <str<strong>on</strong>g>of</str<strong>on</strong>g> Le Nagard and Koehn<br />

(2010) and Guillou (2012) and gives ano<str<strong>on</strong>g>the</str<strong>on</strong>g>r evidence<br />

that <str<strong>on</strong>g>the</str<strong>on</strong>g> BLEU metric is <strong>in</strong>accurate for measur<strong>in</strong>g<br />

pr<strong>on</strong>oun translati<strong>on</strong>.<br />

Manual evaluati<strong>on</strong> gives a more realistic view.<br />

We randomly sampled 50 out <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> 166 sentences<br />

that differ and <strong>on</strong>e annotator assessed which <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> two systems gave a better translati<strong>on</strong>. Table<br />

3 shows that <strong>in</strong> almost half <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> cases <str<strong>on</strong>g>the</str<strong>on</strong>g> change<br />

was an improvement. Includ<strong>in</strong>g <str<strong>on</strong>g>the</str<strong>on</strong>g> sentences that<br />

are acceptable for both sett<strong>in</strong>gs, <str<strong>on</strong>g>the</str<strong>on</strong>g> new approach<br />

picked <str<strong>on</strong>g>the</str<strong>on</strong>g> correct Czech counterpart <str<strong>on</strong>g>of</str<strong>on</strong>g> it <strong>in</strong> 22%<br />

more sentences than <str<strong>on</strong>g>the</str<strong>on</strong>g> orig<strong>in</strong>al approach. S<strong>in</strong>ce<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> proporti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> changed sentences accounts<br />

for almost 39% <str<strong>on</strong>g>of</str<strong>on</strong>g> all sentences c<strong>on</strong>ta<strong>in</strong><strong>in</strong>g it, <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

overall proporti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> improved sentences with it is<br />

around 8.5% <strong>in</strong> total.<br />

8 Discussi<strong>on</strong><br />

Inspect<strong>in</strong>g <str<strong>on</strong>g>the</str<strong>on</strong>g> manually evaluated translati<strong>on</strong> for<br />

types <str<strong>on</strong>g>of</str<strong>on</strong>g> improvements and losses, we have found<br />

that <strong>in</strong> n<strong>on</strong>e <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> changed sentences <str<strong>on</strong>g>the</str<strong>on</strong>g> orig<strong>in</strong>al<br />

system decided to omit ten (obta<strong>in</strong>ed by <str<strong>on</strong>g>the</str<strong>on</strong>g> rule)<br />

<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> surface. It shows that <str<strong>on</strong>g>the</str<strong>on</strong>g> new approach<br />

agrees with <str<strong>on</strong>g>the</str<strong>on</strong>g> orig<strong>in</strong>al <strong>on</strong>e <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> way <str<strong>on</strong>g>of</str<strong>on</strong>g> omitt<strong>in</strong>g<br />

pers<strong>on</strong>al pr<strong>on</strong>ouns and ma<strong>in</strong>ly addresses <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

overly simplistic assignment <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> dem<strong>on</strong>strative<br />

ten.<br />

The distributi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> target classes over corrected<br />

sentences is almost uniform. In 13 out<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> 24 improvements, <str<strong>on</strong>g>the</str<strong>on</strong>g> new system succeeded<br />

<strong>in</strong> correctly resolv<strong>in</strong>g <str<strong>on</strong>g>the</str<strong>on</strong>g> Null class while <strong>in</strong><br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> rema<strong>in</strong><strong>in</strong>g 11 cases, <str<strong>on</strong>g>the</str<strong>on</strong>g> corrected class was<br />

PersPr<strong>on</strong>. It took advantage mostly <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

syntax-based features <strong>in</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> former and suggesti<strong>on</strong>s<br />

given by <str<strong>on</strong>g>the</str<strong>on</strong>g> NADA anaphoricity resolver <strong>in</strong><br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> latter.<br />

Exam<strong>in</strong><strong>in</strong>g <str<strong>on</strong>g>the</str<strong>on</strong>g> errors, we observed that <str<strong>on</strong>g>the</str<strong>on</strong>g> majority<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>m are <strong>in</strong>curred <strong>in</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> structures with<br />

“it is”. These errors stem mostly from <strong>in</strong>correct<br />

activati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> syntactic features due to pars<strong>in</strong>g and<br />

POS tagg<strong>in</strong>g errors. Example 11 (<str<strong>on</strong>g>the</str<strong>on</strong>g> Czech sentence<br />

is an MT output) shows <str<strong>on</strong>g>the</str<strong>on</strong>g> latter, when <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

POS tagger err<strong>on</strong>eously labeled <str<strong>on</strong>g>the</str<strong>on</strong>g> word soy as an<br />

adjective. That resulted <strong>in</strong> activat<strong>in</strong>g <str<strong>on</strong>g>the</str<strong>on</strong>g> feature<br />

for adjectival predicates followed by that (Figure<br />

2c) <strong>in</strong>stead <str<strong>on</strong>g>of</str<strong>on</strong>g> a feature <strong>in</strong>dicat<strong>in</strong>g cleft structures<br />

(Figure 2d), thus preferr<strong>in</strong>g <str<strong>on</strong>g>the</str<strong>on</strong>g> label Null to <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

correct To.<br />

(11) SOURCE: It is just soy that all well-known<br />

manufacturers use now.<br />

TECTOMT: Je to jen sójové, že známí<br />

výrobci všech používají teď.<br />

9 C<strong>on</strong>clusi<strong>on</strong><br />

In this work we presented a novel approach to<br />

deal<strong>in</strong>g with <str<strong>on</strong>g>the</str<strong>on</strong>g> translati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> English pers<strong>on</strong>al<br />

pr<strong>on</strong>oun it. We have shown that <str<strong>on</strong>g>the</str<strong>on</strong>g> mapp<strong>in</strong>g between<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> categories <str<strong>on</strong>g>of</str<strong>on</strong>g> it and <str<strong>on</strong>g>the</str<strong>on</strong>g> ways <str<strong>on</strong>g>of</str<strong>on</strong>g> translat<strong>in</strong>g<br />

it to Czech is not <strong>on</strong>e-to-<strong>on</strong>e. In order to<br />

deal with this, we designed a discrim<strong>in</strong>ative translati<strong>on</strong><br />

model <str<strong>on</strong>g>of</str<strong>on</strong>g> it for <str<strong>on</strong>g>the</str<strong>on</strong>g> TectoMT deep syntax MT<br />

framework.<br />

We have built a system that outperforms its predecessor<br />

<strong>in</strong> 8.5% sentences c<strong>on</strong>ta<strong>in</strong><strong>in</strong>g it, tak<strong>in</strong>g<br />

advantage <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> features based <strong>on</strong> rich syntactic<br />

annotati<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> MT system provides, external tools<br />

for anaphoricity resoluti<strong>on</strong> and features captur<strong>in</strong>g<br />

lexical co-occurrence <strong>in</strong> a massive parallel corpus,<br />

The ma<strong>in</strong> bottleneck that hampered bigger improvements<br />

is <str<strong>on</strong>g>the</str<strong>on</strong>g> manual annotati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> tra<strong>in</strong><strong>in</strong>g<br />

data. We managed to accomplish it just <strong>on</strong> 1/6<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> data, which did not provide sufficient evidence<br />

for some specific features.<br />

Our ma<strong>in</strong> objective <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> future work is thus<br />

to reduce a need for manual annotati<strong>on</strong> by discover<strong>in</strong>g<br />

ways <str<strong>on</strong>g>of</str<strong>on</strong>g> automatic extracti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> reliable<br />

classes from a semi-manually annotated corpus<br />

such as PCEDT.<br />

Acknowledgments<br />

This work has been supported by <str<strong>on</strong>g>the</str<strong>on</strong>g> Grant<br />

Agency <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> Czech Republic (grants<br />

P406/12/0658 and P406/2010/0875), <str<strong>on</strong>g>the</str<strong>on</strong>g> grant<br />

GAUK 4226/2011 and EU FP7 project Khresmoi<br />

(c<strong>on</strong>tract no. 257528). This work has been us<strong>in</strong>g<br />

language resources developed and/or stored and/or<br />

distributed by <str<strong>on</strong>g>the</str<strong>on</strong>g> LINDAT-Clar<strong>in</strong> project <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

M<strong>in</strong>istry <str<strong>on</strong>g>of</str<strong>on</strong>g> Educati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> Czech Republic<br />

(project LM2010013).<br />

58

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!