27.02.2014 Views

Proceedings of the Workshop on Discourse in Machine Translation

Proceedings of the Workshop on Discourse in Machine Translation

Proceedings of the Workshop on Discourse in Machine Translation

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

ased syn<str<strong>on</strong>g>the</str<strong>on</strong>g>sis stage, when a surface Czech sentence<br />

is generated from its deep syntactic structure.<br />

Deep syntactic representati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> a sentence follows<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> Prague tectogrammatics <str<strong>on</strong>g>the</str<strong>on</strong>g>ory (Sgall,<br />

1967; Sgall et al., 1986). It is a dependency<br />

tree whose nodes corresp<strong>on</strong>d to <str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>tent words<br />

<strong>in</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> sentence. Pers<strong>on</strong>al pr<strong>on</strong>ouns miss<strong>in</strong>g <strong>on</strong><br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> surface are rec<strong>on</strong>structed <strong>in</strong> special nodes.<br />

Nodes are assigned semantic roles (called functors)<br />

and grammatical <strong>in</strong>formati<strong>on</strong> is comprised <strong>in</strong><br />

so called grammatemes. Fur<str<strong>on</strong>g>the</str<strong>on</strong>g>rmore, tectogrammatical<br />

representati<strong>on</strong> is a place where coreference<br />

relati<strong>on</strong>s are annotated.<br />

4.1 Model <str<strong>on</strong>g>of</str<strong>on</strong>g> it with<strong>in</strong> TectoMT<br />

The transfer stage, which maps an English tectogrammatical<br />

tree to a Czech <strong>on</strong>e, is a place<br />

where <str<strong>on</strong>g>the</str<strong>on</strong>g> translati<strong>on</strong> model <str<strong>on</strong>g>of</str<strong>on</strong>g> it is applied. For<br />

every English node corresp<strong>on</strong>d<strong>in</strong>g to it, a feature<br />

vector is extracted and fed <strong>in</strong>to a discrim<strong>in</strong>ative resolver<br />

that assigns <strong>on</strong>e <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> three classes to it –<br />

PersPr<strong>on</strong>, To and Null, corresp<strong>on</strong>d<strong>in</strong>g to <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

ma<strong>in</strong> Czech types <strong>in</strong>troduced <strong>in</strong> Secti<strong>on</strong> 3.<br />

If labeled as PersPr<strong>on</strong>, <str<strong>on</strong>g>the</str<strong>on</strong>g> English node<br />

is mapped to a Czech #PersPr<strong>on</strong> node and <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

English coreference l<strong>in</strong>k is projected. Dur<strong>in</strong>g<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> syn<str<strong>on</strong>g>the</str<strong>on</strong>g>sis, it is decided whe<str<strong>on</strong>g>the</str<strong>on</strong>g>r <str<strong>on</strong>g>the</str<strong>on</strong>g> pr<strong>on</strong>oun<br />

should be expressed <strong>on</strong> a surface, its gender and<br />

number are copied from <str<strong>on</strong>g>the</str<strong>on</strong>g> antecedent’s head and<br />

f<strong>in</strong>ally <str<strong>on</strong>g>the</str<strong>on</strong>g> correct form (if any) is generated.<br />

Obta<strong>in</strong><strong>in</strong>g class To makes th<strong>in</strong>gs easier. The<br />

English node is <strong>on</strong>ly mapped to a Czech node c<strong>on</strong>ta<strong>in</strong><strong>in</strong>g<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> pr<strong>on</strong>oun ten with its gender and number<br />

set to neuter s<strong>in</strong>gular, so that later <str<strong>on</strong>g>the</str<strong>on</strong>g> correct<br />

form to will be generated.<br />

Last, if it is assigned Null, no corresp<strong>on</strong>d<strong>in</strong>g<br />

node <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> Czech side is generated, but <str<strong>on</strong>g>the</str<strong>on</strong>g> Czech<br />

counterpart <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> govern<strong>in</strong>g verb is forced to be <strong>in</strong><br />

neuter s<strong>in</strong>gular.<br />

5 Prague Czech-English Dependency<br />

Treebank as a source <str<strong>on</strong>g>of</str<strong>on</strong>g> data<br />

The Prague Czech-English Dependency Treebank<br />

(Hajič et al., 2011, PCEDT) is a manually parsed<br />

Czech-English parallel corpus compris<strong>in</strong>g over 1.2<br />

milli<strong>on</strong> words for each language <strong>in</strong> almost 50,000<br />

sentence pairs. The English part c<strong>on</strong>ta<strong>in</strong>s <str<strong>on</strong>g>the</str<strong>on</strong>g> entire<br />

Penn Treebank–Wall Street Journal Secti<strong>on</strong><br />

(L<strong>in</strong>guistic Data C<strong>on</strong>sortium, 1999). The Czech<br />

part c<strong>on</strong>sists <str<strong>on</strong>g>of</str<strong>on</strong>g> translati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> all <str<strong>on</strong>g>the</str<strong>on</strong>g> texts from<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> English part. The data from both parts are<br />

annotated <strong>on</strong> three layers follow<strong>in</strong>g <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ory <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

Prague tectogrammatics – <str<strong>on</strong>g>the</str<strong>on</strong>g> morphological layer<br />

(where each token from <str<strong>on</strong>g>the</str<strong>on</strong>g> sentence gets a lemma<br />

and a POS tag), <str<strong>on</strong>g>the</str<strong>on</strong>g> analytical layer (surface syntax<br />

<strong>in</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> form <str<strong>on</strong>g>of</str<strong>on</strong>g> a dependency tree, where each<br />

node corresp<strong>on</strong>ds to a token <strong>in</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> sentence) and<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> tectogrammatical representati<strong>on</strong> (see Secti<strong>on</strong><br />

4).<br />

Sentences <str<strong>on</strong>g>of</str<strong>on</strong>g> PCEDT have been automatically<br />

morphologically annotated and parsed <strong>in</strong>to analytical<br />

dependency trees. 4 The tectogrammatical<br />

trees <strong>in</strong> both language parts have been annotated<br />

manually (Hajič et al., 2012). The nodes <str<strong>on</strong>g>of</str<strong>on</strong>g> Czech<br />

and English trees have been automatically aligned<br />

<strong>on</strong> analytical as well as tectogrammatical layer<br />

(Mareček et al., 2008).<br />

5.1 Extracti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> Classes<br />

The shortcom<strong>in</strong>gs <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> automatic alignment<br />

is particularly harmful for pr<strong>on</strong>ouns and zero<br />

anaphors, which can replace a whole range <str<strong>on</strong>g>of</str<strong>on</strong>g> c<strong>on</strong>tent<br />

words and <str<strong>on</strong>g>the</str<strong>on</strong>g>ir mean<strong>in</strong>g is <strong>in</strong>ferred ma<strong>in</strong>ly<br />

from <str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>text. The situati<strong>on</strong> is better for verbs<br />

as <str<strong>on</strong>g>the</str<strong>on</strong>g>ir usual parents <strong>in</strong> dependency trees: s<strong>in</strong>ce<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g>y carry mean<strong>in</strong>g <strong>in</strong> a greater extent, <str<strong>on</strong>g>the</str<strong>on</strong>g>ir automatic<br />

alignment is <str<strong>on</strong>g>of</str<strong>on</strong>g> a higher quality.<br />

Thus, we did not search for a Czech counterpart<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> it by follow<strong>in</strong>g <str<strong>on</strong>g>the</str<strong>on</strong>g> alignment <str<strong>on</strong>g>of</str<strong>on</strong>g> it itself. Us<strong>in</strong>g<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> fact that <str<strong>on</strong>g>the</str<strong>on</strong>g> verb alignment is more reliable<br />

and functors <strong>in</strong> tectogrammatical trees have been<br />

manually corrected, we followed <str<strong>on</strong>g>the</str<strong>on</strong>g> alignment <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> parent <str<strong>on</strong>g>of</str<strong>on</strong>g> it (a verb) and selected <str<strong>on</strong>g>the</str<strong>on</strong>g> Czech subtree<br />

with <str<strong>on</strong>g>the</str<strong>on</strong>g> same tectogrammatical functor as it<br />

had <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> English side. If <str<strong>on</strong>g>the</str<strong>on</strong>g> obta<strong>in</strong>ed subtree<br />

is a s<strong>in</strong>gle node <str<strong>on</strong>g>of</str<strong>on</strong>g> type #PersPr<strong>on</strong> or ten, we assigned<br />

class PersPr<strong>on</strong> or To, respectively, to <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

corresp<strong>on</strong>d<strong>in</strong>g it. This approach relies also <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

assumpti<strong>on</strong> that semantic roles do not change <strong>in</strong><br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> translati<strong>on</strong>.<br />

The automatic acquisiti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> classes covered<br />

more than 60% <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>in</strong>stances, <str<strong>on</strong>g>the</str<strong>on</strong>g> rest had to be labeled<br />

manually. Dur<strong>in</strong>g <str<strong>on</strong>g>the</str<strong>on</strong>g> annotati<strong>on</strong>, we obeyed<br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> follow<strong>in</strong>g rules:<br />

1. If a dem<strong>on</strong>strative pr<strong>on</strong>oun to is present <strong>in</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g><br />

Czech sentence or if a pers<strong>on</strong>al pr<strong>on</strong>oun is<br />

ei<str<strong>on</strong>g>the</str<strong>on</strong>g>r present or unexpressed, assign <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>in</strong>stance<br />

to <str<strong>on</strong>g>the</str<strong>on</strong>g> corresp<strong>on</strong>d<strong>in</strong>g class.<br />

4 The English dependency trees were built by automatically<br />

transform<strong>in</strong>g <str<strong>on</strong>g>the</str<strong>on</strong>g> orig<strong>in</strong>al phrase-structure annotati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

<str<strong>on</strong>g>the</str<strong>on</strong>g> Penn Treebank.<br />

54

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!