Proceedings - Translation Concepts

More documents

Recommendations

Info

MuTra 2006 – Audiovisual Translation Scenarios: Conference Proceedings Stephen Armstrong & Colm Caffrey & Marian Flanagan rely on a bilingual corpus, aligned at sentential level. In addition to this, EBMT goes a step further and goes beneath sentence level, ‘chunking’ each sentence pair and producing an alignment of sub-sentential chunks. Going beyond the sentence means we should have more scope for capturing useful matches which may be missed otherwise. EBMT is based on the principle of recombining these chunks to produce an automatic translation. 4 Our System Our first step was to gather together a suitable corpus (described in section 4.1). We clean the data, split it into sentences, storing these sentences for later use. Our alignment program is run on these files, which results in a sententially-aligned bilingual corpus. The next step is to split sentences up into smaller units. For this we implement the Marker Hypothesis, which states that ‘all natural languages are marked for complex syntactic structure at surface form by a closed set of specific lexemes and morphemes which appear in a limited set of grammatical contexts and which signal that context’ (Green, 1979). Closed-class sets of words can predict or indicate what word classification will appear next. This is the basis for our system, and how we break up sentences and recombine them to generate new output, for example: German subtitle: EN: Did you ever consult that private detective? DE: Waren sie eigentlich bei diesem Privatdetektiv ? The resulting sub-sentential chunks would be the following, as chunks must contain at least one non-marker word. When a marker word is followed by another marker word, these two words are combined into one segment. that private detective? bei diesem Privatdetektiv? These smaller units are stored in the corpus as aligned segments, so if an input sentence cannot be matched against a complete sentence stored in the parallel corpus, the input sentence is then broken up into smaller segments or chunks, and the system then checks if these input chunks are already stored in the corpus. If so, the corresponding segment is retrieved and recombined with other segments or words to produce suitable output in a different language. We use a number of statistically and linguistically motivated methods to find the most probable matches and recombine them to produce a successful translation. We also use a modular architecture, which means it should be easy to adapt the system to new language pairs. All that has to be changed is the bilingual corpus, along with a new set of marker words for that particular language pair. Figure 1 is a diagram to explain what happens when an English sentence is entered into the EBMT system. 4
MuTra 2006 – Audiovisual Translation Scenarios: Conference Proceedings Stephen Armstrong & Colm Caffrey & Marian Flanagan Input sentence (EN) Is this complete sentence stored in the corpus? Yes. Retrieve target-language sentence from aligned corpus and return as output. No. Break-up input sentence into smaller chunks according to Marker Hypothesis. Check if these smaller chunks exist in the database. Yes. Retrieve the target-language chunk(s), recombine and return as output. No. Check for word correspondence in the dictionary. Retrieve words and recombine. Return as output. Output sentence (DE) Fig. 1: EBMT system and the recognition of a sentence 4.1 Corpus We concluded that the best way to do this was to build up a collection of DVDs which contain both English-German and English-Japanese subtitles. We extract these subtitles to text files using the freely available software SubRip, which gives us the subtitle text, along with their respective TC-in / TC-out (the time code at which the subtitle begins and ends). So far we have extracted subtitles from almost 50 full-length features for the language pair English/German, which amounts to 64,996 sentences for English, and 61,292 sentences for German. Using the set of corpus linguistic tools, WordSmith, we were able to extract some interesting statistics from our own corpus. We calculated the average sentence length for both languages to be a little less than 6 words. Contrast this with the average length of sentences in for example the Europarl 6 corpus, which we calculated to be 24 words per sentence, and we can clearly see that our presuppositions about the language-constraints imposed on subtitler hold true. We plan to conduct a number of experiments on our corpus in an attempt to prove that the majority of language in the domain of film dialogs is in fact 6 The Europarl corpus consists of thousands of sentences from proceedings in the European Parliament, and has been used in many EBMT research projects. 5
Page 1 and 2: Marie Curie EU High Level Scientifi
Page 3: TABLE OF CONTENTS Part I: Multiling
Page 6 and 7: MuTra 2006 - Audiovisual Translatio
Page 18 and 19: Eduard Bartoll (Barcelona) Contents
Page 38 and 39: Lena Hamaidia (Sheffield) EU-High-L
Page 42 and 43: 3 Marked Structures MuTra 2006 - Au
Page 46 and 47: Scene 2: I know Sign Language MuTra
Page 58 and 59:
MuTra 2006 - Audiovisual Translatio
Page 60 and 61:
Page 62 and 63:
Page 64 and 65:
Page 66 and 67:
Page 68 and 69:
Page 70 and 71:
Stavroula Sokoli (Athens) EU-High-L
Page 72 and 73:
Page 74 and 75:
Page 76 and 77:
Page 78 and 79:
EU-High-Level Scientific Conference
Page 80 and 81:
Page 82 and 83:
Page 84 and 85:
Page 86 and 87:
Page 88 and 89:
Page 90 and 91:
Page 92 and 93:
Page 94 and 95:
Page 96 and 97:
I would like to raise five issues:
Page 98 and 99:
Page 100 and 101:
Page 102 and 103:
Page 104 and 105:
Page 106 and 107:
Page 108 and 109:
Page 110 and 111:
Page 112 and 113:
Page 114 and 115:
Page 116 and 117:
Page 118 and 119:
Page 120 and 121:
Page 122 and 123:
Page 124 and 125:
Page 126 and 127:
Francisco Utray Delgado (Madrid) U-
Page 128 and 129:
Page 130 and 131:
Page 132 and 133:
Page 134 and 135:
Page 136 and 137:
Page 138 and 139:
Page 140 and 141:
Page 142 and 143:
Page 144 and 145:
Page 146 and 147:
Page 148 and 149:
Page 150 and 151:
Page 152 and 153:
EU-High-Level Scientific Conference
Page 154 and 155:
Page 156 and 157:
Page 158 and 159:
Page 160 and 161:
Page 162 and 163:
Sabine Braun (Surrey) Contents EU-H
Page 164 and 165:
Page 166 and 167:
Page 168 and 169:
Page 170 and 171:
Page 172 and 173:
Page 174 and 175:
Page 176 and 177:
Page 178 and 179:
Page 180 and 181:
6 The accent of the voice MuTra 200
Page 182 and 183:
Page 184 and 185:
show all

Proceedings - Translation Concepts

Create successful ePaper yourself

Delete template?

Save as template?