20.11.2014 Views

Translation Universals.pdf - ymerleksi - home

Translation Universals.pdf - ymerleksi - home

Translation Universals.pdf - ymerleksi - home

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Corpora, universals and interference 75<br />

gather more data. <strong>Translation</strong> into Finnish is very much dominated by English<br />

sources – in fact the literary genre is an exception, having by far the greatest<br />

variety of SLs.<br />

As pointed out above, there is no reliable measure of overall similarity<br />

and difference between corpora. I therefore developed a tentative solution<br />

for comparing the four subcorpora to one another, based on comparing<br />

lexis on a rank order basis. The point of departure was a frequency-based<br />

wordlist of each corpus. The lists consisted of individual word forms, not<br />

lemmata. Lemmatisation seemed unnecessary, even pointless, since it is by<br />

no means clear that great differences in the frequencies of typical forms are<br />

trivial from the point of view of translation – quite the contrary, in a highly<br />

inflectional language it can point to an untypical usage pattern (see, for<br />

example, Mauranen 1999b).<br />

Since the method of comparison needed for this study had no precedent<br />

to go on, I started from some general principles that we already know from<br />

corpus linguistics (for alternative solutions, see Jantunen, this volume). First,<br />

the most frequent items in different reasonably-sized corpora tend to be fairly<br />

consistently the same in a given language. In fact, frequent items can show<br />

remarkable consistency even across highly unequal corpus sizes, up to a 100-<br />

fold difference (see, e.g. Mauranen 1999b). Therefore, the fact that the Russian<br />

SL subcorpus was only about two thirds the size of the others, should not<br />

influence the results dramatically, especially as rank order is not very sensitive<br />

to size. Second, fairly soon after the top frequency words, corpora begin to<br />

show their differences, and the high frequencies peter out and tail off into a<br />

very long list of few, finally single occurrences.<br />

My solution was to try out three different frequency bands of thirty words<br />

each; the first, from 1 to 30, the second from 50 to 79, and the third from 100 to<br />

129 (with some adjustment on account of excluding proper names). The bands<br />

were chosen conservatively in that they were all fairly high in frequency: this<br />

meant it was possible that there would not be much variation. My assumption<br />

was that given the distributions of monolingual corpora, there would be very<br />

little variation in the top frequency band, somewhat more in the second, but<br />

the last one which was picked from just below the 100 top frequency level,<br />

would be the least predictable. I also assumed that the best guess for finding<br />

meaningful variation would be in the middle band. Of course, although we<br />

might reasonably expect this to be the case, the exact optimal place for the<br />

middle band is ultimately an empirical matter, and no precedents existed to<br />

look into.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!