02.05.2014 Views

Building an old Occitan corpus via cross-Language transfer

Building an old Occitan corpus via cross-Language transfer

Building an old Occitan corpus via cross-Language transfer

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

very interesting linguistic document consisting of<br />

8097 lines of the “universally acknowledged masterpiece<br />

of Old Occit<strong>an</strong> narrative” (Fleischm<strong>an</strong>n,<br />

1995). Multiple styles, such as internal monologues,<br />

dialogues <strong>an</strong>d narratives, provide a rich<br />

lexical, morphological <strong>an</strong>d syntactic database of<br />

a l<strong>an</strong>guage spoken in southern Fr<strong>an</strong>ce.<br />

3 Linguistic Annotation <strong>via</strong><br />

Cross-L<strong>an</strong>guage Tr<strong>an</strong>sfer<br />

Corpus-based approaches often require a great<br />

amount of parallel data or m<strong>an</strong>ual labor. In contrast,<br />

the <strong>cross</strong>-l<strong>an</strong>guage tr<strong>an</strong>sfer, as proposed by<br />

H<strong>an</strong>a et al. (2006), is a resource-light approach.<br />

That is, this method does not involve <strong>an</strong>y resources<br />

in the target l<strong>an</strong>guage, neither training<br />

data, a large lexicon, nor time-consuming m<strong>an</strong>ual<br />

<strong>an</strong>notation.<br />

While <strong>cross</strong>-l<strong>an</strong>guage tr<strong>an</strong>sfer has been previously<br />

applied to l<strong>an</strong>guages with parallel corpora<br />

<strong>an</strong>d bilingual lexica (Yarowsky <strong>an</strong>d Ngai, 2001;<br />

Hwa et al., 2005), H<strong>an</strong>a et al. (2006) introduced<br />

a method in the area where these additional resources<br />

are not available. Feldm<strong>an</strong> <strong>an</strong>d H<strong>an</strong>a<br />

(2010) performed several experiments with Rom<strong>an</strong>ce<br />

<strong>an</strong>d Slavic l<strong>an</strong>guages. The only resources<br />

they used were i) POS tagged data of the source<br />

l<strong>an</strong>guage, ii) raw data in the target l<strong>an</strong>guage, <strong>an</strong>d<br />

iii) a resource-light morphological <strong>an</strong>alyzer for<br />

the target l<strong>an</strong>guage. For POS tagging, the Markov<br />

model tagger TnT (Br<strong>an</strong>ts, 2000) was trained on<br />

a source l<strong>an</strong>guage, namely Sp<strong>an</strong>ish <strong>an</strong>d Czech, in<br />

order to obtain tr<strong>an</strong>sition probabilities. The reasoning<br />

is that the word order patterns of source<br />

<strong>an</strong>d target l<strong>an</strong>guages are very similar so that given<br />

the same tagset, the tr<strong>an</strong>sition probabilities should<br />

be similar, too. Since the l<strong>an</strong>guages differ in their<br />

morphological characteristics, a direct tr<strong>an</strong>sfer of<br />

the lexical probabilities was not possible. Instead,<br />

a shallow morphological <strong>an</strong>alyzer was developed<br />

for the target l<strong>an</strong>guages, using cognate information,<br />

among other similarities. The trained models<br />

were then applied to the target l<strong>an</strong>guages, Portuguese,<br />

Catal<strong>an</strong>, <strong>an</strong>d Russi<strong>an</strong>. Tagging accuracies<br />

for Catal<strong>an</strong>, Portuguese, <strong>an</strong>d Russi<strong>an</strong> yielded<br />

70.7%, 77.2%, <strong>an</strong>d 78.6% respectively.<br />

In contrast, syntactic tr<strong>an</strong>sfer is mainly used<br />

in machine tr<strong>an</strong>slation. This approach requires<br />

a bilingual <strong>corpus</strong> aligned on a sentence level.<br />

That is, words in a source l<strong>an</strong>guage are mapped<br />

to words in a target l<strong>an</strong>guage. Dien et al. (2004)<br />

used this method on English-Vietnamese <strong>corpus</strong>.<br />

They extracted a syntactic tree set from English<br />

<strong>an</strong>d tr<strong>an</strong>sferred it into the target l<strong>an</strong>guage. Obtained<br />

two sets of parsed trees, English <strong>an</strong>d Vietnamese,<br />

were further used as training data to extract<br />

tr<strong>an</strong>sfer rules. In contrast, H<strong>an</strong>nem<strong>an</strong> et<br />

al. (2009) extracted unique grammar rules from<br />

English-French parallel parsed <strong>corpus</strong> <strong>an</strong>d selected<br />

high-frequency rules to reorder position<br />

of constituents. Hwa et al. (2005) describe <strong>an</strong><br />

approach that focuses on syntax projection per<br />

se, but their approach also relies on word alignment<br />

in a parallel <strong>corpus</strong>. They show that the<br />

approach works better for closely related l<strong>an</strong>guages<br />

(English to Sp<strong>an</strong>ish) th<strong>an</strong> for l<strong>an</strong>guages<br />

as different as English <strong>an</strong>d Chinese. It is widely<br />

agreed that word alignment <strong>an</strong>d thus, syntactic<br />

tr<strong>an</strong>sfer, is best applied in similar l<strong>an</strong>guages<br />

due to their word order pattern (Wat<strong>an</strong>abe <strong>an</strong>d<br />

Sumita, 2003). Therefore, genetically related Rom<strong>an</strong>ce<br />

l<strong>an</strong>guages should be well suited for syntactic<br />

<strong>cross</strong> l<strong>an</strong>guage tr<strong>an</strong>sfer. McDonald et al.<br />

(2011) <strong>an</strong>d Naseem et al. (2012) describe novel<br />

approaches that use more th<strong>an</strong> one source l<strong>an</strong>guage,<br />

reaching results similar to those of a supervised<br />

parser for the source l<strong>an</strong>guage.<br />

While <strong>cross</strong>-l<strong>an</strong>guage tr<strong>an</strong>sfer has been applied<br />

successfully to modern l<strong>an</strong>guages, we decided to<br />

use it to tr<strong>an</strong>sfer linguistic <strong>an</strong>notation to a historical<br />

<strong>corpus</strong>. The choice of source l<strong>an</strong>guages<br />

was based on the availability of <strong>an</strong>notated resources<br />

<strong>an</strong>d the similarity of l<strong>an</strong>guage characteristics.<br />

Thus, Old French <strong>corpus</strong> (Martineau et al.,<br />

2007) was selected as a source for the morphosyntactic<br />

<strong>an</strong>notation of Occit<strong>an</strong>. However, to<br />

tr<strong>an</strong>sfer syntactic information, we used the Catal<strong>an</strong><br />

dependency treeb<strong>an</strong>k (Civit et al., 2006) since<br />

modern Catal<strong>an</strong> displays a pro-drop feature <strong>an</strong>d a<br />

relatively free word order, similarly to Old Occit<strong>an</strong>.<br />

4 Corpus Pre-Processing<br />

The rom<strong>an</strong>ce ‘Flamenca’ is available in sc<strong>an</strong>ned<br />

images format, therefore, the initial step included<br />

conversion to <strong>an</strong> electronic version <strong>via</strong> OCR <strong>an</strong>d<br />

m<strong>an</strong>ual correction. Figure 2 shows a sample of<br />

the m<strong>an</strong>uscript.<br />

394<br />

Proceedings of KONVENS 2012 (LThist 2012 workshop), Vienna, September 21, 2012

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!