02.05.2014 Views

Building an old Occitan corpus via cross-Language transfer

Building an old Occitan corpus via cross-Language transfer

Building an old Occitan corpus via cross-Language transfer

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

is mainly used with parallel corpora <strong>an</strong>d large<br />

bilingual lexicons (Yarowsky <strong>an</strong>d Ngai, 2001;<br />

Hwa et al., 2005), experiments by H<strong>an</strong>a et al.<br />

(2006) have demonstrated the usability of this<br />

method in situations where there are no parallel<br />

corpora but instead resources for a closely related<br />

l<strong>an</strong>guage.<br />

Our goal is to build a <strong>corpus</strong> of Old Occit<strong>an</strong> in<br />

a resource-light m<strong>an</strong>ner, by using <strong>cross</strong>-l<strong>an</strong>guage<br />

tr<strong>an</strong>sfer. This approach will be used not only for<br />

part-of-speech tagging, but also for syntactic <strong>an</strong>notation.<br />

The org<strong>an</strong>ization of the remainder of the paper<br />

is as follows: Section 2 provides a brief description<br />

of Old Occit<strong>an</strong>. Section 3 reviews the<br />

concept of a resource-light approach in <strong>corpus</strong><br />

linguistics. Section 4 provides details on <strong>corpus</strong><br />

pre-processing. The methods for <strong>cross</strong>-linguistic<br />

part-of-speech (POS) tagging <strong>an</strong>d <strong>cross</strong>-linguistic<br />

parsing are described in Sections 5 <strong>an</strong>d 6. Finally,<br />

the conclusions <strong>an</strong>d directions for further work<br />

are presented in Section 7.<br />

2 Old Occit<strong>an</strong> (Provençal)<br />

Occit<strong>an</strong>, often referred to as Provençal, constitutes<br />

<strong>an</strong> import<strong>an</strong>t element of the literary, linguistic,<br />

<strong>an</strong>d cultural heritage in the history of<br />

Rom<strong>an</strong>ce l<strong>an</strong>guages. Provençal (Occit<strong>an</strong>) poetry<br />

was a predecessor of French lyrics. Moreover,<br />

Occit<strong>an</strong> was the only administrative l<strong>an</strong>guage in<br />

Medieval Fr<strong>an</strong>ce, besides Latin (Belasco, 1990).<br />

While the historical import<strong>an</strong>ce of this l<strong>an</strong>guage<br />

is indisputable, Occit<strong>an</strong>, as a l<strong>an</strong>guage, remains<br />

linguistically understudied. Compared to Old<br />

French, Provençal is still lacking digitized copies<br />

of sc<strong>an</strong>ned m<strong>an</strong>uscripts, as well as <strong>an</strong>notated corpora<br />

for morpho-syntactic or syntactic research.<br />

Typologically, Old Occit<strong>an</strong> is classified as one<br />

of the Gallo-Rom<strong>an</strong> l<strong>an</strong>guages, together with<br />

French <strong>an</strong>d Catal<strong>an</strong> (Bec, 1973). If one examines<br />

Old Occit<strong>an</strong>, Old French, <strong>an</strong>d Old Catal<strong>an</strong>, on the<br />

one h<strong>an</strong>d, it is striking how m<strong>an</strong>y lexical <strong>an</strong>d morphological<br />

characteristics these l<strong>an</strong>guages share.<br />

For example, French <strong>an</strong>d Occit<strong>an</strong> have rich verbal<br />

inflection <strong>an</strong>d a two-case nominal system (nominative<br />

<strong>an</strong>d accusative), illustrated with <strong>an</strong> example<br />

of the word ‘wall’ in (2):<br />

Figure 1: Linguistic map of Fr<strong>an</strong>ce, from (Bec, 1973)<br />

(2)<br />

Case Old Occit<strong>an</strong> Old French<br />

Nominative lo murs li murs<br />

Accusative lo mur lo mur<br />

On the other h<strong>an</strong>d, Occit<strong>an</strong> has syntactic traits<br />

similar to Catal<strong>an</strong>, such as a relatively free word<br />

order <strong>an</strong>d null subjects, illustrated in (3) <strong>an</strong>d (4).<br />

(3) Gr<strong>an</strong><br />

great<br />

honor<br />

honor<br />

nos<br />

us Dat<br />

fai<br />

does<br />

‘He gr<strong>an</strong>ts us a great honor’ (Old Occit<strong>an</strong> -<br />

Flamenca)<br />

(4) molt<br />

much<br />

he<br />

have<br />

gr<strong>an</strong><br />

big<br />

desig<br />

desire<br />

‘I have a lot of great desire...’ (Old Catal<strong>an</strong> -<br />

Ramon Llull) 3<br />

The close relationship between these three l<strong>an</strong>guages<br />

is also marked geographically. The northern<br />

border of the Occit<strong>an</strong>-speaking area is adjacent<br />

to the French linguistic domain, whereas<br />

in the south, Occit<strong>an</strong> borders on the Catal<strong>an</strong>speaking<br />

area, as shown in Figure 1.<br />

This project focuses on the 13th century Old<br />

Occit<strong>an</strong> rom<strong>an</strong>ce “Flamenca”, from the edition<br />

by Meyer (1901). Apart from a very intriguing<br />

fable of beautiful Flamenca imprisoned in a<br />

tower by her jealous husb<strong>an</strong>d, this story presents a<br />

asp<br />

3 http://orbita.bib.ub.edu/ramon/velec.<br />

393<br />

Proceedings of KONVENS 2012 (LThist 2012 workshop), Vienna, September 21, 2012

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!