Building an old Occitan corpus via cross-Language transfer
Building an old Occitan corpus via cross-Language transfer
Building an old Occitan corpus via cross-Language transfer
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
is mainly used with parallel corpora <strong>an</strong>d large<br />
bilingual lexicons (Yarowsky <strong>an</strong>d Ngai, 2001;<br />
Hwa et al., 2005), experiments by H<strong>an</strong>a et al.<br />
(2006) have demonstrated the usability of this<br />
method in situations where there are no parallel<br />
corpora but instead resources for a closely related<br />
l<strong>an</strong>guage.<br />
Our goal is to build a <strong>corpus</strong> of Old Occit<strong>an</strong> in<br />
a resource-light m<strong>an</strong>ner, by using <strong>cross</strong>-l<strong>an</strong>guage<br />
tr<strong>an</strong>sfer. This approach will be used not only for<br />
part-of-speech tagging, but also for syntactic <strong>an</strong>notation.<br />
The org<strong>an</strong>ization of the remainder of the paper<br />
is as follows: Section 2 provides a brief description<br />
of Old Occit<strong>an</strong>. Section 3 reviews the<br />
concept of a resource-light approach in <strong>corpus</strong><br />
linguistics. Section 4 provides details on <strong>corpus</strong><br />
pre-processing. The methods for <strong>cross</strong>-linguistic<br />
part-of-speech (POS) tagging <strong>an</strong>d <strong>cross</strong>-linguistic<br />
parsing are described in Sections 5 <strong>an</strong>d 6. Finally,<br />
the conclusions <strong>an</strong>d directions for further work<br />
are presented in Section 7.<br />
2 Old Occit<strong>an</strong> (Provençal)<br />
Occit<strong>an</strong>, often referred to as Provençal, constitutes<br />
<strong>an</strong> import<strong>an</strong>t element of the literary, linguistic,<br />
<strong>an</strong>d cultural heritage in the history of<br />
Rom<strong>an</strong>ce l<strong>an</strong>guages. Provençal (Occit<strong>an</strong>) poetry<br />
was a predecessor of French lyrics. Moreover,<br />
Occit<strong>an</strong> was the only administrative l<strong>an</strong>guage in<br />
Medieval Fr<strong>an</strong>ce, besides Latin (Belasco, 1990).<br />
While the historical import<strong>an</strong>ce of this l<strong>an</strong>guage<br />
is indisputable, Occit<strong>an</strong>, as a l<strong>an</strong>guage, remains<br />
linguistically understudied. Compared to Old<br />
French, Provençal is still lacking digitized copies<br />
of sc<strong>an</strong>ned m<strong>an</strong>uscripts, as well as <strong>an</strong>notated corpora<br />
for morpho-syntactic or syntactic research.<br />
Typologically, Old Occit<strong>an</strong> is classified as one<br />
of the Gallo-Rom<strong>an</strong> l<strong>an</strong>guages, together with<br />
French <strong>an</strong>d Catal<strong>an</strong> (Bec, 1973). If one examines<br />
Old Occit<strong>an</strong>, Old French, <strong>an</strong>d Old Catal<strong>an</strong>, on the<br />
one h<strong>an</strong>d, it is striking how m<strong>an</strong>y lexical <strong>an</strong>d morphological<br />
characteristics these l<strong>an</strong>guages share.<br />
For example, French <strong>an</strong>d Occit<strong>an</strong> have rich verbal<br />
inflection <strong>an</strong>d a two-case nominal system (nominative<br />
<strong>an</strong>d accusative), illustrated with <strong>an</strong> example<br />
of the word ‘wall’ in (2):<br />
Figure 1: Linguistic map of Fr<strong>an</strong>ce, from (Bec, 1973)<br />
(2)<br />
Case Old Occit<strong>an</strong> Old French<br />
Nominative lo murs li murs<br />
Accusative lo mur lo mur<br />
On the other h<strong>an</strong>d, Occit<strong>an</strong> has syntactic traits<br />
similar to Catal<strong>an</strong>, such as a relatively free word<br />
order <strong>an</strong>d null subjects, illustrated in (3) <strong>an</strong>d (4).<br />
(3) Gr<strong>an</strong><br />
great<br />
honor<br />
honor<br />
nos<br />
us Dat<br />
fai<br />
does<br />
‘He gr<strong>an</strong>ts us a great honor’ (Old Occit<strong>an</strong> -<br />
Flamenca)<br />
(4) molt<br />
much<br />
he<br />
have<br />
gr<strong>an</strong><br />
big<br />
desig<br />
desire<br />
‘I have a lot of great desire...’ (Old Catal<strong>an</strong> -<br />
Ramon Llull) 3<br />
The close relationship between these three l<strong>an</strong>guages<br />
is also marked geographically. The northern<br />
border of the Occit<strong>an</strong>-speaking area is adjacent<br />
to the French linguistic domain, whereas<br />
in the south, Occit<strong>an</strong> borders on the Catal<strong>an</strong>speaking<br />
area, as shown in Figure 1.<br />
This project focuses on the 13th century Old<br />
Occit<strong>an</strong> rom<strong>an</strong>ce “Flamenca”, from the edition<br />
by Meyer (1901). Apart from a very intriguing<br />
fable of beautiful Flamenca imprisoned in a<br />
tower by her jealous husb<strong>an</strong>d, this story presents a<br />
asp<br />
3 http://orbita.bib.ub.edu/ramon/velec.<br />
393<br />
Proceedings of KONVENS 2012 (LThist 2012 workshop), Vienna, September 21, 2012