25.08.2013 Views

PDF (Online Text) - EURAC

PDF (Online Text) - EURAC

PDF (Online Text) - EURAC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Un corpus per il sardo:<br />

problemi e perspettive<br />

45<br />

Nicoletta Puddu<br />

Creating a corpus for minority languages has provided an interesting tool to both study<br />

and preserve these languages (see, for example, the DoBeS project at MPI Nijmegen).<br />

Sardinian, as an endangered language, could certainly profit from a well-designed<br />

corpus. The first digital collection of Sardinian texts was the Sardinian <strong>Text</strong> Database;<br />

however, it cannot be considered as a corpus: it is not normalized and the user can<br />

only search for exact matches. In this paper, I discuss the main problems in designing<br />

and developing a corpus for Sardinian.<br />

Kennedy (1998) individuates three main stages in compiling a corpus: (1) corpus<br />

design; (2) text collection and capture; and, (3) text encoding or mark-up. As for<br />

the first stage, I propose that a Sardinian corpus should be mixed, monolingual,<br />

synchronic, balanced, and annotated, and I discuss the reasons for these choices<br />

throughout the paper. <strong>Text</strong> collection seems to be a minor problem in the case of<br />

Sardinian: both written and spoken texts are available and the number of speakers<br />

is still significant enough to collect a sufficient amount of data. The major problems<br />

arise at the third stage. Sardinian is fragmented into different varieties, and has not<br />

a standard variety (not even a standard orthography). Recently, several proposals for<br />

standardization have been made, but without success (see the discussion in Calaresu<br />

2002; Puddu 2003). First of all, I suggest using a standard orthography that allows us<br />

to group Sardinian dialects into macro varieties. Then, it will be possible to articulate<br />

the corpus into sub-corpora that are representative of each variety. The creation of<br />

an adequate morphological tag system will be fundamental. As a matter of fact, with<br />

a homogeneous tag system, it will be possible to perform searches throughout the<br />

corpus and study linguistic phenomena both in the single macro variety and in the<br />

language as a whole.<br />

Finally, I propose a morphological tag system and a first tagged pilot corpus of Sardinian<br />

based on written samples according to EAGLES and XCES standards.<br />

1. Perché creare corpora per le lingue minoritarie<br />

La corpus linguistics o linguistica dei corpora (da qui LC) risulta di particolare<br />

interesse, soprattutto per chi adotti un approccio funzionalista, in quanto “studia la

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!