25.08.2013 Views

PDF (Online Text) - EURAC

PDF (Online Text) - EURAC

PDF (Online Text) - EURAC

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

oriented dictionary that would explicitly store such properties. As far as the corpus<br />

is concerned, a multilevel annotation would be more appropriate than the current<br />

monodimensional view: without changes to the current annotation, extra layers<br />

may be added for the above-mentioned features of nouns and verbs, but also for an<br />

appropriate treatment of fused forms (cf. dirang, ‘do what?’ from dira + eng) and of<br />

multiword items, for example, idiomatic expressions (cf. bona kgwedi ‘see the moon’<br />

i.e., ‘menstruate’). As Northern Sotho orthography is not yet fully standardised, a<br />

distinction between standard orthography and observed (possibly deviant) orthography<br />

may be introduced through additional layers.<br />

5. Conclusions and Future Work<br />

We reported on an ongoing research and development project for the creation<br />

of tagging resources for Northern Sotho. In this context, modular components of a<br />

two-layered architecture were created, which are needed in the first place for the<br />

preparation of a training corpus for statistical tagging, but which will prove equally<br />

useful, we hope, for the later development of larger corpora.<br />

We bootstrap the training corpus and the tagger lexicon in parallel, using semi-<br />

automatic procedures consisting of a rule-based automatic pre-classification and<br />

subsequent manual validation: the procedures concern the identification of verbal<br />

and nominal forms and the disambiguation of closed class items. These procedures are<br />

applied one after the other by order of their expected precision (‘easy-first’, ‘safety-<br />

first’), leading thereby to a partly disambiguated corpus. For the creation of the<br />

training corpus, the remaining ambiguities are removed manually, whereas this task is<br />

supposed to be left to the statistical tagger in the later creation of larger corpora.<br />

Linguistic knowledge about the language is extensively used in the definition of the<br />

automatic procedures: morphological and morpho-syntactic regularities in the local<br />

context provide the starting point for their formulation.<br />

Future work on the tools described in this paper will be devoted to the development<br />

of further disambiguation rules, to the finalisation of a fully disambiguated training<br />

corpus, and to tagger training and tests. This will allow us to (i) assess tagging quality<br />

as obtained by the use of the statistical tagger only in a setup with our rule-based<br />

pre-processing, (ii) to stabilise the proposed tagset on the basis of experience with<br />

statistical tagging, and (iii) to undertake tagging of the PSC, which could then serve<br />

for lexicographic exploration.<br />

A well-designed POS-tagger for Northern Sotho would provide a flying start to the<br />

development of similar taggers for the other Sotho languages, the Nguni languages,<br />

112

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!