25.08.2013 Views

PDF (Online Text) - EURAC

PDF (Online Text) - EURAC

PDF (Online Text) - EURAC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Creating Word Class Tagged Corpora<br />

for Northern Sotho by Linguistically<br />

Informed Bootstrapping<br />

Danie J. Prinsloo and Ulrich Heid<br />

To bootstrap tagging resources (tagger lexicon and training corpus) for Northern<br />

Sotho, a tagset and a number of modular and reusable corpus processing tools are<br />

being developed. This article describes the tagset and routines for identifying verbs<br />

and nouns, and for disambiguating closed class items. All of these are based on<br />

morphological and morphosyntactic specificities of Northern Sotho.<br />

1. Introduction<br />

In this paper, we report on ongoing work towards the parallel creation of<br />

computational linguistic resources for Northern Sotho, on the basis of linguistic<br />

knowledge about the language. Northern Sotho is one of the eleven official languages<br />

of South Africa, spoken by about 4.2 million speakers in the northeastern part of the<br />

country. It belongs to the Sotho family of the Bantu languages (S32), (Guthrie 1971).<br />

The three Sotho languages are closely related.<br />

The creation of Natural Language Processing (NLP) resources is part of an effort<br />

towards an infrastructure for corpus linguistics and computational lexicography and<br />

terminology for Northern Sotho, which is seen as an element of a broader action for<br />

the development of Human Language Technology (HLT) and NLP applications for the<br />

South African languages.<br />

Parallel resource creation has been attempted as part of our research and<br />

development agenda in order to speed up the resource building process, in the sense<br />

of rapid prototyping of a part-of-speech (=POS) tagset; a tagger lexicon and (manually<br />

corrected) reference corpus; and a statistical tagger. These constitute the first set of<br />

corpus linguistic tools to be developed (we report on the first three tools here). At the<br />

same time, we intend to verify to what extent ‘traditional’ corpus linguistic methods<br />

and tools (as used for European languages) can be applied to a Bantu language-- an<br />

attempt that, to our knowledge, has not been made before.<br />

Two text corpora are used as input to the study. The first is a 43,000 tokens corpus,<br />

a selection from the Northern Sotho novel Tša ka Mafuri (Matsepe 1974), and the<br />

second is the Pretoria Sepedi Corpus (PSC) of 6 million tokens, a collection of 327<br />

97

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!