25.08.2013 Views

PDF (Online Text) - EURAC

PDF (Online Text) - EURAC

PDF (Online Text) - EURAC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Table 4: Alternatives in Tagging the Verb reka<br />

02 ANA rekana ‘V’ rek ‘Vroot’ an ‘Rec’ a<br />

2.3 Quantitative Aspects of the Lexicon<br />

rekane ‘V’ rek ‘Vroot’ an ‘Rec’ e ‘Per’<br />

rekanwa ‘V’ rek ‘Vroot’ an ‘Rec’ w ‘Pas’ a<br />

rekanwe ‘V’ rek ‘Vroot’ an ‘Rec’ w ‘Pas’ e ‘Per’<br />

There are a few marked tendencies in the quantitative distribution of lexical items<br />

in Northern Sotho, especially with respect to the relationship between frequency of<br />

use and ambiguity.<br />

In our 43,000 word corpus sample, we counted types and tokens, distinguishing<br />

nouns, verbs and closed class items. In Northern Sotho, only nouns and verbs allow<br />

for productive word formation (i.e., are open word classes), whereas function words,<br />

adverbs and adjectives are listed (i.e., belong to closed classes). Note that we did<br />

not consider numerals at all; the figures given are to be taken as tendencies. We<br />

separately counted forms that can be unambiguously identified as nouns, verbs or<br />

elements of one of the closed classes, as opposed to ambiguous forms where more<br />

than one word class can be assigned, depending on the context.<br />

All three have many more unambiguous types than ambiguous ones. As is likely in<br />

most languages, however, high frequency items are also highly ambiguous (cf. Table 5<br />

below). Nevertheless, if only slightly more than half of the potential verb occurrences<br />

in the sample are unambiguous (ca. 5000 tokens), the percentage of unambiguous<br />

occurrences of noun candidates is as high as 90% (5800 out of 6300 tokens). Ambiguity<br />

with nouns is restricted to rather infrequent items. For closed class items, however,<br />

the inverse situation is observed: only little more than 20% of the occurrences of<br />

closed class items in our sample are unambiguous, and a small set of closed class item<br />

types (88 types), of an average frequency of two hundred or more, constitutes about<br />

40% of the total amount of word forms in the sample. We expect that this distribution<br />

will be more or less generalisable to larger data sets of Northern Sotho. It will have<br />

an incidence on our approach to the bootstrapping of linguistic resources for this<br />

language. Table 5 lists the most frequent (and at the same time most ambiguous)<br />

items from the 43,000 word corpus sample with their tags (according to the tagset<br />

described in section 3.2) and their absolute frequency in the sample.<br />

101

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!