25.08.2013 Views

PDF (Online Text) - EURAC

PDF (Online Text) - EURAC

PDF (Online Text) - EURAC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

• Nouns and verbs can be guessed in the text on the basis of their<br />

morphological properties; thus, separate rule-based guessers were developed, and<br />

their results were manually corrected in the training corpus; and,<br />

• The disambiguation of closed-class items in context is, to a considerable<br />

extent, possible on the basis of rules similar to ‘local grammars.’ A certain amount<br />

of ambiguities in the training corpus have to be dealt with manually.<br />

In the remainder of this section, we report on tagset design (section 3.2); on an<br />

architecture for the creation of a tagger lexicon and a training corpus (section 3.3);<br />

and, on verb and noun guessing and the disambiguation of closed class items (sections<br />

3.4 to 3.6).<br />

3.2 Tagset Design<br />

The tagset designed for Northern Sotho is organised as a logical tagset (similar to<br />

a type hierarchy); this opens up the possibility to formulate underspecified queries to<br />

the corpus.<br />

The tagset mirrors some of the linguistic specificities of Northern Sotho, but is also<br />

conditioned by considerations of automatic processability with a statistical tagger.<br />

The tagset reflects properties of the nominal system of classes and concords: as they<br />

are (mostly) lexically distinct, we introduced class-based subtypes for nouns, pronouns<br />

and concords, as well as for adjectives: N, ADJ, C (for concord) and PRO (for pronoun)<br />

have such subtypes. As concords and pronouns have functionally and/or semantically<br />

defined subtypes, we apply the class-based subdivision in fact to the types listed in<br />

Table 6:<br />

Table 6: Nominal Categories that have Class-related Subtypes<br />

N Nouns CPOSS possessive concords<br />

ADJ adjectives EMPRO emphatic pronouns<br />

CS subject concords POSSPRO possessive pronouns<br />

CO object concords QUANTPRO quantifying pronouns<br />

CDEM demonstrative concords<br />

Given the complexity of the system of verbal derivation (cf. Table 3 above), an<br />

attempt to subclassify verbal forms accordingly would have led to an amount of<br />

tags (i.e., of distinctions) that would not be manageable with a statistical tagger.<br />

Furthermore, as- according to Northern Sotho orthography conventions- concords,<br />

adjectives and pronouns are written separately from the nouns and verbs to which<br />

they are grammatically related (disjunctive writing), these elements receive their<br />

103

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!