APT: Arabic Part-of-speech Tagger - Computer Science

APT: Arabic Part-of-speech TaggerShereen KHOJAComputing Department, Lancaster UniversityLancasterLA1 4YR, UK,s.khoja@lancaster.ac.ukAbstractArabic is the language of millions of peopleall over the world, yet a publicly availablegrammatically tagged corpus of Arabic stilldoes not exist. In this paper, I describe someof the initial findings in the development of anArabic part-of-speech tagger. For this tagger Ihave compiled a tagset containing 131 tagsthat is derived from traditional Arabicgrammatical theory. I have used this tagger tomanually tag a corpus and I have extracted alexicon from this corpus. Because Arabic is amorphologically complex language, some preprocessingis necessary before automatictagging can take place. The results I haveobtained so far highlight some of thecharacteristics of the Arabic language thatneed to be addressed to improve tagging.IntroductionArabic is the official language of twenty MiddleEast and African countries, and is the religiouslanguage of all Muslims, regardless of their origin.It is therefore surprising that very little work hasbeen done on Arabic corpus linguistics.Arabic differs from Indo-European languagessyntactically, morphologically and semantically. Itis a Semitic language whose main characteristicfeature is that most words are built up from rootsby following certain fixed patterns 1 and addinginfixes, prefixes and suffixes. It is an oldlanguage, and what is now known as ClassicalArabic was standardised around fourteen centuriesago.The modern form of Arabic is called ModernStandard Arabic (MSA) and it is the form used byall Arabic-speaking countries in publications, themedia and academic institutions. MSA is spoken1Also called measures or binyan.by people from different Arab countries where thelocal dialect may not be mutually intelligible.MSA is a simplified form of Classical Arabic, andfollows its grammar. The main differencesbetween Classical and MSA are that MSA has alarger (more modern) vocabulary, and does notuse some of the more complicated forms ofgrammar found in Classical Arabic.Statistical and hybrid part-of-speech taggershave been used very successfully for English(Garside, 1997) (Church, 1988), and with varyingdegrees of success for other European languages(Sanchez, 1995), (Tzoukermann, 1995). I aminterested in seeing how these techniques can beapplied to a language with a radically differentmorphology and syntax, and what results can beobtained.In the following sections I will describe anArabic part-of-speech tagger that I have calledAPT that uses statistical and rule-basedtechniques. I will also describe the tagset that Ihave derived from traditional Arabic grammaticaltheory, and that is still used today in ModernStandard Arabic. Finally, I will show some of theresults that I have obtained so far.1 Tagging TechniquesMany techniques have been used to tag Englishand other European language corpora. The firsttechnique to be developed was the rule-basedtechnique used by Greene and Rubin in 1970 totag the Brown corpus (Greene, 1971). Their tagger(called TAGGIT) used context-frame rules toselect the appropriate tag for each word. Itachieved an accuracy of 77%. More recently,interest in rule-based taggers has re-emerged withEric Brill's tagger that achieved an accuracy of96% (Brill, 1992). The accuracy of this tagger waslater improved to 97.5% (Brill, 1994).In the 1980s more research effort went intotaggers that used hidden Markov models to select

the appropriate tag. Such taggers include CLAWS,which was developed at Lancaster University andachieved an accuracy of 97% (Garside, 1987)(Garside, 1997), Church's PARTS tagger (Church,1988) and the Xerox tagger, which was developedby Doug Cutting and achieved an accuracy of 96%(Cutting, 1992).A combination of both statistical and rule-basedmethods has also been used to develop hybridtaggers. These seem to produce a higher rate ofaccuracy. An accuracy of 98% has been reportedby Tapanainen and Voultilainen (1994) whenusing both techniques separately, then aligning theoutput. In fact, both Brill's tagger and CLAWS arein essence a combination of both techniques. InBrill's tagger, rules are selected that have themaximum net improvement rate that is based onstatistics, and CLAWS contains a rule-basedcomponent that handles idioms.More recently, taggers that use artificialintelligence techniques have been developed. Onesuch tagger (Daelemans, 1996) uses machinelearning, and is a form of supervised learningbased on similarity-based reasoning. The accuracyrates of this tagger for English reached 97%. Also,neural networks have been used in developingPart-of-Speech taggers. The tagger developed atthe University of Memphis (Olde, 1999) is anexample of such a tagger. It achieved an accuracyof 91.6%. Another neural network taggerdeveloped for Portuguese achieved an accuracy of96% (Marques, 1996).2 Arabic TagsetSince the grammar of Arabic has beenstandardised for centuries, I decided to derive myinitial tagset from this grammatical tradition ratherthan from an Indo-European based tagset. Thereason for this is that Arabic is a very differentlanguage from Indo-European languages, andshould have its own tagset. Also, Arabic linguistswill be basing their studies on a traditional Arabicgrammar rather than an Indo-European grammar.Arabic grammarians traditionally analyse allArabic words into three main parts-of-speech.These parts-of-speech are further sub-categorisedinto more detailed parts-of-speech whichcollectively cover the whole of the Arabiclanguage (Haywood, 1962). The three main partsof-speechare:1. Noun: A noun in Arabic is a name or a wordthat describes a person, thing, or idea.Traditionally the Noun class in Arabic issubdivided into Derivatives (that is, nounsderived from verbs, nouns derived from othernouns, and nouns derived from particles) andPrimitives (nouns not so derived). Thesenouns could be further sub-categorised bynumber, gender and case. This class alsoincludes what, in traditional Europeangrammatical theory, would be classified asParticiples, Pronouns, Relatives,Demonstratives and Interrogatives.2. Verb: The verb classification in Arabic issimilar to that in English, although the tensesand aspects are different. The Verb tag can besub-categorised into Perfect, Imperfect, andImperative. Further sub-categorisation of theVerb class are possible using number, personand gender.3. Particle: The Particle class includes:Prepositions, Adverbs, Conjunctions,Interrogative Particles, Exceptions 2 , andInterjections.Initial experiments used a tagset that had fivemain tags (Noun, Verb, Particle, Residual,Punctuation), which needed to be extended to 35tags taking into account clitics (see section 3). Thetagset has now been revised and contains 131 tags.These are all subcategories of the five maincategories.The verbs have been sub-categorised by ‘type’(perfect, imperfect, imperative), person, numberand gender, and the tag name reflects this subcategorisation.For example, the word ksrtm 3 “you[plural, masculine] broke”, which is a perfect verbin the second person masculine plural form, hasthe tag VPPl2M. An indicative imperfect secondperson feminine singular verb such as tktbyn “you[singular, feminine] are writing” would be taggedVISg2FI.Similarly, personal pronouns are tagged fornumber, person and gender. The tag for hma thatis a personal pronoun for the third person dualneutral (masculine or feminine) is tagged asNPrPDu3. As well as personal pronouns, there are2These include the Arabic words that are equivalent tothe word except and the prefixes non-, un-, and im-.3Words in italic are tranliterations of the Arabic word.

elative and demonstrative pronouns, which arealso classed by number and gender.Nouns are classed by number (compare ktab,ktaban, ktb, meaning “one book”, “two books”,and “books”) and gender (compare ktab[masculine] meaning “book” and mdrst [feminine]meaning “school”)). Foreign and proper nounsreceive separate tags.The category of particle includes prepositions,adverbs and conjunctions, all of which appear inArabic either as individual words, or as cliticsattached to the following word (see section 3).Other particles are interjections, exceptions andnegative particles.Dates, numbers, punctuations and abbreviationsare also tagged separately. Dates are classified aseither an Arabic date such as mhrm which is thefirst month in the Islamic calendar, or a date fromthe Gregorian calendar such as atar which is theArabic word for the month “March” 4 .3 The Training Corpus and LexiconA corpus of 50,000 words in Modern StandardArabic (an extract from the Saudi Al-Jazirahnewspaper, date 03/03/1999) was tagged using thesmaller tagset described above, and is currentlybeing retagged with the more detailed tagset.For morphologically complex words acombination of tags was used. For example, theword walktab “and the book” is given the tagPC+NCSgMND, where PC indicates a particlethat is a conjunction, and NCSgMND indicates asingular, masculine, nominative, definite noun.This corpus was used to derive the variouslexicons described below. It was also used to trainthe tagger; that is, to extract the statistical data thatis needed for the automatic tagging of untagged orraw corpora (see section 4.4).The first version of the lexicon lists every wordthat appears in the training corpus together with allthe tags it has received. Because of the frequentoccurrence of clitics in Arabic, many wordsappear in the lexicon several times with differentclitics. For example, the word mstsfat “hospital”appears twice in the lexicon, once with the definitearticle, and once without.4Some Arabic texts use a transcription of the Englishmonths. For example, March transcribed in Arabic asmars.An example of a clitic in English would be“she'll”, where two words have been mergedtogether, “she” and “will”. In Arabic the definitearticle, equivalent to “the” in English, appears as atwo-letter proclitic at the beginning of the noun.This is similar to the definite article in Frenchthat appears as a proclitic when attached to a wordthat starts with a vowel. An example is l'idée,where l' is the definite article, and idée is theFrench word for “idea”. Another example of aclitic is the Italian word muoviamoci “let us go”.The clitic here is the first person plural pronoun ci.Similarly the conjunction “and” appears inArabic as a one-letter proclitic attached to the firstword of the coordinated sequence; this word couldbe a noun (including a noun with prefixed definitearticle), a verb, particle or even a number. OtherArabic clitics include further inseparableconjunctions, pronouns, and some prepositions.For the second version of the lexicon, the cliticsare removed from the words before they areplaced in the lexicon. This allows a rule-basedcomponent of the tagger to match the un-cliticisedword in the lexicon to the word in the runningtext, whatever clitics it is encountered with. In theoriginal lexicon derived from the training corpus,for instance, the ambiguous Arabic word marsappears without any clitics meaning the foreignmonth name “March”, and also with theconjunction “and” to mean the verb “and hepracticed”.The initial version of the lexicon containing allwords from the corpus without removing anyclitics contained 13,912 words. After removingclitics, the lexicon was reduced to 9,986 words.Table 1 shows an extract from the lexicon. Thefirst column contains a transcription of the Arabicword, the second column contains its meaning,and the other columns contain all that words tags.This small extract from the lexicon highlightsthe fact that the lexicon does not cover all thepossible tags of some words. For example, theArabic word hrs appears in the corpus onlymeaning the noun “guards”, but more generally itcould also be the verb “he guarded”. Anotherexample is the word hsn that appears three timesin the training corpus as a proper noun, but it doesnot appear in the training corpus with its othermeaning “good”. It will be necessary in the futureto go through the lexicon and manually add all thepossible tags to each word, or use a larger training

corpus, thought that might still not find allpossible tags.thml carrying, carry VISG2M VISG3Ftwjyhat instructions NPLFkadm servant NSGM?qd convene, contract VPSG3MNSGMhrmyn mosques [dual] NDUMsryfyn distinguished [dual] NDUMmustsfa hospitalNSGFan truly, not, letter N P PS NFmtnql mobile NSGMadwyt medicines NPLFila to PPSahd one, Sunday NSGM NDnql move(V), move(N) VPSG3MNSGM?ly Ali(name), on NP PPSt?ml doing VISG2M VISG3Fhrs guard NPLMhsn Hassan(name) NPTable 1: Extract from the lexicon4 The Arabic POS TaggerAPT was developed using a combination of bothstatistical and rule-based techniques since hybridtaggers seem to produce the highest accuracy rates(see section 1).4.1 Initial TaggingThe first step of the tagger is the initial tagging.This is basically the look-up component of thetagger. Here every word is looked up in thelexicon, and if it is found, then it is given all thepossible tags of that word as specified in thelexicon.In early versions of the tagger, clitics would bestripped off words in the running text before lookupin the lexicon, but this is now incorporated inthe more general morphological analysis describedin the section 4.2.Since the lexicon is so small, the initial look-upis unlikely to find many of the words. Most of thewords that it would find might be particles and ifthe text is from the same genre, then maybe someof the proper nouns and verbs.It is obvious then that because of Arabic'scomplex morphology, some pre-processing ormorphological analysis is required. This isachieved by using the stemmer.4.2 StemmingStemming is the process of removing all of aword's affixes to produce the stem or root. InArabic this means the removal of prefixes,suffixes and infixes. The stemming component isthe rule-based part of the tagger, since rules areused to determine what the affixes are.After the initial tagging, if a word is not found inthe lexicon, it is stemmed. The affixes are used tohelp determine the tag of the word. Sometimesone affix can determine the tag of a word: forexample, if the prefix is the definite article, thenthe word is a noun. On the other hand, acombination of affixes is most commonly used todetermine the tag of the word. For example, if theprefix is a t “which indicates the imperfect verb”and the suffix is wn “which indicates themasculine plural”, then the word is likely to be asecond person plural masculine imperfect verb,such as tdrswn which means “you [pluralmasculine] are studying”.Since the stemming algorithm also uses theArabic word patterns, these can be used todetermine the tag of the word. Most words inArabic are formed using fixed patterns, and thesepatterns have predictable properties and meanings.For example, words that follow a certain patternare plural nouns. This type of plural is called thebroken plural, and it is different from the sound orperfect plural because it is not formed by thesimple addition of suffixes.Some of the problems that the stemmer faces arethat some letters that appear to be affixes are infact part of the word. Another problem is thatsome letters (for instance the long vowels) maychange to other letters when an affix is added, andso the letters should be changed back when thataffix is removed.Tests of the stemmer over Arabic words showthat it achieves an accuracy of 97%, using adictionary of 4,748 triliteral and quadriliteralroots.4.3 Results after Initial TaggingFour corpora were compiled and used for testing.The corpora were chosen because they are fromdifferent countries, and although they all useModern Standard Arabic, some local colloquialdifferences do appear. The last corpus was chosen

probabilities that were calculated can be seen inTable 2. Here, the probability of a Noun beingfollowed by another Noun is 0.711, and theprobability of a Verb being followed by a Noun is0.926.N V P No. Pu.N 0.711 0.065 0.143 0.010 0.071V 0.926 0.037 0.0 0.008 0.029P 0.689 0.199 0.085 0.016 0.011No. 0.509 0.06 0.098 0.009 0.324Pu. 0.492 0.159 0.152 0.046 0.151Table 2: Contextual Probabilities of Tagged CorpusThe statistical tagger achieved an accuracy ofaround 90% when disambiguating ambiguouswords with this tagset. On the other hand, thetagger tagged all unknown words as nouns. Thereason for this is that over 30,000 words (67%) inthe training corpus were nouns, so the probabilityof a word being a noun is much higher than theprobability of it being anything else.ConclusionI am currently recalculating the lexical andcontextual probabilities that will be used by thetagger for the larger tagset. The results of thestatistical tagger will be presented in the nearfuture.Another improvement that is necessary here is togo through the lexicon manually and add all thepossible tags that a word can take.A component that still needs to be added is apre-processing component to handle commonerrors found in Arabic. These include the errors inthe placement of the hamza on the alif (glottalstops), and also placing the dots under the letteryah. An example of such an error can be seen inTable 1, where the name ali should have the dots,while the preposition meaning “on” should nothave the dots.ReferencesRoger Garside, Geoffrey Leech, and Anthony McEnery(1997) Corpus Annotation: Linguistic Informationfrom Computer Text Corpora. Addison WesleyLongman Inc., New York.Kenneth Church (1988) A Stochastic Parts Programand Noun Phrase Parser for Unrestricted Text. In“Proceedings of the Second Conference on AppliedNatural Language Processing” (ACL) Austin, Texes,pp. 136-143.Fernando Sanchez Leon and Amalio F. Nieto Serrano.(1995) Public Domain POS Tagger for Spanish.CRATER Final Deliverable Report No. 22.Evelyne Tzoukermann, Dragomir R. Radev, andWillian A. Gale (1995) Combining LinguisticKnowledge and Statistical Learning in French Partof-SpeechTagging. In EACL SIGDAT Workshop.Association for Computational Linguistics -European Chapter, Dublin, Ireland.B.B. Greene and G.M. Rubin (1971) AutomaticGrammatical Tagging of English. Department ofLinguistics, Brown University, Providence, R.I.Eric Brill (1992) A Simple Rule-Based Part of SpeechTagger. In “Proceedings of the Third Conference onApplied Natural Language Processing”, Trento, Italy,pp. 152-155.Eric Brill (1994) Some Advances in TransformationBased Part of Speech Tagging. In “Proceedings ofthe Twelfth International Conference on ArtificialIntelligence” (AAAI-94), Seattle, WA.Roger Garside, Geoffrey Leech, and Geoffrey Sampson(1987) The Computational Analysis of English: acorpus-based approach. Longman Group UKLimited.Doug Cutting, Julian Kupiec, Jan Pederson, andPenelope Sibun (1992) A Practical Part-of-SpeechTagger. In “Proceedings of the Third Conference onApplied Natural Language Processing”, Trento, Italy.Pasi Tapanainen and Atro Voutilainen (1994) Taggingaccurately - Don't guess if you know. In “Proceedingsof the Fourth ACL Conference on Applied NaturalLanguage Processing”, Stuttgart.Walter Daelemans, Jakub Zavrel, Peter Berck, andSteven Gillis (1996) MBT: A Memory-Based Part ofSpeech Tagger-Generator. In “Proceedings of theFourth Workshop on Very Large Corpora”,Copenhagen, Denmark, pp. 14-27.Brent A. Olde, James Hoener, Patrick Chipman, ArthurC. Graesser, and the Tutoring Research Group (1999)A Connectionist Model for Part of Speech Tagging.In “Proceedings of the 12th International FloridaArtificial Intelligence Research Society Conference”,Menlo Park, CA, pp. 172-176.Nuno Marques and Jos Gabriel Lopes (1996) UsingNeural Nets for Portuguese Part-of-Speech Tagging.In “Proceedings of the Fifth International Conferenceon The Cognitive Science of Natural LanguageProcessing”, Dublin City University.JA Haywood and H M Nahmad (1962) A new Arabicgrammar. Lund Humphries Publishers Ltd.F. Jelinek (1976) Continuous Speech Recognition byStatistical Methods. In “Proceedings of the IEEEE”,pp. 532-556.

APT: Arabic Part-of-speech Tagger - Computer Science

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?