Hindi Part-of-Speech Tagging and Chunking : A Maximum Entropy ...

Hindi Part-of-Speech Tagging and Chunking : A Maximum EntropyApproachAniket DalalCSE departmentIIT BombayMumbaiKumar NagarajCSE departmentIIT BombayMumbai(aniketd,kumar,uma,sandy)Uma SawantCSE departmentIIT BombayMumbai@cse.iitb.ac.inSandeep ShelkeCSE departmentIIT BombayMumbaiAbstractWe present a statistical approach toPart-of-Speech(POS) tagger and chunkerfor Hindi language. Our system employsMaximum Entropy Markov Model(MEMM), trains from annotated Hindicorpus and assigns tags(POS tags andchunk labels) to previously unseen text.This model uses multiple features simultaneouslyto predict the tag for a word. Thefeature set is broadly classified as contextbasedfeatures, word features, dictionaryfeatures and corpus-based features. Apartfrom contextual features, which are independentof languages, we discuss the useof specialized features that capture lexicaland morphological properties of Hindilanguage. We experimented our approachover corpus of NLPAI-ML 2006 contestconsisting of around 35000 words annotatedwith 29 different POS tags and 6chunk tags. The best accuracies reportedby our method on development data is89.346% for POS tagging and 87.399%for chunk labelling on per word basis.When the models trained using the abovedata was applied to the final evaluationdata of the contest, the F1-measures forPOS tagging was reported to be 82.22%,while for chunking it was 82.40%.1 IntroductionPart-of-Speech(POS) tagging is the process of assigninga part-of-speech like noun, verb, pronounor other lexical class marker to each word in a sentence.POS tagging is a necessary precursor toother natural language processing tasks like naturallanguage parsing, information retrieval and informationextraction.A word can occur with different lexical classtags in different contexts. The main challenge inPOS tagging involves resolving this ambiguity inpossible POS tags for a word. Several approacheshave been proposed and successfully implementedfor English POS tagging. These systems can begrouped as rule based, statistical and hybrid.POS tagging can be modelled as a sequence labellingtask. Given an input sequence of wordsW n = w 1 w 2 ...w n , the task is to construct a labelsequence L n = l 1 l 2 ...l n , where label l i belongs tothe set of POS tags. The generated label sequenceL n has the highest probability of occurring for theword sequence W n among all possible label sequences,that isˆ L n = argmax {P r(L n |W n )}Statistical POS tagging methods take this approach.In particular, Maximum Entropy MarkovModel(MEMM) builds a model which capturesknown information and applies this model to obtainthe best label sequence (Ratnaparkhi, 1996;Ratnaparkhi, 1997).After POS tags are identified, the next stepis chunking, which involves dividing sentencesinto nonoverlapping nonrecursive phrases. Ingeneral, full parsing is expensive, and is notvery robust. On the other hand, chunking canbe much faster, more robust, yet may be sufficientfor many applications (information extraction,question answering). It can also serve asa possible first step for full parsing. In our system,there are six different kinds of chunk labels,namely, noun phrase(NP), verb phrase(VG),adjective phrase(JJP), Adverb phrase(RBP), conjunctphrase(CP) and others(BLK). The task of

identifying chunks and their labels is modeled inthe same way as that of identifying POS tags.In this paper, we present a statistical POS taggerand chunker for Hindi language. We havebuilt separate models for the same which satisfythe maximum entropy principle and can be used totag unseen text. Our system is tailored for NLPAI-ML contest 2006.This paper is organized as follows. Section 2gives an overview of maximum entropy models.Feature functions used in Hindi POS tagging andchunking are presented in section 3. Section 4 providesexperimental details and results.2 Maximum Entropy Markov ModelMaximum entropy (ME) principle states that theleast biased model which considers all known informationis the one which maximizes entropy.The ME technique builds a model which assumesnothing other than the imposed constraints. Tobuild such a model, we define feature functions. Afeature function is a boolean function which capturessome aspect of the language which is relevantto the sequence labelling task. An examplefeature function for POS tagging isf j (l | c) ={1 if current word is alphanumeric,0 otherwiseHere, l is one of the possible labels and c is thecontext 1 . The relationship between feature functionsand labels as evidenced in the training corpusis expressed as constraints. The probabilitydistribution satisfying these constraints and whichmakes no other assumptions has maximum entropy,is unique and can be expressed as (Bergeret al., 1996)⎛P r(l | c) = 1z(c) exp ⎝⎞k∑λ j f j (l, c) ⎠j=1where z(c) is a normalizing constant. The problemof estimating λ j parameters is solved by usingGeneralized Iterative Scaling(Darroch and Ratcliff,1972) algorithm. This learnt model is usedfor tagging unseen text. In our system, during tagging,Beam Search algorithm is applied to find themost promising label sequence.1 Context is a set of words surrounding the current wordand/or labels of previous words.3 Feature Functions3.1 POS tagging featuresFor the task of Hindi POS tagging, the main featurefunctions used in our system are listed below:Context-based features:From our empirical analysis, we found that acontext window of size four gives the best performance.For a word, the context consists of :• POS tag of previous word.• Combination of POS tags of previous twowords.• Current word.• Next word.Word features:Word features capture lexical and morphologicalproperties of the word being tagged. They are:• Suffixes : If the word suffix is same as a givensuffix.• Digits : Does the word have any digits, or isthe word completely numeric.• Special characters : Are there any specialcharacters like ‘-’ in the word.• Root of current word, or the next word (e. g.‘KaRa’)• English word: To handle English words thatoccasionally appear in Hindi text.Dictionary feature:This feature utilizes information present in a standardHindi dictionary. We define a feature functionfor each POS tag. For a POS tag l, if the wordbeing tagged can occur with label l according todictionary, then the corresponding feature is true.Corpus-based features: These features rely oninformation extracted from training corpus. Theyare:• Has the word occurred as proper noun intraining.• All possible tags of the current word, as seenin training.

• Has the word occurred with only a single tagin training corpus.• All possible tags of the next word, as seen intraining.0.90.890.88Accuracy v/s Training Data Size3.2 Chunking featuresThe main feature functions used in Hindi chunkingare listed below.Accuracy0.870.86Context-based features:For chunking, the most suitable context windowwas empirically found to consist of words, POStags and chunk labels of current word and twowords on either side of it. On the lines of (Singhet al., 2005), we found that for words having specificPOS tags (JJ, NN, VFM, PREP, SYM, QF,NEG and RP) adding current word, word and itsPOS tag combination as features reduces the performanceof chunker. We call such a POS tag asnonessential-word tag. For a word, the contextbasedfeatures consists of :• Current word and word, POS tag combination,if POS tag of current word is not in thelist of nonessential-word tags.• POS tags of all words in context, individually.• Combinations of POS tags of next two words,previous two words and current word, previousword, separately.• Chunk label of previous two words, independently.Current POS tag based features:For each tag, list of possible chunk labels for thattag are identified. These chunk labels are used asfeatures. Another feature based on POS tag of currentword utilizes what we call as tag class. POStags are classified into different groups based onthe most likely chunk label for that POS tag, asseen in training corpus. For example, all POStags which are most likely to occur in noun phraseare grouped under one class. The class of currentword’s POS tag is used as a feature.4 ExperimentsOur system is built for the NLPAI-ML task of POStagging Indian Languages. The tagset of the contestspecifies 29 POS tags and 6 chunk labels. Thedevelopment corpus for the task was provided by0.8555 60 65 70 75 80 85 90 95Training Data Size( % )Figure 1: POS tagging accuracy with varyingtraining - test data splitAccuracy0.920.90.880.860.84Accuracy across runschunking accuracyPOS tagging accuracy0.820 1 2 3 4 5 6 7 8 9RunFigure 2: Accuracy across runscontest organizers. We have conducted experimentsfor different split of training and test data.As can be seen in figure 1, POS tagging accuracyincreases with increase in proportion of trainingdata till it reaches 75%, after which there isa reduction in accuracy due to overfitting of thetrained model to training corpus. Beyond a splitof 85-15, increasing training corpus proportion increasesthe accuracy as the test corpus size becomesvery small. This prompted us to use a 75-25 split for training and test data in our experiments.The results were averaged out across differentruns, each time randomly picking trainingand test data. Figure 2 shows results using 75-25split of training and test data across 10 differentruns. Our chunker heavily depends on POS tagsand hence, in most cases its accuracy closely tailsthe POS tagging accuracy. The best POS taggingaccuracy of the system in these runs was foundto be 89.34% and the least accuracy was 87.04%.The average accuracy over 10 runs was 88.4%.For chunking, the best accuracy of chunk labels

on per word basis in these runs was 87.29% andthe least accuracy was 83.25%, with average being86.45%.Tag Precision Recall InstancesCC 0.9159091 0.9482353 425NN 0.8416244 0.9221357 1798PREP 0.95674485 0.9804658 1331QFN 0.897541 0.93191487 235JJ 0.77710843 0.73295456 352VFM 0.9081081 0.8993576 934PRP 0.9819277 0.9702381 840QF 0.74285716 0.7819549 133NLOC 0.90384614 0.8468468 111JVB 0.68 0.6296296 108VAUX 0.94126505 0.9272997 674SYM 0.9758389 0.9972565 729QW 0.9285714 0.8125 48INTF 0.64285713 0.5869565 46NNC 0.6839623 0.6415929 226RP 0.91056913 0.8924303 251NVB 0.64880955 0.5369458 203RB 0.8695652 0.7619048 105VNN 0.91907513 0.9137931 174VJJ 0.5555556 0.20833333 24VRB 0.8333333 0.41666666 24NEG 0.9894737 0.9791667 96NNPC 0.88 0.6984127 126NNP 0.7904762 0.53205127 156RBVB 0 0.0 1UH 0 0.0 3VV 0 0.0 1Table 1: Statistics for individual POS tags in a runwith 75-25 split.Number of words tagged 9154Number of words wrongly tagged 975Correctness Accuracy 89.3489Table 2: Overall statistics for a run with 75-25split.Detailed statistics for a run of the POS taggerwith 75-25 split is provided in table 2. From thetable, we can observe that our system has goodperformance in case of verb forms which appearmore frequently(VAUX, VFM, VNN), postpositionsand pronouns. However, for proper nounsthe performance is not satisfactory because considerablenumber of proper nouns are tagged asnouns. This is because, in most cases the ambiguitybetween the two can be resolved only at semanticlevel. Also, we find that compound tags(NNC,NNPC) are incorrectly tagged as correspondingnon-compound tags(NN, NNP).5 ConclusionWe have presented a part-of-speech tagger andchunker for Hindi which uses maximum entropyframework. We also discussed language dependentas well as language independent features suitablefor Hindi POS tagging and chunking. Wehave shown that such a system has good performancewith an average accuracy of 88.4% for POStagging and 86.45% for chunking, with best accuraciesbeing 89.35% and 87.39% for POS taggingand chunking, respectively. We believe thatfurther error analysis and more language specificfeatures would improve the system performance,particularly in case of chunking.6 AcknowledgmentWe would like to thank Dr. Pushpak Bhattacharyyafor his guidance. We would also like tothank Manish Shrivastava for many helpful suggestionsand comments.ReferencesAdam L. Berger, Stephen Della Pietra, and VincentJ. Della Pietra. 1996. A maximum entropy approachto natural language processing. ComputationalLinguistics, 22(1):39–71.J.N. Darroch and D. Ratcliff. 1972. Generalized iterativescaling for log-linear models. Annals of MathematicalStatistics, 43(5):1470–1480.Adwait Ratnaparkhi. 1996. A maximum entropymodel for part-of-speech tagging. In Eric Brill andKenneth Church, editors, Proceedings of the Conferenceon Empirical Methods in Natural LanguageProcessing, pages 133–142. Association for ComputationalLinguistics, Somerset, New Jersey.Adwait Ratnaparkhi. 1997. A simple introduction tomaximum entropy models for natural language processing.Technical Report 97-08, Institute for Researchin Cognitive Science, University of Pennsylvania,May.Akshay Singh, Sushma Bendre, and Rajeev Sangal.2005. Hmm based chunker for hindi. In Proceedingsof IJCNLP-05. Jeju Island, Republic of Korea,October.

Hindi Part-of-Speech Tagging and Chunking : A Maximum Entropy ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?