11.07.2015 Views

Hindi Part-of-Speech Tagging and Chunking : A Maximum Entropy ...

Hindi Part-of-Speech Tagging and Chunking : A Maximum Entropy ...

Hindi Part-of-Speech Tagging and Chunking : A Maximum Entropy ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Hindi</strong> <strong>Part</strong>-<strong>of</strong>-<strong>Speech</strong> <strong>Tagging</strong> <strong>and</strong> <strong>Chunking</strong> : A <strong>Maximum</strong> <strong>Entropy</strong>ApproachAniket DalalCSE departmentIIT BombayMumbaiKumar NagarajCSE departmentIIT BombayMumbai(aniketd,kumar,uma,s<strong>and</strong>y)Uma SawantCSE departmentIIT BombayMumbai@cse.iitb.ac.inS<strong>and</strong>eep ShelkeCSE departmentIIT BombayMumbaiAbstractWe present a statistical approach to<strong>Part</strong>-<strong>of</strong>-<strong>Speech</strong>(POS) tagger <strong>and</strong> chunkerfor <strong>Hindi</strong> language. Our system employs<strong>Maximum</strong> <strong>Entropy</strong> Markov Model(MEMM), trains from annotated <strong>Hindi</strong>corpus <strong>and</strong> assigns tags(POS tags <strong>and</strong>chunk labels) to previously unseen text.This model uses multiple features simultaneouslyto predict the tag for a word. Thefeature set is broadly classified as contextbasedfeatures, word features, dictionaryfeatures <strong>and</strong> corpus-based features. Apartfrom contextual features, which are independent<strong>of</strong> languages, we discuss the use<strong>of</strong> specialized features that capture lexical<strong>and</strong> morphological properties <strong>of</strong> <strong>Hindi</strong>language. We experimented our approachover corpus <strong>of</strong> NLPAI-ML 2006 contestconsisting <strong>of</strong> around 35000 words annotatedwith 29 different POS tags <strong>and</strong> 6chunk tags. The best accuracies reportedby our method on development data is89.346% for POS tagging <strong>and</strong> 87.399%for chunk labelling on per word basis.When the models trained using the abovedata was applied to the final evaluationdata <strong>of</strong> the contest, the F1-measures forPOS tagging was reported to be 82.22%,while for chunking it was 82.40%.1 Introduction<strong>Part</strong>-<strong>of</strong>-<strong>Speech</strong>(POS) tagging is the process <strong>of</strong> assigninga part-<strong>of</strong>-speech like noun, verb, pronounor other lexical class marker to each word in a sentence.POS tagging is a necessary precursor toother natural language processing tasks like naturallanguage parsing, information retrieval <strong>and</strong> informationextraction.A word can occur with different lexical classtags in different contexts. The main challenge inPOS tagging involves resolving this ambiguity inpossible POS tags for a word. Several approacheshave been proposed <strong>and</strong> successfully implementedfor English POS tagging. These systems can begrouped as rule based, statistical <strong>and</strong> hybrid.POS tagging can be modelled as a sequence labellingtask. Given an input sequence <strong>of</strong> wordsW n = w 1 w 2 ...w n , the task is to construct a labelsequence L n = l 1 l 2 ...l n , where label l i belongs tothe set <strong>of</strong> POS tags. The generated label sequenceL n has the highest probability <strong>of</strong> occurring for theword sequence W n among all possible label sequences,that isˆ L n = argmax {P r(L n |W n )}Statistical POS tagging methods take this approach.In particular, <strong>Maximum</strong> <strong>Entropy</strong> MarkovModel(MEMM) builds a model which capturesknown information <strong>and</strong> applies this model to obtainthe best label sequence (Ratnaparkhi, 1996;Ratnaparkhi, 1997).After POS tags are identified, the next stepis chunking, which involves dividing sentencesinto nonoverlapping nonrecursive phrases. Ingeneral, full parsing is expensive, <strong>and</strong> is notvery robust. On the other h<strong>and</strong>, chunking canbe much faster, more robust, yet may be sufficientfor many applications (information extraction,question answering). It can also serve asa possible first step for full parsing. In our system,there are six different kinds <strong>of</strong> chunk labels,namely, noun phrase(NP), verb phrase(VG),adjective phrase(JJP), Adverb phrase(RBP), conjunctphrase(CP) <strong>and</strong> others(BLK). The task <strong>of</strong>


identifying chunks <strong>and</strong> their labels is modeled inthe same way as that <strong>of</strong> identifying POS tags.In this paper, we present a statistical POS tagger<strong>and</strong> chunker for <strong>Hindi</strong> language. We havebuilt separate models for the same which satisfythe maximum entropy principle <strong>and</strong> can be used totag unseen text. Our system is tailored for NLPAI-ML contest 2006.This paper is organized as follows. Section 2gives an overview <strong>of</strong> maximum entropy models.Feature functions used in <strong>Hindi</strong> POS tagging <strong>and</strong>chunking are presented in section 3. Section 4 providesexperimental details <strong>and</strong> results.2 <strong>Maximum</strong> <strong>Entropy</strong> Markov Model<strong>Maximum</strong> entropy (ME) principle states that theleast biased model which considers all known informationis the one which maximizes entropy.The ME technique builds a model which assumesnothing other than the imposed constraints. Tobuild such a model, we define feature functions. Afeature function is a boolean function which capturessome aspect <strong>of</strong> the language which is relevantto the sequence labelling task. An examplefeature function for POS tagging isf j (l | c) ={1 if current word is alphanumeric,0 otherwiseHere, l is one <strong>of</strong> the possible labels <strong>and</strong> c is thecontext 1 . The relationship between feature functions<strong>and</strong> labels as evidenced in the training corpusis expressed as constraints. The probabilitydistribution satisfying these constraints <strong>and</strong> whichmakes no other assumptions has maximum entropy,is unique <strong>and</strong> can be expressed as (Bergeret al., 1996)⎛P r(l | c) = 1z(c) exp ⎝⎞k∑λ j f j (l, c) ⎠j=1where z(c) is a normalizing constant. The problem<strong>of</strong> estimating λ j parameters is solved by usingGeneralized Iterative Scaling(Darroch <strong>and</strong> Ratcliff,1972) algorithm. This learnt model is usedfor tagging unseen text. In our system, during tagging,Beam Search algorithm is applied to find themost promising label sequence.1 Context is a set <strong>of</strong> words surrounding the current word<strong>and</strong>/or labels <strong>of</strong> previous words.3 Feature Functions3.1 POS tagging featuresFor the task <strong>of</strong> <strong>Hindi</strong> POS tagging, the main featurefunctions used in our system are listed below:Context-based features:From our empirical analysis, we found that acontext window <strong>of</strong> size four gives the best performance.For a word, the context consists <strong>of</strong> :• POS tag <strong>of</strong> previous word.• Combination <strong>of</strong> POS tags <strong>of</strong> previous twowords.• Current word.• Next word.Word features:Word features capture lexical <strong>and</strong> morphologicalproperties <strong>of</strong> the word being tagged. They are:• Suffixes : If the word suffix is same as a givensuffix.• Digits : Does the word have any digits, or isthe word completely numeric.• Special characters : Are there any specialcharacters like ‘-’ in the word.• Root <strong>of</strong> current word, or the next word (e. g.‘KaRa’)• English word: To h<strong>and</strong>le English words thatoccasionally appear in <strong>Hindi</strong> text.Dictionary feature:This feature utilizes information present in a st<strong>and</strong>ard<strong>Hindi</strong> dictionary. We define a feature functionfor each POS tag. For a POS tag l, if the wordbeing tagged can occur with label l according todictionary, then the corresponding feature is true.Corpus-based features: These features rely oninformation extracted from training corpus. Theyare:• Has the word occurred as proper noun intraining.• All possible tags <strong>of</strong> the current word, as seenin training.


• Has the word occurred with only a single tagin training corpus.• All possible tags <strong>of</strong> the next word, as seen intraining.0.90.890.88Accuracy v/s Training Data Size3.2 <strong>Chunking</strong> featuresThe main feature functions used in <strong>Hindi</strong> chunkingare listed below.Accuracy0.870.86Context-based features:For chunking, the most suitable context windowwas empirically found to consist <strong>of</strong> words, POStags <strong>and</strong> chunk labels <strong>of</strong> current word <strong>and</strong> twowords on either side <strong>of</strong> it. On the lines <strong>of</strong> (Singhet al., 2005), we found that for words having specificPOS tags (JJ, NN, VFM, PREP, SYM, QF,NEG <strong>and</strong> RP) adding current word, word <strong>and</strong> itsPOS tag combination as features reduces the performance<strong>of</strong> chunker. We call such a POS tag asnonessential-word tag. For a word, the contextbasedfeatures consists <strong>of</strong> :• Current word <strong>and</strong> word, POS tag combination,if POS tag <strong>of</strong> current word is not in thelist <strong>of</strong> nonessential-word tags.• POS tags <strong>of</strong> all words in context, individually.• Combinations <strong>of</strong> POS tags <strong>of</strong> next two words,previous two words <strong>and</strong> current word, previousword, separately.• Chunk label <strong>of</strong> previous two words, independently.Current POS tag based features:For each tag, list <strong>of</strong> possible chunk labels for thattag are identified. These chunk labels are used asfeatures. Another feature based on POS tag <strong>of</strong> currentword utilizes what we call as tag class. POStags are classified into different groups based onthe most likely chunk label for that POS tag, asseen in training corpus. For example, all POStags which are most likely to occur in noun phraseare grouped under one class. The class <strong>of</strong> currentword’s POS tag is used as a feature.4 ExperimentsOur system is built for the NLPAI-ML task <strong>of</strong> POStagging Indian Languages. The tagset <strong>of</strong> the contestspecifies 29 POS tags <strong>and</strong> 6 chunk labels. Thedevelopment corpus for the task was provided by0.8555 60 65 70 75 80 85 90 95Training Data Size( % )Figure 1: POS tagging accuracy with varyingtraining - test data splitAccuracy0.920.90.880.860.84Accuracy across runschunking accuracyPOS tagging accuracy0.820 1 2 3 4 5 6 7 8 9RunFigure 2: Accuracy across runscontest organizers. We have conducted experimentsfor different split <strong>of</strong> training <strong>and</strong> test data.As can be seen in figure 1, POS tagging accuracyincreases with increase in proportion <strong>of</strong> trainingdata till it reaches 75%, after which there isa reduction in accuracy due to overfitting <strong>of</strong> thetrained model to training corpus. Beyond a split<strong>of</strong> 85-15, increasing training corpus proportion increasesthe accuracy as the test corpus size becomesvery small. This prompted us to use a 75-25 split for training <strong>and</strong> test data in our experiments.The results were averaged out across differentruns, each time r<strong>and</strong>omly picking training<strong>and</strong> test data. Figure 2 shows results using 75-25split <strong>of</strong> training <strong>and</strong> test data across 10 differentruns. Our chunker heavily depends on POS tags<strong>and</strong> hence, in most cases its accuracy closely tailsthe POS tagging accuracy. The best POS taggingaccuracy <strong>of</strong> the system in these runs was foundto be 89.34% <strong>and</strong> the least accuracy was 87.04%.The average accuracy over 10 runs was 88.4%.For chunking, the best accuracy <strong>of</strong> chunk labels


on per word basis in these runs was 87.29% <strong>and</strong>the least accuracy was 83.25%, with average being86.45%.Tag Precision Recall InstancesCC 0.9159091 0.9482353 425NN 0.8416244 0.9221357 1798PREP 0.95674485 0.9804658 1331QFN 0.897541 0.93191487 235JJ 0.77710843 0.73295456 352VFM 0.9081081 0.8993576 934PRP 0.9819277 0.9702381 840QF 0.74285716 0.7819549 133NLOC 0.90384614 0.8468468 111JVB 0.68 0.6296296 108VAUX 0.94126505 0.9272997 674SYM 0.9758389 0.9972565 729QW 0.9285714 0.8125 48INTF 0.64285713 0.5869565 46NNC 0.6839623 0.6415929 226RP 0.91056913 0.8924303 251NVB 0.64880955 0.5369458 203RB 0.8695652 0.7619048 105VNN 0.91907513 0.9137931 174VJJ 0.5555556 0.20833333 24VRB 0.8333333 0.41666666 24NEG 0.9894737 0.9791667 96NNPC 0.88 0.6984127 126NNP 0.7904762 0.53205127 156RBVB 0 0.0 1UH 0 0.0 3VV 0 0.0 1Table 1: Statistics for individual POS tags in a runwith 75-25 split.Number <strong>of</strong> words tagged 9154Number <strong>of</strong> words wrongly tagged 975Correctness Accuracy 89.3489Table 2: Overall statistics for a run with 75-25split.Detailed statistics for a run <strong>of</strong> the POS taggerwith 75-25 split is provided in table 2. From thetable, we can observe that our system has goodperformance in case <strong>of</strong> verb forms which appearmore frequently(VAUX, VFM, VNN), postpositions<strong>and</strong> pronouns. However, for proper nounsthe performance is not satisfactory because considerablenumber <strong>of</strong> proper nouns are tagged asnouns. This is because, in most cases the ambiguitybetween the two can be resolved only at semanticlevel. Also, we find that compound tags(NNC,NNPC) are incorrectly tagged as correspondingnon-compound tags(NN, NNP).5 ConclusionWe have presented a part-<strong>of</strong>-speech tagger <strong>and</strong>chunker for <strong>Hindi</strong> which uses maximum entropyframework. We also discussed language dependentas well as language independent features suitablefor <strong>Hindi</strong> POS tagging <strong>and</strong> chunking. Wehave shown that such a system has good performancewith an average accuracy <strong>of</strong> 88.4% for POStagging <strong>and</strong> 86.45% for chunking, with best accuraciesbeing 89.35% <strong>and</strong> 87.39% for POS tagging<strong>and</strong> chunking, respectively. We believe thatfurther error analysis <strong>and</strong> more language specificfeatures would improve the system performance,particularly in case <strong>of</strong> chunking.6 AcknowledgmentWe would like to thank Dr. Pushpak Bhattacharyyafor his guidance. We would also like tothank Manish Shrivastava for many helpful suggestions<strong>and</strong> comments.ReferencesAdam L. Berger, Stephen Della Pietra, <strong>and</strong> VincentJ. Della Pietra. 1996. A maximum entropy approachto natural language processing. ComputationalLinguistics, 22(1):39–71.J.N. Darroch <strong>and</strong> D. Ratcliff. 1972. Generalized iterativescaling for log-linear models. Annals <strong>of</strong> MathematicalStatistics, 43(5):1470–1480.Adwait Ratnaparkhi. 1996. A maximum entropymodel for part-<strong>of</strong>-speech tagging. In Eric Brill <strong>and</strong>Kenneth Church, editors, Proceedings <strong>of</strong> the Conferenceon Empirical Methods in Natural LanguageProcessing, pages 133–142. Association for ComputationalLinguistics, Somerset, New Jersey.Adwait Ratnaparkhi. 1997. A simple introduction tomaximum entropy models for natural language processing.Technical Report 97-08, Institute for Researchin Cognitive Science, University <strong>of</strong> Pennsylvania,May.Akshay Singh, Sushma Bendre, <strong>and</strong> Rajeev Sangal.2005. Hmm based chunker for hindi. In Proceedings<strong>of</strong> IJCNLP-05. Jeju Isl<strong>and</strong>, Republic <strong>of</strong> Korea,October.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!