11.07.2015 Views

Hindi Part-of-Speech Tagging and Chunking : A Maximum Entropy ...

Hindi Part-of-Speech Tagging and Chunking : A Maximum Entropy ...

Hindi Part-of-Speech Tagging and Chunking : A Maximum Entropy ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

• Has the word occurred with only a single tagin training corpus.• All possible tags <strong>of</strong> the next word, as seen intraining.0.90.890.88Accuracy v/s Training Data Size3.2 <strong>Chunking</strong> featuresThe main feature functions used in <strong>Hindi</strong> chunkingare listed below.Accuracy0.870.86Context-based features:For chunking, the most suitable context windowwas empirically found to consist <strong>of</strong> words, POStags <strong>and</strong> chunk labels <strong>of</strong> current word <strong>and</strong> twowords on either side <strong>of</strong> it. On the lines <strong>of</strong> (Singhet al., 2005), we found that for words having specificPOS tags (JJ, NN, VFM, PREP, SYM, QF,NEG <strong>and</strong> RP) adding current word, word <strong>and</strong> itsPOS tag combination as features reduces the performance<strong>of</strong> chunker. We call such a POS tag asnonessential-word tag. For a word, the contextbasedfeatures consists <strong>of</strong> :• Current word <strong>and</strong> word, POS tag combination,if POS tag <strong>of</strong> current word is not in thelist <strong>of</strong> nonessential-word tags.• POS tags <strong>of</strong> all words in context, individually.• Combinations <strong>of</strong> POS tags <strong>of</strong> next two words,previous two words <strong>and</strong> current word, previousword, separately.• Chunk label <strong>of</strong> previous two words, independently.Current POS tag based features:For each tag, list <strong>of</strong> possible chunk labels for thattag are identified. These chunk labels are used asfeatures. Another feature based on POS tag <strong>of</strong> currentword utilizes what we call as tag class. POStags are classified into different groups based onthe most likely chunk label for that POS tag, asseen in training corpus. For example, all POStags which are most likely to occur in noun phraseare grouped under one class. The class <strong>of</strong> currentword’s POS tag is used as a feature.4 ExperimentsOur system is built for the NLPAI-ML task <strong>of</strong> POStagging Indian Languages. The tagset <strong>of</strong> the contestspecifies 29 POS tags <strong>and</strong> 6 chunk labels. Thedevelopment corpus for the task was provided by0.8555 60 65 70 75 80 85 90 95Training Data Size( % )Figure 1: POS tagging accuracy with varyingtraining - test data splitAccuracy0.920.90.880.860.84Accuracy across runschunking accuracyPOS tagging accuracy0.820 1 2 3 4 5 6 7 8 9RunFigure 2: Accuracy across runscontest organizers. We have conducted experimentsfor different split <strong>of</strong> training <strong>and</strong> test data.As can be seen in figure 1, POS tagging accuracyincreases with increase in proportion <strong>of</strong> trainingdata till it reaches 75%, after which there isa reduction in accuracy due to overfitting <strong>of</strong> thetrained model to training corpus. Beyond a split<strong>of</strong> 85-15, increasing training corpus proportion increasesthe accuracy as the test corpus size becomesvery small. This prompted us to use a 75-25 split for training <strong>and</strong> test data in our experiments.The results were averaged out across differentruns, each time r<strong>and</strong>omly picking training<strong>and</strong> test data. Figure 2 shows results using 75-25split <strong>of</strong> training <strong>and</strong> test data across 10 differentruns. Our chunker heavily depends on POS tags<strong>and</strong> hence, in most cases its accuracy closely tailsthe POS tagging accuracy. The best POS taggingaccuracy <strong>of</strong> the system in these runs was foundto be 89.34% <strong>and</strong> the least accuracy was 87.04%.The average accuracy over 10 runs was 88.4%.For chunking, the best accuracy <strong>of</strong> chunk labels

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!