11.07.2015 Views

Hindi Part-of-Speech Tagging and Chunking : A Maximum Entropy ...

Hindi Part-of-Speech Tagging and Chunking : A Maximum Entropy ...

Hindi Part-of-Speech Tagging and Chunking : A Maximum Entropy ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

identifying chunks <strong>and</strong> their labels is modeled inthe same way as that <strong>of</strong> identifying POS tags.In this paper, we present a statistical POS tagger<strong>and</strong> chunker for <strong>Hindi</strong> language. We havebuilt separate models for the same which satisfythe maximum entropy principle <strong>and</strong> can be used totag unseen text. Our system is tailored for NLPAI-ML contest 2006.This paper is organized as follows. Section 2gives an overview <strong>of</strong> maximum entropy models.Feature functions used in <strong>Hindi</strong> POS tagging <strong>and</strong>chunking are presented in section 3. Section 4 providesexperimental details <strong>and</strong> results.2 <strong>Maximum</strong> <strong>Entropy</strong> Markov Model<strong>Maximum</strong> entropy (ME) principle states that theleast biased model which considers all known informationis the one which maximizes entropy.The ME technique builds a model which assumesnothing other than the imposed constraints. Tobuild such a model, we define feature functions. Afeature function is a boolean function which capturessome aspect <strong>of</strong> the language which is relevantto the sequence labelling task. An examplefeature function for POS tagging isf j (l | c) ={1 if current word is alphanumeric,0 otherwiseHere, l is one <strong>of</strong> the possible labels <strong>and</strong> c is thecontext 1 . The relationship between feature functions<strong>and</strong> labels as evidenced in the training corpusis expressed as constraints. The probabilitydistribution satisfying these constraints <strong>and</strong> whichmakes no other assumptions has maximum entropy,is unique <strong>and</strong> can be expressed as (Bergeret al., 1996)⎛P r(l | c) = 1z(c) exp ⎝⎞k∑λ j f j (l, c) ⎠j=1where z(c) is a normalizing constant. The problem<strong>of</strong> estimating λ j parameters is solved by usingGeneralized Iterative Scaling(Darroch <strong>and</strong> Ratcliff,1972) algorithm. This learnt model is usedfor tagging unseen text. In our system, during tagging,Beam Search algorithm is applied to find themost promising label sequence.1 Context is a set <strong>of</strong> words surrounding the current word<strong>and</strong>/or labels <strong>of</strong> previous words.3 Feature Functions3.1 POS tagging featuresFor the task <strong>of</strong> <strong>Hindi</strong> POS tagging, the main featurefunctions used in our system are listed below:Context-based features:From our empirical analysis, we found that acontext window <strong>of</strong> size four gives the best performance.For a word, the context consists <strong>of</strong> :• POS tag <strong>of</strong> previous word.• Combination <strong>of</strong> POS tags <strong>of</strong> previous twowords.• Current word.• Next word.Word features:Word features capture lexical <strong>and</strong> morphologicalproperties <strong>of</strong> the word being tagged. They are:• Suffixes : If the word suffix is same as a givensuffix.• Digits : Does the word have any digits, or isthe word completely numeric.• Special characters : Are there any specialcharacters like ‘-’ in the word.• Root <strong>of</strong> current word, or the next word (e. g.‘KaRa’)• English word: To h<strong>and</strong>le English words thatoccasionally appear in <strong>Hindi</strong> text.Dictionary feature:This feature utilizes information present in a st<strong>and</strong>ard<strong>Hindi</strong> dictionary. We define a feature functionfor each POS tag. For a POS tag l, if the wordbeing tagged can occur with label l according todictionary, then the corresponding feature is true.Corpus-based features: These features rely oninformation extracted from training corpus. Theyare:• Has the word occurred as proper noun intraining.• All possible tags <strong>of</strong> the current word, as seenin training.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!