14.11.2013 Views

Morpheme based Language Model for Tamil Part-of-Speech Tagging

Morpheme based Language Model for Tamil Part-of-Speech Tagging

Morpheme based Language Model for Tamil Part-of-Speech Tagging

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Morpheme</strong> <strong>based</strong> <strong>Language</strong> <strong>Model</strong><br />

<strong>for</strong> <strong>Tamil</strong> <strong>Part</strong>-<strong>of</strong>-<strong>Speech</strong> <strong>Tagging</strong><br />

S. Lakshmana Pandian and T. V. Geetha<br />

Abstract—The paper describes a <strong>Tamil</strong> <strong>Part</strong> <strong>of</strong> <strong>Speech</strong> (POS)<br />

tagging using a corpus-<strong>based</strong> approach by <strong>for</strong>mulating a<br />

<strong>Language</strong> <strong>Model</strong> using morpheme components <strong>of</strong> words. Rule<br />

<strong>based</strong> tagging, Markov model taggers, Hidden Markov <strong>Model</strong><br />

taggers and trans<strong>for</strong>mation-<strong>based</strong> learning tagger are some <strong>of</strong> the<br />

methods available <strong>for</strong> part <strong>of</strong> speech tagging. In this paper, we<br />

present a language model <strong>based</strong> on the in<strong>for</strong>mation <strong>of</strong> the stem<br />

type, last morpheme, and previous to the last morpheme part <strong>of</strong><br />

the word <strong>for</strong> categorizing its part <strong>of</strong> speech. For estimating the<br />

contribution factors <strong>of</strong> the model, we follow generalized iterative<br />

scaling technique. Presented model has the overall F-measure <strong>of</strong><br />

96%.<br />

Index Terms—Bayesian learning, language model, morpheme<br />

components, generalized iterative scaling.<br />

I. INTRODUCTION<br />

art-<strong>of</strong>-speech tagging, i.e., the process <strong>of</strong> assigning the<br />

P<br />

part-<strong>of</strong>-speech label to words in a given text, is an<br />

important aspect <strong>of</strong> natural language processing. The<br />

first task <strong>of</strong> any POS tagging process is to choose<br />

various POS tags. A tag set is normally chosen <strong>based</strong> on<br />

the language technology application <strong>for</strong> which the POS tags<br />

are used. In this work, we have chosen a tag set <strong>of</strong> 35<br />

categories <strong>for</strong> <strong>Tamil</strong>, keeping in mind applications like named<br />

entity recognition and question and answering systems. The<br />

major complexity in the POS tagging task is choosing the tag<br />

<strong>for</strong> the word by resolving ambiguity in cases where a word can<br />

occur with different POS tags in different contexts. Rule <strong>based</strong><br />

approach, statistical approach and hybrid approaches<br />

combining both rule <strong>based</strong> and statistical <strong>based</strong> have been<br />

used <strong>for</strong> POS tagging. In this work, we have used a statistical<br />

language model <strong>for</strong> assigning part <strong>of</strong> speech tags. We have<br />

exploited the role <strong>of</strong> morphological context in choosing POS<br />

tags. The paper is organized as follows. Section 2 gives an<br />

overview <strong>of</strong> existing POS tagging approaches. The language<br />

characteristics used <strong>for</strong> POS tagging and list <strong>of</strong> POS tags used<br />

are described in section 3. Section 4 describes the effect <strong>of</strong><br />

morphological context on tagging, while section 5 describes<br />

the design <strong>of</strong> the language model, and the section 6 contains<br />

evaluation and results.<br />

II. RELATED WORK<br />

The earliest tagger used a rule <strong>based</strong> approach <strong>for</strong> assigning<br />

tags on the basis <strong>of</strong> word patterns and on the basis <strong>of</strong> tag<br />

Manuscript received May 12, 2008. Manuscript accepted <strong>for</strong> publication<br />

October 25, 2008.<br />

S. Lakshmana Pandian and T. V. Geetha are with Department <strong>of</strong> Computer<br />

Science and Engineering, Anna University, Chennai, India<br />

(lpandian72@yahoo.com).<br />

assigned to the preceding and following words [8], [9]. Brill<br />

tagger used <strong>for</strong> English is a rule-<strong>based</strong> tagger, which uses hand<br />

written rules to distinguish tag entities [16]. It generates<br />

lexical rules automatically from input tokens that are<br />

annotated by the most likely tags and then employs these rules<br />

<strong>for</strong> identifying the tags <strong>of</strong> unknown words. Hidden Markov<br />

<strong>Model</strong> taggers [15] have been widely used <strong>for</strong> English POS<br />

tagging, where the main emphasis is on maximizing the<br />

product <strong>of</strong> word likelihood and tag sequence probability. In<br />

essence, these taggers exploit the fixed word order property <strong>of</strong><br />

English to find the tag probabilities. TnT [14] is a stochastic<br />

Hidden Markov <strong>Model</strong> tagger, which estimates the lexical<br />

probability <strong>for</strong> unknown words <strong>based</strong> on its suffixes in<br />

comparison to suffixes <strong>of</strong> words in the training corpus. TnT<br />

has developed suffix <strong>based</strong> language models <strong>for</strong> German and<br />

English. However most Hidden Markov <strong>Model</strong>s are <strong>based</strong> on<br />

sequence <strong>of</strong> words and are better suited <strong>for</strong> languages which<br />

have relatively fixed word order.<br />

POS tagger <strong>for</strong> relatively free word order Indian languages<br />

needs more than word <strong>based</strong> POS sequences. POS tagger <strong>for</strong><br />

Hindi, a partially free word order language, has been designed<br />

<strong>based</strong> on Hidden Markov <strong>Model</strong> framework proposed by Scutt<br />

and Brants [14]. This tagger chooses the best tag <strong>for</strong> a given<br />

word sequence. However, the language specific features and<br />

context has not been considered to tackle the partial free word<br />

order characteristics <strong>of</strong> Hindi. In the work <strong>of</strong> Aniket Dalal et<br />

al. [1], Hindi part <strong>of</strong> speech tagger using Maximum entropy<br />

<strong>Model</strong> has been described. In this system, the main POS<br />

tagging feature used are word <strong>based</strong> context, one level suffix<br />

and dictionary-<strong>based</strong> features. A word <strong>based</strong> hybrid model [7]<br />

<strong>for</strong> POS tagging has been used <strong>for</strong> Bengali, where the Hidden<br />

Markov <strong>Model</strong> probabilities <strong>of</strong> words are updated using both<br />

tagged as well as untagged corpus. In the case <strong>of</strong> untagged<br />

corpus the Expectation Maximization algorithm has been used<br />

to update the probabilities.<br />

III. PARTS OF SPEECH IN TAMIL<br />

<strong>Tamil</strong> is a morphologically rich language resulting in its<br />

relatively free word order characteristics. Normally most<br />

<strong>Tamil</strong> words take on more than one morphological suffix;<br />

<strong>of</strong>ten the number <strong>of</strong> suffixes is 3 with the maximum going up<br />

to 13. The role <strong>of</strong> the sequence <strong>of</strong> the morphological suffixes<br />

attached to a word in determining the part-<strong>of</strong>-speech tag is an<br />

interesting property <strong>of</strong> <strong>Tamil</strong> language. In this work, we have<br />

identified 79 morpheme components, which can be combined<br />

to <strong>for</strong>m about 2,000 possible combinations <strong>of</strong> integrated<br />

suffixes. Two basic parts <strong>of</strong> speech, namely, noun and verb,<br />

are mutually distinguished by their grammatical inflections. In


<strong>Tamil</strong>, noun grammatically marks number and cases. <strong>Tamil</strong><br />

nouns basically take on eight cases. The normal morphological<br />

derivatives <strong>of</strong> <strong>Tamil</strong> nouns are as follows:<br />

Stem Noun + [Plural Marker]+[Oblique]+[Case Marker]<br />

The normal morphological derivative <strong>of</strong> <strong>Tamil</strong> Verb is as<br />

follows<br />

Stem Verb +[Tense Marker]+[Verbal <strong>Part</strong>iciple<br />

Suffix]+[Auxiliary verb +[Tense Marker]+[Person, Number,<br />

Gender]<br />

In addition, adjective, adverb, pronoun, postposition are<br />

also some stems that take suffixes. In this work, we have used<br />

a tagged corpus <strong>of</strong> 4,70,910 words, which have been tagged<br />

with 35 POS categories in a semiautomatic manner using an<br />

available morphological analyzer [5], which separates stem<br />

and all morpheme components. It also provides the type <strong>of</strong><br />

stem using lexicon. This tagged corpus is the basis <strong>of</strong> our<br />

language model.<br />

IV. EFFECT OF MORPHOLOGICAL INFORMATION IN TAGGING<br />

The relatively free word order <strong>of</strong> <strong>Tamil</strong> normally has the<br />

main verb in a terminating position and all other categories <strong>of</strong><br />

words can occur in any position in the sentence. Pure word<br />

<strong>based</strong> approaches <strong>for</strong> POS tagging are not effective. As<br />

described in the previous section, <strong>Tamil</strong> has a rich multilevel<br />

morphology. This work exploits this multi-level morphology<br />

in determining the POS category <strong>of</strong> a word. The stem, the prefinal<br />

and final morpheme components attached to the word<br />

that is the words derivative <strong>for</strong>m normally contribute to<br />

choosing the POS category <strong>of</strong> the word. Certain sequence <strong>of</strong><br />

morpheme can be attached to the certain stem types.<br />

Moreover, the context in which morphological components<br />

occur and its combinations also contribute in choosing the<br />

POS tag <strong>for</strong> a word. In this work, language models have been<br />

used to determine the probabilities <strong>of</strong> a word derivative <strong>for</strong>m<br />

functioning as a particular POS category.<br />

TABLE I.<br />

LIST OF TAGS FOR TAMIL AND THEIR DESCRIPTION<br />

Tag<br />

Description<br />

1. N Noun<br />

2. NP Noun Phrase<br />

3. NN Noun + noun<br />

4. NNP Noun + Noun Phrase<br />

5. IN Interrogative noun<br />

6. INP Interrogative noun phrase<br />

7. PN Pronominal Noun<br />

8. PNP Pronominal noun<br />

9. VN Verbal Noun<br />

10 VNP Verbal Noun Phrase<br />

11 Pn Pronoun<br />

12 PnP Pronoun Phrase<br />

13 Nn Nominal noun<br />

Tag<br />

Description<br />

14 NnP Nominal noun Phrase<br />

15 V Verb<br />

16 VP Verbal phrase<br />

17 Vinf Verb Infinitive<br />

18 Vvp Verb verbal participle<br />

19 Vrp Verbal Relative participle<br />

20 AV Auxiliary verb<br />

21 FV Finite Verb<br />

22 NFV Negative Finite Verb<br />

23 Adv Adverb<br />

24 SP Sub-ordinate clause conjunction<br />

Phrase<br />

25 SCC Sub-ordinate clause conjunction<br />

26 Par <strong>Part</strong>icle<br />

27 Adj Adjective<br />

28 Iadj Interrogative adjective<br />

29 Dadj Demonstrative adjective<br />

30 Inter Intersection<br />

31 Int Intensifier<br />

32 CNum Character number<br />

33 Num Number<br />

34 DT Date time<br />

35 PO Post position<br />

V. LANGUAGE MODEL<br />

<strong>Language</strong> models, in general, predict the occurrence <strong>of</strong> a<br />

unit <strong>based</strong> on the statistical in<strong>for</strong>mation <strong>of</strong> unit’s category and<br />

the context in which the unit occurs. <strong>Language</strong> models can<br />

differ in the basic units used <strong>for</strong> building the model, the<br />

features attached to the units and the length <strong>of</strong> the context used<br />

<strong>for</strong> prediction. The units used <strong>for</strong> building language models<br />

depend on the application <strong>for</strong> which the model is used. Bayes’<br />

theorem is usually used <strong>for</strong> posterior probability estimation<br />

from statistical in<strong>for</strong>mation. Bayes’ theorem relates the<br />

conditional and marginal probabilities <strong>of</strong> stochastic events A<br />

and B as follows<br />

Pr( A / B ) =<br />

Pr( B / A ) Pr(<br />

Pr( B )<br />

A )<br />

α<br />

L ( A / B ) Pr( A )<br />

where L(A|B) is the likelihood <strong>of</strong> A given fixed B.<br />

(5.1)<br />

Each term in Bayes’ theorem is described as follows.<br />

Pr(A) is the prior probability or marginal probability <strong>of</strong> A.<br />

It is "prior" in the sense that it does not take into account any<br />

in<strong>for</strong>mation about B.<br />

Pr(B) is the prior or marginal probability <strong>of</strong> B, and acts as a<br />

normalizing constant.<br />

Pr(A|B) is the conditional probability <strong>of</strong> A, given B. It is


also called the posterior probability because it is derived from<br />

or depends upon the specified value <strong>of</strong> B.<br />

Pr(B|A) is the conditional probability <strong>of</strong> B given A.<br />

In this paper, we have used a language model that considers<br />

the lexical category <strong>of</strong> the stem along with morphological<br />

components <strong>of</strong> a word in order to determine its POS tag.<br />

In case the word is equal to a stem then its POS category is<br />

the same as the stem lexical category. In case the word<br />

consists <strong>of</strong> a stem and a single morphological component then<br />

the language model is designed by considering both the lexical<br />

category <strong>of</strong> the stem and the morphological component. It is<br />

given by the equation 5.2:<br />

P<br />

pos<br />

_ type , e<br />

pos<br />

pos<br />

) = α<br />

1<br />

p (<br />

) + α p ( )<br />

root _ type<br />

( root<br />

2<br />

l<br />

e l<br />

(5.2)<br />

where pos<br />

P (<br />

) gives the probability <strong>of</strong> the word<br />

root _ type , e l<br />

being tagged with the particular pos tag given a particular<br />

stem type and a particular morphological ending e<br />

l<br />

. The two<br />

factors used <strong>for</strong> this probability calculation are<br />

pos , the prior probability <strong>of</strong> a particular pos tag<br />

p (<br />

root<br />

_ type<br />

)<br />

given a particular stem type and pos<br />

p ( ) , the prior<br />

probability <strong>of</strong> a particular pos tag given a particular<br />

morphological ending e . In addition, α 1 and α 2 are<br />

contribution factors and α 1 + α 2 = 1.<br />

In order to calculate the prior probability<br />

l<br />

e l<br />

p (<br />

pos<br />

e<br />

l<br />

)<br />

we use<br />

equation 5.3:<br />

e<br />

l<br />

P ( pos ) p ( )<br />

pos<br />

pos<br />

P ( ) = (5.3)<br />

e<br />

P ( e )<br />

l<br />

l<br />

In equation 5.3 P (pos)<br />

is the posterior independent<br />

probability <strong>of</strong> the particular pos in the given corpus.<br />

e<br />

l<br />

p ( ) is the posterior probability <strong>of</strong> the final<br />

pos<br />

morphological component given the particular pos tag<br />

calculated from the tagged corpus. P ( e l<br />

) is the posterior<br />

independent probability <strong>of</strong> final morphological component<br />

calculated from the tagged corpus. This probability is<br />

calculated using equation 5.4:<br />

k<br />

e<br />

l<br />

P ( e<br />

l<br />

) = ∑ p ( pos<br />

i<br />

) p ( )<br />

(5.4)<br />

pos<br />

i = 1<br />

i<br />

P ( e<br />

l<br />

) is calculated as a summation <strong>of</strong> all possible pos type<br />

k . P ( e l<br />

) , the product <strong>of</strong> p ( posi<br />

) , the posterior probability<br />

<strong>of</strong> the given pos type i in the corpus and e<br />

l<br />

p ( ), the<br />

pos<br />

i<br />

posterior probability <strong>of</strong> the final morpheme component given<br />

the pos tag i.<br />

In case the word whose pos tag is to be determined consists<br />

<strong>of</strong> more than one morphological component, then the language<br />

model is designed by considering three factors: the lexical<br />

category <strong>of</strong> the stem and the pre-final and final morphological<br />

component and is given by equation 5.5:<br />

pos<br />

P<br />

root_<br />

type,<br />

e<br />

pos pos pos<br />

) = α<br />

1p(<br />

) + α2<br />

p(<br />

) + α p(<br />

)<br />

, e root_<br />

type e e<br />

( 3<br />

l−1 l<br />

l−1<br />

l<br />

(5.5)<br />

where<br />

pos<br />

P (<br />

)<br />

gives the probability <strong>of</strong> the<br />

root _ type , e l −1 , e<br />

l<br />

word being tagged with the particular pos tag given a<br />

particular stem type and a particular morphological pre-final<br />

and final components el −1<br />

and e<br />

l<br />

. The three factors used <strong>for</strong><br />

this probability calculation are pos , the prior<br />

p (<br />

)<br />

root _ type<br />

probability <strong>of</strong> a particular pos tag given a particular stem type<br />

and pos<br />

p ( ) , the prior probability <strong>of</strong> a particular pos tag<br />

e l<br />

given a particular morphological component e l<br />

. These two<br />

factors are similar to the equations 5.1 and 5.2. The second<br />

factor pos is the prior probability <strong>of</strong> a particular pos tag<br />

p (<br />

e l − 1<br />

)<br />

given a particular morphological component e<br />

l−1<br />

. In addition,<br />

α 1 , α 2 and α 3 are contribution factors and α 1 + α 2 + α 3 = 1. In<br />

order to calculate the prior probability<br />

equation:<br />

(<br />

pos<br />

e<br />

)<br />

=<br />

P<br />

l − 1 l − 1<br />

(<br />

e<br />

l<br />

pos ) p (<br />

pos<br />

P ( e )<br />

pos<br />

p (<br />

e<br />

l − 1<br />

− 1<br />

)<br />

)<br />

we use the<br />

P (5.6)<br />

In equation 5.6 P ( pos ) is the posterior independent<br />

probability <strong>of</strong> the particular pos in the given corpus.<br />

e<br />

l −<br />

p (<br />

pos<br />

1<br />

)<br />

is the posterior probability <strong>of</strong> the final<br />

morphological component given the particular pos tag<br />

calculated from the tagged corpus. P ) is the posterior<br />

( e l −1<br />

independent probability <strong>of</strong> final morphological component<br />

calculated from the tagged corpus. This probability is<br />

calculated using equation 5.7.<br />

k<br />

e<br />

l − 1<br />

P ( e<br />

l − 1<br />

) = ∑ p ( pos<br />

i<br />

) p ( )<br />

(5.7)<br />

pos<br />

i = 1<br />

P( e l −1<br />

) is calculated as a summation <strong>of</strong> all possible pos<br />

type k. P ( e l −1<br />

) , the product <strong>of</strong> p ( pos<br />

i<br />

) , the posterior<br />

probability <strong>of</strong> the given pos type i in the corpus and<br />

e<br />

l − 1<br />

p ( ), the posterior probability <strong>of</strong> the final morpheme<br />

pos<br />

i<br />

component given the pos tag i. The next section describes the<br />

calculation <strong>of</strong> contribution factors.<br />

VI. ESTIMATION OF CONTRIBUTION FACTORS<br />

Using generalized iterative scaling technique, the<br />

contribution factors α 1 , α 2 and α 3 are calculated so as to<br />

maximize the likelihood <strong>of</strong> the rest <strong>of</strong> the corpus. Coarse and<br />

fine steps are used <strong>for</strong> finding the parameters. For the fixed<br />

i


value <strong>of</strong> α 1 , the α2 values are raised and simultaneously α 3<br />

values are decreased, the correctness is showing the<br />

characteristic <strong>of</strong> inverted parabola in first quadrant <strong>of</strong> the<br />

graph. This inherent property is used <strong>for</strong> designing the<br />

following algorithm. Here, we are estimating the parameters in<br />

two steps, namely, coarse and fine steps. At coarse step, the<br />

value <strong>of</strong> increment and decrement is in the quantity <strong>of</strong> 0.1 <strong>for</strong><br />

estimating coarse optimal value <strong>of</strong> the parameters. At fine<br />

step, the changes are in the quantity <strong>of</strong> .01 <strong>for</strong> estimating fine<br />

optimal value <strong>of</strong> parameters.<br />

Algorithm <strong>for</strong> Estimating α 1 , α 2 and α 3<br />

Constraint α 1 + α 2 + α 3 = 1<br />

Max =0.0<br />

Optimal _solution = set(0 , 0 , 1 )<br />

Coarse step<br />

Initialize α 1= 0.0<br />

Step in increment α 1 by 0.1<br />

Repeat the following step until α 1


Fig. 2. Correctness in % <strong>for</strong> the parameter set from Table 3.<br />

Fig. 2 shows the graphical representation <strong>of</strong> Correctness <strong>for</strong><br />

each set <strong>of</strong> parameter values as mentioned in Table III.<br />

VII. EVALUATION AND RESULTS<br />

The implemented system is evaluated as in In<strong>for</strong>mation<br />

Retrieval, which makes frequent use <strong>of</strong> precision and recall.<br />

Precision is defined as a measure <strong>of</strong> the proportion <strong>of</strong> the<br />

selected items that the system got right.<br />

precision<br />

No . <strong>of</strong> items tagged correctly<br />

=<br />

No . <strong>of</strong> items tagged<br />

(6.1)<br />

Recall is defined as the proportion <strong>of</strong> the target items that<br />

the system selected.<br />

Re call =<br />

No.<br />

<strong>of</strong> items tagged by the system<br />

No.<br />

<strong>of</strong> items to be tagged<br />

(6.2)<br />

To combine precision and recall into a single measure <strong>of</strong><br />

over all per<strong>for</strong>mance, the F-measure is defined as<br />

F<br />

=<br />

1<br />

α<br />

p<br />

1<br />

+ ( 1 −<br />

1<br />

α )<br />

R<br />

(6.3)<br />

where P is precision, R is recall and α is factor, which<br />

determine the weighting <strong>of</strong> precision and recall. A value <strong>of</strong> α<br />

is <strong>of</strong>ten chosen <strong>for</strong> equal weighting <strong>of</strong> precision and recall.<br />

With this α value the F-measure simplifies to<br />

2 PR<br />

(6.4)<br />

F<br />

=<br />

P<br />

+<br />

R<br />

The perplexity is a useful metric <strong>for</strong> how well the language<br />

model matches with a test corpus. The perplexity is a variant<br />

<strong>of</strong> entropy. The entropy is measured by the following equation<br />

H ( tag<br />

_ type<br />

n<br />

1<br />

) = − ∑ log( p ( tag ( w<br />

i<br />

))<br />

(6.5)<br />

n<br />

i = 1<br />

H ( tag _ type )<br />

perplexity ( tag _ type ) = 2<br />

(6.6)<br />

The system is evaluated with a test corpus with 43,678<br />

words in which 36,128 words are morphologically analyzed<br />

within our tag set. 6,123 words are named entity, while the<br />

remaining words are unidentified. The morphologically<br />

analyzed words are passed into tagger designed using our<br />

language model. The following table and graphs represents the<br />

results obtained by the language models <strong>for</strong> determining the<br />

tags.<br />

TABLE IV.<br />

NOUN CATEGORIES<br />

Postag Recall Precision F-measure Perplexity<br />

0.98 0.99 0.99 1.91<br />

0.98 0.98 0.98 2.61<br />

1.00 1.00 1.00 1.63<br />

0.48 0.97 0.64 8.58<br />

0.99 0.87 0.93 3.29<br />

0.81 1.00 0.89 1.91<br />

0.69 0.47 0.56 9.18<br />

0.01 1.00 0.02 1.57<br />

0.00 0.00 0.00 -<br />

0.15 0.97 0.27 1.87<br />

0.60 0.71 0.65 3.95<br />

0.00 0.00 0.00 -<br />

0.81 0.96 0.88 1.63<br />

0.18 1.00 0.33 3.63<br />

Table IV shows noun category type POS tags in which the<br />

tags , and are the noun category but its<br />

stem is <strong>of</strong> type verb. Due to this reason, the words <strong>of</strong> these tag<br />

types are wrongly tagged. These cases can be rectified by<br />

trans<strong>for</strong>mation <strong>based</strong> learning rules.<br />

TABLE V.<br />

VERB CATEGORIES<br />

Postag Recall Precision F-measure Perplexity<br />

0.99 0.93 0.95 3.88<br />

1.00 1.00 1.00 1.00<br />

1.00 0.99 0.99 1.00<br />

1.00 0.90 0.95 1.00<br />

0.00 0.00 0.00 -<br />

0.93 0.92 0.92 2.74<br />

0.81 0.87 0.84 3.66<br />

0.71 0.88 0.79 3.98<br />

0.99 1.00 0.99 19.40<br />

0.98 0.78 0.87 1.81<br />

0.99 0.97 0.98 1.13<br />

0.42 0.91 0.57 3.47<br />

0.00 0.00 0.00 -<br />

Table V shows verb category type POS tags. A word ‘anRu’<br />

in <strong>Tamil</strong> belongs to tag type. It has two type <strong>of</strong><br />

context with the meaning <strong>of</strong> (i) those day (DT) and (ii) not<br />

(NFV). These cases have to be rectified by word sense


disambiguation rules. Words <strong>of</strong> Tag type are<br />

relatively very few. By considering further morpheme<br />

components, these tag types can be identified.<br />

TABLE VI.<br />

OTHER CATEGORIES<br />

Postag Recall Precision F-measure Perplexity<br />

1.00 1.00 1.00 1.00<br />

1.00 1.00 1.00 1.00<br />

1.00 0.99 0.99 1.00<br />

0.98 0.99 0.99 3.57<br />

0.99 1.00 0.99 1.00<br />

1.00 0.97 0.99 1.00<br />

1.00 1.00 1.00 1.00<br />

0.96 1.00 0.98 1.00<br />

1.00 1.00 1.00 1.02<br />

1.00 1.00 1.00 1.00<br />

0.92 0.97 0.94 3.80<br />

1.00 1.00 1.00 9.22<br />

1.00 0.86 0.92 9.63<br />

1.00 1.00 1.00 1.00<br />

0.97 1.00 0.99 9.07<br />

Table VI shows POS tags <strong>of</strong> other category types. The<br />

occurrence <strong>of</strong> categories <strong>of</strong> this type is less as compared with<br />

noun type and verb type.<br />

Fig 5. F-measure <strong>for</strong> POS tag <strong>of</strong> other categories.<br />

The same test is repeated <strong>for</strong> another corpus with 62,765<br />

words, in which the used analyzer identifies 52,889 words.<br />

49,340 words are correctly tagged using the suggested<br />

language model. These two tests are tabulated in Table VII.<br />

TABLE VII.<br />

OVERALL ACCURACY AND PERPLEXITY.<br />

Words Correct Error Accuracy (%) Perplexity<br />

T1 36,128 34,656 1,472 95.92 153.80<br />

T2 36,889 34,840 2,049 94.45 153.96<br />

As the perplexity values <strong>for</strong> both test cases are more or less<br />

equal, we can conclude that the <strong>for</strong>mulated language model is<br />

independent <strong>of</strong> the type <strong>of</strong> untagged data.<br />

VIII. CONCLUSION AND FUTURE WORK<br />

We have presented a part-<strong>of</strong>-speech tagger <strong>for</strong> <strong>Tamil</strong>, which<br />

uses a specialized language model. The input <strong>of</strong> a tagger is a<br />

string <strong>of</strong> words and a specified tag set similar to the described<br />

one. The output is a single best tag <strong>for</strong> each word. The overall<br />

accuracy <strong>of</strong> this tagger is 95.92%. This system can be<br />

improved by adding a module with trans<strong>for</strong>mation <strong>based</strong><br />

learning technique and word sense disambiguation. The same<br />

approach can be applied <strong>for</strong> any morphologically rich natural<br />

language. The morpheme <strong>based</strong> language model approach can<br />

be modified <strong>for</strong> chunking, named entity recognition and<br />

shallow parsing.<br />

Fig 3. F-measure <strong>for</strong> POS tag <strong>of</strong> noun categories.<br />

Fig 4. F-measure <strong>for</strong> POS tag <strong>of</strong> verb categories.<br />

REFERENCES<br />

[1] Aniket Dalal, Kumar Nagaraj, Uma Sawant, Sandeep Shelke , Hindi<br />

<strong>Part</strong>-<strong>of</strong>-<strong>Speech</strong> <strong>Tagging</strong> and Chunking : A Maximum Entropy<br />

Approach. In: Proceedings <strong>of</strong> the NLPAI Machine Learning Contest<br />

2006 NLPAI, 2006.<br />

[2] Nizar Habash , Owen Rambow ,Arabic Tokenization, <strong>Part</strong>-<strong>of</strong>-<strong>Speech</strong><br />

<strong>Tagging</strong> and Morphological Disambiguation in one Fell Swoop. In:<br />

Proceedings <strong>of</strong> the 43rd Annual Meeting <strong>of</strong> the ACL, pages 573–580,<br />

Association <strong>for</strong> Computational Linguistics, June 2005.<br />

[3] D. Hiemstra. Using language models <strong>for</strong> in<strong>for</strong>mation retrieval. PhD<br />

Thesis, University <strong>of</strong> Twente, 2001.<br />

[4] S. Armstrong, G. Robert, and P. Bouillon. Building a <strong>Language</strong> <strong>Model</strong><br />

<strong>for</strong> POS <strong>Tagging</strong> (unpublished), 1996.<br />

http://citeseer.ist.psu.edu/armstrong96building.html<br />

[5] P. Anandan, K. Saravanan, Ranjani <strong>Part</strong>hasarathi and T. V. Geetha.<br />

Morphological Analyzer <strong>for</strong> <strong>Tamil</strong>. In: International Conference on<br />

Natural language Processing, 2002.<br />

[6] Thomas Lehman. A grammar <strong>of</strong> modern <strong>Tamil</strong>, Pondicherry Institute <strong>of</strong><br />

Linguistic and culture.<br />

[7] Sandipan Dandapat, Sudeshna Sarkar and Anupam Basu. A Hybrid<br />

<strong>Model</strong> <strong>for</strong> <strong>Part</strong>-<strong>of</strong>-speech tagging and its application to Bengali. In:<br />

Transaction on Engineering, Computing and Technology VI December<br />

2004.


[8] Barbara B. Greene and Gerald M. Rubin. Automated grammatical tagger<br />

<strong>of</strong> English. Department <strong>of</strong> Linguistics, Brown University, 1971.<br />

[9] S. Klein and R. Simmons. A computational approach to grammatical<br />

coding <strong>of</strong> English words. JACM, 10:334-337, 1963.<br />

[10] Theologos Athanaselies, Stelios Bakamidis and Ioannis Dologlou. Word<br />

reordering <strong>based</strong> on Statistical <strong>Language</strong> <strong>Model</strong>. In: Transaction<br />

Engineering, Computing and Technology, v. 12, March 2006.<br />

[11] Sankaran Baskaran. Hindi POS tagging and Chunking. In: Proceedings<br />

<strong>of</strong> the NLPAI Machine Learning Contest, 2006.<br />

[12] Lluís Márquez and Lluis Padró. A flexible pos tagger using an<br />

automatically acquired <strong>Language</strong> model. In: Proceedings <strong>of</strong><br />

ACL/EACL'97.<br />

[13] K. Rajan. Corpus analysis and tagging <strong>for</strong> <strong>Tamil</strong>. In: Proceeding <strong>of</strong><br />

symposium on Translation support system STRANS-2002<br />

[14] T. Brants. TnT - A Statistical <strong>Part</strong>-<strong>of</strong>-<strong>Speech</strong> Tagger. User manual,<br />

2000.<br />

[15] Scott M. Thede and Mary P. Harper. A second-order Hidden Markov<br />

<strong>Model</strong> <strong>for</strong> part-<strong>of</strong>-speech tagging. In: Proceedings <strong>of</strong> the 37th Annual<br />

Meeting <strong>of</strong> the Association <strong>for</strong> Computational Linguistics, pages 175—<br />

182. Association <strong>for</strong> Computational Linguistics, June 20--26, 1999.<br />

[16] Eric Brill. Trans<strong>for</strong>mation-Based Error-Driven Learning and Natural<br />

<strong>Language</strong> Processing: A Case Study in <strong>Part</strong>-<strong>of</strong>-<strong>Speech</strong> <strong>Tagging</strong>.<br />

Computation Linguistics, 21(4):543- 565, 1995.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!