Unknown Word Guessing and Part-of-Speech Tagging ... - CiteSeerX

Unknown Word Guessing and Part-of-Speech Tagging UsingSupport Vector MachinesTetsuji Nakagawa, Taku Kudoh and Yuji MatsumotoGraduate School of Information Science,Nara Institute of Science and Technology8916−5 Takayama, Ikoma, Nara 630−0101, Japan{tetsu-na,taku-ku,matsu}@is.aist-nara.ac.jpAbstractThe accuracy of part-of-speech(POS) tagging for unknown wordsis substantially lower than that forknown words. Considering the highaccuracy rate of up-to-date statisticalPOS taggers, unknown wordsaccount for a non-negligible portionof the errors. This paper describesPOS prediction for unknown wordsusing Support Vector Machines.We achieve high accuracy in POStag prediction using substrings andsurrounding context as the features.Furthermore, we integrate thismethod with a practical EnglishPOS tagger, and achieve accuracyof 97.1%, higher than conventionalapproaches.1 IntroductionPart-of-speech (POS) tagging is fundamentalin natural language processing. Manystatistical POS taggers use text data whichare manually annotated with POS tags astraining data to obtain the statistical informationor rules to perform POS tagging.However, in part-of-speech tagging, we frequentlyencounter words that do not existin training data. Such unknown words areusually handled by an exceptional processing,because the statistical information orrules for those words are unknown. Manymachine learning methods have been appliedfor part-of-speech tagging, such as thehidden Markov model (HMM) (Charniak etal., 1993), the transformation-based errordrivensystem (Brill, 1995), the decision tree(Schmid, 1994) and the maximum entropymodel (Ratnaparkhi, 1996). Though thesemethods have good performance, the accuracyfor unknown words is much lower thanthat for known words, and this is a nonnegligibleproblem where training data is limited(Brants, 2000).One known approach for unknown wordguessing is to use suffixes or surroundingcontext of unknown words (Thede, 1998).Weischedel estimated conditional probabilityfor an unknown word w given a tag t usingthe ending form of w, the existence of hyphenationand capitalization (Weischedel etal., 1993):p(w|t) = p(unknown word|t) · p(capital|t)·p(ending/hyphenation|t)Although this method has the merit in handlingunknown words within the frameworkof probability theory, ending forms like “-ed” and “-ion” are selected heuristically, soapplying this method to other languages isnot straightforward. Brants used the linearinterpolation of fixed length suffix modelfor unknown word handling in his part-ofspeechtagger TnT (Brants, 2000). Thismethod achieves relatively high accuracy andis reported to be effective for other languages(Džeroski et al., 2000). Cucerzan andYarowsky proposed paradigmatic similaritymeasures and showed a good result for highlyinflectional languages using a large amount ofunannotated text (Cucerzan and Yarowsky,2000). Other methods for unknown wordguessing have been studied, such as the

ule-based method (Mikheev, 1997) and thedecision tree-based method (Orphanos andChristodoulakis, 1999).In this paper, we propose a method to predictPOS tags of unknown English words asa post-processing of POS tagging using SupportVector Machines (SVMs). SVMs (Cortesand Vapnik, 1995; Vapnik, 1999) are a supervisedmachine learning algorithm for binaryclassification and known to have good generalizationperformance. SVMs can handlea large number of features and hardly overfit.Consequently, SVMs can be applied successfullyto natural language processing applications(Joachims, 1998; Kudoh and Matsumoto,2000). In this paper, we show howto apply SVMs to more general POS taggingas well as unknown word guessing, and reportsome experimental results.The paper is organized as follows: We startby presenting Support Vector Machines in thenext section. We then describe our methodfor unknown word guessing and POS taggingin sections 3 and 4. In section 5, we describethe results of some experiments.2 Support Vector MachinesSupport Vector Machines (SVMs) are supervisedmachine learning algorithm for binaryclassification on a feature vector space x ∈R L .w · x + b = 0, w ∈ R L , b ∈ R. (1)Suppose the hyperplane (1) separates thetraining data, {(x i , y i ) | x i ∈ R L , y i ∈ {±1},1 ≤ i ≤ l}, into two classes such thaty i · (w · x i + b) ≥ 1. (2)While several of such separating hyperplanesexist (Figure 1, left), SVMs find the optimalhyperplane that maximizes the margin(the distance between the hyperplane and thenearest points) (Figure 1, right). Such ahyperplane is known to have the minimumexpected test error and can be solved byquadratic programming. Given a test examplex, its label y is decided by the sign ofPositive ExampleNegative ExampleFigure 1: Maximize the Margindiscriminant function f(x) as follows:Marginf(x) = w · x + b, (3)y = sgn(f(x)). (4)For linearly non-separable cases, featurevectors are mapped into a higher dimensionalspace by a nonlinear function Φ(x) and linearlyseparated there. In SVMs’ formula,since all data points appear as a form of innerproduct, we only need the inner productof two points in the higher dimensional space.Those values are calculated in R L withoutmapping to the higher dimensional space bythe following function K(x i , x j ) called a kernelfunction,Φ(x i ) · Φ(x j ) = K(x i , x j ). (5)In this paper, we use the following polynomialkernel function,K(x i , x j ) = (x i · x j + 1) d . (6)This function virtually maps the original inputspace into a higher dimensional spacewhere all combinations of up to d features aretaken into consideration.Since SVMs are binary classifiers, we mustextend them to multi-class classifiers to predictk > 2 POS tags. Among several methodsof multi-class classification for SVMs (Westonand Watkins, 1999), we employ the oneversus-restapproach. In training, k classifiersf i (x) (1 ≤ i ≤ k) are created to classify theclass i from all other classes,{ fi (x) ≥ +1 x belongs to the class i,(7)f i (x) ≤ −1 otherwise.

Table 1: Example of Features for UnknownWord GuessingPOSContextWordContextSubstringst −1 =TO, t −2 =VBD,t +1 =CD, t +2 =NNSw −1 =to, w −2 =returned,w +1 =two, w +2 =days^g, ^gr, ^gre, ^gree,e$, le$, lle$, ille$,〈Capital〉Given a test example x, its class c is determinedby the classifier that gives the largestdiscriminating function value,c = argmax f i (x). (8)i3 Predicting POS Tags ofUnknown WordsIn order to predict the POS tag of an unknownword, the following features are used:(1) POS context: The POS tags of the twowords on both sides of the unknownword.(2) Word context: The lexical forms of thetwo words on both sides of the unknownword.(3) Substrings: Prefixes and suffixes of upto four characters of the unknown wordand the existence of numerals, capital lettersand hyphens in the unknown word.For the following example sentence,... she/PRP returned/VBD to/TOGreenville/(Unknown Word) two/CDdays/NNS before/IN ...the features for the unknown word“Greenville” are shown in Table 1 1 .These features are almost same as those usedby Ratnaparkhi (Ratnaparkhi, 1996), butcombination of POS tags is not used becausepolynomial kernel can automatically considerthem.1 “^” and “$” mean the beginning and the end ofthe word respectively.SVM classifiers are created for each POStag using all words in the training data. ThenPOS tags of unknown words are predicted usingthose classifiers.4 Part-of-Speech TaggingIn unknown word guessing, the POS tag ofan unknown word is predicted using the POScontext, the word context and the substrings.This method can be extended to more generalPOS tagging by predicting the POS tags ofall words in a given sentence. Differing fromunknown word guessing as a post-processingof POS tagging, the POS tags for succeedingwords are usually not known during POStagging. Therefore, two methods for this taskare tested as described in the following subsections.4.1 Using Only the Preceding POSTagsThe first method uses only the POS tags ofthe preceding words. As features, (1) the POStags of the two preceding words, (2) two precedingand succeeding words, and (3) prefixesand suffixes of up to four characters, the existenceof numerals, capital letters and hyphensare used.In probabilistic models such as HMM, thegenerative probabilities of all sequences areconsidered and the most likely path is selectedby the Viterbi algorithm. Since the SVMs donot produce probabilities, we employ a deterministicmethod for POS tag prediction ofthe guessing word by referring to the featuresshown above. This method has the merit ofhaving a small computational cost, but it hasthe demerit of not using the information ofthe succeeding POS tags.A tag dictionary which provides the lists ofPOS tags for known words (i.e., that appearedin training data) is used. This dictionary wasalso used by Ratnaparkhi to reduce the numberof possible POS tags for known words(Ratnaparkhi, 1996). For unknown words, allpossible POS tags are taken as the candidates.This method requires no exceptional processingsto handle unknown words.

Table 2: Test Data for POS TaggingTraining Tokens Known Words/Unknown Words Percentage of Unknown Word1,000 153,492/131,316 46.1%10,000 218,197/ 66,611 23.4%100,000 261,786/ 23,022 8.1%1,000,000 278,535/ 6,273 2.2%4.2 Using the Preceding andSucceeding POS TagsThe second method uses the POS tag informationon both sides of the guessing word.Same features as shown in Table 1 are used.In general, the POS tags of the succeedingwords are unknown. Roth and Zelenkoaddress the POS tagging with no unknownwords and used a dictionary with the mostfrequent POS tag for each word (Roth and Zelenko,1998). They used that POS tag for thesucceeding words. They report that about 2%of accuracy decrease is caused by incorrectlyattached POS tags by their method. We usea similar two pass method without using adictionary. In the first pass, all POS tags arepredicted without using the POS tag informationof succeeding words (i.e., using the samefeatures as section 4.1). In the second pass,POS tagging is performed using the POS tagspredicted in the first pass for the succeedingcontext (i.e., using the same features as section3). This method has the advantage ofhandling known and unknown words in thesame way.The tag dictionary is also used in thismethod.5 EvaluationExperiments for unknown word guessing andPOS tagging are performed using the PennTreebank WSJ corpus having 50 POS tags.Four training data sets were constructedby randomly selecting approximately 1,000,10,000, 100,000 and 1,000,000 tokens.Test data for unknown word guessing consistsof words that do not appear in the trainingdata. The POS tags on both sides of theunknown word were tagged by TnT.Test data for POS tagging consists of about285,000 tokens differing from the trainingdata. The number of known/unknown wordsand the percentage of unknown word in thetest data are shown in Table 2.The accuracies are compared with TnTwhich is based on a second order Markovmodel.5.1 Unknown Word GuessingThe accuracy of the unknown word guessingis shown in Table 3 together with the degreeof polynomial kernel used for the experiments.Our method has higher accuracy compared toTnT for every training data set.Accuracies with various settings are shownin Table 4. The first column shows the caseswhen the correct POS context on both sides ofthe guessing words is given. From the secondto fourth columns, some features are deletedso as to see the contribution of the featuresto the accuracy. The decrease of accuracycaused by the errors in POS tagging by TnT isabout 1%. Information from substrings (prefixes,suffixes and the existence of numerals,capital letters and hyphens) plays the mostimportant role in predicting POS tags. Onthe other hand, the words themselves havemuch less contribution while the POS contexthave moderate contribution to the finalaccuracy.In general, features that rarely appear inthe training data are statistically unreliable,and often decrease the performance of the system.Ratnaparkhi ignored features that appearedless than 10 times in training data(Ratnaparkhi, 1996). We examined the behaviorwhen reducing the sparse features. Table5 shows the result for 10,000 training tokens.Ignoring the features that appearedonly once, the accuracy is a bit improved.

Table 3: Performance of Unknown Word Guessing vs. Size of Training TokensTraining Tokens SVMs d TnT1,000 69.8% 1 69.4%10,000 82.3% 2 81.5%100,000 86.7% 2 83.3%1,000,000 87.1% 2 84.2%Table 4: Performance of Unknown Word Guessing for Different Sets of FeaturesTraining Tokens Correct POS No POS No Word No Substrings1,000 71.1% 64.2% 68.7% 33.7%10,000 82.9% 75.8% 80.2% 37.1%100,000 87.5% 80.1% 85.2% 33.8%1,000,000 88.5% 80.8% 86.0% 30.0%Table 5: Performance of Unknown Word Guessing along Reduction of Features: Featuresoccurring less than or equal to “cutoff” are not taken into consideration.Cutoff 0 1 10 100Accuracy 82.3% 82.5% 81.7% 74.8%Number of Features 18,936 7,683 1,314 201Table 6: Performance of Unknown Word Guessing for Different Kernel FunctionsTraining Tokens Degree of Polynomial Kernel1 2 3 41,000 69.8% 69.8% 61.2% 34.8%10,000 82.0% 82.3% 80.3% 79.5%100,000 85.0% 86.7% 86.0% 84.3%1,000,000 — 87.1% — —Table 7: Performance of POS tagging (for Known Word/for Unknown Word)Training Tokens SVMs TnTPreceding POS d Preceding & Succeeding POS d1,000 83.4%(96.3/68.3) 1 83.9%(96.4/69.3) 1 83.8%(96.0/69.4)10,000 92.1%(95.5/81.2) 2 92.5%(95.7/82.2) 2 92.3%(95.7/81.5)100,000 95.6%(96.5/85.7) 2 95.9%(96.7/86.7) 2 95.4%(96.4/83.3)1,000,000 97.0%(97.3/86.3) 2 97.1%(97.3/86.9) 2 96.6%(96.9/84.2)Table 8: Performance of POS tagging for Different Sets of Features (for Known Word/forUnknown Word)Training Tokens No POS No Word No Substrings1,000 81.3%(96.0/64.2) 83.2%(96.2/68.0) 65.2%(96.2/29.0)10,000 90.3%(94.7/75.8) 91.9%(95.5/79.9) 80.7%(94.8/34.4)100,000 93.9%(95.1/80.1) 95.4%(96.4/85.0) 82.4%(87.0/30.7)1,000,000 95.0%(95.3/80.8) 96.7%(96.9/85.7) 74.4%(75.6/24.1)

However, even if a large number of featuresare used without a cutoff, SVMs hardly overfitand keep good generalization performance.The performance at the different degree ofpolynomial kernel is shown in Table 6. Thebest degree seems to be 1 or 2 for this task,and the best degree tends to increase whenthe training data increases.5.2 Part-of-Speech TaggingThe accuracies of POS tagging are shown inTable 7. The table shows two cases: one refersonly to the preceding POS tags, and the otherrefers to both preceding and succeeding POStags with the two pass method. The resultsshow that the performance is comparable toTnT in the first case and better in the secondcase. Between the first case and the secondcase, the accuracy for known words are almostequal, but the accuracy of the first casefor unknown words is lower than that of thesecond case.Accuracies measured by deleting each featureare shown in Table 8. The contributionof each feature has the same tendency as thecase of the unknown word guessing in section5.1. The biggest difference of features betweenour method and the TnT is the use ofword context. Although using a lot of featuressuch as word context is difficult in Markovmodel, it is easy in SVMs as seen in section5.1. For small training data, the accuraciesof the case without the word context are lessthan that of TnT. This means that one reasonfor better performance of our method isthe use of word context.6 Conclusion and Future WorkIn this paper, we applied SVMs to unknownword guessing and showed that they performquite well using context and substringinformation. Furthermore, extending themethod to POS tagging, the resulting taggerachieves higher accuracy than the stateof-the-artHMM-based tagger. Comparingto other machine learning algorithms, SVMshave the advantage of considering the combinationsof features automatically by introducinga kernel function and seldom over-fitwith a large set of features. Our methodsdo not depend on particular characteristics ofEnglish, therefore, our methods can be appliedto other languages such as German andFrench. However, for languages like Japaneseand Chinese, it is difficult to apply our methodsstraightforwardly because words are notseparated by spaces in those languages.One problem of our methods is computationalcost. It took about 16.5 hours fortraining with 100,000 tokens and 4 hours fortesting with 285,000 tokens in POS taggingusing POS tags on both sides on an Alpha21164A 500MHz processor. Although SVMshave good properties and performance, theircomputational cost is large. It is difficult totrain for a large amount of training data, andtesting time increases in more complex models.Another point to be improved is the searchalgorithm for POS tagging. In this paper,a deterministic method is used as a searchalgorithm. This method does not considerthe overall likelihood of a whole sentence anduses only local information compared to probabilisticmodels. The accuracy may be improvedby incorporating some beam searchscheme. Furthermore, our method outputsonly the best answer and cannot output thesecond or third best answer. There is a wayto translate the outputs of SVMs as probabilities(Platt, 1999), which may be applieddirectly to remedy this problem.ReferencesT. Brants. 2000. TnT — A Statistical Partof-SpeechTagger. In Proceedings of the6th Applied NLP Conference(ANLP-2000),pages 224–231.E. Brill. 1995. Transformation-Based Error-Driven Learning and Natural LanguageProcessing: A Case Study in Part-of-Speech Tagging. Computational Linguistics,21(4), pages 543–565.E. Charniak, C. Hendrickson, N. Jacobsonand M. Perkowitz. 1993. Equations forPart-of-Speech Tagging. In Proceedings of

the Eleventh National Conference on ArtificialIntelligence(AAAI-93), pages 784–789.C. Cortes and V. Vapnik. 1995. Support VectorNetworks Machine Learning, 20, pages273–297.S. Cucerzan and D. Yarowsky. 2000. LanguageIndependent, Minimally SupervisedInduction of Lexical Probabilities. InProceedings of the 38th Annual Meetingof the Association for ComputationalLinguistics(ACL-2000), pages 270–277.S. Džeroski, T. Erjavec and J. Zavrel.2000. Morphosyntactic Tagging of Slovene:Evaluating Taggers and Tagsets. InProceedings of the Second InternationalConference on Language Resources andEvaluation(LREC-2000), pages 1099–1104.T. Joachims. 1998. Text Categorization withSupport Vector Machines: Learning withMany Relevant Features. In Proceedings ofthe 10th European Conference on MachineLearning(ECML-98), pages 137–142.T. Kudoh and Y. Matsumoto. 2000. Use ofSupport Vector Learning for Chunk IdentificationIn Proceedings of the FourthConference on Computational Natural LanguageLearning(CoNLL-2000), pages 142–144.A. Mikheev. 1997. Automatic Rule Inductionfor Unknown-Word Guessing. ComputationalLinguistics, 23(3), pages 405–423.G. Orphanos and D. Christodoulakis. 1999.POS Disambiguation and Unknown WordGuessing with Decision Trees. In Proceedingsof the Ninth Conference of the EuropeanChapter of the Association for ComputationalLinguistics(EACL-99), pages134–141.A. Ratnaparkhi. 1996. A MaximumEntropy Model for Part-of-Speech Tagging.In Proceedings of Conference onEmpirical Methods in Natural LanguageProcessing(EMNLP-1), pages 133–142.D. Roth and D. Zelenko. 1998. Partof Speech Tagging Using a Network ofLinear Separators. In Proceedings ofthe joint 17th International Conference onComputational Linguistics and 36th AnnualMeeting of the Association for ComputationalLinguistics(ACL/COLING-98),pages 1136–1142.H. Schmid. 1994. Probabilistic Partof-SpeechTagging Using Decision Trees.In Proceedings of the International Conferenceon New Methods in LanguageProcessing(NeMLaP-1), pages 44–49.S. Thede. 1998. Predicting Part-of-SpeechInformation about Unknown Words usingStatistical Methods. In Proceedings ofthe joint 17th International Conference onComputational Linguistics and 36th AnnualMeeting of the Association for ComputationalLinguistics(ACL/COLING-98),pages 1505–1507.V. Vapnik. 1999. The Nature of StatisticalLearning Theory. Springer.R. Weischedel, M. Meteer, R. Schwartz,L. Ramshaw and J. Palmucci. 1993. Copingwith Ambiguity and Unknown Wordsthrough Probabilistic Models. ComputationalLinguistics, 19(2), pages 359–382.J. Weston and C. Watkins. 1999. SupportVector Machines for Multi-Class PatternRecognition. In Proceedings of the SeventhEuropean Symposium On Artificial NeuralNetworks(ESANN-99).J. Platt. 1999. Probabilistic Outputs for SupportVector Machines and Comparisons toRegularized Likelihood Methods. Advancesin Large Margin Classifiers. MIT Press.

Unknown Word Guessing and Part-of-Speech Tagging ... - CiteSeerX

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?