A Morphology-based Chinese Word Segmentation Method

klmp.pku.edu.cn

A Morphology-based Chinese Word Segmentation Method

firstly proposed by John Lafferty [16]. As their greatflexibility to integrate a variety of interacting features ofthe sequence data, guaranteeing the convergence to theglobal optimum, CRFs have been successfully applied toa number of real world sequence labeling tasks in manyfields. Especially for the CWS, the CRFs-based methodsobtained state-of-the-art performances.3.1. Traditional CRFs-based methodAs for the sequence labeling method for CWS, theChinese character sequence is treated as the observationsequence, and each character is labeled with a tagindicating its position in a word. In the general label set{B, M, E, S}, B is the beginning of a word, M is theinside of a word, E is the end of a word, and S representsa word with one single character. As a sequence labelingproblem, we use CRFs to segment the Chinese word.Generally speaking, a sequence labeling problemcan be formalized as predicting the label sequence Y =Yl ... Yn given an observation sequence X = Xl ... Xn.CRFs solves the problem by directly model theconditional distribution p(YIA) , which is defined byLafferty et al. as [16]1p(YI X) = -- exp LLh (,X,c) (1)ZX ( ) CECkwhere C is the set of cliques, Yc is the set of vertices inclique c, !k(Yc , A) is the feature function, which ISdefmed as( ) {I if and matchXa fveed valueh , X (2)= 0 otherwisek is the index the features, Ak is its weight, and Z(A) = LYeXPLcEc LkAk!k(Yc, X, c) is a normalization factor overstate sequences for the sequence X.Particularly, in the linear-chain CRFs, thefirst-order Markov assumption is made on the hiddenvariables. It is obvious that the clique of the linear-chainCRFs model is composed of two neighboring verticesand linked edges. We drew the features as !k(Yi-J, Yi, X, i)(Yi EY, 1 i n) below. So the linear chain CRFs has thefollowing forms,1np(YIX)= Z (X) exp frh(Yi-PYi,X,i) (3)Given training data, the parameter e of CRFs canbe estimated by optimization methods, such asConjugate Gradient (CG), quasi-Newton, ImprovedIterative Scaling (US) [17], and Generalized IterativeScaling (GIS) [18]. Recently, the method of Limitedmemory BFGS [19], [20] has shown much betterperformance in time and space. Given an observationsequence X, we can compute the value of P(yIA) , and thelabel sequence Y can be predicted throughY = arg max p(Y I X) (4)YIn conventional CRF -based method for CWS, theobservation is a sequence of Chinese characters, and thelabel indicates the position in a word.3.2. The CRFs-based method combined withmorphological informationThe traditional CRFs-based method is better thandictionary-based method, because it has highperformance for words in vocabulary and also canidentifY some new words, namely, words out ofvocabulary (OOV). However, it can not describe theinner structure of Chinese word sufficiently only by thecharacter. And there is still much space to improve forthe new words. Therefore, the method is stillinvestigated and other information is introduced, such asthe morphological information. Following this trend, animproved method combined with morphologicalinformation is described below and a morphemedictionary is build for this purpose.As described for the morphology of Chinese inSection 2, the inner structure information of Chineseword can be express by morphemes. Additionally, theCRFs have the capability to combine a variety ofoverlapping features of the sequence data. Therefore,given the morphology labels and the CWS labels, it isnatural to use a two linear chains factorial CRFs tomodel this task. Cascaded CRFs may be employed as animplementation. However, the cascaded strategy has theproblem of cascading error. Since the joint frameworkfor named entity recognition (NER) obtained betterperformances, an extended label is used to apply thisframework. In order to reflect the dependencies betweenCWS labels {B, M, E, S} and morphological labels {n, v,a, .. .}, an extended label set is generated byconcatenating these two kinds of labels. The CWS labelsare adapted for characters in the joint way butdistinguished by the suffixes -n, -v, -a, and so on,resulting in the extended label set {B-n, B-v, M-a, .. .}. Anexample of CWS tagging with an extended label set isillustrated in Table 2. So, it becomes linear CRFs thatcan be trained with the corpus built in section 3.3.However, the jointly training is also of high time andspace consumption. A parallel conditional random fieldimplementation is employed to solve this problem [21].Table 2:An example of CWS tagging with anextended label set for a character sequence


and word boundary in the same state level and completethe word segmentation and morphology tagidentification complementarily. Experimental resultsshow that the morphology information is of great use toword segmentation. The further analysis on the resultsshows that our method tends to obtain goodperformances when dealing with out-of-vocabularywords. This is evidence that our method successfullyincorporates word structure information.References[1] Jin Kiat Low, Hwee Tou Ng and Wenyuan Guo, "AMaximum Entropy Approach to Chinese WordSegmentation", Proceedings of the Fourth SIGHANWorkshop on Chinese Language Processing, JejuIsland, Korea, pp. 161-164, 2005.[2] Huihsin Tseng, Pichuan Chang, Galen Andrew,Daniel Jurafsky and Christopher Manning, "AConditional Random Field Word Segmenter forSighan Bakeoff 2005", Proceedings of the FourthSIGHAN Workshop on Chinese LanguageProcessing, Jeju Island, Korea, pp. 168-171, 2005.[3] Hai Zhao, Chang-Ning Huang and Mu Li, "AnImproved Chinese Word Segmentation System withConditional Random Field", Proceedings of theFifth SIGHAN Workshop on Chinese LanguageProcessing, Sydney, Australia, pp. 162-165, 2006.[4] Keh-Jiann Chen, Wei-Yun Ma, "Unknown WordExtraction for Chinese Documents", Proceedings ofthe 19th international conference on Computationallinguistics, Taipei, Taiwan, pp. 169-175, 2002.[5] Andi Wu, "Chinese Word Segmentation inMSR-NLP", Proceedings of the Second SIGHANWorkshop on Chinese Language Processing,Sapporo, Japan, pp. 172-175, 2003.[6] Yun Liu, Shiwen Yu and Xuefeng Zhu,"Construction of the Contemporary ChineseCompound Words Database and its Application",Modem Education Technology and ChineseTeaching to Foreigners : Proceedings of the SecondInternational Conference on New Technologies inTeaching and Learning Chinese, Guangxi, China,pp. 273-278, 2000.[7] Li Wang, "Problems with the boundary betweenwords and word groups", Zhongguo Yuwen, pp.3-8, 1953.[8] Jerome L. Packard, The Morphology of Chinese,Cambridge University Press, Cambridge, UK,2000.[9] Elizabeth o. Selkirk, The Syntax of Words, MITPress, 1982.[10] Jerrold M. Sadock, The autolexical classification oflexemes, 1988.[11] Jerrold M. Sadock, Autolexical Syntax, Universityof Chicago Press, Chicago, USA, 1991.[12] S. Scalise, Generative Morphology, Foris,Dordrecht, Netherlands, 1984.[13] Ting-chi Charles Tang, "The relation betweenword-syntax and sentence-syntax in Chinese: a casestudy in compound verbs", Second InternationalConference on Chinese Linguistic, Paris, French,pp. 23-5, 1993.[14] Ting-chi Charles Tang, "More on the relationbetween word-syntax and sentence-syntax inChinese: case study in compound nouns",Proceedings of the Sixth North AmericanConference on Chinese Linguistics, Los Angeles:GSIL, University of Southern California, America,pp. 195-248, 1995.[15] Huihsin Tseng, Pichuan Chang, Galen Andrew,Daniel Jurafsky and Christopher Manning, "AConditional Random Field Word Segmenter forSIGHAN Bakeoff 2005 [C)", In Proceedings of the4th SIGHAN Workshop on Chinese LanguageProcessing(SIGHAN), 2005.[16] John Lafferty, Andrew McCallum and FernandoPereira, "Conditional Random Fields: ProbabilisticModels for Segmenting and Labeling SequenceData", ICML-2001, pp. 282-289, 2001.[17] Adam Berger, "The IIS Algorithm: A gentleIntroduction", Ann. Math. Statistics, 1997.[18] 1. Darroch and D. Ratcliff, "Generalized iterativescaling for log-linear models", Ann. Math.Statistics, pp. 1470-1480, 1972.[19] Jorge Nocedal, "Updating quasi-Newton matriceswith limited storage", Mathematics of Computation,pp. 773-782, 1980.[20] Dong C. Liu and Jorge Nocedal, "On the limitedmemary BFGS method for large scaleoptimization", Mathematical Programming, pp.503-528, 1989.[21] Xiaojun Lin, Liang Zhao, Dianhai Yu and XihongWu, "Distributed Training for Conditional RandomFields", IEEE NLP-KE' 1 0, to be published.

More magazines by this user
Similar magazines