12.07.2015 Views

An Unsupervised Approach to Domain-Specific Term Extraction

An Unsupervised Approach to Domain-Specific Term Extraction

An Unsupervised Approach to Domain-Specific Term Extraction

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

scored extracted terms for subtle differences indomain specificity, we opted for a simple annotationprocess involving non-expert annota<strong>to</strong>rs. Weasked three human annota<strong>to</strong>rs <strong>to</strong> answer “yes” or“no” when given a term and its domain, as predictedby the two methods. Note that before annotatingthe actual data set, we trained the humanannota<strong>to</strong>rs in the context of a pilot annotationtest. In our human verification process, we attainedan accuracy of 40.59% and 36.59% for D1and D2, respectively, with initial inter-annota<strong>to</strong>ragreement of 69.61% and 73.04%, respectively.Thus, we cautiously conclude that our proposedmethod performs better than Park et al. (2008).Note that as the work in Park et al. (2008) wasoriginally developed <strong>to</strong> extract domain-specificterms for use in correcting domain term errors,the authors did not discuss the performance ofdomain-specific term extraction in isolation.We made a few observations during the manualverification process. Despite the strict annotationguidelines (which were further adjusted afterthe pilot test), the agreement actually droppedbetween the pilot and the final annotation (especiallywith one of the annota<strong>to</strong>rs, namely A 3 ). Weasked the individual annota<strong>to</strong>rs about the ease ofthe verification procedure and the notion of domainspecificity, from their individual perspective.It turned out that although the choice of domainspecificity was guided by statistical usage,word senses were involved in the decision process<strong>to</strong> some degree. Additionally, the annota<strong>to</strong>rscommented on the subjectivity of statisticalmarkedness of the terms. The average correlationsamong two annota<strong>to</strong>rs are .738 and .673 forD1 and D2, respectively.3 Applying <strong>Domain</strong>-<strong>Specific</strong> <strong>Term</strong>sIn this section, we evaluate the utility of the collecteddomain-specific terms via two tasks: textcategorization and keyphrase extraction. Our motivationin selecting these tasks is that domainspecificterms should offer a better representationof document <strong>to</strong>pic than general terms.3.1 Text CategorizationAu<strong>to</strong>matic text categorization is the task of classifyingdocuments in<strong>to</strong> a set of predefined categories.F1F3Type TF TF-IDF TF TF-IDFBaseline .473 .660 .477 .677<strong>Domain</strong> .536 .587 – –Combined .583 .681 .579 .681Table 2: Performance of text categorizationTo build a text categorization model, wefirst preprocess the documents, perform part-ofspeech(POS) tagging using a probabilistic POStagger, 2 and lemmatization using morpha (Minnenet al., 2001). We then build an SVM-basedclassifier (Joachims, 1998). 3 We use TF-IDF forfeature weighting, and all unigram terms. Notethat when domain-specific terms are multiwordnoun phrases (NP), we break them down in<strong>to</strong> unigramsbased on the findings of Hulth and Megyesi(2006). As baselines, we built systems using unigramterms which occur above a threshold frequency(i.e. frequency ≥ 1, 2 and 3 or F1, F2,F3 in Table 2) after removing s<strong>to</strong>p words. Table 2shows the micro-averaged F-scores of the text categorizationtask. Note that since using F2 resultsin the lowest performance, we report only resultsover thresholds F1 and F3.Table 2 shows that domain-specific terms alonedo not perform well, since only a relatively smallvolume of domain-specific indexing terms areextracted, compared <strong>to</strong> the number of unigramterms. However, when combined with a unigrammodel, they aid unigram models <strong>to</strong> improvethe overall performance. Despite only showinga small improvement, given the relatively smallnumber of domain-specific terms extracted by ourmethod, we confirm that domain-specific termsare useful for categorizing (monolingual) texts,just as domain specificity has been shown <strong>to</strong> helpin cross-lingual text categorization (Rigutini etal., 2005).3.2 Au<strong>to</strong>matic Keyphrase <strong>Extraction</strong>Keyphrases are simplex nouns or NPs that representthe key ideas of the document. Theycan serve as a representative summary of thedocument and also as high-quality index terms.In the past, various attempts have been made2 Lingua::EN::Tagger3 http://svmlight.joachims.org/svm_multiclass.html


Type L Boolean TF TF-IDFKEA NB .200 – –ME .249 – –KEA + NB .204 .200 .197<strong>Domain</strong> ME .260 .261 .267Table 3: Performance of keyphrase extraction<strong>to</strong> boost au<strong>to</strong>matic keyphrase extraction performance,based primarily on statistics (Frank et al.,1999; Witten et al., 1999) and a rich set of heuristicfeatures (Nguyen and Kan, 2007).To collect the gold-standard keyphrases, wehired two human annota<strong>to</strong>rs <strong>to</strong> manually assignkeyphrases <strong>to</strong> 210 test articles in the same 23selected domains. In summary, we collected a<strong>to</strong>tal of 1,339 keyphrases containing 911 simplexkeyphrases and 428 NPs. We checked thekeyphrases found after applying the candidate selectionmethod employed from Nguyen and Kan(2007). The final number of keyphrases found inour data was only 750 (56.01% of all the documents),among which 158 (21.07%) were NPs.To build a keyphrase extrac<strong>to</strong>r, we first preprocessedthem with a POS tagger and lemmatizer,and applied the candidate selection methodin Nguyen and Kan (2007) <strong>to</strong> extract candidates.Then, we adopted two features from KEA (Franket al., 1999; Witten et al., 1999), as well as thedomain-specific terms collected by our method.KEA uses two commonly used features: TF-IDFfor document cohesion, and distance <strong>to</strong> model thelocality of keyphrases. Finally, we used the features<strong>to</strong> build a maxent classifier 4 and a NaïveBayes (NB) model. To represent the domainspecificity of the keyphrase candidates, we simplypresented the 23 domains as three separate sets offeatures with differing values (Boolean, TF andTF-IDF), when a given keyphrase candidate is indeeda domain-specific term. Finally, with KEAas a baseline, we compared the systems over the<strong>to</strong>p-7 candidates using the current standard evaluationmethod (i.e. exact matching scheme). Table3 shows the micro-averaged F-scores.In the results, we first notice that our test systemoutperformed KEA with ME, but that ourtest system using Boolean produced better performancethan KEA only with NB. The maximumhtml4 http://maxent.sourceforge.net/index.improvement in F-score is about 1.8%, in the bestconfiguration where TF-IDF weighting is used inconjunction with an ME learner. This is particularlynotable because: (a) the average performanceof current keyphrase extraction systems isa little more than 3 matching keyphrases over the<strong>to</strong>p 15 candidates, but we produce only 7 candidates;and (b) the candidate selection method weemployed (Nguyen and Kan, 2007) found only56.01% of keyphrases as candidates. Finally, wenote with cautious optimism that domain-specificterms can help in keyphrase extraction, sincekeyphrases are similar in the same or similar domains.4 ConclusionsIn this work, we have presented an unsupervisedmethod which au<strong>to</strong>matically extracts domainspecificterms based on term and documentstatistics, using a simple adaptation of TF-IDF.We compared our method with the benchmarkmethod of Park et al. (2008) using human judgements.Although our method did not extract alarge number of domain-specific terms, the qualityof terms is high and well distributed over alldomains. In addition, we have confirmed the utilityof domain-specific terms in both text categorizationand keyphrase extraction tasks. We empiricallyverified that domain-specific terms areindeed useful in keyphrase extraction, and <strong>to</strong> alesser degree, text categorization. Although wecould not conclusively prove the higher utility ofthese terms, there is a strong indica<strong>to</strong>r that theyare useful and deserve further analysis. Additionally,given the small number of domain-specificterms we extracted and used, we conclude thatthey are useful for text categorization.AcknowledgementsNICTA is funded by the Australian Governmentas represented by the Department of Broadband,Communications and the Digital Economy andthe Australian Research Council through the ICTCentre of Excellence program.ReferencesK.V. Chandrinos and I. <strong>An</strong>droutsopoulos and G.Paliouras and C.D. Spyropoulos. Au<strong>to</strong>matic Web


ating: Filtering obscence content on the Web. InProceedings of ECRATDL. 2000, pp. 403–406.P. Drouin. Detection of <strong>Domain</strong> <strong>Specific</strong> <strong>Term</strong>inologyUsing Corpora Comparison. In Proceedings ofLREC. 2004, pp. 79–82.G. Escudero and L. Marquez and G. Rigau. Boostingapplied <strong>to</strong> word sense disambiguation. In Proceedingsof 11th ECML. 2000, pp. 129–141.E. Frank and G.W. Paynter and I. Witten and C.Gutwin and C.G. Nevill-Manning. <strong>Domain</strong> <strong>Specific</strong>Keyphrase <strong>Extraction</strong>. In Proceedings of 16thIJCAI. 1999, pp. 668–673.A. Hulth and B. Megayesi. A Study on Au<strong>to</strong>maticallyExtracted Keywords in Text Categorization. In Proceedingsof COLING/ACL. 2006, pp. 537–544.T. Joachims. Text categorization with support vec<strong>to</strong>rmachines: Learning with many relevant features. InProceedings of ECML. 1998, pp. 137–142.M. Kida and M. Tonoike and T. Utsuro and S. Sa<strong>to</strong>.<strong>Domain</strong> Classification of Technical <strong>Term</strong>s Usingthe Web. Systems and Computers. 2007, 38(14),pp. 2470–2482.P. Lafon. Sur la variabilite de la frequence des formesdans un corpus. In MOTS. 1980, pp. 128–165.B. Magnini and C. Strapparava and G. Pezzulo and A.Gliozzo. The role of domain information in wordsense disambiguation. NLE. 2002, 8(4), pp. 359–373.D. Milne and O. Medelyan and I.H. Witten. Mining<strong>Domain</strong>-<strong>Specific</strong> Thesauri from Wikipedia: A casestudy. In Proceedings of the International Conferenceon Web Intelligence. 2006.G. Minnen and J. Carroll and D. Pearce. Appliedmorphological processing of English. NLE. 2001,7(3), pp.207–223.T.D. Nguyen and M.Y. Kan. Key phrase <strong>Extraction</strong> inScientific Publications. In Proceeding of ICADL.2007, pp. 317-326.Y. Park and S. Patwardhan and K. Visweswariah andS.C. Gates. <strong>An</strong> Empirical <strong>An</strong>alysis of Word ErrorRate and Keyword Error Rate. In Proceedings ofICSLP. 2008.L. Rigutini and B. Liu and Marco Magnini. <strong>An</strong> EMbased training algorithm for cross-language textcategorization. In Proceedings of WI. 2005, pp.529-535.L. Rigutini and E. Di Iorio and M. Ernandes and M.Maggini. Au<strong>to</strong>matic term categorization by extractingknowledge from the Web. In Proceedings of17th ECAI. 2006.G. Sal<strong>to</strong>n and A. Wong and C.S. Yang. A vec<strong>to</strong>r spacemodel for au<strong>to</strong>matic indexing. Communications ofthe ACM. 1975, 18(11), pp. 61–620.I. Witten and G. Paynter and E. Frank and C. Gutwinand G. Nevill-Manning. KEA:Practical Au<strong>to</strong>maticKey phrase <strong>Extraction</strong>. In Proceedings of ACM DL.1999, pp. 254–256.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!