10.07.2015 Views

Tag-LDA for Scalable Real-time Tag Recommendation

Tag-LDA for Scalable Real-time Tag Recommendation

Tag-LDA for Scalable Real-time Tag Recommendation

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

X. Si et al. / Journal of In<strong>for</strong>mation & Computational Science 6:1 (2009) 23-31 25Text Categorization: <strong>Tag</strong> recommendation can be seen as a text categorization task, with eachtag as a category, and all documents have the tag as training samples. Text classification is afundamental task in text mining [6]. Current approaches using supervised classifiers such as SVMor k-NN with word features has reasonable good per<strong>for</strong>mance [7][8]. Unbalanced trainingsamples <strong>for</strong> different categories can greatly affect the per<strong>for</strong>mance of text categorization [21],which is common in tags since the frequency of tags follow an exponential distribution. So usetext categorization to recommend tags can only work <strong>for</strong> small amount of tags, and need to train alot of classifiers [9].Collaborative Filtering: Collaborative filtering (CF) [10] is a commonly used method inrecommender systems. To find items one user may like, CF finds users similar to this user, seewhat items they have chosen be<strong>for</strong>e, and use these items as the recommendation. Using CF <strong>for</strong>tag recommendation is a straight<strong>for</strong>ward approach, and has been proven to work well incollaborative tagging scenarios, where multiple users can label the same entity [11][12]. Whencollaborative tagging is not available, such as recommendation <strong>for</strong> blog posts, tags used bysimilar posts can be used <strong>for</strong> recommendation [13][14], and similar posts are obtained bysearching invert-indexed post archive. Searching in a big index may result in longer responding<strong>time</strong>, our method do not involve with disk-based searches, so client-side responsiveness can beguaranteed. Our method compares samples in low-dimension dense topic space instead ofhigh-dimension sparse word space, thus can better utilize the latent semantic relations betweentags and posts, and result in better recommendation results.3 Recommend <strong>Tag</strong>s by <strong>Tag</strong>-<strong>LDA</strong> Model3.1 <strong>Tag</strong>-<strong>LDA</strong> ModelTo find most likely tags <strong>for</strong> a document, we have to find the link between tags and documents. Incollaborative tag recommendation, where multiple users can tag the same entity, collaborativefiltering (CF) is often used, which use users as the link between tags and documents. For unseentext content, we have no previous tags <strong>for</strong> it, so words in the text is the main way to bridge tagsand documents. User role may provide extra in<strong>for</strong>mation, but here we just consider w ∈ d. Wemodel documents, words and tags in an unified probabilistic model called tag-<strong>LDA</strong>, so tags anddocuments are connected by latent topics, and we can inference the conditional probabilitybetween them.€Figure 1: Latent topic models: (A) Latent Dirichlet Allocation Model (B) <strong>Tag</strong>-<strong>LDA</strong> Model. D is number ofdocuments, N w is number of words, N t is number of tags, and T is number of latent topics.


X. Si et al. / Journal of In<strong>for</strong>mation & Computational Science 6:1 (2009) 23-31 29Our tag-<strong>LDA</strong> model per<strong>for</strong>ms better than other methods in most of the cases. When thenumber of recommended tags grows, all algorithms' precision of recommendation drops, sincemore noisy tags are introduced. The recall grows with number of recommended tags, our methodcan reach 71% of the best possible per<strong>for</strong>mance when 20 tags are recommended, which issignificantly higher than other methods. Take recommending 10 tags as an example, which ismore close to real world situation. Our method has a 32% improvement over search-basedcollaborative filtering, and a 293% improvement over naïve keyword based method, measured inF1.5 Discussion and Future WorkIn this paper, we proposed a scalable and real-<strong>time</strong> method <strong>for</strong> tag recommendation. We add therole of tag to the Latent Dirichlet Allocation model, thus create the tag-<strong>LDA</strong> model. <strong>Tag</strong>-<strong>LDA</strong>model can bridge words, documents and tags through the same set of latent topics. For a newdocument, we recommend tags with the highest likelihood given the document. Compared withsearch-based collaborative filtering, which need to search a large document index, our methodonly need model parameters to run, and thus the recommendation can be done in real-<strong>time</strong>. Tohandle large-scale data set from the web, we implemented a distributed training tool <strong>for</strong> ourtag-<strong>LDA</strong> model using the open source Hadoop MapReduce framework. We evaluated ourmethod using a large real world blog data set. Our method has a 32% improvement oversearch-based collaborative filtering, and a 293% improvement over naïve keywordoccurrence-based method, measured in F1.Work remains to be done. We already observed some problem of current method. Therecommender may provide near duplicated tags. When comes to finer granularity of concept, therecommender may provide related but not directly relevant tags. We will explore solutions tothese problems. Further, we plan to make the training process of tag-<strong>LDA</strong> model incremental, soit can handle continuous generated data from the web. The role of user is not used in currentmodel, we will consider the tagging history of the user and recommend personalized tags.AcknowledgementThis work is supported by the National Science Foundation of China under Grant No. 60621062,60873174 and the National 863 High-Tech Project under Grant No. 2007AA01Z148.References[1] Y. Matsuo, M. Ishizuka, Keyword Extraction From A Single Document Using Word Co-occurrenceStatistical In<strong>for</strong>mation, International Journal on Artificial Intelligence Tools Vol. 13, No. 1, 2004[2] Hunyadi, L. Keyword extraction: aims and ways today and tomorrow. in In: Proceedings of the KeywordProject: Unlocking Content through Computational Linguistics. 2001.[3] Hulth, A., Improved automatic keyword extraction given more linguistic knowledge, Proceedings of the2003 Conference on Empirical Methods in Natural Language Processing, pp 216--223, 2003.[4] Peter D. Turney, Learning algorithms <strong>for</strong> keyphrase extraction. In<strong>for</strong>mation Retrieval, 2(4), 2000.[5] Hend S. Al-Khalifa and Hugh C. Davis, Folksonomies versus Automatic Keyword Extraction: AnEmpirical Study, Proceedings of IADIS Web Applications and Research, 2006.[6] Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys. 34(1), 2002.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!