Tag-LDA for Scalable Real-time Tag Recommendation

X. Si et al. / Journal of Information & Computational Science 6:1 (2009) 23-31 25Text Categorization: Tag recommendation can be seen as a text categorization task, with eachtag as a category, and all documents have the tag as training samples. Text classification is afundamental task in text mining [6]. Current approaches using supervised classifiers such as SVMor k-NN with word features has reasonable good performance [7][8]. Unbalanced trainingsamples for different categories can greatly affect the performance of text categorization [21],which is common in tags since the frequency of tags follow an exponential distribution. So usetext categorization to recommend tags can only work for small amount of tags, and need to train alot of classifiers [9].Collaborative Filtering: Collaborative filtering (CF) [10] is a commonly used method inrecommender systems. To find items one user may like, CF finds users similar to this user, seewhat items they have chosen before, and use these items as the recommendation. Using CF fortag recommendation is a straightforward approach, and has been proven to work well incollaborative tagging scenarios, where multiple users can label the same entity [11][12]. Whencollaborative tagging is not available, such as recommendation for blog posts, tags used bysimilar posts can be used for recommendation [13][14], and similar posts are obtained bysearching invert-indexed post archive. Searching in a big index may result in longer respondingtime, our method do not involve with disk-based searches, so client-side responsiveness can beguaranteed. Our method compares samples in low-dimension dense topic space instead ofhigh-dimension sparse word space, thus can better utilize the latent semantic relations betweentags and posts, and result in better recommendation results.3 Recommend Tags by Tag-LDA Model3.1 Tag-LDA ModelTo find most likely tags for a document, we have to find the link between tags and documents. Incollaborative tag recommendation, where multiple users can tag the same entity, collaborativefiltering (CF) is often used, which use users as the link between tags and documents. For unseentext content, we have no previous tags for it, so words in the text is the main way to bridge tagsand documents. User role may provide extra information, but here we just consider w ∈ d. Wemodel documents, words and tags in an unified probabilistic model called tag-LDA, so tags anddocuments are connected by latent topics, and we can inference the conditional probabilitybetween them.€Figure 1: Latent topic models: (A) Latent Dirichlet Allocation Model (B) Tag-LDA Model. D is number ofdocuments, N w is number of words, N t is number of tags, and T is number of latent topics.

X. Si et al. / Journal of Information & Computational Science 6:1 (2009) 23-31 29Our tag-LDA model performs better than other methods in most of the cases. When thenumber of recommended tags grows, all algorithms' precision of recommendation drops, sincemore noisy tags are introduced. The recall grows with number of recommended tags, our methodcan reach 71% of the best possible performance when 20 tags are recommended, which issignificantly higher than other methods. Take recommending 10 tags as an example, which ismore close to real world situation. Our method has a 32% improvement over search-basedcollaborative filtering, and a 293% improvement over naïve keyword based method, measured inF1.5 Discussion and Future WorkIn this paper, we proposed a scalable and real-time method for tag recommendation. We add therole of tag to the Latent Dirichlet Allocation model, thus create the tag-LDA model. Tag-LDAmodel can bridge words, documents and tags through the same set of latent topics. For a newdocument, we recommend tags with the highest likelihood given the document. Compared withsearch-based collaborative filtering, which need to search a large document index, our methodonly need model parameters to run, and thus the recommendation can be done in real-time. Tohandle large-scale data set from the web, we implemented a distributed training tool for ourtag-LDA model using the open source Hadoop MapReduce framework. We evaluated ourmethod using a large real world blog data set. Our method has a 32% improvement oversearch-based collaborative filtering, and a 293% improvement over naïve keywordoccurrence-based method, measured in F1.Work remains to be done. We already observed some problem of current method. Therecommender may provide near duplicated tags. When comes to finer granularity of concept, therecommender may provide related but not directly relevant tags. We will explore solutions tothese problems. Further, we plan to make the training process of tag-LDA model incremental, soit can handle continuous generated data from the web. The role of user is not used in currentmodel, we will consider the tagging history of the user and recommend personalized tags.AcknowledgementThis work is supported by the National Science Foundation of China under Grant No. 60621062,60873174 and the National 863 High-Tech Project under Grant No. 2007AA01Z148.References[1] Y. Matsuo, M. Ishizuka, Keyword Extraction From A Single Document Using Word Co-occurrenceStatistical Information, International Journal on Artificial Intelligence Tools Vol. 13, No. 1, 2004[2] Hunyadi, L. Keyword extraction: aims and ways today and tomorrow. in In: Proceedings of the KeywordProject: Unlocking Content through Computational Linguistics. 2001.[3] Hulth, A., Improved automatic keyword extraction given more linguistic knowledge, Proceedings of the2003 Conference on Empirical Methods in Natural Language Processing, pp 216--223, 2003.[4] Peter D. Turney, Learning algorithms for keyphrase extraction. Information Retrieval, 2(4), 2000.[5] Hend S. Al-Khalifa and Hugh C. Davis, Folksonomies versus Automatic Keyword Extraction: AnEmpirical Study, Proceedings of IADIS Web Applications and Research, 2006.[6] Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys. 34(1), 2002.

Tag-LDA for Scalable Real-time Tag Recommendation

Create successful ePaper yourself

Delete template?

Save as template?