10.07.2015 Views

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

4<strong>Web</strong> Content <strong>Mining</strong>In recent years the growth of the World Wide <strong>Web</strong> exc<strong>ee</strong>ded all expectations. Today thereare several billions of HTML documents, pictures <strong>and</strong> other multimedia files available viaInternet <strong>and</strong> the number is still rising. But considering the impressive variety of the <strong>Web</strong>,retrieving interesting contents has become a very difficult task. <strong>Web</strong> Content <strong>Mining</strong> uses theideas <strong>and</strong> principles of data mining <strong>and</strong> knowledge discovery to scr<strong>ee</strong>n more specific data.The use of the <strong>Web</strong> as a provider of information is unfortunately more complex than workingwith static databases. Because of its very dynamic nature <strong>and</strong> its vast number of documents,there is a n<strong>ee</strong>d for new solutions that are not depending on accessing the complete data onthe outset. Another important aspect is the presentation of query results. Due to its enormoussize, a <strong>Web</strong> query can retrieve thous<strong>and</strong>s of resulting <strong>Web</strong> pages. Thus meaningful methodsfor presenting these large results are necessary to help a user to select the most interestingcontent. In this chapter we will discuss several basic topics of <strong>Web</strong> document representation,<strong>Web</strong> search, short text processing, topic extraction <strong>and</strong> <strong>Web</strong> opinion mining.4.1 Vector Space ModelThe representation of a set of documents as vectors in a common vector space is known as thevector space model <strong>and</strong> is fundamental to a host of information retrieval operations rangingfrom scoring documents on a query, document classification <strong>and</strong> document clustering. We firstdevelop the notion of a document vector that captures the relative importance of the terms ina document.Towards this end, we assign to each term in a document a weight for that term, that dependson the number of occurrences of the term in the document. We would like to computea score betw<strong>ee</strong>n a query term <strong>and</strong> a document based on the weight of t in d. The simplestapproach is to assign the weight to be equal to the number of occurrences of term t in documentd. This weighting scheme is referred to as term frequency <strong>and</strong> is denoted tf t,d , with thesubscripts denoting the term <strong>and</strong> the document in order.Raw term frequency as above suffers from a critical problem: all terms are consideredequally important when it comes to assessing relevancy on a query. In fact certain terms havelittle or no discriminating power in determining relevance. For instance, a collection of documentson the auto industry is likely to have the term auto in almost every document. To thisend, we introduce a mechanism for attenuating the effect of terms that occur too often in theG. Xu et al., <strong>Web</strong> <strong>Mining</strong> <strong>and</strong> <strong>Social</strong> <strong>Networking</strong>,DOI 10.1007/978-1-4419-7735-9_4, © Springer Science+Business Media, LLC 2011

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!