10.07.2015 Views

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

78 4 <strong>Web</strong> Content <strong>Mining</strong>Input queryOnline stepOffline stepSearch Engine qinput, term1,,termxSimilarity computation SuggestionqaqbqcPast <strong>Web</strong> queriesSearch Engineqa, term1,,termxqb, term1,,termx Fig. 4.1. Framework of the traditional pseudo relevance based suggestion using the term featurespaceof <strong>Web</strong> users, but it has a critical drawback. If the input query by a user is totally new to current<strong>Web</strong> logs, its click information is not accessible from the <strong>Web</strong> query logs for query similaritycomputation. In this case, pseudo relevance can be used to exp<strong>and</strong> the expressions of queriesby features extracted from the top search results returned by search engines. Two traditionalfeature spaces are terms <strong>and</strong> URLs of these search results.Suppose that we have already obtained a set of c<strong>and</strong>idate queries which are past queriessubmitted by previous users. The task of query suggestion is to suggest the most related c<strong>and</strong>idatesfor an target input query. In Figure 1 we illustrate the overview of the traditional pseudorelevance based suggestion which uses the term feature space.In the offline step, the top search results of past queries are retrieved by a search engine,<strong>and</strong> terms are extracted from the search result pages. In this manner, each past query is representedby the terms of its search result pages. In the online step, the same process is used toget the term expression of an input query. The query to query similarity is computed based ontheir term expressions. The past queries with the highest similarity scores are recommended tousers. Pseudo relevance based suggestion using the URL feature space works in a similar wayshown in Figure 1. The only difference is placing the terms with URLs in query expression.If we use terms of search results as features. We can obtain the following vector expression:V(q i )={t 1 ,t 2 ,···,t k }. The similarity function can be the well-known cosine coefficient,defined asSim(q i ,q j )= V(q i) ·V (q j )|V(q i )||V(q j )| . (4.5)URLs can also be used to express queries <strong>and</strong> compute query-to-query similarity. Let q i<strong>and</strong> q j be two queries, <strong>and</strong> U(q i ) <strong>and</strong> U(q j ) be the two sets of URLs returned by a searchengine (here i.e., Google) in response to the two queries q i <strong>and</strong> q j respectively. In this case,Jaccard is an intuitive <strong>and</strong> suitable similarity measure, defined as:Jaccard(q i ,q j )= |U(q i) ∩U(q j )||U(q i ) ∪U(q j )| . (4.6)The value of the defined similarity betw<strong>ee</strong>n two queries lies in the range [0, 1]: 1 if they ar<strong>ee</strong>xactly the same URLs, <strong>and</strong> 0 if they have no URLs in common.After the feature enrichment of a query, we can apply traditional text processing approacheson the enriched queries,such as text classification, text clustering, topic detection<strong>and</strong> so forth.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!