10.07.2015 Views

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

110 6 <strong>Web</strong> Usage <strong>Mining</strong>be determined [56]. The navigation path of the <strong>Web</strong>-users, if available to the server, carriesvaluable information about the user interests.The purpose of finding similar interests among the <strong>Web</strong>-users is to discover knowledgefrom the user profile. If a <strong>Web</strong> site is well designed, there will be strong correlation among thesimilarity of the navigation paths <strong>and</strong> similarity among the user interests. Therefore, clusteringof the former could be used to cluster the latter.The definition of the similarity is application dependent. The similarity function can bebased on visiting the same or similar pages, or the frequency of access to a page [77, 140],or even on the visiting orders of links (i.e., clients’ navigation paths). In the latter case, twousers that access the same pages can be mapped into different groups of interest similaritiesif they access pages in distinct visiting orders. In [256], Xiao et al. propose several similaritymeasures to capture the users’ interests. A matrix-based algorithm is then developed to cluster<strong>Web</strong> users such that the users in the same cluster are closely related with respect to thesimilarity measure.Problem DefinitionsThe structure of an Internet server site S could be abstracted as a directed graph called connectivitygraph: The node set of the graph consists of all <strong>Web</strong> pages of the site. The hypertextlinks betw<strong>ee</strong>n pages can be taken as directed edges of the graph as each link has a startingpage <strong>and</strong> an ending page. For some of the links starting points or end points could be somepages outside the site. It is imagined that the connectivity graph could be quite complicated.For simplicity, the concerns here is limited on the part of clients’ navigation path inside aparticular site. From the Internet browsing logs, the following information about a <strong>Web</strong> usercould be gathered: the frequency of a hyper-page usage, the lists of links she]he selected, th<strong>ee</strong>lapsed time betw<strong>ee</strong>n two links, <strong>and</strong> the order of pages accessed by individual <strong>Web</strong> users.Similarity MeasuresSuppose that, for a given <strong>Web</strong> site S, there are m users U = {u 1 ,u 2 ,...,u m } who accessed ndifferent <strong>Web</strong> pages P = p 1 , p 2 ,..., p n in some time interval. For each page p i , <strong>and</strong> each useru j , it is associated with a usage value, denoted as use(p i ,u j ), <strong>and</strong> defined asuse ( p i ,u j)={1 if pi is accessed by u j0 OtherwiseThe use vector can be obtained by retrieving the access logs of the site. If two usersaccessed the same pages, they might have some similar interests in the sense that they are interestedin the same information (e.g., news, electrical products etc). The number of commonpages they accessed can measure this similarity. The measure is defined bySim1 ( () ∑ k use(pk ,u i ) ∗ use ( ))p k ,u ju i ,u j = √∑ k use(p k ,u i ) ∗ ∑ k use ( ) (6.1)p k ,u jwhere ∑ k use(p k ,u i ) is the total number of pages that were accessed by user u i , <strong>and</strong> the productof ∑ k use(p k ,u i ) ∗ ∑ k use ( p k ,u j)is the number of common pages accessed by both user ui ,<strong>and</strong> u j . If two users access the exact same pages, their similarity will be 1. The similaritymeasure defined in this way is called Usage Based (UB) measure.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!