10.07.2015 Views

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

6.5 <strong>Web</strong> Usage <strong>Mining</strong> Applications 135Step 3. Create the graph corresponding to the matrix, <strong>and</strong> employ clique(or connected components)finding algorithm in the graph.Step 4. For each cluster found, create new index <strong>Web</strong> pages by synthesizing the links to thedocuments of pages contained in the cluster.In the first step, an access log, containing a sequence of hits, or requests to the <strong>Web</strong>server, is taken for processing. Each request typically consists of time-stamp made, the URLrequested <strong>and</strong> the IP address from which the request originated. The IP address in this caseis treated as a single user. Thus a series of hits made in a day period, ordered by the timestamps,is collected as a single session for that user. The obtained user sessions in the formof requested URLs form the session vector, will be used in the second step. To compute theco-occurrence frequencies betw<strong>ee</strong>n pages, the conditional probability of each pair of pages P 1<strong>and</strong> P 2 is calculated. P r (P 1 |P 2 ) denotes the probability of a user visiting P 1 if it has alreadyvisited P 2 , while P r (P 2 |P 1 ) is the probability of a user visiting P 2 after having visiting P 1 . Theco-occurrence frequency betw<strong>ee</strong>n P 1 <strong>and</strong> P 2 is the minimum of these values. The reason whyusing the minimum of two conditional probabilities is to avoid the problem of asymmetricalrelationships of two pages playing distinct roles in the <strong>Web</strong> site. Last, a matrix correspondingto the calculated co-occurrence frequencies is created, <strong>and</strong> in turn, a graph which is equivalentto the matrix is built up to reflect the connections of pages derived from the log as well. Inthe third step, a clique finding algorithm is employed on the graph to reveal the connectedcomponents of the graph. In this manner, a clique (or called cluster) is the collection of nodes(i.e. pages) whose members are directly connected with edges. In other words, the subgraphof the clique, in which each pair of nodes has a connected path betw<strong>ee</strong>n them, satisfies the factthat every node in the clique or cluster is related to at least one other node in the subgraph.Eventually, for each found cluster of pages, a new indexing page containing all the links tothe documents in the cluster is generated. From the above descriptions, we can s<strong>ee</strong> that theadded indexing pages represent the coherent relationships betw<strong>ee</strong>n pages from the user navigationalperspective, therefore, providing an additional way for users to visually know theaccess intents of other users <strong>and</strong> easily browse directly to the n<strong>ee</strong>ded pages from the providedinstrumental pages. Figure 6.7 depicts an example of index page derived by PageGatheralgorithm [204].<strong>Mining</strong> <strong>Web</strong> Logs to Improve <strong>Web</strong>site OrganizationIn [233], the authors proposed a novel algorithm to automatically find the <strong>Web</strong> pages in awebsite whose location is different from where the users expect to find. The motivation behindthe approach is that users will backtrack if they do not find the page where they expect it, <strong>and</strong>the backtrack point is where the expected location of page should be. Apparently mining <strong>Web</strong>log will provide a possible solution to identifying the backtrack points, <strong>and</strong> in turn, improvingwebsite organization.The model of user search pattern usually follows the below procedures [233]:Algorithm 6.12: backtrack path findingFor a single target page T, the user is expected to execute the following search strategy1. Start from the root.2. While (current location C is not the target page T) do(a) If any of the links from C is likely to reach T, follow the link that appears most likely

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!