10.07.2015 Views

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

148 7 Extracting <strong>and</strong> Analyzing <strong>Web</strong> <strong>Social</strong> NetworksR merge (c(t k−1 ),c(t k )) = N mg(c(t k−1 ))t k −t k−1. (7.4)The split rate, R split (c(t k−1 ), c(t k )), is the number of split URLs from c(t k−1 ) per unittime. When the split rate is low, c(t k ) is larger than other split communities. Otherwise, c(t k )is smaller than other split communities. The split rate is defined as follows.R split (c(t k−1 ),c(t k )) = N sp(c(t k−1 ))t k −t k−1. (7.5)By combining these metrics, some complex evolution patterns can be represented. Forexample, a community has stably grown when its growth rate is positive, <strong>and</strong> its disappearance<strong>and</strong> split rates are low. Similar evolution patterns can be defined for shrinkage.Longer range metrics (more than one unit time) can be calculated for main lines. Forexample, the novelty metrics of a main line (c(t i ), c(t i+1) , ..., c(t j )) is calculated as follows.Other metrics can be calculated similarly.R novelty (c(t i ),c(t j )) = ∑ j k=i N ap(c(t k ))t j −t i. (7.6)7.1.3 <strong>Web</strong> Archives <strong>and</strong> GraphsFor experiments, M.Toyoda <strong>and</strong> M. Kitsuregawa used four <strong>Web</strong> archives of Japanese <strong>Web</strong>pages (in jp domain) crawled in 1999, 2000, 2001, <strong>and</strong> 2002 (S<strong>ee</strong> Table 1). The same <strong>Web</strong>crawler was used in 1999 <strong>and</strong> 2000, <strong>and</strong> collected about 17 million pages in each year. In2001, the number of pages became more than twice of the 2000 archive through improvingthe crawling rate. The used crawlers collected pages in the breadth-first order.From each archive, a <strong>Web</strong> graph is built with URLs <strong>and</strong> links by extracting anchors fromall pages in the archive. The graph included not only URLs inside the archive, but also URLsoutside pointed to by inside URLs. As a result, the graph included URLs outside jp domain,such as com <strong>and</strong> edu. Table 1 also shows the number of links <strong>and</strong> the total URLs. For efficientlink analysis, each <strong>Web</strong> graph was stored in a main-memory database that provided out-links<strong>and</strong> in-links of a given URL. Its implementation was similar to the connectivity server [30].The whole system was implemented on Sun Enterprise Server 6500 with 8 CPU <strong>and</strong> 4GBmemory. Building the connectivity database of 2002 took about one day.By comparing these graphs, the <strong>Web</strong> was extremely dynamic. More than half URLs disappearedor changed its location in one year. They first examined how many URLs in our<strong>Web</strong> graphs were changed over time, by counting the number of URLs shared betw<strong>ee</strong>n thesegraphs. About 60% of URLs disappeared from both 1999 <strong>and</strong> 2000 graphs. From 2001 to2002, about 30% of URLs disappeared in four months. The number of URLs surviving throughfour archives was only about 5 million. In [63], Cho reported that more than 70% of pages survivedmore than one month in their four month observation. This result is close to their resultsfrom 2001 to 2002 (30% disappearance in four months). Although it is not easy to estimatethe rate for one year from [63], they say that their results does not deviate so much.7.1.4 Evolution of <strong>Web</strong> Community ChartsThe global behavior of community evolution is described now. From the above four <strong>Web</strong>graphs, four community charts are built using the technique described in Chapter 5(Section 5.4).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!