Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

5.4.2 Network Flow/Cut-based Notions of Communities5.4 <strong>Web</strong> Community Discovery 97Bipartite cores based communities are usually very small <strong>and</strong> do not represent full communities.Flake et al. [93] defined a more general community as a subgraph whose internal linkdensity exc<strong>ee</strong>ds the density of connection to nodes outside it by some margin.Formally, a community is defined as a subset C of V C ⊂V such that each c ∈V has at leastas many neighbors in C as in VCC. This is a NP-complete graph partitioning problem. Hence,we n<strong>ee</strong>d to approximate <strong>and</strong> recast it into a less stringent definition, based on the network flowmodel from operations research.The max-flow/min-cut problem [66] is posed as follows. We are given a graph G =(V,E)with a source node s ∈ V <strong>and</strong> a target node t ∈ V . Each edge (u,v) is like a water pipe with apositive integer maximum flow capacity c(u,v). The max-flow algorithm finds the maximumrate of flow from s to t without exc<strong>ee</strong>ding the capacity constraints on any edge. It is knownthat this maximum flow is the same as a minimum-capacity cut (min-cut) separating s <strong>and</strong> t.Flake et al. [93] applied the above concept to the <strong>Web</strong> context as follows. Suppose we aregiven the <strong>Web</strong> graph G =(V,E) with a node subset S ⊂ V identified as s<strong>ee</strong>d URLs, whichare examples of the community the user wishes to discover. An artificial source s will becreated <strong>and</strong> connected to all s<strong>ee</strong>d nodes u ∈ S, setting c(s,u) =∞ . Then, we connect allv ∈ V −S−{s,t} to an artificial target t with c(v,t)=1. Each original edge is made undirected<strong>and</strong> heuristically set the capacity to k (usually set to |S|). The s → t max-flow algorithm isapplied on the resulting graph. All nodes in the s-side of the min-cut are defined as membersof the community C.In reality, we do not have the whole <strong>Web</strong> graph <strong>and</strong> must collect the necessary portionsby crawling. The crawler begins with the s<strong>ee</strong>d set S <strong>and</strong> finds all in- <strong>and</strong> out-neighbors of thes<strong>ee</strong>d nodes to some fixed depth. The crawled nodes together with the s<strong>ee</strong>d set S are then usedto set up the max-flow problem described above. And the process can continue until someconditions are satisfied. This crawling process can be thought of as a different form of focusedcrawling that is driven not by textual content but by consideration based solely on hyperlink.Compared with the bipartite cores based approach, the max-flow based community discoverycan extract larger, more complete communities. However, it cannot find the theme, thehierarchy, <strong>and</strong> the relationships of <strong>Web</strong> communities.5.4.3 <strong>Web</strong> Community ChartThe two community finding algorithms described earlier can only identify groups of pagesthat belong to web communities. They cannot derive or infer the relationships betw<strong>ee</strong>n extractedcommunities. M. Toyoda <strong>and</strong> M. Kitsuregawa [243] have proposed a technique forconstructing a web community chart that provides not only a set of web communities but alsothe relationships betw<strong>ee</strong>n them.Their technique is based on a link-based related page algorithm that gives related pages toa given input page. The main idea is to apply a related page algorithm to a number of pages,<strong>and</strong> investigate how each page derives other pages as related pages. If a page s derives a paget as a related page <strong>and</strong> the page t also derives s as a related page, then we say that there isa symmetric derivation relationship betw<strong>ee</strong>n s <strong>and</strong> t. For example, a fan page i of a baseballteam derives other fan pages as related pages. And, when we apply the related page algorithmto another fan page j, the page j also derives the original fan page i as its related page. Thesymmetric derivation relationship betw<strong>ee</strong>n two pages often means that they are both pointedto by similar set of hubs.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!