Web Mining and Social Networking: Techniques and ... - tud.ttu.ee
Web Mining and Social Networking: Techniques and ... - tud.ttu.ee
Web Mining and Social Networking: Techniques and ... - tud.ttu.ee
- No tags were found...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
5.4.2 Network Flow/Cut-based Notions of Communities5.4 <strong>Web</strong> Community Discovery 97Bipartite cores based communities are usually very small <strong>and</strong> do not represent full communities.Flake et al. [93] defined a more general community as a subgraph whose internal linkdensity exc<strong>ee</strong>ds the density of connection to nodes outside it by some margin.Formally, a community is defined as a subset C of V C ⊂V such that each c ∈V has at leastas many neighbors in C as in VCC. This is a NP-complete graph partitioning problem. Hence,we n<strong>ee</strong>d to approximate <strong>and</strong> recast it into a less stringent definition, based on the network flowmodel from operations research.The max-flow/min-cut problem [66] is posed as follows. We are given a graph G =(V,E)with a source node s ∈ V <strong>and</strong> a target node t ∈ V . Each edge (u,v) is like a water pipe with apositive integer maximum flow capacity c(u,v). The max-flow algorithm finds the maximumrate of flow from s to t without exc<strong>ee</strong>ding the capacity constraints on any edge. It is knownthat this maximum flow is the same as a minimum-capacity cut (min-cut) separating s <strong>and</strong> t.Flake et al. [93] applied the above concept to the <strong>Web</strong> context as follows. Suppose we aregiven the <strong>Web</strong> graph G =(V,E) with a node subset S ⊂ V identified as s<strong>ee</strong>d URLs, whichare examples of the community the user wishes to discover. An artificial source s will becreated <strong>and</strong> connected to all s<strong>ee</strong>d nodes u ∈ S, setting c(s,u) =∞ . Then, we connect allv ∈ V −S−{s,t} to an artificial target t with c(v,t)=1. Each original edge is made undirected<strong>and</strong> heuristically set the capacity to k (usually set to |S|). The s → t max-flow algorithm isapplied on the resulting graph. All nodes in the s-side of the min-cut are defined as membersof the community C.In reality, we do not have the whole <strong>Web</strong> graph <strong>and</strong> must collect the necessary portionsby crawling. The crawler begins with the s<strong>ee</strong>d set S <strong>and</strong> finds all in- <strong>and</strong> out-neighbors of thes<strong>ee</strong>d nodes to some fixed depth. The crawled nodes together with the s<strong>ee</strong>d set S are then usedto set up the max-flow problem described above. And the process can continue until someconditions are satisfied. This crawling process can be thought of as a different form of focusedcrawling that is driven not by textual content but by consideration based solely on hyperlink.Compared with the bipartite cores based approach, the max-flow based community discoverycan extract larger, more complete communities. However, it cannot find the theme, thehierarchy, <strong>and</strong> the relationships of <strong>Web</strong> communities.5.4.3 <strong>Web</strong> Community ChartThe two community finding algorithms described earlier can only identify groups of pagesthat belong to web communities. They cannot derive or infer the relationships betw<strong>ee</strong>n extractedcommunities. M. Toyoda <strong>and</strong> M. Kitsuregawa [243] have proposed a technique forconstructing a web community chart that provides not only a set of web communities but alsothe relationships betw<strong>ee</strong>n them.Their technique is based on a link-based related page algorithm that gives related pages toa given input page. The main idea is to apply a related page algorithm to a number of pages,<strong>and</strong> investigate how each page derives other pages as related pages. If a page s derives a paget as a related page <strong>and</strong> the page t also derives s as a related page, then we say that there isa symmetric derivation relationship betw<strong>ee</strong>n s <strong>and</strong> t. For example, a fan page i of a baseballteam derives other fan pages as related pages. And, when we apply the related page algorithmto another fan page j, the page j also derives the original fan page i as its related page. Thesymmetric derivation relationship betw<strong>ee</strong>n two pages often means that they are both pointedto by similar set of hubs.