11.07.2015 Views

Upgrade Report - Department of Informatics - King's College London

Upgrade Report - Department of Informatics - King's College London

Upgrade Report - Department of Informatics - King's College London

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

6 GRAPH SAMPLING AND CRAWLING 14on this, new vertices emerging in Q will consist <strong>of</strong> an edge copying flavor as described in Section 5.3. In asimilar fashion when a new topic emerges it is likely to be based, or inspired from, an existing topic.Based on the above example the graph G(Q, E) is constructed when taking into consideration that whenan author adds references to a new paper that author will cite most or all <strong>of</strong> the papers on that topic andsome papers <strong>of</strong> general interest. It is mentioned that this intuition can be applied to other social graphs aswell and we can assume within this statement that we can consider Q to be a set <strong>of</strong> interests a person mayhave and G(Q, E) to be a social interaction graph between people. It is more likely for people who sharethe same interest to be connected in G and in addition people without common interests may be connectedas a result <strong>of</strong> the popularity <strong>of</strong> one or both <strong>of</strong> those people.From this intuitive understanding there are two factors that emerge which are an edge copying flavorand a flavor <strong>of</strong> preferential attachment based on degree. In fact the suggested model contains both theseelements in some way. In fact since graph B(Q, U) uses the edge copying mechanism heavily it does exhibita power-law degree distribution and a community structure as it is proven in the paper. The graph G(Q, E)which also includes the degree preferential attachment mechanism as well as common affiliation links betweenvertices <strong>of</strong> Q also exhibits this phenomenon. In addition this model claims a densification power-law andbounded diameter.This model is worth noting due to the fact that all the above properties are properties which have beenobserved in most online graphs (social, citation, Peer-To-Peer (P2P) etc) but more importantly this modelprovides a proven power-law degree distribution, densification and bounded diameter.6 Graph sampling and crawlingGenerally speaking, graph sampling is a very underdeveloped topic. Put simply, the big question is: ‘Howcan one sample only a part <strong>of</strong> a graph and yet maintain certain structural information that is present onthe entire graph’. This question has many interpretations and shades <strong>of</strong> meaning. In our case, for example,we are interested in getting a crawled sample <strong>of</strong> a preferential attachment graph which is a good model <strong>of</strong>the graph in its entirety. Thus, this sample studied on its own will have certain required properties whichneed to hold for us. In particular we want the degree distribution that is observed in the entire network tobe, at scale, observed in our sample <strong>of</strong> the network. Other typical examples might be clustering coefficientor diameter.Work on efficient sampling <strong>of</strong> network characteristics arises in many areas. In the context <strong>of</strong> searchengine design, studies in optimally sampling the URL crawl frontier to rapidly sample (e.g.) high page-rankvertices based on knowledge <strong>of</strong> vertex degree in the current sample can be found in e.g. [4].Within the random graph community, trace-route sampling was used to estimate cumulate degree distributions;and methods <strong>of</strong> removing the high degree bias from this process were studied in e.g. [1], [25].Another approach, analyzed in [10], is the jump and crawl method to find (e.g.) all very high degree vertices.The method uses a mixture <strong>of</strong> uniform sampling followed by inspection <strong>of</strong> the neighboring vertices, in a timesub-linear in the network size.In the context <strong>of</strong> online social networks, exploration <strong>of</strong>ten focused on how discover the entire networkmore efficiently. Until recently this was feasible for many real world networks, before they exploded to theircurrent size. It is no longer feasible to get a consistent snapshot <strong>of</strong> the Facebook network for example. 1Methods based on SRW are commonly used for graph searching and crawling, and such methods havebeen used and analyzed extensively. Stutzbach et al [53] compare the performance <strong>of</strong> BFS with a SRW anda MHRW [28,43] on various classes <strong>of</strong> random graphs as a basis for sampling the degree distribution <strong>of</strong> theunderlying networks. The purpose <strong>of</strong> the investigation was to sample from dynamic P2P networks. In arelated study M. Gjoka et al [27] made extensive use <strong>of</strong> the above methods to collect a sample <strong>of</strong> Facebookusers. As Simple Random Walks (SRWs) are degree biassed they used a re-weighting technique to unbiasthe sampled degree sequence output by the random walk. This is referred to as a Re-Weighted RandomWalk (RWRW) in [27]. In both the above cases it was shown the bias could be removed dynamically by1 According to the Facebook statistics page at *http://www.facebook.com/press/info.php?statistics (retrieved on 02 June2011) at the time retrieved there were over 500 million active users (the exact number was not mentioned) and the average userhad 130 friends.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!