Upgrade Report - Department of Informatics - King's College London

More documents

Recommendations

Info

6 GRAPH SAMPLING AND CRAWLING 15using a MHRW and selecting an appropriate target distribution.This indicates that there are application or network specific optimizations that can be done on randomwalks in order to tune them to the required task.An interesting experimental analysis on sampling methods such as Respondent-Driven Sampling (RDS)and Metropolis-Hastings Random Walk has been done by A. H. Rasti et al [50] showing the effect of graphstructure and size has on the efficiency of these methods. There were several graph types used including theErdos-Renyi random graph as well as the Small World graph, the Barbasi-Albert (preferential attachment)graph and the Hierarchical Scale-Free graph, the latter being a scale-free graph which has a structure ofclusters within clusters. It was shown that the above methods, when applied to the Hierarchical Scale-Freehad a reduced efficiency.Some work has been done by F. Wu et al [23] and their work has provided us with the framework forthe definitions and the methods of crawling. For each method there were several properties measured suchas:Efficiency How fast new vertices are discoveredSensitivity How much the results vary depending on the target social network and the percentage ofprotected users within that network.Bias How different statistical properties are distorted in the samplesThe above are the general concern of graph crawling as we require our methods to be unbiased, successfuland independent of the entry point.In a related study conducted by A. Mislove et al [45] the problem area is very well described. Accordingto [45] the major issues that arise in sampling, have to do with getting an unbiased sample quickly andeffectively. Several methods are discussed such as Breadth-first search (BFS) and the snowball method.The snowball method is effectively like a BFS which terminates prematurely for each vertex i.e. samplesonly n out-edge targets for each discovered vertex rather than all of them and rejects the rest. It isexperimentally determined that sampling the graph with these methods introduces a bias. According totheir work additional challenges arise when the underlying network itself imposes additional limitations, likenot allowing retrieval of backwards links (as it is the case in Formspring) or imposing a strict limitation onthe rate that data can be acquired.It is a fact that the data rate limitations can be overcome by scraping the web interface of OSNs ratherthan using the APIs. This is achieved by the method called web scraping successively reads web pagesfrom the OSNs web interface and strips them down to contain only the relevant information. However asit is pointed out in [27] this introduces a large overhead due to the fact that useless information such asweb-headers and other hypertext elements will have to be downloaded. We must be very careful and thesize of the overheads must be evaluated to determine whether or not we will achieve a higher throughputvia scraping or by using the API.Methods such as the RWRW and MHRW are often used to gather unbiased samples of graphs and inparticular the method of MHRW is evaluated by Rasti et al [50] and also used by Krishnamurthy et al [3]to sample the Twitter network. The work of Krishnamurthy used the method to obtain a ground truth tocompare against the results acquired their other methods. This shows a certain level of confidence to theunbiased nature of RWRW and MHRW.In [35] Leskovec mentions that the goals in sampling a network could include:Back in time goal Where one is interested in obtaining a sample of a network which is has the sameproperties as that network in a previous epoch.Scale-down goal Where one is interested in getting a sample of the network which has the same propertiesas the network in the current epoch.This is of course only some of the goals one could consider when sampling a network.
6 GRAPH SAMPLING AND CRAWLING 16In addition to the above problem there is also the more general problem of when to stop the sampling.When do we know that we’ve sampled enough of the network? This question is very critical as it has manyside-effects. What if our method is appropriate but we are stopping it too soon? What if we sampled toomuch and distorted our perfect image? The answer to this question is still largely an open problem. Howeverin [27] some convergence tactics are discussed and analyzed. This offers a great tool for the crawler to beable to determine when it is best to halt the crawling.In general we can consider numerous other sampling goals which are sometimes case specific. For exampleone could consider the goal of measuring cover time of a graph, or of determining where a SRW of s stepsand starting from a vertex u is most likely to terminate at. These are some examples of possible interestingquestions that arise which always depends on the problem at hand. In general, sampling with SRWs is usedextensively in the context of property testing seen in Section 4.8, for example in the work of Czumaj etal [18] SRWs were used to determine whether or not a graph is an α-expander (defined in Section 4.3).6.1 Simple Random WalkThe most popular crawling method used in practice is the SRW. The reason for this is its simplicity andsurprising efficiency which made it a very attractive method to study and analyze.Let G = (V, E) be a connected graph. A SRW W u , u ∈ V on the undirected graph G = (V, E) is aMarkov chain X 0 = u, X 1 , . . . , X t , . . . on the vertices V associated to a particle that moves from vertex tovertex according to a transition rule. The probability of a transition from vertex i to vertex j is p(i, j) if{i, j} ∈ E, and 0 otherwise.Let d(v) = d(v, t) be the degree of vertex v ∈ G(t), and let N(v) denote the neighbours of v in thisgraph. A vertex u is a neighbor of v if there exists an edge e = (v, u).In the above general case the stopping criterion can be arbitrary. The most common use case is to stopwhen a sample of sufficient size and quality has been obtained, in other cases the method may stop whenall the vertices have been visited (covered).The SRW on graphs is in general the algorithm presented in Algorithm 1Algorithm 1 SRWv ← start vertexwhile not done doVisit(v)v ← random vertex from N(v)end whileThis simple method has certain properties:Transition matrix Given a graph G(N, M) the transition matrix P is a matrix which corresponds to theprobability that a certain transition will occur in a random walk. For example let P uv denote theelement at (u, v) in the transition matrix P uv = P r[“Given that we are at vertex u we move to vertexv”].{ 1P uv =d u, v ∈ N(u)0, otherwiseWhere d u is the degree of vertex u and N(u) denotes the neighborhood of u.Stationary Distribution Given a distribution π over a graph which is the proportion of time a randomwalk has spent over specific vertices, we determine the distribution at the next step of a SRW π ′ asπ ′ = P T π. The stationary distribution π s is the distribution with the property P T π s = π s .Mixing time The mixing time t m is the number of steps where the distribution π tm → π sIt is proven that in SRWs, as the number of steps t → ∞ the stationary distribution of a vertex u is:π u = d u2M(6.1)
Page 1 and 2: King’s College LondonDepartment o
Page 3 and 4: 7.5 Partitioned Preferential Attach
Page 5 and 6: List of Figures1 Erdós-Rényi Grap
Page 7 and 8: 2Part IIntroduction1 Online Social
Page 9 and 10: 4Part IIRelated Work4 Network prope
Page 11 and 12: 4 NETWORK PROPERTIES AND METRICS 6W
Page 13 and 14: 5 GRAPH GENERATION MODELS 8Figure 1
Page 15 and 16: 5 GRAPH GENERATION MODELS 10Figure
Page 17 and 18: 5 GRAPH GENERATION MODELS 12Figure
Page 19: 6 GRAPH SAMPLING AND CRAWLING 14on
Page 23 and 24: 6 GRAPH SAMPLING AND CRAWLING 18{a(
Page 25 and 26: 7 GRAPH GENERATION 20(a) Constant m
Page 27 and 28: 7 GRAPH GENERATION 22Figure 9: Grow
Page 29 and 30: 7 GRAPH GENERATION 24Figure 10: Imp
Page 31 and 32: 8 EXISTING DATA SET ANALYSIS 26From
Page 33 and 34: 8 EXISTING DATA SET ANALYSIS 28From
Page 35 and 36: 9 GRAPH SAMPLING 30calls. We presen
Page 37 and 38: 9 GRAPH SAMPLING 32Figure 19: Real
Page 39 and 40: 9 GRAPH SAMPLING 34where b > 0 cons
Page 41 and 42: 9 GRAPH SAMPLING 36Theorem 3. For c
Page 43 and 44: 9 GRAPH SAMPLING 38Figure 22: Plots
Page 45 and 46: 9 GRAPH SAMPLING 40wide range of re
Page 47 and 48: 11 GRAPH SAMPLING 42In addition to
Page 49 and 50: 12 GRAPH GENERATION MODELS 4411.1.3
Page 51 and 52: 46Part VReferencesReferences[1] Dim
Page 53 and 54: REFERENCES 48[36] Jure Leskovec, La
Page 55 and 56: 50Part VIAppendixASampling Manufact
Page 57 and 58: A SAMPLING MANUFACTURED GRAPHS: BFS

Upgrade Report - Department of Informatics - King's College London

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?