Upgrade Report - Department of Informatics - King's College London

King’s College LondonDepartment of InformaticsAlgorithm Design GroupReport Submitted for Upgrade to PhDAuthorYiannis SiantosS.N: 0972443SupervisorsDr. Colin CooperDr. Tomasz RadzikFebruary 28, 2012

ContentsI Introduction 21 Online Social Networks (OSNs) 22 Large Online Social Networks and Generative Models 33 Graph sampling and crawling 3II Related Work 44 Network properties and metrics 44.1 Diameter and degree distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44.2 Degree correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44.3 Expander graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44.4 Connected Components And Community Structure . . . . . . . . . . . . . . . . . . . . . . . . 54.5 Communities: Conductance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54.6 Communities: Centrality and modularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64.7 Communities: Overlapping communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64.8 Property testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Graph Generation Models 75.1 Erdós-Rényi Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75.2 Preferential attachment model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85.3 Edge Copying model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95.4 Triangle Closing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115.5 Forest Fire Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115.6 Stochastic Kronecker Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125.7 Affiliation Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Graph sampling and crawling 146.1 Simple Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166.2 Weighted Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176.3 Metropolis Hastings Random Walks (MHRWs) . . . . . . . . . . . . . . . . . . . . . . . . . . 17III Our Work 197 Graph generation 197.1 Random Walk Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197.2 Preferential attachment and message propagation . . . . . . . . . . . . . . . . . . . . . . . . . 207.3 Grow, Back-Connect, Densify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217.4 Implicit Graph Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22I

7.5 Partitioned Preferential Attachment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 Existing Data Set Analysis 258.1 Ground truth. Complete twitter snapshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258.2 Other Real Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268.2.1 Degree Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278.2.2 Degree-Based Cut Conductunce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288.2.3 Degree Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 Graph Sampling 299.1 Sampling Real World Networks: Case study of twitter . . . . . . . . . . . . . . . . . . . . . . 299.1.1 Our initial crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309.1.2 Second crawling, unbiased sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309.2 Uniform Sampling (UNI): A study of efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . 329.3 Sampling Manufactured Graphs: Weighted Random Walk . . . . . . . . . . . . . . . . . . . . 339.3.1 Theoretical foundation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339.3.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37IV Future Work 4110 Graph Analysis 4110.1 Real world network characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4110.2 Property Testing and Estimation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 4211 Graph sampling 4211.1 Algorithm Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4311.1.1 Algorithm complexity reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4311.1.2 Run-time reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4311.1.3 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4411.2 Improvement Of Sampling Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4412 Graph Generation models 44V References 4613 Acronyms 49VI Appendix 50A Sampling Manufactured Graphs: BFS Tree Filtering 50A.1 Crawler Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50A.2 Measured properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50A.2.1 Crawlers With Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51II

A.3 Selection Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52A.4 Method Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53III

List of Figures1 Erdós-Rényi Graph G(n, p) graph with n = 50000, p = 0.0001 . . . . . . . . . . . . . . . . . . 82 Degree distribution of the preferential attachment graph . . . . . . . . . . . . . . . . . . . . . 103 Degree distribution of the edge copying graph with γ = 0.5 . . . . . . . . . . . . . . . . . . . 104 Degree distribution of the triangle closing graph with random-random selection policy . . . . 115 Degree distribution of the forest fire model with n = 10 5 m = 1, p = 0.6, r = 0.3 . . . . . . . . 126 Random Walk Models (log-log scales). m = 3, s = 200, N = 50000, Power-law Co-Efficient cas noted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Comparison of different parameter effects on undirected random walk graphs N = 50000 . . . 208 Preferential Message Propagation (log-log scales). m = 2,p = 0.2,N = 50000 . . . . . . . . . . 219 Grow-Back Connect-Densify Model(log-log scales). a = b = c = 1 3,p = 0.1,N = 200000 . . . . 2210 Implicit Graph Model. α = 1.25,N = 10000,c = 1.8 . . . . . . . . . . . . . . . . . . . . . . . . 2411 Partitioned Preferential Attachment. n = 7, 5.10 5 , m = 3, s = 500 . . . . . . . . . . . . . . . . 2512 Twitter 2009 data on log-log scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2513 (In-degrees, Out-Degrees) to frequency plot on a log scale . . . . . . . . . . . . . . . . . . . . 2614 Degree Distribution of various real networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2715 Degree-Based Cut Conductance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2816 Degree Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2917 Real Twitter data vs crawled data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3018 In Degrees (red) Out Degrees (green) Total Degrees (blue) and tweets (fuchsia) as a functionof frequency on a log scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3119 Real 2009 Twitter distribution (Green) compared to sampled Twitter distribution (Red) . . . 3220 KV test on a preferential attachment graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3321 Plots of experimental data showing cover time of all vertices of degree at least t a as a functionof a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3722 Plots of experimental data showing cover time of all vertices of degree at least t a (ignoringa < 0.25) as a function of a in the SlashDot graph. . . . . . . . . . . . . . . . . . . . . . . . 3823 Plots of experimental data showing cover time of all vertices of degree at least t a (ignoringa < 0.35) as a function of a in a sample of the Google web-graph . . . . . . . . . . . . . . . . 3924 Growth pattern of all methods at 1000,2000,.., 64000 vertices . . . . . . . . . . . . . . . . . . 5225 The comparison of the degree frequency distributions of the different methods at 64000 visitedvertices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53IV

1AbstractIn this report we intend to describe our progress to date. The topic of this research is large graphsobserved in the World Wide Web (WWW) and OSNs; their structure, generative models and otherspecial or unique characteristics. In particular we will be discussing Online Social Networks and theirrepresentation as graphs. These networks have some very interesting properties, and although they havebeen analyzed in the past there are still many open problems regarding their local and global structure.The graph analysis is performed both on theoretically generated and understood graphs as well as graphsthat have been observed on Online Social Networks such as Twitter.In particular there are many new features discovered in these graphs and we hope to present somenever before seen findings on regarding the structure of these graphs and some of their static and dynamicproperties. Many of these properties are a shared characteristic on many online networks, however someof them are expressed in a different magnitude, if at all.Based on these observations we will present and discuss the models which have been created to simulatethese phenomena and discuss their strengths and weaknesses. We also suggest some of our own modelsin an attempt to better describe the generative process of the real world networks.An important part of what will be presented also includes methods to sample these networks which isa vital part of our research. Sampling these networks is important because at their current size and stateit is impossible to obtain them in their entirety. We present some approaches to sampling these networksunder specific assumptions and with specific goals. We also discuss the results as well as the parts whereour methods succeeded and where they failed in achieving the goals we set.In addition we present some of our objectives for the rest of the duration of our research which webelieve are critical and achievable goals which will help us further evolve our work and further contributeto the field.

2Part IIntroduction1 Online Social Networks (OSNs)The reason that OSNs are interesting from a computer science and mathematical perspective may not beobvious. However upon deeper examination this issue is one which has motivated increasing interest overthe past few years due to the explosive growth of OSNs and the relative power that people have within thesenetworks to affect the world around them.Recent developments in technology have allowed the creation of large networks, available globally viapersonal computers, or more recently mobile phones. The original and most outstanding example of suchnetworks are the WWW, and the email network. Very recently, many novel OSNs such as Twitter and Facebook,or online video repositories such as YouTube have sprung up. These networks, extensively interleavedwith each other and the WWW, have substantial impact on the way our lives are lived. Nobody who wasfollowed the political unrest in North Africa during Febuary of 2011 can be unaware of the importance ofe.g. Twitter in galvanizing and coordinating social behavior.The WWW is remarkable in its own right, and its structure subject of considerable research to understandthe novel phenomena which are observed such as the tenancy of WWW pages to link to pages to whichmany other pages have already linked to for example. More recently, through the development of the topicof Web Science there has been interest in the evolution of such networks as eco-spheres in a biological sense.However much of the value of the web comes from our ability to search and index it rapidly through thedevelopment of ranking and retrieval algorithms such as that offered by Google.Our project examines properties of OSNs, and in particular Twitter, which has not been examined indetail yet, to determine whether they share similar structural properties with the WWW or other socialnetworks. The WWW is known to exhibit a scale-free structure, with small world properties such as smalldiameter. By understanding graph structure, smaller networks similar to the WWW can be simulated, toallow evaluation of search algorithms.Our specific interest in this topic has three factors:1. Ways to obtain representative samples of massive networks with limited resources. This will allow usto get a meaningful snapshot of the network without the expense of an extensive sampling of hundredsof millions of home pages.2. To investigate the graph theoretic structure of the networks3. Generate artificial graphs which fit the structure of the examined networks and analyze the generationprocesses.The results that we will be presenting from crawling these networks, combined with a presentation ofthe theoretical background show a similarity of basic structure between the WWW and these other socialnetworks with respect to properties such as degree frequency distribution etc. This similarity may arisebecause of the way these networks were generated. The similarity of the structure may allow the creationof a generative model which describes them, and exhibits the same characteristics.One major problem is the very large size of these networks, and limited amount of resources that areusually available when accessing them for research purposes. To solve this, a way to take representativesamples will need to be found which ideally will produce samples relatively small, to give a correct indicationof specific properties of interest.The area of online social graph structure, unknown graph sampling and graph generation, mentionedabove are unresolved topics, and many open questions remain. Solving these problems by developing goodtheoretical graph generation methods and proposing an effective and efficient way to sample online graphs,are of interest and would be a useful tool for the community.

2 LARGE ONLINE SOCIAL NETWORKS AND GENERATIVE MODELS 32 Large Online Social Networks and Generative ModelsWhile the intuition for modeling an OSN as a graph is rather simple and may even seem trivial, one maydiscover that upon interpreting such a structure there are many questions that automatically arise. Suchquestions may include:• What are the underlying processes that take place and contribute to the visible structure of thenetwork?• Why does it have this rate of growth?• How do individual actors interact within the network and how does this affect the network structure?• What is the most likely structure of the network after some time has elapsed?• Who is most likely to create more links on the network and who is more likely to receive those links?• How can we get a sample and how do the properties of this sample relate to the properties of theentire network?These are just a few of the questions that arise. In order to answer these questions our work focuses onthe three main aspects mentioned in section 1. In general modeling the real world networks is still an openquestion. There are numerous models which are available to generate graphs which match a wide range ofproperties of real world networks (such as OSNs) but there is still work to be done with new observationsconstantly being made.Generative models work based on various intuitions and can be seen as network evolution models in thesense that they initialize small graphs and then build upon them in the same way a real network starts witha single node and evolves over time. It is believed that this aspect of growth is essential in modeling thesesorts of networks but it is not the only aspect which needs to exist in these models.3 Graph sampling and crawlingAnother important aspect of our research focuses around graph sampling and crawling. In reality we do notdifferentiate between these two terms since they are often used to express the same goal. The remarkablesize to which real world networks has grown means that it is no longer feasible to obtain entire networks.Even if it was feasible however, it may not be desirable due to the fact that in an ideal world we would liketo minimize the “cost” of obtaining these networks and most of the time this means obtaining a precise andsmall sample.Graph sampling and crawling can be seen as two terms which are similar in many ways. Both terms areusually used in the same context and while sampling is the goal, crawling is the way to achieve it. Thereare numerous methods available to gather samples from graphs and many of them can be applicable tothe real world networks. These methods include Simple Random Walks (SRWs) and their variations aswell as Breadth-First Search (BFS) and other similar methods. Many of these methods will be presentedthroughout this report as they play an important role in achieving some of our fundamental goals which isgraph sampling.

4Part IIRelated Work4 Network properties and metricsOur particular area of research is relatively new. This is mainly due to the fact that it is only recentlythat these network structures have emerged. It started with the emergence of the WWW, however work ongraph generation models greatly predates the rise of the WWW and was previously done to model socialnetworks in the domain of social sciences.The WWW graph in particular has received a great deal of attention. This graph is the result of modelingthe WWW as a set of pages which are the vertices of the graph, and a set of directed links between the pageswhich are the edges of the graph [32]. There are some very interesting properties that have been discoveredin this graph, for example the degree distribution observed follows a long-tail distribution [11, 49]. This isapparently a common attribute that is present in the WWW graph, as well as the Internet (autonomoussystems) [24], citation graphs [51], OSNs [24] and many others.4.1 Diameter and degree distributionThe diameter of the aforementioned networks seems to be relatively small and a “small world” phenomenonhas been observed. This means that the diameter d, or effective diameter [52] is of the scale of d ∝ log Nwhere N is the size of the graph. However according to a study by J. Leskovec et al [31] it was proposed thatthe diameter is even smaller than this proposed value. It is unclear whether or not this observed characteristicof certain networks is related to another observed characteristic of an edge densification power-law [31] where|E(t)| ∝ |V (t)| α where E(t) and V (t) are the sets of edges and vertices at the epoch t respectively and α isnot trivial (α > 1).The graphs with the above properties are commonly referred to as power-law graphs or scale-free graphs,however the latter term is controversial. For simplicity’s sake we will use the term power-law graphs andscale-free graphs interchangeably to refer to all graphs which have power-law degree distributions, a diameterless than or equal to the expected diameter of a small-world network. Additionally we will require our scalefreegraphs to have a scale invariance where these basic properties remain true even at very different sizesof the graph. According to the work of Dill et al [19] the web graph presents with a self-similar structurewhich may be the cause of the aforementioned scale invariance of the web-graph, it may be reasonable toassume that many other scale-free graphs observed such as OSN graphs share this characteristic with theweb-graph. We believe that this is an interesting research objective for our current work.4.2 Degree correlationsThe term “degree correlations” is used to denote the relationship between the degree of a vertex and thedegree of its neighbors. In the work of Pastor-Satorras et al [48] it has been suggested that there are greaterthan expected degree correlations in real world networks. The metric defined was based on the conditionalprobability P c (k ′ /k) which denotes the probability of a node of degree k is connected to a node of degree k ′ .As it is stated by Pastor, due to the statistical fluctuations, measuring this probability is a rather complextask. He suggested the following alternative:〈k nn 〉 = ∑ k ′ P c (k ′ /k) (4.1)This denotes an average value of this metric.4.3 Expander graphsIn general when we talk about the expansion property of a graph, or if we describe a graph as an expanderwe may mean several different properties may hold for a graph. Expander graphs were first defined by

4 NETWORK PROPERTIES AND METRICS 5Bassalygo and Pinsker in the early 70s.There are several expansion properties that we can take into account:Edge Expansion The edge expansion h(G) is sometimes referred to as the isoperimetric number or Cheegerconstant a metric which corresponds to graph conductance, described in more detail in Section 4.5Vertex Expansion The vertex isoperimetric numbers or vertex expansions h out (G) and h in (G) are definedas:h out (G) =h in (G) =min0

4 NETWORK PROPERTIES AND METRICS 6Where a ij is the entry (i, j) in the adjacency matrix of the graph and α(V c ) is the edges incident to V c . Theconductance of a graph G is defined as:φ G = minV c⊂V|V c|≤ |V |2φ(V c ) (4.7)4.6 Communities: Centrality and modularityAs suggested by the work of M. Girvan et al [26] we can detect community structures using some alternativenetwork measures such as betweenness centrality, closeness centrality etc . The idea is that we partitionthe graph by removing edges with high betweenness centrality, where the betweenness centrality C B is ameasurement of how many optimal paths go through a specific vertex/edge, more formally:C B (u) =∑s≠u≠t∈Vσ st (u)σ st(4.8)Where σ st is the number of shortest paths from s to t, and σ st (u) is the number of shortest paths from s tot that pass through a vertex u.There is also a measure which has been suggested by J. Newman et al [47] called modularity which givesa quantitative measurement of the quality of a given community partitioning of a graph. This measureeffectively compares the number of edges which are present within a given partition of the graph with theexpected number of edges that would be found in a random graph with the same number of vertices andedges.Its worth mentioning the textbook measurement of modularity as defined in [47] is shown in Equation 4.9.Q = ∑ i(e ii − a 2 i ) (4.9)a i = ∑ je ijIn the above, a i indicates the fraction of edges that connect to vertices in community i and e ii the fractionof edges which are within a community i (i.e. with a source and target within community i). As statedin the literature, in a network where the number of within-community edges is no better than the numberof within community edges in a graph with the same vertices and community split but random edges thenthe modularity is 0 while values approaching 1.0 indicate a very strong community structure. In practicewe will consider values over 0.3 to be significantly higher than the expected values, denoting a communitystructure.The problem with utilizing the modularity measurement or the partitioning in general is that findingthe optimal separation is an NP-complete problem. Recent work on approximation techniques [6] hasmade the estimation of near optimal communities feasible even on very large graphs. This was achievedby attempting to iteratively optimize the modularity of a given partition by moving vertices into othercommunities, effectively producing near optimal partitioning in polynomial time.4.7 Communities: Overlapping communitiesIn related work by N. Mishra et al [44] it was suggested that the above criteria are not necessarily sufficient toeffectively represent the community structure which is apparent in OSNs. In their work it was suggested thatcommunities need not be disjoint and may, in fact, be heavily overlapping. We find this to be intuitivelyreasonable due to the fact that an average user of such a network has an array of interests, each onebelonging to a different category (e.g. sports, politics, region, religion) and in fact OSN users belong tomultiple communities based on their interests. They introduce the concept of (α − β)-communities whichare formally defined as follows:Definition 1. Given a graph G = {V, E} in which every vertex has a self-loop. C ⊂ V is an (α − β)-clusterif it is:

5 GRAPH GENERATION MODELS 71. Internally dense: ∀u ∈ C, |E(u, C)| ≥ β|C|;2. Externally Sparse: ∀u ∈ VC, |E(u, C)| ≤ a|C|.Additionally given 0 ≤ α < β ≤ 1 the (α − β)-clustering problem is defined as the problem to find all(α − β)-clusters.This concept led to an alternative way of describing communities and analyzing OSNs. An example ofthis is seen in the work of J. He et al [29] who suggested a method to detect such clusters and have shownthat the community structure present in OSNs is a feature which is not observed in generated random graphssuch as the preferential attachment graph (seen in Section 5.2).4.8 Property testingIn general, the term property tester is used to denote a method which given an input object determineswhether or not that object satisfies a specific property or how “far” it is from satisfying that property.In the context of graph algorithms the input object is a graph and the possible properties that could betested include (e.g.) being bipartite or being an α-expander. In addition, property testers should be ableto determine if a graph is ɛ-far from having that property, ɛ being the fraction of edges which need to berearranged in order to make the property hold for that graph.Property testers in general make use of randomized algorithms to make a decision on the problem.In addition we require any property tester to accept graphs for which the property in question holds aprobability of at least 2 3and reject graphs which are ɛ-far from having this property with the same probability.Since these methods are designed in such a way that they make a decision in time sub-linear to the inputsize it is possible to perform multiple runs of these methods to reach an answer within the desired accuracybounds.These methods have seen extensive use due to their efficiency and reasonably good accuracy. For exampleCzumaj et al [18] make use of such a method to test a graph’s α-expansion property. It has been shownthat it is possible to get an answer to such a question in sub-linear time, which in general is a hard problemto compute deterministically.5 Graph Generation Models5.1 Erdós-Rényi GraphIn graph theory, the Erdős-Rényi model (or Poisson graph model), are two models for generating randomgraphs, including one that sets an edge between each pair of nodes with equal probability, independentlyof the other edges. It is used in the probabilistic method to prove the existence of graphs satisfying variousproperties and to provide a rigorous definition of what it means for a property to hold for almost all graphs.It is the simplest of all graph generation models and it can be seen from two different aspects, the firstone called the G(n, M) model and the second one the G(n, p) model. These models were examined by P.Erdós et al [22].G(n, M) model From all the possible graphs with n vertices and M edges we choose one Uniformly atrandom (UAR).G(n, p) model Consider a graph with n vertices. For each vertex pair u,v there exists an edge (u, v) withprobability pThese model are very well analyzed and understood ones, however they lack the basic properties thatwe need to successfully model an OSN graph. For example it does not have a power-law degree distribution,in the average case.An example degree distribution of this graph can be seen in Figure 1

5 GRAPH GENERATION MODELS 8Figure 1: Erdós-Rényi Graph G(n, p) graph with n = 50000, p = 0.00015.2 Preferential attachment modelThe preferential attachment model is a graph process used to generate graphs with degree distributionswhich follow a power-law. This generative method was proposed by Barabási and Albert [5] as a generativeprocedure for a model of the WWW. Surveys by Bollobás and Riordan [7] and Drinea, Enachescuand Mitzenmacher [21] give many related generative procedures to obtain graphs with power-law degreesequences.In this model, the graph G(t) = G(m, t) is obtained from G(t−1) by adding a new vertex v t with m edgesbetween v t and G(t − 1). The end points of these edges are chosen preferentially, that is to say proportionalto the existing degree of vertices in G(t − 1). Thus the probability p(x, t) that vertex x ∈ G(t − 1) ischosen as the end point of a given edge is equal to p(x, t) = d(x, t − 1)/(2m(t − 1)), and this choice is madeindependently for each of the m edges added. A model generated in this way has a power law of 3 for thedegree sequence, irrespective of the number of edges m ≥ 1 added at each step.It was empirically shown by Barabási that there are two required elements in order in order to obtain apower-law degree distribution and two models were defined.Model A This model retains growth but all new vertices created are attached UAR on existing edges.Model B This model consists of a graph with a static vertex count which at each epoch (or time frame)an edge is added preferentially.Each of the individual models above fails to create scale-free graphs, and it is shown in [5] that only bya combination of models A and B do we obtain a power-law degree distribution.Preferential attachment graphs have a heavy tailed degree sequence. Thus, although the majority of thevertices have constant degree, a very distinct minority have very large degrees. This particular propertyis the significant defining features of such graphs. A log-log plot of the degree sequence breaks naturallyinto three parts. The lower range (small constant degree) where there may be curvature, as the power lawapproximation is incorrect. The middle range, of large but well represented vertex degrees, which give thecharacteristic straight line log-log plot of the power law coefficient. In the upper tail, where the sequenceis far from concentrated, and the plot is a spiky mess. Due to the structure of the model as well as theextensive analysis that has been done, this is going to be the main reference model for our experiments.

5 GRAPH GENERATION MODELS 9The preferential attachment model was refined Bollobas and Riordan [8,9] who introduced the scale freemodel to make detailed calculations of degree sequence and diameter. The model was generalized by manyauthors, including the web-graph model of Cooper and Frieze [15]. The web-graph model is very generaland allows the number of edges added at each step to vary, for edges from new vertices to choose their endpoints preferentially or uniformly at random, as well as for insertion of edges between existing vertices. Byvarying these parameters, preferential attachment graphs with degree sequences exhibiting power laws c inthe interval (2, ∞) are obtained. Assuming that m edges are added at every step, we refer to this generalized(web-graph) process with power law c as G(c, m, t).In [13], Cooper noted the result that the power law c for preferential attachment graphs and web-graphscan be written explicitly asc = 1 + 1/η, (5.1)where η is the expected proportion of edge end points added preferentially. In the Barabási and Albertmodel, η = 1/2, as each new edge chooses an existing neighbor vertex preferentially; thus explaining thepower law of 3 for this model.The value η occurs naturally in such models in the expression for the expected degree of a vertex. Letd(s, t) denote the degree at step t of the vertex v s added at step s. The expected value of d(s, t) is given by( ) t ηEd(s, t) ∼ m , (5.2)swhere η is the parameter defined above (see e.g. [17]). Thus, in the preferential attachment model of [5],Ed(s, t) ∼ m(t/s) 1/2 .The actual value of d(s, t) is not particularly concentrated around Ed(s, t), but the following inequalitiesproved in e.g. [17] and [13], are adequate for our proofs. The inequalities hold With High Probability (whp),for all vertices in G(c, m, t).( ) t η(1−ɛ) ( ) t η≤ d(s, t) ≤ log 2 t, (5.3)sswhere ɛ > 0 is some arbitrarily small positive constant (e.g. ɛ = 0.00001). The upshot of this, and ourreason for explaining this to the reader, is that all vertices v added after step s log 2/η+1 t have degreed(v, t) = o ((t/s) η ) whp.Preferential attachment graphs have diameterDiam(G(m, t)) = O(log t) (5.4)whp This was improved for scale free graphs by Bollobas and Riordan, but crude proofs can be made forthe general web-graph model based on expansion properties of the graph.The resulting degree distribution from such a graph can be seen in Figure 25.3 Edge Copying modelLike the preferential attachment, the edge copying model [32] produces scale-free graphs. However there isno explicit concept of preferential attachment involved.The model works as follows:• Starting with an initial graph at each step a new vertex v arrives• For this vertex v we select a vertex u UAR which will serve as a proxy for v• For each edge (u, w i ) with probability 1 − γ we direct an edge from v to w i• With probability γ we select a vertex v ′ UAR and direct an edge from v to v ′This model has been shown to produce power-laws with a co-efficient of 1+ 1has been observed in the resulting graphs.1−γand community structureThe resulting degree distribution from the above generation model can be seen in Figure 3.

5 GRAPH GENERATION MODELS 10Figure 2: Degree distribution of the preferential attachment graphFigure 3: Degree distribution of the edge copying graph with γ = 0.5

5 GRAPH GENERATION MODELS 11Figure 4: Degree distribution of the triangle closing graph with random-random selection policy5.4 Triangle Closing ModelThis model was created based on the observation that most links in OSNs are local, no more than 2 hopsapart [36]. There are several variations of this model discussed but it all follows the idea that a vertex arrivesin the network and creates an edge to another vertex at random. Then it proceeds in forming additionallinks by selecting a vertex which is 2 hops away from it. The way the vertex is selected depends on a givenpolicy and there are several policies mentioned in.However it was suggested by Leskovec that the random neighbor of random neighbor policy is thesimplest one and worked surprisingly well when the generated graphs were compared to some real OSNswith respect to how each vertex directs edges to other vertices. While other policies, such as most activeneighbor of most active neighbor, may work better the increase in accuracy is marginal [36].The resulting degree distribution from the above generation model can be seen in Figure 4.5.5 Forest Fire ModelThis model, first proposed by Leskovec et al [39] is a model which is an intuitive model on how OSNs evolveover time. It is very similar to the triangle closing model when viewed from the perspective of the localityof new links. Additionally it has the advantages on not only generating degree distribution power-laws inboth directions but also in generating communities. Moreover the edge densification power-law [31] holdsfor each step of the generative process and the diameter decreases at each step.The method is as follows:• At each step we add a new vertex v and direct m edges to other vertices u 1 , · · · , u m selected UAR.We call these vertices ambassadors of v• Recursively for each new edge e i = (v, u n ) added we generate two random numbers x and y whichfollow a geometric distribution with meansrespectively. Where 0 ≤ p, r < 1.p1−p and• We obtain two sets of edges S 1 = {s 1 , · · · , s x } S 2 = {s x+1 , · · · , s x+y } where ∃e 1 = (u n , s i )∀s i ∈ S 1or ∃e 2 = (s j , u n )∀s j ∈ S 2rp1−rp

5 GRAPH GENERATION MODELS 12Figure 5: Degree distribution of the forest fire model with n = 10 5 m = 1, p = 0.6, r = 0.3• We proceed to create x edges e i where i ∈ [1, · · · , x], e i = (v, s i ) and y edges e j where j ∈ [x +1, · · · , x + y], e j = (s j , v)• We continue this process until no new edges are added at which point we add a new vertex and repeatthe above process.While this method has a very good intuition on why it should generate graphs which look like OSNsand in practice it does generate such graphs under certain condition, due to the complexity of the methodit has not been analyzed yet. There are still uncertainties on why the graphs generated look the way theydo and how the input parameters (p, r, m) affect the properties of the generated graph. For a range of thoseparameters the graph may tend to become a complete graph where by definition the desired properties arenot met.Additionally Leskovec has shown that there is a specific range of parameters that he called the “sweetspot” which is required to generate graphs which actually have the properties described, and this range ofparameters is very narrow. Our experience has shown that even small deviations of the parameters maylead to chaotic behavior of the generative process.The degree distribution of a graph generated using this method can be seen in Figure 5.5.6 Stochastic Kronecker GraphThis method as a generation model was first proposed by Leskovec et al [38]. This method takes advantageof the self-similar structure observed in many real world graphs and uses the Kronecker matrix multiplication(a form of tensor product applied on matrices) in order to generate graphs.This method itself however requires a good initiator matrix to be set and depending on that, the propertiesof any size of graph generated can be calculated using the properties of the initiator matrix product.However choosing an appropriate initiator matrix is still an open problem and this method is not as effectivefor generating graphs as it could potentially be.The major advantage and use of this method however is not to generate graphs, but to do exactly theopposite, which is to find which initiator matrix could be used to produce graph which is similar to a givengraph. It was suggested by Leskovec that finding a good initiator which is most likely to have produced a

5 GRAPH GENERATION MODELS 13given graph and then applying the Kronecker multiplication will produce a graph very similar to the originalgraph. This can be used to model the given network either at a scale or determine how it may look likewhen it grows.The problem of finding the initiator, while generally an NP -complete problem was proven to be solvableby approximation in O(n) time [37] and this may prove to be a valuable tool in analyzing networks andtheir temporal evolution.The Kronecker product is an operation of two matrices of arbitrary size resulting in a block matrix. Thisoperation is completely unrelated to the normal matrix product.Assume we have two matrices M A and M B of dimentions m × n and p × q respectively.⎛⎞⎛⎞a 1,1 a 1,2 · · · a 1,nb 1,1 b 1,2 · · · b 1,qM A = ⎜ a 2,1 a 2,2 · · · a 2,n⎟⎝. . . . . . . . . . . . . . . . . . . . . . ⎠ M B = ⎜b 2,1 b 2,2 · · · b 2,q⎟⎝. . . . . . . . . . . . . . . . . . . ⎠a m,1 a m,2 · · · a m,n b p,1 b p,2 · · · b p,qThe Kronecker Product of these matrices is symbolised as:and is a block operation defined as:M A ⊗ M B⎛⎞a 1,1 M B a 1,2 M B · · · a 1,n M BM A ⊗ M B = ⎜ a 2,1 M B a 2,2 M B · · · a 2,n M B⎟⎝. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ⎠a m,1 M B a m,2 M B · · · a m,n M BIn the case of the Stochastic Kronecker Graph we require that M A = M B , m = n and 0 < a ij ≤ 1. Wewill symbolize M A ⊗ M A as M (2)(i)Awhich would contain elements a(2)ij. The matrix MAis defined as theadjacency probability matrix of a graph. In order to generate the actual graph which results from the n-thKronecker multiplication we create a graph where for each vertex pair u i , u j the probability there is an edgee ij = (u i , u j ) is P (e ij ) = a (n)ij .Further analysis of this model was done by M. Mahdian et al [42] and it was shown that there aretransitional phases for the emergence of a giant component and for the connectivity. Additionally Mahdianproved that the diameter beyond the connectivity threshold is constant.5.7 Affiliation NetworksIn the work of S. Lattanzi et al [33] it is pointed out that all the theoretical understanding of pre-existinggraph generation models failed to explain the properties which were recently observed in social networks[31,38]. They propose a new method which is based on previous work on bipartite models of social networks.This model aims to “capture the affiliation of agents to societies”In this model there are two distinct graphs:• A bipartite graph that represents the affiliation network, which in [33] is reffered to as B(Q, U)• The social network graph which in [33] is reffered to as G(Q, E)In the above case the set Q is shared among both graphs. Their intuition in creating this model is basedon observations on social phenomena in online graphs (such as the citation network). In their example theymake use of the citation network and in this context Q is the set of papers and U the set of topics thosepapers are about. When a new paper emerges it is likely to be based on an older paper, reffered to as theprototype and it is also likely that focus on (a subset of) the topics in which the prototype focuses on. Based

6 GRAPH SAMPLING AND CRAWLING 14on this, new vertices emerging in Q will consist of an edge copying flavor as described in Section 5.3. In asimilar fashion when a new topic emerges it is likely to be based, or inspired from, an existing topic.Based on the above example the graph G(Q, E) is constructed when taking into consideration that whenan author adds references to a new paper that author will cite most or all of the papers on that topic andsome papers of general interest. It is mentioned that this intuition can be applied to other social graphs aswell and we can assume within this statement that we can consider Q to be a set of interests a person mayhave and G(Q, E) to be a social interaction graph between people. It is more likely for people who sharethe same interest to be connected in G and in addition people without common interests may be connectedas a result of the popularity of one or both of those people.From this intuitive understanding there are two factors that emerge which are an edge copying flavorand a flavor of preferential attachment based on degree. In fact the suggested model contains both theseelements in some way. In fact since graph B(Q, U) uses the edge copying mechanism heavily it does exhibita power-law degree distribution and a community structure as it is proven in the paper. The graph G(Q, E)which also includes the degree preferential attachment mechanism as well as common affiliation links betweenvertices of Q also exhibits this phenomenon. In addition this model claims a densification power-law andbounded diameter.This model is worth noting due to the fact that all the above properties are properties which have beenobserved in most online graphs (social, citation, Peer-To-Peer (P2P) etc) but more importantly this modelprovides a proven power-law degree distribution, densification and bounded diameter.6 Graph sampling and crawlingGenerally speaking, graph sampling is a very underdeveloped topic. Put simply, the big question is: ‘Howcan one sample only a part of a graph and yet maintain certain structural information that is present onthe entire graph’. This question has many interpretations and shades of meaning. In our case, for example,we are interested in getting a crawled sample of a preferential attachment graph which is a good model ofthe graph in its entirety. Thus, this sample studied on its own will have certain required properties whichneed to hold for us. In particular we want the degree distribution that is observed in the entire network tobe, at scale, observed in our sample of the network. Other typical examples might be clustering coefficientor diameter.Work on efficient sampling of network characteristics arises in many areas. In the context of searchengine design, studies in optimally sampling the URL crawl frontier to rapidly sample (e.g.) high page-rankvertices based on knowledge of vertex degree in the current sample can be found in e.g. [4].Within the random graph community, trace-route sampling was used to estimate cumulate degree distributions;and methods of removing the high degree bias from this process were studied in e.g. [1], [25].Another approach, analyzed in [10], is the jump and crawl method to find (e.g.) all very high degree vertices.The method uses a mixture of uniform sampling followed by inspection of the neighboring vertices, in a timesub-linear in the network size.In the context of online social networks, exploration often focused on how discover the entire networkmore efficiently. Until recently this was feasible for many real world networks, before they exploded to theircurrent size. It is no longer feasible to get a consistent snapshot of the Facebook network for example. 1Methods based on SRW are commonly used for graph searching and crawling, and such methods havebeen used and analyzed extensively. Stutzbach et al [53] compare the performance of BFS with a SRW anda MHRW [28,43] on various classes of random graphs as a basis for sampling the degree distribution of theunderlying networks. The purpose of the investigation was to sample from dynamic P2P networks. In arelated study M. Gjoka et al [27] made extensive use of the above methods to collect a sample of Facebookusers. As Simple Random Walks (SRWs) are degree biassed they used a re-weighting technique to unbiasthe sampled degree sequence output by the random walk. This is referred to as a Re-Weighted RandomWalk (RWRW) in [27]. In both the above cases it was shown the bias could be removed dynamically by1 According to the Facebook statistics page at *http://www.facebook.com/press/info.php?statistics (retrieved on 02 June2011) at the time retrieved there were over 500 million active users (the exact number was not mentioned) and the average userhad 130 friends.

6 GRAPH SAMPLING AND CRAWLING 15using a MHRW and selecting an appropriate target distribution.This indicates that there are application or network specific optimizations that can be done on randomwalks in order to tune them to the required task.An interesting experimental analysis on sampling methods such as Respondent-Driven Sampling (RDS)and Metropolis-Hastings Random Walk has been done by A. H. Rasti et al [50] showing the effect of graphstructure and size has on the efficiency of these methods. There were several graph types used including theErdos-Renyi random graph as well as the Small World graph, the Barbasi-Albert (preferential attachment)graph and the Hierarchical Scale-Free graph, the latter being a scale-free graph which has a structure ofclusters within clusters. It was shown that the above methods, when applied to the Hierarchical Scale-Freehad a reduced efficiency.Some work has been done by F. Wu et al [23] and their work has provided us with the framework forthe definitions and the methods of crawling. For each method there were several properties measured suchas:Efficiency How fast new vertices are discoveredSensitivity How much the results vary depending on the target social network and the percentage ofprotected users within that network.Bias How different statistical properties are distorted in the samplesThe above are the general concern of graph crawling as we require our methods to be unbiased, successfuland independent of the entry point.In a related study conducted by A. Mislove et al [45] the problem area is very well described. Accordingto [45] the major issues that arise in sampling, have to do with getting an unbiased sample quickly andeffectively. Several methods are discussed such as Breadth-first search (BFS) and the snowball method.The snowball method is effectively like a BFS which terminates prematurely for each vertex i.e. samplesonly n out-edge targets for each discovered vertex rather than all of them and rejects the rest. It isexperimentally determined that sampling the graph with these methods introduces a bias. According totheir work additional challenges arise when the underlying network itself imposes additional limitations, likenot allowing retrieval of backwards links (as it is the case in Formspring) or imposing a strict limitation onthe rate that data can be acquired.It is a fact that the data rate limitations can be overcome by scraping the web interface of OSNs ratherthan using the APIs. This is achieved by the method called web scraping successively reads web pagesfrom the OSNs web interface and strips them down to contain only the relevant information. However asit is pointed out in [27] this introduces a large overhead due to the fact that useless information such asweb-headers and other hypertext elements will have to be downloaded. We must be very careful and thesize of the overheads must be evaluated to determine whether or not we will achieve a higher throughputvia scraping or by using the API.Methods such as the RWRW and MHRW are often used to gather unbiased samples of graphs and inparticular the method of MHRW is evaluated by Rasti et al [50] and also used by Krishnamurthy et al [3]to sample the Twitter network. The work of Krishnamurthy used the method to obtain a ground truth tocompare against the results acquired their other methods. This shows a certain level of confidence to theunbiased nature of RWRW and MHRW.In [35] Leskovec mentions that the goals in sampling a network could include:Back in time goal Where one is interested in obtaining a sample of a network which is has the sameproperties as that network in a previous epoch.Scale-down goal Where one is interested in getting a sample of the network which has the same propertiesas the network in the current epoch.This is of course only some of the goals one could consider when sampling a network.

6 GRAPH SAMPLING AND CRAWLING 16In addition to the above problem there is also the more general problem of when to stop the sampling.When do we know that we’ve sampled enough of the network? This question is very critical as it has manyside-effects. What if our method is appropriate but we are stopping it too soon? What if we sampled toomuch and distorted our perfect image? The answer to this question is still largely an open problem. Howeverin [27] some convergence tactics are discussed and analyzed. This offers a great tool for the crawler to beable to determine when it is best to halt the crawling.In general we can consider numerous other sampling goals which are sometimes case specific. For exampleone could consider the goal of measuring cover time of a graph, or of determining where a SRW of s stepsand starting from a vertex u is most likely to terminate at. These are some examples of possible interestingquestions that arise which always depends on the problem at hand. In general, sampling with SRWs is usedextensively in the context of property testing seen in Section 4.8, for example in the work of Czumaj etal [18] SRWs were used to determine whether or not a graph is an α-expander (defined in Section 4.3).6.1 Simple Random WalkThe most popular crawling method used in practice is the SRW. The reason for this is its simplicity andsurprising efficiency which made it a very attractive method to study and analyze.Let G = (V, E) be a connected graph. A SRW W u , u ∈ V on the undirected graph G = (V, E) is aMarkov chain X 0 = u, X 1 , . . . , X t , . . . on the vertices V associated to a particle that moves from vertex tovertex according to a transition rule. The probability of a transition from vertex i to vertex j is p(i, j) if{i, j} ∈ E, and 0 otherwise.Let d(v) = d(v, t) be the degree of vertex v ∈ G(t), and let N(v) denote the neighbours of v in thisgraph. A vertex u is a neighbor of v if there exists an edge e = (v, u).In the above general case the stopping criterion can be arbitrary. The most common use case is to stopwhen a sample of sufficient size and quality has been obtained, in other cases the method may stop whenall the vertices have been visited (covered).The SRW on graphs is in general the algorithm presented in Algorithm 1Algorithm 1 SRWv ← start vertexwhile not done doVisit(v)v ← random vertex from N(v)end whileThis simple method has certain properties:Transition matrix Given a graph G(N, M) the transition matrix P is a matrix which corresponds to theprobability that a certain transition will occur in a random walk. For example let P uv denote theelement at (u, v) in the transition matrix P uv = P r[“Given that we are at vertex u we move to vertexv”].{ 1P uv =d u, v ∈ N(u)0, otherwiseWhere d u is the degree of vertex u and N(u) denotes the neighborhood of u.Stationary Distribution Given a distribution π over a graph which is the proportion of time a randomwalk has spent over specific vertices, we determine the distribution at the next step of a SRW π ′ asπ ′ = P T π. The stationary distribution π s is the distribution with the property P T π s = π s .Mixing time The mixing time t m is the number of steps where the distribution π tm → π sIt is proven that in SRWs, as the number of steps t → ∞ the stationary distribution of a vertex u is:π u = d u2M(6.1)

6 GRAPH SAMPLING AND CRAWLING 17This property proves the bias of the SRW towards high degree vertices.Using Equation 6.1 we can easily determine the bias of the random walk by normalizing each verticesweight with its stationary distribution as shown by E. Voltz et al [54]. This method of re-weighting thevertices is the idea behind the RWRW.Additionally Stutzbach et al [53] use a random walk method based on the metropolis sampling method.This method is more commonly known as the MHRW which is seen in Section 6.3. This method is used togather unbiased samples of P2P networks. This specific method is similar to a simple random walk with reweightingbut instead of estimating the bias of the walk after it has been completed it uses a rejection policyto force the sampling to be done according to a given distribution, in most cases the uniform distribution.This indicates that there are application or network specific optimizations that can be done on randomwalks in order to tune them to the required task.6.2 Weighted Random WalkAs we have seen in equation (6.6) there are more to random walks than just a random selection of the nexttarget. To this end the idea of Weighted Random Walks (WRWs) comes as a natural next step. Thereare cases where the edges are weighted, usually denoting edge significance. This means that a SRW is notsufficient to symbolize the transitions in this specific model. The transition probability of the walk shouldtake these weights into account. In this case we define a WRW with the following transition probability:P uv ={w(u,v)∑k∈N(u) w(u,k),∃e = (u, v) ∈ E0, otherwiseWe next note some facts about random walks, which can be found either in Aldous and Fill [2] orLovasz [41]. The weight w(e) of an edge e has the meaning of conductance in electrical networks, and theresistance r(e) of e is given by r(e) = 1/w(e). The general theory of weighted random walks is given inChapter 3 of [2].The commute time K(u, v) between vertices u and v, is the expected number of steps taken to travelfrom u to v and back to u. The commute time for a weighted walk is given byK(u, v) = w(G)R eff (u, v). (6.2)Here w(G) = 2 ∑ e∈E(G) w(e) where E(G) is the set of edges of a graph G, and R eff (u, v) is the effectiveresistance between u and v, when G is taken as an electrical network with edge e having resistance r(e).For our proof we do not need to calculate R eff (u, v) very precisely, but rather note that if uP v is any pathbetween u and v thenR eff (u, v) ≤ ∑r(e).For u ∈ V , and a subset of vertices S ⊆ V , let C u (S) be the expected time taken for W u to visit every vertexof G. The cover time C S of S is defined as C S = max u∈V C u (S). We define a walk as seeded if it starts in S.The seeded cover time C S ∗ of S as C S ∗ = max u∈S C u (S). For a random walk starting in a set S, the covertime of S satisfies the following Matthews bounde∈uP vC S ≤ max H(u, v) log |S|. (6.3)u,v∈SFor u ≠ v, the variable H(u, v) is the expected time to reach v starting from u (the hitting time). Thecommute time K(u, v) is given by K(u, v) = H(u, v) + H(v, u), so K(u, v) > H(u, v).6.3 Metropolis Hastings Random Walks (MHRWs)The MHRW is an other form of random walk, first proposed by Metropolis et al [43] and later analyzed andgeneralized by Hastings et al [28].The idea is to manufacture a walk which samples from a desired distribution. Assume that while runningthe random walk method we reach vertex X and generate a random neighbor Y as the target for the nextstep, the MHRW would accept Y as the next vertex with a probability:

6 GRAPH SAMPLING AND CRAWLING 18{a(u, v) = min 1, π(v)q(v/u) }π(u)q(u/v)(6.4)Where π(u) is the desired distribution we want for vertex u and q(u/v) is the probability that we selectu as the next potential target, given that we are currently at vThe transition probability of the MHRW is defined in Equation 6.5⎧⎨P uv =⎩1d ua(u, v), ∃v ∈ N(u),1 − a(u, v), v = u0, otherwiseThe general algorithm to implement this walk on graph is as follows.Algorithm 2 MHRWv ← start vertexVisit(v)while not done donext ← random vertex from N(v)p ← number ∈ [0, 1] generated UARif p < a(c, next) thenv ← nextVisit(v)elsev ← vend ifend whileThe MHRW is manufactured in such a way such that we can obtain any desired distribution by samplingour graph, this can be seen in the work of of M. Gjoka et al [27] where the MHRW is used to gather samplesfrom Facebook. Due to the fact that they required the distribution of all vertices to be π(u) = 1|V | (uniformvertex selection) the acceptance probability a(u, v) was:(6.5){ min(1,d ua(u, v) =dv), ∃e = (u, v) ∈ E0, otherwise(6.6)

19Part IIIOur Work7 Graph generationAs part of our ongoing work we have implemented some known graph generation models as well as somenovel ones which we have experimented with in order to determine the most fitting one to simulate theTwitter network. Additionally the generated graphs have been a useful tool to test our graph crawlingtechniques on, without having to worry about the limitations of crawling the real OSNs.Firstly our immediate goals in simulating the Twitter network include the creation of a model whichgenerates graphs. The generated graphs need to have degree distributions which follow a power-law in bothin and out degrees, have at most a small-world diameter. In addition to this we need our model to create acommunity structure.We have implemented some existing models some of which achieve the above goals partially. In additionwe are proposing a few new methods which we believe have a great deal of potential in simulating real worldnetworks.7.1 Random Walk GraphThis is a very simple model which intuitively simulates a user searching through the network by goingthrough his/her local links. At each step the method creates a new vertex u and attaches to a randomexisting vertex v in the graph. Then it performs m SRWs of s steps each and stores all the visited verticesin a list L allowing repeated entries. After the end of each walk a vertex u r is selected at random fromthe list and an edge (u, u r ) created. Ideally this should generate power-laws since it does have both growthand some flavor of preferential attachment, and experimentally it does. Additionally it also generates agood community structure (modularity around 0.4) due to the locality of the links. There are two differentvariations of the model that can be made, which is simply using either undirected random walks (i.e. ignoringedge direction) or using directed random walks and walking only along out-edges. The power-law coefficientis very steep, with the directed variation having an in-degree power-law coefficient of around 3.2. The modelis directed but only presents with power-laws in the in direction, which by design is what it is intended todo at present. It is unclear on how the parameters affect it therefore it may be a good candidate model butneeds further analysis and modifications.In Figure 6 we can see the degree distributions of graphs generated using both variations of this model:(a) Directed SRW. c = 3.2 (b) Undirected SRW. c = 2.25 (c) Comparisson On In-DegreesFigure 6: Random Walk Models (log-log scales). m = 3, s = 200, N = 50000, Power-law Co-Efficient c asnotedFrom further experimentation we have determined that the number of steps in both variations of modeldo not seem to affect the generated power-law co-efficient. However what the number of steps does affect isthe modularity Q of the graph, which depending on s when m was kept constant (m = 3) was determinedto vary from Q = 0.5 for s = 10 to Q = 0.25 for s = 200. Furthermore what does affect the power-lawco-efficient is m which larger m means smaller co-efficients. Additionally m seems to affect Q which whenwe kept s constant (s = 50) Q appears to drop from Q = 0.6 for m = 2 down to Q = 0.25 for m = 5The degree distributions of both the above experiments can be seen in Figures 7.

7 GRAPH GENERATION 20(a) Constant m = 3 Varying s(b) Constant s = 50 Varying mFigure 7: Comparison of different parameter effects on undirected random walk graphs N = 500007.2 Preferential attachment and message propagationThis model is essentially a combination of the well known preferential attachment model with an intuitiveflavor of the Twitter link creation procedure. The idea is that there are two ways of creating links. The firstway is the traditional undirected preferential attachment on a directed graph where a new vertex joins andconnects to m other vertices with probability proportional to each vertex’s total degree. The alternativeway is that an existing vertex activates and begins transmitting a message down to its out-edges. Eachrecipient of that message has a probability p to retransmit that message. After the process has died out (orreached a maximum number of hops) then from all the vertices which retransmitted, m of them choose tocreate a link to the originator of the message.One variation of this model consists of two phases. The first phase is the growth phase, which is thesame process as in the preferential attachment model. The second variation of the model alternates betweena preferential attachment step and a message propagation step until the graph reaches a desired size.The the message propagation (or densify) phase, which consists of N steps (N being the number ofvertices of the graph), performs the following per step:1. Given the vertex u i where i is the current step (i ≤ N) add to a queue Q where h = 0represents the current number of hops h2. Let q =< u, h > be an entry in our queue. Do the following while the queue Q is not empty• Remove q from Q.• ∀v j : ∃e j = (u, v j ) : With probability p add v j to a list L and < t i , h + 1 > to Q ∀e i = (u, t i )• The above step will be performed until Q is empty.3. Select m vertices {w 0 , · · · , w m } UAR from L and create edges e j = (w j , u i ) 0 < j ≤ mHere we would like to point out that during first step of the message propagation (or densify) procedurewe expect pm vertices to retransmit the message. If pm ≥ 1 then with high probability the process willnever terminate. For that reason we have two different ways of ensuring termination. The first is to neveradd the same vertex to Q or L more than once, which would mean that it will terminate after at most NMsteps (where M is the number of edges). The second is to have an upper bound on the maximum hopsh max for which we will add items in the queue, which will constrain the number of steps to p ∑ h maxi=1m i . Inpractice both these methods are used to optimize both for time and space performance where h max = 16 inthe example we will be presenting.The first variation of the above process generates power-law degree distributions on both the out-degreesand in-degrees as seen below. Additionally some sort of community structure is created with the modularityQ ranging between 0.25 and 0.5. It is empirically observed that p affects Q but we have not yet determinedhow and why this occurs. However intuitively we assume that the aspect of the model which generatesthe community structure is the densification and we would expect that values of p such that pm = 1 willresult in a higher success rate of message propagations while limiting the number of hops that the message

7 GRAPH GENERATION 21would reach, thus more local links and therefore a better community structure. The power-law coefficientof the model presented below with the given parameters is c = 2.7 and approaches c = 3 on larger graphsizes in the in-direction as expected by the preferential attachment portion. It appears that the out-degreepower-law co-efficient also tends to approach the same value as the in-degree, however it is not clear whythis occurs.The resulting degree distribution plot of the second variation of the model, where the message propagationsare performed after the preferential attachment is complete, is seen in Figure 8.Figure 8: Preferential Message Propagation (log-log scales). m = 2,p = 0.2,N = 500007.3 Grow, Back-Connect, DensifyThis model adds an additional flavor to the Preferential Attachment With Message propagation modelmentioned above. In the context of this model we call the preferential attachment step the grow step,while the message propagation step the densify step. In addition to this we have another element in themodel which is called the Back-Connect element. The parameters of the model are a, b, c and p wherea + b + c = 1, a, b, c ≥ 0 and 0 ≤ p < 1Starting with a complete graph of 3 vertices, we perform the following at each step:• With probability a we add a new vertex u and choose a vertex v preferentially and create an edge(v, u) (Grow)• With probability b we choose a vertex u UAR and a vertex v preferentially and create an edge (u, v)(Back-Connect)• With probability c we perform a message propagation similar to the one described in section 7.2 wherep is the message transmission probability. The difference in this case is that all new edges created atthe last step of the propagation will be e j = (u i , w j ), j ∈ [1, · · · , m]Given the parameters a = b = c = 1 3and with p = 0.1 we have obtained the power-law degree distributionseen below, where the resulting co-efficient is c P L = 2:

7 GRAPH GENERATION 22Figure 9: Grow-Back Connect-Densify Model(log-log scales). a = b = c = 1 3,p = 0.1,N = 2000007.4 Implicit Graph ModelIn general an implicit graph is a graph which is not known in advance, i.e. there is no explicit graph structuredefined. However the structure can be determined when certain rules or formulas are applied, which aregiven as part of the graph’s definition. In this case we define a graph with the following properties:• The graph consists of N vertices• Each vertex is labeled u i where i ∈ [1 · · · N]• Each vertex u i has an out-degree given by the formula⌈ ⌉ Ad + (u i ) =i α(7.1)Theorem 1. The out-degree sequence produced by formula 7.1 follows a power law with co-efficient c = 1+ 1 αProof. We need to determine the number of natural numbers which produce the same degree. i.e. we needto know the interval for which equation 7.1 produces the same result.Let i, j ∈ [1, · · · , N] : d + (u i ) = y and d + (u j ) = y + 1 where y ∈ N and ̸ ∃k < i, l < j : d + (u k ) =y, d + (u l ) = y + 1.y = A i α⇐⇒ i = ( A y ) 1 αy + 1 = A Aj α ⇐⇒ j = (y + 1 ) 1 αThe number of vertices with degree y, (|D y |) is given by the difference between |j − i|Because d + is decreasing i < j therefore:

7 GRAPH GENERATION 23j − i = ( A y ) 1 Aα − (y + 1 ) 1 α⎡ ( ) 1⎤= ( A y ) 1 α1α ⎣1 −⎦1 + 1 y= ( A [y ) 1 α 1 − (1 + 1 ]y )− 1 α≈ ( A [y ) 1 α 1 − (1 − 1 ]αy )= ( A ( )y ) 1 1ααy( )= A 1 1ααy 1+ 1 αTherefore the number of vertices which have a degree of y is:( )|D y | ≈ A 1 1ααy 1+ 1 α(7.2)From equation 7.2 we can conclude that |D y | ∝ y −(1+ 1 α ) as y → ∞ which means the function d + willfollow a power-law in the midrange with co-efficientc = 1 + 1 α(7.3)The only remaining issue for this model is determining the actual edges of the graph. What we havecurrently done is the following:• For vertex u i let edge e ij = (u i , v), 0 < j ≤ d + (u i ) be the j-th out-edge of u i .• To determine v we use a seeded pseudo-random number generator with the product ij as the seed.• We proceeded to generate a single random number 0 < n ≤ N using the random number generator• Vertex v = u n the n-th vertex of our graph.The above process guarantees that while the edges are random, they follow a specific rule, and the sameresult set can be reproduced on demand. As shown above the model generates graphs with a power-lawdegree distribution with coefficients c ≤ 2 as given by equation 7.3 where α ≥ 1 which is a coefficient thatis hard to obtain using generative processes.For values of α < 1 we must set N sufficiently large to observe power-law degree distributions. Thisis due to the fact that the formula at 7.1 decreases much slower. Therefore we present this model as aninteresting alternative to generative models and we plan to further analyze this soon.The degree distribution we obtain from this graph model is seen in Figure 10.

7 GRAPH GENERATION 24Figure 10: Implicit Graph Model. α = 1.25,N = 10000,c = 1.8We can observe that the out-degree distribution follows a power-law, as expected by definition. Thein-degree distribution approaches a log distribution but has some anomalies which may be a direct effect ofthe way we determine the target vertices of the model’s edges. The modularity Q = 0.6 which is significantlyhigher than expected however we currently consider this to be an artifact of the edge endpoint selectionin conjunction with the fact that there are several disconnected components within the graph. It is worthnoting however that the graph mainly consists of a single large component which in this case contains over99% of the vertices of the graph while the second largest component consists of only 8 vertices.7.5 Partitioned Preferential AttachmentThis model is essentially a combination of the well known preferential attachment model with an additionalelement of periodic partitioning of the graph. This partitioning may happen at random or fixed intervals andit intuitively simulates the cell reproduction procedure. When the graph being generated reaches a “criticalmass” s then we split it in two parts. The split is performed at random. We treat each partition as a newgraph and continue to add vertices to random partitions and attaching edges to each vertex preferentially.The preferential selection is based on the degrees of the vertices in the current partition. During the splitthe edges which have endpoints belonging to different components are maintained, which maintains theconnected nature of the graph.The resulting degree distribution from this generative process is seen in Figure 11

8 EXISTING DATA SET ANALYSIS 25Figure 11: Partitioned Preferential Attachment. n = 7, 5.10 5 , m = 3, s = 5008 Existing Data Set Analysis8.1 Ground truth. Complete twitter snapshotBefore proceeding in the presentation of our results, its worth noting information about the data we will beusing as the oracle (or ground truth). The work of M. Cha et al [12] while not immediately relevant to theongoing work on our part has provided us and many other researchers with a rich resource of crawled twitterdata. They used a massively parallel Twitter crawling array of systems which crawled the twitter networkin its entirety in August 2009. While the crawl does not include any isolated vertex (in and out-degree ofzero) this does provide with a rich bundle of information which we can use as an oracle. Their work mainlyfocused on measuring user influence in twitter. This was based around the affect which “tweets” had on userbehavior. Our work is oriented around measuring the structural information of the twitter network such asdegree sequence, community structure and diameter. However using their data we were able to extract thedegree sequence of the twitter network as it looked like at the time of their crawl.(a) The Total degree, Out-degree and In-Degree distributions of(b) The in-degree and out-degree relationship on the real twitterthe real twitter datadataFigure 12: Twitter 2009 data on log-log scales

8 EXISTING DATA SET ANALYSIS 26From Figures 12 there are several features we can observe regarding twitter based on these data.• The degree distribution of both in and out degrees follows a power-law in the midrange with anexponent of around −2 as seen in figure 12a• There seems to be an anomaly at the point of in-degree 20. There are other points of the in-degreewhere the plot seems “bumpy” but as it is on a log-log scale the anomaly at in-degree 20 is reallymuch greater in magnitude.• As we can observe in figure 12b there is a strong connection between in-degree and out-degree on largerdegrees. It is observed that in most cases (around 70% of the time) the following holds 0.7d − (u) ≤d + (u) ≤ 1.3d − (u)• While there are exceptions to the above in the case of the out-degree, i.e. vertices of high out-degreehave been observed to have the entire possible range of in-degrees, the same is not observed for the indegree.This means that after a specific threshold (in-degree of 2000) there are nearly no observationsof users who subscribe to many other users but don’t have many followers.Some additional observations that have been made regarding the structure of these data have revealedan additional unique phenomenon in the Twitter network. This is visually observable in Figure 13 :Figure 13: (In-degrees, Out-Degrees) to frequency plot on a log scaleWhat is plotted above are the frequency of observation of a given in-degree, out-degree combination.We can observe that the plot does not follow a power-law in the same way in which it is observed in allmanufactured graphs. We can observe that there is a higher than average frequency in the range of valueswhere the in-degree is close to the out-degree which may be a direct result of the observation that thein-degree and out-degree tend to be within a certain range of each other as mentioned above.8.2 Other Real NetworksAt this point we would like to present some results we have obtained from certain measurements on datasetswhich were obtained from publicly available sources 2 .The networks we experimented with include:• SlashDot Zoo 2008 dataset2 Datasets obtained from *http://snap.stanford.edu/data/index.html

8 EXISTING DATA SET ANALYSIS 27• SlashDot Zoo 2009 dataset• LiveJournal• Google Web GraphThe quantities measured in these networks include:• Degree distribution• Degree-Based Cut Conductance φ• Degree Correlations8.2.1 Degree DistributionsThe results that will be presented here are interesting and based on these we can draw several conclusions.(a) SlashDot 08(b) SlashDot 09 All(c) SlashDot Positive Links(d) SlashDot Negative Links(e) Google Web Graph(f) LiveJournalFigure 14: Degree Distribution of various real networks

8 EXISTING DATA SET ANALYSIS 28From what we can observe in the Figures 14 is that there is indeed the common characteristic of apower-law degree distribution. The co-efficient of this distribution seems to be a distinct characteristic ofthese networks or categories of networks but in general, the web-graph seems to have a co-efficient of c = 3which agrees with the web-graph model suggested by Cooper et al [13–15]. However some of the socialnetworks seem to have a co-efficient c < 2 which is hard to reproduce with existing generative models.8.2.2 Degree-Based Cut ConductunceFor the following plots it is worth clarifying what is actually drawn. The main idea of the plots is conductunceφ however our main interest is the conductance of between subsets of the graph which are partitioned basedon their degree.Assuming a graph G = {V, E} we denote as G[S] the subgraph induced from a vertex set S ∪ V and allthe edges e{u, v} : u, v ∈ S. In our case we define the following set of partitions:S(a) = {u : d(u) ≥ |V | a }G a = G[S(a)]The plot presented below are a on the x-axis and the conductance φ between the sets S(a) and VS(a) using the same formula defined in Equation 4.6.Figure 15: Degree-Based Cut ConductanceIn Figure 15 we can see the conductance of various real world networks compared to manufacturednetworks. From the plot we can infer that for the areas that each real world network carries a different plotand we may consider this as an indication of how well vertices of specific degrees connect between themselves.For example, the SlashDot Zoo networks when compared to preferential attachement graphs, show a muchbetter (lower) conductance at reasonably high degrees, we do not consider extreme cases of very very highdegrees as those cases usually include one or two isolated vertices. What this means in SlashDot is that(reasonably) high degree vertices are better inter-connected than we can expect in a preferential attachmentnetwork. This may indicate additional features in the generative process of these networks which is arenot limited to preferential attachment. It is of interest to us to perform additional measurements in othergenerated networks to determine how well they compare. However another interesting observation we canmake is networks such as LiveJournal and Google have a bad conductance at reasonably high degrees

9 GRAPH SAMPLING 29comparable to the preferential attachment graphs. As we have seen above these two networks have a similardegree distribution to the preferential attachment model.8.2.3 Degree CorrelationsThe degree correlations measure we used was the measure presented in Equation 4.1. However due to thegreat variation in the sizes of the networks and as a result to the max degrees observed, we have rescaledthe results of the measure with respect to the number of vertices of the graph. This means that instead ofd(u) we have a = log |V | d(u) and instead of 〈k nn 〉 we have log |V | 〈k nn 〉.Figure 16: Degree CorrelationsWhat we can conclude from the above plot is that the correlations observed in real world networks varygreatly, and in addition the generative model parameters also affect this measure.9 Graph Sampling9.1 Sampling Real World Networks: Case study of twitterIn the initial steps of this research degree we had experimented on crawling the twitter OSN. The initialapproach of the crawling method was a simple random walk which was performed on a single lab computer.The inherent limitation was that the Twitter API only allowed 150 requests per hour. The way that theAPI was utilized in this initial phase was, starting from a specific vertex (my own account) to retrieve allthe in and out-links and then randomly select on of those to continue with the crawl while rejecting all thepreviously visited ones. This family of methods is well known as Exploration without replacement or graphtraversals. We would like to clarify what we mean by in-degree and out-degree of the twitter graph as theterm is ambiguous and depends on the point-of-view of the network. We define the degrees in general asthe direction of the flow of information, in our case “tweets” where a user’s out-degree is the number offollowers of that user, who are recipients of tweets from the user, while the user’s in-degree is the numberof users that user is following i.e. the number of people that user is receiving tweets from.The above process had a lot of challenges, namely the limitation of the API to only receive 5000 in or outlinks per request and the requirement that the request for in-links and out-links to be done as separate API

9 GRAPH SAMPLING 30calls. We present the example of a well known Twitter site of the comedian Stephen Fry who approximatelyhas 50,000 friends and 2,500,000 followers. In order to fully acquire the friends and followers of this specificuser due to the API limitation we would have required 520 separate requests which would take around 4hours to complete.Due to the inherent bias in the random walk process we were very likely to encounter more high degreevertices as a consequence of equation 6.1 which would take hours to fully crawl. This resulted in a periodof two weeks we only fully crawled around 5000 users. However this gave us a strong indication of the smallworld structure of twitter since those 5000 users (our core graph) not only were well connected betweenthem, but we also determined that the combined in and out edges of those 5000 users were over 20 million.The results of this crawling are further analyzed in Section 9.1.1.9.1.1 Our initial crawlingAs it has been mentioned we crawled a much smaller part of the Twitter network using a random walk withreplacement. Generally methods with replacement or graph traversals are methods which do not visit thesame vertex multiple times. The deduced sub-graph of Twitter we have acquired using this crawling hadthe properties presented below.(a) Total(red) in(blue) and out(green) degreedistributions(b) In-degree to Out-Degree Observations (c) Out-degrees.Crawled data(red)Real data(green) andFigure 17: Real Twitter data vs crawled dataFrom the above we can observe the relative bias of the random walk method especially when observing17c. As we can see the plot of the crawled data tends to approach the real data on higher degrees, howeveron lower degrees the frequency is much lower than the real frequency.Additionally its worth noting that the relative bias of our crawling was relatively small. This may be inpart a consequence of the fact that the above data consist of the merging of several random walks. Eachdifferent random walk started from the same location and crawled the graph for approximately the sameamount of time. The results of each crawler were merged to create the graph image presented above.The problems with the initial crawling, were not only the API limitations but also the hardware limitationswe were imposed with. Generally there were two options available for storing the intermediate graph:either in the main memory or on the hard drive. The first case had the limitation of not being able to storea large amount of data (due to the fact that the memory of the computer used was limited) and in thesecond case proved to be unfeasible to use, due to the high access time of the hard disk.9.1.2 Second crawling, unbiased samplingOur second crawling was mainly focused on discovering the actual degree distribution of the Twitter network.This was in part due to the fact that it happened at a time after we had analyzed the real Twitter dataand had a good idea how the network looked like in 2009. This however was not enough because we wantedto know if the structure of the network had remained unchanged, at least in the aspect of the degreedistribution. Due to this we started a second crawl of the Twitter network, this time using an unbiasedcrawling method called Uniform Sampling (UNI).In order to do our sampling we took advantage of the fact that each Twitter user, when created, isassigned a unique user id. To the best of our knowledge and backed by observation of a subset of users, theuser ids are assigned incrementally according to the order each user was created. This suggested that there

9 GRAPH SAMPLING 31was a very strong possibility that if we generated a random number ranging from the smallest to largestobserved user id at the time we would have a good possibility to have obtained a valid user id. The methodwe used was to generate random numbers between 15 and 300, 000, 000 which were the smallest and largestTwitter ids at the time. We then performed a user info request from the Twitter API which is a singlerequest that returns a great deal of data regarding Twitter users such as in-degree, out-degree, number oftweets sent, age of the account etc. This request, while not providing the actual in-edges and out-edges (i.e.the other endpoints) did provide us with a good idea regarding the degree distributions.We have observed that the actual valid user ids that fell within our sample range were around 75% ofall the numbers tried, which seems to follow our intuitive assumption. The above method of rejecting aproportion of the generated samples is known as the method of rejection sampling [34] where the desiredsampling distribution is the uniform distribution.In the process of this crawling we gathered data regarding 378303 users and the resulting degree distributionsare presented in Figure 18.Figure 18: In Degrees (red) Out Degrees (green) Total Degrees (blue) and tweets (fuchsia) as a function offrequency on a log scaleFrom what is seen in Figure 18 we can determine that the degree distribution follows a power-law forboth the in and out degrees. In addition to this the power-law coefficient seems to be the same for bothdistributions and also for the total degree. A unique observation in this plot is that the number of statuses(or tweets) that users have sent also follow a power law. To elaborate what exists in the above plot. Letf(x) be the number of users who have sent x tweets and N the total number of users then what is plotted onthe above is t(x) = f(x)/N. This distribution seems to be similar to the degree distributions with a slightlylower slope. A rough estimate of the power-law in the mid-range indicates that it has a co-efficient of c ≈ 1.6which is significantly lower than what the currently analyzed generative models are able to reproduce.These data have also given us the opportunity to make some reasonable comparisons to the dataset fromthe 2009 Twitter crawl. The result of this comparison can be seen below. We note that only the resultsof the in-degree comparison are presented due to the fact that they include a certain anomaly which wehoped to examine and determine whether that was an artifact of the crawling or of the network itself. Thecomparisons of the other distributions has similar results.

9 GRAPH SAMPLING 32Figure 19: Real 2009 Twitter distribution (Green) compared to sampled Twitter distribution (Red)From the above we can observe that the degree distribution is the same, however this cannot be said withcertainty as the scale of the two data sets is largely different. It is worth mentioning that the large data-setwe have acquired does not include vertices of total degree 0, which seem to take up a significant portion ofthe network. Also when viewing both Figure 18 and Figure 12a as we can see the anomaly of the in-degree ofdegree 20 is in fact an artifact of the Twitter network and still remains today. From empirical observation anumber of the users with in-degree 20 do not correspond to people but certain services, sometimes malicious.It seems that there might be a limit imposed either by Twitter or by other limitations which restricts thesekinds of accounts to an in-degree of 20.All the above have led us to the assumption that the actual Twitter network today, while being muchlarger, has a similar degree distribution compared to the network as it was in 2009, however it seems thatthe slope is slightly different.9.2 Uniform Sampling (UNI): A study of efficiencyIn this section we believe that it is worth mentioning a small case study of UNI. We have performed sometests by sampling vertices with respect to degree UAR from a preferential attachment network. This studywas made in order to correctly and accurately measure the efficiency of the UNI method depending on thesize of the sample. The resulting image was mostly what was expected (which was a near accurate degreedistribution sample even for small sample sizes) however we did make some important observations.We will present our findings here but before doing so we will describe the method we used to testthe efficiency of UNI. The method we used was the widely used and described Kolmogorov-Smirnov Test(KV Test) [46]. This goodness of fit test is defined as follows:Assuming that we are sampling from a distribution with Cumulative Distribution Function (c.d.f.) F (.)we denote the c.d.f. of our sample as F x (.) then the KV Test is defined as:D n =sup |F n (x) − F (x)| (9.1)−∞

9 GRAPH SAMPLING 33graph. Under this hypothesis F (x) = P (d(u) ≤ x) the probability a vertex has degree less than x whileF n (x) is the actual proportion of sampled edges of d(u) ≤ x.The results of the above test are shown in Figure 20.Figure 20: KV test on a preferential attachment graphWhat is shown in Figure 20 is three different KV Tests. Each of those tests, uses the null hypothesisthat the sample is taken from a c.d.f which includes the degree distributions of specific degree ranges. Thered part of the plot is for degrees between 3 (the minimum degree for m=3) and 12, the green part is fordegrees between 12 and 110 (the point where the degree distribution of PA starts getting somewhat messyas we can see in Figure 2) and the blue part is of degrees above 111.The reason we decided to partition the tests in cases was nature of the distribution. There was a need toperform this sort of separation in order to determine how well a uniform sampling process would perform infinding authorities in a real network situation. As we can see in Figure 20 if we assume a confidence range of±0.1 then we can see that the KV Test is very efficient even at very small sizes in obtaining a representativesample of vertices with degree d(u) < 110, however this is not the case for higher degree vertices. Indeedthe UNI method is not a good method to sample the c.d.f. of large degrees and this important observationis what led to the inspiration of a new method which would perform this sampling better. This method willbe described in Section 9.3.9.3 Sampling Manufactured Graphs: Weighted Random Walk9.3.1 Theoretical foundationThe simplest way to generate a graph with a power law degree sequence is the preferential attachmentmethod described in Section 5.2Let S be a subset of the vertices a graph G = (V, E), where S is defined in terms of some property, suchas the set of vertices with degree at least d. We suppose the content of S is unknown, and that we wishto discover all vertices in S by searching G using a SRW. We say a SRW is seeded if the walk starts fromsome vertex s of S. In the context of searching networks such as Facebook, Twitter or the WWW it is notunreasonable to suppose we know some high degree vertex without supposing we know all of them.The basis of this sub-linear algorithm is formed from Equation 5.3. Our algorithm is a degree-biassedrandom walk, with transition probability p(u, v) given byp(u, v) =(d(v)) b∑w∈N(u) (d(w))b ,

9 GRAPH SAMPLING 34where b > 0 constant. The value of b = (1/η − 1)/2 we will choose in the proof below is optimized to dependon η. Using (5.1), this value can be expressed directly as a function of the degree sequence power law c.The easiest way to reason about biassed random walks, is to give each edge e a weight w(e), so thattransitions along edges are made proportional to this weight. In the case above the weight of the edgee = (u, v) is given by w(e) = (d(u)d(v)) b so that the transition probability is now written asp(u, v) =(d(u)d(v)) b∑w∈N(u) (d(u)d(w))b .The inspiration for a degree biassed walk with parameter b comes from the β-walks of Ikeda, Kubo,Okumoto and Yamashita [30] which use an edge weight w(x, y) = 1/(d(x)d(y)) β . When β = 1/2 this givesan improved worst case bound of O(n 2 log n) for the cover time of connected n-vertex graphs.Theorem 2. Let G(m, t) be a graph generated in the preferential attachment model. Then whp we can findall vertices in G(m, t) of degree at least t a in O(t 1−a(1−δ) ) steps, using a biassed seeded random walk (BSRW)with transition probability along edge {x, y} proportional to √ d(x)d(y). Here δ > 0 is a small positiveconstant (eg. δ = 0.00001).Proof. We consider now the preferential attachment graph G(t) ≡ G(m, t). In this special case, η = 1/2and the whp bounds (5.3) on the degree of vertex s are) (1−ɛ)/2≤ d(s, t) ≤) 1/2log 2 t. (9.2)( ts( tsWe define a graph G ∗ on vertices 1, 2, . . . , t, which has the same degrees of vertices as in graph G(t), andis built in a similar iterative process: for each v = t 0 + 1, . . . , t, add m edges from vertex v to some earliervertices. Graph G(t 0 ) is the same constant-size starting graph for both G(t) and G ∗ . In graph G(t), edgesare selected according to a random preferential process, while in graph G ∗ according to the deterministicprocess which greedily fills the in-degrees of vertices, giving the preference to the older vertices. In bothgraphs, if {x, y} is an edge and x > y, then this edge was added to the graph when vertex x was considered.Graph G ∗ can be obtained from graph G(t) by swapping edges: whenever there is a pair of edges {x, y} and{u, v} such that u > x > y > v, then replace these edges with edges {x, v} and {u, y}.Assume b > 0 and define¯d(v) =¯w(G) = 2( tv) 1/2,∑{x,y}∈E(G)( ¯d(x) ¯d(y)) b ≥ w(G) log −4b t,where G is any graph with vertices 1, 2, . . . , t and the degrees satisfying the bounds (9.2).If we view G ∗ as obtained by swapping edges in G = G(t), then it is easy to see that each such swapincreases ¯w(G). Indeed, if u > x > y > v, thenimplying thatTherefore, ¯w(G ∗ ) ≥ ¯w(G(t)).( ¯d(x)) b > ( ¯d(u)) b and ( ¯d(v)) b > ( ¯d(y)) b ,( ¯d(x)) b ( ¯d(v)) b + ( ¯d(u)) b ( ¯d(y)) b> ( ¯d(x)) b ( ¯d(y)) b + ( ¯d(u)) b ( ¯d(v)) b .Next we derive an upper bound on ¯w(G ∗ ). Because of the greedy process of adding edges to G ∗ , a vertexx in G ∗ has “incoming” edges which originate from vertices first(x), first(x) + 1, . . . , last(x). All m edges

9 GRAPH SAMPLING 35outgoing from each vertex y = first(x) + 1, . . . , last(x) − 1 point to x. Thus we have∑¯w(G ∗ ( ) b) = 2¯d(x) ¯d(y)≤ 2≤ 2≤{y,x}∈E(G ∗ )t∑last(x)∑x=1 y=first(x)m ( ¯d(x) ¯d(y)) bt∑d(x) ( ) b¯d(x) ¯d(first(x))x=12 log 2 tt∑ ( ) 1+b ( ) b ¯d(x) ¯d(first(x)) . (9.3)x=1Now we calculate first(x). The m·first(x) edges outgoing from vertices 1, 2, . . . , first(x) fill fully the in-degreesof vertices 1, 2, . . . , x − 1 (the greedy process), soThus2m · first(x) ≥ m · first(x) + mx ≥≥∑x−1d(z)z=1∑x−1( ) t (1−ɛ)/2≥ t (1−ɛ)/2 x 1−(1−ɛ)/2 .zz=1¯d(first(x)) =≤) 1/2( tfirst(x)()2mt1/2t (1−ɛ)/2 x 1−(1−ɛ)/2= (2m) 1/2 ( tx) (1+ɛ)/4. (9.4)Using (9.3) and (9.4), we getChoosing b so that¯w(G ∗ ) ≤ 2(2m) b/2 log 2 tthe sum in (9.5) is O(t log t), and we have= 2(2m) b/2 log 2 tt∑( ) t (1+b)/2 ( t b(1+ɛ)/4x x)x=1t∑( ) t 1/2+(3/4)b(1+ɛ). (9.5)xx=11/2 + (3/4)b(1 + ɛ) ≤ 1, (9.6)w(G) ≤ log 4b ¯w(G) ≤ log 4b ¯w(G ∗ ) = O(t log 4b+3 t).Proceeding now as in the proof of Theorem 3, we get the similar bound on the seeded cover time of S(a):)CS(a) (t ∗ = O 1−2ba(1−ɛ) polylog t .We take b = 2 3(1 − ɛ) to satisfy (9.6) and obtainC ∗ S(a) = O (t 1−(4/3)a(1−ɛ)2) .Similarly as in the proof of Theorem 3, we can conclude that all vertices in S(a) are discovered inO(t 1−(4/3)a(1−δ) ) steps whp, for δ = 3ɛ, and that the cover time of the graph G(t) is O(t polylog t).

9 GRAPH SAMPLING 36Theorem 3. For c > 2, whp we can find all vertices in G(c, m, t) of degree at least t a in O(t 1−a(c−2)(1−δ) )steps, using a BSRW with transition probability along edge {x, y} proportional to (d(x)d(y)) (c−2)/2 . Hereδ > 0 is a small positive constant (eg. δ = 0.00001). The cover time of G(c, m, t) by this BSRW is O(t log 7 t).Proof. Suppose we want to find all vertices of degree at least t a for some a > 0 in G(t) ≡ G(c, m, t). LetS(a) = {v : d(v, t) ≥ t a }. Recall that G(t) is generated by a process of attaching v t to G(t − 1). At whatsteps were the vertices v ∈ S(a) added to G(t)? The expected degree of v at step t is given by (5.2) i.e.Ed(v, t) = (1 + o(1))m(t/v) η . This function is monotone decreasing with increasing v. Let σ be given by( ) t ηt a =which implies σ = t 1−a/η . (9.7)σLet s = σ · log 2/η+1 t, then using (5.3) all vertices added at steps w ≥ s have d(w, t) = o(t a ). On the otherhand, using (5.3) again, all vertices v added at steps 1, ..., s have degree d(v, t) ≥ (t/s) η(1−ɛ) .We want to apply the Matthews bound (6.3). Clearly log |S(a)| ≤ log t. It remains to findmax H(u, v) ≤ max K(u, v).u,v∈S u,v∈STo calculate K(u, v) in (6.2), we first need to bound w(G)w(G) = 2= ∑ x∈V≤= ∑ x∈V∑{x,y}∈E(G)∑y∈N(x)∑ ∑x∈V y∈N(x)(d(x)) 2b+1w(x, y)(d(x)d(y)) b(d(x)) 2b + (d(y)) 2b2≤t∑( ) t η(2b+1)log 4 t.xx=1The upper bound on vertex degree in the last line comes from (5.3). Thus on choosing η(2b + 1) = 1,that is, for b = (1/eta − 1)/2,we havew(G) = O(t log 5 t). (9.8)Because Diam(G(s)) = O(log s), (see (5.4)), we know that for any u, v ∈ S(a) there is a path uP v oflength O(log t) from u to v in G(t) contained in G(s), and thus consisting of vertices w of degree d(w, t) ≥(t/s) η(1−ɛ) = d ∗ . Thus all edges of this path have resistance at most 1/(d(x)d(y)) b ≤ 1/(d ∗ ) 2b . From (5.3),d ∗ satisfies()d ∗ tη(1−ɛ)≥t 1−a/η log 1+2/η ≥ ta(1−ɛ)t log 3 t .By the discussion above,Using (6.2), and the value of d ∗ , we haveR eff (u, v) ≤ ∑e∈uP v( ) log tr(e) = Od ∗ .K(u, v) ≤ K ∗ = O(t 1−2ba(1−ɛ) log 12 t).The bound in Theorem (3 on finding all vertices of degree at least t a is now obtained as follows. TheMatthews bound (6.3) gives the (expected) cover time CS(a) ∗ = O(K∗ log t). Let δ = 3ɛ, an arbitrary butsmall constant. We use one of the ɛ to absorb the polylog term in K ∗ , and the other to apply the Markovinequality (Pr(X > A · EX) ≤ 1/A), with EX = CS(a) ∗ , to give a whp result.

9 GRAPH SAMPLING 37Figure 21: Plots of experimental data showing cover time of all vertices of degree at least t a as a functionof aFinally we establish the cover time of the graph G(t). This is done by using (6.3) with S = V (t) thevertex set of G(t), i.e.C V (t) ≤ max H(u, v) log t. (9.9)u,v∈V (t)We bound H(u, v) by (6.2) as usual. The resistance r(e) of any edge e = {x, y} isr(e) =1(d(x)d(y)) b ≤ 1 = O(1).m2b From (5.4) the diameter of G(t) is O(log t), so R eff (u, v) = O(log t), since the effective resistance between uand v is at most the resistance of a shortest path between u and v. This and (9.8) give K(u, v) = O(t log 6 t).Thus the cover time of the graph G(t) is O(t log 7 t).Basically, Theorem 2, and its generalization Theorem 3, say that if we search this type of graph usinga SRW a bias b = (c − 2)/2 proportional to the power law c then, (i) we can find all high degree verticesquickly, and (ii) the time to discover all vertices is of about the same order as a simple random walk.9.3.2 Experimental resultsPreferential Attachment Graph Theorem 2 gives an encouraging upper bound of the order of aroundt 1−(4/3)a for a biassed random walk to the cover all vertices of degree at least t a in the t-vertexpreferential attachment graph G(m, t). Our experiments, summarized in Figure 21, suggest that theactual bound is stronger than this. The experiments were made on G(m, t) with m = 3, and t = 10 7vertices. The representative degree distribution of such graphs is given in Figure 2, with both axesin logarithmic scale. More precisely, the x-axis is the exponent a in the degree d = t a , i.e. x = log dlog t ,while the y-axis is the frequency of the vertices of degree t a .In Figure 21, plot SRW shows the average cover time τ(a) of all vertices of degree at least t a bythe simple random walk (the uniform transition probabilities). Plot WRW shows the average cover

9 GRAPH SAMPLING 38Figure 22: Plots of experimental data showing cover time of all vertices of degree at least t a (ignoringa < 0.25) as a function of a in the SlashDot graph.times by the biassed random walk with b = 1/2. Both axes are in logarithmic scale. The y-axis isy = (log τ(a))/ log t. There are also three reference lines drawn in Figure 21. These lines have slopes−a, −3a/2 and −2a, are included for discussion purposes only, and the intercepts have no meaning.Before discussing Figure 21 in greater detail, we remark that it broadly confirms the implications fromour theoretical analysis: for random preferential attachment graphs, biassed random walks discoverquickly all higher degree vertices while not increasing by much the cover time of the whole graph. Forexample, by checking the exact cover times, we observed that the biassed random walk with b = 1/2took on average 2.7 times longer than a simple random walk to cover the whole graph G(3, 10 7 ), butdiscovered the 100 highest degree vertices 10 times faster than a simple random walk.The cover time C G of a simple random walk on G(m, t) is known and has value C G ∼2m(m−1)t log t,see [16]. The intercept of the y-axis at m = 3 predicted by this is 1+ log(3 log 107 )= 1.24 and this agrees(log 10 7 )well with the experimental intercept of 1.22. This agreement helps confirm our experimental results.For a weighted random walk, the stationary distribution π(v) of vertex v is given byπ(v) = 1w(G)∑x∈N(v)w(v, x),where w(G) is the sum of the edge weights of G, each edge counted twice. Thus for a simple randomwalk on G(m, t), π S (v) = d(v)2mt. For the weighted random walk of Theorem 2 (η = 1/2 and b = 1/2 forG(m, t)) we have the following lower bound.π W (v) = Ω(d(v) 3 2 /t log 5 t).

9 GRAPH SAMPLING 39Figure 23: Plots of experimental data showing cover time of all vertices of degree at least t a (ignoringa < 0.35) as a function of a in a sample of the Google web-graphThis bound holds because we know from (9.8) that w(G) = O(t log 5 t), and∑x∈N(v)w(v, x) = ∑x∈N(v)(d(v)d(x)) 1/2≥ (d(v)) 1/2 ∑≥ (d(v)) 3/2 .x∈N(v)(m) 1/2We can give an informal explanation of Figure 21 as follows. In the long run, the number of visits tovertex v in T steps approaches T π(v), so the first visit to v should be at about T (v) = 1/π(v). Asπ(v) increases with increasing degree d(v), then if h > a we should expect to see all vertices of degreet h before all vertices of degree t a .For a simple random walk, let v be a vertex of degree t a , then T (v) ≈ 1/π S (v) = 2mt/t a ≈ t 1−a . Sothe SRW plot in Figure 21 should have slope −a, and this is indeed the case.For a weighted random walk, the same argument gives(1 t log 5 )π W (v) = O t(t a ) 3/2 = Õ(t1−3/2a ),which explains the slope of −3a2for the WRW plot.The total number n(a) of vertices of degree at least t a is approximated by σ = t 1−a/η = t 1−2a , wherethe value of σ from (9.7), is the expected step at which a vertex of degree t a is added, and η = 1/2for preferential attachment. As no walk based process can visit σ vertices in less than σ steps, thisexplains the line with slope −2a in Figure 21.Real World Networks In this section we present our experimental results on a real world graphs: asample of the Google web-graph and the SlashDot Zoo (November 2008 dataset). The dataset forthese networks was obtained through the Stanford Network Analysis Project site which contains a

9 GRAPH SAMPLING 40wide range of resources as well as data sets. 3 While both these graphs are directed we are ignoringedge direction and treating them as undirected for the purpose of our experiments. In the case of theGoogle dataset we only take into account the largest weakly connected component while SlashDot isalready weakly connected.The google web-graph sample has a power-law degree distribution in the mid-range with a co-efficientclose to c = 3, as seen in Figure 14e. As we can see in Figure 23 our method outperforms a SRWin discovering all high degree vertices giving us a strong indication of the effectiveness of our methodeven on real world networks. The value of b used in this case was b = 2 3 .In addition as we can see in Figure 14a the degree distribution of the SlashDot graph follows a powerlaw,with a co-efficient of approximately 1.8. This is lower than the power-law range in which ourmethod was proven to work. However as it is seen in Figure 22 the biassed random walk is still quickerto cover all high degree vertices than a SRW. The value of b used for this case was b = 1 2In conclusion we have analyzed the number of steps required by biassed random walks to discover allhigher degree vertices in random t-vertex preferential attachment graphs, and we have proven sub-linearupper bounds for discovering all vertices with degree at least t a , for 0 < a < 1/2. Our experimental resultsconfirm the good performance of biassed random walks on such graphs. Our theoretical analysis appliesalso to generalized web-graph processes.Our theoretical bounds are probably not tight and it would be interesting to see if better bounds canbe proven. What is the best value for the parameter b of biassed random walks? From the practical pointof view, it would be interesting the investigate the performance of biassed random walks on additional realnetworks which exhibit the power law.In addition we have presented some interesting results both on generated and real networks. Theresults in both cases are very promising with the proposed WRW showing an improvement in the efficiencyof discovering high degree vertices compared to a SRW. There are however many open questions whichremain, especially in the case of the real world networks.Some questions which arise include:• What are the features of these networks which make the WRW behave less efficiently than expected?• What would be the optimal value of β on these networks?• Is there a single unit of measurement which would indicate how effective a WRW would be on anygiven network?We believe that these questions are very interesting and should receive a great deal of attention as partof our ongoing research in this area.3 http://snap.stanford.edu/data/index.html datasets retrieved 15/12/2011. SlashDot consists of 77360 vertices and 905468edges while the largest weakly connected component of the google sample consists of 855802 vertices and 5066842 edges

41Part IVFuture WorkIn this section we present the an outline of the work we intend to do in the future. As we have seen fromthe work so far, there are still many open problems and unanswered questions which remain.In general our work focuses on three aspects related to graph theory and large, scale-free graphs as theyappear in the WWW and OSNs. These networks are a topic which receives great interest and is what somewould refer to as a “hot topic”. This is due to the great deal of information that one can extract about socialbehavior via observing these networks, observations that are of interest from many different aspects such associal sciences, algorithm design, mathematics and economics. While our work focuses on mathematics andalgorithm design we believe that our work can contribute in additional ways to other sciences since it givesanswers to the same problems, viewed from a different aspect.Based on our research interests we will further expand our research on in the following areas:• Graph analysis• Graph sampling• Graph generation modelsThese areas are in some ways a different side of the same coin. We cannot generate a graph if we do notknow what properties need to hold for that graph, we cannot know what properties hold on that graph ifwe do not analyze the graph and we cannot analyze the graph if we don’t obtain the graph, or at least asample of the graph. We have obtained various data-sets already however, in an expanding world such asthe the WWW even if the data-set is only a few months old it can be seen as outdated.We will proceed to outline our goals in more detail keeping in mind the work that has already been doneand unanswered questions that have arose.10 Graph Analysis10.1 Real world network characteristicsThrough our work so far we have determined that other than degree distributions, we cannot really drawa solid conclusion on the real world network characteristics. In Section 8.2 we have seen some interestingresults when trying to measure certain network properties and compare them to the widely used preferentialattachment model, presented in Section 5.2. However the outcome of these measurements shows that whilethe preferential attachment graph is a good way to model some properties of these networks, this graph doesnot model them as well as we would like. In addition from further reflection upon the WRW method wesuggested in Section 9.3 we can see that while the preferential attachment graph and real world networkshave similar properties, this walk behaves very differently on them. This may indicate additional propertieswe need to measure, or units of measurements we need to suggest to provide us with a clear and precisemetric on the performance of our walk on these graphs.Our suggestion in this aspect of our research is that we continue the analysis of the real world networkswe have obtained, and utilization of our current sources to obtain additional ones. Some of the things whichwe would be interested in measuring include:• Optimal conductance on the induced graphs from our degree based cuts (or conductance of a “metacut”)• Cumulative Stationary Distributions of SRW and WRW with different values for β• Make a comparative analysis of which community measure and community split tactic are most appropriatefor discovering communities.

11 GRAPH SAMPLING 42In addition to the above properties there are many real world network characteristics which may beunknown. This claim is backed by observations on the efficiency of sampling methods such as the WRWwhich are different from what the theory suggests. An important aspect of our research would be todetermine what these properties are and formalize a unit of measurement for graphs which would indicate“ease of sampling”.10.2 Property Testing and Estimation AlgorithmsIn the section of graph analysis we have encountered the need to measure properties which are computationallyhard. Problems such as expansion testing are problems which, on graphs of such a scale, are impossibleto compute using an exhaustive algorithm. This tells us that proper estimations would need to be made.Ideal candidate methods for such estimations fall within the fields of property testing, seen in Section 4.8as well as approximation algorithms like the modularity algorithm discussed in Section 4.6. In addition weintend to implement such algorithms to test properties and measure quantities. We believe the scope of theapplication of such algorithms is limited and we intend to introduce new applications by making use of thesealgorithms. As part of our work we intend to investigate possible algorithms to solve graph partitioningproblems which we often encounter.There are numerous algorithms which offer approximate solutions to the above problems and we havealready implemented and tested several. We intend to continue implementing such algorithms but in additionwe will investigate possible improvements in aspects such as accuracy and complexity.There are many possibilities in this area and we have already started work on approximate optimalminimum k-cut 4 algorithms in order to provide our own estimate of the community structure of graphs aswell as determine whether or not this cut would result in conductunce and modularity quantities providedby other, similar algorithms.11 Graph samplingIn Section 9.3 we have seen a WRW method which discovers vertices of degree d > n a in sub-linear time.There is always additional room for improvement in graph crawling and sampling. We must stress thatthe purpose of our crawling is to get representative samples of a real world network and not to sample thenetworks in their entirety. As we saw in Section 9.2 uniform sampling is successful in performing this taskbut may be unfeasible in reality. Additionally it does not work well with discovering the high degree verticesof power-law graphs. This may indicate that a combination of the two methods may work better in samplingsuch graphs. However the main focus of our future work will be given on methods based on random walksto efficiently and effectively sample from real world networks. Random walks work surprisingly well and thisis backed by theory. In general WRWs can be a very important tool in graph sampling in the general caseif we dynamically adjust the bias according the desired outcomes. This is an area of great interest which ispart of our currently ongoing research.However there are some drawbacks with the WRW method which have to do with its algorithmic complexity.This method requires additional quantities to be measured in order to determine the next vertex tobe traversed and these quantities require time O(d(u)) to compute. While it is possible to pre-proccess someof these quantities the time does not decrease in such a degree to make the WRW complexity proportionalto the SRW complexity which takes O(1) time per step.We know that C v ∝Diam(G)n is the cover time of the graph. In our case Diam(G) ∝ log n. While thiscover time is in the same scale as the SRW the expected O(n log n) of a SRW is significantly lower than therun-time of a WRW. We have implemented several optimizations on the algorithm already which includea preprocessing of the graph to determine vertex and edge weights. This process takes O(mn) time tocomplete but reduces the overall runtime of the walk. It is important for the applicability of the algorithmto firstly, analyze the complexity of the method and further reduce the run-time to quantities comparableto a SRW complexity.4 The minimum k-cut problem is the problem of finding the minimum set of edges which when removed, partition the graphinto k connected components

11 GRAPH SAMPLING 43However the methods of the SRW and WRW are not the only methods we intend to use. There are aplethora of methods which have their uses and these include:Uniform Sampling (UNI) This method is the ideal method for unbiased sampling. While we do realisethat it is unfeasable to use this method in some cases we will continue to make extensive use of thismethod where possible. In Section 9.2 we saw some practical experiments using this method andconcluded that it is sufficient to obtain small sample sizes in order to sample the lower and mid-rangeof the degree distribution.Weighted Random Walk (WRW) This method has been extensively analyzed. In Section 9.3 we havemade use of this method to sample all high degree vertices in sub-linear time to the size of the graph.This, in conjuction with the UNI sampling could efficiently be used to sample from the entire graph insub-linear times. We intend to make further use of these methods, by relaxing the WRW requirementsto only partially sample the high degree vertices rather than fully acquire them.Simple Random Walk (SRW) The SRW method is a well known and analyzed one. It is important torealize the its potential and value and take advantage of the analyses that exist for this method inorder to perform appropriate re-weighting (such as in the RWRW) to meet our requirements. One ofthe major advantages of this method is its very low computational time and space.Sampling With Memory This kind of sampling includes a general category of sampling methods. Theseinclude methods such as: BFS, Depth-First Search (DFS), RDS. At this point we will make a specialreference to a method that had been used in the past and is included as a footnote of this report theBFS queue filtering, which includes a family of methods which is based on the application of certainselection policies on a BFS queue, which is the queue containing the order in which the vertices willbe visited during a BFS traversal. There were many interesting results when we made use of thesemethods to sample from manufactured graphs and we intend to further utilize them to sample fromreal graphs.All the methods mentioned above have been use to some extent and most of there results have been, atleast, discussed. We believe that proper utilization of these methods we further help expand our capabilitiesin graph sampling.11.1 Algorithm OptimizationsIn the context of optimization there are several aspects one should consider:• Algorithm complexity reduction• Run-time reduction• Approximation11.1.1 Algorithm complexity reductionThe most obvious way to reduce the run-time of an algorithm is by reducing its overall complexity. Howeverthis is not always easy, and sometimes it is impossible. In the case of the WRW it is possible, with theappropriate modifications, to make use of other methods of searching the adjacency lists of the graph, suchas a binary search instead of a linear search. This would reduce the complexity of each step to O(log d(u)).11.1.2 Run-time reductionThe run-time reduction of the algorithms is something that is provided by default in most compilers thereforewe will not go in an in-depth analysis of it.

12 GRAPH GENERATION MODELS 4411.1.3 ApproximationIn cases where the precise result is not required but just a “good guess” then approximations can be usedto calculate the results of certain operations. These approximations have the advantage of having verylow run-times and if it is the case that we don’t require accurate results then approximations are usuallyfavorable. In our case we can make use of approximate weights and selection of next target in order to speedup the methods.11.2 Improvement Of Sampling EfficiencyAt this point we need to stress the fact that sampling needs to be as efficient as possible. We need to ensurethat the sample size we require is the smallest possible. There are several trade-offs we need to considerfor our sampling methods. SRWs are sampling methods which require no memory, in theory. There arecertain optimizations that can be achieved by making use of some limited memory but in general the SRWand WRW do not require this memory in order to function. This may have many advantages since the sizeof the graphs we are working with is too large to assume that even a good sample would fit in memory,however this has the disadvantage of frequent vertex re-visits. This can be avoided in other methods suchas BFS and similar techniques but keeping track of a BFS tree of graphs with high branching factors is verylimiting even.In order to fully answer the question of how we can improve the sampling efficiency of our methods weneed to carefully consider these trade-offs and the sampling goals. It is a fact that when we sample weneed to store this sample somewhere, since this is our end goal. This allows us to make some run-timeoptimizations of methods such as the SRW by “simulating” revisits rather than performing an actual revisitand using information we obtained in previous steps. This may violate the “no memory” policy of the SRWbut in practice we choose to do so in order to achieve our sampling goals.Keeping the above in mind we are able to better understand our goals and limitations and what weconsider to be “reasonable” trade-offs when designing our methods. An important part of our future workwould be to apply these methods, in their more relaxed form on real networks, having the confidence thatour experimental results and theoretical results provide us with the guarantees we need in order to makeclaims that the samples we obtain achieve the goals that our methods allow us to achieve.In addition, we would be able to verify the applicability of other methods such as the modified BFSmethods which would allow us to achieve different sampling goals. However at this point we would like toelaborate on what we really mean by “sampling goal”. Up to now the sampling goals we had were to getunbiased samples of the mid-range of degree distributions and in addition to obtain all of the high degreevertices of graphs. As part of our future work we will consider relaxing the latter goal to obtaining “most”of the high degree vertices, but also expand our goals to sample triangles, clusters and expansion properties.12 Graph Generation modelsIn Sections 5.2 we saw the preferential attachment model. This model is a very good, well analyzed andunderstood model to generate graphs which share characteristics with many real online networks. Howeverthis model does not simulate all characteristics which are apparent in such networks such as communitystructure, constant-bound diameter and edge densification. There are models which have been proposedthat simulate these characteristics and we also have suggested some of our own models. However we believethere is a lot of work to be done in this aspect as most of the existing models still do not simulate somekey characteristics of real world networks and in addition many of the models have not yet been analyzed.The analysis of these models is a very important aspect of our future research goals and as we saw from theproposed method of WRW, a good analysis of such models would allow us to transfer existing knowledge ofgraph sampling on those graph generation models.Special interest will be given to graphs such as the implicit graph model seen in Section 7.4 since thismodel is able to produce power-law degree distributions with co-efficient c < 2 which are hard to generateusing other generative models. In addition, other methods based on graph generation based on affiliationswill be used more extensively since they tend to better match real world network characteristics.

12 GRAPH GENERATION MODELS 45In general there are several models which have promising results in the context of graph generation andthese include:• Implicit graph model (Section 7.4)• Random Walk Graph (Section 7.1)• Grow-Back Connect-Densify (Section 7.3)• Partitioned Preferential Attachment Model (Section 7.3)There are some additional methods proposed and, if need be, more models will be proposed during therest of our research, however we believe a simple and useful analysis can be done on the above models whichwill provide with proofs on the observed properties of these networks.We now have a better understanding of what underlying processes are going on for generating thesenetworks, and these processes include preferential attachment, triangle closing, edge copying and communitycreation around specific areas of interest. These areas of interest each person links to may includemany different unrelated topics, which result in overlapping communities. It is our suspicion that the edgedensification observed derives from a combination of the above phenomena and is a reasonable side-effect ofuser activity, an aspect of the generation process which current generation models cannot easily reproduce.However using these intuitions we can attempt to propose our own models. However we must keep in mindthat a good generation model has the following general properties:• It is as simple as possible• It is rigorously analyzed• It exhibits the required phenomena naturally and not artificiallyIn general we have great confidence that some of our proposed methods will be analyzed and the interestingproperties they exhibit will be proven. This would allow us to further control these generative modelsand build upon them.

46Part VReferencesReferences[1] Dimitris Achlioptas, Aaron Clauset, David Kempe, and Cristopher Moore. On the bias of traceroutesampling: Or, power-law degree distributions in regular graphs. J. ACM, 56(4), 2009.[2] David Aldous and James Allen Fill. Reversible markov chains and random walkson graphs-chapter 9: A second look at general markov chains. http://statwww.berkeley.edu/pub/users/aldous/RWG/book.html,1995.[3] P. Gill B. Krishnamurthy and M. Arlitt. A few chirps about twitter. In Proceedings of the 1st ACMSIGCOMM Workshop on Social Networks, WOSN08, 2008.[4] Ricardo Baeza-yates, Carlos Castillo, Mauricio Marin, and Andrea Rodriguez. Crawling a country:Better strategies than breadth-first for web page ordering. In Proceedings of the 14th internationalconference on World Wide Web / Industrial and Practical Experience Track, pages 864–872. ACMPress, 2005.[5] A. Barabási and R. Albert. Emergence of scaling in random networks. In science, number 5439 inVolume 286, pages 509–512, 1999.[6] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfoldingof communities in large networks. Journal of Statistical Mechanics: Theory and Experiment,2008(10):P10008, 2008.[7] B. Bollobás and O. Riordan. Handbook of Graphs and Networks: Mathematical results on scale-freegraphs, pages 1–32. S. Bornholdt, H. Schuster (eds), Wiley-VCH, 2002.[8] Béla Bollobás, Oliver Riordan, Joel Spencer, and Gábor Tusnády. The degree sequence of a scalefreerandom graph process. random structures and algorithms. Random Structures and Algorithms,18:279–290, 2001.[9] Bélaa Bollobás and Oliver Riordan. The diameter of a scale-free random graph. Combinatorica, 24:5–34,January 2004.[10] Mickey Brautbar and Michael Kearns. Local algorithms for finding interesting individuals in largenetworks. In ICS, pages 188–199, 2010.[11] Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, RaymieStata, Andrew Tomkins, and Janet Wiener. Graph structure in the web: experiments and models. InProceedings of the Ninth International World-Wide Web Conference (WWW9, Amsterdam, May 15 -19, 2000 - Best Paper). Foretec Seminars, Inc. (of CD-ROM), Reston, VA, 2000.[12] Meeyoung Cha, Hamed Haddadi, Fabricio Benevenuto, and Krishna P. Gummadi. Measuring UserInfluence in Twitter: The Million Follower Fallacy. In Proceedings of the 4th International AAAIConference on Weblogs and Social Media (ICWSM), Washington DC, USA, May 2010.[13] C. Cooper. The age specific degree distribution of web-graphs. Combinatorics Probability and Computing,15:637–661, 2006.[14] C. Cooper and A. Frieze. A general model of web graphs. In Science, 2001.[15] C. Cooper and A. Frieze. A general model web graphs. In Random Structures and Algorithms, pages311–335, 2003.[16] C. Cooper and A. Frieze. The cover time of the preferential attachment graphs. Journal of CombinatorialTheory, B(97):269–290, 2007.

REFERENCES 47[17] Colin Cooper. Classifying special interest groups in web graphs. In Proc. RANDOM 2002: Randomizationand Approximation Techniques in Computer Science, pages 263–275, 2002.[18] Artur Czumaj and Christian Sohler. Testing expansion in bounded-degree graphs. In Proceedings ofthe 48th Annual IEEE Symposium on Foundations of Computer Science, pages 570–578, Washington,DC, USA, 2007. IEEE Computer Society.[19] Stephen Dill, Ravi Kumar, Kevin S. Mccurley, Sridhar Rajagopalan, D. Sivakumar, and AndrewTomkins. Self-similarity in the web. ACM Trans. Internet Technol., 2:205–223, August 2002.[20] Jozef Dodziuk. Difference equations, isoperimetric inequality and transience of certain random walks.Transactions of the American Mathematical Society, 284(2):pp. 787–794, 1984.[21] M. Enachescu E. Drinea and M. Mitzenmacher. Technical report: Variations on random graph modelsfor the web. Technical report, Harvard University, Department of Computer Science, 2001.[22] P. Erdős and A Rényi. On the evolution of random graphs. In Publication of the Mathematical Instituteof the Hungarian Academy of Sciences, pages 17–61, 1960.[23] S. Ye F. Wu, J. Lang. Crawling online social graphs. In 2010 12th International Asia-Pacific WebConference, 2010.[24] Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos. On power-law relationships of the internettopology. In Proceedings of the conference on Applications, technologies, architectures, and protocolsfor computer communication, SIGCOMM ’99, pages 251–262, New York, NY, USA, 1999. ACM.[25] Abraham D. Flaxman and Juan Vera. Bias reduction in traceroute sampling - towards a more accuratemap of the internet. In WAW, pages 1–15, 2007.[26] M. Girvan and M. E. J. Newman. Community structure in social and biological networks. In Proc NatlAcad Sci U S A, 2002.[27] Minas Gjoka, Maciej Kurant, Carter T. Butts, and Athina Markopoulou. A walk in facebook: Uniformsampling of users in online social networks. CoRR, abs/0906.0060, 2009.[28] W. K. Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika,57(1):97–109, 1970.[29] Jing He, John E. Hopcroft, Hongyu Liang, Supasorn Suwajanakorn, and Liaoruo Wang. Detecting thestructure of social networks using (α, β)-communities. In Proceedings of the 8th International Workshopin Algorithms and Models for the Web Graph (WAW), pages 26–37, 2011.[30] Satoshi Ikeda, Izumi Kubo, Norihiro Okumoto, and Masafumi Yamashita. Impact of Local TopologicalInformation on Random Walks on Finite Graphs. In Proceedings of the 32st International Colloquiumon Automata, Languages and Programming (ICALP), pages 1054–1067, Washington DC, USA, 2003.[31] J. Kleinberg J. Leskovec and M. Faloutsos. Graph evolution: Densification and shrinking diameters.In Knowledge Discovery from Data, Vol. 1, No. 1, Article 2, 2007.[32] Jon M. Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew S. Tomkins.The web as a graph: measurements, models, and methods. In Proceedings of the 5th annual internationalconference on Computing and combinatorics, COCOON’99, Berlin, Heidelberg, 1999. Springer-Verlag.[33] Silvio Lattanzi and D. Sivakumar. Affiliation networks. In Proceedings of the 41st annual ACM symposiumon Theory of computing, STOC ’09, pages 427–434, New York, NY, USA, 2009. ACM.[34] Albert Leon-Garcia. Probability and Random Processes for Electrical Engineering (2nd Edition).Addison-Wesley, 2 edition, August 1993.[35] J. Leskovec and C. Faloutsos. Sampling from large graphs. In Proceedings of of ACM SIGKDD, ACMSIGKDD 2006, 2006.

REFERENCES 48[36] Jure Leskovec, Lars Backstrom, Ravi Kumar, and Andrew Tomkins. Microscopic evolution of socialnetworks. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discoveryand data mining, KDD ’08, pages 462–470, New York, NY, USA, 2008. ACM.[37] Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, and Zoubin Ghahramani.Kronecker graphs: An approach to modeling networks. J. Mach. Learn. Res., 11:985–1042, March2010.[38] Jure Leskovec and Christos Faloutsos. Scalable modeling of real graphs using kronecker multiplication.In Proceedings of the 24th international conference on Machine learning, ICML ’07, pages 497–504,New York, NY, USA, 2007. ACM.[39] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution: Densification and shrinkingdiameters. ACM Trans. Knowl. Discov. Data, 1, March 2007.[40] Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. Community structurein large networks: Natural cluster sizes and the absence of large well-defined clusters. CoRR,abs/0810.1355, 2008.[41] László Lovász. Random walks on graphs: A survey. Bolyai Society Mathematical Studies, 2:353–397,1996.[42] Mohammad Mahdian and Ying Xu. Stochastic kronecker graphs. Technical report, Proceedings of the5th Workshop on Algorithms and Models for the Web-Graph, 2007.[43] Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and EdwardTeller. Equation of state calculations by fast computing machines. The Journal Of Chemical Physics,21(6):1087–1092, 1953.[44] Nina Mishra, Robert Schreiber, Isabelle Stanton, and Robert E. Tarjan. Finding strongly knit clustersin social networks. Internet Mathematics, 5(1-2):155–174, 2008.[45] Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee.Measurement and analysis of online social networks. In Proceedings of the 5th ACM/USENIX InternetMeasurement Conference (IMC07, 2007.[46] Alexander M. Mood. Introduction to the Theory Of Statistics, pages 508–511. McGraw-Hill Inc., 1963.[47] M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Phys.Rev. E, 69(2):026113, Feb 2004.[48] Romualdo Pastor-Satorras, Alexei Vázquez, and Alessandro Vespignani. Dynamical and correlationproperties of the internet. Phys. Rev. Lett., 87:258701, Nov 2001.[49] A. Barabási R. Albert. Statistical mechanics of complex networks. In Reviews Of Modern Physics,Volume 74, 2002.[50] Amir H. Rasti, Mojtaba Torkjazi, Reza Rejaie, Nick Duffield, Walter Willinger, and Daniel Stutzbach.Evaluating sampling techniques for large dynamic graphs. 2008.[51] S. Redner. How popular is your paper? an empirical study of the citation distribution. In EuropeanPhysical Journal, B 4, 1998.[52] G. Siganos S. L. Tauro, C. Palmer and M. Faloutsos. A simple conceptual model for the internettopology. In Global Telecommunications Conference, volume 3, GLOBECOM 01, pages 1667–1671,2001.[53] Daniel Stutzbach and Reza Rejaie. On unbiased sampling for unstructured peer-to-peer networks. InProc. ACM IMC, pages 27–40, 2006.[54] Erik Volz and Douglas D. Heckathorn. Probability based estimation theory for respondent drivensampling. Journal of Official Statistics, 24:79–97, 2008.

13 ACRONYMS 4913 AcronymsSRW Simple Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3SRW Simple Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14P2P Peer-To-Peer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14OSN Online Social Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IWWW World Wide Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1MHRW Metropolis Hastings Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IRWRW Re-Weighted Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14WRW Weighted Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17UAR Uniformly at random . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7SCC Strongly Connected Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5WCC Weakly Connected Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5RDS Respondent-Driven Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15BFS Breadth-First Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3DFS Depth-First Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43whp With High Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9UNI Uniform Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IIKV Test Kolmogorov-Smirnov Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32BSRW biassed seeded random walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34c.d.f. Cumulative Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

50Part VIAppendixASampling Manufactured Graphs: BFS Tree FilteringIn this section we will present some results on crawling artificially generated graphs. The first graph crawledwas generated with the directed preferential attachment model and had 1M vertices and 15M edges (m = 1edge per step new vertex probability p = 0.05).This sampling was done during the initial steps of the research process and is something that we wishto revisit during our next steps, however no progress has been done between the 9 month report and todaytherefore we include this in the appendix.A.1 Crawler DefinitionsWe define the following notations commonly used in literature for the purposes of clarification:Inspected Vertices set V i The set of vertices u i which have been examined using a specific policy todetermine their adjacent vertices. These policies include:• Out-Edge Direction• Bidirectional• UndirectedObserved vertices list V o The list of vertices u o which have been ‘found’ adjacent to a vertex u i whichhas been inspected. Vertices are considered to be ‘found’ if when using the current inspection policyan edge is found between u i and u oObserved edges E s The set of all edges e s which have been obtained by inspecting vertices u i .Adjacency List L The list which contains all inspected vertices u i and for each vertex u i holds a list ofall vertices u ∈ V i ∪ V o for which there exists an edge e s = (u i , u)Crawler steps s The number of times the crawler has obtained a vertex and attempted to inspect it.Vertex frequencyf(v) The number of times vertex v exists in V oA.2 Measured propertiesWe define the following properties:Induced Sub-Graph G[V i ] The graph induced by the inspected verticesDerived sub-Graph G s G s is G[V i ] extended to include all vertices and edges in E s .Graph density D The graph density represents the fraction of edges which are present in the graph overto the number of edges which appear in an directed loopless simple complete graph G with the samenumber of vertices. More formally:D =|E||V |(|V | − 1)(A.1)

A SAMPLING MANUFACTURED GRAPHS: BFS TREE FILTERING 51Vertex coverage C V We define vertex coverage as the percentage of vertices which have been discoveredover the total number of vertices in the target graph:C V = |V s||V |(A.2)Edge coverage C E We define edge coverage as the percentage of edges which have been discovered overthe total number of edges in the target graph:C E = |E s||E|(A.3)Success rate R s We define the success rate as the number of inspected vertices |V i | over the total numberof crawler steps sDensity Proximity C DR s = |V i|sWe define density proximity as follows:(A.4)C D = D sDwhere D is the density of the original graph and D s is the density of the sampled graphWe measure D s as follows:(A.5)D s =|E s ||V i |(|V i | − 1) + |V s \V i ||V i |(A.6)Where |V i |(|V i | − 1) + |V s \V i ||V i | is the maximum possible edges we could have observed via crawling.A.2.1Crawlers With MemoryA crawling method with memory is a method which uses some sort of secondary storage to store someinformation regarding the data it has received so far in order to decide on the next step based on all storedinformation.• Our crawler stores the sets V i , V o and L• We start with an initial set of vertices s 0 ,· · · ,s i which are our seeds. We add these seeds to V o . In thisparticular case i = 1 and only a single seed was used, identical for all methods.• At each step we select a vertex v from V o using a certain selection policy A.3• If v /∈ V i then we remove v from V o and inspect it using a specific inspection policy A.1• For each edge (v, u) acquired we add u to V o . V o may include vertices which are also in V i and mayalso contain duplicate vertices• We then add v to V i and consider it to be inspected• We repeat the process until |V i | reaches a satisfactory sizeOur crawler does not have a concept of position. It simply selects a vertex from what it has seen atsome point during the inspections it had made and chooses to inspect that, this can be seen as prioritizingthe BFS tree. Additionally the sampling does not allow vertices to be revisited.

A SAMPLING MANUFACTURED GRAPHS: BFS TREE FILTERING 52A.3 Selection PoliciesBy a selection policy, we mean the method we use to select a vertex from the seen vertices list. There areseveral that we have experimented with.These include:Breadth-First A method which samples the first item in V o . Essentially the well known a Breadth-FirstSearch.Depth-First A method which samples the last item in V o . Essentially the well known Depth-First Search.Biased Random Samples an item in V o uniformly at random.Unbiased Random Samples a distinctly uniformly random item from V o only the unique items in the listare considered.Hypothetical Greedy Samples the item u which has the highest frequency f(u) in V o . In case of multipleitems having the same frequency the first one is selected.Least Discovered Samples the item u which has the lowest frequency f(u) in V o in case of multiple itemsthe first one is selected.(a) Breadth-First(b) Depth-First(c) Biased Random(d) Unbiased Random(e) Hypothetical Greedy(f) Least ObservedFigure 24: Growth pattern of all methods at 1000,2000,.., 64000 vertices

A SAMPLING MANUFACTURED GRAPHS: BFS TREE FILTERING 53A.4 Method ComparisonIn the following table we present the results of the measurements of the methods described above.C D Method C V % C E % R s %38.96 BFS 40.14% 88.01% 28.02%38.87 Biased Random 40.10% 87.71% 20.98%33.25 DFS 37.28% 69.75% 80.38%40.09 Hypothetical Greedy 40.61% 91.63% 48.74%12.22 Least Discovered 10.58% 7.27% 74.66%21.28 Unbiased Random 25.29% 30.28% 93.62%Figure 25: The comparison of the degree frequency distributions of the different methods at 64000 visitedverticesAbove is the resulting degree frequency plot. The red is the real graph, according to the above, in thelower degrees the method of depth first (fuchsia) is the most effective followed the unbiased random (black)and least discovered method (yellow) which also seem to stabilize at a slope similar to the real graph. Themethods of biased random(blue) and BFS(green) are very similar and seem to avoid low degree vertices andhave the same high degree bias. Hypothetical greedy (yellow) is successful in avoiding low degree verticesand has a bias towards high degrees as expected as the graph generation method has presented with theproperty of having almost symmetric in and out degrees. The general elevation observed at the ends of theplots compared to the original graph plot is due to the fact that the degree frequency for degrees with oneobservation will depend on the total number of vertices.

Upgrade Report - Department of Informatics - King's College London

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?