Download Chapters 3-6 (.PDF) - ODBMS

More documents

Recommendations

Info

38 6. DISCUSSION—POWER LAWS AND DEVIATIONS where μ and σ are parameters and A(μ, σ ) is a constant (used for normalization if y(x) is a probability distribution). The DGX distribution has been used to fit the degree distribution of a bipartite “clickstream” graph linking websites and users (Figure 2.2(c)), telecommunications, and other data. 6.2.3 DOUBLY-PARETO LOGNORMAL (DPLN ) Another deviation is well modeled by the so-called Doubly Pareto Lognormal (dPln). Mitzenmacher [210] obtained good fits for file size distributions using dPln. Seshadri et al. [245] studied the distribution of phone calls per customer, and also found it to be a good fit. We will describe the results of Seshadri et al. below. Informally, a random variable that follows the dPln distribution looks like the plots of Figure 6.1: in log-log scales, the distribution is approximated by two lines that meet in the middle of the plot. More specifically, Figure 6.1 shows the empirical pdf (that is, the density histogram) for a switch in a telephone company, over a time period of several months. Plot (a) gives the distribution of the number of distinct partners (“callees”) per customer.The overwhelming majority of customers called only one person; until about 80-100 “callees,” a power law seems to fit well; but after that, there is a sudden drop, following a power-law with a different slope. This is exactly the behavior of the dPln: piece-wise linear, in log-log scales. Similarly, Figure 6.1(b) shows the empirical pdf for the count of phone calls per customer: again, the vast majority of customers make just one phone call, with a piece-wise linear behavior, and the “knee” at around 200 phone calls. Figure 6.1(c) shows the empirical pdf for the count of minutes per customer.The qualitative behavior is the same: piece-wise linear, in log-log scales. Additional plots from the same source ([245]) showed similar behavior for several other switches and several other time intervals. In fact, the dataset in [245] included four switches, over month-long periods; each switch recorded calls made to and from callers who were physically present in a contiguous geographical area. PDF 10 -2 10 -4 10 -6 10 0 10 1 Data Fitted DPLN[2.8, 0.01, 0.35, 3.8] 10 2 Partners 10 3 10 4 PDF 10 -2 10 -4 10 -6 10 0 10 2 Data Fitted DPLN[2.8, 0.01, 0.55, 5.6] 10 3 Calls 10 4 10 5 PDF 10 -2 10 -4 10 -6 Data Fitted DPLN[2.5, 0.01, 0.45, 6.5] 10 2 Duration (a) pdf of partners (b) pdf of calls (c) pdf of minutes Figure 6.1: Results of using dPln to model. (a) the number of call-partners, (b) the number of calls made, and (c) total duration (in minutes) talked, by users at a telephone-company switch, during a given the time period. 10 4
For each customer, the following counts were computed: 6.2. DEVIATIONS FROM POWER LAWS 39 Partners: The total number of unique callers and callees associated with every user. Note that this is essentially the degree of nodes in the (undirected and unweighted) social graph, which has an edge between two users if either called the other. Calls: The total number of calls made or received by each user. In graph theoretic terms, this is the weighted degree in the social graph where the weight of an edge between two users is equal to the number of calls that involved them both. Duration: The total duration of calls for each customer in minutes. This is the weighted degree in the social graph where the weight of the edge between two users is the total duration of the calls between them. The dPln Distribution For the intuition and the fitting of the dPln distribution, see the work of Reed [238]. Here, we mention only the highlights. If a random variable X follows the four-parameter dPln distribution dPln(α, β, ν, τ), then the complete distribution is given by: fX = αβ α+β e (αν+α2τ 2 /2) x−α−1 log x−ν−ατ ( 2 τ xβ−1e (−βτ+β2τ 2 /2) c log x−ν+βτ ( 2 τ ) ) + , (6.5) where and c are the CDF and complementary CDF of N(0, 1) (Gaussian with zero mean and unit variance). Intuitively, the pdf looks like a piece-wise linear curve in log-log scales; the parameter α and β are the slopes of the (asymptotically) linear parts, and ν and τ determine the knee of the curve. One easier way of understanding the double-Pareto nature of X is by observing that X can be written as X = S0 V1 V2 where S0 is lognormally distributed, and V1 and V2 are Pareto distributions with parameters α and β. Note that X has a mean that is finite only if α>1 in which case it is given by αβ τ2 eν+ 2 . (α − 1)(β + 1) In conclusion, the distinguishing features of the dPln distribution are two linear sub-plots in the log-log scale and a hyperbolic middle section.
Page 1: M &C M &C M &C & Morgan Claypool Pu
Page 4 and 5: Synthesis Lectures on Data Mining a
Page 6 and 7: Copyright © 2012 by Morgan & Clayp
Page 8 and 9: ABSTRACT What does the Web look lik
Page 11 and 12: 1 2 Contents Acknowledgments ......
Page 13 and 14: 10.2.1 The Highly Optimized Toleran
Page 15: Resources .........................
Page 19: PART I Patterns and Laws
Page 22 and 23: 20 3. PATTERNS IN EVOLVING GRAPHS N
Page 24 and 25: 22 3. PATTERNS IN EVOLVING GRAPHS d
Page 26 and 27: 24 3. PATTERNS IN EVOLVING GRAPHS c
Page 29 and 30: CHAPTER 4 Patterns in Weighted Grap
Page 31 and 32: 10 12 10 10 10 8 10 6 10 4 10 2 10
Page 33 and 34: CHAPTER 5 Discussion—The Structur
Page 35: IN Tube TENDRILS SCC OUT Disconnect
Page 38 and 39: 36 6. DISCUSSION—POWER LAWS AND D
Page 42: Authors’ Biographies DEEPAYAN CHA

Download Chapters 3-6 (.PDF) - ODBMS

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?