02.04.2013 Views

Download Chapters 3-6 (.PDF) - ODBMS

Download Chapters 3-6 (.PDF) - ODBMS

Download Chapters 3-6 (.PDF) - ODBMS

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

CHAPTER 6<br />

Discussion—Power Laws and<br />

Deviations<br />

6.1 POWER LAWS—SLOPE ESTIMATION<br />

We saw many power laws in the previous sections. Here we describe how to estimate the slope of a<br />

power law, and how to estimate the goodness of fit. We discuss these issues below, using the detection<br />

of power laws in degree distributions as an example.<br />

Computing the power-law exponent: This is no simple task: the power law could be only in the<br />

tail of the distribution and not over the entire distribution, estimators of the power-law exponent<br />

could be biased, some required assumptions may not hold, and so on. Several methods are currently<br />

employed, though there is no clear “winner” at present.<br />

1. Linear regression on the log-log scale: We could plot the data on a log-log scale, then optionally<br />

“bin” them into equal-sized buckets, and finally find the slope of the linear fit. However, there<br />

are at least three problems: (i) this can lead to biased estimates [130], (ii) sometimes the power<br />

law is only in the tail of the distribution, and the point where the tail begins needs to be<br />

hand-picked, and (iii) the right end of the distribution is very noisy [215]. However, this is<br />

the simplest technique, and seems to be the most popular one.<br />

2. Linear regression after logarithmic binning: This is the same as above, but the bin widths increase<br />

exponentially as we go toward the tail. In other words, the number of data points in each bin<br />

is counted, and then the height of each bin is divided by its width to normalize. Plotting the<br />

histogram on a log-log scale would make the bin sizes equal, and the power law can be fitted to<br />

the heights of the bins.This reduces the noise in the tail buckets, fixing problem (iii). However,<br />

binning leads to loss of information; all that we retain in a bin is its average. In addition, issues<br />

(i) and (ii) still exist.<br />

3. Regression on the cumulative distribution: We convert the pdf p(x) (that is, the scatter plot) into<br />

a cumulative distribution F(x):<br />

∞ ∞<br />

F(x) = P(X≥ x) = p(z) = Az −γ<br />

z=x<br />

z=x<br />

(6.1)<br />

35

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!