08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

12.4.7 Median<br />

One <strong>of</strong>ten calculates the average value <strong>of</strong> a random variable to get a feeling for the<br />

magnitude <strong>of</strong> the variable. This is reasonable when the probability distribution <strong>of</strong> the<br />

variable is Gaussian, or has a small variance. However, if there are outliers, then the<br />

average may be distorted by outliers. An alternative to calculating the expected value is<br />

to calculate the median, the value for which half <strong>of</strong> the probability is above and half is<br />

below.<br />

12.4.8 The Central Limit Theorem<br />

Let s = x 1 + x 2 + · · · + x n be a sum <strong>of</strong> n independent random variables where each x i<br />

has probability distribution<br />

{ 0<br />

1<br />

x i =<br />

2<br />

1 .<br />

1<br />

2<br />

The expected value <strong>of</strong> each x i is 1 / 2 with variance<br />

σ 2 i =<br />

( 1<br />

2 − 0 ) 2<br />

1<br />

2 + ( 1<br />

2 − 1 ) 2<br />

1<br />

2 = 1 4 .<br />

The expected value <strong>of</strong> s is n/2 and since the variables are independent, the variance <strong>of</strong><br />

the sum is the sum <strong>of</strong> the variances and hence is n/4. How concentrated s is around its<br />

mean depends on the standard deviation <strong>of</strong> s which is √ n<br />

. For n equal 100 the expected<br />

2<br />

value <strong>of</strong> s is 50 with a standard deviation <strong>of</strong> 5 which is 10% <strong>of</strong> the mean. For n = 10, 000<br />

the expected value <strong>of</strong> s is 5,000 with a standard deviation <strong>of</strong> 50 which is 1% <strong>of</strong> the<br />

mean. Note that as n increases, the standard deviation increases, but the ratio <strong>of</strong> the<br />

standard deviation to the mean goes to zero. More generally, if x i are independent and<br />

identically distributed, each with standard deviation σ, then the standard deviation <strong>of</strong><br />

x 1 + x 2 + · · · + x n is √ nσ. So, x 1+x 2 √n<br />

+···+x n<br />

has standard deviation σ. The central limit<br />

theorem makes a stronger assertion that in fact x 1+x 2 √n<br />

+···+x n<br />

has Gaussian distribution<br />

with standard deviation σ.<br />

Theorem 12.2 Suppose x 1 , x 2 , . . . , x n is a sequence <strong>of</strong> identically distributed independent<br />

random variables, each with mean µ and variance σ 2 . The distribution <strong>of</strong> the random<br />

variable<br />

1<br />

√ (x 1 + x 2 + · · · + x n − nµ)<br />

n<br />

converges to the distribution <strong>of</strong> the Gaussian with mean 0 and variance σ 2 .<br />

12.4.9 Probability Distributions<br />

The Gaussian or normal distribution<br />

391

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!