12.07.2015 Views

Strategic Practice and Homework 10 - Projects at Harvard

Strategic Practice and Homework 10 - Projects at Harvard

Strategic Practice and Homework 10 - Projects at Harvard

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

St<strong>at</strong> 1<strong>10</strong> <strong>Str<strong>at</strong>egic</strong> <strong>Practice</strong> <strong>10</strong> Solutions, Fall 2011Prof. Joe Blitzstein (Department of St<strong>at</strong>istics, <strong>Harvard</strong> University)1 Conditional Expect<strong>at</strong>ion & Conditional Variance1. Show th<strong>at</strong> E((Y E(Y |X)) 2 |X) =E(Y 2 |X) (E(Y |X)) 2 , so these two expressionsfor Var(Y |X) agree.This is the conditional version of the fact th<strong>at</strong>Var(Y )=E((Y E(Y )) 2 )=E(Y 2 ) (E(Y )) 2 ,<strong>and</strong> so must be true since conditional expect<strong>at</strong>ions are expect<strong>at</strong>ions, just as conditionalprobabilities are probabilities. Algebraically, letting g(X) =E(Y |X)we haveE((Y E(Y |X)) 2 |X) =E(Y 2 2Yg(X)+g(X) 2 |X) =E(Y 2 |X) 2E(Yg(X)|X)+E(g(X) 2 |X),<strong>and</strong> E(Yg(X)|X) =g(X)E(Y |X) =g(X) 2 ,E(g(X) 2 |X) =g(X) 2 by takingout wh<strong>at</strong>’s known, so the righth<strong>and</strong> side above simplifies to E(Y 2 |X) g(X) 2 .2. Prove Eve’s Law.We will show th<strong>at</strong> Var(Y )=E(Var(Y |X))+Var(E(Y |X)). Let g(X) =E(Y |X).By Adam’s Law, E(g(X)) = E(Y ). ThenE(Var(Y |X)) = E(E(Y 2 |X) g(X) 2 )=E(Y 2 ) E(g(X) 2 ),Var(E(Y |X)) = E(g(X) 2 ) (E(g(X)) 2 = E(g(X) 2 ) (E(Y )) 2 .Adding these equ<strong>at</strong>ions, we have Eve’s Law.3. Let X <strong>and</strong> Y be r<strong>and</strong>om variables with finite variances, <strong>and</strong> let W = YE(Y |X). This is a residual: thedi↵erencebetweenthetruevalueofY <strong>and</strong> thepredicted value of Y based on X.(a) Compute E(W )<strong>and</strong>E(W |X).Adam’s law (iter<strong>at</strong>ed expect<strong>at</strong>ion), taking out wh<strong>at</strong>’s known, <strong>and</strong> linearity giveE(W ) = EY E(E(Y |X)) = EY EY =0,E(W |X) = E(Y |X) E(E(Y |X)|X) =E(Y |X) E(Y |X) =0.1


St<strong>at</strong> 1<strong>10</strong> Penultim<strong>at</strong>e <strong>Homework</strong>, Fall 2011Prof. Joe Blitzstein (Department of St<strong>at</strong>istics, <strong>Harvard</strong> University)1. Judit plays in a total of N ⇠ Geom(s) chesstournamentsinhercareer.Supposeth<strong>at</strong> in each tournament she has probability p of winning the tournament, independently.Let T be the number of tournaments she wins in her career.(a) Find the mean <strong>and</strong> variance of T .(b) Find the MGF of T . Wh<strong>at</strong> is the name of this distribution (with its parameters)?2. Let X 1 ,X 2 be i.i.d., <strong>and</strong> let ¯X = 1 2 (X 1 + X 2 ). In many st<strong>at</strong>istics problems, it isuseful or important to obtain a conditional expect<strong>at</strong>ion given ¯X. As an example ofthis, find E(w 1 X 1 + w 2 X 2 | ¯X), where w 1 ,w 2 are constants with w 1 + w 2 =1.3. A certain stock has low vol<strong>at</strong>ility on some days <strong>and</strong> high vol<strong>at</strong>ility on other days.Suppose th<strong>at</strong> the probability of a low vol<strong>at</strong>ility day is p <strong>and</strong> of a high vol<strong>at</strong>ility dayis q =1 p, <strong>and</strong>th<strong>at</strong>onlowvol<strong>at</strong>ilitydaysthepercentchangeinthestockpriceis2 2N (0, 1 ), while on high vol<strong>at</strong>ility days the percent change is N (0, 2 ), with 1 < 2 .Let X be the percent change of the stock on a certain day. The distribution issaid to be a mixture of two Normal distributions, <strong>and</strong> a convenient way to representX is as X = I 1 X 1 + I 2 X 2 where I 1 is the indic<strong>at</strong>or r.v. of having a low vol<strong>at</strong>ility day,2I 2 =1 I 1 , X j ⇠N(0, j ), <strong>and</strong> I 1 ,X 1 ,X 2 are independent.(a) Find the variance of X in two ways:Cov(I 1 X 1 + I 2 X 2 ,I 1 X 1 + I 2 X 2 )directly.using Eve’s Law, <strong>and</strong> by calcul<strong>at</strong>ing(b) The kurtosis of a r.v. Y with mean µ <strong>and</strong> st<strong>and</strong>ard devi<strong>at</strong>ionis defined byKurt(Y )=E(Y µ)443.This is a measure of how heavy-tailed the distribution of Y . Find the kurtosis of X(in terms of p, q,21 ,22 ,fullysimplified). Theresultwillshowth<strong>at</strong>eventhoughthekurtosis of any Normal distribution is 0, the kurtosis of X is positive <strong>and</strong> in fact canbe very large depending on the parameter values.4. We wish to estim<strong>at</strong>e an unknown parameter ✓, basedonar.v.X we will getto observe. As in the Bayesian perspective, assume th<strong>at</strong> X <strong>and</strong> ✓ have a jointdistribution. Let ˆ✓ be the estim<strong>at</strong>or (which is a function of X). Then ˆ✓ is said to beunbiased if E(ˆ✓|✓) =✓, <strong>and</strong>ˆ✓ is said to be the Bayes procedure if E(✓|X) =ˆ✓.1


(a) Let ˆ✓ be unbiased. Find E(ˆ✓ ✓) 2 (the average squared di↵erence between theestim<strong>at</strong>or <strong>and</strong> the true value of ✓), in terms of marginal moments of ˆ✓ <strong>and</strong> ✓.Hint: condition on ✓.(b) Repe<strong>at</strong> (a), except in this part suppose th<strong>at</strong> ˆ✓ is the Bayes procedure r<strong>at</strong>her thanassuming th<strong>at</strong> it is unbiased.Hint: condition on X.(c) Show th<strong>at</strong> it is impossible for ˆ✓ to be both the Bayes procedure <strong>and</strong> unbiased,except in silly problems where we get to know ✓ perfectly by observing X.Hint: if Y is a nonneg<strong>at</strong>ive r.v. with mean 0, then P (Y =0)=1.5. The surprise of learning th<strong>at</strong> an event with probability p happened is defined aslog 2 (1/p) (wherethelogisbase2soth<strong>at</strong>ifwelearnth<strong>at</strong>aneventofprobability1/2happened, the surprise is 1, which corresponds to having received 1 bit of inform<strong>at</strong>ion).Let X be a discrete r.v. whose possible values are a 1 ,a 2 ,...,a n (distinct realnumbers), with P (X = a j )=p j (where p 1 + p 2 + ···+ p n =1).The entropy of X is defined to be the average surprise of learning the value of X,i.e., H(X) = P nj=1 p j log 2 (1/p j ). This concept was used by Shannon to cre<strong>at</strong>e thefield of inform<strong>at</strong>ion theory, which is used to quantify inform<strong>at</strong>ion <strong>and</strong> has becomeessential for communic<strong>at</strong>ion <strong>and</strong> compression (e.g., MP3s <strong>and</strong> cell phones).(a) Explain why H(X 3 )=H(X), <strong>and</strong> give an example where H(X 2 ) 6= H(X).(b) Show th<strong>at</strong> the maximum possible entropy for X is when its distribution is uniformover a 1 ,a 2 ,...,a n ,i.e.,P (X = a j )=1/n. (This should make sense intuitively:learning the value of X conveys the most inform<strong>at</strong>ion on average when X is equallylikely to take any of its values, <strong>and</strong> the least possible inform<strong>at</strong>ion if X is a constant.)Hint: this can be done by Jensen’s inequality, without any need for calculus. To doso, consider a r.v. Y whose possible values are the probabilities p 1 ,...,p n ,<strong>and</strong>showwhy E(log 2 (1/Y )) apple log 2 (E(1/Y )) <strong>and</strong> how to interpret it.6. In a n<strong>at</strong>ional survey, a r<strong>and</strong>om sample of people are chosen <strong>and</strong> asked whether theysupport a certain policy. Assume th<strong>at</strong> everyone in the popul<strong>at</strong>ion is equally likelyto be surveyed <strong>at</strong> each step, <strong>and</strong> th<strong>at</strong> the sampling is with replacement (samplingwithout replacement is typically more realistic, but with replacement will be a goodapproxim<strong>at</strong>ion if the sample size is small compared to the popul<strong>at</strong>ion size). Let n bethe sample size, <strong>and</strong> let ˆp <strong>and</strong> p be the proportion of people who support the policyin the sample <strong>and</strong> in the entire popul<strong>at</strong>ion, respectively. Show th<strong>at</strong> for every c>0,P (|ˆp p| >c) apple 14nc 2 .2


St<strong>at</strong> 1<strong>10</strong> Penultim<strong>at</strong>e <strong>Homework</strong> Solutions, Fall 2011Prof. Joe Blitzstein (Department of St<strong>at</strong>istics, <strong>Harvard</strong> University)1. Judit plays in a total of N ⇠ Geom(s) chesstournamentsinhercareer.Supposeth<strong>at</strong> in each tournament she has probability p of winning the tournament, independently.Let T be the number of tournaments she wins in her career.(a) Find the mean <strong>and</strong> variance of T .We have T |N ⇠ Bin(N,p). By Adam’s Law,E(T )=E(E(T |N)) = E(Np)=p(1s)/s.By Eve’s Law,Var(T )=E(Var(T |N)) + Var(E(T |N))= E(Np(1 p)) + Var(Np)= p(1 p)(1 s)/s + p 2 (1 s)/s 2=p(1 s)(s +(1 s)p)s 2 .(b) Find the MGF of T . Wh<strong>at</strong> is the name of this distribution (with its parameters)?Let I j ⇠ Bern(p) be the indic<strong>at</strong>or of Judit winning the jth tournament. ThenE(e tT )=E(E(e tT |N))= E((pe t + q) N )1X= s (pe t +1 p) n (1 s) n=n=0s1 (1 s)(pe t +1 p) .This is reminiscent of the Geometric MGF used on HW 9. The famous discretedistributions we have studied whose possible values are 0, 1, 2,... are the Poisson,Geometric, <strong>and</strong> Neg<strong>at</strong>ive Binomial, <strong>and</strong> T is clearly not Poisson since its variancedoesn’t equal its mean. If T ⇠ Geom(✓), we have ✓ = , as found by settingE(T )= 1✓✓ss+p(1or by finding Var(T )/E(T ). Writing the MGF of T asE(e tT )=sss +(1 s)p (1 s)pe = s+(1t 11s)s)p,(1 s)ps+(1 s)p et


we see th<strong>at</strong> T ⇠ Geom(✓), with ✓ =ss+(1. Note th<strong>at</strong> this is consistent with (a).s)pThe distribution of T can also be obtained by a story proof. Imagine th<strong>at</strong> just beforeeach tournament she may play in, Judit retires with probability s (if she retires,she does not play in th<strong>at</strong> or future tournaments). Her tournament history can bewritten as a sequence of W (win), L (lose), R (retire), ending in the first R, wherethe probabilities of W, L, R are (1 s)p, (1 s)(1 p),srespectively. For calcul<strong>at</strong>ingT ,thelossescanbeignored:wewanttocountthenumberofW ’s before the R. Thesprobability th<strong>at</strong> a result is R given th<strong>at</strong> it is W or R is ,soweagainhaves+(1 s)psT ⇠ Geom(s+(1 s)p ).2. Let X 1 ,X 2 be i.i.d., <strong>and</strong> let ¯X = 1 2 (X 1 + X 2 ). In many st<strong>at</strong>istics problems, it isuseful or important to obtain a conditional expect<strong>at</strong>ion given ¯X. As an example ofthis, find E(w 1 X 1 + w 2 X 2 | ¯X), where w 1 ,w 2 are constants with w 1 + w 2 =1.By symmetry E(X 1 | ¯X) =E(X 2 | ¯X) <strong>and</strong>bylinearity<strong>and</strong>takingoutwh<strong>at</strong>’sknown,E(X 1 | ¯X)+E(X 2 | ¯X) =E(X 1 + X 2 | ¯X) =X 1 + X 2 . So E(X 1 | ¯X) =E(X 2 | ¯X) = ¯X(this was also derived in class). Thus,E(w 1 X 1 + w 2 X 2 | ¯X) =w 1 E(X 1 | ¯X)+w 2 E(X 2 | ¯X) =w 1 ¯X + w2 ¯X = ¯X.3. A certain stock has low vol<strong>at</strong>ility on some days <strong>and</strong> high vol<strong>at</strong>ility on other days.Suppose th<strong>at</strong> the probability of a low vol<strong>at</strong>ility day is p <strong>and</strong> of a high vol<strong>at</strong>ility dayis q =1 p, <strong>and</strong>th<strong>at</strong>onlowvol<strong>at</strong>ilitydaysthepercentchangeinthestockpriceis2 2N (0, 1 ), while on high vol<strong>at</strong>ility days the percent change is N (0, 2 ), with 1 < 2 .Let X be the percent change of the stock on a certain day. The distribution issaid to be a mixture of two Normal distributions, <strong>and</strong> a convenient way to representX is as X = I 1 X 1 + I 2 X 2 where I 1 is the indic<strong>at</strong>or r.v. of having a low vol<strong>at</strong>ility day,2I 2 =1 I 1 , X j ⇠N(0, j ), <strong>and</strong> I 1 ,X 1 ,X 2 are independent.(a) Find the variance of X in two ways:Cov(I 1 X 1 + I 2 X 2 ,I 1 X 1 + I 2 X 2 )directly.By Eve’s Law,Var(X) =E(Var(X|I 1 ))+Var(E(X|I 1 )) = E(I 2 1since I 2 1 = I 1 ,I 2 2 = I 2 . For the covariance method, exp<strong>and</strong>using Eve’s Law, <strong>and</strong> by calcul<strong>at</strong>ing21+(1 I 1 ) 2 2 2)+Var(0) = p 2 1+(1 p) 2 2,Var(X) =Cov(I 1 X 1 +I 2 X 2 ,I 1 X 1 +I 2 X 2 )=Var(I 1 X 1 )+Var(I 2 X 2 )+2Cov(I 1 X 1 ,I 2 X 2 ).Then Var(I 1 X 1 )=E(I 2 1X 2 1) (E(I 1 X 1 )) 2 = E(I 1 )E(X 2 1)=pVar(X 1 )sinceE(I 1 X 1 )=E(I 1 )E(X 1 )=0. Similarly, Var(I 2 X 2 )=(1 p)Var(X 2 ). And Cov(I 1 X 1 ,I 2 X 2 )=2


E(I 1 I 2 X 1 X 2 ) E(I 1 X 1 )E(I 2 X 2 )=0sinceI 1 I 2 always equals 0. So again we haveVar(X) =p 1 2 +(1 p) 2.2(b) The kurtosis of a r.v. Y with mean µ <strong>and</strong> st<strong>and</strong>ard devi<strong>at</strong>ion is defined byKurt(Y )=E(Y µ)443.This is a measure of how heavy-tailed the distribution of Y . Find the kurtosis of X(in terms of p, q,21 ,22 ,fullysimplified). Theresultwillshowth<strong>at</strong>eventhoughthekurtosis of any Normal distribution is 0, the kurtosis of X is positive <strong>and</strong> in fact canbe very large depending on the parameter values.Note th<strong>at</strong> (I 1 X 1 + I 2 X 2 ) 4 = I 1 X 4 1 + I 2 X 4 2 since the cross terms disappear (becauseI 1 I 2 is always 0) <strong>and</strong> any positive power of an indic<strong>at</strong>or r.v. is th<strong>at</strong> indic<strong>at</strong>or r.v.! SoE(X 4 )=E(I 1 X 4 1 + I 2 X 4 2)=3p 4 1 +3q 4 2.Altern<strong>at</strong>ively, we can use E(X 4 )=E(X 4 |I 1 =1)p + E(X 4 |I 1 =0)q to find E(X 4 ).The mean of X is E(I 1 X 1 )+E(I 2 X 2 ) = 0, so the kurtosis of X isKurt(X) = 3p 4 1 +3q 4 2(p 2 1 + q 2 2) 2 3.This becomes 0 if 1 = 2 , since then we have a Normal distribution r<strong>at</strong>her than amixture of two di↵erent Normal distributions. For 1 < 2 ,thekurtosisispositivesince p 4 1 + q 4 2 > (p 2 1 + q 2 2) 2 , as seen by a Jensen’s inequality argument, or byinterpreting this as saying E(Y 2 ) > (EY ) 2 where Y is21 with probability p <strong>and</strong> 2 2with probability q.4. We wish to estim<strong>at</strong>e an unknown parameter ✓, basedonar.v.X we will getto observe. As in the Bayesian perspective, assume th<strong>at</strong> X <strong>and</strong> ✓ have a jointdistribution. Let ˆ✓ be the estim<strong>at</strong>or (which is a function of X). Then ˆ✓ is said to beunbiased if E(ˆ✓|✓) =✓, <strong>and</strong>ˆ✓ is said to be the Bayes procedure if E(✓|X) =ˆ✓.(a) Let ˆ✓ be unbiased. Find E(ˆ✓ ✓) 2 (the average squared di↵erence between theestim<strong>at</strong>or <strong>and</strong> the true value of ✓), in terms of marginal moments of ˆ✓ <strong>and</strong> ✓.Hint: condition on ✓.3


Conditioning on ✓, wehaveE(ˆ✓ ✓) 2 = E(E(ˆ✓ 2 2ˆ✓✓ + ✓ 2 |✓))= E(E(ˆ✓ 2 |✓)) E(E(2ˆ✓✓|✓)) + E(E(✓ 2 |✓))= E(ˆ✓ 2 ) 2E(✓E(ˆ✓|✓)) + E(✓ 2 )= E(ˆ✓ 2 ) 2E(✓ 2 )+E(✓ 2 )= E(ˆ✓ 2 ) E(✓ 2 ).(b) Repe<strong>at</strong> (a), except in this part suppose th<strong>at</strong> ˆ✓ is the Bayes procedure r<strong>at</strong>her thanassuming th<strong>at</strong> it is unbiased.Hint: condition on X.By the same argument as (a) except now conditioning on X, wehaveE(ˆ✓ ✓) 2 = E(E(ˆ✓ 2 2ˆ✓✓ + ✓ 2 |X))= E(E(ˆ✓ 2 |X)) E(E(2ˆ✓✓|X)) + E(E(✓ 2 |X))= E(ˆ✓ 2 ) 2E(ˆ✓ 2 )+E(✓ 2 )= E(✓ 2 ) E(ˆ✓ 2 ).(c) Show th<strong>at</strong> it is impossible for ˆ✓ to be both the Bayes procedure <strong>and</strong> unbiased,except in silly problems where we get to know ✓ perfectly by observing X.Hint: if Y is a nonneg<strong>at</strong>ive r.v. with mean 0, then P (Y =0)=1.Suppose th<strong>at</strong> ˆ✓ is both the Bayes procedure <strong>and</strong> unbiased. By the above, we haveE(ˆ✓ ✓) 2 = a <strong>and</strong> E(ˆ✓ ✓) 2 = a, wherea = E(ˆ✓ 2 ) E(✓ 2 ). But th<strong>at</strong> implies a =0,which means th<strong>at</strong> ˆ✓ = ✓ (with probability 1). Th<strong>at</strong> can only happen in the extremesitu<strong>at</strong>ion where the observed d<strong>at</strong>a reveal the true ✓ perfectly; inpractice,n<strong>at</strong>ureismuch more elusive <strong>and</strong> does not reveal its deepest secrets with such alacrity.5. The surprise of learning th<strong>at</strong> an event with probability p happened is defined aslog 2 (1/p) (wherethelogisbase2soth<strong>at</strong>ifwelearnth<strong>at</strong>aneventofprobability1/2happened, the surprise is 1, which corresponds to having received 1 bit of inform<strong>at</strong>ion).Let X be a discrete r.v. whose possible values are a 1 ,a 2 ,...,a n (distinct realnumbers), with P (X = a j )=p j (where p 1 + p 2 + ···+ p n =1).The entropy of X is defined to be the average surprise of learning the value of X,i.e., H(X) = P nj=1 p j log 2 (1/p j ). This concept was used by Shannon to cre<strong>at</strong>e thefield of inform<strong>at</strong>ion theory, which is used to quantify inform<strong>at</strong>ion <strong>and</strong> has becomeessential for communic<strong>at</strong>ion <strong>and</strong> compression (e.g., MP3s <strong>and</strong> cell phones).4


(a) Explain why H(X 3 )=H(X), <strong>and</strong> give an example where H(X 2 ) 6= H(X).Note th<strong>at</strong> the definition of H(X) dependsonlyontheprobabilities of the distinctvalues of X, notonwh<strong>at</strong>thevaluesare. Sincethefunctiong(x) =x 3 is one-toone,P (X 3 = a 3 j)=p j for all j. So H(X 3 )=H(X). For a simple example whereH(X 2 ) 6= H(X), let X be a “r<strong>and</strong>om sign,” i.e., let X take values 1<strong>and</strong>1withprobability 1/2 each.ThenX 2 has entropy 0, whereas X has entropy log 2 (2) = 1.(b) Show th<strong>at</strong> the maximum possible entropy for X is when its distribution is uniformover a 1 ,a 2 ,...,a n ,i.e.,P (X = a j )=1/n. (This should make sense intuitively:learning the value of X conveys the most inform<strong>at</strong>ion on average when X is equallylikely to take any of its values, <strong>and</strong> the least possible inform<strong>at</strong>ion if X is a constant.)Hint: this can be done by Jensen’s inequality, without any need for calculus. To doso, consider a r.v. Y whose possible values are the probabilities p 1 ,...,p n ,<strong>and</strong>showwhy E(log 2 (1/Y )) apple log 2 (E(1/Y )) <strong>and</strong> how to interpret it.Let Y be as in the hint. Then H(X) =E(log 2 (1/Y )) <strong>and</strong> E(1/Y )= P n p jj=1 p j= n.By Jensen’s inequality, E(log 2 (T )) apple log 2 (E(T )) for any positive r.v. T ,sincelog 2is a concave function. Therefore,H(X) =E(log 2 (1/Y )) apple log 2 (E(1/Y )) = log 2 (n).Equality holds if <strong>and</strong> only if 1/Y is a constant, which is the case where p j =1/n forall j. ThiscorrespondstoX being equally likely to take on each of its values.6. In a n<strong>at</strong>ional survey, a r<strong>and</strong>om sample of people are chosen <strong>and</strong> asked whether theysupport a certain policy. Assume th<strong>at</strong> everyone in the popul<strong>at</strong>ion is equally likelyto be surveyed <strong>at</strong> each step, <strong>and</strong> th<strong>at</strong> the sampling is with replacement (samplingwithout replacement is typically more realistic, but with replacement will be a goodapproxim<strong>at</strong>ion if the sample size is small compared to the popul<strong>at</strong>ion size). Let n bethe sample size, <strong>and</strong> let ˆp <strong>and</strong> p be the proportion of people who support the policyin the sample <strong>and</strong> in the entire popul<strong>at</strong>ion, respectively. Show th<strong>at</strong> for every c>0,P (|ˆp p| >c) apple 14nc 2 .We can write ˆp = X/n with X ⇠ Bin(n, p). So E(ˆp) =p, Var(ˆp) =p(1by Chebyshev’s inequality,P (|ˆpp| >c) apple Var(ˆp)c 2 =p(1 p)nc 2 apple 14nc 2 ,where the last inequality is because p(1 p) ismaximized<strong>at</strong>p =1/2.p)/n. Then5

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!