11.07.2015 Views

330.104, Statistische Methoden der Ingenieurwissenschaften - IMW

330.104, Statistische Methoden der Ingenieurwissenschaften - IMW

330.104, Statistische Methoden der Ingenieurwissenschaften - IMW

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

IntroductionLiteraturePlease find the pdf of the slides in the TUWIS information system.Additional literature for the course is◮ M.H. DeGroot and M.J. Schervish (2002) Probability and Statistics,Addison Wesley,◮ G. Casella and R.L. Berger (2001) Statistical Inference, 2nd edition,Duxbury,◮ J. Johnston and J. DiNardo (1997) Econometric Methods, 5th edition,McGrawHill,some more advanced textbooks are◮ W.H. Greene (2008) Econometric Analysis, 6th edition, Prentice Hall,◮ D. Pollard (2001) A User’s Guide to Measure Theoretic Probability,Cambridge University Press.◮ P. Billingsley (1995) Probability and Measure, 3rd edition, Wiley.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 3


IntroductionClassesThe classes will take place at HS11 every Friday 8.30 a.m. to 10.00a.m.There will be a mid-term and a final exam.Your scores in both exams contribute equally weighted to the finalscore.There will be a third exam in autumn. The score in the third test canbe substituted for either (i) one missed test or (ii) a negative test.Active class participation, e.g., pointing out typos / errors in theslides can give up to an additional 10% score.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 4


IntroductionCourse Outline1 Introduction2 Probability3 Random Variables and Distributions4 Expectation5 Special Distributions6 Estimation7 Hypothesis Testing8 Linear Statistical ModelsThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 5


ProbabilitySets Theory (1)This section develops the formal mathematical model for events,namely the theory of sets. The concepts of element, subset, emptyset, intersection, union, complement, and disjoint set are introduced.Def: Sample SpaceThe collection of all possible outcomes of a random experiment is calledthe sample space. We will denote the sample space with ΩDef: ElementEach possible individual outcome is an element of the sample spaceI.e., for each possible outcome ω of the experiment we have ω ∈ Ω.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 6


ProbabilitySets Theory (3)Ex: Partial InformationIf a die is rolled and we do not immediately see the final outcome butreceive the information that the outcome is even. Then, this is an event BwithB = {2, 4, 6} ⊂ Ω.Ex. cont: Sequential Revelation of InformationIf we then receive the additional information that the outcome is greaterthan 3, we haveA = {4, 6} ⊂ B = {2, 4, 6} ⊂ Ω.That means: Events can be used to model the gradual revelation ofinformation about the outcome of a random experiment.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 8


ProbabilitySets Theory (4)Some events are impossible. These events are subsets of the samplespace which contain no element—they are empty.Def: Empty SetAn event, which contains no elements is called empty set, denoted by ∅.For all events A it is true that ∅ ⊆ A, especially ∅ ⊆ ΩEx: Impossible EventWhen rolling a die, the event ”The outcome is negative.” is not possible,thus, it is represented by the empty set ∅.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 9


ProbabilitySets Theory (5)Not every experiment has a sample space that can easily beenumerated.Ex: Flip a CoinFlip a coin repeatedly until it shows heads and report the number of flips ittakes.The sample space belonging to this experiment cannot be finite!Why?Def: Finite and Infinite SetsA set that contains only finitely many elements is called a finite set.An infinite set A is called countable if there is a one-to-onerelationship between the elements of A and the set of naturalnumbers {1, 2, 3, . . . }.A set is uncountable if it is neither finite nor countable.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 10


ProbabilityOperations of Set Theory (1)Def: UnionLet A and B be any two events. The union of A and B, denoted byA ∪ B, is defined to be the event that contains all outcomes that belongto A alone, to B alone, or to both A and B.A⋂BBAThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 11


ProbabilityOperations of Set Theory (2)The union of two events (sets) has the following properties:A ∪ B = B ∪ A,A ∪ A = A,A ∪ ∅ = A,A ∪ Ω = Ω.Furthermore, if A ⊆ B then A ∪ B = B.The notation of the union of a sequence of events isfinite union A 1 ∪ A 2 ∪ · · · ∪ A n =infinite union A 1 ∪ A 2 ∪ . . . =n⋃i=1∞⋃i=1A iA iThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 12


ProbabilityOperations of Set Theory (3)Def: IntersectionLet A and B be any two events. The intersection of A and B, denoted byA ∩ B, is defined to be the event that contains all outcomes that belongboth to A and to B.BAA⋃BThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 13


ProbabilityOperations of Set Theory (4)The intersection of two events (sets) has the following properties:A ∩ B = B ∩ A,A ∩ A = A,A ∩ ∅ = ∅,A ∩ Ω = A.Furthermore, if A ⊆ B then A ∩ B = A.The notation of the intersection of a sequence of events isfinite intersection A 1 ∩ A 2 ∩ · · · ∩ A n =infinite intersection A 1 ∩ A 2 ∩ . . . =n⋂i=1∞⋂i=1A iA iThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 14


ProbabilityOperations of Set Theory (5)Def: ComplementThe complement of an event A is an event denoted by A c which containsall outcomes in the sample space Ω that do not belong to A.A c ⩵AAThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 15


ProbabilityOperations of Set Theory (6)The complement of an event (set) has the following properties:(A c ) c = A,A ∪ A c = Ω,∅ c = Ω,A ∩ A c = ∅,Ω c = ∅.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 16


ProbabilityOperations of Set Theory (7)Def: DifferenceThe difference of event A and event B is the event that contains allelements which are in A and not in B. It is denoted by A\B, speak ”Aminus B”.BABThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 17


ProbabilityOperations of Set Theory (8)Def: Disjoint Events (Mutually Exclusive Events)Two events A and B are said to be disjoint events or mutually exclusiveevents if they have no outcome in common, i.e., if A ∩ B = ∅.BAThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 18


ProbabilityOperations of Set Theory (9)Ex: Toss a CoinToss a coin three times. Define the sample space of this experiment andstate the following four events:Event A: At least one head is obtained.Event B: Head is obtained on the second toss.Event C: A tail is obtained on the third toss.Event D: No heads are obtained at all.Each of the tosses has 2 distinct possible outcomes (head: H, andtail: T), thus, we have 2 3 = 8 possible outcomes of the experiment.Enumerating the outcomes defines the sample spaceΩ = {ω 1 , . . . , ω 8 } withω 1 = HHH, ω 3 = HTH, ω 5 = THH, ω 7 = TTH,ω 2 = HHT, ω 4 = HTT, ω 6 = THT, ω 8 = TTT.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 19


ProbabilityOperations of Set Theory (10)Ex: Uncountable Sample SpaceThe installed capacity of water and electricity supply for an office buildingis able to serve between 4.000 and 200.000 liters of water and between 1million and 150 million kilowatt-hour of electricity per day. Illustrate thesample space for daily supply.Electricity1501005000 50 100 150 200WaterThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 20


ProbabilitySets and R (1)The R-language has no data type set. The components of datastructures like vector, matrix, DataFrame, etc. can be interpreted aselements of sets.A simple and convenient syntax allows for selecting subsets and doingset operations.Define vectors a and b by> a b a %in% b[1] FALSE TRUE FALSE TRUE FALSE TRUEThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 21


ProbabilitySets and R (2)If we assign this selector to a named structure we may use it forselecting elements from a.> sel a[sel][1] 2 4 6To select the complement, use the logical negation !.> a[!sel][1] 1 3 5Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 22


ProbabilityThe Definition of Probability (1)Now we begin with the mathematical definition of probability andthen we present some useful results that follow easily from thedefinition.In a given experiment, we assign to each event A in the sample SpaceΩ a number Pr(A) which indicates the probability that A will occur.Def: Axioms of ProbabilityA function Pr : Ω → R which assigns each event A ∈ Ω a real numberthat satisfies the following three axioms is called probability:(i) Pr(A) ≥ 0 ∀A ⊆ Ω,(ii) Pr(Ω) = 1,(iii) For disjoint events A 1 ,A 2 , . . . , the probability of their union,Pr(∪ i A i ), is equal to ∑ i Pr(A i), the sum of the probabilities of theindividual events.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 23


ProbabilityThe Definition of Probability (2)From the axioms of probability it follows that◮ Pr(∅) = 0.◮ For every two events A and B, B = (B ∩ A) ∪ (B ∩ A c ), i.e., thedisjoint events A and A c partition B into two disjoint events.◮ For every event A, Pr(A c ) = 1 − Pr(A).◮ If A ⊂ B, then Pr(A) ≤ Pr(B).◮ For every event A, 0 ≤ A ≤ 1.◮ For every two events A and B,Pr(A ∪ B) = Pr(A) + Pr(B) − Pr(A ∩ B).Note: A ∪ B can be partitioned into three disjoint eventsA ∪ B = (A ∩ B c ) ∪ (A ∩ B) ∪ (A c ∩ B).Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 24


ProbabilityThe Cost and Benefit of Rigor (1)Note: Probabilities are defined on events not on the elements of thesample space! This is an issue when dealing with uncountable samplespaces.In Axiom (iii), we are a little vague about the meaning of the dots(. . . ).Axiom (iii) only states that we want to add up probabilities in thesame way as lengths, areas, volumes, and masses, simply letting uscalculate the probability of an event by breaking it into disjoint pieces,whose probabilities are summed.Sometimes we need a countable infinity of pieces → (iii) is then calledcountable additivity.It turns out, that in general, the collection of events to whichcountably additive probabilities are assigned cannot include all subsetsof the sample space.The domain of the function P is usually just a sigma-field F, i.e., acollection of events (subsets of Ω).Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 25


ProbabilityThe Cost and Benefit of Rigor (2)Def: Sigma-FieldCall a class F a sigma-field of subsets of Ω if:(i) the empty set ∅ and the whole sample space Ω both belong to F,(ii) if A belongs to F then so does its complement Ω\A,(iii) if A 1 ,A 2 , . . . is a countable collection of sets in F then both theunion ∪ i A i and the intersection ∩ i A i are also in F.I.e., probabilities are defined on so called probability spaces (Ω, F, P ),which consists of a sample space Ω, a sigma-field F, and a probabilitymeasure P : F → [0, 1] which is defined on the sigma-field F.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 26


ProbabilityThe Cost and Benefit of Rigor (3)Note: If Ω consists of only a finite number of elements then F canconsist of all subsets of Ω (but however, it needs not to).Sigma-fields are used to define so called information sets thatspecify what is known (will be known) at a certain time t (see theintroductory example of the die, where information is revealedsequentially).The simplest information set is F = {∅, Ω}.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 27


ProbabilityFinite Sample Spaces (1)The simplest experiments in which to determine and <strong>der</strong>iveprobabilities are those that involve only finitely many possibleoutcomes. This section will introduce the important concepts.Here we will perform a bottom up approach, i.e., define probabilitieson the single elements of the sample space Ω and <strong>der</strong>ive theprobability of more complex events from it.Let Ω consist of n outcomes Ω = {ω 1 , . . . , ω n }. Then withp i = Pr(ω i ) the following conditions must holdp i ≥ 0 for i = 1, . . . , n,andn∑p i = 1.i=1Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 28


ProbabilityFinite Sample Spaces (2)Def: Simple Sample SpaceA sample space Ω containing n outcomes s 1 , . . . , s n is called a simplesample space if the probability assigned to each of the outcomes is 1/n.Then in this simple sample space, the probability of an event A thatconsists of exactly m outcomes isPr(A) = m n .Ex: Tossing Three CoinsToss three coins simultaneously. What is the probability of obtainingexactly two heads?Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 29


ProbabilityFinite Sample Spaces (3)Ex: Rolling Two DiceConsi<strong>der</strong> an experiment where two dice are rolled. Calculate theprobability of each of the possible values of the sum of the two numbersthat may appear.The corresponding simple sample space Ω is:(1, 1) (1, 2) (1, 3) (1, 4) (1, 5) (1, 6)(2, 1) (2, 2) (2, 3) (2, 4) (2, 5) (2, 6)(3, 1) (3, 2) (3, 3) (3, 4) (3, 5) (3, 6)(4, 1) (4, 2) (4, 3) (4, 4) (4, 5) (4, 6)(5, 1) (5, 2) (5, 3) (5, 4) (5, 5) (5, 6)(6, 1) (6, 2) (6, 3) (6, 4) (6, 5) (6, 6)Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 30


ProbabilityFinite Sample Spaces (4)Note: The sum of the numbers obtained is constant along 45-degreelines through the sample space matrix.Thus the assigned probabilities arePr(2) = Pr(12) = 1 36 , Pr(5) = Pr(9) = 4 36 ,Pr(3) = Pr(11) = 2 36 , Pr(6) = Pr(8) = 5 36 ,Pr(4) = Pr(10) = 3 36 , Pr(7) = 636 .Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 31


ProbabilityCounting Methods (1)The following will present some counting methods that exploit thespecial structure which exists in many common experiments.Th: Multiplication RuleConsi<strong>der</strong> an experiment that has the following characteristics(i) The experiment is performed in two parts.(ii) The first part has m possible outcomes, the second part has npossible outcomes.Then there are mn final outcomes and in a simple sample space, theprobability of each of the final outcomes is 1/(mn)Note: For the multiplication rule to hold, it is necessary that eachoutcome in one part can occur completely independent of whatoccurred in an other step of the experiment (the sub-experimentsmust be independent, see later).Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 32


ProbabilityCounting Methods (2)This multiplication rule can be easily extended to more than twoparts.Ex: Toss Ten CoinsWhen tossing ten coins, the sample space consists of 2 10 = 1024outcomes.Def: PermutationLet S be a set of n elements and assign each of the elements one numberout of the sequence 1, . . . , n. Each possible different assignment is called apermutation.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 33


ProbabilityCounting Methods (3)The number of permutations of a set of n elements isn(n − 1)(n − 2) · · · 1 = n!speak: n factorialEx: Seating ArrangementsConsi<strong>der</strong> a class with 20 students. How many different seatingarrangements are possible?The number of possible different arrangements is calculated as20 · 19 · 18 · · · 1 = 20! = 2, 432, 902, 008, 176, 640, 000In R:> factorial(20)Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 34


ProbabilityCounting Methods (4)More general, we consi<strong>der</strong> an experiment where we select a sample ofk elements out of a sample space with n elements, k ≤ n. How manydifferent selections are possible?We call this procedure sampling without replacement since eachelement can only be selected once, i.e., it is not replaced by anidentical element, once selected.Th: Sampling Without ReplacementIf we sample k elements from a sample space of n elements (i.e., withoutreplacement), and the or<strong>der</strong> in the selected sample is relevant, there aren · (n − 1) · · · (n − k + 1) =possible outcomes.n · (n − 1) · · · 1(n − k)(n − k − 1) · · · 1 = n!(n − k)!Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 35


ProbabilityCounting Methods (5)Ex: The Winners’ RostrumConsi<strong>der</strong> a contest of 20 skiers. How many different constellations of thewinners’ rostrum are possible?At the rostrum, the or<strong>der</strong> certainly matters:20!(20 − 3)! = 20!17! = 6840If there is no additional information about the ability of the individualskiers (i.e., in a simple sample space), we assess a probability of1/6840 to each constellation.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 36


ProbabilityCounting Methods (6)If we sample k elements out of a set of n elements and the or<strong>der</strong>within the selected sample does not matter, then apparently there arek! different permutations in the chosen sample that are ”identical”.Then, the statement that a certain draw qualifies a selection criterionconforms to an event that consists of k! elements.Ex: Austrian Lotto, 6 out of 45You win, if your selection of 6 numbers out of {1, 2, . . . , 45} matchesexactly the drawing of this round. The or<strong>der</strong> in which the numbers arechosen is completely irrelevant (please compare to the winners’ nostrum).Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 37


ProbabilityCounting Methods (7)There are45!= 5, 864, 443, 200(45 − 6)!different ways of how to fill out the Lotto-ticket.The or<strong>der</strong> does, however, not matter. I.e., the event that you haveselected the correct group of 6 numbers consists of m = 6! = 720different elements.Remember: We are in a simple sample space. It containsn = 5, 864, 443, 200 different elements. The event A consists ofm = 720 elements that qualify for a first price.Then the probability of winning a first price isPr(A) = m n= 6!(45− 6)!45!=18, 145, 060Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 38


ProbabilityCounting Methods (8)If the or<strong>der</strong> of selection does not matter (as this is the case in theexample above) the number of different selections of k elements froma set of n elements isn!(n − k)!k! .Def. Binomial CoefficientThe binomial coefficient ( nk)is defined as( ) n n!=k (n − k)!k! .Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 39


ProbabilityCounting Methods (9)Why binomial coefficient? If you choose n times up or down (0 or 1,true(or false), then the number of possibilities to choose k times up isn)k .1↗1↗ ↘1 4↗ ↘ ↗1 3↗ ↘ ↗ ↘2 6↘ ↗ ↘ ↗1 3↘ ↗ ↘1 4↘ ↗1↘1Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 40


ProbabilityCounting Methods (10)If it is possible to sample a selected element repeatedly, we call itsampling with replacement.Th: Sampling With ReplacementWhen sampling k elements from a set of n elements, there arepossibilities.n } · n {{· · · n}= (n) kk timesThis is a generalization of the multiplication rule.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 41


ProbabilityCounting Methods (11)Ex: JokerConsi<strong>der</strong> a box with n = 10 balls numbered 0, . . . ,9. Select k = 6 ballsfrom the box and put each of the selected balls back into the box. Writethe selections such that they form a 6-digit number. What is theprobability of obtaining a certain number?There are n k = 1, 000, 000 different choices.The or<strong>der</strong> of the drawing matters!The probability of winning a first price is( 1n) k=11, 000, 000Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 42


ProbabilityCounting Methods (12)Ex: The Birthday ProblemConsi<strong>der</strong> a group of k people with 2 ≤ k ≤ 365. What is the probabilitythat at least two of these people have the same birthday (i.e., calibratetheir birthday on the same day of the same month, but not necessarilyborn in the same year)?For simplicity we ignore leap years.We assume a simple sample space and abstract from seasonalities,i.e., we ignore the fact that the birth rate is not exactly constantthroughout the year.Then the sample space of this experiment contains ofdifferent elements.365 kThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 43


ProbabilityCounting Methods (13)Furthermore, from these 365 k elements there are365!(365 − k)!that contain only different birthdays.Let B denote the event that all k persons have different birthdays,then365!Pr(B) =(365 − k)! 365 k .If we define A = Ω\B then A is the desired event that at least twobirthdays in the selection coincide, andPr(A) = Pr(Ω\B) = 1 − Pr(B) = 1 −365!(365 − k)! 365 k .Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 44


ProbabilityCounting Methods (14)A naive attempt to calculate Pr(A) for, e.g., a group of 10 persons fails!> 1-factorial(365)/(factorial(365-10)*365^10)yields[1] NaNWarning messages:1: In factorial(365) : value out of range in’gammafn’2: In factorial(365 - 10) : value out of rangein ’gammafn’This is so because> factorial(365)[1] InfThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 45


ProbabilityCounting Methods (15)Only few programs–like Mathematica–are able to numerically handlenumbers like 365!365! =25104128675558732292929443748812027705165520269876079766872595193901106138220937419666018009000254169376172314360982328660708071123369979853445367910653872383599704355532740937678091491429440864316046925074510134847025546014098005907965541041195496105311886173373435145517193282760847755882291690213539123479186274701519396808504940722607033001246328398800550487427999876690416973437861078185344667966871511049653888130136836199010529180056125844549488648617682915826347564148990984138067809999604687488146734837340699359838791124995957584538873616661533093253551256845056046388738129702951381151861413688922986510005440943943014699244112555755279140760492764253740250410391056421979003289600000000000000000000000000000000000000000000000000000000000000000000000000000000000000000Are we lost?Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 46


ProbabilityCounting Methods (16)Fortunately not!Focus on Pr(B), it can be calculated in an iterative way.Let Pr(B k ) denote the probability, that a group of k persons hasdifferent birthdays.Write Pr(B k ) asPr(B k ) ==365! 365 · 364 . . . 365 − (k − 1)=(365 − k)! 365k 365 } · 365 {{. . . 365}k times( ) 365·365( ) 364. . .365( 365 − (k − 2)365( )365 − (k − 1)= Pr(B k−1 ) ·365)·( )365 − (k − 1)365Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 47


ProbabilityCounting Methods (17)R-code can be found in file BirthdayProblem.R# allocating variablessampleSize


ProbabilityCounting Methods (18)The probability of at least two coinciding birthdays over the size ofthe group k.probA0.0 0.2 0.4 0.6 0.8 1.00 100 200 300Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 49k


ProbabilityCounting Methods (19)If we want only see the relevant range of k ≤ 60.#open a new window for the second plotX11()plot(sampleSize[1:60],probA[1:60],t="l",xlab="k",ylab="ProbA")Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 50


ProbabilityCounting Methods (20)The probability of at least two coinciding birthdays over the size ofthe group k ≤ 60.ProbA0.0 0.2 0.4 0.6 0.8 1.00 10 20 30 40 50 60Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 51k


ProbabilityConditional Probability (1)A major use of probability in statistical inference is the updating ofprobabilities when certain events are observed.Def: Conditional ProbabilityThe updated probability of an event A after we learn that an event B hasoccurred is called the conditional probability of A given B, given byPr(A|B) =Pr(A ∩ B)Pr(B)Note: The conditional probability Pr(A|B) is not defined ifPr(B) = 0.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 52


ProbabilityConditional Probability (2)The conditional probability Pr(A|B) is the proportion of the totalprobability Pr(B) that is represented by the probability Pr(A ∩ B).Or in other words, when knowing that B has occurred, the sample setwherein the remaining uncertainty is realized is no longer Ω but B,and thus, we normalize Pr(B) to 1.BBA⋃BAA⋃BThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 53


ProbabilityConditional Probability (3)Th: The Multiplication Rule for Conditional ProbabilitiesFrom the definition of conditional probabilities it follows thatPr(A ∩ B) = Pr(A)Pr(B|A) = Pr(B)Pr(A|B)Since A can be partitioned into two disjoint events by B and B cA = (A ∩ B) ∪ (A ∩ B c )it follows thatPr(A) = Pr(A|B)Pr(B) + Pr(A|B c )Pr(B c ).This is a special case of the law of total probability (see later).Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 54


ProbabilityConditional Probability (4)Ex: Selecting Two BallsSuppose there is a box with r red balls and b blue balls. Two balls areselected at random (without replacement). What is the probability thatthe first ball is red and the second is blue?Let A be the event that the first ball is red, and B be the event thatthe second ball is blue conditional on the fact that the first is red.Pr(A) =rr + b , Pr(B|A) = br + b − 1andPr(A ∩ B) =rr + b · br + b − 1Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 55


ProbabilityConditional Probability (5)Th: The Multiplication Rule for Conditional Probabilities, cont.Consi<strong>der</strong> events A 1 , A 2 , . . . , A n with Pr(A 1 ∩ A 2 ∩ · · · ∩ A n−1 ) > 0 thenPr(A 1 ∩ A 2 ∩ · · · ∩ A n ) = Pr(A 1 )Pr(A 2 |A 1 )Pr(A 3 |A 1 ∩ A 2 )· · · Pr(A n |A 1 ∩ A 2 ∩ · · · ∩ A n−1 ).Write the right hand side according to the definition of conditionalprobability asPr(A 1 ) Pr(A 1 ∩ A 2 ) Pr(A 1 ∩ A 2 ∩ A 3 ) Pr(A 1 ∩ A 2 . . . A n ). . .Pr(A 1 ) Pr(A 1 ∩ A 2 ) Pr(A 1 ∩ A 2 . . . A n−1 )Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 56


ProbabilityConditional Probability (6)Ex: Selecting k BallsSuppose there is a box with r red balls and b blue balls. Select k balls(without replacement), with k ≤ r. What is the probability that there is atleast one blue ball selected?Event A: At least one blue ball is selected.Event B = A c : Only red balls are selected.Event B i : There is a red ball selected with draw number i.Then B = ⋂ ki=1 B i and according to the multiplication rule ofconditional probability we can determine Pr(B) byk−1⋂Pr(B) = Pr(B 1 ) · Pr(B 2 |B 1 ) · Pr(B 3 |B 1 ∩ B 2 ) · · · Pr(B k | B i )i=1Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 57


ProbabilityConditional Probability (7)Since we have a simple sample space, the conditional probabilities canbe determined by counting the possible draws that qualify for theevent.Apparently we haveandi−1⋂Pr(B i |j=1Pr(B) =B j ) =r − (i − 1)r − (i − 1) + b(r!) (r + b − k)!·(r − k)! (r + b)!Set r = b = 10 and determine the probability that when selecting 4balls there will be at least one blue ball selected.ThenPr(B) = 1020 ·919 ·818 ·717 = 10! · 16! = 0.04334 ⇒ Pr(A) = 0.956666! 20!Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 58


ProbabilityConditional Probability (8)Def: PartitionConsi<strong>der</strong> a sample space Ω and k disjoint events B 1 , . . . , B k such that⋃ ki=1 B i = Ω. Then these events form a partition of Ω.This family of events B i covers the entire sample space and they aredisjoint. I.e., we know with certainty that exactly one of the B i s willoccur.Th: Law of Total ProbabilitySuppose the events B 1 , . . . , B k form a partition of the sample space Ωand Pr(B i ) > 0 ∀i. Then for an arbitrary event A we havePr(A) =k∑Pr(A|B i )Pr(B i )i=1Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 59


ProbabilityConditional Probability (9)Since the B i s form a partition of Ω, they also partition A into disjointevents A ∩ B i .B 2B 3B 1AB 4B 5Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 60


ProbabilityConditional Probability (10)Ex: Selecting BallsSuppose there is a box with r ≥ 1 red balls and b ≥ 1 blue balls. Selecttwo balls (without replacement). What is the probability that the secondball is red?Event A: The second ball is red.Event B: The first ball is red.Partition A with respect to B and B cA = (A ∩ B) ∪ (A ∩ B c )and from the multiplication law of conditional probability we followPr(A) = r − 1r + b − 1 ·rr + b +rr + b − 1 ·br + b =rr + bThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 61


ProbabilityConditional Probability (11)Note: Assume we draw the balls with replacement. Then every drawfaces exactly the same situation and the probability of drawing a redball constantly equalsrr+b .Without replacement: For the first ball, the probability to choose ared ball is apparentlyrr+b .As we already calculated: The probability that the second ball is red(without replacement) is alsoEx: Selecting Balls (cont.)rr+b .Suppose there is a box with r ≥ 2 red balls and b ≥ 2 blue balls. Selectthree balls (without replacement). What is the probability that the thirdball is red?Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 62


ProbabilityConditional Probability (12)Let A denote the event that the third ball is red.Let k denote the number of red balls selected with the first two draws.The events k = 0, k = 1, and k = 2 form a partition of the samplespace.Thus we havePr(A) = Pr(A|k = 0)Pr(k = 0) + Pr(A|k = 1)Pr(k = 1)+Pr(A|k = 2)Pr(k = 2)Substituting the expressions for the stated probabilities givesPr(A) =rr + bThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 63


ProbabilityConditional Probability (13)The law of total probability has an analog version that is conditionalon another event C with Pr(C) > 0.Th: The Conditional Version of the Law of Total ProbabilitySuppose an event C with Pr(C) > 0 and events B 1 , . . . , B k that form apartition of the sample space Ω and Pr(B i ) > 0 ∀i. Then for an arbitraryevent A we havePr(A|C) =k∑Pr(A|B i ∩ C)Pr(B i |C)i=1Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 64


ProbabilityIndependent Events (1)If learning that an event B has occurred does not change theprobability of an event A then A and B are independent.Def: IndependenceTwo events A and B are called independent ifPr(A ∩ B) = Pr(A)Pr(B)From independence of A and B it follows:◮ If Pr(A) > 0 then we have Pr(B|A) = Pr(B)◮ If Pr(B) > 0 then we have Pr(A|B) = Pr(A)This means we do not update the probability assigned to A whenobserving that B has occurred.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 65


ProbabilityIndependent Events (2)Ex: Quality ControlSuppose a production process that produces widgets. This is followed bytwo tests aimed at identifying inoperable widgets. Test a detects aninoperable widget with 70% probability, test b detects an inoperable widgetwith 80% probability. The results of the tests are independent. What isthe probability that an inoperable widget is identified by quality control?Let A be the event that a widget that is inoperable passes both tests,A a is the event that it passes test a undetected, and A b is the eventthat it passes test b undetected.Then Pr(A a ) = 30% and Pr(A b ) = 20%.Since the tests are independentPr(A) = Pr(A a ∩ A b ) = Pr(A a |A b )Pr(A b ) = Pr(A a )Pr(A b ) = 0.06The probability that it is detected is then 94%.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 66


ProbabilityBayes’ Theorem (1)Consi<strong>der</strong> a partition B 1 , . . . , B k of the sample space.If then some information is revealed (say we learn that some event Ahas occurred) then we do still not know which of the B i has occurred,but we can update the probabilities we assign to each of the B i s inthe partition.Th: Bayes’ TheoremLet the events B 1 , . . . , B k be a partition of Ω such that Pr(B i ) > 0; ∀i,and let A be an event with Pr(A) > 0. Then for i = 1, . . . , k,Pr(B i |A) =Pr(A|B i )Pr(B i )∑ kj=1 Pr(A|B j)Pr(B j )Note that the term in the denominator equals Pr(A) (by the law oftotal probability).Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 67


ProbabilityBayes’ Theorem (2)Ex: Identifying the Source of a defective ItemThree different machines M 1 , M 2 , and M 3 were used for producing a largebatch of similar items. Suppose the fractions produced by the respectivemachines are 20%, 30%, and 50%.We further know that 1% of items produced by machine M 1 are defective,2% of M 2 ’s and 3% of M 3 ’s products are inoperable.Suppose that one item is selected at random from the entire batch, and itis found to be defective. What is the probability that this item wasproduced by M 2 .Let B i be the event that the selected item was produced by M i .A is the event that the selected item is defective.Pr(B 1 ) = 0.2, Pr(B 2 ) = 0.3, Pr(B 3 ) = 0.5,Pr(A|B 1 ) = 0.01, Pr(A|B 2 ) = 0.02, Pr(A|B 3 ) = 0.03.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 68


ProbabilityBayes’ Theorem (3)By the law of total probabilityPr(A) =3∑Pr(A|B i )Pr(B i ) = 0.023i=1Then by Bayes’ Theorem the updated probability of B 2 after we havelearned that the selected item is defective isPr(B 2 |A) = Pr(A|B 2)Pr(B 2 )Pr(A)=0.02 · 0.30.023= 0.2609Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 69


ProbabilityBayes’ Theorem (4)Ex: Unfair CoinSuppose a box contains one fair coin and one coin with a head on eachside.One coin is selected at random and when it is tossed, a head is obtained.What is the probability that the coin is the fair coin?Let B 1 be the event that the coin is fair, B 2 be the event that theselected coin is the unfair coin, and H 1 be the event that head isobtained when the coin is tossed.Then by Bayes’ theorem,Pr(B 1 |H 1 ) =Pr(H 1 |B 1 )Pr(B 1 )Pr(H 1 |B 1 )Pr(B 1 ) + Pr(H 1 |B 2 )Pr(B 2 )Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 70


ProbabilityBayes’ Theorem (5)Which yieldsPr(B 1 |H 1 ) =1/2 · 1/21/2 · 1/2 + 1 · 1/2 = 1 3Ex: Unfair Coin (cont.)Suppose the selected coin is tossed once more and again a head isobtained.What is now the probability that the coin is the fair coin?There are two ways to calculate the updated probability:◮ Use Bayes’ theorem once after observing both heads.◮ Use the conditional version of Bayes’ theorem to update Pr(B 1 |H 1 )after the second times head is observed.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 71


ProbabilityBayes’ Theorem (6)Let H 2 denote the event that the second toss shows a head.Then by Bayes’ theoremPr(B 1 |H 1 ∩ H 2 ) =Pr(H 1 ∩ H 2 |B 1 )Pr(B 1 )Pr(H 1 ∩ H 2 |B 1 )Pr(B 1 ) + Pr(H 1 ∩ H 2 |B 2 )Pr(B 2 ) =1/4 · 1/21/4 · 1/2 + 1 · 1/2 = 1 5Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 72


ProbabilityBayes’ Theorem (7)The second way to calculate Pr(B 1 |H 1 ∩ H 2 ) is to update Pr(B 1 |H 1 )with conditional Bayes’ theorem.From the definition of conditional probability we know thatAnd thus we getwherePr(B 1 ∩ H 2 |H 1 ) = Pr(B 1 |H 2 ∩ H 1 )Pr(H 2 |H 1 )= Pr(H 2 |B 1 ∩ H 1 )Pr(B 1 |H 1 )Pr(B 1 |H 2 ∩ H 1 ) = Pr(H 2|B 1 ∩ H 1 )Pr(B 1 |H 1 )Pr(H 2 |H 1 )Pr(H 2 |H 1 ) = Pr(H 2 |B 1 ∩ H 1 )Pr(B 1 |H 1 )+Pr(H 2 |B 2 ∩ H 1 )Pr(B 2 |H 1 )Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 73


ProbabilityBayes’ Theorem (8)So we finally getPr(B 1 |H 2 ∩ H 1 ) =1/2 · 1/31/2 · 1/3 + 1 · 2/3= 1 5Note: In the first step of updating, Pr(B 1 ) is called the priorprobability and Pr(B 1 |H 1 ) is the posterior probability, i.e. afterupdating.In the second step of updating, the prior probability is Pr(B 1 |H 1 ) andPr(B 1 |H 2 ∩ H 1 ) is the posterior probability.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 74


Random Variables and DistributionsRandom Variables and Distributions (1)Def: Random VariableA random variable is a real-valued function defined on a sample space Ω.I.e., a random variable X is some function that assigns a real numberX(ω) to each ω ∈ Ω.Ex: Rolling Two DiceRoll two dice and report the sum of the two outcomes.We had this experiment already.Ex: Tossing a Coin Three TimesToss a coin three times and report the number of heads you obtained.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 75


Random Variables and DistributionsRandom Variables and Distributions (2)Ex: Demand for Water and ElectricityConsi<strong>der</strong> the example from above and suppose the cost of one liter ofwater is q W and the cost of one kilowatt hour of electricity is q E . Observethe daily consumption and report the associated costs.Electricity1501005000 50 100 150 200WaterThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 76


Random Variables and DistributionsRandom Variables and Distributions (3)The prices for water and electricity define a function on the samplespace Ω which assigns each observation (ω W , ω E ) ∈ Ω the associatedcosts viacost(ω W , ω E ) = q W ω W + q E ω EWhile the first two examples consi<strong>der</strong> discrete random variables, thelast example illustrates a continuous random variable.Def: Distribution FunctionThe distribution function (df) F of a random variable X is a functiondefined on the real numbers asF (x) = Pr(X ≤ x) for − ∞ < x < ∞Sometimes the distribution function is also called the cumulativedistribution function (cdf).Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 77


Random Variables and DistributionsDiscrete Distributions (1)Def: Discrete Random VariablesA discrete random variable X is a random variable that can take only afinite number k of different values x 1 , x 2 , . . . , x k or, at most, a countablesequence of different values x 1 , x 2 , . . . .Consi<strong>der</strong> the ”Tossing a Coin Three Times”-example, the randomvariable X (=number of heads obtained) has four possible outcomes0, 1, 2, 3 withPr(X = 0) = Pr(X = 3) = 1 8 , Pr(X = 1) = Pr(X = 2) = 3 8Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 78


Random Variables and DistributionsDiscrete Distributions (2)Illustration of the df in the ”Tossing a Coin Three Times”-example.Fx1.00.80.60.40.21 0 1 2 3 4xThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 79


Random Variables and DistributionsDiscrete Distributions (3)The probability distribution is right continuous with existing leftlimits.The jumps in the distribution function represent the atomicprobability mass that is situated at the discrete values which arepossible outcomes of the random variable.Since the distribution function F evaluated at x gives the probabilitythat X ≤ x it is right continuous.Def: Probability FunctionFor discrete distributions we define the probability function pf that isdefined aspf(x) = Pr(X = x)Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 80


Random Variables and DistributionsDiscrete Distributions (4)Graphical illustration of the probability function for the ”Toss a CoinThree Times”-example.”pf(x)0.0 0.1 0.2 0.3 0.4●●●●−1 0 1 2 3 4Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 81x


Random Variables and DistributionsDiscrete Distributions (5)Def: Binomial DistributionConsi<strong>der</strong> an experiment that consists of n independentsub-experiments with each of the sub-experiments can only take ontwo possible outcomes, success and no success.The probability of success is p, equal for all sub-experiments.A random variable X is assigned the number of how often success isobserved.Then the distribution function of X is called a binomial distributionwith parameters n and p.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 82


Random Variables and DistributionsDiscrete Distributions (6)Def: Binomial Distribution (alternative)Consi<strong>der</strong> a discrete random variable X with possible values 0, . . . , n. Ifprobability of taking a value x is given by( nPr(X = x) = px)x (1 − p) n−xthen X is called binomially distributed with parameters n and p.The distribution function of a n, p-binomial distribution is then∑( ) nF (x) =p i (1 − p) n−iii∈{0,...,n}∧i≤xThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 83


Random Variables and DistributionsDiscrete Distributions (7)R-code for generating a graph of the probability distribution of abinomially distributed random variable can be found inBinomialDistribution.R.n


Random Variables and DistributionsDiscrete Distributions (8)The pd for tossing a coin three times:F(x)0.0 0.2 0.4 0.6 0.8 1.0●●●●−1 0 1 2 3 4Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 85x


Random Variables and DistributionsDiscrete Distributions (9)The pd for tossing a coin hundred times:F(x)0.0 0.2 0.4 0.6 0.8 1.0●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●0 20 40 60 80 100Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 86x


Random Variables and DistributionsDiscrete Distributions (10)Ex: Rolling a DiceWhen rolling a dice, the numbers 1 to 6 have equal probability of 1/6.What is the pd of this experiment?F(x)0.0 0.2 0.4 0.6 0.8 1.0●●●●●●0 1 2 3 4 5 6 7xThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 87


Random Variables and DistributionsContinuous Distributions (1)A continuous random variable is one that can assume every value inan interval (bounded or unbounded).Its distribution can be characterized by a density function.Def: Continuous Random VariableA random variable X is called a continuous random variable or X is saidto have a continuous distribution if there exists function f defined on thereal numbers such that the probability that X takes a value in A ⊂ R isgiven by∫Pr(X ∈ A) = f(x)dxDef: Probability Density FunctionThe function f from above is called the probability density function of X,also denoted pdf.AThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 88


Random Variables and DistributionsContinuous Distributions (2)In or<strong>der</strong> to conform with the axioms of probability, a pdf f mustsatisfyf(x) ≥ 0, ∀x,and∫ ∞−∞f(x)dx = 1The probability that X is in the interval (a, b] is thenPr(a < X ≤ b) =∫ baf(x)dxWhat is the probability that X assumes a value in [a, b)?Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 89


Random Variables and DistributionsContinuous Distributions (3)Note: For a continuous random variable it is true thatPr(X = x) = 0, ∀x ∈ R.I.e., the probability for obtaining exactly some pre-specified value iszero.But, this does not mean that a particular outcome x is impossible.Non-uniqueness of the dpf: As a consequence, we might change thepdf of a random variable X at a finite number of points (or even atan countably infinite number of points) without changing the value ofthe integral. I.e., the pdf is not uniquely defined.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 90


Random Variables and DistributionsContinuous Distributions (4)Def: Uniform Distribution on an IntervalConsi<strong>der</strong> the interval [a, b] with a < b and select a random number Xfrom the interval such that the probability that X belongs to an arbitrarysubinterval of [a, b] is proportional to the length of the subinterval.Then X is called a uniformly distributed variable and the distribution of Xis called a uniform distribution.Then f has to be constant over [a, b].And since ∫ ∞−∞ f(x)dx = ∫ baf(x)dx = 1 we get⎧⎨1for a ≤ x ≤ b,f(x) = b − a⎩0 otherwise.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 91


Random Variables and DistributionsContinuous Distributions (5)Ex: Uniform Distribution Over [−2, 3]The probability density function of a random variable that is uniformlydistributed over the interval [−2, 3].f(x)0.00 0.05 0.10 0.15 0.20−3 −2 −1 0 1 2 3 4xThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 92


Random Variables and DistributionsDistribution Function (1)We have already defined the term distribution function F of arandom variable X asF (x) = Pr(X ≤ x) for − ∞ < x < ∞.If the distribution of the variable X is characterized by a probabilitydensity function f, then the distribution function F and theprobability density function f show the following dependenceF (x) =∫ x−∞f(u)duandf(x) = dFdx = F ′ (x)Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 93


Random Variables and DistributionsDistribution Function (2)Ex: Uniform Distribution Over [a, b]The distribution function F for a uniform distribution over the interval[a, b].Since there is no probability mass outside the interval [a, b], we knowthat F is flat there.the distribution function is then given by⎧0⎪⎨∫ for x < a,x1F (x) =a b − a du = x − a for a ≤ x < b,b − a⎪⎩1 for b ≤ x.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 94


Random Variables and DistributionsDistribution Function (3)Ex: Uniform Distribution Over [−2, 3]Determine the distribution function of a uniform distribution over theinterval [−2, 3].F(x)0.0 0.2 0.4 0.6 0.8 1.0−3 −2 −1 0 1 2 3 4xThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 95


Random Variables and DistributionsDistribution Function (4)Th: Determining Probabilities from the Distribution FunctionFrom the definition of the distribution function it follows directlyPr(X > x) = 1 − F (x)Pr(x 1 < X ≤ x 2 ) = F (x 2 ) − F (x 1 )Pr(X < x) = F (x − ) = lim u→x −Pr(X = x) = F (x) − F (x − )If F is continuous at x then it follows that Pr(X x ) = 0.Therefore, atomic probability mass is characterized by jumps in thedistribution function.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 96


Random Variables and DistributionsDistribution Function (5)Ex: Uniform DistributionWhat is the probability that a uniformly over [−2, 3] distributed randomvariable X assumes a value in (0, 2]?f(x)0.00 0.05 0.10 0.15 0.20−3 −2 −1 0 1 2 3 4xThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 97


Random Variables and DistributionsDistribution Function (6)We can read the value of the integral over the density functiondirectly from the distribution function F .F(x)0.0 0.2 0.4 0.6 0.8 1.0−3 −2 −1 0 1 2 3 4xThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 98


Random Variables and DistributionsDistribution Function (7)Illustration of a general distribution function:F(x)0.0 0.2 0.4 0.6 0.8 1.0●●Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 99x


Random Variables and DistributionsDistribution Function (8)In many cases, we are interested in a critical threshold that is situatedsuch that a certain probability mass lies below the threshold.Def: The Quantile FunctionThe p quantile of a distribution function F is called F −1 (p) and assignseach p ∈ [0, 1] the smallest x such that F (x) ≥ p.Note: We write F −1 despite the fact that in general F cannot beinverted.When do we run into problems when trying to invert F ?Ex: The Quantile Function of the Uniform DistributionWhat is the p quantile of a uniform distribution over [a, b]?Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 100


Random Variables and DistributionsDistribution Function (9)In this case, we are able to invert the distribution function.F (x) = x − ab − a= p,x − a = p(b − a),x = pb + (1 − p)aNote: Distribution functions and quantiles depend on the distributiononly. Any two random variables with the same distribution have thesame distribution function and the same quantile function.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 101


Random Variables and DistributionsDistribution Function (10)Illustration of the 5% quantile of an arbitrary distribution:f(x)0.0 0.1 0.2 0.3 0.45% qantile5%−4 −2 0 2 4xThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 102


Random Variables and DistributionsDistribution Function (11)The same 5% quantile determined from the distribution function:F(x)0.0 0.2 0.4 0.6 0.8 1.00.055% qantile−4 −2 0 2 4Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 103x


Random Variables and DistributionsBivariate Distributions (1)We generalize the concept of the distribution of a random variable tothe joint distribution of two (and later more) random variables.In doing so, we introduce the joint pf for two discrete randomvariables, the joint pdf of two continuous random variables, and thejoint df for any two random variables.Def: Joint Probability Function for two Discrete Random VariablesThe joint probability function or joint pf of two discrete random variablesX and Y is defined by a function pf that assigns every point (x, y) in thexy-plane the valuepf(x, y) = Pr(X = x ∧ Y = y).pf must satisfy∑All (x,y)pf(x, y) = 1Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 104


Random Variables and DistributionsBivariate Distributions (2)Ex: Bivariate Probability FunctionConsi<strong>der</strong> a bivariate discrete distribution, characterized by its probabilityfunction as specified in the following table:YX 1 2 3 41 0.1 0.0 0.1 0.02 0.3 0.0 0.1 0.23 0.0 0.2 0.0 0.0Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 105


Random Variables and DistributionsBivariate Distributions (3)Illustration of the joint pf of two discrete random variables:1.0 1.5 2.0 2.53.0YX0.30.20.10.04.03.53.02.52.01.51.0Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 106


Random Variables and DistributionsBivariate Distributions (4)Def: Continuous Joint DistributionsTwo continuous random variables X and Y have a continuous jointdistribution if there exists a nonnegative function f defined over thexy-plane, such that for every subset A of the plane∫ ∫Pr((X, Y ) ∈ A) = f(x, y) dx dy.The function f is called the joint probability density function, joint pdf, ofX and Y .The density must satisfyf(x, y) ≥ 0 for − ∞ < x < ∞ and − ∞ < y < ∞,1 =∫ ∞ ∫ ∞−∞−∞Af(x, y) dx dy.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 107


Random Variables and DistributionsBivariate Distributions (5)Illustration of the joint pdf of two continuous random variables:-2 -1 0 1 20.150.100.050.00210Y-1-2XThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 108


Random Variables and DistributionsBivariate Distributions (6)Def: Bivariate Distribution FunctionThe joint distribution function, or joint df, of two random variables X andY is defined as the function F defined on the entire xy-plane such thatF (x, y) = Pr(X ≤ x ∧ Y ≤ y).For continuous random variables X and YF (x, y) =∫ x−∞∫ y−∞f(r, s) dr ds.The probability that a pair (X, Y ) lies in a specific rectanglea < x ≤ b and c < y ≤ d is given byPr(a < X ≤ b ∧ c < Y ≤ d) = F (b, d) − F (a, d) − F (b, c) + F (a, c)Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 109


Random Variables and DistributionsBivariate Distributions (7)The probability of a rectangle:dycAabThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 110x


Random Variables and DistributionsBivariate Distributions (8)It might be also necessary to consi<strong>der</strong> the joint distribution of adiscrete variable and a continuous variable.Def: The Joint pf-pdf of a Bivariate Discrete-Continuous DistributionLet X denote a discrete random variable with probability mass onI = {x 1 , x 2 . . . } and Y a continuous random variable. Their jointdistribution is characterized by a joint pf/pdf-function f(x, y) such thatfor A x ⊆ I and A y ⊆ R we have∫∑Pr((X, Y ) ∈ A x × A y ) = f(x i , y) dyA y x i ∈A xAnd again, it must be true that∫ ∞∑f(x i , y) dy = 1−∞x i ∈IThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 111


Random Variables and DistributionsMarginal Distributions (1)Def: Marginal Probability Functions for Discrete VariablesIf X and Y have a discrete joint distribution for which the joint probabilityfunction is pf, the marginal probability function pf 1 of X is defined aspf 1 (x) = Pr(X = x) = ∑ yPr(X = x ∧ Y = y) = ∑ ypf(x, y)The marginal probability function pf 2 of Y is determined in ananalogous way:pf 2 (y) = Pr(Y = y) = ∑ xPr(X = x ∧ Y = y) = ∑ xpf(x, y)Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 112


Random Variables and DistributionsMarginal Distributions (2)Calculating pf 1 from the joint pf:● ● ● ● ●● ● ● ● ●● ● ● ● ●y● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●x 1 x 2… x x nThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 113x


Random Variables and DistributionsMarginal Distributions (3)Consi<strong>der</strong> the bivariate probability function defined in the exampleabove⎧0.2 for X = 1⎪⎨0.6 for X = 2pf 1 (x) =0.2 for X = 3⎪⎩0.0 otherwiseand⎧⎪⎨pf 2 (y) =⎪⎩0.4 for Y = 10.2 for Y = 20.2 for Y = 30.2 for Y = 40.0 otherwiseThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 114


Random Variables and DistributionsMarginal Distributions (4)Def: Marginal Probability Functions for Continuous VariablesIf X and Y have a continuous joint distribution for which the jointprobability density function is f, the marginal probability density functionf 1 of X is defined asf 1 (x) =∫ ∞−∞f(x, y) dyfor − ∞ < x < ∞The marginal probability density function f 2 of Y is determined in ananalogous way:f 2 (y) =∫ ∞−∞f(x, y) dxfor − ∞ < y < ∞Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 115


Random Variables and DistributionsConditional Distributions (1)This is a generalization of the concept of conditional probability.It is the idea that there will typically be many random variables ofinterest in an applied problemAfter we have observed some of these variables, we want to be able toupdate the distributions of the remaining variables conditional on theobservations.Def: Discrete Conditional DistributionsSuppose that X and Y have a joint discrete distribution with probabilityfunction pf. Let pf 1 and pf 2 denote the marginal probability functions ofX and Y , respectively.After the value y of the variable Y has been observed, the distribution ofX is characterized by the conditional probability function pf 1 (x|y) given bypf 1 (x|y) = Pr(X = x|Y = y) =Pr(X = x ∧ Y = y)Pr(Y = y)=pf(x, y)pf 2 (y)Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 116


Random Variables and DistributionsConditional Distributions (2)The conditional distribution of X given another variable Y is thedistribution we have to use for X after we have observed the value ofY .Analogously, the conditional probability function pf 2 (y|x) is defined bypf 2 (y|x) = Pr(Y = y|X = x) =Pr(X = x ∧ Y = y)Pr(X = x)=pf(x, y)pf 1 (x)The conditional probability functions pf 1 (x|y) and pf 2 (y|x) areprobability functions, i.e.,∑∑pf 1 (x|y) = 1, pf 2 (y|x) = 1.xyThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 117


Random Variables and DistributionsConditional Distributions (3)Consi<strong>der</strong> the bivariate probability function defined in the exampleabove⎧⎨ 0.25 for X = 1pf 1 (x|y = 1) = 0.75 for X = 2 ,⎩0.00 otherwise{ 1.00 for X = 3pf 1 (x|y = 2) =,0.00 otherwise⎧⎨ 0.50 for X = 1pf 1 (x|y = 3) = 0.50 for X = 2 ,⎩0.00 otherwise{ 1.00 for X = 2pf 1 (x|y = 4) =.0.00 otherwiseThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 118


Random Variables and DistributionsConditional Distributions (4)Def: Continuous Conditional DistributionsSuppose that X and Y have a joint continuous distribution withprobability density function f. Let f 1 and f 2 denote the marginalprobability density functions of X and Y , respectively.Let y be a value such that f 2 (y) > 0, then the conditional probabilitydensity function f 1 (x|y) is given byf 1 (x|y) =f(x, y)f 2 (y)for − ∞ < x < ∞.Analogously, the definition off 2 (y|x) =f(x, y)f 1 (x)for − ∞ < y < ∞.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 119


Random Variables and DistributionsConditional Distributions (5)Th: Generalization of the Multiplication Rule of ConditionalProbabilityConsi<strong>der</strong> X and Y , two random variables with a joint continuousdistribution. For all y with f 2 (y) > 0 we havef(x, y) = f 1 (x|y)f 2 (y).At the same time, for all y with f 1 (x) > 0 we havef(x, y) = f 2 (y|x)f 1 (x).This equality is key for phrasing a Bayes’ Theorem for randomvariables.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 120


Random Variables and DistributionsBayes’ Theorem for Random Variables (1)For discrete random variables, we do actually not need an extratheorem, because all we need is already in the theorem we know.Only the notation has to be adopted.Th: Law of Total Probability for Random VariablesLet pf(x, y) denote the joint probability function of two discrete randomvariables X and Y . Thenpf 1 (x) = ∑ ypf 1 (x|y)pf 2 (y)Let f(s, t) be the joint probability density of two continuous randomvariables S and T . Thenf 1 (x) =∫ ∞−∞f 1 (x|y)f 2 (y)Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 121


Random Variables and DistributionsBayes’ Theorem for Random Variables (2)Attention: We will not distinguish between f and pf from here onand call probability functions and probability density function f! Ifnecessary, we discuss the differences between discrete and continuousdistributions.Th: Bayes’ Theorem for Random VariablesIf the joint distribution of X and Y is constructed from f 1 (x) and f 2 (y|x),then Bayes’ Theorem for random variables states thatf 1 (x|y) = f 2(y|x)f 1 (x)f 2 (y)If the joint distribution of X and Y is constructed from f 2 (y) and f 1 (x|y),then Bayes’ Theorem for random variables states thatf 2 (y|x) = f 1(x|y)f 2 (y)f 1 (x)Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 122


Random Variables and DistributionsBayes’ Theorem for Random Variables (3)Ex: Clinical TrialSuppose a clinical trial which can either be success (X = 1) or no success(X = 0). The probability P for having success is unknown and it isassumed that P is uniformly distributed over [0, 1]. Assume the trial is asuccess. What is the posteriori distribution of P ?This is a joint distribution of a discrete variable (X) and a continuousvariable (P ).First construct the joint pf-pdf f(x, p).The marginal distribution of P is uniform over [0, 1]{ 1 for p ∈ [0, 1],f 2 (p) =0 otherwise.For given P = p, the conditional probability of X = 1 is p and thePr(X = 0) = (1 − p).Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 123


Random Variables and DistributionsBayes’ Theorem for Random Variables (4)More general: For given P = p the conditional probability functionf 1 (x|p) is given by (note: p 0 = 1)Thereforef(x, p) = f 1 (x|p)f 2 (p)f 1 (x|p) = p x (1 − p) 1−x .={ (p x (1 − p) 1−x) · 1 for x ∈ {0, 1} ∧ p ∈ [0, 1],0 otherwise.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 124


Random Variables and DistributionsBayes’ Theorem for Random Variables (5)An illustration of the joint pf-pdf f(x, y):-0.50.00.51.01.50.20.0 1.01.00.80.60.4x0.80.60.40.20.0pThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 125


Random Variables and DistributionsBayes’ Theorem for Random Variables (6)What is the unconditional probability of success?f 1 (x = 1) =∫ 10f(x = 1, y) dy =∫ 10p dp = p22∣10= 1 2 .That is, with this very diffuse prior about the probability of success,the unconditional probability of success is 50%, which conforms varymuch with our intuition.The posterior distribution of the probability of success is characterizedby the conditional densityf 2 (p|x) = f 1(x|p)f 2 (p)f 1 (x)= px (1 − p) 1−x1/2And since we have assumed that the first treatment was a successf 2 (p|x = 1) = f 1(x = 1|p)f 2 (p)f 1 (x = 1)= p1/2 = 2p.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 126


Random Variables and DistributionsBayes’ Theorem for Random Variables (7)The prior marginal distribution of p:f 2 (p)0.0 0.2 0.4 0.6 0.8 1.0−0.5 0.0 0.5 1.0 1.5Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 127p


Random Variables and DistributionsBayes’ Theorem for Random Variables (8)The posterior marginal distribution of p:f2(p|x)0.0 0.5 1.0 1.5 2.0−0.5 0.0 0.5 1.0 1.5Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 128p


Random Variables and DistributionsBayes’ Theorem for Random Variables (9)Ex: Clinical Trial (cont.)What if we now treat another patient. What is the posterior distributionof the probability of success P if the second medication isa) again successful?b) not successful?Let g be the joint posterior pf-pdf of X and P , i.e., after we havealready observed the first success.Then we know already thatg 2 (p) = 2p for 0 ≤ p ≤ 1The conditional distribution of X given P is unchanged (why?), i.e.,g 1 (x|p) = f 1 (x|p) = p x (1 − p) 1−xThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 129


Random Variables and DistributionsBayes’ Theorem for Random Variables (10)The joint pf-pdf of X and P , after we have already observed the firstsuccess is theng(x, p) = g 1 (x|p)g 2 (p)={ (p x (1 − p) 1−x) · 2p for x ∈ {0, 1} ∧ p ∈ [0, 1],0 otherwise.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 130


Random Variables and DistributionsBayes’ Theorem for Random Variables (11)Illustration of the joint pf-pdf of X and P , after we have alreadyobserved the first success.-0.50.00.51.01.52.01.5x1.00.50.60.80.0 1.00.4p0.20.0Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 131


Random Variables and DistributionsBayes’ Theorem for Random Variables (12)What is the probability of success in the second round?g 1 (x = 1) =∫ 10g 1 (x = 1|p)g 2 (p) dp = p · 2p dp = 2 p33The probability of no-success in the second trial is therefore1∣ = 203g 1 (x = 0) ==∫ 10∫ 10g 1 (x = 0|p)g 2 (p) dp((1 − p) · 2p dp = 2p 22∣10)− p3 13 ∣ = 1 30Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 132


Random Variables and DistributionsBayes’ Theorem for Random Variables (13)The posterior distribution of P after we have observed the secondtrial is characterized by the conditional pdfg 2 (p|x) = g 1(x|p)g 2 (p)g 1 (x)In the case we observe success for a second time this isg 2 (p|x = 1) = g 1(x = 1|p)g 2 (p)g 1 (x = 1)= p · 2p2/3 = 3p2 .In the case the second trial is no success we getg 2 (p|x = 0) = g 1(x = 0|p)g 2 (p)g 1 (x = 0)=(1 − p) · 2p1/3= 6(p − p 2 ).Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 133


Random Variables and DistributionsBayes’ Theorem for Random Variables (14)The conditional density of P , g 2 (p|x).g2(p|x)0.0 0.5 1.0 1.5 2.0 2.5 3.0x=0x=1−0.5 0.0 0.5 1.0 1.5Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 134p


Random Variables and DistributionsIndependence of Random Variables (15)Def: IndependenceTwo random variables X and Y with joint pdf (pf) f(x, y) and marginalpdf (pf) f 1 (x) and f 2 (y) are independent if and only if for all real numbersx and y it is true thatf(x, y) = f 1 (x)f 2 (y)Then for all x with f 1 (x) > 0 we havef 2 (y|x) = f 2 (y).And for all y with f 2 (y) > 0 we havef 1 (x|y) = f 1 (x).Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 135


Random Variables and DistributionsFunctions of Random Variables (1)Suppose you know the distribution of some basic un<strong>der</strong>lying randomvariable X. The variable of interest is, however, not directly this basicvariable but some function of X, i.e., Y = h(X).What is the distribution of Y ?Ex: Y = X 2 , when X is Uniformly DistributedSuppose that X has a uniform distribution on [−1, 1] and Y is defined byY = X 2 . What is the distribution of Y ?Since Y = X 2 , Y is in [0, 1].Let G denote the df of Y , thenG(y) = Pr(Y ≤ y)= Pr(X 2 ≤ y)= Pr(− √ y ≤ X ≤ √ y)Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 136


Random Variables and DistributionsFunctions of Random Variables (2)This impliesG(y) = Pr(− √ y ≤ X ≤ √ y)==∫ √ y− √ y∫ √ y− √ y= √ y.f(x)dx12 dxIf g(y) denotes the probability density of Y then we know that g canbe <strong>der</strong>ived from the distribution function G byg(y) = dG(y)dy= 12 √ yThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 137


Random Variables and DistributionsFunctions of Random Variables (3)The density g(y) is unbounded at y = 0(!)g0 1 2 3 4 50.0 0.2 0.4 0.6 0.8 1.0Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 138y


Random Variables and DistributionsFunctions of Random Variables (4)The transformation of densities is, apparently, not so straight forward.The safe way: Transform the distribution function!Suppose Y = h(X) and X lies in a certain interval (a, b). Let h bedifferentiable with h ′ (x) > 0 in (a, b). With F and G the distributionfunctions of X and Y respectivelyG(y) = Pr(Y ≤ y) = Pr(h(X) ≤ y) = Pr(X ≤ h −1 (y)) = F (h −1 (y))The density g(y) then follows fromg(y) = dG(y)dy= dF (h−1 (y))dy= f(h −1 (y)) d(h−1 (y)).dyThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 139


Random Variables and DistributionsFunctions of Random Variables (5)Suppose now that h(x) < 0 in (a, b). ThenG(y) = Pr(Y ≤ y) = Pr(h(X) ≤ y) = Pr(X ≥ h −1 (y))= 1 − F (h −1 (y))The density g(y) then isg(y) = dG(y)dy= − dF (h−1 (y))dy= −f(h −1 (y)) d(h−1 (y)).dyand since d(h−1 (y))dy< 0 we can generally writeg(y) = f(h −1 (y))d(h −1 (y))∣ dy ∣ .Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 140


Random Variables and DistributionsFunctions of Random Variables (6)In general we have to separate domains, where h is strictly increasingfrom domains where h is strictly decreasing and treat them separately.Domains where h is flat have to be treated separately!Y = X 2 , when X is Uniformly Distributed (cont.)Use the presented framework.The function y = h(x) = x 2 is strictly decreasing in [−1, 0) andstrictly increasing in (0, 1].First treat x ∈ [−1, 0), which contributesg a (y) = f(h −1 (y))d(h −1 (y))∣ dy ∣ = 1 2 · 12 √ yfor y ∈ (0, 1]Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 141


Random Variables and DistributionsFunctions of Random Variables (7)In the next step we treat x ∈ (0, 1], which contributes againg b (y) = f(h −1 (y))d(h −1 (y))∣ dy ∣ = 1 2 · 12 √ for y ∈ (0, 1]yThe density of y is then the sum of all sub-parts, which isg(y) = 12 √ yfor y ∈ (0, 1]It is unbounded for y → 0, since the <strong>der</strong>ivative of √ y is unboundedfor y → 0.However, ∫ 10g(y)dy exists and equals 1, so g is a properly definedprobability density.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 142


ExpectationThe Expectation of a Random Variable (1)The distribution function contains the entire probabilistic informationabout a random variable.It is now our goal to catch part of this information with the help ofdifferent measures without having to describe the entire distribution.First, we want to characterize the central location of a randomvariable with a measure called the expectation of a random variable.Def: Expectation of a Discrete Random VariableConsi<strong>der</strong> a random variable X with a discrete distribution given by theprobability function f. The expectation of X, denoted by E(X), is definedasE(X) = ∑ xf(x).xThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 143


ExpectationThe Expectation of a Random Variable (2)Def: Expectation of a Continuous Random VariableConsi<strong>der</strong> a random variable X with a continuous distribution given by theprobability density function f. The expectation of X, denoted by E(X), isdefined asE(X) =∫ ∞−∞xf(x) dx.The number E(x) is also called the expected value of X or the meanof X.The expectation is the probability weighted sum (integral) of thepossible outcomes of a random variable.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 144


ExpectationThe Expectation of a Random Variable (3)The relation of the expectation to the center of gravity:Assume that a long rod (along the x-axis) over which the mass variescontinuously such that f(x) is the mass density at x. Then the centerof gravity is located at the point E(X), and the rod will be balancedif it is supported at that point.Consi<strong>der</strong> a random variable X (represented by the pdf f) and Ywhich is <strong>der</strong>ived from X by Y = r(X). ThenE(Y ) = E(r(X)) =∫ ∞−∞r(x)f(x) dxThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 145


ExpectationThe Expectation of a Random Variable (4)Ex: Y = X 2 , when X is Uniformly DistributedSuppose that X has a uniform distribution on [−1, 1] and Y is defined byY = X 2 . What is the expectation of Y ?Excurse: What is the expectation of X?E(X) =∫ ∞−∞xf(x) dx =∫ 1−1x 1 2 dx = 1 2There are two different possibilities to calculate E(Y ):E(Y ) =E(Y ) =∫ 1−1∫ 10x 2 f(x) dx =yg(y) dy =∫ 10∫ 1−1x 2 1 2 dx = 1 2x 22∣x 331−1∣1−1= 0.= 1 3 ,y 12 √ y dy = 1 ∣2 ∣∣∣12 3 y3/2 = 1 3 .0Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 146


ExpectationProperties of Expectations (1)Th: Expectation of a Linear FunctionConsi<strong>der</strong> a random variable X for which the expectation E(X) exists. IfY = aX + b with a and b constant, thenE(Y ) = aE(X) + bSuppose f(x) is the pdf of X thenE(Y ) = E(aX + b) == a∫ ∞−∞= xE(X) + b.∫ ∞−∞xf(x) dx + b(ax + b)f(x) dx∫ ∞−∞f(x) dxThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 147


ExpectationProperties of Expectations (2)Th: Expectation of a Sum of Random VariablesConsi<strong>der</strong> a random variables X i with i ∈ {1, . . . , n} such that expectationsE(X i ) exist. Then.E(X 1 + · · · + X n ) = E(X 1 ) + · · · + E(X n )As a special case of the theorems presented we have for constantsa 1 , . . . , a n , bE(a 1 X 1 + · · · + a n X n + b) = a 1 E(X 1 ) + · · · + a n E(X n ) + bThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 148


ExpectationProperties of Expectations (3)Ex: The Mean of the Binomial DistributionConsi<strong>der</strong> a random variable X that is binomially distributed withparameters n and p. What is the expectation of X?Remember: X consists of n sub-experiments which are independentand where each sub-experiment has a constant probability of success,p.So consi<strong>der</strong> X i a random variable, with X 1 = 1 if sub-experiment i isa success and X i = 0 otherwise.The expectation of X i isE(X i ) = 1 · p + (1 − p) · 0 = pThen apparently we have X = ∑ ni=1 X i and, thus,n∑E(X) = E(X i ) = np.i=1Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 149


ExpectationProperties of Expectations (4)Since we also know the pf of the Binomial distribution, we cancalculate the expectation also asE(X) =n∑x=0( nx px)x (1 − p) n−x = np.Th: Expectation of Independent Random VariablesIf X 1 , . . . , X n are independent random variables such that allexpectations E(X i ) then( n)∏n∏E X i = E(X i ).i=1 i=1Attention: The assumption of independence is crucial in this theorem!Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 150


ExpectationProperties of Expectations (5)Ex: Expected Number of TrialsSuppose you try to perform a certain task until you are successful. Assumethat the trials are independent and that the probability of success is p < 1per trial. What is the expected value of the number of trials it takes untilyou succeed?First step: What is the probability that you succeed at trial x?This implies, that you do not have success for the first x − 1consecutive trials. In the last trial (i.e., the trial number x) you havesuccess.Let q = 1 − p denote the probability of non-success. Since the trialsare independent, the probability that you have success at trial x (i.e.,the pf of the experiment) is thenf(x) = Pr(X = x) = q x−1 pThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 151


ExpectationProperties of Expectations (6)Then the expected value of trial it takes until you have success isE(X) =∞∑∞∑∞∑xf(x) = xq x−1 p = p xq x−1 .i=1x=1x=1To solve this sum, letThen∞∑a = xq x−1 .x=1a − a · q = 1 + 2q + 3q 2 + 4q 3 + · · ·− q − 2q 2 − 3q 3 − · · ·(1 − q)a = 1 + q + q 2 + q 3 + · · · =∞∑q i =i=01(1 − q)Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 152


ExpectationProperties of Expectations (7)That meansAnd thusa =1(1 − q) 2 = 1 p 2 .E(X) = pa = p 1 p 2 = 1 p .Ex: Rolling a Double-6Roll two dice simultaneously. Repeat the experiment until you obtain adouble-6. What is the expected value of trials needed until you succeed?Since p = 136 we have E(X) = 1 p = 36.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 153


ExpectationThe Variance (1)The expectation is the most common measure of central location of adistribution. It does, however, not convey very much informationabout the distribution.We define the associated measure of dispersion, the variance.Def: VarianceSuppose X is a random variable with mean µ = E(X). The variance of Xis defined asσ 2 = Var(X) = E [ (X − µ) 2] .I.e., the variance is the expected quadratic deviation of X from itsmean.It is true that Var(X) ≥ 0.If the expectation of the quadratic deviation is unbounded, thevariance does not exist.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 154


ExpectationThe Variance (3)The variance calculated step-by-step:1 2 3 4 5 6x − µ −2.50 −1.50 −0.50 0.50 1.50 2.50(x − µ) 2 6.25 2.25 0.25 0.25 2.25 6.25f(x) 1/6 1/6 1/6 1/6 1/6 1/6The variance is thenσ 2 = Var(X) =6∑(x − µ) 2 f(x) = 17.56x=1= 3512 = 2.917.The standard deviation is∑σ = √ 6 (x − µ) 2 f(x) =x=1√3512 = 1.708.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 156


ExpectationProperties of the Variance (1)Th: Variance of a ConstantVar(X) = 0 if and only if there exists some constant c such thatPr(X = c) = 1.I.e., only a constant has zero variance.Th: Variance of a Linear FunctionFor constants a and b it is true thatVar(aX + b) = a 2 Var(X)To prove this theorem consi<strong>der</strong>, that the mean of aX + b equalsaE(X) + b and substitute into the definition of the variance.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 157


ExpectationProperties of the Variance (2)Th: Alternative Calculation of the VarianceFor every random variable X it is true thatVar(X) = E(X 2 ) − [E(X)] 2Let µ = E(X),Var(X) = E [ (X − µ) 2]= E(X 2 − 2µX + µ 2 )= E(X 2 ) − 2µE(X) + µ 2= E(X 2 ) − µ 2 .Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 158


ExpectationProperties of the Variance (3)Th: Variance of the Sum of Independent Random VariablesConsi<strong>der</strong> X 1 , . . . , X n , n independent random variables. For the varianceof the sum of these variables it is true thatVar(X 1 + · · · + X n ) = Var(X 1 ) + · · · + Var(X n )Attention: The assumption of independence is crucial to the validityof stated equation!!Proof: Substitute into the definition of the variance and use the factthat for independent X i and X j we have E(X i X j ) = E(X i )E(X j ).Th: Variance of the Sum of Independent Random Variables (cont.)Consi<strong>der</strong> X 1 , . . . , X n , n independent random variables and a 1 , . . . , a n , barbitrary constants.Var(a 1 X 1 + · · · + a n X n + b) = a 2 1Var(X 1 ) + · · · + a 2 nVar(X n )Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 159


ExpectationProperties of the Variance (4)Ex: Mean and Variance of Portfolio ReturnsAssume you have two investment opportunities a and b with independentfuture returns r a and r b withE(r a ) = µ a , E(r b ) = µ b ,Var(r a ) = σa, 2 Var(r b ) = σb 2.What is the expected return and return-variance of a portfolio withweights w a ∈ [0, 1] and w b = 1 − w a ?The portfolio return r p is given byr p = w a r a + w b r b .Since the two investments are independent, the portfolio returnsatisfiesµ p = w a µ a + w b µ b , σ 2 p = w 2 aσ 2 a + w 2 b σ2 b .Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 160


ExpectationProperties of the Variance (5)Consi<strong>der</strong> the example with parametersµ a = 3%, µ b = 6%,σa 2 = 0.0025, σb 2 = 0.01.E(rp)0.030 0.040 0.050 0.060●minVarr a●0.002 0.004 0.006 0.008 0.010Var(r p )●r bThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 161


ExpectationProperties of the Variance (6)You see: The investment a is less risky than investment b. However,holding only investment a in the portfolio is not optimal, since addinga small fraction of the riskier investment b to the portfolio leads to◮ an increase of the expected portfolio return,◮ a decrease in the portfolio risk.This is the concept of diversification!Find the minimum-variance portfolio:σ 2 p = (1 − w b ) 2 σ 2 a + w 2 b σ2 b ,dσ 2 pdw b= −2(1 − w b )σ 2 a + 2w b σ 2 b!= 0,w b =σ 2 aσ 2 a + σ 2 b> 0.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 162


ExpectationProperties of the Variance (7)Ex: Variance of the Binomial DistributionConsi<strong>der</strong> a random variable X which is generated by a binomialdistribution with parameters n and p. What is the variance of X?We can write X = X 1 + · + X n , with n independent sub-experimentsand X i is 1 with probability p.We know already that for independent events we haveVar(X) =n∑Var(X i )i=1For each of the sub-experiments calculate the variance in three steps:E(X i ) = 1 · p + 0 · (1 − p) = p,E(Xi 2 ) = 1 2 · p + 0 2 · (1 − p) = p,Var(X i ) = E(Xi 2 ) − [E(X)] 2 = p − p 2 = p(1 − p)Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 163


ExpectationProperties of the Variance (8)Then it follows for the variance of the binomial distributionVar(X) = np(1 − p)An alternative interpretation of the variance:Ex: Minimizing the Mean Squared ErrorConsi<strong>der</strong> a random variable X with mean µ and variance σ 2 . Make aprediction m for the outcome of X. Measure the quality of the predictionwith respect to its mean squared error (MSE): E[(X − m) 2 ]. What is theprediction with the best quality, i.e., with minimum MSE?For an arbitrary m the MSE isMSE = E[(X − m) 2 ] = E(x 2 − 2mX + m 2 )= E(X 2 ) − 2mµ + m 2 .Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 164


ExpectationProperties of the Variance (9)Find the optimum by consi<strong>der</strong>ing the first <strong>der</strong>ivative:dMSEdm = −2µ + 2m ! = 0,m = µIn or<strong>der</strong> to minimize the MSE, the predicted value of X should be itsmean µ.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 165


ExpectationCovariance and Correlation (1)If we are interested in the joint distribution of two random variables,we need a measure of how the two random variables depend on eachother.Def: CovarianceConsi<strong>der</strong> two random variables X and Y with E(X) = µ x , E(Y ) = µ y ,Var(X) = σ 2 x, Var(Y ) = σ 2 y.Then the covariance of X and Y , Cov(X, Y ), is defined byCov(X, Y ) = E [(X − µ x )(Y − µ y )]If σ 2 x < ∞ and σ 2 y < ∞ then Cov(X, Y ) exists and is finite.If Y tends to obtain a value that exceeds its mean in cases X liesabove its mean, and if Y tends to take a value below its meanwhenever X is below its mean, then the covariance of X and Y ispositive.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 166


ExpectationCovariance and Correlation (2)Th: Bounds for the CovarianceBy the Schwarz Inequality we have[Cov(X, Y )] 2 ≤ σ 2 xσ 2 y.Since the covariance is somewhat hard to interpret, the dependence oftwo random variables is frequently expressed by a scale-independentmeasure, the correlation.Def: Correlationconsi<strong>der</strong> two random variables X and Y with 0 < σ 2 x < ∞ and0 < σ 2 y < ∞, then the correlation ρ(X, Y ) of the two variables is definedbyρ(X, Y ) = Cov(X, Y )σ x σ y.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 167


ExpectationCovariance and Correlation (3)Th: Bounds for the CorrelationFrom the bounds for the covariance it follows that−1 ≤ ρ(X, Y ) ≤ 1X and Y are positively correlated if ρ(X, Y ) > 0, they are negativelycorrelated if ρ(X, Y ) < 0 and they are uncorrelated if ρ(X, Y ) = 0.Th: Alternative Definition of the CovarianceFor X and Y with 0 < σ 2 x < ∞ and 0 < σ 2 y < ∞ we haveCov(X, Y ) = E(XY ) − E(X)E(Y )This leads directly to the following theoremTh: Independence and CorrelationIndependent random variables are uncorrelated.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 168


ExpectationCovariance and Correlation (4)This can easily be seen. Since independence of X and Y is defined byE(XY ) = E(X)E(Y ),the covariance of X and Y vanishesCov(XY ) = E(XY ) − E(X)E(Y ) = 0 = ρ(X, Y ).Attention: Uncorrelated random variables are not necessarilyindependent.Ex: Counterexample – Uncorrelated but not IndependentConsi<strong>der</strong> X that can assume the three discrete values −1, 0, 1 with equalprobability. Let Y = X 2 . Show that ρ(XY ) = 0.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 169


ExpectationCovariance and Correlation (5)Th: Correlation as a Measure of Linear DependenceConsi<strong>der</strong> X with 0 < σx 2 < ∞ and Y = aX + b for constant a ≠ 0, b.If a > 0, then ρ(X, Y ) = 1.If a < 0, then ρ(X, Y ) = −1.Substitute into the definition of the covarianceCov(X, Y ) = E(XY ) − E(X)E(Y )= E[X(aX + b)] − E(X)E(aX + b)= aE(X 2 ) + bE(X) − a[E(X)] 2 − bE(X)= a[E(X 2 ) − [E(X)] 2 ] = aσx.2Since σ y = |a|σ x the statement of the theorem follows.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 170


ExpectationCovariance and Correlation (6)Th: Variance of the Sum of Random VariablesConsi<strong>der</strong> X and Y with finite variance, thenVar(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ).Substitute into the definition of the covarianceVar(X + Y ) = E [ (X − µ x + Y − µ y ) 2]= E [ (X − µ x ) 2 + (Y − µ y ) 2 + 2(X − µ x )(Y − µ y ) ]= Var(X) + Var(Y ) + 2Cov(X, Y ).A more general formulation is for constant a, b, and cVar(aX + bY + c) = a 2 Var(X) + b 2 Var(Y ) + 2abCov(X, Y ).in particularVar(X − Y ) = Var(X) + Var(Y ) − 2Cov(X, Y ).Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 171


ExpectationCovariance and Correlation (7)Ex: Mean and Variance of Portfolio Returns RevisitedAs in the previous portfolio example, we have two investmentopportunities a and b with future returns r a and r b with correlationρ(r a , r b ). What is the expected return and return-variance of a portfoliowith weights w a ∈ [0, 1] and w b = 1 − w a ?The portfolio return r p given byr p = w a r a + w b r b .Since the two investments are correlated, the portfolio return satisfiesµ p = w a µ a + w b µ b , σ 2 p = w 2 aσ 2 a + w 2 b σ2 b + 2w aw b σ a σ b ρ(r a , r b ).Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 172


ExpectationCovariance and Correlation (8)Mean/Variance characteristic for different levels of return correlation.E(r p )0.030 0.040 0.050 0.060ρ = − 0.8r a●ρ = 0ρ = 0.8●r b0.002 0.004 0.006 0.008 0.010Var(r p )Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 173


ExpectationProperties of the Sample Mean (1)This is now the first step towards making estimates.In the final state, we want to estimate the mean of a distributionfrom a sample of realizations drawn from this distribution.The first step is to get some un<strong>der</strong>standing of the inter-dependencebetween the expectation of the distribution and the sample mean.Def: Sample MeanDraw a sample of n realizations (observations) X 1 , . . . , X n from a givendistribution. The sample mean is defined asX n = 1 nn∑X i .i=1Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 174


ExpectationProperties of the Sample Mean (2)We need a short detour with two theoremsTh: Markov InequalityConsi<strong>der</strong> a non-negative random variable X, i.e., Pr(X ≥ 0) = 1. Thenfor every given number δ ≥ 0 it follows thatPr(X ≥ δ) ≤ E(X) .δI.e., we can set limits for the probability of outliers.Ex: Limits to the Probability of OutliersSuppose the non-negative random variable X has an expectation ofE(X) = 1. What can we say about the probability of observing X ≥ 100?By the use of the Markov inequality we can statePr(X ≥ 100) ≤ 1.00%.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 175


ExpectationProperties of the Sample Mean (3)Th: Chebyshev InequalityConsi<strong>der</strong> X, a random variable with finite variance. For every δ > 0Pr(|X − E(X)| ≥ δ) ≤ Var(X)δ 2 .This can be <strong>der</strong>ived from the Markov inequality applied toY = [X − E(X)] 2 , which is a non-negative random variable.Ex: 3σWhat is the probability that a random variable X deviates from itsexpectation more than three times the standard deviation?Without specific knowledge of the distribution we can proposePr(|X − E(X)| ≥ 3σ) ≤σ2(3σ) 2 = 1 9 .Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 176


ExpectationProperties of the Sample Mean (4)Th: Properties of the Sample MeanDraw a sample of n independent observations X 1 , . . . , X n from adistribution with mean µ and variance σ 2 . The sample meanX n = 1 n (X 1 + · · · + X n )is a random variable with mean and varianceE(X n ) = µ and Var(X n ) = σ2nrespectively. More precisely, for δ > 0,Pr(|X n − µ| ≥ δ) ≤ σ2nδ 2 .Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 177


ExpectationProperties of the Sample Mean (5)I.e., we have now a procedure to estimate the mean of a distributionand we are able to assess the quality of the estimate.The variance of the estimate decreases linearly with the sample size n.Only drawback: We must have an idea about the variance of thedistribution.Ex: Clinical TrialConsi<strong>der</strong> the example we already had. Each trial is an independent drawfrom a Binomial distribution with unknown probability of success p. Howmany trials do we need to perform in or<strong>der</strong> to be 99% confident that theabsolute error of the estimation is less than 5 percentage point?Let X i be the outcome of trial i, which equals 1 in the case ofsuccess and 0 otherwise.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 178


ExpectationProperties of the Sample Mean (6)Then the individual trial is characterized byE(X i ) = p, Var(X i ) = p(1 − p) ≤ 0.25.Having a sample of n observations, we know thatandE(X n ) = p, Var(X n ) =Pr(|X n − p| ≥ 0.05) ≤100n≤ 0.01n ≥ 10, 000p(1 − p)n≤ 0.25n .p(1 − p)n(0.05) 2 ≤ 0.250.0025n!≤ 0.01Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 179


ExpectationProperties of the Sample Mean (7)If taking a sample of at least 10, 000 trials, we know that the estimatefor the probability of success from the sample mean deviates onlywith less than 1% probability more that 5 percentage points from thetrue p.This estimate of the probability of an error is very conservative, sinceit does not use any information about the distribution, with theexception of a rough estimate of an upper bound to the variance.The experiment in NumberOfTrials.R demonstrates, that the actualprobability of an error in excess of 5% is consi<strong>der</strong>ably lower.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 180


ExpectationProperties of the Sample Mean (8)Th: Law of Large NumbersLet X 1 , . . . , X n be a random sample of from a distribution with mean µand for which the variance exists. Let X n denote the sample mean thenfor every ε > 0limn→∞ Pr(|X n − µ| < ε) = 1It is also said that X n converges to µ in probabilityX np→n→∞ µ.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 181


Special DistributionsBinomial DistributionWe have already discussed the binomial distribution, see above.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 182


Special DistributionsNormal Distribution (1)The normal distribution is by far the single most important probabilitydistribution in statistics.This is true, despite the fact that its density cannot be integrated inclosed form.The reason is the central limit theorem (will follow soon).Def: Normal DistributionA variable X has a normal distribution with mean µ and variance σ 2 (with−∞ < µ < ∞, σ > 0) if its density is given byf(x) = √ 1 [exp − 1 (x − µ) 2 ]2πσ 2 σ 2 .Then X is said to be N(µ, σ 2 ) distributed.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 183


Special DistributionsNormal Distribution (2)The shape of a N(µ, σ 2 ) distributed random variable.σf(x)µThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 184x


Special DistributionsNormal Distribution (3)Th: Sum of Normally Distributed VariablesLet X and Y be two normally distributed random variables with µ x andσ 2 x the mean and variance of X and µ y σ 2 y the mean and variance of Y .Then the sum Z = X + Y is also normally distributed with meanµ z = µ x + µ y and variance σ 2 z = σ 2 x + σ 2 y + 2Cov(X, Y ).Def: Standard Normal DistributionA random variable X has a standard normal distribution if it is normallydistributed with mean µ x = 0 and variance σ 2 x = 1.If X is normally distributed with mean µ x and variance σ 2 x thenY = (X − µ x )/σ xhas a standard normal distribution.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 185


Special DistributionsNormal Distribution (4)For a normally distributed variable, 68.3% of all observations lie in theinterval [µ − σ, µ + σ].fy68.3%Μ Σ15.85% 15.85%Μ ΣΜΜ ΣyThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 186


Special DistributionsNormal Distribution (5)For a normally distributed variable, 95.5% of all observations lie in theinterval [µ − 2σ, µ + 2σ].fy95.5%Μ 2Σ2.25% 2.25%Μ 2ΣΜΜ 2ΣyThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 187


Special DistributionsNormal Distribution (6)The distribution function of the standard normal distribution isdenoted by N(x). I.e., if X has a standard normal distributionPr(X ≤ x) = N(x).In R the density, distribution function, quantile, and random numbergenerator can be called using:dnorm(x), pnorm(x), qnorm(x), rnorm(n)The quantile N −1 (α) evaluated on a selection of valuesα N(α) α N(α)10.0% -1.282 1.0% -2.3265.0% -1.645 0.5% -2.5762.5% -1.960Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 188


Special DistributionsCentral Limit Theorem (1)We know already, that the sum of normally distributed variables isagain normally distributed. Thus, the sample mean is also normallydistributed.This ”stability” property of the normal distribution is unique.Th: Central Limit TheoremLet X 1 , . . . , X n form a random sample of size n (of independentobservations) drawn from a given distribution with mean µ and variance0 < σ 2 < ∞. Then for each fixed number x[ ]lim Pr (Xn − µ)n→∞ σ/ √ n ≤ x = N(x).I.e., the distribution of sample means converges to a normaldistribution. The normal distribution ”attracts” the distribution ofsample means.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 189


Special DistributionsCentral Limit Theorem (2)Ex: Estimate Average IncomeSuppose wages in a town are uniformly distributed between EUR 2, 000.−and EUR 3, 000.−.Asking one person about her salary, each value between EUR 2, 000.− andEUR 3, 000.− is equally probable.Estimate the mean income by taking samples. What is the quality of yourestimate?If you take the smallest possible sample, i.e., asking only one person,and taking this observation as your estimate for the average income,may result in a huge error.To illustrate this fact, we simulate the experiment and repeat it10, 000 times.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 190


Special DistributionsCentral Limit Theorem (3)The mean and variance of the income distribution∫ 3,000[]E(X 2 ) = x 2 1dx3, 000 − 2, 000=2,00011, 000[ x33] 3,0002,000= 11, 000[ 3, 000 3 − 2, 000 33],E(X) =2, 000 + 3, 0002= 2500,Var(X) = E(X 2 ) − [E(X)] 2 = 11000= 83, 333.˙3.[ 3000 3 − 2000 33]− 2, 500 2Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 191


Special DistributionsCentral Limit Theorem (4)Distribution of the sample mean for n = 1 (10, 000 experiments).Histogram of EstimateDensity0.0000 0.0004 0.0008 0.00122000 2200 2400 2600 2800 3000EstimateThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 192


Special DistributionsCentral Limit Theorem (5)Distribution of the sample mean for n = 2 (10, 000 experiments).Histogram of EstimateDensity0.0000 0.0010 0.00202000 2200 2400 2600 2800 3000EstimateThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 193


Special DistributionsCentral Limit Theorem (6)Distribution of the sample mean for n = 3 (10, 000 experiments).Histogram of EstimateDensity0.0000 0.0010 0.00202000 2200 2400 2600 2800 3000EstimateThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 194


Special DistributionsCentral Limit Theorem (7)Distribution of the sample mean for n = 10 (10, 000 experiments).Histogram of EstimateDensity0.000 0.001 0.002 0.003 0.0042000 2200 2400 2600 2800 3000EstimateThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 195


Special DistributionsCentral Limit Theorem (8)Distribution of the sample mean for n = 100 (10, 000 experiments).Histogram of EstimateDensity0.000 0.004 0.008 0.0122000 2200 2400 2600 2800 3000EstimateThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 196


Special DistributionsCentral Limit Theorem (9)This implementation of this simulation can be found in the fileCentralLimitUniform.R.The same convergency result is achieved if the income is binomiallydistributed, either EUR 2, 000, − or EUR 3, 000, − with equalprobability.Since the variance of this distribution is larger than the variance ofthe uniform distribution, convergency is slower.See CentralLimitBinom.R for an implementation.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 197


Special DistributionsExponential Distribution (1)Imagine a situation, where we wait for an event to occur and assumethere is a constant intensity of arrival, i.e., in any infinitesimal timeinterval [t, t + dt] the probability of arrival (conditional on the factthat the event did not already occur) is constant λ dt.The arrival time in such a setup is then exponentially distributed.This is the continuous counterpart to the discrete example wediscussed, where we try some task repeatedly until we have success.Def: Exponential DistributionA random variable X is said to be exponentially distributed with intensityλ if its pdf is{ λe−λxfor x ∈ [0, ∞),f(x) =0 otherwise.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 198


Special DistributionsExponential Distribution (2)The expectation of an exponentially distributed variable isE(X) =∫ ∞0xf(x) dx =Integration by parts yieldsE(t) = 1 λ .∫ ∞0xλe −λx dx.Ex: Radioactive DecayThe stability (i.e., the time until decay) of a radioactive element withhalf-life T 1/2 is distributed with densityf(t) = λe −λt , for t ∈ [0, ∞), with λ = ln(2)T 1/2.What is the probability that one particular atom exists longer than 2 · T 1/2 ?Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 199


Special DistributionsExponential Distribution (3)Let τ denote the lifetime of a particular atom.The probability that τ exceeds some constant δ ≥ 0 is∫ ∞[Pr(τ ≥ δ) = λe −λt dt = λ − 1 ] ∞δλ e−λt = e −λδδSubstituting δ = 2 · T 1/2 and λ = ln(2)/T 1/2 yields−2 ln(2)Pr(τ ≥ 2 · T 1/2 ) = eThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 200


Special DistributionsChi-Square Distribution (1)Assume that X 1 , . . . , X n is a sample drawn from a normaldistribution with mean µ and variance σ 2 .We already know that the sample mean X n is normally distributedwith mean µ and variance σ 2 /n, i.e., X n will be a perfect candidateto estimate (unknown) µ from a sample (see later).One problem: The precision of the estimate of µ depends on σ 2 !!When we want to estimate µ from an observed sample, we generallydo not know σ 2 either!I.e., we know that X n is centered at the true µ, but we have no ideaof how to assess the precision of that estimate!Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 201


Special DistributionsChi-Square Distribution (2)Therefore, we also have to estimate σ 2 from the sample dataX 1 , . . . , X n .. . . let’s approach this idea step by step!Def: χ 2 DistributionLet X 1 , . . . , X n be a sample drawn from a normal distribution with meanµ and variance σ 2 , thenn∑( (Xi − µ)) 2i=1σhas a χ 2 distribution with n degrees of freedom.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 202


Special DistributionsChi-Square Distribution (3)More technically . . .Def: χ 2 Distribution (technically)A random variable Y is said to have a χ 2 distribution with n degrees offreedom, if the density f(y) is given by{1f(y) = 2 n/2 Γ(n/2) y(n/2)−1 e −x/2 for y > 0,0 otherwise,whereΓ(α) =∫ ∞0x α−1 e −x dx.In R the density, distribution function, quantile, and random numbergenerator for χ 2 distribution with df degrees of freedom:dchisq(x,df), pchisq(x,df), qchisq(x,df),rchisq(n,df)Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 203


Special DistributionsChi-Square Distribution (4)The density of the χ 2 distribution for different degrees of freedom.f(x)0.00 0.05 0.10 0.15 0.20 0.25n=3n=5n=100 5 10 15 20 25Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 204x


Special DistributionsChi-Square Distribution (5)Th: Sum of χ 2 Distributed VariablesLet X be a χ 2 distributed variable with m degrees of freedom and Y a χ 2distributed variable with n degrees of freedom, then the sum X + Y is χ 2distributed with m + n degrees of freedom.Ex: Simulate a χ 2 DistributionGenerate a series of n standard normally distributed variables X 1 , . . . , X nand calculate Y = X 2 1 + · · · + X2 n. To get an idea of the distribution of Y ,repeat this exercise 100.000 and plot a histogram of the generatedrealizations of Y . Compare the shape of the histogram to the density ofthe χ 2 distribution with n degrees of freedom.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 205


Special Distributionst Distribution (1)Now we do the next step towards an estimation of the parameters ofthe normal distribution.Def: t DistributionConsi<strong>der</strong> two independent random variables Y and Z where Y has a χ 2distribution with n degrees of freedom and Z has a standard normaldistribution, thenZX = ( ) Y 1/2nhas a t distribution with n degrees of freedom.The density of the t distribution is given by( )Γ (n+1) ) −(n+1)/22f(x) = √ (nπΓn)(1 + x2n2Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 206


Special Distributionst Distribution (2)The density of the t distribution for different degrees of freedom.f(x)0.0 0.1 0.2 0.3 0.4n=1n=2n=5n=20−4 −2 0 2 4xThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 207


Special Distributionst Distribution (3)For low values of degrees of freedom, the t distribution has ”fat tails”.For high values of degrees of freedom, the t distribution converges toa standard normal distribution.In R the density, distribution function, quantile, andrandom number generator for t distribution with df degrees of freedom:dt(x,df), pt(x,df), qt(x,df), rt(n,df)Now we are prepared to start estimating parameters of a normaldistribution!Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 208


EstimationConfidence Interval (1)For a random variable X we search for an interval that containsrealizations of X at a certain confidence level. I.e., only with someresidual probability X lies outside the confidence interval.Def: Confidence IntervalLet X be a random variable. The (1 − α) confidence interval of X is aninterval centered around the mean such that a realization X is inside theconfidence interval with probability (1 − α).We denote (1 − α) the confidence level and α the significance level.E.g., if we want to determine the 95% confidence interval for variableX with distribution function F , we have to determine the lowerbound of the interval, such that 2.5% of probability mass is belowthat bound and to determine the upper bound such that 2.5% ofprobability mass is above that bound.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 209


EstimationConfidence Interval (2)Consi<strong>der</strong> X with a normal distribution with mean µ and variance σ 2 .We know thatY = X − µσhas a standard normal distribution, then the confidence interval withconfidence level (1 − α) is given by[ µ + N −1 (α/2)σ , µ + N −1 (1 − α/2)σ ].Since the standard normal distribution is symmetric, we haveN −1 (α) = −N −1 (1 − α), thus[ µ − c · σ , µ + c · σ ].where the critical multiplier c = |N −1 (α/2)|.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 210


EstimationConfidence Interval (3)The density of the t distribution for different degrees of freedom.fy1ΑΑ2Μ±N 1 Α2ΣΑ2yΜN 1 Α2ΣΜΜN 1 Α2ΣThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 211


EstimationConfidence Interval (4)The critical multiplier c = |N −1 (α/2)| specifies the distancebetween the mean and the limits of the confidence interval inmultiples of the standard deviation.(1 − α) α |N −1 (α/2)|90% 10% 1.64595% 5% 1.96099% 1% 2.576Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 212


EstimationEstimating the Mean of a Distribution (1)Recall the clinical trial example. We have observed a collection ofpatients and recognized whether they responded successfully or not.What can we say about the probability that a patient will recoverafter studying the historical observations?Or in other words: How can we infer properties of the un<strong>der</strong>lyingdistribution of the population from observing a finite random sample?We saw already, the sample mean X n is centered around thepopulation mean µ and the variance of X n decreases with increasingsample size.I.e., the sample mean is a suitable estimate of the distribution meanµ.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 213


EstimationEstimating the Mean of a Distribution (2)Def: Unbiased Estimator of the Distribution MeanLet ˆµ be an estimator of the mean µ of a distribution inferred from arandom sample X 1 . . . , X n defined byˆµ = X n = 1 nn∑X i .i=1Then ˆµ is an unbiased estimator of the distribution mean since it is truethatE(ˆµ) = µ.This result is true independent of the un<strong>der</strong>lying distribution.Independent of n the sample mean is the best estimate of thepopulation mean.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 214


EstimationEstimating the Mean of a Normal Distribution (1)If the sample is drawn from a normal distribution, we know that ˆµ isalso a normally distributed variable with expectation µ.If we know the variance σ 2 of the distribution, we are also able tospecify a confidence interval.Th: Confidence Interval for ˆµ when σ is KnownWhen σ is known, the confidence interval for ˆµ estimated from a samplewith size n at a confidence level of (1 − α) is given by[ µ − c · σ/ √ n , µ + c · σ/ √ n ]with critical multiplier c = N −1 (α/2).Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 215


EstimationEstimating the Mean of a Normal Distribution (2)Ex: Sample SizeAssume that the height of students is normally distributed with standarddeviation of 16cm. What is the required sample size if we want toestimate the mean height of students with 95% confidence within aninterval of 1cm length?Remember: The central limit theorem guarantees that for sufficientlarge n, ˆµ is normally distributed.How large must n be in or<strong>der</strong> to guarantee the normal distribution?Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 216


EstimationEstimating the Parameters of a Distribution (3)Encouraged by the nice and simple result for ˆµ we ask: What is anatural estimate for the distribution variance?Since the distribution variance is defined byσ 2 = E ( [X − µ] 2)we might tend to try the following estimator from an observed sample.s 2 = 1 nn∑ ( ) 2Xi − X n .i=1Unfortunately, this estimator is biased.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 217


EstimationEstimating the Parameters of a Distribution (4)To see this, let us compute the expectation of s 2 which can bewritten as1 ∑(Xi − X) 2 = 1 ∑(Xi − µ + µ − X) 2n n= 1 ∑(Xi − µ) 2 + 2 1 ∑(xi − µ)(µ − X)nn+ 1 ∑(µ − X)2n= 1 ∑(Xi − µ) 2 + 2(µ − X) 1 ∑(xi − µ)nn+(µ − X) 2= 1 ∑(Xi − µ) 2 − (µ − X) 2nThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 218


EstimationEstimating the Parameters of a Distribution (5)Taking expectations and using the fact that Var(X n ) = σ 2 /n yieldsE ( s 2) ( 1 ∑)= E (Xi − µ) 2 − E ( (µ − X) 2)n= 1 n= 1 n∑E((Xi − µ) 2) − E ( (X − µ) 2)∑σ 2 − σ2n= σ 2 − σ2n= n − 1n σ2 .Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 219


EstimationEstimating the Parameters of a Distribution (6)Thus, we can simply construct an unbiased estimator for the varianceby adapting s 2 .Def: Unbiased Estimator of the Distribution VarianceConsi<strong>der</strong> a sample X 1 . . . , X n drawn from a distribution with finite meanand variance σ 2 then the sample variance ˆσ 2 is defined byˆσ 2 =nn − 1 s2 = 1n − 1n∑ ( ) 2Xi − X n .Then ˆσ 2 is called an unbiased estimator of the distribution variance sinceit is true thatE( ˆσ 2 ) = σ 2 .i=1This result is independent of the un<strong>der</strong>lying distribution.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 220


EstimationEstimating Mean and Variance of a Normal Dist. (1)Th: Joint Distribution of the Sample Mean and Sample VarianceConsi<strong>der</strong> a sample X 1 . . . , X n from a normal distribution with mean µ andvariance σ 2 . Then the sample mean ˆµ and the sample variance ˆσ 2 areindependent random variables.(ˆµ − µ)σ/ √ has a standard normal distribution,n(n − 1) ˆσ 2σ 2has a χ 2 distribution with n − 1 degrees of freedom.Although ˆµ and ˆσ 2 are estimated from the same sample, they areindependent!!Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 221


EstimationEstimating Mean and Variance of a Normal Dist. (2)Then, according to the definition of the t distribution and withˆσ =√ ˆσ2 ,has a t distribution.Z =(ˆµ − µ)σ/ √ n (ˆµ − µ)( ) 1/2=ˆσ/ √ nˆσ2σ 2Please note: Z is completely independent of σ and only depends on µand measures, which are computed from the observed sample!!!!Therefore, the Z can be used to assess the quality of the estimate ˆµ.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 222


EstimationEstimating Mean and Variance of a Normal Dist. (3)Let T n (x) be the distribution function of the t distribution with ndegrees of freedom.Th: Confidence Interval for ˆµ when σ is UnknownThe confidence interval for ˆµ estimated from a sample with size n at aconfidence level of (1 − α) is given by[ µ − c · ˆσ/ √ n , µ + c · ˆσ/ √ n ]with critical multiplier c = |T −1n−1 (α/2)|.I.e., not knowing the variance results in a larger confidence intervalcompared to the case with known σ, because of the ”fat tails” of thet distribution (at least if n is small).Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 223


EstimationEstimating Mean and Variance of a Normal Dist. (4)If we, thus, have no idea about µ and σ, then our best estimate forthe mean is µ = ˆµ and the best estimate for the variance is σ = ˆσ 2 .What about the (1 − α) confidence interval for µ = ˆµ?Please note: Now the argument runs the opposite direction! We lookfor the interval around ˆµ such that µ is contained in the interval withprobability (1 − α).Th: Confidence Interval for the Estimate of the MeanEstimating µ of a normal distribution from a sample of n observations, thebest estimate is µ = ˆµ.The (1 − α) confidence interval for the estimate is[ ˆµ − c · ˆσ/ √ n , ˆµ + c · ˆσ/ √ n ]with critical multiplier c = |T −1n−1 (α/2)|.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 224


EstimationEstimating Mean and Variance of a Normal Dist. (5)Def: Standard Error of the Estimate of the MeanThe standard deviation of ˆµ estimated from a sample with size nˆσ/ √ nis called the standard error of the estimate ˆµ.Ex: Tensile StrengthA firm produces steel cables. The file TensileStrength.csv contains asample of 120 test results (approx. normally distributed, strength in kN).What is the estimate of the mean tensile strength of the population andwhat is the 95% confidence interval of the estimate?I.e., not knowing the variance results in a greater confidence intervalcompared to the case with known σ, because of the ”fat tails” of thet distribution (at least if n is small).Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 225


Hypothesis TestingSimple Hypothesis Test (1)Consi<strong>der</strong> the case where a parameter θ of a distribution is unknown.We know that θ must lie in some parameter space Ω.It is the goal to perform a test in or<strong>der</strong> to determine whether theparameter θ is in some specified subset Ω 0 or in the complement Ω 1 .Def: Null Hypothesis H 0 and Alternative Hypothesis H 1The null hypothesis claims that θ lies in Ω 0 , the alternative hypothesis H 1claims that θ lies in Ω 1 = Ω\Ω 0 .Def: Simple HypothesisThe null hypothesis is said to be simple if Ω 0 contains only of one distinctvalue θ 0 .Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 226


Hypothesis TestingSimple Hypothesis Test (2)A hypothesis test will be performed after observing a sampleX 1 , . . . , X n .If X = (X 1 , . . . , X n ), a n-dimensional vector of outcomes in an-dimensional sample space S. tThen a test specifies a critical regionC such that for all X ∈ C the null hypothesis will be rejected!Def: Critical RegionA hypothesis test specifies a critical region C ∈ S, such that for all X ∈ Cthe null hypothesis will be rejected.Def: Test StatisticsStatistical tests are usually define a test statistics τ = τ(X) and set thecritical region asC = {X ∈ S | τ(X) ≥ c}for some critical value c.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 227


Hypothesis TestingSimple Hypothesis Test (3)Let us now test the simple hypothesisH 0 : µ = µ 0about the mean of a normal distribution against the alternativehypothesisH 1 : µ ≠ µ 0 .Now we ask the following question: Given the null hypothesis is true,is it likely to observe X as we have observed it?If we conclude that un<strong>der</strong> the null hypothesis it is very unlikely toobserve what we have observed, then we use the data to rejectthe null hypothesis and conclude that H 1 must be true!What does ”very unlikely” mean? Let us formalize it!Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 228


Hypothesis TestingSimple Hypothesis Test (4)Assume H 0 is valid and consi<strong>der</strong> a certain significance level α.Then α defines what ”unlikely” means!With probability (1 − α) the estimate ˆµ lies in the (1 − α) confidenceinterval.I.e., at a significance level of α, an estimate ˆµ inside the (1 − α)confidence interval is interpreted as ”likely” outcome—H 0 is accepted.An estimate ˆµ that is outside the (1 − α) confidence interval isobserved with probability α which is interpreted as ”unlikely” un<strong>der</strong>H 0 —thus, H 0 is rejected in this case.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 229


Hypothesis TestingSimple Hypothesis Test (5)The critical region of the test.Μ Σ ΑnT 1 n1 2 Μ Σ ΑnT 1 n1 2 reject H 0 reject H 0accept H 0Μ 0tThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 230


Hypothesis TestingSimple Hypothesis Test (6)Ex: Tensile Strength (cont.)A firm produces steel cables with a desired average tensile strength of25, 000 N. An average tensile strength below this value does not fullysatisfy customers’ needs. Higher tensile strength unnecessarily increasesproduction costs. Fluctuations are, however, unavoidable but there shouldnot be systematic deviations. Therefore, the firm takes a random sampleof 120 cables (see TensileStrength.csv) and performs a test.Do you agree with the hypothesis that the average tensile strength of thecables equals the desired value of 25, 000 N at a significance level of 5%?See TensileStrength.R for an implementation of the test.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 231


Hypothesis TestingSimple Hypothesis Test (7)Illustration of the result.0.060.050.04H 0 rejectH 0 reject0.03H 0 accept0.020.0124 970 24 980 24 990 25 000 25 010 25 020 25 030yThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 232


Hypothesis TestingSimple Hypothesis Test (8)We reformulate the test such that we have a test statistics τ thatdetermines the acceptance or rejection of H 0 .Def: Z ValueDefine the statistics Z asZ = (ˆµ − µ 0)ˆσ/ √ nThen the hypothesis test about the mean of a normal distribution hasthe following form:Def: t-Test About the Mean of a Normal Dist. (Unknown Variance)Reject H 0 if it is true that|Z| > c = |T −1n−1 (α/2)|.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 233


Hypothesis TestingSimple Hypothesis Test (9)Using Z to test H 0 .ΑT 1 n1 2 T Αn11 12 ΑZT 1 n1 2 ZT Αn11 12 H 0 rejectH 0 acceptH 0 reject0ZThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 234


Hypothesis TestingSimple Hypothesis Test (10)Ex: Tensile Strength (cont.)Calculate the Z statistics for the given data and perform the hypothesistest at a significance level of 5% and at a significance level of 1%.Def: Power FunctionLet π(θ|τ) denote the probability that a specified test τ will reject H 0 inthe case that the true parameter vector equals θ, i.e.,π(θ|τ) = Pr(X ∈ C|θ)Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 235


Hypothesis TestingSimple Hypothesis Test (11)Note: the critical multiplier c is the respective multiplier given α.Def: Size of a Significance TestThe level α(τ) of a significance test τ is defined byα(τ) = supµ∈Ω 0{π(µ|τ)}.The level of a test is also called significance level of the test.Certainly, the critical threshold of the t-test is chosen such that usingc(α) results in a test with size equal to α.I.e., α defines the tolerance with respect to rejecting H 0 when it istrue!!Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 236


Hypothesis TestingSimple Hypothesis Test (12)Power of a simple hypothesis test (n = 10):π0.0 0.2 0.4 0.6 0.8 1.0α = 0.01α = 0.1−4 −2 0 2 4(mu−mu_0)/(sigma_hat/sqrt(n))Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 237


Hypothesis TestingSimple Hypothesis Test (13)What would be the perfect test?A test that satisfiesπ(θ ∈ Ω 0 ) = 0,π(θ ∈ Ω 1 ) = 1.Unfortunately, in general such a perfect test does not exist!Def: Type 1 and Type 2 ErrorsThe erroneous decision to reject a true null hypothesis is called a type 1error. The erroneous decision to accept a false null hypothesis is called atype 2 error.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 238


Hypothesis TestingSimple Hypothesis Test (14)Consequently, given a test τ, if θ ∈ Ω 0 , π(θ|τ) is the probability ofmaking a type 1 error.If θ ∈ Ω 1 , 1 − π(θ|τ) is the probability of making a type 2 error.And—as already mentioned—α(τ) is the maximum probability overall θ ∈ Ω 0 of making a type 1 error.Thus, α expresses how tolerant one is with respect to falsely rejectinga true null hypothesis.Note: The significance level α must exogenously be determined,depending on the consequences of type 1 and type 2 errors,respectively.Note: Two Sided Hypothesis TestThe test of a simple hypothesis is also called a two sided test.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 239


Hypothesis TestingSimple Hypothesis Test (15)We see, for large values of α, the test tends to reject the null.For α decreasing, the result certainly jumps to ”not reject H 0 ”.Therefore, it is a natural question to ask: What is the critical α suchthat we are indifferent between accepting and rejecting the null?This leads to the definition of the ”p”-value.Def: p-ValueThe p-value of a sample X equals the critical α ∗ such that the test at thesignificance level p is indifferent between rejecting and accepting the nullhypothesis. I.e.,τ(X) = c(p).Then the null hypothesis is rejected wheneverOtherwise, it is accepted.α > p.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 240


Hypothesis TestingSimple Hypothesis Test (16)Th: p-Value for the t-TestThe p value of a t-test about the mean of a normal distribution is given by[ (∣ )]∣∣∣ ˆµ − µ 0p = 2[1 − T n−1 (|Z|)] = 2 1 − T n−1ˆσ/ √ n ∣Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 241


Hypothesis TestingOne Sided Hypothesis Test (1)If the null hypothesis has the formH 0 : µ ≥ (≤)µ 0 ,this is no longer a simple hypothesis.The corresponding alternative hypothesis is thenH a : µ < (>)µ 0 .Def: One Sided t-Test About Mean of a Normal Dist. (σ Unknown)The one sided t-test about the mean of a normal distribution rejects thenull hypothesis if◮ ˆµ ∉ Ω 0∣ ◮ ∣∣ |Z| = ˆµ−µ 0∣ > c = |Tn−1 −1 (α)|.ˆσ/ √ nThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 242


Hypothesis TestingOne Sided Hypothesis Test (2)Illustration of the one sided t-test of H 0 : µ ≥ µ 0 :T 1 n1 ΑZT 1 n1 ΑH 0 rejectH 0 accept0tThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 243


Hypothesis TestingOne Sided Hypothesis Test (3)Power of the one sided hypothesis test (n = 10):π0.0 0.2 0.4 0.6 0.8 1.0α = 0.01α = 0.1−4 −2 0 2 4(mu−mu_0)/(sigma_hat/sqrt(n))Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 244


Hypothesis TestingOne Sided Hypothesis Test (4)Ex: LampThe producer of projector lamps claims that his lamps have an averagedurability of at least 1000 hours. A test of 100 lamps performed by acompetitor reveals that the average durability in the sample is only 958hours. Is this a statistically significant evidence that the claim is not true?Use the data in ProjectorLamps.csv which contains the lifetime [h]of 100 observations.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 245


Linear Statistical ModelsThe Linear Regression Model (1)Linear models are used to study the relationship between andependent variable and one or more independent variables.More precisely, we want to study models of linear relationship of theformy = x 1 β 1 + x 2 β 2 + · · · + x k β k + ε,where y is the dependent variable and the variables x 1 , . . . , x k arecalled the independent or explanatory variables.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 246


Linear Statistical ModelsThe Linear Regression Model (2)A1: Linear RelationshipSuppose the following functional relationship between the dependentvariable y and the vector of k independent variables Xwherey = X ′ β + ε,β is a vector of coefficients (fixed),ε is a normally distributed disturbance.A2: Expectation of DisturbancesThe conditional expectation of the disturbance ε is zero,E(ε|X) = 0.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 247


Linear Statistical ModelsThe Linear Regression Model (3)A3: Spherical DisturbancesTaking a sample of disturbances ε 1 ,. . . , ε n , we haveVar(ε i ) = σ 2 ,andCov(ε i , ε j ) = 0, i ≠ j.Constant variance is called homoscedasticity.Uncorrelatedness across observations is called nonautocorrelation.Assumption A3 can also be written asE(εε ′ |X) = σ 2 I,with ε the vector of sampled disturbances, X is the (n × k) matrix ofcorresponding explanatory variables, and I the (n × n) identity matrix.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 248


Linear Statistical ModelsThe Linear Regression Model (4)A4: Explanatory VariablesThe explanatory variables X are either fixed or random, but in any casethey are independent of ε.These are the general assumptions which un<strong>der</strong>lie a linear regressionmodel. Some of these assumptions may be relaxed, e.g. theassumption of nonautocorrelation of disturbances, the assumption ofhomoscedasticity, the assumption of normality, etc. But this needsspecial treatment.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 249


Linear Statistical ModelsOLS, Ordinary Least Squares (1)From the assumption of the linear model, it follows thatE(y|X) = X ′ β.Goal: Estimating the linear relationship β from observing a sample ofcorresponding values of y and X.Let us denote b our estimate of the coefficients β. Then, our estimateof E(y|X) is called ŷ given byŷ = X ′ b.Let e denote our estimate of the disturbances ε (also called theregression residuals). These are consequently given bye = y − ŷ = y − X ′ b.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 250


Linear Statistical ModelsOLS, Ordinary Least Squares (2)From the definitions it follows thaty = X ′ β + ε = X ′ b + e.Now take a sample of n observations. I.e. we have a vector y ofobservations of the dependent variable and a (n × k) matrix X ofobservations of explanatories, where◮ each row of X corresponds to one observation of all k explanatories,◮ each row of X contains a sample of n observations of the respectiveexplanatory variable.We refer to a specific row of X by X i. and to a column of X by X .j .Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 251


Linear Statistical ModelsOLS, Ordinary Least Squares (3)For a given estimate for the vector of regression coefficients b 0 thesum of squared residuals is given byS(b 0 ) =n∑e 2 i0 =i=1n∑(y i − X i. b 0 ) 2 ,i=1or in more compact matrix formS(b 0 ) = e ′ e = (y − Xb 0 ) ′ (y − Xb 0 ) = y ′ y − 2b ′ oX ′ y + b ′ 0X ′ Xb 0 .We minimize S(b 0 ) with respect to the coefficient vector b 0 by<strong>der</strong>iving the gradient (as a column vector)∇S = ∂S(b 0)∂b 0= −2X ′ y + 2X ′ Xb 0Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 252


Linear Statistical ModelsOLS, Ordinary Least Squares (4)And setting ∇S(b 0 ) = 0 which leads to the optimal coefficient vectorb that satisfiesX ′ Xb = X ′ yWe can solve this equation for unique b if X ′ X has an inverse.We know that (X ′ X) −1 exists if X has full rank k.Def: OLS Regression CoefficientsIf the matrix of explanatories has full rank k then the vector of OLSregression coefficients is uniquely defined byb = (X ′ X) −1 X ′ yThe second <strong>der</strong>ivative of S with respect to b =∂ 2 S(b 0 )∂b 0 ∂b ′ = 2X ′ X0is positive definite if X has full rank, i.e., b is a unique minimum.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 253


Linear Statistical ModelsOLS, Ordinary Least Squares (5)Th: OSL Estimator b is an Unbiased Estimate of βThe ordinary least squares estimate b is unbiased in every sample, i.e.,E(b) = βWrite b asb = (X ′ X) −1 X ′ y = (X ′ X) −1 X ′ (Xβ + ε) = β + (X ′ X) −1 X ′ εThe conditional expectation with respect to X isE(b|X) = β + E((X ′ X) −1 X ′ ε|X) = β + (X ′ X) −1 X ′ E(ε|X) = β.Therefore,E(b) = E X (E(b|X)) = E X (β) = β.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 254


Linear Statistical ModelsOLS, Ordinary Least Squares (6)Th: Variance / Covariance of Estimated CoefficientsThe variance / covariance matrix of the estimated coefficient vector bconditional on the observed sample of explanatories isVar(b|X) = σ 2 (X ′ X) −1 .To see this writeVar(b|X) = E((b − β)(b − β) ′ |X)= E((X ′ X) −1 X ′ εε ′ X(X ′ X) −1 |X)= (X ′ X) −1 X ′ E(εε ′ |X)X(X ′ X) −1= (X ′ X) −1 X ′ (σ 2 I)X(X ′ X) −1= σ 2 (X ′ X) −1 .Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 255


Linear Statistical ModelsOLS, Ordinary Least Squares (7)Th: Distribution of bThe distribution of the estimated coefficient vector b conditional on thedata sample X, y is a joint multivariate normal distribution:b|X, y ∼ N(β, σ 2 (X ′ X) −1 ).Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 256


Linear Statistical ModelsOLS, Ordinary Least Squares (8)Def: OSL Estimate of the Regression VarianceThe OLS estimate of the regression variance is defined byˆσ 2 =e′ en − k ,which is an unbiased estimate of σ 2 , i.e.,(E( ˆσ e ′ )2 e|X) = En − k |X = σ 2 ,andE( ˆσ 2 ) = E X (E( ˆσ 2 |X)) = E X (σ 2 ) = σ 2 .Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 257


Linear Statistical ModelsOLS, Ordinary Least Squares (9)Th: Estimated Variance of the OLS RegressorThe estimated variance / covariance of the regression coefficients b isˆ Var(b|X) = ˆσ 2 (X ′ X) −1 .Def: Standard Error of a CoefficientThe standard error of the coefficient b j is given byse j = {[ ˆσ 2 (X ′ X) −1 ] jj } 1/2 .The estimated variance / covariance of the regression coefficients b isˆ Var(b|X) = ˆσ 2 (X ′ X) −1 .Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 258


Linear Statistical ModelsOLS, Ordinary Least Squares (10)Ex: The Impact of PromotionDD-Drugstores is a chain of drugstores that operate around the US. Tosee how effective its advertising and other promotional activities are, thefirm has collected data from 50 randomly selected regions. In each ofthese regions the firm divides its promotional expenses and its sales by therespective figures of its leading competitor and computes two index-values’Promote’ and ’Sales’. How effective is promotion? Assume in one region,DD-Drugstores uses 110% of promotional expenses compared to itsleading competitor. What is the predicted ’Sales’ figure?See DD-Drugstores.csv for the data collected by the firm.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 259


Linear Statistical ModelsOLS, Ordinary Least Squares (11)The data:Effectiveness of PromotionSales80 90 100 110 120●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●80 90 100 110 120PromoteThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 260


Linear Statistical ModelsOLS, Ordinary Least Squares (12)Some descriptive statistics:sample mean Sales . . . 99.74ˆσ 2 Sales . . . 97.91ˆσ Sales . . . 9.90sample mean Promote . . . 97.88Cov(Promote, Sales) . . . 58.17ρ(Promote, Sales) . . . 0.67First, define the variables:◮ the dependent variable y is the vector of observations of ”Sales”,◮ the matrix of independent variables X is built by merging a column ofconstant 1 and the vector of observations of ”Promote” to a (50 × 2)matrix.The OLS estimate of the regression coefficients is given byb = (X ′ X) −1 X ′ yThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 261


Linear Statistical ModelsOLS, Ordinary Least Squares (13)This results in>bSalesConst 25.1264201Promote 0.7622965The estimated regression variance isandˆσ 2 =e′ en − k = 54.68ˆσ = 7.39The estimated variance / covariance matrix of the regressioncoefficients is(Var(b|X) ˆ = ˆσ 2 (X ′ X) −1 141.195831 −1.43136687=−1.431367 0.01462369).Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 262


Linear Statistical ModelsOLS, Ordinary Least Squares (14)Th: Independence of b and ˆσ 2If ε is normally distributed, the least squares coefficient estimator b isstatistically independent of the residual vector e and therefore all functionsof e including ˆσ 2 .Therefore, the ratiot j = b j − β jse jhas a t-distribution with (n − k) degrees of freedom.When testing the null hypothesis that all coefficients equal zero at asignificance level of α, the null is rejected if∣ |t j | =b j − 0 ∣∣∣∣ ≥ |T −1sen−k (α/2)|.jThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 263


Linear Statistical ModelsOLS, Ordinary Least Squares (15)Then the t statistic of the coefficients b ist 1 = 25.1311.88 = 2.11, t 2 = 0.760.12 = 6.30For n − k degrees of freedom the critical multipliers for different levelsof α isα |T −1n−k (α/2)|10% 1.685% 2.011% 2.68Are the regression coefficients significantly different from zero?Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 264


Linear Statistical ModelsOLS, Ordinary Least Squares (16)R has a built in regression tool> lm(dat$Sales ∼ dat$Promote)Call:lm(formula = dat$Sales ∼ dat$Promote)Coefficients:(Intercept) dat$Promote25.1264 0.7623More detailed information provides> summary(lm(dat$Sales ∼ dat$Promote))Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 265


Linear Statistical ModelsOLS, Ordinary Least Squares (17)The regression line, least squares relative to a straight line:Effectiveness of PromotionSales80 90 100 110 120●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●80 90 100 110 120PromoteThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 266


Linear Statistical ModelsOLS, Ordinary Least Squares (18)The unconditional mean, least squares relative to a constant:Effectiveness of PromotionSales80 90 100 110 120●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●80 90 100 110 120PromoteThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 267


Linear Statistical ModelsOLS, Ordinary Least Squares (19)Def: Projector and Residual MakerWe define the projector matrix P asP = X(X ′ X) −1 X ′ ,and the residual maker matrix asM = (I − X(X ′ X) −1 X ′ ).Then we haveP y = X(X ′ X) −1 X ′ y = X [(X ′ X) −1 X ′ y] = Xb = ŷ,andMy = (I − P )y = y − ŷ = e.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 268


Linear Statistical ModelsOLS, Ordinary Least Squares (20)Both matrices P and M are symmetric and idempotent, i.e.,andP and M are orthogonal, i.e.,P ′ = P, M ′ = M,P 2 = P, M 2 = M.P M = MP = 0.And it follows from the definition thatP X = X, MX = 0, P e = 0, Me = e.This means, all columns of X are eigenvectors of P with eigenvalueequal to 1 and the residual vector e is also an eigenvector of P witheigenvalue equal to 0.For symmetric, idempotent matrices all eigenvalues are either equal to0 or to 1.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 269


Linear Statistical ModelsOLS, Ordinary Least Squares (21)Since the rank of P is equal to k (Why? Hint: (X ′ X) −1 is (k × k)),the k columns of X are the only eigenvectors of P that have nonzeroeigenvalues.Furthermore, we know that the eigensystem of symmetric matrices isorthogonal, thus, the vector of residuals e is orthogonal to theexplanatory variables!Th: Regression is an Orthogonal ProjectionThe linear regression separates the vector y of dependent variables into◮◮a component ŷ, which is a linear combination of the explanatory variables,a component e, which is orthogonal to the explanatory variables.y = ŷ + e = P y + My = projection + orthogonal residual.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 270


Linear Statistical ModelsOLS, Ordinary Least Squares (22)We can see the Pythagorean theorem at work if we consi<strong>der</strong> the sumof squaresy ′ y = (P y + My) ′ (P y + My) = y ′ P ′ P y + y ′ M ′ My = ŷ ′ ŷ + e ′ e.Consi<strong>der</strong> the special set of explanatory variables that consists only ofa vector with constant components equal to 1.We call this set X 0 , and define P 0 and M 0 as the respective projectorand residual maker.What can we say about P 0 and M 0 ?For the DD-Drugstores example, P 0 is a (50 × 50) matrix with allelements equal to 1/50.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 271


Linear Statistical ModelsOLS, Ordinary Least Squares (23)M 0 then transforms the vector y into a vector that contains y i − y asits components.We call e 0 = M 0 y the residuals with respect to the constant samplemean.Def: SST, the Total Variation of yThe total variation of the dependent variable y is defined as the sum ofsquared deviations of the components of y from the sample mean ySST =n∑(y i − y) 2 = (M 0 y) ′ (M 0 y) = y ′ M 0M ′ 0 y = y ′ M 0 y = e ′ 0e 0 .i=1The question is, how much of the variation in e 0 can be explained bythe explanatories X?Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 272


Linear Statistical ModelsOLS, Ordinary Least Squares (24)Def: SSR, the Regression Sum of SquaresThe regression sum of squares SSR of a regression of y on X is defined asthe total variance of ŷ,SSR =n∑(ŷ i − ŷ) 2 =i=1n∑(ŷ i − y) 2i=1= (M 0 Xb) ′ (M 0 Xb) = b ′ X ′ M ′ 0M 0 Xb = b ′ X ′ M 0 Xb.Def: SSE, the Sum of Squared ErrorsThe sum of squared errors SSE of a regression of y on X is defined as thetotal variance of e,SSE =n∑(e i − e) 2 =i=1n∑(e i ) 2 = e ′ e.i=1Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 273


Linear Statistical ModelsOLS, Ordinary Least Squares (25)We use the <strong>der</strong>ived separation of y into ŷ and e to decompose theresiduals e 0e 0 = M 0 y = M 0 (ŷ + e) = M 0 Xb + M 0 eSince e has mean equal to zero, we have M 0 e = e, thusSST = y ′ M 0 y = b ′ X ′ M 0 Xb + e ′ e = SSR + SSEIn the case of the DD-Drugstores example we haveSST = 4797.62 = SSR + SSE = 2172.88 + 2624.74Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 274


Linear Statistical ModelsOLS, Ordinary Least Squares (26)What is the fraction of variance that can be explained by theregression analysis?Def: R 2 , the Coefficient of DeterminationThe coefficient of determination R 2 is defined byR 2 = SSRSST = 1 − SSESST .It expresses the fraction of total variance of y which can be explained bythe regression analysis.In the DD-Drugstores example we haveR 2 = 2172.884797.62 = 0.4529Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 275


Linear Statistical ModelsPredictions (1)Consi<strong>der</strong> the case where we observed a sample X and y ofindependent and dependent variables and estimated for a linearregression the coefficients b and the variance of the disturbance ˆσ 2 .We now want to use these estimates to predict y 0 (unknown so far)for a set of explanatory variables x 0 (already observed).Then our best prediction isThe prediction error isŷ 0 = (x 0 ) ′ b.e 0 = y 0 − ŷ 0 = (x 0 ) ′ (β − b) + ε 0 .Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 276


Linear Statistical ModelsPredictions (2)The precision of prediction is inversely related to the variance of e 0Var(e 0 |X, x 0 ) = Var[(x 0 ) ′ (β − b)|X, x 0 ] + Var(ε 0 |X, x 0 )= (x 0 ) ′ Var[(β − b)|X, x 0 ]x 0 + σ 2= σ 2 (X ′ X) −1 + σ 2 .Thus the estimated variance of the prediction error isˆ Var(e 0 |X, x 0 ) = ˆσ 2 ( 1 + (x 0 ) ′ (X ′ X) −1 x 0) .The additional term comes from possible errors in estimating thecoefficient vector b. It is a quadratic form of the positive definitematrix (X ′ X) −1 , i.e., it is convex and has its minimum at mean(X).Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 277


Linear Statistical ModelsPredictions (3)Predicting y when regarding estimation errors:Effectiveness of PromotionSales0 50 100 15095% confidence interval●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●mean(Sales)0 50 100 150PromoteThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 278


Linear Statistical ModelsPredictions (4)Now back to the DD-Drugstores example: What is the predictedvalue of Sales when Promote is at 110?( ) 25.13ŷ = (1, 110)= 108.98,0.76the associated variance Var(e ˆ 0 ) of this estimate is[(ˆσ 2 ŷ = 54.68 1 + (1, 110) ′ 2.58212 −0.02618−0.02617 0.00027= 54.68(1 + 0.0593) = 57.924.) ( 1110)]Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 279


Linear Statistical ModelsDiagnostics (1)What is the overall fit of the regression model?Similar to the R 2 , the F-statistics compares the reduction in variancethat can be achieved by means of the regression compared to theunconditional model (i.e., to the variation with respect to theunconditional mean).Consi<strong>der</strong> the null hypothesis that all slope coefficients are equal tozero, i.e., this corresponds to hypothesis that the true model has theformY = µ + ε, Var(ε) = σ 2By construction, SSR and SSE are independent. Un<strong>der</strong> this null,SSR/σ 2 is χ 2 distributed with k − 1 degrees of freedom andSSE/σ 2 is χ 2 distributed with n − k degrees of freedom, thus,f =SSR/(k − 1) (SST-SSE)/(k − 1)=SSE/(n − k) SSE/(n − k)has an F distribution with (k − 1) and (n − k) degrees of freedom.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 280


Linear Statistical ModelsDiagnostics (2)If now the model is able to reduce the unexplained variation, i.e.,SSE


Linear Statistical ModelsDiagnostics (3)The density of the F distribution with 1 and 48 degrees of freedom:0.0 0.2 0.4 0.6 0.8 1.0F 1 , 48 (x)F −1 1 , 48(1 − α)0 1 2 3 4 5xThomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 282


Linear Statistical ModelsDiagnostics (4)Once again use the lm function of R> summary(lm(dat$Sales ∼ dat$Promote))Call:lm(formula = dat$Sales ∼dat$Promote)Residuals:Min 1Q Median 3Q Max-17.3069 -4.9544 -0.4544 5.1214 17.7423Coefficients:Estimate Std. Error t value Pr(>|t|)(Intercept) 25.1264 11.8826 2.115 0.0397 *dat$Promote 0.7623 0.1209 6.304 8.6e-08 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 7.395 on 48 degrees of freedomMultiple R-squared: 0.4529, Adjusted R-squared: 0.4415F-statistic: 39.74 on 1 and 48 DF, p-value: 8.601e-08Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 283


Linear Statistical ModelsFinal ExampleThe file OECD.csv contains data on life expectancy [a], GDP percapita [USD, current prices and PPPs], health expenditures [% ofGDP], and the fraction of self employment [% of employees] ofselected OECD member countries.Ex: Determinants of Life ExpectancyUse the data in OECD.csv and analyze the impact of the given variableson the life expectancy.Do a stepwise regression and interpret the outcome.Thomas Dangl <strong>Statistische</strong> <strong>Methoden</strong> 284

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!