3. Basic probability concepts

More documents

Recommendations

Info

proportion of that outcome that we observe in a sample of observations. The long-run relative frequency is the proportion that we observe as the sample becomes very large. In most circumstances, as the sample size increases toward infinity, the relative frequency of the outcome in the sample will converge on its true value in the population. After estimating the long-run relative frequency, we then turn the situation around and interpret the frequency as the probability of observing the outcome of interest by sampling a single observation from the population. Say that we want to estimate the allele frequencies of individuals for the MN blood group in a particular human population. The three possible genotypes are MM, MN, and NN. We could sample a number of individuals (the more, the better for this purpose, as long as the sample is representative of the population) and determine the genotype of each. From these we could in turn estimate the frequencies of the alleles M and N. For example, the following table summarizes the classic data of Race and Sanger (1962) on MN genotypes from a British population. Phenotype M MN N Genotype MM MN NN Total Number (absolute frequency) 363 634 282 1279 Relative frequency 0.284 0.496 0.220 1.000 Based on these data, the relative frequencies of the M and N alleles can be estimated as the homozygote frequencies plus half of the heterozygote frequency: 1 1 M MM 2 MN 2 1 1 N NN 2 MN 2 ( ) ( ) p = f + f = 0.284 + 0.496 = 0.532 p = f + f = 0.220 + 0.496 = 0.468 where p indicates a relative frequency and f an absolute frequency (count). The samples sizes are rather large, so we can assume (for now) that these estimates are reasonably precise. We can now interpret the relative frequencies as probabilities. The p M value of 0.532 indicates that our best estimate of the probability of drawing a single M allele from the gene pool is 53.2%. In other words, for every 1000 alleles drawn from the gene pool, 532 will, on average, be M alleles. Derivation of probabilities from relative frequencies is logically circular, of course, but still valuable. The strong assumption of fairness also applies to empirical probabilities: all entities are assumed to be equally likely to occur. Thus empirical probabilities are based on logical probabilities assigned to individual observations. For example, if we use a mark-recapture method to estimate the number of fish living within a lake, the estimate will most certainly be based on the strong assumption that every fish in the lake stands an equal chance of being caught. A sample in which every individual has an equal probability of selection is a random sample. We will see that this “assumption of randomness” is basic to many statistical procedures. If the sample is not truly random in this sense, then the long-run relative frequency of the outcome in the sample will not converge on its true value in the population. The relative 6
frequency (and thus estimated probability) of the outcome will then be biased, but we will be unaware of the nature and extent of the bias. Minimizing bias is one of the main objectives of sampling theory. The classical and empirical definitions of probability are basically identical, but because the total number of potential observations can be large, the terminology is modified slightly: The empirical probability of an outcome is observed frequency of that outcome, relative to the total number of observations. Pr ( outcome) observed frequency = total frequency nA ( ) Pr( A) = n The variable n in the case represents the number of observations rather than the number of distinct outcomes. For obvious reasons this is known as the frequency (or frequentist) definition of probability. As a definition, it is equally valid no matter what the observed and total frequencies. However, as an estimate of some “true” population frequency, it is more precise for larger than for smaller total frequencies. For example, say that we are trying to estimate the sex ratio of a (biological) population of field mice living in a particular large field, which we define to be our statistical population. The sex ratio (expressed in this case as the ratio of males to total number of individuals) can be viewed as the probability of randomly drawing a male from the population, assuming that all individuals are equally likely to be caught. (Is this a reasonable assumption) We might estimate this probability by sampling 10 individuals from the population and discovering that 6 are males, giving Pr( male ) = 6 /10 = 0.6 . Or we might sample 1000 individuals and discover that 513 are males, giving Pr( male ) = 513/1000 = 0.513 . It is reasonable to conclude that the second estimate is somehow better than the first, because it is larger. We will show later (when discussing confidence intervals) in what sense the second estimate is better than the first. One potential use of empirical probabilities is to test null hypotheses that specify explicit values for population parameters. For example, our null hypothesis might be that the population of field mice consists of half females and half males – i.e., that the probabilities of randomly drawing a female or drawing a male from the population are both ½. We might posit these probabilities simply because there are two classes of individuals, which are assumed to be equally likely (the classical frequencies). Or, if we are more informed, we might posit the probabilities because Fisher’s (1930) sex-ratio model predicts for most populations a stable equilibrium with equal numbers of females and males. In either case, the null-hypothesis values of ½ and ½ can be tested by randomly sampling individuals from the population and determining the relative frequencies, which are then used as estimates of probabilities to compare against the null-hypothesis values. This raises the issue of just how different the observed frequencies have to be from the postulated values to reject the null hypothesis. We will return to this example when discussing statistical tests of hypotheses. 7
Page 1 and 2: From: R.E. Strauss, Statistical Hyp
Page 3 and 4: (observations or entities) from an
Page 5: Value of first die Value of second
Page 9: subjective probabilities should be

3. Basic probability concepts

Create successful ePaper yourself

Delete template?

Save as template?