3. Basic probability concepts
3. Basic probability concepts
3. Basic probability concepts
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
proportion of that outcome that we observe in a sample of observations. The long-run relative<br />
frequency is the proportion that we observe as the sample becomes very large. In most<br />
circumstances, as the sample size increases toward infinity, the relative frequency of the outcome<br />
in the sample will converge on its true value in the population. After estimating the long-run<br />
relative frequency, we then turn the situation around and interpret the frequency as the<br />
<strong>probability</strong> of observing the outcome of interest by sampling a single observation from the<br />
population.<br />
Say that we want to estimate the allele frequencies of individuals for the MN blood group in<br />
a particular human population. The three possible genotypes are MM, MN, and NN. We could<br />
sample a number of individuals (the more, the better for this purpose, as long as the sample is<br />
representative of the population) and determine the genotype of each. From these we could in<br />
turn estimate the frequencies of the alleles M and N. For example, the following table<br />
summarizes the classic data of Race and Sanger (1962) on MN genotypes from a British<br />
population.<br />
Phenotype M MN N<br />
Genotype MM MN NN Total<br />
Number (absolute frequency) 363 634 282 1279<br />
Relative frequency 0.284 0.496 0.220 1.000<br />
Based on these data, the relative frequencies of the M and N alleles can be estimated as the<br />
homozygote frequencies plus half of the heterozygote frequency:<br />
1 1<br />
M MM 2 MN<br />
2<br />
1 1<br />
N NN 2 MN<br />
2<br />
( )<br />
( )<br />
p = f + f = 0.284 + 0.496 = 0.532<br />
p = f + f = 0.220 + 0.496 = 0.468<br />
where p indicates a relative frequency and f an absolute frequency (count). The samples sizes<br />
are rather large, so we can assume (for now) that these estimates are reasonably precise. We can<br />
now interpret the relative frequencies as probabilities. The p M value of 0.532 indicates that our<br />
best estimate of the <strong>probability</strong> of drawing a single M allele from the gene pool is 5<strong>3.</strong>2%. In<br />
other words, for every 1000 alleles drawn from the gene pool, 532 will, on average, be M alleles.<br />
Derivation of probabilities from relative frequencies is logically circular, of course, but still<br />
valuable.<br />
The strong assumption of fairness also applies to empirical probabilities: all entities are<br />
assumed to be equally likely to occur. Thus empirical probabilities are based on logical<br />
probabilities assigned to individual observations. For example, if we use a mark-recapture<br />
method to estimate the number of fish living within a lake, the estimate will most certainly be<br />
based on the strong assumption that every fish in the lake stands an equal chance of being<br />
caught. A sample in which every individual has an equal <strong>probability</strong> of selection is a random<br />
sample. We will see that this “assumption of randomness” is basic to many statistical<br />
procedures. If the sample is not truly random in this sense, then the long-run relative frequency<br />
of the outcome in the sample will not converge on its true value in the population. The relative<br />
6