05.01.2015 Views

3. Basic probability concepts

3. Basic probability concepts

3. Basic probability concepts

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

proportion of that outcome that we observe in a sample of observations. The long-run relative<br />

frequency is the proportion that we observe as the sample becomes very large. In most<br />

circumstances, as the sample size increases toward infinity, the relative frequency of the outcome<br />

in the sample will converge on its true value in the population. After estimating the long-run<br />

relative frequency, we then turn the situation around and interpret the frequency as the<br />

<strong>probability</strong> of observing the outcome of interest by sampling a single observation from the<br />

population.<br />

Say that we want to estimate the allele frequencies of individuals for the MN blood group in<br />

a particular human population. The three possible genotypes are MM, MN, and NN. We could<br />

sample a number of individuals (the more, the better for this purpose, as long as the sample is<br />

representative of the population) and determine the genotype of each. From these we could in<br />

turn estimate the frequencies of the alleles M and N. For example, the following table<br />

summarizes the classic data of Race and Sanger (1962) on MN genotypes from a British<br />

population.<br />

Phenotype M MN N<br />

Genotype MM MN NN Total<br />

Number (absolute frequency) 363 634 282 1279<br />

Relative frequency 0.284 0.496 0.220 1.000<br />

Based on these data, the relative frequencies of the M and N alleles can be estimated as the<br />

homozygote frequencies plus half of the heterozygote frequency:<br />

1 1<br />

M MM 2 MN<br />

2<br />

1 1<br />

N NN 2 MN<br />

2<br />

( )<br />

( )<br />

p = f + f = 0.284 + 0.496 = 0.532<br />

p = f + f = 0.220 + 0.496 = 0.468<br />

where p indicates a relative frequency and f an absolute frequency (count). The samples sizes<br />

are rather large, so we can assume (for now) that these estimates are reasonably precise. We can<br />

now interpret the relative frequencies as probabilities. The p M value of 0.532 indicates that our<br />

best estimate of the <strong>probability</strong> of drawing a single M allele from the gene pool is 5<strong>3.</strong>2%. In<br />

other words, for every 1000 alleles drawn from the gene pool, 532 will, on average, be M alleles.<br />

Derivation of probabilities from relative frequencies is logically circular, of course, but still<br />

valuable.<br />

The strong assumption of fairness also applies to empirical probabilities: all entities are<br />

assumed to be equally likely to occur. Thus empirical probabilities are based on logical<br />

probabilities assigned to individual observations. For example, if we use a mark-recapture<br />

method to estimate the number of fish living within a lake, the estimate will most certainly be<br />

based on the strong assumption that every fish in the lake stands an equal chance of being<br />

caught. A sample in which every individual has an equal <strong>probability</strong> of selection is a random<br />

sample. We will see that this “assumption of randomness” is basic to many statistical<br />

procedures. If the sample is not truly random in this sense, then the long-run relative frequency<br />

of the outcome in the sample will not converge on its true value in the population. The relative<br />

6

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!