Gao X, Starmer J, Martin ER. A multiple testing correction method for ...
Gao X, Starmer J, Martin ER. A multiple testing correction method for ...
Gao X, Starmer J, Martin ER. A multiple testing correction method for ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Genetic Epidemiology 32: 361–369 (2008)<br />
A Multiple Testing Correction Method <strong>for</strong> Genetic Association<br />
Studies Using Correlated Single Nucleotide Polymorphisms<br />
Xiaoyi <strong>Gao</strong>, 1<br />
Joshua <strong>Starmer</strong>, 2,3 and Eden R. <strong>Martin</strong> 1<br />
1<br />
Center <strong>for</strong> Genetic Epidemiology and Statistical Genetics, Miami Institute <strong>for</strong> Human Genomics, University of Miami Miller School of Medicine,<br />
Miami, Florida<br />
2<br />
Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina<br />
3<br />
Curriculum in Toxicology, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina<br />
Multiple <strong>testing</strong> is a challenging issue in genetic association studies using large numbers of single nucleotide polymorphism<br />
(SNP) markers, many of which exhibit linkage disequilibrium (LD). Failure to adjust <strong>for</strong> <strong>multiple</strong> <strong>testing</strong> appropriately may<br />
produce excessive false positives or overlook true positive signals. The Bonferroni <strong>method</strong> of adjusting <strong>for</strong> <strong>multiple</strong><br />
comparisons is easy to compute, but is well known to be conservative in the presence of LD. On the other hand,<br />
permutation-based <strong>correction</strong>s can correctly account <strong>for</strong> LD among SNPs, but are computationally intensive. In this work,<br />
we propose a new <strong>multiple</strong> <strong>testing</strong> <strong>correction</strong> <strong>method</strong> <strong>for</strong> association studies using SNP markers. We show that it is simple,<br />
fast and more accurate than the recently developed <strong>method</strong>s and is comparable to permutation-based <strong>correction</strong>s using<br />
both simulated and real data. We also demonstrate how it might be used in whole-genome association studies to control<br />
type I error. The efficiency and accuracy of the proposed <strong>method</strong> make it an attractive choice <strong>for</strong> <strong>multiple</strong> <strong>testing</strong> adjustment<br />
when there is high intermarker LD in the SNP data set. Genet. Epidemiol. 32:361–369, 2008. r 2008 Wiley-Liss, Inc.<br />
Key words: single nucleotide polymorphism; composite linkage disequilibrium; <strong>multiple</strong> <strong>testing</strong> <strong>correction</strong>; principal<br />
component analysis; eigenvalues<br />
Contract grant sponsor: NIH; Contract grant numbers: NS39764, AG019757, AG20135; Contract grant sponsor: NIEHS; Contract grant<br />
numbers: T32 ES007126.<br />
Correspondence to: Xiaoyi <strong>Gao</strong>, Center <strong>for</strong> Genetic Epidemiology and Statistical Genetics, Miami Institute <strong>for</strong> Human Genomics,<br />
University of Miami Miller School of Medicine, Miami, FL 33136. E-mail: xgao@med.miami.edu<br />
Received 10 July 2007; Revised 28 November 2007; Accepted 20 December 2007<br />
Published online 12 February 2008 in Wiley InterScience (www.interscience.wiley.com).<br />
DOI: 10.1002/gepi.20310<br />
INTRODUCTION<br />
Multiple <strong>testing</strong> is a challenging issue <strong>for</strong> genetic<br />
data analysis. Candidate gene and genome-wide<br />
association studies involve statistical <strong>testing</strong> of not<br />
just a single hypothesis, but many. Even when the<br />
point-wise error rate (PW<strong>ER</strong>, ap) is set to a low level,<br />
the experiment-wise error rate (EW<strong>ER</strong>, ae) increases<br />
with the number of tests carried out. For this reason,<br />
strict significance thresholds have been recommended<br />
to control EW<strong>ER</strong> [Risch and Merikangas,<br />
1996]. However, an overly conservative approach<br />
may result in overlooking true positive signals,<br />
while an overly liberal criterion could produce<br />
excessive false positives. Sˇ idák and Bonferroni<br />
<strong>correction</strong>s are popular approaches <strong>for</strong> controlling<br />
ae by specifying what ap values should be used <strong>for</strong><br />
each individual test. The Sˇ idák <strong>correction</strong> is calculated<br />
as ap ¼ 1 ð1 aeÞ 1=N , where N is the number<br />
of individual hypotheses to be tested [Sˇ idák, 1967].<br />
This <strong>correction</strong> assumes that the hypothesis tests<br />
are independent. Noting that ð1 apÞ N<br />
1 Nap<br />
<strong>for</strong> small ap, we obtain the Bonferroni <strong>correction</strong> as<br />
r 2008 Wiley-Liss, Inc.<br />
ap ¼ ae=N [Bonferroni, 1935, 1936], which is an<br />
approximation to the S ˇ idák <strong>correction</strong>.<br />
Recently, single nucleotide polymorphisms<br />
(SNPs), which are often densely genotyped, have<br />
become popular markers <strong>for</strong> genetic association<br />
studies. The closely spaced SNPs frequently yield<br />
high correlation because of extensive linkage disequilibrium<br />
(LD) among them [Wall and Pritchard,<br />
2003]. There<strong>for</strong>e, when association studies are conducted<br />
with many SNPs, the tests per<strong>for</strong>med on<br />
each SNP are usually not independent, depending<br />
on the correlation structure among the SNPs. This<br />
violation of the independence assumption limits the<br />
S ˇ idák and Bonferroni <strong>correction</strong>s’ ability to control<br />
type I error effectively, and the PW<strong>ER</strong> has to be<br />
adjusted in order to keep the EW<strong>ER</strong> at a nominal<br />
level.<br />
In practice, many researchers use permutationbased<br />
<strong>method</strong>s to control the EW<strong>ER</strong> when the tests<br />
are correlated. For example, Churchill and Doerge<br />
[1994] used a permutation test <strong>for</strong> estimating threshold<br />
values in quantitative trait mapping. Ritchie<br />
et al. [2001] and Hoh et al. [2001] used a permutation
362 <strong>Gao</strong> et al.<br />
test to control significance level <strong>for</strong> dichotomous<br />
traits. Permutation test <strong>correction</strong> is very robust and<br />
has the advantage of drawing the threshold directly<br />
from the experimental data [Cheverud, 2001]. However,<br />
permutation tests are computationally intensive.<br />
Churchill and Doerge [1994] suggested that at<br />
least 10,000 shuffles are needed to estimate a 0.01<br />
threshold and 1,000 shuffles to estimate a 0.05<br />
threshold.<br />
If the number of independent tests can be correctly<br />
inferred, we can still use the standard Bonferroni<br />
<strong>correction</strong> to rapidly adjust <strong>for</strong> <strong>multiple</strong> <strong>testing</strong>.<br />
Based on this idea, several researchers have tried to<br />
derive the effective number of independent tests,<br />
Meff [Cheverud, 2001; Nyholt, 2004; Li and Ji, 2005].<br />
Cheverud [2001] was the first to propose this idea<br />
<strong>for</strong> <strong>multiple</strong> <strong>testing</strong> <strong>correction</strong> and published a<br />
<strong>for</strong>mula <strong>for</strong> calculating Meff when SNP markers are<br />
correlated. However, Cheverud’s Meff is still overly<br />
conservative when there is high LD among SNPs [Li<br />
and Ji, 2005; Salyakina et al., 2005]. Nyholt [2005]<br />
suggested excluding all SNPs in perfect LD except<br />
one prior to using Cheverud’s Meff as a means to<br />
improve the adjustment accuracy, but this <strong>method</strong><br />
remains overly conservative. Li and Ji [2005]<br />
proposed another Meff <strong>for</strong>mula and demonstrated<br />
its improvement over Cheverud’s. However, Li and<br />
Ji’s approach, partitioning eigenvalues into integer<br />
and fractional parts, is an intuitive solution, and it<br />
was tested only on a small number of SNPs (o15 <strong>for</strong><br />
each gene) in their single-locus analyses [Li and Ji,<br />
2005]. It is not clear how their <strong>method</strong> per<strong>for</strong>ms in<br />
relatively large SNP data sets (Z100) SNPs. With<br />
these limitations in mind, we have developed a new<br />
approach <strong>for</strong> estimating Meff, and denote it as Meff G,<br />
which improves on existing <strong>method</strong>s.<br />
The first step in calculating Meff <strong>for</strong> SNP data is<br />
constructing a correlation matrix, along with the<br />
corresponding eigenvalues, <strong>for</strong> the SNP loci. For<br />
example, Nyholt [2004] used LD correlation. However,<br />
a problem with calculating LD correlation is<br />
that the haplotype phase in<strong>for</strong>mation is not usually<br />
available and needs to be derived. A common<br />
technique <strong>for</strong> inferring LD when the haplotype<br />
phase is unknown is to use the expectation-maximization<br />
algorithm under the assumption of Hardy-<br />
Weinberg equilibrium (HWE) [Excoffier and Slatkin,<br />
1995]. The potential problem with this approach is<br />
that HWE may not hold when sample individuals<br />
are chosen based on phenotypes [Zaykin et al., 2006],<br />
and HWE can be distorted between cases and<br />
controls in regions of association [Nielsen et al.,<br />
1999; Wittke-Thompson et al., 2005]. Furthermore,<br />
this <strong>method</strong> requires the additional step of estimating<br />
haplotype frequencies, which may not be<br />
necessary if our goal is only to capture the correlation<br />
structure of SNPs. In contrast, the composite LD<br />
(CLD) correlation, which is calculated directly from<br />
SNP genotypes, describes the SNP correlation well<br />
Genet. Epidemiol.<br />
and is simpler to calculate. Recently, Weir et al.<br />
[2004], Schaid [2004], and Zaykin [2004] and Zaykin<br />
et al. [2006] showed that CLD can capture the<br />
relationship among SNPs comparable to those of<br />
gametic LD without requiring HWE.<br />
With the above improvements in mind, we<br />
propose a new <strong>multiple</strong> <strong>testing</strong> <strong>correction</strong>,<br />
simpleM, which uses CLD to create the correlation<br />
matrix and Meff G to calculate the effective number<br />
of independent tests. We then show that the new<br />
approach can successfully control the type I error<br />
rate based on both simulated and real data.<br />
Compared with either Bonferroni or Li and Ji’s<br />
approach, the adjusted thresholds from simpleM are<br />
more accurate, i.e., closer to the permutation-based<br />
<strong>correction</strong>s. Moreover, the proposed <strong>method</strong> can also<br />
be used to address <strong>multiple</strong> <strong>testing</strong> <strong>correction</strong> issues<br />
in genome-wide association studies using correlated<br />
SNPs.<br />
METHODS<br />
NOTATION<br />
For SNP markers, we consider only biallelic cases.<br />
Each SNP marker has two alleles and correspondingly<br />
three genotypes. For example, take two SNPs,<br />
A and B, which both have two alleles: A and a <strong>for</strong><br />
SNP A, and B and b <strong>for</strong> SNP B. The three genotypes<br />
are AA, Aa and aa <strong>for</strong> SNP A, and similarly BB, Bb<br />
and bb <strong>for</strong> SNP B. For SNP A, the allele frequencies<br />
are denoted as pA and pa <strong>for</strong> the A and a alleles,<br />
respectively, and the genotype frequencies are<br />
denoted as PAA, PAa and Paa <strong>for</strong> the AA, Aa and aa<br />
genotypes, respectively. Similarly pB, pb, PBB, PBb and<br />
Pbb are the respective frequencies <strong>for</strong> SNP B. PAB<br />
denotes the gametic frequency and P A=B denotes the<br />
non-gametic frequency between SNPs A and B. Li<br />
and Ji’s Meff and the proposed Meff are denoted as<br />
Meff L and Meff G, respectively. Keeping the EW<strong>ER</strong><br />
(ae) at a nominal significance level, the adjusted<br />
PW<strong>ER</strong>s are denoted as aL and aG calculated using<br />
Meff L, and Meff G, respectively. The Bonferroni and<br />
the permutation-based point-wise <strong>correction</strong> thresholds<br />
are denoted as aB and aP, respectively. M is the<br />
total number of SNPs in the data set.<br />
COMPOSITE LD<br />
The CLD coefficient is defined as<br />
DAB ¼ PAB þ P A=B<br />
¼ DAB þ D A=B;<br />
2pApB<br />
where DAB ¼ PAB pApB and D A=B ¼ P A=B pApB<br />
[Weir, 1996].<br />
The composite correlation is defined as<br />
DAB<br />
rAB ¼ p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ;<br />
ðpAð1 pAÞþDAÞðpBð1 pBÞþDBÞ
where DA ¼ PAA p 2 A and DB ¼ PBB p 2 B [Weir,<br />
1979, 2004]. The composite estimator works well in<br />
capturing the LD correlation among SNPs and it is<br />
robust to violations in HWE [Weir et al., 2004;<br />
Schaid, 2004]. It is also easier to compute than LD<br />
correlation (see the Appendix). The CLD correlation<br />
can be calculated in R using the cor() function [Team,<br />
2007] when the SNP genotypes are numerically<br />
coded as 2, 1 and 0 <strong>for</strong> wild-type allele homozygotes,<br />
heterozygotes and variant-type allele homozygotes,<br />
respectively, which is shown in the Appendix.<br />
Meff_G ESTIMATION<br />
Principal component analysis (PCA) is a classical<br />
statistical approach <strong>for</strong> reducing dimensionality in<br />
multivariate analysis [Mardia et al., 1979]. The PCA<br />
approach has been applied to many recent genetic<br />
studies, such as haplotype tagging SNP selection<br />
[Meng et al., 2003; Lin and Altman, 2004] and<br />
<strong>correction</strong> <strong>for</strong> population stratification [Price et al.,<br />
2006]. It is a data-driven approach that allows<br />
researchers to consider all the SNPs simultaneously,<br />
which is ideal <strong>for</strong> inferring Meff <strong>for</strong> a particular data<br />
set. For simpleM, we compute eigenvalues from the<br />
pair-wise SNP correlation matrix created with CLD<br />
and then derive Meff G using PCA. Each eigenvalue<br />
can be interpreted as the amount of variance<br />
explained by the corresponding principal component.<br />
The eigenvalues, flig, are usually arranged in<br />
descending order, l1 l2 lM, where M is the<br />
number of SNPs in the data. Generally, a relatively<br />
small number of eigenvalues, x, contribute a high<br />
percentage of the sum of the variances <strong>for</strong> all of the<br />
components <strong>for</strong> correlated data. That is to say that<br />
only the first x eigenvalues are needed,<br />
Px i¼1 li= PM i¼1 li4C, where the percentage cutoff,<br />
C, is determined by the researcher. There are rules of<br />
thumb <strong>for</strong> choosing this threshold in PCA [Mardia<br />
et al., 1979], and in line with these, we propose<br />
defining x so that the corresponding eigenvalues<br />
explain 99.5% of the variation <strong>for</strong> SNP data and<br />
Meff G ¼ x. It should be noted, however, that too<br />
large or too small C may cause Meff G to be either<br />
overly conservative or overly liberal.<br />
THE simpleM METHOD<br />
The simpleM <strong>method</strong> involves four steps:<br />
Step 1: Derive the CLD correlation matrix from the<br />
SNP data set. This can be done using the cor()<br />
function in R (see the Appendix).<br />
Step 2: Calculate the eigenvalues, <strong>for</strong> example, by<br />
the R function eigen().<br />
Step 3: Infer Meff G through PCA to estimate the<br />
effective number of independent tests (see the Meff G<br />
ESTIMATION section).<br />
Step 4: Apply the Bonferroni <strong>correction</strong> <strong>for</strong>mula to<br />
calculate the adjusted point-wise significance level<br />
as aG ¼ ae=Meff G.<br />
Multiple Testing Correction Method<br />
363<br />
P<strong>ER</strong>MUTATION TESTS<br />
All adjusted thresholds were validated by permutation<br />
tests. Because we wanted to test the validity of<br />
our algorithm at significance levels of both ae ¼ 0:05<br />
and 0.01, we per<strong>for</strong>med 100,000 permutations on our<br />
data sets. In each permutation shuffle, half of the<br />
individuals were randomly assigned as cases and<br />
the other half were assigned as controls in the<br />
balanced data sets we simulated. For each permuted<br />
case-control sample, Armitage’s trend [Armitage,<br />
1955; Sasieni, 1997] tested association <strong>for</strong> each SNP.<br />
Thus, a total of M test statistics and their corresponding<br />
P-values were calculated <strong>for</strong> each permutation<br />
repeat and the smallest P-value was recorded.<br />
The smallest P-values were then arranged in<br />
ascending order and the fifth percentile was the<br />
permutation-based empirical experiment-wise critical<br />
value <strong>for</strong> the overall 0.05 type I error rate.<br />
Similarly, the first percentile <strong>for</strong> the threshold of the<br />
overall 0.01 type I error rate.<br />
SIMULATION DATA<br />
Four simulation studies were designed to study<br />
the per<strong>for</strong>mance of simpleM. Our simulation was<br />
similar to Rinaldo et al. [2005] except that the<br />
simulated regions were larger. To be more specific,<br />
we used Wall and Pritchard’s simulation program<br />
[Wall and Pritchard, 2003], which is a variation of<br />
Hudson’s MS program [Hudson, 2002], to generate<br />
recombination cold regions and hot spots/hot regions.<br />
In simulation 1, eight cold regions (10 kb each)<br />
were separated by hot spots (1 kb each) giving<br />
recombination cold regions interlaced with hot<br />
spots. The mutation rate was y ¼ 4Nem, where Ne<br />
is the effective population size, set to be 10,000, and<br />
m is the mutation rate per basepair, per generation,<br />
set to be 1.4 10 8 . The recombination rate was<br />
r ¼ 4Ned, where d ¼ 2:5 10 8 is the recombination<br />
rate per basepair, per generation. Values <strong>for</strong> m and d<br />
were chosen because they yield results similar to the<br />
empirical data in the SeattleSNP database [Rinaldo<br />
et al., 2005]. The corresponding scaled recombination<br />
rate <strong>for</strong> the entire simulated region was the<br />
product of r and the length of the region in<br />
basepairs. The per basepair recombination rate in<br />
hot spots was chosen to be 100 times greater than in<br />
cold regions.<br />
In contrast to simulation 1, simulation 2 had four<br />
cold regions (10 kb each) separated by hot regions<br />
(15 kb each); thus, cold regions are interlaced with<br />
hot regions. The recombination rate, d, was chosen to<br />
be 9 10 8 =bp <strong>for</strong> the hot regions. Again, the<br />
population parameters were chosen because they<br />
generate LD patterns similar to that observed in the<br />
SeattleSNP database [Rinaldo et al., 2005].<br />
For simulations 1 and 2, 100 SNP data sets were<br />
generated, each with 400 individuals (200 cases vs.<br />
200 controls, randomly assigned in the permutation<br />
Genet. Epidemiol.
364 <strong>Gao</strong> et al.<br />
tests). We used only common SNPs in each data set,<br />
which had a minor allele frequency 40.10. The<br />
resulting number of SNPs ranged from approximately<br />
70 to 140, with about 100 SNPs per simulation<br />
on average.<br />
Simulations 3 and 4 were identical to simulations 1<br />
and 2, except the number of individuals simulated<br />
increased to 1,000 (500 cases vs. 500 controls,<br />
randomly assigned in the permutation tests) to<br />
address how sample size affects simpleM’s per<strong>for</strong>mance.<br />
REAL DATA<br />
To evaluate simpleM with real data, we used a<br />
partial SNP data set from an Alzheimer wholegenome<br />
association project. We randomly chose<br />
1,723 SNPs spanning a region of 8 Mb on chromosome<br />
22. Five hundred unrelated unaffected individuals<br />
were used. The missing values in the data set<br />
were less than 1% <strong>for</strong> each SNP and each individual,<br />
respectively. The minor allele frequency <strong>for</strong> each<br />
SNP was 40.05. The total missing value rate <strong>for</strong> this<br />
data was 0.065%.<br />
RESULTS<br />
SIMULATION RESULTS<br />
For simulations 1 and 2, the CLD correlation<br />
matrices were calculated <strong>for</strong> each simulated SNP<br />
data set, as well as the eigenvalues, and aL and aG.<br />
The permutation-based <strong>correction</strong> threshold, aP, was<br />
derived using 100,000 random shuffles to serve as<br />
the true cutoff. The Bonferroni <strong>correction</strong>, aB, was<br />
also calculated <strong>for</strong> comparison purposes. The results<br />
<strong>for</strong> the 100 simulations are plotted in Figure 1, (a)<br />
and (b) <strong>for</strong> ae ¼ 0:05 and (c) and (d) <strong>for</strong> ae ¼ 0:01. For<br />
each simulation data set, the adjusted PW<strong>ER</strong> <strong>for</strong><br />
each <strong>method</strong> was plotted in separate colors: black,<br />
red, blue and purple, and marked with different<br />
letters: B, P, G and L in the figure to denote the<br />
estimated aB, aP, aG and aL, respectively. The<br />
number of SNPs in the data sets was sorted and<br />
arranged in ascending order. Finally, all the points<br />
(adjusted PW<strong>ER</strong> thresholds) <strong>for</strong> each <strong>method</strong> were<br />
connected to aid visualization. In all plots, the<br />
Bonferroni <strong>correction</strong> cutoff is too conservative<br />
relative to the permutation-based <strong>correction</strong> threshold.<br />
aG gives the adjusted PW<strong>ER</strong> closest to the<br />
permutation-based threshold, almost overlapping it,<br />
while aL is too liberal. It appears that aL is sensitive<br />
to whether or not cold regions are interspersed with<br />
hot spots (Fig. 1(a)), or cold regions are interspersed<br />
with hot regions (Fig. 1(b)), and similarly <strong>for</strong> Figure<br />
1(c) vs. Figure 1(d).<br />
The adjusted PW<strong>ER</strong>s from simulations 3 and 4 are<br />
plotted in Figure 2 in the same way the results from<br />
simulations 1 and 2 were. Our metric, aG, continued<br />
to be the closest adjustment to aP. The general trends<br />
Genet. Epidemiol.<br />
between simulation 3 and 1 and between simulation<br />
4 and 2 are in agreement with each other, which<br />
indicates that the simpleM <strong>method</strong> is not sensitive<br />
to the sample size used in these examples. Again, we<br />
observed that aL is sensitive to the relationship<br />
between cold regions interlaced with hot regions<br />
and cold regions interlaced with hot spots. Both<br />
Figures 1 and 2 show that aL is sensitive to the<br />
underlying LD structure.<br />
The results from simulations 1, 2, 3 and 4 show<br />
that aG can give a <strong>multiple</strong> <strong>testing</strong> adjustment that<br />
nearly overlaps aP. Furthermore, the simpleM<br />
<strong>method</strong> tremendously reduces the computing time<br />
<strong>for</strong> each data set compared with the permutation<br />
<strong>method</strong>. Calculating aP required over 3 hr per data<br />
set (1,000 individuals) using 100,000 permutation<br />
shuffles, whereas calculating aG only required about<br />
0.1 sec on our desktop computer (Intel Core2 2.4G<br />
CPU with 2 GB memory), which is at least 100,000<br />
times faster. Increasing the number of SNPs in the<br />
data sets makes the differences in speed even more<br />
dramatic. For example, a 100,000 shuffle permutation<br />
test on a data set of 1,000 individuals with 1,000<br />
SNPs takes more than a day to finish, while it takes<br />
only about a second <strong>for</strong> the simpleM <strong>method</strong> to<br />
derive the adjustment threshold. Assuming the<br />
computation time <strong>for</strong> the permutation is proportional<br />
to the number of SNPs, it will then consume<br />
over 12 days, 4 months, 20 months and 3 years <strong>for</strong><br />
10K, 100K, 500K and 1M SNP data sets with 1,000<br />
individuals using 100,000 shuffles on a single PC,<br />
while the simpleM <strong>method</strong> could finish any of these<br />
calculations in less than 1 hr.<br />
EXP<strong>ER</strong>IMENTAL DATA RESULTS<br />
We used an Alzheimer whole-genome association<br />
SNP data set to both validate the simpleM <strong>method</strong><br />
on real data and show how it can be applied to<br />
whole-genome analysis. In the presence of a large<br />
number of SNPs, it is challenging to calculate<br />
eigenvalues efficiently and effectively. There<strong>for</strong>e,<br />
we partitioned the large SNP data set into 13 small<br />
sets, each with 133 SNPs. Set 1 consisted of SNPs<br />
1–133, set 2 consisted of SNPs 134–266, and so on, set<br />
13 consisted of SNPs 1,597–1,723. Because PCA<br />
requires complete data matrices, we filled the<br />
missing data in the Alzheimer SNP data using the<br />
K nearest-neighbor algorithm [Hastie et al., 2001].<br />
We then applied the simpleM <strong>method</strong> to each set<br />
and got the following series of Meff G: 95 (133), 101<br />
(133), 92 (133), 90 (133), 91 (133), 92 (133), 68 (133), 71<br />
(133), 90 (133), 85 (133), 85 (133), 89 (133) and 83<br />
(127), where the number outside of parenthesis is the<br />
adjusted effective number of independent SNPs and<br />
the number within parenthesis is the original<br />
number of SNPs in the set. Thus, <strong>for</strong> the entire set<br />
of 1,723 SNPs, Meff G suggests that there are 1,132<br />
independent SNPs, while Meff L gives 837. To
compare the quality of our <strong>method</strong> to the permutation<br />
critical value, we calculated the permutation test<br />
threshold with 100,000 shuffles. If we set the<br />
nominal significance level to be 0.05, the derived<br />
permutation-based PW<strong>ER</strong>, aP ¼ 4:58 10 5 , the adjusted<br />
PW<strong>ER</strong> thresholds aG ¼ 0:05=1132 ¼<br />
4:42 10 5 and aL ¼ 0:05=837 ¼ 5:97 10 5 , and<br />
the Bonferroni <strong>correction</strong> aB ¼ 0:05=1723 ¼<br />
2:90 10 5 . In this case aG is very close to aP, while<br />
the Bonferroni <strong>correction</strong> is much more conservative<br />
and Li and Ji’s <strong>method</strong> is too liberal. If we set the<br />
nominal significance level to be 0.01, then<br />
aP ¼ 9:01 10 6 , aG ¼ 0:01=1132 ¼ 8:83 10 6 and<br />
aL ¼ 0:01=837 ¼ 1:19 10 5 and the Bonferroni <strong>correction</strong><br />
is aB ¼ 0:01=1723 ¼ 5:80 10 6 . Again, aG is<br />
very close to aP, while aB is too conservative and aL<br />
is too liberal.<br />
Because data with large numbers of SNPs need to<br />
be divided into sets <strong>for</strong> PCA, we investigated the use<br />
Multiple Testing Correction Method<br />
Fig. 1. Adjusted PW<strong>ER</strong> thresholds comparison <strong>for</strong> simulation 1 and 2 (400 individuals, 200 cases vs. 200 controls). The adjusted PW<strong>ER</strong><br />
thresholds <strong>for</strong> Bonferroni, permutation, Li and Ji’s approach, and the proposed <strong>method</strong> are marked by black, red, purple and blue,<br />
respectively. The data sets are sorted in order of ascending number of SNPs. Data points are connected to aid visualization. (a)<br />
corresponds to simulation 1 with the EW<strong>ER</strong> equal to 0.05. SNPs were generated as recombination cold regions interlaced with hot spots.<br />
Four hundred individuals were simulated. (b) corresponds to simulation 2 with EW<strong>ER</strong> 5 0.05. SNPs were generated as recombination<br />
cold regions interlaced with hot regions. Four hundred individuals were simulated. (c) and (d) are duplicates of (a) and (b), respectively,<br />
except EW<strong>ER</strong> is equal to 0.01. PW<strong>ER</strong>, point-wise error rate; SNPs, single nucleotide polymorphisms; EW<strong>ER</strong>, experiment-wise error rate.<br />
365<br />
of alternative <strong>method</strong>s <strong>for</strong> choosing block sizes. The<br />
simplest <strong>method</strong> was to define a fixed size <strong>for</strong> the<br />
blocks, as seen in the preceding results. We also used<br />
the Haploview software [Barrett et al., 2005] and<br />
Gabriel et al.’s definition on haplotype blocks<br />
[Gabriel et al., 2002] because they take advantage<br />
of the LD structure among SNPs. Based on the<br />
haplotype block output from the Haploview software,<br />
we divided the SNP data into 13 blocks, each<br />
with about 100–140 SNPs (cutting at block boundaries),<br />
and then applied our simpleM <strong>method</strong> to<br />
each block. This gave us 102 (141), 103 (141), 98 (146),<br />
84 (119), 94 (140), 86 (128), 65 (133), 57 (110), 96 (142),<br />
92 (142), 88 (138), 88 (132) and 71 (111), where the<br />
number outside of parenthesis is the adjusted<br />
effective number of independent SNPs and within<br />
parenthesis is the original number of SNPs in the<br />
block. The sum of the inferred Meff Gs <strong>for</strong> each block<br />
was 1,124 <strong>for</strong> the whole data set, while Meff L gave<br />
Genet. Epidemiol.
366 <strong>Gao</strong> et al.<br />
Fig. 2. Adjusted PW<strong>ER</strong> thresholds comparison <strong>for</strong> simulation 3 and 4 (1,000 individuals, 500 cases vs. 500 controls). The adjusted PW<strong>ER</strong><br />
thresholds <strong>for</strong> Bonferroni, permutation, Li and Ji’s approach, and the proposed <strong>method</strong> are marked by black, red, purple and blue,<br />
respectively. The data sets are sorted in order of ascending number of SNPs. Data points are connected to aid visualization. (a)<br />
corresponds to simulation 3 with the EW<strong>ER</strong> equal to 0.05. SNPs were generated as recombination cold regions interlaced with hot spots.<br />
One thousand individuals were simulated. (b) corresponds to simulation 4 with EW<strong>ER</strong> 5 0.05. SNPs were generated as recombination<br />
cold regions interlaced with hot regions. One thousand individuals were simulated. (c) and (d) are duplicates of (a) and (b), respectively,<br />
except EW<strong>ER</strong> is equal to 0.01. PW<strong>ER</strong>, point-wise error rate; SNPs, single nucleotide polymorphisms; EW<strong>ER</strong>, experiment-wise error rate.<br />
818. For a nominal significance level of 0.05, the<br />
adjusted PW<strong>ER</strong> threshold aG is equal to 0:05=1124 ¼<br />
4:45 10 5 and aL ¼ 0:05=818 ¼ 6:11 10 5 (compared<br />
to aP ¼ 4:58 10 5 and, <strong>for</strong> the fixed length<br />
blocks, aG ¼ 0:05=1132 ¼ 4:42 10 5 and<br />
aL ¼ 0:05=837 ¼ 5:97 10 5 ). If we set the nominal<br />
significance level to be 0.01, then aG ¼ 0:01=1124 ¼<br />
8:90 10 6 and aL ¼ 0:01=818 ¼ 1:22 10 5 (compared<br />
to aP ¼ 9:01 10 6 and, <strong>for</strong> the fixed length<br />
blocks, aG ¼ 0:01=1132 ¼ 8:83 10 6 and<br />
aL ¼ 0:01=837 ¼ 1:19 10 5 ). With variable block<br />
sizes, the aG improved slightly over that from the<br />
fixed length partition. This improvement, however,<br />
is mitigated by the fact that Haploview assumes<br />
HWE in the estimation of gamete frequencies.<br />
DISCUSSION<br />
In our simpleM approach, we use CLD correlation<br />
instead of LD correlation. The advantages that CLD<br />
Genet. Epidemiol.<br />
have over LD correlation are that CLD does not<br />
require HWE and the calculation is simpler and<br />
faster. The correlation structure among SNPs can be<br />
derived from their genotypes directly and no<br />
haplotype frequency estimation is necessary. Cheverud’s<br />
Meff may not capture the correlation among<br />
SNPs well [Salyakina et al., 2005; Li and Ji, 2005]. To<br />
improve the Meff, Nyholt suggested removing all<br />
SNPs in perfect correlation except one from the data<br />
set [Nyholt, 2005]. Although this remedy may be<br />
effective on some small SNP data sets, it shows that<br />
Cheverud’s Meff does not adjust effectively in many<br />
situations. In contrast, Meff L showed a significant<br />
improvement over Cheverud’s Meff C [Li and Ji,<br />
2005]. In our adjustment comparisons we used<br />
Cheverud’s Meff, but it did not offer much improvement<br />
over the Bonferroni <strong>correction</strong> on our data<br />
(results not shown). There<strong>for</strong>e, we did not include it<br />
in our comparison. Another <strong>multiple</strong> <strong>testing</strong> <strong>method</strong><br />
that we did not consider is the false discovery rate
(FDR) approach. FDR is commonly used in microarray<br />
data analysis, where studies involve a large<br />
amount of true alternative hypothesis (genes differently<br />
expressed). However, in genetic association<br />
studies, most of the hypotheses are null (SNPs not<br />
associated with the disease). Moreover, FDR assumes<br />
that the P-values corresponding to true null<br />
hypothesis tests are independent and uni<strong>for</strong>mly<br />
distributed or can be considered as approximately<br />
independent [Benjamini and Hochberg, 1995; Storey<br />
et al., 2004], which is likely to be violated when there<br />
is high LD among SNPs in genetic association<br />
studies. There<strong>for</strong>e, we compared our <strong>method</strong> only<br />
to the permutation <strong>method</strong> which is considered as a<br />
gold standard in <strong>multiple</strong> <strong>testing</strong> <strong>correction</strong>, Bonferroni<br />
and Li and Ji’s approach. Among all the<br />
adjustment <strong>method</strong>s considered, the simpleM <strong>method</strong><br />
gave the best approximation to the permutationbased<br />
<strong>correction</strong> threshold using either the simulated<br />
or the real data set in the presence of high<br />
intermarker LD correlations. In the extreme case, if<br />
the SNPs are nearly independent, there should not<br />
be much difference in using these adjustment<br />
<strong>method</strong>s.<br />
There are two possible ways to compare different<br />
<strong>multiple</strong> <strong>testing</strong> <strong>method</strong>s. First, fix the EW<strong>ER</strong> at a<br />
nominal value and try to find the corresponding<br />
PW<strong>ER</strong> threshold. Whichever adjustment is closest to<br />
the permutation-based PW<strong>ER</strong> is considered the best,<br />
as we did in our comparison (see Figs. 1 and 2).<br />
Second, given the PW<strong>ER</strong>, derive the corresponding<br />
EW<strong>ER</strong> and then compare it to the nominal type I<br />
error rate. These two <strong>method</strong>s are equivalent, but<br />
calculated in opposite directions. While PW<strong>ER</strong> is<br />
useful <strong>for</strong> determining the threshold <strong>for</strong> accepting or<br />
rejecting hypothesis tests, EW<strong>ER</strong> can be useful <strong>for</strong><br />
appreciating the how significant small changes in<br />
PW<strong>ER</strong> are. For example, with the Alzheimer SNP<br />
data we calculated aP ¼ 4:58 10 5 , which corresponds<br />
to the permutation-based EW<strong>ER</strong> of 0.05. We<br />
then approximate the ‘‘true’’ effective number of<br />
independent tests with Meff P, using the <strong>for</strong>mula<br />
Meff P ¼ 0:05=ð4:58 10 5 Þ¼1; 092. From the same<br />
data set we calculated aG ¼ 4:42 10 5 , aL ¼ 5:97<br />
10 5 and aB ¼ 2:90 10 5 . We can now derive the<br />
EW<strong>ER</strong>s by multiplying Meff P by the various values<br />
<strong>for</strong> a giving us 0.048, 0.065 and 0.032 <strong>for</strong> the<br />
simpleM, Li and Ji’s <strong>method</strong> and Bonferroni<br />
approach, respectively. Thus, the small differences<br />
in the PW<strong>ER</strong>s resulted in rather large changes in<br />
EW<strong>ER</strong>s. Since the difference between 0.05 and the<br />
EW<strong>ER</strong> <strong>for</strong> aG is the smallest of the three <strong>method</strong>s, we<br />
conclude that it is the most accurate.<br />
In analyzing SNP markers, several tests have been<br />
proposed, such as the w 2 test, the allelic-based test<br />
and Armitage’s trend test. There are several suggestions<br />
<strong>for</strong> which test procedure should be used<br />
[Sasieni, 1997; Schaid and Jacobsen, 1999; Deng,<br />
2000; Knapp, 2001; Zou, 2006]. Here, we adopted the<br />
Multiple Testing Correction Method<br />
367<br />
suggestion by Sasieni [1997] to analyze by genotypes<br />
and used Armitage’s trend test. Our Meff estimation<br />
should also apply to the w 2 test based on alleles.<br />
However, realizing the relationship w2 G ¼ w2a =ð1 þ ^ fÞ,<br />
where w2 a is the allele-based w2 test statistic, w2 G is the<br />
trend test statistic and ^ f is the estimated inbreeding<br />
coefficient [Sasieni, 1997; Zou, 2006], the permutation-based<br />
<strong>correction</strong> threshold, aP, may vary with<br />
the different tests employed slightly because ^ f is<br />
unlikely to be equal from locus to locus. The aG<br />
calculated here is only an approximation of the<br />
permutation-based <strong>correction</strong> thresholds and not a<br />
replacement. We should be aware that the precision<br />
of the permutation-based critical value is associated<br />
with the number of shuffles used. For more precise<br />
estimates, a larger number of shuffles should be<br />
per<strong>for</strong>med. For example, if the permutation critical<br />
value is set at 0.05, 10,000 shuffles gave a good<br />
permutation estimate on our data. However, it<br />
required 100,000 shuffles to get a relatively stable<br />
permutation estimate <strong>for</strong> a critical value of 0.01 in<br />
our tests.<br />
There may be several limitations <strong>for</strong> the simpleM<br />
<strong>method</strong>. It is not uncommon to have missing data in<br />
SNP data sets. However, PCA requires complete<br />
data matrices; otherwise, the CLD correlation matrix<br />
may not be positive semi-definite. For the Alzheimer<br />
data, the missing value rate is only 0.065%. Be<strong>for</strong>e<br />
using PCA, we filled the missing values with<br />
inferred genotypes using the K nearest-neighbor<br />
<strong>method</strong>. Missing genotypes can also be filled with<br />
re-genotyping. With new developments in genotyping<br />
technology and statistical imputation <strong>method</strong>s, a<br />
small amount of missing values is unlikely to hinder<br />
simpleM’s inference. Another problem with PCA is<br />
that it becomes inefficient with a large number SNPs<br />
(41,000). In such situations, the eigenvalue calculation<br />
is unable to produce enough non-zero eigenvalues<br />
<strong>for</strong> either Meff G or Meff L to work well. This is<br />
not surprising since it was pointed out by Schäfer<br />
and Strimmer [2005] that a growing number of zero<br />
eigenvalues will be observed in situations where<br />
there are a small number of samples and a large<br />
number of variables, i.e., the usual ‘‘small n and<br />
large p’’ hurdle. In practice, when using a large<br />
number of SNPs, the data set has to be divided into<br />
smaller blocks. We tested two <strong>method</strong>s <strong>for</strong> dividing<br />
data sets: a simple fixed block size and Haploview<br />
with Gabriel et al.’s definition of haplotype blocks.<br />
Blocks created with Haploview resulted in a slightly<br />
better adjusted cutoff than with fixed blocks. This<br />
improvement, however, is mitigated by the fact that<br />
Haploview requires HWE to estimate haplotype<br />
frequencies.<br />
This study is concerned mainly with single-locus<br />
association tests. Here, we give some suggestions <strong>for</strong><br />
two popular designs: candidate gene and genomewide<br />
association studies in human genetics. In<br />
candidate gene association studies, we can analyze<br />
Genet. Epidemiol.
368 <strong>Gao</strong> et al.<br />
the SNPs all together if the eigenvalues can be<br />
derived. In the situations where the high dimensionality<br />
prohibits the calculation of eigenvalues, we can<br />
analyze the SNPs on each chromosome separately or<br />
according to the gene functions and then sum all of<br />
the Meff values together. The total Meff can be used to<br />
calculate the adjusted PW<strong>ER</strong>. In genome-wide<br />
association studies, we have to partition the SNPs<br />
into several parts and analyze them separately. Since<br />
SNPs on different chromosomes are expected to be<br />
in linkage equilibrium in general populations, the<br />
genome-wide effective number of independent tests<br />
can be obtained by summing the chromosome<br />
specific Meff values. For each chromosome, we may<br />
use the partition-ligation approach by dividing the<br />
SNPs into several parts, and then sum the Meff<br />
values from each partition, similar to how we tested<br />
our Alzheimer SNP data set. The total Meff is used in<br />
the final adjustment calculation. Due to the interblock<br />
correlations that are unlikely to be captured in<br />
this partition-ligation approach, the total Meff may<br />
be slightly conservative. However, the interblock<br />
correlations may be reduced if we partition SNPs<br />
according to their haplotype block structure.<br />
In summary, the simpleM algorithm provides a<br />
highly accurate approximation to the permutationbased<br />
<strong>correction</strong> threshold and is easily implemented.<br />
Itisshowntobesimple,fastandmoreaccuratethan<br />
recently developed <strong>method</strong>s and is comparable to the<br />
permutation-based <strong>correction</strong> threshold using both<br />
simulated and real SNP data. The efficiency and<br />
accuracy of the simpleM <strong>method</strong> make it an attractive<br />
choice <strong>for</strong> <strong>multiple</strong> <strong>testing</strong> adjustment when there is<br />
high intermarker LD in the SNP data set as in<br />
candidate gene or genome-wide association studies.<br />
ACKNOWLEDGMENTS<br />
This work was supported in part by NIH grants<br />
NS39764, AG019757 and AG20135 and NIEHS T32<br />
ES007126. We thank Dr. Gary Beecham who prepared<br />
the Alzheimer data <strong>for</strong> us. We thank Dr.<br />
Richard Morris <strong>for</strong> initial inspiration.<br />
REF<strong>ER</strong>ENCES<br />
Armitage P. 1955. Tests <strong>for</strong> linear trends in proportions and<br />
frequencies. Biometrics 11:375–386.<br />
Barrett JC, Fry B, Maller J, Daly MJ. 2005. Haploview: analysis and<br />
visualization of LD and haplotype maps. Bioin<strong>for</strong>matics<br />
21:263–265.<br />
Benjamini Y, Hochberg Y. 1995. Controlling the false discovery<br />
rate: a practical and powerful approach to <strong>multiple</strong> <strong>testing</strong>. J R<br />
Stat Soc B 57:289–300.<br />
Bonferroni CE. 1935. Il calcolo delle assicurazioni su gruppi di<br />
teste, chapter ‘‘Studi in Onore del Professore Salvatore ortu<br />
Carboni’’. Rome. p 13–60.<br />
Bonferroni CE. 1936. Teoria statistica delle classi e calcolo delle<br />
probabilitá. Pubblicazioni del Istituto Superiore di Scienze<br />
Economiche e Commerciali di Firenze 8:3–62.<br />
Genet. Epidemiol.<br />
Cheverud JM. 2001. A simple <strong>correction</strong> <strong>for</strong> <strong>multiple</strong> comparisons<br />
in interval mapping genome scans. Heredity 87:52–58.<br />
Churchill GA, Doerge RW. 1994. Empirical threshold values <strong>for</strong><br />
quantitative trait mapping. Genetics 138:963–971.<br />
Deng HW. 2000. Re: ‘‘biased tests of association: comparisons of<br />
allele frequencies when departing from Hardy-Weinberg<br />
proportions’’. Am J Epidemiol 151:335–336.<br />
Excoffier L, Slatkin M. 1995. Maximum-likelihood estimation of<br />
molecular haplotype frequencies in a diploid population. Mol<br />
Biol Evol 12:921–927.<br />
Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel<br />
B, Higgins J, DeFelice M, Lochner A, Faggart M, Liu-Cordero<br />
SN, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander ES, Daly<br />
MJ, Altshuler D. 2002. The structure of haplotype blocks in the<br />
human genome. Science 296:2225–2229.<br />
Hastie T, Tibshirani R, Friedman J. 2001. The Elements of<br />
Statistical Learning. Berlin: Springer.<br />
Hoh J, Wille A, Ott J. 2001. Trimming, weighting, and grouping<br />
SNPs in human case-control association studies. Genome Res<br />
11:2115–2119.<br />
Hudson RR. 2002. Generating samples under a Wright-Fisher<br />
neutral modal of genetic variation. Bioin<strong>for</strong>matics 18:337–338.<br />
Knapp M. 2001. Re:‘‘biased tests of association: comparisons of<br />
allele frequencies when departing from Hardy-Weinberg<br />
proportions’’. Am J Epidemiol 154:287–288.<br />
Li J, Ji L. 2005. Adjusting <strong>multiple</strong> <strong>testing</strong> in multilocus analyses using<br />
the eigenvalues of a correlation matrix. Heredity 95:221–227.<br />
Lin Z, Altman RB. 2004. Finding haplotype tagging SNPs by use of<br />
principal components analysis. Am J Hum Genet 75:850–861.<br />
Mardia KV, Kent JT, Bibby JM. 1979. Multivariate Analysis.<br />
London: Academic Press.<br />
Meng Z, Zaykin DV, Xu CF, Wagner M, Ehm MG. 2003. Selection<br />
of genetic markers <strong>for</strong> association analyses, using linkage<br />
disequilibrium and haplotypes. Am J Hum Genet 73:115–130.<br />
Nielsen DM, Ehm MG, Weir BS. 1999. Detecting marker-disease<br />
association by <strong>testing</strong> <strong>for</strong> Hardy-Weinberg disequilibrium at a<br />
marker locus. Am J Hum Genet 63:1531–1540.<br />
Nyholt DR. 2004. A simple <strong>correction</strong> <strong>for</strong> <strong>multiple</strong> <strong>testing</strong> <strong>for</strong><br />
single-nucleotide polymorphisms in linkage disequilibrium<br />
with each other. Am J Hum Genet 74:765–769.<br />
Nyholt DR. 2005. Evaluation of Nyholt’s procedure <strong>for</strong> <strong>multiple</strong><br />
<strong>testing</strong> <strong>correction</strong>—author’s reply. Hum Hered 60:61–62.<br />
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich<br />
D. 2006. Principal components analysis corrects <strong>for</strong> stratification in<br />
genome-wide association studies. Nat Genet 38:904–909.<br />
Rinaldo A, Bacanu SA, Devlin B, Sonpar V, Wasserman L, Roeder<br />
K. 2005. Characterization of multilocus linkage disequilibrium.<br />
Genet Epidemiol 28:193–206.<br />
Risch N, Merikangas K. 1996. The future of genetic studies of<br />
complex human diseases. Science 273:1516–1517.<br />
Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD,<br />
Parl FF, Moore JH. 2001. Multifactor-dimensionality reduction<br />
reveals high-order interactions among estrogen-metabolism<br />
genes in sporadic breast cancer. Am J Hum Genet 69:138–147.<br />
Salyakina D, Seaman SR, Browning BL, Dudbridge F, Muller-<br />
Myhsok B. 2005. Evaluation of Nyholt’s procedure <strong>for</strong> <strong>multiple</strong><br />
<strong>testing</strong> <strong>correction</strong>. Hum Hered 60:19–25.<br />
Sasieni PD. 1997. From genotypes to genes: doubling the sample<br />
size. Biometrics 53:1253–1261.<br />
Schäfer J, Strimmer K. 2005. A shrinkage approach to large scale<br />
covariance-matrix estimation and implications <strong>for</strong> functional<br />
genomics. Stat Appl Genet Mol Biol 4:32.<br />
Schaid DJ. 2004. Linkage disequilibrium <strong>testing</strong> when linkage<br />
phase is unknown. Genetics 166:505–512.
Schaid DJ, Jacobsen SJ. 1999. Biased tests of association:<br />
comparisons of allele frequencies when departing from<br />
Hardy-Weinberg proportions. Am J Epidemiol 149:706–711.<br />
Storey JD, Taylor JE, Siegmund D. 2004. Strong control, conservative<br />
point estimation, and simultaneous conservative consistency of<br />
false discovery rates: A unified approach. J R Stat Soc B 66:187–205.<br />
Team RDC. 2007. R: A Language and Environment <strong>for</strong> Statistical<br />
Computing. Vienna, Austria: R Foundation <strong>for</strong> Statistical<br />
Computing, ISBN 3-900051-07-0.<br />
S ˇ idák Z. 1967. Rectangular confidence regions <strong>for</strong> the means of<br />
multivariate normal distributions. J Am Stat Assoc 62:626–633.<br />
Wall JD, Pritchard JK. 2003. Assessing the per<strong>for</strong>mance of the<br />
haplotype block model of linkage disequilibrium. Am J Hum<br />
Genet 73:502–515.<br />
Weir BS. 1979. Inferences about linkage disequilibrium. Biometrics<br />
35:235–254.<br />
APPENDIX<br />
CALCULATING PAIR-WISE CLD<br />
CORRELATION FOR BIALLELIC SNPS<br />
We first trans<strong>for</strong>m the genotypes into numerical<br />
coding as<br />
8<br />
if the genotype is variant<br />
0<br />
><<br />
type allele homozygote;<br />
if the genotype is<br />
numerical coding ¼ 1<br />
heterozygote;<br />
>:<br />
if the genotype is wild<br />
2<br />
type allele homozygote:<br />
A pair of SNPs, A and B, are represented by two<br />
vectors, x and y. Denote the number of individuals<br />
as n and nuv is the number of individuals who carry<br />
uv genotypes.<br />
The covariance between x and y is<br />
X xiyi<br />
X X<br />
xi yi<br />
covðx; yÞ ¼ 1 1<br />
n n2 ¼ 1<br />
n ð2nAaBB þ nAaBb þ 4nAABB þ 2nAABbÞ<br />
1<br />
n2 ðnAa þ 2nAAÞðnBb þ 2nBBÞ;<br />
and the CLD coefficient is<br />
DAB ¼ PAB þ P A=B<br />
2pApB<br />
¼ 2P AB<br />
1<br />
AB þ PAB Ab þ PAB aB þ<br />
2 ðPAB ab<br />
¼ 2nAABB<br />
n<br />
þ nAABb<br />
n<br />
þ nAaBB<br />
n<br />
2 2nAA þ nAa 2nBB þ nBb<br />
2n 2n<br />
þ PAb aB Þ 2pApB<br />
1 nAaBb<br />
þ<br />
2 n<br />
Multiple Testing Correction Method<br />
Weir BS. 1996. Genetic Data Analysis, vol. II. Sunderland, MA:<br />
Sinauer Associates Inc.<br />
Weir BS, Hill WG, Cardon LR. 2004. Allelic association patterns <strong>for</strong><br />
a dense snp map. Genet Epidemiol 27:442–450.<br />
Wittke-Thompson JK, Pluzhnikov A, Cox NJ. 2005. Rational<br />
inferences about departures from Hardy-Weinberg equilibrium.<br />
Am J Hum Genet 76:967–986.<br />
Zaykin DV. 2004. Bounds and normalization of the composite<br />
linkage disequilibrium coefficient. Genet Epidemiol<br />
27:252–257.<br />
Zaykin DV, Meng Z, Ehm MG. 2006. Contrasting linkagedisequilibrium<br />
patterns between cases and controls as a<br />
novel association-mapping <strong>method</strong>. Am J Hum Genet<br />
78:737–746.<br />
Zou GY. 2006. Statistical <strong>method</strong>s <strong>for</strong> the analysis of genetic<br />
association studies. Ann Hum Genet 70:262–276.<br />
¼ 1<br />
2n ð4nAABB þ 2nAABb þ 2nAaBB þ nAaBbÞ<br />
1<br />
2n2 ð2nAA þ nAaÞð2nBB þ nBbÞ:<br />
There<strong>for</strong>e, covðx; yÞ ¼2DAB.<br />
We know that pA ¼ð2nAA þ nAaÞ=2n and<br />
pB ¼ð2nBB þ nBbÞ=2n, and then the variance of x is<br />
varðxÞ ¼ 1 X<br />
P 2<br />
2 xi<br />
xi n<br />
n<br />
¼ 1<br />
n ðnAa þ 4nAAÞ<br />
¼ nAa þ 2nAA<br />
n<br />
þ 2nAA<br />
n<br />
nAa þ 2nAA<br />
n<br />
2<br />
nAa þ 2nAA<br />
n<br />
¼ 2pA þ 2PAA 4p 2 A<br />
¼ 2½pAð1 pAÞþPAA p 2 AŠ ¼ 2½pAð1 pAÞþDAŠ:<br />
Similarly, the variance of y is<br />
varðyÞ ¼2½pBð1 pBÞþDBŠ:<br />
There<strong>for</strong>e, the CLD correlation given by Weir [1979,<br />
2004], as<br />
DAB<br />
rAB ¼ p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ;<br />
ðpAð1 pAÞþDAÞðpBð1 pBÞþDBÞ<br />
can be computed from the 0, 1 and 2 genotype<br />
numerical coding, as correlation<br />
covðx; yÞ<br />
rAB ¼ p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi :<br />
varðxÞvarðyÞ<br />
The CLD correlation can be calculated simply by<br />
using the R function cor().<br />
2<br />
369<br />
Genet. Epidemiol.