05.07.2013 Views

Gao X, Starmer J, Martin ER. A multiple testing correction method for ...

Gao X, Starmer J, Martin ER. A multiple testing correction method for ...

Gao X, Starmer J, Martin ER. A multiple testing correction method for ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Genetic Epidemiology 32: 361–369 (2008)<br />

A Multiple Testing Correction Method <strong>for</strong> Genetic Association<br />

Studies Using Correlated Single Nucleotide Polymorphisms<br />

Xiaoyi <strong>Gao</strong>, 1<br />

Joshua <strong>Starmer</strong>, 2,3 and Eden R. <strong>Martin</strong> 1<br />

1<br />

Center <strong>for</strong> Genetic Epidemiology and Statistical Genetics, Miami Institute <strong>for</strong> Human Genomics, University of Miami Miller School of Medicine,<br />

Miami, Florida<br />

2<br />

Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina<br />

3<br />

Curriculum in Toxicology, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina<br />

Multiple <strong>testing</strong> is a challenging issue in genetic association studies using large numbers of single nucleotide polymorphism<br />

(SNP) markers, many of which exhibit linkage disequilibrium (LD). Failure to adjust <strong>for</strong> <strong>multiple</strong> <strong>testing</strong> appropriately may<br />

produce excessive false positives or overlook true positive signals. The Bonferroni <strong>method</strong> of adjusting <strong>for</strong> <strong>multiple</strong><br />

comparisons is easy to compute, but is well known to be conservative in the presence of LD. On the other hand,<br />

permutation-based <strong>correction</strong>s can correctly account <strong>for</strong> LD among SNPs, but are computationally intensive. In this work,<br />

we propose a new <strong>multiple</strong> <strong>testing</strong> <strong>correction</strong> <strong>method</strong> <strong>for</strong> association studies using SNP markers. We show that it is simple,<br />

fast and more accurate than the recently developed <strong>method</strong>s and is comparable to permutation-based <strong>correction</strong>s using<br />

both simulated and real data. We also demonstrate how it might be used in whole-genome association studies to control<br />

type I error. The efficiency and accuracy of the proposed <strong>method</strong> make it an attractive choice <strong>for</strong> <strong>multiple</strong> <strong>testing</strong> adjustment<br />

when there is high intermarker LD in the SNP data set. Genet. Epidemiol. 32:361–369, 2008. r 2008 Wiley-Liss, Inc.<br />

Key words: single nucleotide polymorphism; composite linkage disequilibrium; <strong>multiple</strong> <strong>testing</strong> <strong>correction</strong>; principal<br />

component analysis; eigenvalues<br />

Contract grant sponsor: NIH; Contract grant numbers: NS39764, AG019757, AG20135; Contract grant sponsor: NIEHS; Contract grant<br />

numbers: T32 ES007126.<br />

Correspondence to: Xiaoyi <strong>Gao</strong>, Center <strong>for</strong> Genetic Epidemiology and Statistical Genetics, Miami Institute <strong>for</strong> Human Genomics,<br />

University of Miami Miller School of Medicine, Miami, FL 33136. E-mail: xgao@med.miami.edu<br />

Received 10 July 2007; Revised 28 November 2007; Accepted 20 December 2007<br />

Published online 12 February 2008 in Wiley InterScience (www.interscience.wiley.com).<br />

DOI: 10.1002/gepi.20310<br />

INTRODUCTION<br />

Multiple <strong>testing</strong> is a challenging issue <strong>for</strong> genetic<br />

data analysis. Candidate gene and genome-wide<br />

association studies involve statistical <strong>testing</strong> of not<br />

just a single hypothesis, but many. Even when the<br />

point-wise error rate (PW<strong>ER</strong>, ap) is set to a low level,<br />

the experiment-wise error rate (EW<strong>ER</strong>, ae) increases<br />

with the number of tests carried out. For this reason,<br />

strict significance thresholds have been recommended<br />

to control EW<strong>ER</strong> [Risch and Merikangas,<br />

1996]. However, an overly conservative approach<br />

may result in overlooking true positive signals,<br />

while an overly liberal criterion could produce<br />

excessive false positives. Sˇ idák and Bonferroni<br />

<strong>correction</strong>s are popular approaches <strong>for</strong> controlling<br />

ae by specifying what ap values should be used <strong>for</strong><br />

each individual test. The Sˇ idák <strong>correction</strong> is calculated<br />

as ap ¼ 1 ð1 aeÞ 1=N , where N is the number<br />

of individual hypotheses to be tested [Sˇ idák, 1967].<br />

This <strong>correction</strong> assumes that the hypothesis tests<br />

are independent. Noting that ð1 apÞ N<br />

1 Nap<br />

<strong>for</strong> small ap, we obtain the Bonferroni <strong>correction</strong> as<br />

r 2008 Wiley-Liss, Inc.<br />

ap ¼ ae=N [Bonferroni, 1935, 1936], which is an<br />

approximation to the S ˇ idák <strong>correction</strong>.<br />

Recently, single nucleotide polymorphisms<br />

(SNPs), which are often densely genotyped, have<br />

become popular markers <strong>for</strong> genetic association<br />

studies. The closely spaced SNPs frequently yield<br />

high correlation because of extensive linkage disequilibrium<br />

(LD) among them [Wall and Pritchard,<br />

2003]. There<strong>for</strong>e, when association studies are conducted<br />

with many SNPs, the tests per<strong>for</strong>med on<br />

each SNP are usually not independent, depending<br />

on the correlation structure among the SNPs. This<br />

violation of the independence assumption limits the<br />

S ˇ idák and Bonferroni <strong>correction</strong>s’ ability to control<br />

type I error effectively, and the PW<strong>ER</strong> has to be<br />

adjusted in order to keep the EW<strong>ER</strong> at a nominal<br />

level.<br />

In practice, many researchers use permutationbased<br />

<strong>method</strong>s to control the EW<strong>ER</strong> when the tests<br />

are correlated. For example, Churchill and Doerge<br />

[1994] used a permutation test <strong>for</strong> estimating threshold<br />

values in quantitative trait mapping. Ritchie<br />

et al. [2001] and Hoh et al. [2001] used a permutation


362 <strong>Gao</strong> et al.<br />

test to control significance level <strong>for</strong> dichotomous<br />

traits. Permutation test <strong>correction</strong> is very robust and<br />

has the advantage of drawing the threshold directly<br />

from the experimental data [Cheverud, 2001]. However,<br />

permutation tests are computationally intensive.<br />

Churchill and Doerge [1994] suggested that at<br />

least 10,000 shuffles are needed to estimate a 0.01<br />

threshold and 1,000 shuffles to estimate a 0.05<br />

threshold.<br />

If the number of independent tests can be correctly<br />

inferred, we can still use the standard Bonferroni<br />

<strong>correction</strong> to rapidly adjust <strong>for</strong> <strong>multiple</strong> <strong>testing</strong>.<br />

Based on this idea, several researchers have tried to<br />

derive the effective number of independent tests,<br />

Meff [Cheverud, 2001; Nyholt, 2004; Li and Ji, 2005].<br />

Cheverud [2001] was the first to propose this idea<br />

<strong>for</strong> <strong>multiple</strong> <strong>testing</strong> <strong>correction</strong> and published a<br />

<strong>for</strong>mula <strong>for</strong> calculating Meff when SNP markers are<br />

correlated. However, Cheverud’s Meff is still overly<br />

conservative when there is high LD among SNPs [Li<br />

and Ji, 2005; Salyakina et al., 2005]. Nyholt [2005]<br />

suggested excluding all SNPs in perfect LD except<br />

one prior to using Cheverud’s Meff as a means to<br />

improve the adjustment accuracy, but this <strong>method</strong><br />

remains overly conservative. Li and Ji [2005]<br />

proposed another Meff <strong>for</strong>mula and demonstrated<br />

its improvement over Cheverud’s. However, Li and<br />

Ji’s approach, partitioning eigenvalues into integer<br />

and fractional parts, is an intuitive solution, and it<br />

was tested only on a small number of SNPs (o15 <strong>for</strong><br />

each gene) in their single-locus analyses [Li and Ji,<br />

2005]. It is not clear how their <strong>method</strong> per<strong>for</strong>ms in<br />

relatively large SNP data sets (Z100) SNPs. With<br />

these limitations in mind, we have developed a new<br />

approach <strong>for</strong> estimating Meff, and denote it as Meff G,<br />

which improves on existing <strong>method</strong>s.<br />

The first step in calculating Meff <strong>for</strong> SNP data is<br />

constructing a correlation matrix, along with the<br />

corresponding eigenvalues, <strong>for</strong> the SNP loci. For<br />

example, Nyholt [2004] used LD correlation. However,<br />

a problem with calculating LD correlation is<br />

that the haplotype phase in<strong>for</strong>mation is not usually<br />

available and needs to be derived. A common<br />

technique <strong>for</strong> inferring LD when the haplotype<br />

phase is unknown is to use the expectation-maximization<br />

algorithm under the assumption of Hardy-<br />

Weinberg equilibrium (HWE) [Excoffier and Slatkin,<br />

1995]. The potential problem with this approach is<br />

that HWE may not hold when sample individuals<br />

are chosen based on phenotypes [Zaykin et al., 2006],<br />

and HWE can be distorted between cases and<br />

controls in regions of association [Nielsen et al.,<br />

1999; Wittke-Thompson et al., 2005]. Furthermore,<br />

this <strong>method</strong> requires the additional step of estimating<br />

haplotype frequencies, which may not be<br />

necessary if our goal is only to capture the correlation<br />

structure of SNPs. In contrast, the composite LD<br />

(CLD) correlation, which is calculated directly from<br />

SNP genotypes, describes the SNP correlation well<br />

Genet. Epidemiol.<br />

and is simpler to calculate. Recently, Weir et al.<br />

[2004], Schaid [2004], and Zaykin [2004] and Zaykin<br />

et al. [2006] showed that CLD can capture the<br />

relationship among SNPs comparable to those of<br />

gametic LD without requiring HWE.<br />

With the above improvements in mind, we<br />

propose a new <strong>multiple</strong> <strong>testing</strong> <strong>correction</strong>,<br />

simpleM, which uses CLD to create the correlation<br />

matrix and Meff G to calculate the effective number<br />

of independent tests. We then show that the new<br />

approach can successfully control the type I error<br />

rate based on both simulated and real data.<br />

Compared with either Bonferroni or Li and Ji’s<br />

approach, the adjusted thresholds from simpleM are<br />

more accurate, i.e., closer to the permutation-based<br />

<strong>correction</strong>s. Moreover, the proposed <strong>method</strong> can also<br />

be used to address <strong>multiple</strong> <strong>testing</strong> <strong>correction</strong> issues<br />

in genome-wide association studies using correlated<br />

SNPs.<br />

METHODS<br />

NOTATION<br />

For SNP markers, we consider only biallelic cases.<br />

Each SNP marker has two alleles and correspondingly<br />

three genotypes. For example, take two SNPs,<br />

A and B, which both have two alleles: A and a <strong>for</strong><br />

SNP A, and B and b <strong>for</strong> SNP B. The three genotypes<br />

are AA, Aa and aa <strong>for</strong> SNP A, and similarly BB, Bb<br />

and bb <strong>for</strong> SNP B. For SNP A, the allele frequencies<br />

are denoted as pA and pa <strong>for</strong> the A and a alleles,<br />

respectively, and the genotype frequencies are<br />

denoted as PAA, PAa and Paa <strong>for</strong> the AA, Aa and aa<br />

genotypes, respectively. Similarly pB, pb, PBB, PBb and<br />

Pbb are the respective frequencies <strong>for</strong> SNP B. PAB<br />

denotes the gametic frequency and P A=B denotes the<br />

non-gametic frequency between SNPs A and B. Li<br />

and Ji’s Meff and the proposed Meff are denoted as<br />

Meff L and Meff G, respectively. Keeping the EW<strong>ER</strong><br />

(ae) at a nominal significance level, the adjusted<br />

PW<strong>ER</strong>s are denoted as aL and aG calculated using<br />

Meff L, and Meff G, respectively. The Bonferroni and<br />

the permutation-based point-wise <strong>correction</strong> thresholds<br />

are denoted as aB and aP, respectively. M is the<br />

total number of SNPs in the data set.<br />

COMPOSITE LD<br />

The CLD coefficient is defined as<br />

DAB ¼ PAB þ P A=B<br />

¼ DAB þ D A=B;<br />

2pApB<br />

where DAB ¼ PAB pApB and D A=B ¼ P A=B pApB<br />

[Weir, 1996].<br />

The composite correlation is defined as<br />

DAB<br />

rAB ¼ p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ;<br />

ðpAð1 pAÞþDAÞðpBð1 pBÞþDBÞ


where DA ¼ PAA p 2 A and DB ¼ PBB p 2 B [Weir,<br />

1979, 2004]. The composite estimator works well in<br />

capturing the LD correlation among SNPs and it is<br />

robust to violations in HWE [Weir et al., 2004;<br />

Schaid, 2004]. It is also easier to compute than LD<br />

correlation (see the Appendix). The CLD correlation<br />

can be calculated in R using the cor() function [Team,<br />

2007] when the SNP genotypes are numerically<br />

coded as 2, 1 and 0 <strong>for</strong> wild-type allele homozygotes,<br />

heterozygotes and variant-type allele homozygotes,<br />

respectively, which is shown in the Appendix.<br />

Meff_G ESTIMATION<br />

Principal component analysis (PCA) is a classical<br />

statistical approach <strong>for</strong> reducing dimensionality in<br />

multivariate analysis [Mardia et al., 1979]. The PCA<br />

approach has been applied to many recent genetic<br />

studies, such as haplotype tagging SNP selection<br />

[Meng et al., 2003; Lin and Altman, 2004] and<br />

<strong>correction</strong> <strong>for</strong> population stratification [Price et al.,<br />

2006]. It is a data-driven approach that allows<br />

researchers to consider all the SNPs simultaneously,<br />

which is ideal <strong>for</strong> inferring Meff <strong>for</strong> a particular data<br />

set. For simpleM, we compute eigenvalues from the<br />

pair-wise SNP correlation matrix created with CLD<br />

and then derive Meff G using PCA. Each eigenvalue<br />

can be interpreted as the amount of variance<br />

explained by the corresponding principal component.<br />

The eigenvalues, flig, are usually arranged in<br />

descending order, l1 l2 lM, where M is the<br />

number of SNPs in the data. Generally, a relatively<br />

small number of eigenvalues, x, contribute a high<br />

percentage of the sum of the variances <strong>for</strong> all of the<br />

components <strong>for</strong> correlated data. That is to say that<br />

only the first x eigenvalues are needed,<br />

Px i¼1 li= PM i¼1 li4C, where the percentage cutoff,<br />

C, is determined by the researcher. There are rules of<br />

thumb <strong>for</strong> choosing this threshold in PCA [Mardia<br />

et al., 1979], and in line with these, we propose<br />

defining x so that the corresponding eigenvalues<br />

explain 99.5% of the variation <strong>for</strong> SNP data and<br />

Meff G ¼ x. It should be noted, however, that too<br />

large or too small C may cause Meff G to be either<br />

overly conservative or overly liberal.<br />

THE simpleM METHOD<br />

The simpleM <strong>method</strong> involves four steps:<br />

Step 1: Derive the CLD correlation matrix from the<br />

SNP data set. This can be done using the cor()<br />

function in R (see the Appendix).<br />

Step 2: Calculate the eigenvalues, <strong>for</strong> example, by<br />

the R function eigen().<br />

Step 3: Infer Meff G through PCA to estimate the<br />

effective number of independent tests (see the Meff G<br />

ESTIMATION section).<br />

Step 4: Apply the Bonferroni <strong>correction</strong> <strong>for</strong>mula to<br />

calculate the adjusted point-wise significance level<br />

as aG ¼ ae=Meff G.<br />

Multiple Testing Correction Method<br />

363<br />

P<strong>ER</strong>MUTATION TESTS<br />

All adjusted thresholds were validated by permutation<br />

tests. Because we wanted to test the validity of<br />

our algorithm at significance levels of both ae ¼ 0:05<br />

and 0.01, we per<strong>for</strong>med 100,000 permutations on our<br />

data sets. In each permutation shuffle, half of the<br />

individuals were randomly assigned as cases and<br />

the other half were assigned as controls in the<br />

balanced data sets we simulated. For each permuted<br />

case-control sample, Armitage’s trend [Armitage,<br />

1955; Sasieni, 1997] tested association <strong>for</strong> each SNP.<br />

Thus, a total of M test statistics and their corresponding<br />

P-values were calculated <strong>for</strong> each permutation<br />

repeat and the smallest P-value was recorded.<br />

The smallest P-values were then arranged in<br />

ascending order and the fifth percentile was the<br />

permutation-based empirical experiment-wise critical<br />

value <strong>for</strong> the overall 0.05 type I error rate.<br />

Similarly, the first percentile <strong>for</strong> the threshold of the<br />

overall 0.01 type I error rate.<br />

SIMULATION DATA<br />

Four simulation studies were designed to study<br />

the per<strong>for</strong>mance of simpleM. Our simulation was<br />

similar to Rinaldo et al. [2005] except that the<br />

simulated regions were larger. To be more specific,<br />

we used Wall and Pritchard’s simulation program<br />

[Wall and Pritchard, 2003], which is a variation of<br />

Hudson’s MS program [Hudson, 2002], to generate<br />

recombination cold regions and hot spots/hot regions.<br />

In simulation 1, eight cold regions (10 kb each)<br />

were separated by hot spots (1 kb each) giving<br />

recombination cold regions interlaced with hot<br />

spots. The mutation rate was y ¼ 4Nem, where Ne<br />

is the effective population size, set to be 10,000, and<br />

m is the mutation rate per basepair, per generation,<br />

set to be 1.4 10 8 . The recombination rate was<br />

r ¼ 4Ned, where d ¼ 2:5 10 8 is the recombination<br />

rate per basepair, per generation. Values <strong>for</strong> m and d<br />

were chosen because they yield results similar to the<br />

empirical data in the SeattleSNP database [Rinaldo<br />

et al., 2005]. The corresponding scaled recombination<br />

rate <strong>for</strong> the entire simulated region was the<br />

product of r and the length of the region in<br />

basepairs. The per basepair recombination rate in<br />

hot spots was chosen to be 100 times greater than in<br />

cold regions.<br />

In contrast to simulation 1, simulation 2 had four<br />

cold regions (10 kb each) separated by hot regions<br />

(15 kb each); thus, cold regions are interlaced with<br />

hot regions. The recombination rate, d, was chosen to<br />

be 9 10 8 =bp <strong>for</strong> the hot regions. Again, the<br />

population parameters were chosen because they<br />

generate LD patterns similar to that observed in the<br />

SeattleSNP database [Rinaldo et al., 2005].<br />

For simulations 1 and 2, 100 SNP data sets were<br />

generated, each with 400 individuals (200 cases vs.<br />

200 controls, randomly assigned in the permutation<br />

Genet. Epidemiol.


364 <strong>Gao</strong> et al.<br />

tests). We used only common SNPs in each data set,<br />

which had a minor allele frequency 40.10. The<br />

resulting number of SNPs ranged from approximately<br />

70 to 140, with about 100 SNPs per simulation<br />

on average.<br />

Simulations 3 and 4 were identical to simulations 1<br />

and 2, except the number of individuals simulated<br />

increased to 1,000 (500 cases vs. 500 controls,<br />

randomly assigned in the permutation tests) to<br />

address how sample size affects simpleM’s per<strong>for</strong>mance.<br />

REAL DATA<br />

To evaluate simpleM with real data, we used a<br />

partial SNP data set from an Alzheimer wholegenome<br />

association project. We randomly chose<br />

1,723 SNPs spanning a region of 8 Mb on chromosome<br />

22. Five hundred unrelated unaffected individuals<br />

were used. The missing values in the data set<br />

were less than 1% <strong>for</strong> each SNP and each individual,<br />

respectively. The minor allele frequency <strong>for</strong> each<br />

SNP was 40.05. The total missing value rate <strong>for</strong> this<br />

data was 0.065%.<br />

RESULTS<br />

SIMULATION RESULTS<br />

For simulations 1 and 2, the CLD correlation<br />

matrices were calculated <strong>for</strong> each simulated SNP<br />

data set, as well as the eigenvalues, and aL and aG.<br />

The permutation-based <strong>correction</strong> threshold, aP, was<br />

derived using 100,000 random shuffles to serve as<br />

the true cutoff. The Bonferroni <strong>correction</strong>, aB, was<br />

also calculated <strong>for</strong> comparison purposes. The results<br />

<strong>for</strong> the 100 simulations are plotted in Figure 1, (a)<br />

and (b) <strong>for</strong> ae ¼ 0:05 and (c) and (d) <strong>for</strong> ae ¼ 0:01. For<br />

each simulation data set, the adjusted PW<strong>ER</strong> <strong>for</strong><br />

each <strong>method</strong> was plotted in separate colors: black,<br />

red, blue and purple, and marked with different<br />

letters: B, P, G and L in the figure to denote the<br />

estimated aB, aP, aG and aL, respectively. The<br />

number of SNPs in the data sets was sorted and<br />

arranged in ascending order. Finally, all the points<br />

(adjusted PW<strong>ER</strong> thresholds) <strong>for</strong> each <strong>method</strong> were<br />

connected to aid visualization. In all plots, the<br />

Bonferroni <strong>correction</strong> cutoff is too conservative<br />

relative to the permutation-based <strong>correction</strong> threshold.<br />

aG gives the adjusted PW<strong>ER</strong> closest to the<br />

permutation-based threshold, almost overlapping it,<br />

while aL is too liberal. It appears that aL is sensitive<br />

to whether or not cold regions are interspersed with<br />

hot spots (Fig. 1(a)), or cold regions are interspersed<br />

with hot regions (Fig. 1(b)), and similarly <strong>for</strong> Figure<br />

1(c) vs. Figure 1(d).<br />

The adjusted PW<strong>ER</strong>s from simulations 3 and 4 are<br />

plotted in Figure 2 in the same way the results from<br />

simulations 1 and 2 were. Our metric, aG, continued<br />

to be the closest adjustment to aP. The general trends<br />

Genet. Epidemiol.<br />

between simulation 3 and 1 and between simulation<br />

4 and 2 are in agreement with each other, which<br />

indicates that the simpleM <strong>method</strong> is not sensitive<br />

to the sample size used in these examples. Again, we<br />

observed that aL is sensitive to the relationship<br />

between cold regions interlaced with hot regions<br />

and cold regions interlaced with hot spots. Both<br />

Figures 1 and 2 show that aL is sensitive to the<br />

underlying LD structure.<br />

The results from simulations 1, 2, 3 and 4 show<br />

that aG can give a <strong>multiple</strong> <strong>testing</strong> adjustment that<br />

nearly overlaps aP. Furthermore, the simpleM<br />

<strong>method</strong> tremendously reduces the computing time<br />

<strong>for</strong> each data set compared with the permutation<br />

<strong>method</strong>. Calculating aP required over 3 hr per data<br />

set (1,000 individuals) using 100,000 permutation<br />

shuffles, whereas calculating aG only required about<br />

0.1 sec on our desktop computer (Intel Core2 2.4G<br />

CPU with 2 GB memory), which is at least 100,000<br />

times faster. Increasing the number of SNPs in the<br />

data sets makes the differences in speed even more<br />

dramatic. For example, a 100,000 shuffle permutation<br />

test on a data set of 1,000 individuals with 1,000<br />

SNPs takes more than a day to finish, while it takes<br />

only about a second <strong>for</strong> the simpleM <strong>method</strong> to<br />

derive the adjustment threshold. Assuming the<br />

computation time <strong>for</strong> the permutation is proportional<br />

to the number of SNPs, it will then consume<br />

over 12 days, 4 months, 20 months and 3 years <strong>for</strong><br />

10K, 100K, 500K and 1M SNP data sets with 1,000<br />

individuals using 100,000 shuffles on a single PC,<br />

while the simpleM <strong>method</strong> could finish any of these<br />

calculations in less than 1 hr.<br />

EXP<strong>ER</strong>IMENTAL DATA RESULTS<br />

We used an Alzheimer whole-genome association<br />

SNP data set to both validate the simpleM <strong>method</strong><br />

on real data and show how it can be applied to<br />

whole-genome analysis. In the presence of a large<br />

number of SNPs, it is challenging to calculate<br />

eigenvalues efficiently and effectively. There<strong>for</strong>e,<br />

we partitioned the large SNP data set into 13 small<br />

sets, each with 133 SNPs. Set 1 consisted of SNPs<br />

1–133, set 2 consisted of SNPs 134–266, and so on, set<br />

13 consisted of SNPs 1,597–1,723. Because PCA<br />

requires complete data matrices, we filled the<br />

missing data in the Alzheimer SNP data using the<br />

K nearest-neighbor algorithm [Hastie et al., 2001].<br />

We then applied the simpleM <strong>method</strong> to each set<br />

and got the following series of Meff G: 95 (133), 101<br />

(133), 92 (133), 90 (133), 91 (133), 92 (133), 68 (133), 71<br />

(133), 90 (133), 85 (133), 85 (133), 89 (133) and 83<br />

(127), where the number outside of parenthesis is the<br />

adjusted effective number of independent SNPs and<br />

the number within parenthesis is the original<br />

number of SNPs in the set. Thus, <strong>for</strong> the entire set<br />

of 1,723 SNPs, Meff G suggests that there are 1,132<br />

independent SNPs, while Meff L gives 837. To


compare the quality of our <strong>method</strong> to the permutation<br />

critical value, we calculated the permutation test<br />

threshold with 100,000 shuffles. If we set the<br />

nominal significance level to be 0.05, the derived<br />

permutation-based PW<strong>ER</strong>, aP ¼ 4:58 10 5 , the adjusted<br />

PW<strong>ER</strong> thresholds aG ¼ 0:05=1132 ¼<br />

4:42 10 5 and aL ¼ 0:05=837 ¼ 5:97 10 5 , and<br />

the Bonferroni <strong>correction</strong> aB ¼ 0:05=1723 ¼<br />

2:90 10 5 . In this case aG is very close to aP, while<br />

the Bonferroni <strong>correction</strong> is much more conservative<br />

and Li and Ji’s <strong>method</strong> is too liberal. If we set the<br />

nominal significance level to be 0.01, then<br />

aP ¼ 9:01 10 6 , aG ¼ 0:01=1132 ¼ 8:83 10 6 and<br />

aL ¼ 0:01=837 ¼ 1:19 10 5 and the Bonferroni <strong>correction</strong><br />

is aB ¼ 0:01=1723 ¼ 5:80 10 6 . Again, aG is<br />

very close to aP, while aB is too conservative and aL<br />

is too liberal.<br />

Because data with large numbers of SNPs need to<br />

be divided into sets <strong>for</strong> PCA, we investigated the use<br />

Multiple Testing Correction Method<br />

Fig. 1. Adjusted PW<strong>ER</strong> thresholds comparison <strong>for</strong> simulation 1 and 2 (400 individuals, 200 cases vs. 200 controls). The adjusted PW<strong>ER</strong><br />

thresholds <strong>for</strong> Bonferroni, permutation, Li and Ji’s approach, and the proposed <strong>method</strong> are marked by black, red, purple and blue,<br />

respectively. The data sets are sorted in order of ascending number of SNPs. Data points are connected to aid visualization. (a)<br />

corresponds to simulation 1 with the EW<strong>ER</strong> equal to 0.05. SNPs were generated as recombination cold regions interlaced with hot spots.<br />

Four hundred individuals were simulated. (b) corresponds to simulation 2 with EW<strong>ER</strong> 5 0.05. SNPs were generated as recombination<br />

cold regions interlaced with hot regions. Four hundred individuals were simulated. (c) and (d) are duplicates of (a) and (b), respectively,<br />

except EW<strong>ER</strong> is equal to 0.01. PW<strong>ER</strong>, point-wise error rate; SNPs, single nucleotide polymorphisms; EW<strong>ER</strong>, experiment-wise error rate.<br />

365<br />

of alternative <strong>method</strong>s <strong>for</strong> choosing block sizes. The<br />

simplest <strong>method</strong> was to define a fixed size <strong>for</strong> the<br />

blocks, as seen in the preceding results. We also used<br />

the Haploview software [Barrett et al., 2005] and<br />

Gabriel et al.’s definition on haplotype blocks<br />

[Gabriel et al., 2002] because they take advantage<br />

of the LD structure among SNPs. Based on the<br />

haplotype block output from the Haploview software,<br />

we divided the SNP data into 13 blocks, each<br />

with about 100–140 SNPs (cutting at block boundaries),<br />

and then applied our simpleM <strong>method</strong> to<br />

each block. This gave us 102 (141), 103 (141), 98 (146),<br />

84 (119), 94 (140), 86 (128), 65 (133), 57 (110), 96 (142),<br />

92 (142), 88 (138), 88 (132) and 71 (111), where the<br />

number outside of parenthesis is the adjusted<br />

effective number of independent SNPs and within<br />

parenthesis is the original number of SNPs in the<br />

block. The sum of the inferred Meff Gs <strong>for</strong> each block<br />

was 1,124 <strong>for</strong> the whole data set, while Meff L gave<br />

Genet. Epidemiol.


366 <strong>Gao</strong> et al.<br />

Fig. 2. Adjusted PW<strong>ER</strong> thresholds comparison <strong>for</strong> simulation 3 and 4 (1,000 individuals, 500 cases vs. 500 controls). The adjusted PW<strong>ER</strong><br />

thresholds <strong>for</strong> Bonferroni, permutation, Li and Ji’s approach, and the proposed <strong>method</strong> are marked by black, red, purple and blue,<br />

respectively. The data sets are sorted in order of ascending number of SNPs. Data points are connected to aid visualization. (a)<br />

corresponds to simulation 3 with the EW<strong>ER</strong> equal to 0.05. SNPs were generated as recombination cold regions interlaced with hot spots.<br />

One thousand individuals were simulated. (b) corresponds to simulation 4 with EW<strong>ER</strong> 5 0.05. SNPs were generated as recombination<br />

cold regions interlaced with hot regions. One thousand individuals were simulated. (c) and (d) are duplicates of (a) and (b), respectively,<br />

except EW<strong>ER</strong> is equal to 0.01. PW<strong>ER</strong>, point-wise error rate; SNPs, single nucleotide polymorphisms; EW<strong>ER</strong>, experiment-wise error rate.<br />

818. For a nominal significance level of 0.05, the<br />

adjusted PW<strong>ER</strong> threshold aG is equal to 0:05=1124 ¼<br />

4:45 10 5 and aL ¼ 0:05=818 ¼ 6:11 10 5 (compared<br />

to aP ¼ 4:58 10 5 and, <strong>for</strong> the fixed length<br />

blocks, aG ¼ 0:05=1132 ¼ 4:42 10 5 and<br />

aL ¼ 0:05=837 ¼ 5:97 10 5 ). If we set the nominal<br />

significance level to be 0.01, then aG ¼ 0:01=1124 ¼<br />

8:90 10 6 and aL ¼ 0:01=818 ¼ 1:22 10 5 (compared<br />

to aP ¼ 9:01 10 6 and, <strong>for</strong> the fixed length<br />

blocks, aG ¼ 0:01=1132 ¼ 8:83 10 6 and<br />

aL ¼ 0:01=837 ¼ 1:19 10 5 ). With variable block<br />

sizes, the aG improved slightly over that from the<br />

fixed length partition. This improvement, however,<br />

is mitigated by the fact that Haploview assumes<br />

HWE in the estimation of gamete frequencies.<br />

DISCUSSION<br />

In our simpleM approach, we use CLD correlation<br />

instead of LD correlation. The advantages that CLD<br />

Genet. Epidemiol.<br />

have over LD correlation are that CLD does not<br />

require HWE and the calculation is simpler and<br />

faster. The correlation structure among SNPs can be<br />

derived from their genotypes directly and no<br />

haplotype frequency estimation is necessary. Cheverud’s<br />

Meff may not capture the correlation among<br />

SNPs well [Salyakina et al., 2005; Li and Ji, 2005]. To<br />

improve the Meff, Nyholt suggested removing all<br />

SNPs in perfect correlation except one from the data<br />

set [Nyholt, 2005]. Although this remedy may be<br />

effective on some small SNP data sets, it shows that<br />

Cheverud’s Meff does not adjust effectively in many<br />

situations. In contrast, Meff L showed a significant<br />

improvement over Cheverud’s Meff C [Li and Ji,<br />

2005]. In our adjustment comparisons we used<br />

Cheverud’s Meff, but it did not offer much improvement<br />

over the Bonferroni <strong>correction</strong> on our data<br />

(results not shown). There<strong>for</strong>e, we did not include it<br />

in our comparison. Another <strong>multiple</strong> <strong>testing</strong> <strong>method</strong><br />

that we did not consider is the false discovery rate


(FDR) approach. FDR is commonly used in microarray<br />

data analysis, where studies involve a large<br />

amount of true alternative hypothesis (genes differently<br />

expressed). However, in genetic association<br />

studies, most of the hypotheses are null (SNPs not<br />

associated with the disease). Moreover, FDR assumes<br />

that the P-values corresponding to true null<br />

hypothesis tests are independent and uni<strong>for</strong>mly<br />

distributed or can be considered as approximately<br />

independent [Benjamini and Hochberg, 1995; Storey<br />

et al., 2004], which is likely to be violated when there<br />

is high LD among SNPs in genetic association<br />

studies. There<strong>for</strong>e, we compared our <strong>method</strong> only<br />

to the permutation <strong>method</strong> which is considered as a<br />

gold standard in <strong>multiple</strong> <strong>testing</strong> <strong>correction</strong>, Bonferroni<br />

and Li and Ji’s approach. Among all the<br />

adjustment <strong>method</strong>s considered, the simpleM <strong>method</strong><br />

gave the best approximation to the permutationbased<br />

<strong>correction</strong> threshold using either the simulated<br />

or the real data set in the presence of high<br />

intermarker LD correlations. In the extreme case, if<br />

the SNPs are nearly independent, there should not<br />

be much difference in using these adjustment<br />

<strong>method</strong>s.<br />

There are two possible ways to compare different<br />

<strong>multiple</strong> <strong>testing</strong> <strong>method</strong>s. First, fix the EW<strong>ER</strong> at a<br />

nominal value and try to find the corresponding<br />

PW<strong>ER</strong> threshold. Whichever adjustment is closest to<br />

the permutation-based PW<strong>ER</strong> is considered the best,<br />

as we did in our comparison (see Figs. 1 and 2).<br />

Second, given the PW<strong>ER</strong>, derive the corresponding<br />

EW<strong>ER</strong> and then compare it to the nominal type I<br />

error rate. These two <strong>method</strong>s are equivalent, but<br />

calculated in opposite directions. While PW<strong>ER</strong> is<br />

useful <strong>for</strong> determining the threshold <strong>for</strong> accepting or<br />

rejecting hypothesis tests, EW<strong>ER</strong> can be useful <strong>for</strong><br />

appreciating the how significant small changes in<br />

PW<strong>ER</strong> are. For example, with the Alzheimer SNP<br />

data we calculated aP ¼ 4:58 10 5 , which corresponds<br />

to the permutation-based EW<strong>ER</strong> of 0.05. We<br />

then approximate the ‘‘true’’ effective number of<br />

independent tests with Meff P, using the <strong>for</strong>mula<br />

Meff P ¼ 0:05=ð4:58 10 5 Þ¼1; 092. From the same<br />

data set we calculated aG ¼ 4:42 10 5 , aL ¼ 5:97<br />

10 5 and aB ¼ 2:90 10 5 . We can now derive the<br />

EW<strong>ER</strong>s by multiplying Meff P by the various values<br />

<strong>for</strong> a giving us 0.048, 0.065 and 0.032 <strong>for</strong> the<br />

simpleM, Li and Ji’s <strong>method</strong> and Bonferroni<br />

approach, respectively. Thus, the small differences<br />

in the PW<strong>ER</strong>s resulted in rather large changes in<br />

EW<strong>ER</strong>s. Since the difference between 0.05 and the<br />

EW<strong>ER</strong> <strong>for</strong> aG is the smallest of the three <strong>method</strong>s, we<br />

conclude that it is the most accurate.<br />

In analyzing SNP markers, several tests have been<br />

proposed, such as the w 2 test, the allelic-based test<br />

and Armitage’s trend test. There are several suggestions<br />

<strong>for</strong> which test procedure should be used<br />

[Sasieni, 1997; Schaid and Jacobsen, 1999; Deng,<br />

2000; Knapp, 2001; Zou, 2006]. Here, we adopted the<br />

Multiple Testing Correction Method<br />

367<br />

suggestion by Sasieni [1997] to analyze by genotypes<br />

and used Armitage’s trend test. Our Meff estimation<br />

should also apply to the w 2 test based on alleles.<br />

However, realizing the relationship w2 G ¼ w2a =ð1 þ ^ fÞ,<br />

where w2 a is the allele-based w2 test statistic, w2 G is the<br />

trend test statistic and ^ f is the estimated inbreeding<br />

coefficient [Sasieni, 1997; Zou, 2006], the permutation-based<br />

<strong>correction</strong> threshold, aP, may vary with<br />

the different tests employed slightly because ^ f is<br />

unlikely to be equal from locus to locus. The aG<br />

calculated here is only an approximation of the<br />

permutation-based <strong>correction</strong> thresholds and not a<br />

replacement. We should be aware that the precision<br />

of the permutation-based critical value is associated<br />

with the number of shuffles used. For more precise<br />

estimates, a larger number of shuffles should be<br />

per<strong>for</strong>med. For example, if the permutation critical<br />

value is set at 0.05, 10,000 shuffles gave a good<br />

permutation estimate on our data. However, it<br />

required 100,000 shuffles to get a relatively stable<br />

permutation estimate <strong>for</strong> a critical value of 0.01 in<br />

our tests.<br />

There may be several limitations <strong>for</strong> the simpleM<br />

<strong>method</strong>. It is not uncommon to have missing data in<br />

SNP data sets. However, PCA requires complete<br />

data matrices; otherwise, the CLD correlation matrix<br />

may not be positive semi-definite. For the Alzheimer<br />

data, the missing value rate is only 0.065%. Be<strong>for</strong>e<br />

using PCA, we filled the missing values with<br />

inferred genotypes using the K nearest-neighbor<br />

<strong>method</strong>. Missing genotypes can also be filled with<br />

re-genotyping. With new developments in genotyping<br />

technology and statistical imputation <strong>method</strong>s, a<br />

small amount of missing values is unlikely to hinder<br />

simpleM’s inference. Another problem with PCA is<br />

that it becomes inefficient with a large number SNPs<br />

(41,000). In such situations, the eigenvalue calculation<br />

is unable to produce enough non-zero eigenvalues<br />

<strong>for</strong> either Meff G or Meff L to work well. This is<br />

not surprising since it was pointed out by Schäfer<br />

and Strimmer [2005] that a growing number of zero<br />

eigenvalues will be observed in situations where<br />

there are a small number of samples and a large<br />

number of variables, i.e., the usual ‘‘small n and<br />

large p’’ hurdle. In practice, when using a large<br />

number of SNPs, the data set has to be divided into<br />

smaller blocks. We tested two <strong>method</strong>s <strong>for</strong> dividing<br />

data sets: a simple fixed block size and Haploview<br />

with Gabriel et al.’s definition of haplotype blocks.<br />

Blocks created with Haploview resulted in a slightly<br />

better adjusted cutoff than with fixed blocks. This<br />

improvement, however, is mitigated by the fact that<br />

Haploview requires HWE to estimate haplotype<br />

frequencies.<br />

This study is concerned mainly with single-locus<br />

association tests. Here, we give some suggestions <strong>for</strong><br />

two popular designs: candidate gene and genomewide<br />

association studies in human genetics. In<br />

candidate gene association studies, we can analyze<br />

Genet. Epidemiol.


368 <strong>Gao</strong> et al.<br />

the SNPs all together if the eigenvalues can be<br />

derived. In the situations where the high dimensionality<br />

prohibits the calculation of eigenvalues, we can<br />

analyze the SNPs on each chromosome separately or<br />

according to the gene functions and then sum all of<br />

the Meff values together. The total Meff can be used to<br />

calculate the adjusted PW<strong>ER</strong>. In genome-wide<br />

association studies, we have to partition the SNPs<br />

into several parts and analyze them separately. Since<br />

SNPs on different chromosomes are expected to be<br />

in linkage equilibrium in general populations, the<br />

genome-wide effective number of independent tests<br />

can be obtained by summing the chromosome<br />

specific Meff values. For each chromosome, we may<br />

use the partition-ligation approach by dividing the<br />

SNPs into several parts, and then sum the Meff<br />

values from each partition, similar to how we tested<br />

our Alzheimer SNP data set. The total Meff is used in<br />

the final adjustment calculation. Due to the interblock<br />

correlations that are unlikely to be captured in<br />

this partition-ligation approach, the total Meff may<br />

be slightly conservative. However, the interblock<br />

correlations may be reduced if we partition SNPs<br />

according to their haplotype block structure.<br />

In summary, the simpleM algorithm provides a<br />

highly accurate approximation to the permutationbased<br />

<strong>correction</strong> threshold and is easily implemented.<br />

Itisshowntobesimple,fastandmoreaccuratethan<br />

recently developed <strong>method</strong>s and is comparable to the<br />

permutation-based <strong>correction</strong> threshold using both<br />

simulated and real SNP data. The efficiency and<br />

accuracy of the simpleM <strong>method</strong> make it an attractive<br />

choice <strong>for</strong> <strong>multiple</strong> <strong>testing</strong> adjustment when there is<br />

high intermarker LD in the SNP data set as in<br />

candidate gene or genome-wide association studies.<br />

ACKNOWLEDGMENTS<br />

This work was supported in part by NIH grants<br />

NS39764, AG019757 and AG20135 and NIEHS T32<br />

ES007126. We thank Dr. Gary Beecham who prepared<br />

the Alzheimer data <strong>for</strong> us. We thank Dr.<br />

Richard Morris <strong>for</strong> initial inspiration.<br />

REF<strong>ER</strong>ENCES<br />

Armitage P. 1955. Tests <strong>for</strong> linear trends in proportions and<br />

frequencies. Biometrics 11:375–386.<br />

Barrett JC, Fry B, Maller J, Daly MJ. 2005. Haploview: analysis and<br />

visualization of LD and haplotype maps. Bioin<strong>for</strong>matics<br />

21:263–265.<br />

Benjamini Y, Hochberg Y. 1995. Controlling the false discovery<br />

rate: a practical and powerful approach to <strong>multiple</strong> <strong>testing</strong>. J R<br />

Stat Soc B 57:289–300.<br />

Bonferroni CE. 1935. Il calcolo delle assicurazioni su gruppi di<br />

teste, chapter ‘‘Studi in Onore del Professore Salvatore ortu<br />

Carboni’’. Rome. p 13–60.<br />

Bonferroni CE. 1936. Teoria statistica delle classi e calcolo delle<br />

probabilitá. Pubblicazioni del Istituto Superiore di Scienze<br />

Economiche e Commerciali di Firenze 8:3–62.<br />

Genet. Epidemiol.<br />

Cheverud JM. 2001. A simple <strong>correction</strong> <strong>for</strong> <strong>multiple</strong> comparisons<br />

in interval mapping genome scans. Heredity 87:52–58.<br />

Churchill GA, Doerge RW. 1994. Empirical threshold values <strong>for</strong><br />

quantitative trait mapping. Genetics 138:963–971.<br />

Deng HW. 2000. Re: ‘‘biased tests of association: comparisons of<br />

allele frequencies when departing from Hardy-Weinberg<br />

proportions’’. Am J Epidemiol 151:335–336.<br />

Excoffier L, Slatkin M. 1995. Maximum-likelihood estimation of<br />

molecular haplotype frequencies in a diploid population. Mol<br />

Biol Evol 12:921–927.<br />

Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel<br />

B, Higgins J, DeFelice M, Lochner A, Faggart M, Liu-Cordero<br />

SN, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander ES, Daly<br />

MJ, Altshuler D. 2002. The structure of haplotype blocks in the<br />

human genome. Science 296:2225–2229.<br />

Hastie T, Tibshirani R, Friedman J. 2001. The Elements of<br />

Statistical Learning. Berlin: Springer.<br />

Hoh J, Wille A, Ott J. 2001. Trimming, weighting, and grouping<br />

SNPs in human case-control association studies. Genome Res<br />

11:2115–2119.<br />

Hudson RR. 2002. Generating samples under a Wright-Fisher<br />

neutral modal of genetic variation. Bioin<strong>for</strong>matics 18:337–338.<br />

Knapp M. 2001. Re:‘‘biased tests of association: comparisons of<br />

allele frequencies when departing from Hardy-Weinberg<br />

proportions’’. Am J Epidemiol 154:287–288.<br />

Li J, Ji L. 2005. Adjusting <strong>multiple</strong> <strong>testing</strong> in multilocus analyses using<br />

the eigenvalues of a correlation matrix. Heredity 95:221–227.<br />

Lin Z, Altman RB. 2004. Finding haplotype tagging SNPs by use of<br />

principal components analysis. Am J Hum Genet 75:850–861.<br />

Mardia KV, Kent JT, Bibby JM. 1979. Multivariate Analysis.<br />

London: Academic Press.<br />

Meng Z, Zaykin DV, Xu CF, Wagner M, Ehm MG. 2003. Selection<br />

of genetic markers <strong>for</strong> association analyses, using linkage<br />

disequilibrium and haplotypes. Am J Hum Genet 73:115–130.<br />

Nielsen DM, Ehm MG, Weir BS. 1999. Detecting marker-disease<br />

association by <strong>testing</strong> <strong>for</strong> Hardy-Weinberg disequilibrium at a<br />

marker locus. Am J Hum Genet 63:1531–1540.<br />

Nyholt DR. 2004. A simple <strong>correction</strong> <strong>for</strong> <strong>multiple</strong> <strong>testing</strong> <strong>for</strong><br />

single-nucleotide polymorphisms in linkage disequilibrium<br />

with each other. Am J Hum Genet 74:765–769.<br />

Nyholt DR. 2005. Evaluation of Nyholt’s procedure <strong>for</strong> <strong>multiple</strong><br />

<strong>testing</strong> <strong>correction</strong>—author’s reply. Hum Hered 60:61–62.<br />

Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich<br />

D. 2006. Principal components analysis corrects <strong>for</strong> stratification in<br />

genome-wide association studies. Nat Genet 38:904–909.<br />

Rinaldo A, Bacanu SA, Devlin B, Sonpar V, Wasserman L, Roeder<br />

K. 2005. Characterization of multilocus linkage disequilibrium.<br />

Genet Epidemiol 28:193–206.<br />

Risch N, Merikangas K. 1996. The future of genetic studies of<br />

complex human diseases. Science 273:1516–1517.<br />

Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD,<br />

Parl FF, Moore JH. 2001. Multifactor-dimensionality reduction<br />

reveals high-order interactions among estrogen-metabolism<br />

genes in sporadic breast cancer. Am J Hum Genet 69:138–147.<br />

Salyakina D, Seaman SR, Browning BL, Dudbridge F, Muller-<br />

Myhsok B. 2005. Evaluation of Nyholt’s procedure <strong>for</strong> <strong>multiple</strong><br />

<strong>testing</strong> <strong>correction</strong>. Hum Hered 60:19–25.<br />

Sasieni PD. 1997. From genotypes to genes: doubling the sample<br />

size. Biometrics 53:1253–1261.<br />

Schäfer J, Strimmer K. 2005. A shrinkage approach to large scale<br />

covariance-matrix estimation and implications <strong>for</strong> functional<br />

genomics. Stat Appl Genet Mol Biol 4:32.<br />

Schaid DJ. 2004. Linkage disequilibrium <strong>testing</strong> when linkage<br />

phase is unknown. Genetics 166:505–512.


Schaid DJ, Jacobsen SJ. 1999. Biased tests of association:<br />

comparisons of allele frequencies when departing from<br />

Hardy-Weinberg proportions. Am J Epidemiol 149:706–711.<br />

Storey JD, Taylor JE, Siegmund D. 2004. Strong control, conservative<br />

point estimation, and simultaneous conservative consistency of<br />

false discovery rates: A unified approach. J R Stat Soc B 66:187–205.<br />

Team RDC. 2007. R: A Language and Environment <strong>for</strong> Statistical<br />

Computing. Vienna, Austria: R Foundation <strong>for</strong> Statistical<br />

Computing, ISBN 3-900051-07-0.<br />

S ˇ idák Z. 1967. Rectangular confidence regions <strong>for</strong> the means of<br />

multivariate normal distributions. J Am Stat Assoc 62:626–633.<br />

Wall JD, Pritchard JK. 2003. Assessing the per<strong>for</strong>mance of the<br />

haplotype block model of linkage disequilibrium. Am J Hum<br />

Genet 73:502–515.<br />

Weir BS. 1979. Inferences about linkage disequilibrium. Biometrics<br />

35:235–254.<br />

APPENDIX<br />

CALCULATING PAIR-WISE CLD<br />

CORRELATION FOR BIALLELIC SNPS<br />

We first trans<strong>for</strong>m the genotypes into numerical<br />

coding as<br />

8<br />

if the genotype is variant<br />

0<br />

><<br />

type allele homozygote;<br />

if the genotype is<br />

numerical coding ¼ 1<br />

heterozygote;<br />

>:<br />

if the genotype is wild<br />

2<br />

type allele homozygote:<br />

A pair of SNPs, A and B, are represented by two<br />

vectors, x and y. Denote the number of individuals<br />

as n and nuv is the number of individuals who carry<br />

uv genotypes.<br />

The covariance between x and y is<br />

X xiyi<br />

X X<br />

xi yi<br />

covðx; yÞ ¼ 1 1<br />

n n2 ¼ 1<br />

n ð2nAaBB þ nAaBb þ 4nAABB þ 2nAABbÞ<br />

1<br />

n2 ðnAa þ 2nAAÞðnBb þ 2nBBÞ;<br />

and the CLD coefficient is<br />

DAB ¼ PAB þ P A=B<br />

2pApB<br />

¼ 2P AB<br />

1<br />

AB þ PAB Ab þ PAB aB þ<br />

2 ðPAB ab<br />

¼ 2nAABB<br />

n<br />

þ nAABb<br />

n<br />

þ nAaBB<br />

n<br />

2 2nAA þ nAa 2nBB þ nBb<br />

2n 2n<br />

þ PAb aB Þ 2pApB<br />

1 nAaBb<br />

þ<br />

2 n<br />

Multiple Testing Correction Method<br />

Weir BS. 1996. Genetic Data Analysis, vol. II. Sunderland, MA:<br />

Sinauer Associates Inc.<br />

Weir BS, Hill WG, Cardon LR. 2004. Allelic association patterns <strong>for</strong><br />

a dense snp map. Genet Epidemiol 27:442–450.<br />

Wittke-Thompson JK, Pluzhnikov A, Cox NJ. 2005. Rational<br />

inferences about departures from Hardy-Weinberg equilibrium.<br />

Am J Hum Genet 76:967–986.<br />

Zaykin DV. 2004. Bounds and normalization of the composite<br />

linkage disequilibrium coefficient. Genet Epidemiol<br />

27:252–257.<br />

Zaykin DV, Meng Z, Ehm MG. 2006. Contrasting linkagedisequilibrium<br />

patterns between cases and controls as a<br />

novel association-mapping <strong>method</strong>. Am J Hum Genet<br />

78:737–746.<br />

Zou GY. 2006. Statistical <strong>method</strong>s <strong>for</strong> the analysis of genetic<br />

association studies. Ann Hum Genet 70:262–276.<br />

¼ 1<br />

2n ð4nAABB þ 2nAABb þ 2nAaBB þ nAaBbÞ<br />

1<br />

2n2 ð2nAA þ nAaÞð2nBB þ nBbÞ:<br />

There<strong>for</strong>e, covðx; yÞ ¼2DAB.<br />

We know that pA ¼ð2nAA þ nAaÞ=2n and<br />

pB ¼ð2nBB þ nBbÞ=2n, and then the variance of x is<br />

varðxÞ ¼ 1 X<br />

P 2<br />

2 xi<br />

xi n<br />

n<br />

¼ 1<br />

n ðnAa þ 4nAAÞ<br />

¼ nAa þ 2nAA<br />

n<br />

þ 2nAA<br />

n<br />

nAa þ 2nAA<br />

n<br />

2<br />

nAa þ 2nAA<br />

n<br />

¼ 2pA þ 2PAA 4p 2 A<br />

¼ 2½pAð1 pAÞþPAA p 2 AŠ ¼ 2½pAð1 pAÞþDAŠ:<br />

Similarly, the variance of y is<br />

varðyÞ ¼2½pBð1 pBÞþDBŠ:<br />

There<strong>for</strong>e, the CLD correlation given by Weir [1979,<br />

2004], as<br />

DAB<br />

rAB ¼ p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ;<br />

ðpAð1 pAÞþDAÞðpBð1 pBÞþDBÞ<br />

can be computed from the 0, 1 and 2 genotype<br />

numerical coding, as correlation<br />

covðx; yÞ<br />

rAB ¼ p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi :<br />

varðxÞvarðyÞ<br />

The CLD correlation can be calculated simply by<br />

using the R function cor().<br />

2<br />

369<br />

Genet. Epidemiol.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!