13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

GENETICS OF COMMON DISEASES 399SELECTING AND EVALUATINGTAGGING SNPSA typical way to currently apply tSNPs is to definethem on the basis <strong>of</strong> incomplete genotype data (perhaps 1SNP every 5 kb or so) available in a relatively small populationsample (up to about 60 unrelated individuals, ethnicallymatched to case/control cohort), and then to applythem in usually much larger phenotyped populations. Increasingthe number <strong>of</strong> tags increases performance, butalso increases expense. A consensus seems to be emergingthat the coefficient <strong>of</strong> determination should be at least0.80, which means that the increased sample size isn/0.85, in comparison with exhaustive typing. It is a strikingdemonstration <strong>of</strong> how fast the field has advanced tonote that only 4 years ago a coefficient <strong>of</strong> determination<strong>of</strong> 0.2 was seen as a reasonable goal (Kruglyak 1999). Itis necessary, however, to test whether the selected tSNPswill (1) represent other SNPs not yet known and (2) tag asefficiently in a new sample <strong>of</strong> individuals from the samepopulation.To address the first point, we introduced a SNP droppingprocedure (Goldstein et al. 2003; Weale et al. 2003).<strong>The</strong> basic approach is to take the set <strong>of</strong> known SNPs andfor each SNP i drop it from the analysis in turn. For eachreduced set <strong>of</strong> N–1 SNPs, new tags are selected, and theirability to represent (predict) the dropped SNP i is assessed.In this way, a statistical estimate is obtained <strong>of</strong>how well the tSNPs can represent SNPs that are not observed(for example, SNPs that are not yet discovered) inthe region.Goldstein et al. (2003) carried out analysis along theselines on the Gabriel et al. data set with SNP densities <strong>of</strong>up to 4 SNPs per 10 kb using SNPs with a minor allelefrequency above 8%. Averaged over the regions considered,we found that performance increased up to a SNPdensity <strong>of</strong> 1.5 SNPs per 10 kb, after which no further improvementwas noted (Goldstein et al. 2003). We estimatedusing this preliminary evidence that an averagemarker density <strong>of</strong> somewhere between 1 and 2 SNPs per10 kb would be sufficient to identify tSNPs that capturemost <strong>of</strong> the common allelic variation in the humangenome (Goldstein et al. 2003).<strong>The</strong> regions Goldstein et al. (2003) studied from theoriginal Gabriel et al. (2002) paper, however, did notcover a sufficiently broad range <strong>of</strong> densities, and the averageSNP density was 1 SNP approximately every 7 kb.<strong>The</strong> question still remains, therefore, <strong>of</strong> how tSNP performancechanges as density <strong>of</strong> the original genotype dataset increases to densities higher than 1 every 7 kb. A directway to address this is by assessing the performance<strong>of</strong> tSNPs in data sets with manually adjusted densities,similar to the way in which Ke et al. (2004) have approachedthe question <strong>of</strong> block boundary identifications.If the performance <strong>of</strong> tSNPs improves (for example, assessedby SNP dropping and testing in independent samples,as described below) only modestly when densitiesare adjusted from 1 SNP every 5 kb to higher densities, asis implied by the asymptotic performance for the lessdenseregions studied in Goldstein et al. (2003), thiswould imply that higher densities may not be justified formost genomic regions. <strong>The</strong>se experiments, however,have not yet been reported, although the HapMap projectwill shortly make available data sets ideal for this purpose:that is, data sets in which all SNPs have been genotyped(International HapMap Consortium 2003).One shortcoming <strong>of</strong> our original implementation <strong>of</strong> theSNP dropping procedure was that it ignored LD whichmay have been generated by the sampling procedure itself.More recently, we have carried out a similar experimentbut extended it by selecting the tSNPs in one populationsample, in which SNP i has been dropped, but thenevaluating the performance <strong>of</strong> the tSNP set in predictingthe state <strong>of</strong> SNP i in a second, independent sampled population<strong>of</strong> the same size (K.R. Ahmadi et al., unpubl.).This approach more closely mimics the real situation. Wefound that the results <strong>of</strong> this experiment and the earlierversion <strong>of</strong> the SNP dropping procedure are similar forSNPs with MAF greater than about 6% and a sample size<strong>of</strong> 32 individuals. For SNPs with lower MAF, however,the performance <strong>of</strong> the tSNPs in one sample is not a goodguide to their performance in a new sample (K.R. Ahmadiet al., unpubl.).Some critics <strong>of</strong> haplotype mapping have argued thatrare variants may be the main genetic factor influencingdisease, and that these will be difficult to document usinghaplotype mapping. This raises the question <strong>of</strong> how welltags can capture rare variation. Most analyses to datehave simply discarded SNPs with low MAF. Althoughrare alleles are <strong>of</strong>ten young, and thus resident on relativelylong haplotypes, this does not mean that a set <strong>of</strong>tSNPs provides high power <strong>of</strong> detection when used in theordinary way. Although we consider it well establishedthat tSNPs can be used to adequately represent commonvariation, we share the concerns <strong>of</strong> some critics that it isunclear how well rare variants can be represented, althoughadjustments in tSNP selection and use may help.Considerably more work will be required to resolve theissues <strong>of</strong> how well SNPs with low MAF can be representedby tSNPs.Finally, although tagging rare variants may be possible inone population, the tSNP sets that tag such variants are expectedto behave largely as private rather than cosmopolitan,and therefore, tagging rare variation in multiple populations<strong>of</strong> close ancestry may not be possible, with eachpopulation requiring a unique set <strong>of</strong> tSNPs (see below).GENOME-WIDE TAGA map-based approach incorporating tags can be appliedeither at a local (tagging variation across a gene/region) or genomic (tagging variation across the entiregenome) level. Various estimates <strong>of</strong> the number <strong>of</strong> tagsrequired to cover the genome have been made (Judson etal. 2002). More recently, using the same data set, Gabrielet al. (2002) and Goldstein et al. (2003) reached differentconclusions regarding the number <strong>of</strong> tSNPs required totag the human genome; Gabriel et al. predicted that approximately300,000 are needed and Goldstein et al. predictedthat only about 170,000 tags are sufficient for tagginga European population sample.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!