13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

398 GOLDSTEIN, CAVALLERI, AND AHMADIpopulations. Interestingly, the size <strong>of</strong> blocks variedacross populations reflecting differing population histories,with European and Chinese populations having, onaverage, larger block sizes than African-American andYoruban samples (maximum block size 173 kb vs. 94 kb)(Gabriel et al. 2002).TAGGING METHODOLOGY ANDRELATED ISSUESAs noted, the first use <strong>of</strong> the term “tagging” was by Johnsonand colleagues, who suggested the term htSNPs. In thecase <strong>of</strong> all 9 genes examined in the Johnson paper, 2–5htSNPs per gene were sufficient to tag the common haplotypes.That is, instead <strong>of</strong> typing the full set <strong>of</strong> 122 SNPS,close to the same haplotypic variation could be captured bytyping a subset <strong>of</strong> 34 htSNPs (Johnson et al. 2001).Tagging common haplotypes is only one <strong>of</strong> many possibleways to select a subset <strong>of</strong> SNPs that retain as muchinformation as possible about the other SNPs. Broadlyspeaking, the approaches that have been evaluated can bedivided into two groups (Weale et al. 2003): those basedon maximizing the haplotype diversity present in the taggingset compared to the tagged set (diversity based) andthose based on establishing as high an association as possiblebetween the “tagging” and “tagged” set (associationbased). To avoid the close identification with haplotypediversity in the selection <strong>of</strong> tags, some have suggestedthat tags be referred to as tSNPs rather than htSNPs (see,e.g., Weale et al. 2003).<strong>The</strong> primary motivation for tSNP selection is their applicationin LD-based gene mapping. For this reason, atSNP selection criterion focused on the r 2 measure <strong>of</strong> LDseems the most directly relevant because it allows quantification<strong>of</strong> the loss <strong>of</strong> power in typing the tSNPs instead<strong>of</strong> all the SNPs. Pritchard and Prezeworski showed thatfor two biallelic loci, power scales with r 2 , such that typingthe associated marker with n/r 2 individuals wouldhave approximately the same power as n individuals inwhich the causative variant itself was typed, where r 2 isthe association between the two variants (Pritchard andPrzeworski 2001). This finding has been extended byChapman et al. (2003) to include generalized r 2 , includinghaplotype r 2 (see below).MULTIMARKER VERSUS PAIR-WISEAPPROACHES<strong>The</strong>re still remains the question <strong>of</strong> how to define the r 2value. Relying on pair-wise measures is straightforward,but may be inefficient. This is because pairs <strong>of</strong> SNPs willonly have high pair-wise association when their minor allelefrequencies are very closely matched, thus meaningthat SNPs which exhibit a full range <strong>of</strong> frequencies willneed to be selected as tags. This can be overcome if combinations<strong>of</strong> the tSNPs are used to predict the other SNPs.One approach to this is to use the haplotype r 2 value(Chapman et al. 2003; Goldstein et al. 2003; Weale et al.2003; and the D. Clayton Web site: http://wwwgene.cimr.cam.ac.uk/clayton/s<strong>of</strong>tware).Haplotype r 2 is defined as the proportion <strong>of</strong> variance ina “tagged” SNP <strong>of</strong> interest that is explained by an analysis<strong>of</strong> variance based on the G haplotypes formed by theset <strong>of</strong> tSNPs.Yi = x i1 b 1 + x i2 b 2 + .... + x iG b GWhere Yi is the predicted state <strong>of</strong> the tagged SNP <strong>of</strong> intereston the ith chromosome, x i1 ...x iG are indicator variablesfor the G haplotypes, and b 1 ...b G are coefficients estimatedby standard least squares from the observed data.This approach is more efficient in the sense <strong>of</strong> requiringfewer tags because it relies on combinations <strong>of</strong> haplotypesgenerated by tagging SNPs to predict the state <strong>of</strong>tagged SNPs. <strong>The</strong>se combinations are identified by selectingthe appropriate coefficients in a linear regression.<strong>The</strong> haplotype r 2 criterion therefore appears an appropriatemeasure if one <strong>of</strong> the aims is to reduce the number <strong>of</strong>tags that must be typed in phenotyped material (e.g.,cases and controls).<strong>The</strong> haplotype r 2 approach focuses on the prediction <strong>of</strong>haploid allelic state (0 or 1) on the basis <strong>of</strong> the tSNP haplotypeobserved on a given chromosome. As such, it doesnot explicitly address the issue <strong>of</strong> haplotype inference inphenotyped individuals. One approach that simultaneouslyconsiders both aspects is from Stram et al. (2003), who definea coefficient <strong>of</strong> determination for predicting the haplotypesobserved in an individual based on the tSNP configuration.At present, it is hard to predict which approachesto tSNP selection will prove the most useful in practice.BLOCK-BASED AND BLOCK-FREESELECTION OF TAGSAlthough the discovery <strong>of</strong> the block-like nature <strong>of</strong> LDand its effect on haplotype distribution inspired the idea<strong>of</strong> tags for given haplotypes (Johnson et al. 2001), the use<strong>of</strong> tSNPs in no way depends on blocks <strong>of</strong> LD. Indeed,even if there is such a block structure, it is not apparentthat tag selection should make reference to blocks. Asnoted above, the early suggestions for the definition <strong>of</strong>htSNPs did not address this issue directly. More recently,however, we have argued that block-based identification<strong>of</strong> tags will always be less efficient (sometimes considerably)than methods that select across block boundaries.This is because tagging within blocks limits the effectiverange <strong>of</strong> a set <strong>of</strong> tags, and means that cross-block associationscannot be exploited (Goldstein et al. 2003). For thisreason, we advocate the selection <strong>of</strong> tSNPs across largecontiguous sequence stretches, independently <strong>of</strong> any underlyingblock structure in the region. Even so, somequestions remain in this approach. For example, computationalissues make it difficult to select across very largeregions without some sort <strong>of</strong> subdivision. In addition, selectingacross large regions may result in a set <strong>of</strong> tSNPsthat are not optimized for specific subregions, as for example,a candidate gene (Goldstein et al. 2003). <strong>The</strong>se arejust some <strong>of</strong> the issues that will need to be addressed inorder to develop appropriate, efficient strategies forgenome-wide tSNP selection.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!