11.07.2015 Views

2DkcTXceO

2DkcTXceO

2DkcTXceO

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

194 Journey into genetics and genomicsenvironmental risk factors, such as cigarette smoking, radon and asbestos exposures.Different from the “common disease, common variants” hypothesisbehind GWAS, the hypothesis of “common disease, multiple rare variants”has been proposed (Dickson et al., 2010; Robinson, 2010) as a complementaryapproach to search for the missing heritability.The recent development of Next Generation Sequencing (NGS) technologiesprovides an exciting opportunity to improve our understanding of complexdiseases, their prevention and treatment. As shown by the 1000 GenomeProject (The 1000 Genomes Project Consortium, 2010) and the NHLBI ExomeSequencing Project (ESP) (Tennessen et al., 2012), a vast majority ofvariants on the human genome are rare variants. Numerous candidate genes,whole exome, and whole genome sequencing studies are being conducted toidentify disease-susceptibility rare variants. However, analysis of rare variantsin sequencing association studies present substantial challenges (Bansal et al.,2010; Kiezun et al., 2012) due to the presence of a large number of rare variants.Individual SNP based analysis commonly used in GWAS has little powerto detect the effects of rare variants. SNP set analysis has been advocated toimprove power by assessing the effects of a group of SNPs in a set, e.g., usinga gene, a region, or a pathway. Several rare variant association tests have beenproposed recently, including burden tests (Morgenthaler and Thilly, 2007; Liand Leal, 2008; Madsen and Browning, 2009), and non-burden tests (Lin andTang, 2011; Neale et al., 2011; Wu et al., 2011; Lee et al., 2012). A commontheme of these methods is to aggregate individual variants or individual teststatistics within a SNP set. However, these tests suffer from power loss whena SNP set has a large number of null variants. For example, a large geneoften has a large number of rare variants, with many of them being likely tobe null. Aggregating individual variant test statistics is likely to introduce alarge amount of noises when the number of causal variants is small.To formulate the problem in a statistical framework, assume n subjectsare sequenced in a region, e.g., a gene, with p variants. For the ith subject, letY i be a phenotype (outcome variable), G i =(G i1 ,...,G ip ) ⊤ be the genotypesof p variants (G ij =0, 1, 2) for 0, 1, or 2 copies of the minor allele in a SNPset, e.g., a gene/region, X i =(x i1 ,...,x iq ) ⊤ be a covariate vector. Assumethe Y i are independent and follow a distribution in the exponential familywith E(y i )=µ i and var(y i )=φv(µ i ), where v is a variance function. Wemodel the effects of p SNPs G i in a set, e.g., a gene, and covariates X i on acontinuous/categorical phenotype using the generalized linear model (GLM)(McCullagh and Nelder, 1989),g(µ i )=X ⊤ i α + G ⊤ i β, (18.1)where g is a monotone link function, α =(α 1 ,...,α q ) ⊤ and β =(β 1 ,...,β p ) ⊤are vectors of regression coefficients for the covariates and the genetic variants,respectively. The n × p design matrix G =(G 1 ,...,G n ) ⊤ is very sparse, witheach column containing only a very small number of 1 or 2 and the rest being

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!