11.07.2015 Views

Fitting Statistical models to data with many more variables than ...

Fitting Statistical models to data with many more variables than ...

Fitting Statistical models to data with many more variables than ...

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Talk outline1. Introduction, motivation and issues when p>>n2. Response modellingLikelihood <strong>with</strong> sparsity prior (penalty)3. ExamplesMicroarraysSNP Chips


ìÿí ùÿð íàìàçäà Áþéöê Éàðàäàíäàíäèëÿéèðÿì”, - äåéÿ úàìààòà öðÿê-äèðÿê âåðÿð, îíëàðûíöìäèíè àðòûðàð, äÿéÿðëè ñþùáÿòëÿðè èëÿ ñàêèòëÿøäèðÿðäè.Ñåéèä Éóñèô àüà 2005 -úè èë àâãóñò àéûíûí 26-äà þìðöíöí 79-úó éàéûíäà àðàìûçäàí ýåòäè. Î, äöíéàñûíû äÿéèøñÿ äÿ, îúàüûíäàäÿéÿðëè, àëèúÿíàá îüóëëàðûíû, íÿâÿëÿðèíè ýþðöá òÿñêèíëèê òàïûðûã.Ãÿäèðáèëÿí åëèìèç ñåéèäèí âÿôàòûíäàí ñîíðà îíóí éàøàäûüû Áÿðäÿðàéîíóíóí Úàíàâàðëû êÿíäèíèí àäûíû äÿéèøÿðÿê, î áþéöê èíñàíûíøÿðÿôèíÿ «Ñåéèä Éóñèôëè» àäëàíäûðìûøëàð. Áó ýöí Áÿðäÿ ðàéîíóíóíÑåéèä Éóñèôëè êÿíäèíÿ ýåäÿí ìàøûí êàðâàíûíûí àðäû-àðàñû êÿñèëìèð.Áó êàðâàí ìÿíÿâè òÿñêèíëèéÿ åùòèéàúû îëàíëàðû, ìöøêöë èøëÿðèíèíùÿëëèíè, õÿñòÿëèêëÿðèíèí ÷àðÿñèíè Óëó Òàíðûäàí äèëÿéÿíëÿðè ÑåéèäÉóñèô àüàíûí îúàüûíà àïàðûð. Ñåéèäèí íÿâÿëÿðè áþéöê åùòèðàìëàçÿââàðëàðû ñòîëà äÿâÿò åäèðëÿð. Äóàëàð îõóíóð, ùÿäèñëÿð äàíûøûëûð, ÓëóÒàíðûìûç, ïåéüÿìáÿðëÿðèìèç âÿ èìàìëàðûìûç éàä îëóíóð. ÑåéèäÉóñèô áàáàíûí ùÿéàò òÿðçèíè, àäÿòëÿðèíè õàòûðëàäàí øÿõñè ÿøéàëàðû,êèòàáëàðû ñàõëàíûëàí çÿíýèí ìóçåéè ùÿéÿúàíñûç çèéàðÿò åòìÿêìöìêöí äåéèë.Äÿôÿëÿðëÿ èëûã íÿôÿñè öçöìöçÿ äÿéÿí, äóàëàðû èëÿ áèçëÿðèèíñòèòóòà, ùÿðáè õèäìÿòÿ éîëà ñàëàí Ñåéèä Éóñèô áàáà äöíéàñûíûäÿéèøñÿ äÿ, ðóùó äàèìà áèçèìëÿäèð. Î, ñàü îëñàéäû, éÿãèí èíäè äÿùÿìèøÿêè êèìè èøüàë îëóíìóø òîðïàãëàðûí òåçëèêëÿ àëûíìàñûíû ÓëóÒàíðûäàí äèëÿéÿðäè. Êÿëáÿúÿð äÿéÿðëÿðèíèí ìèíäÿ áèðèíè þçöíäÿÿêñ åòäèðÿí áó êèòàáû éàçìàã ôèêðèìèçè áÿéÿíÿð âÿ áèçÿ óüóðëàðàðçóëàéàðäû. Ãîé, Óëó Òàíðû ñåâèìëè áÿíäÿñè Ñåéèä Éóñèô àüàíûíõÿòðèíÿ áàøëàäûüûìûç áó ìÿñóëèééÿòëè âÿ àüûð èøèìèçäÿ áèçÿ éàðîëñóí! Àìèí!Ñåéèä Éóñèô àüà âÿ íÿâÿëÿðè.4


IntroductionResponse ModellingEach sample has a characteristic or response that we wouldlike <strong>to</strong> predict from our measurements “inside” the celly (n by 1) X (n by p)Say n=100 and p=30,000


p>>n some possible approaches1. Ignore p>>n2. Reduce the number of <strong>variables</strong> eg PCA, t- tests3. Forward stepwise variable selection4. Sparsity prior/penalty


Some issues1. Artefacts – apparent but spurious good fit of <strong>models</strong>2. Likelihood functions are constant over p-n dimensionalsubspaces


Spurious results – “sparse is good”Linear regressionRandom yReal Xn=32 p=14000+apparent Rsq0.0 0.2 0.4 0.6 0.8 1.0Boxplot1 3 5 7 9 11 14 17 20 23 26 29 32 35 38subset size


Cross validationBoxplotC ross validated rsq0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.71 3 5 7 9 11 14 17 20 23 26 29 32 35 38subset size


2. A Bayesian (penalised likelihood) approach (p>>n)Posterior ∝ Likelihood x Prior•Linear regression•GLM’s•Survival analysis•Multinomialregression…Strongly favoursparameters = 0Takes care of flat spots, artefacts, computational load insubset selection


Parameters in the <strong>models</strong>The <strong>models</strong> have a linear predic<strong>to</strong>r ( η=Xβ)Each variable/gene i has a parameter β i associated <strong>with</strong> it


A sparsity priorDraw ν from p( ν ) Gamma( b, k) , 0 < k ≤ 12 2iiDraw β from p β ν N ν2 2i(i|i) (0,i)p = ∫ p p d2 2 2( βi ) ( βi | νi) ( νi) νi(1)ν2iExplicitlyδ δ βp( β ) = ( ) , = (2)i(0.5 −k)2 K0.5−k( |i|) 2δ(0.5 −k)π Γ( k)( δ | βi|)b


The prior - special casesk=1 (Lasso, Tibshirani, 1996)p( β ) = ( δ / 2)exp( δ | β |)iik=0p( β ) ∝ δ exp{ −δ | β |}/ δ | β |i i ik=0, δ= 0(Jeffreys hyperprior, Figueiredo, 2001)p( β ) ∝ 1/ | β |ii


A visual descriptionPosterior ∝ likelihood x prior


An EM algorithmWant <strong>to</strong> find (local) maximum of( ) ( ) ( ) ( )p b j y X a L y X b j òp b n p n dn2n2 2 2LikelihoodHierarchical priorTreat υ 2 as missing <strong>data</strong>β and φ <strong>to</strong> be estimated


E-step1. InitialisationSet intial values for ϕ (old) , β (old) , k, δ2. E-stepCalculate(old) (old) (old) (old) 2Q(β |y, β , φ ) = L(y | β, φ ) − 0.5 (|| β / b || )Where b (old) has components(old) -2 (old) -0.5i=i iδb E{ν | β ,k, }


E- step continuedThe conditional expectation is:For k=1For k=0(old)-2 (old) δ K3/2-k( δ | βi|)i iδ =( old ) (old)i 1/2-kδ βiE{ν |β ,k, } | β | K ( | |)-2 (old) (old)i iδ δiE{ν |β ,k=1, }= /|β |-2 (old) (old) 2 (old)i iδi+ δiE{ν |β ,k=0, }= 1/|β | /|β |


M-stepMaximiseMany possibilities eg NR iterations2(old) (old) ∂ L (old) -1 (old) ∂Lβrδr= ∆(b )[-∆(b ) ∆(b )+I] (∆(b )( )- )2 (old)∂ β∂β( b )β r+1= β r+ α rδ r(old) (old) (old) (old) 2Q(β | y, β , φ ) = L(y | β, φ ) − 0.5 (|| β / b || )rrwhere α ris chosen by a line search algorithm <strong>to</strong> ensureQ( β | y, β , φ ) > Q( β | y, β , φ )(old) (old) (old) (old)r+1 rProduce β (new) , check convergence etc


CommentsStructure can be used <strong>to</strong> makeMatrix <strong>to</strong> be inverted of size min(n,p)In the outer iteration parameters tend<strong>to</strong> zero at a quadratic rate


CommentsParameter estimation and <strong>variables</strong>election happen simultaneouslyNo pre-selection of <strong>variables</strong>


GeneRave in Action


GeneRave in Action


Êÿëáÿúÿð ÿðàçèñèíäÿ ×àéêÿíä,×ÿðÿêäàð, Áàüëûïÿéÿ, Éóõàðû Ãàìûøëû, ÉóõàðûØóðòàí âÿ Çÿéëèê êÿíäëÿðèíäÿ ÷àé öçÿðèíäÿ ãóðóëìóø ÷îõëóòàüáÿíä, äàø êþðïöëÿð äÿ âàðäûð êè, îíëàðûí äà áèð íå÷ÿñè ñîíçàìàíëàðà ãÿäÿð ñàõëàíûëûðäû.Êÿëáÿúÿð Òàðèõ–Äèéàðøöíàñëûã ìóçåéèíèí ùÿéÿòèíÿýÿòèðèëìèø 28 ÿäÿä éàçûëû âÿ øÿêèëëè äàøëàð, íÿèíêè Êÿëáÿúÿðèí,ùÿì äÿ áöòöí Àçÿðáàéúàí òîðïàüûíûí ãÿäèì èíñàí ìÿñêÿíèîëäóüóíó ñöáóò åäèð. Îíà ýþðÿ äÿ àüûëëû, óçàãýþðÿí çèéàëûàëèìëÿðèìèç ùÿëÿ î çàìàíëàð Êÿëáÿúÿðè úèääè íÿçàðÿòÿ àëûá,ãîðóìàüû òÿëÿá åäèðäèëÿð.“Êÿëáÿúÿðäÿ éåðëÿøÿí Ëþù ãàëàñûíûí áåëè äÿâÿ áîéíóíàîõøàéûðäû. «Ëþù» ãàëàñûíûí áàñûíäà äÿðèíëèéè 3 -5 ìåòð îëàí äàñäàíéîíóëàí ãóéóëàð âàð èäè. Áóðàäà ãèäà ìÿùñóëëàðû, ÿðçàã âÿ ñóñàõëàíûëûðäû. Ùÿìèí ãóéóëàðûí è÷ÿðèñèíäÿí áèòèá ÷ûõàí ïàëûä àüàúûíûíáèðèíèí äèàìåòðè 30-40 ñì èäè.Êÿëáÿúÿð ãàëàëàðûíûí è÷ÿðèñèíäÿ ÿí ùöíäöðö Úîìÿðä âÿÍÿúÿôàëûëàð êÿíäè éàõûíëûüûíäà éåðëÿøÿí Úîìÿðä ãàëàñè èäè. Áóãàëàéà ìàøûíëà, àòëà íÿ ãÿäÿð éàõûíëàøñàí äà, éåíÿ äÿ ÿí àçû 500 ìåòðéóõàðû ïèéàäà ãàëõìàã ëàçûì èäè. Ãàëàíûí ñûëäûðûì ãàéàëàðäàí èáàðÿòîëàí áèð òÿðÿôèíÿ èñÿ ãóøäàí áàøãà ùå÷ áèð úàíëû àéàüû äÿéìÿçäè.29


St Judes Leukemia <strong>data</strong> - Probes and genesAffy chips collect <strong>data</strong> for 500,000 probesProbes are collected in<strong>to</strong> “probe sets” of11-14 probes <strong>to</strong> give “gene expression”values


Example 1: St Jude’s Leukaemia <strong>data</strong>p = 44,000 “genes” or >500,000 probes(Affymetrix U133A/B) n = 104 samples 6 leukaemia subtypesResults44,000 “genes” 6-gene classification model Cross-validated error < 5%500,000 probes 5–probe classification model Cross-validated error < 4%


Example 2: Perlegen SNP <strong>data</strong>Reference:Whole-Genome Patterns of Common DNA Variation in ThreeHuman Populations.(2005) Hinds et al, Nature (2005).http://genome.perlegen.com/browser/download.html71 individuals ~1.5 million SNPS33 males 23 African Americans38 females 24 European Americans24 Han Chinese


Single Nucleotide PolymorphismsSNPAGCTCCTAAGCTTAAGCTACTAGCTCCTAACCTTAAGCTACTAGCTCCTAAGCTTAAGCTACTAGCTCCTAAGCTTAAGCTACT


SNP’s are a major determinant of phenotypequantitative traits•Strength•Intelligence•Response <strong>to</strong> drugs


Data and modelWe fit a sparse “main effects” model <strong>to</strong> the <strong>data</strong>using the GeneRave algorithmOn an appropriate scale each SNP genotype has anadditive effect on the probability of race or sex.Most effects are expected <strong>to</strong> be zero and the effects ofa small number of SNP genotypes will dominateFor the Perlegen SNP <strong>data</strong> there are 71 samples and3,096,617 <strong>variables</strong> !!


GeneRave – Perlegen SNP Data1,548,308 SNPS on chromosomes 1 <strong>to</strong> 22Race <strong>data</strong>23 african americans,24 european americans24 han chineseSex <strong>data</strong>33 males38 femalesResultsRace3 SNPs (0.082)Sex2 SNPs (0.00)


Two SNP race classifierafd0860639≠TT=TTAfricanAmericanafd3693051≠CC=CCEuropeanAmericanHanChinese


Validation <strong>data</strong> - Hapmap <strong>data</strong> sethttp://www.hapmap.org270 individuals ~5 million SNPS142 males 90 Utah residents(European Americans)128 females 45 Han Chinese45 Japanese90 Yoruba in Ibadan Nigeria


Independent validation of resultsThe SNPS picked up in the GeneRave analysis have beengenotyped in the Hapmap projectThe SNP on chromosome 1 classifies males and females inthe Hapmap <strong>data</strong> set <strong>with</strong> zero errorThe SNP on Chromosome 15 doesn’tThe SNP from the Perlegen Analysis which classifies Hanchines and European Americans works in the validation<strong>data</strong> <strong>with</strong> zero error


SNP Analysis Conclusion•The sex SNP on chromosome 1 is highly likely <strong>to</strong> be across hybridisation problem <strong>with</strong> the SNP Chips•The Race SNP is associated <strong>with</strong> a gene which codes forskin colour


CodeR Library available for download athttps://www.bioinformatics.csiro.au/GeneRave/index.shtmlNew version soon


ContactName Harri KiiveriTitle Research ScientistPhone 61 8 9332 3317+61 2 9325 3256Email Harri.Kiiveri@csiro.auWeb www.cmis.csiro.au/BHIThank YouContact CSIROPhone 1300 363 400+61 3 9545 2176Email enquiries@csiro.auWeb www.csiro.auwww.csiro.au

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!