06.08.2015 Views

Probe-Level Analysis of Affymetrix GeneChip Microarray Data

Clemson version - Ben Bolstad

Clemson version - Ben Bolstad

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Probe</strong>-<strong>Level</strong> <strong>Analysis</strong> <strong>of</strong><strong>Affymetrix</strong> <strong>GeneChip</strong><strong>Microarray</strong> <strong>Data</strong>Ben Bolstadhttp://www.stat.berkeley.edu/~bolstadClemson UniversityJanuary 6, 2005


Outline for Today's Talk• A brief introduction to the <strong>Affymetrix</strong><strong>GeneChip</strong> Technology• Pre-processing– Background Correction– Normalization• <strong>Probe</strong> <strong>Level</strong> Modeling– Models– Applications


The Human Genome• The cell is the fundamental working unit <strong>of</strong> everyliving organism.• Humans: trillions <strong>of</strong> cells (metazoa);other organisms like yeast: one cell (protozoa).• Cells are <strong>of</strong> many different types (e.g. blood, skin,nerve cells), but all can be traced back to a singlecell, the fertilized egg.


Genes• The human genome is distributed along 23 pairs <strong>of</strong>chromosomes.– 22 autosomal pairs;– the sex chromosome pair, XX for females and XYfor males.• In each pair, one chromosome is paternallyinherited, the other maternally inherited.• Chromosomes are made <strong>of</strong> compressed andentwined DNA.• A (protein-coding) gene is a segment <strong>of</strong>chromosomal DNA that directs the synthesis <strong>of</strong> aprotein.


DNA• A deoxyribonucleic acid or DNA molecule is adouble-stranded polymer composed <strong>of</strong> fourbasic molecular units called nucleotides.• Each nucleotide comprises a phosphate group,a deoxyribose sugar, and one <strong>of</strong> four nitrogenbases: adenine (A), guanine (G), cytosine (C),and thymine (T).• The two chains are held together by hydrogenbonds between nitrogen bases.• Base-pairing occurs according to the followingrule: G pairs with C, and A pairs with T.


Exons and introns• Genes comprise only about 2% <strong>of</strong> the humangenome; the rest consists <strong>of</strong> non-codingregions, whose functions may includeproviding chromosomal structural integrityand regulating when, where, and in whatquantity proteins are made (regulatoryregions).• The terms exon and intron refer to coding(translated into a protein) and non-codingDNA, respectively.


Differential expression• Each cell contains a complete copy <strong>of</strong> theorganism's genome.• Cells are <strong>of</strong> many different types and statesE.g. blood, nerve, and skin cells, dividingcells, cancerous cells, etc.• What makes the cells different?• Differential gene expression, i.e., when,where, and in what quantity each gene isexpressed.• On average, 40% <strong>of</strong> our genes are expressedat any given time.


Functional genomics• The various genome projects have yieldedthe complete DNA sequences <strong>of</strong> manyorganisms. E.g. human, mouse, yeast, fruitfly,etc.• Human: 3 billion base-pairs, 30-40 thousandgenes.• Challenge: go from sequence to function, i.e.,define the role <strong>of</strong> each gene and understandhow the genome functions as a whole.


Central dogma• The expression <strong>of</strong> the genetic information stored in theDNA molecule occurs in two stages:(i) transcription, during which DNA is transcribed intomRNA;(ii) translation, during which mRNA is translated toproduce a protein.DNA mRNA protein• Other important aspects <strong>of</strong> regulation: methylation,alternative splicing, etc.• The correspondence between DNA's four-letteralphabet and a protein's twenty-letter alphabet isspecified by the genetic code, which relates nucleotidetriplets to amino acids.


RNA• A ribonucleic acid or RNA molecule is a nucleicacid similar to DNA, but– single-stranded;– ribose sugar rather than deoxyribose sugar;– uracil (U) replaces thymine (T) as one <strong>of</strong> thebases.• RNA plays an important role in protein synthesisand other chemical activities <strong>of</strong> the cell.• Several classes <strong>of</strong> RNA molecules, includingmessenger RNA (mRNA), transfer RNA (tRNA),ribosomal RNA (rRNA), and other small RNAs.


Idea: measure the amount <strong>of</strong> mRNA to see which genes arebeing expressed in (used by) the cell.Measuring protein might be better, but is currently harder.


Brief TechnologyOverview• High densityoligonucleotide arraytechnology as developedby <strong>Affymetrix</strong>http://www.affymetrix.com• Known as the <strong>GeneChip</strong>Some overview images courtesy <strong>of</strong> <strong>Affymetrix</strong>.


<strong>Probe</strong>s and <strong>Probe</strong>sets


Two <strong>Probe</strong> TypesReference SequenceTAGGTCTGTATGACAGACACAAAGAAGATGCAGACATAGTGTCTGTGTTTCTTCTCAGACATAGTGTGTGTGTTTCTTCTPM: the Perfect MatchMM: the Mismatch


Constructing the Chip


Sample Preparation


Hybridization to the Chip


The Chip is Scanned


Chip dat file – checkered board – close up pixel selection


Chip cel file – checkered boardCourtesy: F. Colin


<strong>Level</strong>s <strong>of</strong> <strong>Analysis</strong>• <strong>Probe</strong> level– Unprocessed data (or minimal processing)– <strong>Data</strong>set is larger– Often ignored or viewed as a “black box”.• High level– Processed data– <strong>Data</strong>set is smaller– Most analysis seems to start here


High-<strong>Level</strong> <strong>Analysis</strong>• Clustering/Classification• Pathway <strong>Analysis</strong>• Cell Cycle• Gene function• Anything where a more biologicalinterpretation is desiredThese are important topics but such matterswill not be discussed further in today's talk.


<strong>Probe</strong>-<strong>Level</strong> <strong>Analysis</strong>• What is <strong>Probe</strong>-level analysis?– <strong>Analysis</strong> and manipulation <strong>of</strong> probe intensity data• Expression calculation: Background, Normalization,Summarization• Fitting models to probe intensity data• Quality control diagnostics• Why do we do it?– Hopefully it will allow us to produce better, morebiologically meaningful gene expression values– We want accurate (low bias) and precise (lowvariance) gene expression estimates– Is there additional information at the probe-levelthat we might otherwise throw away?


Background/SignalAdjustment• A method which does some or all <strong>of</strong> thefollowing– Corrects for background noise, processing effects– Adjusts for cross hybridization– Adjust estimated expression values to fall onproper scale• <strong>Probe</strong> intensities are used in backgroundadjustment to compute correction (unlikecDNA arrays where area surrounding spotmight be used)


RMA Background Approach• Convolution Model= +ObservedOSignalS( )NoiseNExp α (2N μ,σ )


ECorrection is given by( )SO= o = a+ba= o− − , b=2μ σ α σφa o−a−φb b⎛ ⎞ ⎛ ⎞⎜⎝⎟⎠⎜⎝⎟⎠⎛a⎞ ⎛o−a⎞Φ ⎜ ⎟+Φ⎜ ⎟−1b b⎝ ⎠ ⎝ ⎠


Normalization“Non-biological factors can contribute to thevariability <strong>of</strong> data ... In order to reliablycompare data from multiple probe arrays,differences <strong>of</strong> non-biological origin must beminimized.“ 1• Normalization is a process <strong>of</strong> reducingunwanted variation across chips. It may useinformation from multiple chips1 <strong>GeneChip</strong> 3.1 Expression <strong>Analysis</strong> Algorithm Tutorial, <strong>Affymetrix</strong> technical support


Non-Biological Variability5 scanners for 6 dilution groups


Non-linear normalizationneededUnnormalizedScaledA Non-linearNormalization


Quantile Normalization• Normalize so that the quantiles <strong>of</strong> each chip areequal. Simple and fast algorithm. Goal is togive same distribution to each chip.OriginalDistributionTargetDistribution


Sort Sort columns <strong>of</strong> <strong>of</strong> originalmatrixTake averages across rowsTake averages across rowsSet Set average as as value for forAll All elements in in the the row rowUnsort columns <strong>of</strong> <strong>of</strong>matrix to to original order⎡1 5 3 5⎤ ⎡1 1 1 5⎤⎢2 1 6 7⎥ ⎢2 2 2 6⎥⎢ ⎥ → ⎢ ⎥⎢3 2 2 6⎥ ⎢3 5 3 7⎥⎢ ⎥ ⎢ ⎥⎣4 6 1 8⎦ ⎣4 6 6 8⎦⎡1 1 1 5⎤ ⎡ 2 ⎤⎢2 2 2 6⎥ ⎢3⎥⎢ ⎥ → ⎢ ⎥⎢3 5 3 7⎥ ⎢4.5⎥⎢ ⎥ ⎢ ⎥⎣4 6 6 8⎦ ⎣ 6 ⎦⎡ 2 ⎤ ⎡ 2 2 2 2 ⎤⎢3⎥ ⎢3 3 3 3⎥⎢ ⎥ → ⎢ ⎥⎢4.5⎥ ⎢4.5 4.5 4.5 4.5⎥⎢ ⎥ ⎢ ⎥⎣ 6 ⎦ ⎣ 6 6 6 6 ⎦⎡ 2 2 2 2 ⎤ ⎡ 2 4.5 4.5 2 ⎤⎢3 3 3 3⎥ ⎢3 2 6 4.5⎥⎢ ⎥ → ⎢ ⎥⎢4.5 4.5 4.5 4.5⎥ ⎢4.5 3 3 3 ⎥⎢ ⎥ ⎢ ⎥⎣ 6 6 6 6 ⎦ ⎣ 6 6 2 6 ⎦


It Reduces VariabilityExpression ValuesFold changeAlso no serious bias effects. For more see Bolstad et al (2003)


General <strong>Probe</strong> <strong>Level</strong> Modelyij= f( X)+εij• Where f(X) is function <strong>of</strong> factor (and possiblycovariate) variables (our interest will be inlinear functions)• y ij is a pre-processed probe intensity(usually log scale)• Assume that E⎡ ⎤ = 0Var⎣ε ij⎦⎡⎣ε⎤ij ⎦ = σ2


Parallel Behavior SuggestsMulti-chip Model


<strong>Probe</strong> Pattern SuggestsIncluding <strong>Probe</strong>-Effect


Also Want Robustness


Summarization• Problem: Calculating gene expression values.• How do we reduce the 11-20 probe intensities foreach probeset on to a gene expression value?• Our Approach– RMA – a robust multi-chip linear model fit on the log scale• Some Other Approaches– Single chip• AvDiff (<strong>Affymetrix</strong>) – no longer recommended for use due to manyflaws• Mas 5.0 (<strong>Affymetrix</strong>) – use a 1 step Tukey&( biweight to combinethe probe intensities in log scale– Multiple Chip• MBEI (Li-Wong dChip) – a multiplicative model on natural scale


The Three Steps <strong>of</strong> RMA1. Convolution Background2. Quantile Normalization3. Linear model on the log2 scale fit robustly.• S<strong>of</strong>tware– Bioconductor affy packagewww.bioconductor.org– RMAExpresswww.stat.berkeley.edu/~bolstad/RMAExpress– Also available in some commercial s<strong>of</strong>tware egS+ ArrayAnalzyer, Iobian GeneTraffic


RMA mostly does well inpracticeDetecting Differential ExpressionNot noisy in low intensitiesRMAMAS 5.0


One DrawbackRMA MAS 5.0Some fixes for this are being developed see GCRMA(Irizarry and Wu, JHU)


The RMA modely = m+ α + β + εij i j ijwhereyijα iβj= log N B2( ( PM ))ijis a probe-effect i= 1,…,Iis chip-effect ( m + βj is log2 geneexpression on array j) j=1,…,J


Median Polish Algorithmy11 L y1J0M O M MyI1L yIJ00 L 0 0ImposesConstraintsSweep ColumnsIteratemedianα= median β = 0iSweep Rowsmedian ε = median ε = 0i ij j ijjˆ ε L ˆ ε αˆ11 1J1M O M Mˆ ε ˆ ˆI1L ε αIJ Iˆ ˆ β L β mˆ1J


• Advantages–Fast– Very robust• DisadvantagesMedian Polish– No algorithmic flexibility to fit alternativemodels– No standard error estimates


An Alternative Method forFitting a PLM• Robust regression using M-estimation• In this talk, we will use Huber’s influencefunction ψ . The s<strong>of</strong>tware handles manymore.• Fitting algorithm is IRLS with weightsdependent on current residualsψ rr( )• S<strong>of</strong>tware for fitting such models is part <strong>of</strong>affyPLM package <strong>of</strong> Bioconductor


Variance CovarianceEstimates• Suppose model is Y = Xβ + ε• Huber (1981) gives three forms for estimatingvariance covariance matrix2κ1/( n−p) ψ ( r)⎡⎢1/n⎣∑i∑iψ ′( r)i⎤⎥⎦i22(TX X )−11/( n−p)ψ riiκ1/ n ψ ′ r∑i∑( )( )i2W−11 1/−( ) ( ) (Tn−p ∑ψr )iW X X Wκi2 1 −1We will use this formT 'W = X Ψ X


We Will Focus on theSummarization PLM• Array effect modelArray EffectPre-processedLog PM intensityy = α + β + εij i j ijWith constraintI∑i=1αi=<strong>Probe</strong> Effect0


Detecting DifferentialExpression• Problem: Given an experiment with twotreatment groups correctly identify thedifferential genes without incorrectlychoosing non differential genes.• Question 1: How do different methods forDDE compare?• Question 2: Can we do better using PLM’s?


How Do We Know WhichGenes are Differential?• Spike-in datasets.– Transcripts at known concentrationsdiffering across arrays. Commonbackground cRNA.– Typically, Latin Square design• <strong>Affymetrix</strong> HG-U95A• GeneLogic AML – bacterial spike-ins• GeneLogic Tonsil – bacterial spike-ins


<strong>Affymetrix</strong> Spike-in <strong>Data</strong>• 59 chips. All but 1 <strong>of</strong> the rows are done as triplicates37777 684 1597 38734 39058 36311 36889 1024 36202 36085 40322 407 1091 1708A 0 0.25 0.5 1 2 4 8 16 32 64 128 0 512 1024B 0.25 0.5 1 2 4 8 16 32 64 128 256 0.25 1024 0C 0.5 1 2 4 8 16 32 64 128 256 512 0.5 0 0.25D 1 2 4 8 16 32 64 128 256 512 1024 1 0.25 0.5E 2 4 8 16 32 64 128 256 512 1024 0 2 0.5 1F 4 8 16 32 64 128 256 512 1024 0 0.25 4 1 2G 8 16 32 64 128 256 512 1024 0 0.25 0.5 8 2 4H 16 32 64 128 256 512 1024 0 0.25 0.5 1 16 4 8I 32 64 128 256 512 1024 0 0.25 0.5 1 2 32 8 16J 64 128 256 512 1024 0 0.25 0.5 1 2 4 64 16 32K 128 256 512 1024 0 0.25 0.5 1 2 4 8 128 32 64L 256 512 1024 0 0.25 0.5 1 2 4 8 16 256 64 128M 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256N 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256O 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256P 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256Q 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512R 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512S 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512T 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512


Testing for DifferentialExpression• On a probeset by probeset basis compute atest statistic• Should include something related to observedFC and some variability estimate• A “t” statistic: something <strong>of</strong> the formt=XSE


Fold ChangeFC = X − XlmWhereXl=∑∑βjInd( j∈group l)Ind( j∈group l)


Simple t-statistict=Xlsnm2s2l ml− X+nm


“Robust” t-statistict=X% − X%lm2s2l ms% %+n nlm• Use medians in place <strong>of</strong> means• Use MAD in place <strong>of</strong> standard deviation


Simple Moderated t-statistict=snl2s2l mlX− Xmm+ +nsmed• s2 2slsmmed is median + across all genes• Analogous to SAMnlnm


Limma “ebayes” t-statistic• Generalization <strong>of</strong> Bayesian method <strong>of</strong>Lonnstadt and Speed (2002) to thegeneral linear model case• An alternative and much moresophisticated moderated t-statistic.Variability is estimated using a posterior• From the limma Bioconductor package.


<strong>Probe</strong> <strong>Level</strong> Model teststatisticsΣ• Suppose that is component <strong>of</strong> the variancecovariancematrix related to β• Let c be the contrast vector defined such thatthe j th element <strong>of</strong> c is−1n l1n mif array j is in group lif array j is in group m0 otherwise


<strong>Probe</strong> <strong>Level</strong> Model teststatisticstPLM.1=J∑j=1Tc βc2jΣjjPLM.2Tt = c βTc Σc


A First Comparison• 8 chips from <strong>Affymetrix</strong> HG-U95A spike-indataset– 4 arrays for each <strong>of</strong> two concentration pr<strong>of</strong>iles• Fit an array effect model to all 8 chips– Compare the performance <strong>of</strong> the differentmethods by looking at all comparisons• 1 vs 1• 2 vs 2• 3 vs 3• 4 vs 4


What Happens as the Number<strong>of</strong> Arrays Increases?• Expand comparison to all 24 Arrays withsame concentration pr<strong>of</strong>iles from<strong>Affymetrix</strong> HG-U95A spike-in dataset• Fit an array effect model to all 24 arrays• Look at comparisons between equalnumber <strong>of</strong> chips


A Larger Comparison• Look at the entire 59 chips for<strong>Affymetrix</strong> HG-U95A spike-in dataset• Examine two cases. After standardpreprocessing– Fit models for each pairwise comparison(individual models)– Fit a model to all 59 chips (single model)• There are 91 pairwise comparisions


ResultsMethod Individual Models Single Model0% FP 5% FP AUC 0% FP 5% FP AUCFC 0.451 0.985 0.975 0.444 0.982 0.971Std 0.323 0.982 0.956 0.301 0.975 0.952Robust 0.16 0.939 0.857 0.144 0.935 0.852Mod 0.437 0.987 0.975 0.413 0.98 0.97PLM.1 0.653 0.991 0.979 0.54 0.951 0.93PLM.2 0.657 0.991 0.979 0.539 0.951 0.93Ebayes 0.514 0.988 0.978 0.45 0.986 0.974


More Spike-in <strong>Data</strong>sets• Two GeneLogic Spike-in datasets– Tonsil dataset (36 arrays)– AML dataset (34 arrays)• In each case use single models fitted toall arrays


What is Going On Here?• Examine residuals stratified byconcentration group– Spike-ins– Randomly chosen non-differential probesets atlow, medium and high average expression


<strong>Affymetrix</strong> Spike-ins


Low Non-Differential


Middle Non-Differential


High Non-Differential


GeneLogic AML Spike-ins


GeneLogic Tonsil dataset


How About With More“Real” <strong>Data</strong>?GeneLogic Dilution/Mixture <strong>Data</strong>setLearning SetLiver30xVSCNS30x400 top genes“truth”75:255xVS25:755xTest SetMixture <strong>Data</strong>Dilution Series <strong>Data</strong>


Mixture <strong>Data</strong> ResultsMethod 3 vs 3 4 vs 4 5 vs 50% FP 5% FP AUC 0% FP 5% FP AUC 0% FP 5% FP AUCFC 0.007 0.886 0.697 0.008 0.888 0.703 0.005 0.888 0.708Std 0.004 0.793 0.53 0.008 0.872 0.626 0.018 0.902 0.675Robust 0.002 0.485 0.271 0.005 0.747 0.49 0.01 0.743 0.488Mod 0.007 0.908 0.697 0.002 0.932 0.735 0 0.948 0.76PLM.1 0.056 0.943 0.751 0.057 0.947 0.756 0.056 0.95 0.76PLM.2 0.057 0.943 0.752 0.057 0.948 0.758 0.058 0.95 0.761Ebayes 0.001 0.918 0.744 0 0.933 0.761 0 0.943 0.776


Moderation for the PLMtest statisticMethod 5 vs 50% FP 5% FP AUCPLM.2 0.058 0.95 0.761Ebayes 0 0.943 0.776PLM Moderated 0.053 0.963 0.795


Quality Assessment using PLM• PLM quantities useful for assessing chipquality– Weights– Residuals– Standard Errors• Expression values relative to median chip


Pseudo-chip imagesWeightsResidualsPositiveResidualsNegativeResiduals


An Image Gallery“Crop Circles”“Ring <strong>of</strong> Fire”“Tricolor”http://www.stat.berkeley.edu/~bolstad/PLMImageGallery/


NUSE PlotsNormalizedUnscaledStandardErrors


RLE PlotsRelativeLogExpression


Discordant <strong>Probe</strong>s


Discordant Arrays


Morals for Today’s Talk• The “black box” that is low-level processingis important. Your processing can have asignificant effect on your final expressionvalues and resulting conclusions.• Testing for differential expression at theprobe level can improve upon probeset-leveltesting.• There is interesting quality assessmentinformation at the probe-level


Ongoing Work in this Area• Technology changes: What still works?What doesn’t?• Other probe-level models (eg Introducesequence information, MM’s)• Can we introduce the biology into theprocess (gene families? etc)


Acknowledgements• Terry Speed (UC Berkeley)• Francois Colin (UC Berkeley/UCSF)• Julia Brettschneider (UC Berkeley)• Rafael Irizarry (Johns Hopkins)• Bioconductor Corehttp://www.bioconductor.org


Additional Slides


Focusing on a Single<strong>GeneChip</strong> Cell Location


Chip dat file – checkered board – oligo B2Courtesy: F. Colin


Chip dat file – checkered board – close up w/ gridCourtesy: F. Colin


From Chip To <strong>Data</strong>


Constructing a geneexpression measure


Computing ExpressionMeasures:A Three Step Procedure1. Background/Signal adjustment (B)2. Normalization (N)3. Summarization (S)Let be cel file data from multiple arrays thenXXExpression values = S(N(B( )))


Background Methods• <strong>Affymetrix</strong>– Location dependent– Ideal mismatch• RMA– Convolution model• Other– Standard curve adjustment– GCRMA (Wu, Irizzary et al 2003)+ =


Background Signal Methods• <strong>Affymetrix</strong>– Location dependent background based on grids• I will refer to this as the MAS 5 background– Originally proposed subtracting MM from PM butthis is problematic because as many as a third <strong>of</strong>MM’s are greater than the respective PM• No longer used– Now uses what they refer to as the IdealMismatch which is MM when possible andsomething else when not possible (designed sothat there is now no negatives)• Call this IMM


Original RMA Background• Convolution model is suggested by lookingat density <strong>of</strong> observed empirical distributions


Convolution Model• O = S + N– O is observed PM, S is signal (assumedexponential), N is noise (assumed normal,truncated at zero)• Correction is then⎛ a ⎞ ⎛ o − a ⎞φ ⎜ ⎟ − φ ⎜ ⎟E ( )⎝ b= = +⎠ ⎝ bS O o a b⎠⎛ a ⎞ ⎛ o − a ⎞Φ ⎜ ⎟ −Φ⎜ ⎟ −⎝ b ⎠ ⎝ b ⎠a = o − − , b =2μ σ α σ1


A Standard Curve AdjustmentBased on Spike-in Information• Observes that there is acurve that relatesobserved expressionand spike-inconcentration. The idealwould be to have alinear relationshipbetween concentrationand computedexpression. The curvegives us aconcentrationdependent adjustment


What About Non Spike-ins?• We don’t know a concentration for mostprobesets. If we did, or if we had a variablethat related to concentration, the adjustmentwould be easy to perform• Fit the following modelyy( k )= α + ε( k) ( k)1i i i( k) k ( k)= + +( k) ( )2 iαi γ ε'i•Whereyy=log( k) ( k)1i2 i=logPMMM( k) ( k)2i2 i


γRelates to Concentration


Establishing a RelationshipγBetween and Concentration


The Two Curves Yield anAdjustment Curve


Normalization Methods• Methods already compared in Bolstad et al(2003)• Baseline (normalized using reference chip)– Scaling (<strong>Affymetrix</strong>)– Non linear (Li-Wong)• Complete data (no reference chip,information from all arrays used)– Quantile normalization (Bolstad et al 2003)– Contrast (Åstrand, 2003)– Cyclic Loess


Why Quantile Normalization?• Quantile normalization found to performacceptably in reducing variance withoutdrastic bias effects• Quantile normalization is fast


RMA Model• To each probeset (k), with i being number <strong>of</strong>probes and j being number <strong>of</strong> chips, fit the model:y= α + β + ε( k) ( k) ( k) ( k)ij i j ijα( k) ( k)whereiis a probe effect andjis the( k )log gene expression. yijis the log2 backgroundadjusted and normalized PM intensity• Different ways to fit this model– Median polish – quick– Robust linear model – yields some good qualitydiagnostic toolsβ


Basic RMA modelLetthenyij= log N B2( ( PM ))ijy = m+ α + β + εij i j ijwhereα iβ jis probe-effectis chip-effect ( m + β jis log2 geneexpression on array j)Median-polish imposes constraintsmedianα= median β = 0imedian ε = median ε = 0i ij j ijj


Advantages/Disadvantages <strong>of</strong> RMA/Median polish• Advantages–Fast– Gene expression measures perform favorablywhen compared with MAS 5.0, Li-Wong MBEI– Robust against outliers• Disadvantages– No standard error estimates– No algorithmic flexibility to fit alternative models


<strong>Probe</strong> <strong>Level</strong> Models are• RMA methodbased on RMA– Convolution Model Background– Quantile Normalization– Summarization using a robust multi-chipmodel on the log scale. Model is fittedusing the median polish algorithm on aprobeset by probeset basis


Comparing the backgroundmethods• Using an <strong>Affymetrix</strong> spike-in experiment weshall examine– Observed vs spike-in concentration– Observed vs expected fold change– Composite M vs A plots– ROC curves• In each case we will compute expressionvalues use standard RMA methodology. (iequantile normalization, median polishsummarization)


Assessing Bias: ObservedExpression vs Spike-inConcentrationSlope None RMA MAS 5 IMM MAS5/IMMAll 0.493 0.63 0.589 0.69 0.695Mid 0.665 0.784 0.751 0.82 0.82Low 0.184 0.376 0.318 0.52 0.563High 0.329 0.33 0.327 0.295 0.291S.C.A.0.8561.0410.6310.256


Assessing Bias: ObservedFold-change versusExpected Fold-changeSlope None RMA MAS 5 IMM MAS5/ S.C.A.IMMAll 0.484 0.624 0.583 0.683 0.692 0.847


Assessing Variability:M vs A plots• Vertical Axis is M: log2 fold-change.• Horizontal Axis is A: average expression value (onlog2 scale) aka geometric average.• Ideally non differential genes tight about M=0


Detecting DifferentialExpression: ROC Curves


Summary <strong>of</strong> Trade-<strong>of</strong>fsBackgroundMethodDetect DifferentialGenesAccurateestimates <strong>of</strong> FoldchangeNo Background Good PoorRMA Good PoorMAS 5.0 Good PoorIdeal Mismatch Poor GoodMAS5.0/IdealMM Poor GoodStandard CurveAdjustmentGoodGood


Comparing theNormalization Methods• Want to reduce variation but at the sametime we do not want to introduce any bias• First a quick examination <strong>of</strong> the expressionvalues by array• Using same spike-in experiment as before,this time no background correction, onlynormalization and median polishsummarization.


Scaling is Not Sufficient


Variability <strong>of</strong> Non-DifferentialGenes is Reduced


Little effect on Spike-insMethod All Low Mid High FCNoNormalization0.493(0.845)0.185(0.148)0.664(0.733)0.328(0.207)0.484(0.952)Quantile0.493(0.851)0.184(0.153)0.665(0.741)0.329(0.224)0.484(0.955)Scaling0.493(0.852)0.186(0.156)0.663(0.742)0.33(0.225)0.484(0.954)


ROC Curves


Comparing EstablishedExpression Measures


Slope ValueAll 0.493Mid 0.665Low 0.184High 0.329


Slope ValueAll 0.63Mid 0.784Low 0.376High 0.33


Slope ValueAll 0.589Mid 0.751Low 0.318High 0.327


Slope ValueAll 0.69Mid 0.82Low 0.52High 0.295


Slope ValueAll 0.695Mid 0.82Low 0.563High 0.291


Slope ValueAll 0.856Mid 1.041Low 0.631High 0.256


Slope: 0.484


Slope: 0.624


Slope: 0.583


Slope: 0.683


Slope: 0.692


Slope: 0.847


What About the TreatmentEffect Model?• With treatment effect model, have moreobservations contributing to estimationfor each element <strong>of</strong> the variancecovariance matrix• Much quicker to fit


We Will Focus on TwoParticular PLM• Array effect modelArray EffectPre-processedLog PM intensity• Treatment effect modely = α + β + εij i j ijy = α + τ + εij i l ijjIn both casesI∑i=1αi=0<strong>Probe</strong> EffectTreatment Effect

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!