Probe-Level Analysis of Affymetrix GeneChip Microarray Data
Clemson version - Ben Bolstad
Clemson version - Ben Bolstad
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Probe</strong>-<strong>Level</strong> <strong>Analysis</strong> <strong>of</strong><strong>Affymetrix</strong> <strong>GeneChip</strong><strong>Microarray</strong> <strong>Data</strong>Ben Bolstadhttp://www.stat.berkeley.edu/~bolstadClemson UniversityJanuary 6, 2005
Outline for Today's Talk• A brief introduction to the <strong>Affymetrix</strong><strong>GeneChip</strong> Technology• Pre-processing– Background Correction– Normalization• <strong>Probe</strong> <strong>Level</strong> Modeling– Models– Applications
The Human Genome• The cell is the fundamental working unit <strong>of</strong> everyliving organism.• Humans: trillions <strong>of</strong> cells (metazoa);other organisms like yeast: one cell (protozoa).• Cells are <strong>of</strong> many different types (e.g. blood, skin,nerve cells), but all can be traced back to a singlecell, the fertilized egg.
Genes• The human genome is distributed along 23 pairs <strong>of</strong>chromosomes.– 22 autosomal pairs;– the sex chromosome pair, XX for females and XYfor males.• In each pair, one chromosome is paternallyinherited, the other maternally inherited.• Chromosomes are made <strong>of</strong> compressed andentwined DNA.• A (protein-coding) gene is a segment <strong>of</strong>chromosomal DNA that directs the synthesis <strong>of</strong> aprotein.
DNA• A deoxyribonucleic acid or DNA molecule is adouble-stranded polymer composed <strong>of</strong> fourbasic molecular units called nucleotides.• Each nucleotide comprises a phosphate group,a deoxyribose sugar, and one <strong>of</strong> four nitrogenbases: adenine (A), guanine (G), cytosine (C),and thymine (T).• The two chains are held together by hydrogenbonds between nitrogen bases.• Base-pairing occurs according to the followingrule: G pairs with C, and A pairs with T.
Exons and introns• Genes comprise only about 2% <strong>of</strong> the humangenome; the rest consists <strong>of</strong> non-codingregions, whose functions may includeproviding chromosomal structural integrityand regulating when, where, and in whatquantity proteins are made (regulatoryregions).• The terms exon and intron refer to coding(translated into a protein) and non-codingDNA, respectively.
Differential expression• Each cell contains a complete copy <strong>of</strong> theorganism's genome.• Cells are <strong>of</strong> many different types and statesE.g. blood, nerve, and skin cells, dividingcells, cancerous cells, etc.• What makes the cells different?• Differential gene expression, i.e., when,where, and in what quantity each gene isexpressed.• On average, 40% <strong>of</strong> our genes are expressedat any given time.
Functional genomics• The various genome projects have yieldedthe complete DNA sequences <strong>of</strong> manyorganisms. E.g. human, mouse, yeast, fruitfly,etc.• Human: 3 billion base-pairs, 30-40 thousandgenes.• Challenge: go from sequence to function, i.e.,define the role <strong>of</strong> each gene and understandhow the genome functions as a whole.
Central dogma• The expression <strong>of</strong> the genetic information stored in theDNA molecule occurs in two stages:(i) transcription, during which DNA is transcribed intomRNA;(ii) translation, during which mRNA is translated toproduce a protein.DNA mRNA protein• Other important aspects <strong>of</strong> regulation: methylation,alternative splicing, etc.• The correspondence between DNA's four-letteralphabet and a protein's twenty-letter alphabet isspecified by the genetic code, which relates nucleotidetriplets to amino acids.
RNA• A ribonucleic acid or RNA molecule is a nucleicacid similar to DNA, but– single-stranded;– ribose sugar rather than deoxyribose sugar;– uracil (U) replaces thymine (T) as one <strong>of</strong> thebases.• RNA plays an important role in protein synthesisand other chemical activities <strong>of</strong> the cell.• Several classes <strong>of</strong> RNA molecules, includingmessenger RNA (mRNA), transfer RNA (tRNA),ribosomal RNA (rRNA), and other small RNAs.
Idea: measure the amount <strong>of</strong> mRNA to see which genes arebeing expressed in (used by) the cell.Measuring protein might be better, but is currently harder.
Brief TechnologyOverview• High densityoligonucleotide arraytechnology as developedby <strong>Affymetrix</strong>http://www.affymetrix.com• Known as the <strong>GeneChip</strong>Some overview images courtesy <strong>of</strong> <strong>Affymetrix</strong>.
<strong>Probe</strong>s and <strong>Probe</strong>sets
Two <strong>Probe</strong> TypesReference SequenceTAGGTCTGTATGACAGACACAAAGAAGATGCAGACATAGTGTCTGTGTTTCTTCTCAGACATAGTGTGTGTGTTTCTTCTPM: the Perfect MatchMM: the Mismatch
Constructing the Chip
Sample Preparation
Hybridization to the Chip
The Chip is Scanned
Chip dat file – checkered board – close up pixel selection
Chip cel file – checkered boardCourtesy: F. Colin
<strong>Level</strong>s <strong>of</strong> <strong>Analysis</strong>• <strong>Probe</strong> level– Unprocessed data (or minimal processing)– <strong>Data</strong>set is larger– Often ignored or viewed as a “black box”.• High level– Processed data– <strong>Data</strong>set is smaller– Most analysis seems to start here
High-<strong>Level</strong> <strong>Analysis</strong>• Clustering/Classification• Pathway <strong>Analysis</strong>• Cell Cycle• Gene function• Anything where a more biologicalinterpretation is desiredThese are important topics but such matterswill not be discussed further in today's talk.
<strong>Probe</strong>-<strong>Level</strong> <strong>Analysis</strong>• What is <strong>Probe</strong>-level analysis?– <strong>Analysis</strong> and manipulation <strong>of</strong> probe intensity data• Expression calculation: Background, Normalization,Summarization• Fitting models to probe intensity data• Quality control diagnostics• Why do we do it?– Hopefully it will allow us to produce better, morebiologically meaningful gene expression values– We want accurate (low bias) and precise (lowvariance) gene expression estimates– Is there additional information at the probe-levelthat we might otherwise throw away?
Background/SignalAdjustment• A method which does some or all <strong>of</strong> thefollowing– Corrects for background noise, processing effects– Adjusts for cross hybridization– Adjust estimated expression values to fall onproper scale• <strong>Probe</strong> intensities are used in backgroundadjustment to compute correction (unlikecDNA arrays where area surrounding spotmight be used)
RMA Background Approach• Convolution Model= +ObservedOSignalS( )NoiseNExp α (2N μ,σ )
ECorrection is given by( )SO= o = a+ba= o− − , b=2μ σ α σφa o−a−φb b⎛ ⎞ ⎛ ⎞⎜⎝⎟⎠⎜⎝⎟⎠⎛a⎞ ⎛o−a⎞Φ ⎜ ⎟+Φ⎜ ⎟−1b b⎝ ⎠ ⎝ ⎠
Normalization“Non-biological factors can contribute to thevariability <strong>of</strong> data ... In order to reliablycompare data from multiple probe arrays,differences <strong>of</strong> non-biological origin must beminimized.“ 1• Normalization is a process <strong>of</strong> reducingunwanted variation across chips. It may useinformation from multiple chips1 <strong>GeneChip</strong> 3.1 Expression <strong>Analysis</strong> Algorithm Tutorial, <strong>Affymetrix</strong> technical support
Non-Biological Variability5 scanners for 6 dilution groups
Non-linear normalizationneededUnnormalizedScaledA Non-linearNormalization
Quantile Normalization• Normalize so that the quantiles <strong>of</strong> each chip areequal. Simple and fast algorithm. Goal is togive same distribution to each chip.OriginalDistributionTargetDistribution
Sort Sort columns <strong>of</strong> <strong>of</strong> originalmatrixTake averages across rowsTake averages across rowsSet Set average as as value for forAll All elements in in the the row rowUnsort columns <strong>of</strong> <strong>of</strong>matrix to to original order⎡1 5 3 5⎤ ⎡1 1 1 5⎤⎢2 1 6 7⎥ ⎢2 2 2 6⎥⎢ ⎥ → ⎢ ⎥⎢3 2 2 6⎥ ⎢3 5 3 7⎥⎢ ⎥ ⎢ ⎥⎣4 6 1 8⎦ ⎣4 6 6 8⎦⎡1 1 1 5⎤ ⎡ 2 ⎤⎢2 2 2 6⎥ ⎢3⎥⎢ ⎥ → ⎢ ⎥⎢3 5 3 7⎥ ⎢4.5⎥⎢ ⎥ ⎢ ⎥⎣4 6 6 8⎦ ⎣ 6 ⎦⎡ 2 ⎤ ⎡ 2 2 2 2 ⎤⎢3⎥ ⎢3 3 3 3⎥⎢ ⎥ → ⎢ ⎥⎢4.5⎥ ⎢4.5 4.5 4.5 4.5⎥⎢ ⎥ ⎢ ⎥⎣ 6 ⎦ ⎣ 6 6 6 6 ⎦⎡ 2 2 2 2 ⎤ ⎡ 2 4.5 4.5 2 ⎤⎢3 3 3 3⎥ ⎢3 2 6 4.5⎥⎢ ⎥ → ⎢ ⎥⎢4.5 4.5 4.5 4.5⎥ ⎢4.5 3 3 3 ⎥⎢ ⎥ ⎢ ⎥⎣ 6 6 6 6 ⎦ ⎣ 6 6 2 6 ⎦
It Reduces VariabilityExpression ValuesFold changeAlso no serious bias effects. For more see Bolstad et al (2003)
General <strong>Probe</strong> <strong>Level</strong> Modelyij= f( X)+εij• Where f(X) is function <strong>of</strong> factor (and possiblycovariate) variables (our interest will be inlinear functions)• y ij is a pre-processed probe intensity(usually log scale)• Assume that E⎡ ⎤ = 0Var⎣ε ij⎦⎡⎣ε⎤ij ⎦ = σ2
Parallel Behavior SuggestsMulti-chip Model
<strong>Probe</strong> Pattern SuggestsIncluding <strong>Probe</strong>-Effect
Also Want Robustness
Summarization• Problem: Calculating gene expression values.• How do we reduce the 11-20 probe intensities foreach probeset on to a gene expression value?• Our Approach– RMA – a robust multi-chip linear model fit on the log scale• Some Other Approaches– Single chip• AvDiff (<strong>Affymetrix</strong>) – no longer recommended for use due to manyflaws• Mas 5.0 (<strong>Affymetrix</strong>) – use a 1 step Tukey&( biweight to combinethe probe intensities in log scale– Multiple Chip• MBEI (Li-Wong dChip) – a multiplicative model on natural scale
The Three Steps <strong>of</strong> RMA1. Convolution Background2. Quantile Normalization3. Linear model on the log2 scale fit robustly.• S<strong>of</strong>tware– Bioconductor affy packagewww.bioconductor.org– RMAExpresswww.stat.berkeley.edu/~bolstad/RMAExpress– Also available in some commercial s<strong>of</strong>tware egS+ ArrayAnalzyer, Iobian GeneTraffic
RMA mostly does well inpracticeDetecting Differential ExpressionNot noisy in low intensitiesRMAMAS 5.0
One DrawbackRMA MAS 5.0Some fixes for this are being developed see GCRMA(Irizarry and Wu, JHU)
The RMA modely = m+ α + β + εij i j ijwhereyijα iβj= log N B2( ( PM ))ijis a probe-effect i= 1,…,Iis chip-effect ( m + βj is log2 geneexpression on array j) j=1,…,J
Median Polish Algorithmy11 L y1J0M O M MyI1L yIJ00 L 0 0ImposesConstraintsSweep ColumnsIteratemedianα= median β = 0iSweep Rowsmedian ε = median ε = 0i ij j ijjˆ ε L ˆ ε αˆ11 1J1M O M Mˆ ε ˆ ˆI1L ε αIJ Iˆ ˆ β L β mˆ1J
• Advantages–Fast– Very robust• DisadvantagesMedian Polish– No algorithmic flexibility to fit alternativemodels– No standard error estimates
An Alternative Method forFitting a PLM• Robust regression using M-estimation• In this talk, we will use Huber’s influencefunction ψ . The s<strong>of</strong>tware handles manymore.• Fitting algorithm is IRLS with weightsdependent on current residualsψ rr( )• S<strong>of</strong>tware for fitting such models is part <strong>of</strong>affyPLM package <strong>of</strong> Bioconductor
Variance CovarianceEstimates• Suppose model is Y = Xβ + ε• Huber (1981) gives three forms for estimatingvariance covariance matrix2κ1/( n−p) ψ ( r)⎡⎢1/n⎣∑i∑iψ ′( r)i⎤⎥⎦i22(TX X )−11/( n−p)ψ riiκ1/ n ψ ′ r∑i∑( )( )i2W−11 1/−( ) ( ) (Tn−p ∑ψr )iW X X Wκi2 1 −1We will use this formT 'W = X Ψ X
We Will Focus on theSummarization PLM• Array effect modelArray EffectPre-processedLog PM intensityy = α + β + εij i j ijWith constraintI∑i=1αi=<strong>Probe</strong> Effect0
Detecting DifferentialExpression• Problem: Given an experiment with twotreatment groups correctly identify thedifferential genes without incorrectlychoosing non differential genes.• Question 1: How do different methods forDDE compare?• Question 2: Can we do better using PLM’s?
How Do We Know WhichGenes are Differential?• Spike-in datasets.– Transcripts at known concentrationsdiffering across arrays. Commonbackground cRNA.– Typically, Latin Square design• <strong>Affymetrix</strong> HG-U95A• GeneLogic AML – bacterial spike-ins• GeneLogic Tonsil – bacterial spike-ins
<strong>Affymetrix</strong> Spike-in <strong>Data</strong>• 59 chips. All but 1 <strong>of</strong> the rows are done as triplicates37777 684 1597 38734 39058 36311 36889 1024 36202 36085 40322 407 1091 1708A 0 0.25 0.5 1 2 4 8 16 32 64 128 0 512 1024B 0.25 0.5 1 2 4 8 16 32 64 128 256 0.25 1024 0C 0.5 1 2 4 8 16 32 64 128 256 512 0.5 0 0.25D 1 2 4 8 16 32 64 128 256 512 1024 1 0.25 0.5E 2 4 8 16 32 64 128 256 512 1024 0 2 0.5 1F 4 8 16 32 64 128 256 512 1024 0 0.25 4 1 2G 8 16 32 64 128 256 512 1024 0 0.25 0.5 8 2 4H 16 32 64 128 256 512 1024 0 0.25 0.5 1 16 4 8I 32 64 128 256 512 1024 0 0.25 0.5 1 2 32 8 16J 64 128 256 512 1024 0 0.25 0.5 1 2 4 64 16 32K 128 256 512 1024 0 0.25 0.5 1 2 4 8 128 32 64L 256 512 1024 0 0.25 0.5 1 2 4 8 16 256 64 128M 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256N 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256O 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256P 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256Q 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512R 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512S 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512T 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512
Testing for DifferentialExpression• On a probeset by probeset basis compute atest statistic• Should include something related to observedFC and some variability estimate• A “t” statistic: something <strong>of</strong> the formt=XSE
Fold ChangeFC = X − XlmWhereXl=∑∑βjInd( j∈group l)Ind( j∈group l)
Simple t-statistict=Xlsnm2s2l ml− X+nm
“Robust” t-statistict=X% − X%lm2s2l ms% %+n nlm• Use medians in place <strong>of</strong> means• Use MAD in place <strong>of</strong> standard deviation
Simple Moderated t-statistict=snl2s2l mlX− Xmm+ +nsmed• s2 2slsmmed is median + across all genes• Analogous to SAMnlnm
Limma “ebayes” t-statistic• Generalization <strong>of</strong> Bayesian method <strong>of</strong>Lonnstadt and Speed (2002) to thegeneral linear model case• An alternative and much moresophisticated moderated t-statistic.Variability is estimated using a posterior• From the limma Bioconductor package.
<strong>Probe</strong> <strong>Level</strong> Model teststatisticsΣ• Suppose that is component <strong>of</strong> the variancecovariancematrix related to β• Let c be the contrast vector defined such thatthe j th element <strong>of</strong> c is−1n l1n mif array j is in group lif array j is in group m0 otherwise
<strong>Probe</strong> <strong>Level</strong> Model teststatisticstPLM.1=J∑j=1Tc βc2jΣjjPLM.2Tt = c βTc Σc
A First Comparison• 8 chips from <strong>Affymetrix</strong> HG-U95A spike-indataset– 4 arrays for each <strong>of</strong> two concentration pr<strong>of</strong>iles• Fit an array effect model to all 8 chips– Compare the performance <strong>of</strong> the differentmethods by looking at all comparisons• 1 vs 1• 2 vs 2• 3 vs 3• 4 vs 4
What Happens as the Number<strong>of</strong> Arrays Increases?• Expand comparison to all 24 Arrays withsame concentration pr<strong>of</strong>iles from<strong>Affymetrix</strong> HG-U95A spike-in dataset• Fit an array effect model to all 24 arrays• Look at comparisons between equalnumber <strong>of</strong> chips
A Larger Comparison• Look at the entire 59 chips for<strong>Affymetrix</strong> HG-U95A spike-in dataset• Examine two cases. After standardpreprocessing– Fit models for each pairwise comparison(individual models)– Fit a model to all 59 chips (single model)• There are 91 pairwise comparisions
ResultsMethod Individual Models Single Model0% FP 5% FP AUC 0% FP 5% FP AUCFC 0.451 0.985 0.975 0.444 0.982 0.971Std 0.323 0.982 0.956 0.301 0.975 0.952Robust 0.16 0.939 0.857 0.144 0.935 0.852Mod 0.437 0.987 0.975 0.413 0.98 0.97PLM.1 0.653 0.991 0.979 0.54 0.951 0.93PLM.2 0.657 0.991 0.979 0.539 0.951 0.93Ebayes 0.514 0.988 0.978 0.45 0.986 0.974
More Spike-in <strong>Data</strong>sets• Two GeneLogic Spike-in datasets– Tonsil dataset (36 arrays)– AML dataset (34 arrays)• In each case use single models fitted toall arrays
What is Going On Here?• Examine residuals stratified byconcentration group– Spike-ins– Randomly chosen non-differential probesets atlow, medium and high average expression
<strong>Affymetrix</strong> Spike-ins
Low Non-Differential
Middle Non-Differential
High Non-Differential
GeneLogic AML Spike-ins
GeneLogic Tonsil dataset
How About With More“Real” <strong>Data</strong>?GeneLogic Dilution/Mixture <strong>Data</strong>setLearning SetLiver30xVSCNS30x400 top genes“truth”75:255xVS25:755xTest SetMixture <strong>Data</strong>Dilution Series <strong>Data</strong>
Mixture <strong>Data</strong> ResultsMethod 3 vs 3 4 vs 4 5 vs 50% FP 5% FP AUC 0% FP 5% FP AUC 0% FP 5% FP AUCFC 0.007 0.886 0.697 0.008 0.888 0.703 0.005 0.888 0.708Std 0.004 0.793 0.53 0.008 0.872 0.626 0.018 0.902 0.675Robust 0.002 0.485 0.271 0.005 0.747 0.49 0.01 0.743 0.488Mod 0.007 0.908 0.697 0.002 0.932 0.735 0 0.948 0.76PLM.1 0.056 0.943 0.751 0.057 0.947 0.756 0.056 0.95 0.76PLM.2 0.057 0.943 0.752 0.057 0.948 0.758 0.058 0.95 0.761Ebayes 0.001 0.918 0.744 0 0.933 0.761 0 0.943 0.776
Moderation for the PLMtest statisticMethod 5 vs 50% FP 5% FP AUCPLM.2 0.058 0.95 0.761Ebayes 0 0.943 0.776PLM Moderated 0.053 0.963 0.795
Quality Assessment using PLM• PLM quantities useful for assessing chipquality– Weights– Residuals– Standard Errors• Expression values relative to median chip
Pseudo-chip imagesWeightsResidualsPositiveResidualsNegativeResiduals
An Image Gallery“Crop Circles”“Ring <strong>of</strong> Fire”“Tricolor”http://www.stat.berkeley.edu/~bolstad/PLMImageGallery/
NUSE PlotsNormalizedUnscaledStandardErrors
RLE PlotsRelativeLogExpression
Discordant <strong>Probe</strong>s
Discordant Arrays
Morals for Today’s Talk• The “black box” that is low-level processingis important. Your processing can have asignificant effect on your final expressionvalues and resulting conclusions.• Testing for differential expression at theprobe level can improve upon probeset-leveltesting.• There is interesting quality assessmentinformation at the probe-level
Ongoing Work in this Area• Technology changes: What still works?What doesn’t?• Other probe-level models (eg Introducesequence information, MM’s)• Can we introduce the biology into theprocess (gene families? etc)
Acknowledgements• Terry Speed (UC Berkeley)• Francois Colin (UC Berkeley/UCSF)• Julia Brettschneider (UC Berkeley)• Rafael Irizarry (Johns Hopkins)• Bioconductor Corehttp://www.bioconductor.org
Additional Slides
Focusing on a Single<strong>GeneChip</strong> Cell Location
Chip dat file – checkered board – oligo B2Courtesy: F. Colin
Chip dat file – checkered board – close up w/ gridCourtesy: F. Colin
From Chip To <strong>Data</strong>
Constructing a geneexpression measure
Computing ExpressionMeasures:A Three Step Procedure1. Background/Signal adjustment (B)2. Normalization (N)3. Summarization (S)Let be cel file data from multiple arrays thenXXExpression values = S(N(B( )))
Background Methods• <strong>Affymetrix</strong>– Location dependent– Ideal mismatch• RMA– Convolution model• Other– Standard curve adjustment– GCRMA (Wu, Irizzary et al 2003)+ =
Background Signal Methods• <strong>Affymetrix</strong>– Location dependent background based on grids• I will refer to this as the MAS 5 background– Originally proposed subtracting MM from PM butthis is problematic because as many as a third <strong>of</strong>MM’s are greater than the respective PM• No longer used– Now uses what they refer to as the IdealMismatch which is MM when possible andsomething else when not possible (designed sothat there is now no negatives)• Call this IMM
Original RMA Background• Convolution model is suggested by lookingat density <strong>of</strong> observed empirical distributions
Convolution Model• O = S + N– O is observed PM, S is signal (assumedexponential), N is noise (assumed normal,truncated at zero)• Correction is then⎛ a ⎞ ⎛ o − a ⎞φ ⎜ ⎟ − φ ⎜ ⎟E ( )⎝ b= = +⎠ ⎝ bS O o a b⎠⎛ a ⎞ ⎛ o − a ⎞Φ ⎜ ⎟ −Φ⎜ ⎟ −⎝ b ⎠ ⎝ b ⎠a = o − − , b =2μ σ α σ1
A Standard Curve AdjustmentBased on Spike-in Information• Observes that there is acurve that relatesobserved expressionand spike-inconcentration. The idealwould be to have alinear relationshipbetween concentrationand computedexpression. The curvegives us aconcentrationdependent adjustment
What About Non Spike-ins?• We don’t know a concentration for mostprobesets. If we did, or if we had a variablethat related to concentration, the adjustmentwould be easy to perform• Fit the following modelyy( k )= α + ε( k) ( k)1i i i( k) k ( k)= + +( k) ( )2 iαi γ ε'i•Whereyy=log( k) ( k)1i2 i=logPMMM( k) ( k)2i2 i
γRelates to Concentration
Establishing a RelationshipγBetween and Concentration
The Two Curves Yield anAdjustment Curve
Normalization Methods• Methods already compared in Bolstad et al(2003)• Baseline (normalized using reference chip)– Scaling (<strong>Affymetrix</strong>)– Non linear (Li-Wong)• Complete data (no reference chip,information from all arrays used)– Quantile normalization (Bolstad et al 2003)– Contrast (Åstrand, 2003)– Cyclic Loess
Why Quantile Normalization?• Quantile normalization found to performacceptably in reducing variance withoutdrastic bias effects• Quantile normalization is fast
RMA Model• To each probeset (k), with i being number <strong>of</strong>probes and j being number <strong>of</strong> chips, fit the model:y= α + β + ε( k) ( k) ( k) ( k)ij i j ijα( k) ( k)whereiis a probe effect andjis the( k )log gene expression. yijis the log2 backgroundadjusted and normalized PM intensity• Different ways to fit this model– Median polish – quick– Robust linear model – yields some good qualitydiagnostic toolsβ
Basic RMA modelLetthenyij= log N B2( ( PM ))ijy = m+ α + β + εij i j ijwhereα iβ jis probe-effectis chip-effect ( m + β jis log2 geneexpression on array j)Median-polish imposes constraintsmedianα= median β = 0imedian ε = median ε = 0i ij j ijj
Advantages/Disadvantages <strong>of</strong> RMA/Median polish• Advantages–Fast– Gene expression measures perform favorablywhen compared with MAS 5.0, Li-Wong MBEI– Robust against outliers• Disadvantages– No standard error estimates– No algorithmic flexibility to fit alternative models
<strong>Probe</strong> <strong>Level</strong> Models are• RMA methodbased on RMA– Convolution Model Background– Quantile Normalization– Summarization using a robust multi-chipmodel on the log scale. Model is fittedusing the median polish algorithm on aprobeset by probeset basis
Comparing the backgroundmethods• Using an <strong>Affymetrix</strong> spike-in experiment weshall examine– Observed vs spike-in concentration– Observed vs expected fold change– Composite M vs A plots– ROC curves• In each case we will compute expressionvalues use standard RMA methodology. (iequantile normalization, median polishsummarization)
Assessing Bias: ObservedExpression vs Spike-inConcentrationSlope None RMA MAS 5 IMM MAS5/IMMAll 0.493 0.63 0.589 0.69 0.695Mid 0.665 0.784 0.751 0.82 0.82Low 0.184 0.376 0.318 0.52 0.563High 0.329 0.33 0.327 0.295 0.291S.C.A.0.8561.0410.6310.256
Assessing Bias: ObservedFold-change versusExpected Fold-changeSlope None RMA MAS 5 IMM MAS5/ S.C.A.IMMAll 0.484 0.624 0.583 0.683 0.692 0.847
Assessing Variability:M vs A plots• Vertical Axis is M: log2 fold-change.• Horizontal Axis is A: average expression value (onlog2 scale) aka geometric average.• Ideally non differential genes tight about M=0
Detecting DifferentialExpression: ROC Curves
Summary <strong>of</strong> Trade-<strong>of</strong>fsBackgroundMethodDetect DifferentialGenesAccurateestimates <strong>of</strong> FoldchangeNo Background Good PoorRMA Good PoorMAS 5.0 Good PoorIdeal Mismatch Poor GoodMAS5.0/IdealMM Poor GoodStandard CurveAdjustmentGoodGood
Comparing theNormalization Methods• Want to reduce variation but at the sametime we do not want to introduce any bias• First a quick examination <strong>of</strong> the expressionvalues by array• Using same spike-in experiment as before,this time no background correction, onlynormalization and median polishsummarization.
Scaling is Not Sufficient
Variability <strong>of</strong> Non-DifferentialGenes is Reduced
Little effect on Spike-insMethod All Low Mid High FCNoNormalization0.493(0.845)0.185(0.148)0.664(0.733)0.328(0.207)0.484(0.952)Quantile0.493(0.851)0.184(0.153)0.665(0.741)0.329(0.224)0.484(0.955)Scaling0.493(0.852)0.186(0.156)0.663(0.742)0.33(0.225)0.484(0.954)
ROC Curves
Comparing EstablishedExpression Measures
Slope ValueAll 0.493Mid 0.665Low 0.184High 0.329
Slope ValueAll 0.63Mid 0.784Low 0.376High 0.33
Slope ValueAll 0.589Mid 0.751Low 0.318High 0.327
Slope ValueAll 0.69Mid 0.82Low 0.52High 0.295
Slope ValueAll 0.695Mid 0.82Low 0.563High 0.291
Slope ValueAll 0.856Mid 1.041Low 0.631High 0.256
Slope: 0.484
Slope: 0.624
Slope: 0.583
Slope: 0.683
Slope: 0.692
Slope: 0.847
What About the TreatmentEffect Model?• With treatment effect model, have moreobservations contributing to estimationfor each element <strong>of</strong> the variancecovariance matrix• Much quicker to fit
We Will Focus on TwoParticular PLM• Array effect modelArray EffectPre-processedLog PM intensity• Treatment effect modely = α + β + εij i j ijy = α + τ + εij i l ijjIn both casesI∑i=1αi=0<strong>Probe</strong> EffectTreatment Effect