Probe-Level Analysis of Affymetrix GeneChip Microarray Data

Probe-Level Analysis ofAffymetrix GeneChipMicroarray DataBen Bolstadhttp://www.stat.berkeley.edu/~bolstadClemson UniversityJanuary 6, 2005

Outline for Today's Talk• A brief introduction to the AffymetrixGeneChip Technology• Pre-processing– Background Correction– Normalization• Probe Level Modeling– Models– Applications

The Human Genome• The cell is the fundamental working unit of everyliving organism.• Humans: trillions of cells (metazoa);other organisms like yeast: one cell (protozoa).• Cells are of many different types (e.g. blood, skin,nerve cells), but all can be traced back to a singlecell, the fertilized egg.

Genes• The human genome is distributed along 23 pairs ofchromosomes.– 22 autosomal pairs;– the sex chromosome pair, XX for females and XYfor males.• In each pair, one chromosome is paternallyinherited, the other maternally inherited.• Chromosomes are made of compressed andentwined DNA.• A (protein-coding) gene is a segment ofchromosomal DNA that directs the synthesis of aprotein.

DNA• A deoxyribonucleic acid or DNA molecule is adouble-stranded polymer composed of fourbasic molecular units called nucleotides.• Each nucleotide comprises a phosphate group,a deoxyribose sugar, and one of four nitrogenbases: adenine (A), guanine (G), cytosine (C),and thymine (T).• The two chains are held together by hydrogenbonds between nitrogen bases.• Base-pairing occurs according to the followingrule: G pairs with C, and A pairs with T.

Exons and introns• Genes comprise only about 2% of the humangenome; the rest consists of non-codingregions, whose functions may includeproviding chromosomal structural integrityand regulating when, where, and in whatquantity proteins are made (regulatoryregions).• The terms exon and intron refer to coding(translated into a protein) and non-codingDNA, respectively.

Differential expression• Each cell contains a complete copy of theorganism's genome.• Cells are of many different types and statesE.g. blood, nerve, and skin cells, dividingcells, cancerous cells, etc.• What makes the cells different?• Differential gene expression, i.e., when,where, and in what quantity each gene isexpressed.• On average, 40% of our genes are expressedat any given time.

Functional genomics• The various genome projects have yieldedthe complete DNA sequences of manyorganisms. E.g. human, mouse, yeast, fruitfly,etc.• Human: 3 billion base-pairs, 30-40 thousandgenes.• Challenge: go from sequence to function, i.e.,define the role of each gene and understandhow the genome functions as a whole.

Central dogma• The expression of the genetic information stored in theDNA molecule occurs in two stages:(i) transcription, during which DNA is transcribed intomRNA;(ii) translation, during which mRNA is translated toproduce a protein.DNA mRNA protein• Other important aspects of regulation: methylation,alternative splicing, etc.• The correspondence between DNA's four-letteralphabet and a protein's twenty-letter alphabet isspecified by the genetic code, which relates nucleotidetriplets to amino acids.

RNA• A ribonucleic acid or RNA molecule is a nucleicacid similar to DNA, but– single-stranded;– ribose sugar rather than deoxyribose sugar;– uracil (U) replaces thymine (T) as one of thebases.• RNA plays an important role in protein synthesisand other chemical activities of the cell.• Several classes of RNA molecules, includingmessenger RNA (mRNA), transfer RNA (tRNA),ribosomal RNA (rRNA), and other small RNAs.

Idea: measure the amount of mRNA to see which genes arebeing expressed in (used by) the cell.Measuring protein might be better, but is currently harder.

Brief TechnologyOverview• High densityoligonucleotide arraytechnology as developedby Affymetrixhttp://www.affymetrix.com• Known as the GeneChipSome overview images courtesy of Affymetrix.

Probes and Probesets

Two Probe TypesReference SequenceTAGGTCTGTATGACAGACACAAAGAAGATGCAGACATAGTGTCTGTGTTTCTTCTCAGACATAGTGTGTGTGTTTCTTCTPM: the Perfect MatchMM: the Mismatch

Constructing the Chip

Sample Preparation

Hybridization to the Chip

The Chip is Scanned

Chip dat file – checkered board – close up pixel selection

Chip cel file – checkered boardCourtesy: F. Colin

Levels of Analysis• Probe level– Unprocessed data (or minimal processing)– Dataset is larger– Often ignored or viewed as a “black box”.• High level– Processed data– Dataset is smaller– Most analysis seems to start here

High-Level Analysis• Clustering/Classification• Pathway Analysis• Cell Cycle• Gene function• Anything where a more biologicalinterpretation is desiredThese are important topics but such matterswill not be discussed further in today's talk.

Probe-Level Analysis• What is Probe-level analysis?– Analysis and manipulation of probe intensity data• Expression calculation: Background, Normalization,Summarization• Fitting models to probe intensity data• Quality control diagnostics• Why do we do it?– Hopefully it will allow us to produce better, morebiologically meaningful gene expression values– We want accurate (low bias) and precise (lowvariance) gene expression estimates– Is there additional information at the probe-levelthat we might otherwise throw away?

Background/SignalAdjustment• A method which does some or all of thefollowing– Corrects for background noise, processing effects– Adjusts for cross hybridization– Adjust estimated expression values to fall onproper scale• Probe intensities are used in backgroundadjustment to compute correction (unlikecDNA arrays where area surrounding spotmight be used)

RMA Background Approach• Convolution Model= +ObservedOSignalS( )NoiseNExp α (2N μ,σ )

ECorrection is given by( )SO= o = a+ba= o− − , b=2μ σ α σφa o−a−φb b⎛ ⎞ ⎛ ⎞⎜⎝⎟⎠⎜⎝⎟⎠⎛a⎞ ⎛o−a⎞Φ ⎜ ⎟+Φ⎜ ⎟−1b b⎝ ⎠ ⎝ ⎠

Normalization“Non-biological factors can contribute to thevariability of data ... In order to reliablycompare data from multiple probe arrays,differences of non-biological origin must beminimized.“ 1• Normalization is a process of reducingunwanted variation across chips. It may useinformation from multiple chips1 GeneChip 3.1 Expression Analysis Algorithm Tutorial, Affymetrix technical support

Non-Biological Variability5 scanners for 6 dilution groups

Non-linear normalizationneededUnnormalizedScaledA Non-linearNormalization

Quantile Normalization• Normalize so that the quantiles of each chip areequal. Simple and fast algorithm. Goal is togive same distribution to each chip.OriginalDistributionTargetDistribution

Sort Sort columns of of originalmatrixTake averages across rowsTake averages across rowsSet Set average as as value for forAll All elements in in the the row rowUnsort columns of ofmatrix to to original order⎡1 5 3 5⎤ ⎡1 1 1 5⎤⎢2 1 6 7⎥ ⎢2 2 2 6⎥⎢ ⎥ → ⎢ ⎥⎢3 2 2 6⎥ ⎢3 5 3 7⎥⎢ ⎥ ⎢ ⎥⎣4 6 1 8⎦ ⎣4 6 6 8⎦⎡1 1 1 5⎤ ⎡ 2 ⎤⎢2 2 2 6⎥ ⎢3⎥⎢ ⎥ → ⎢ ⎥⎢3 5 3 7⎥ ⎢4.5⎥⎢ ⎥ ⎢ ⎥⎣4 6 6 8⎦ ⎣ 6 ⎦⎡ 2 ⎤ ⎡ 2 2 2 2 ⎤⎢3⎥ ⎢3 3 3 3⎥⎢ ⎥ → ⎢ ⎥⎢4.5⎥ ⎢4.5 4.5 4.5 4.5⎥⎢ ⎥ ⎢ ⎥⎣ 6 ⎦ ⎣ 6 6 6 6 ⎦⎡ 2 2 2 2 ⎤ ⎡ 2 4.5 4.5 2 ⎤⎢3 3 3 3⎥ ⎢3 2 6 4.5⎥⎢ ⎥ → ⎢ ⎥⎢4.5 4.5 4.5 4.5⎥ ⎢4.5 3 3 3 ⎥⎢ ⎥ ⎢ ⎥⎣ 6 6 6 6 ⎦ ⎣ 6 6 2 6 ⎦

It Reduces VariabilityExpression ValuesFold changeAlso no serious bias effects. For more see Bolstad et al (2003)

General Probe Level Modelyij= f( X)+εij• Where f(X) is function of factor (and possiblycovariate) variables (our interest will be inlinear functions)• y ij is a pre-processed probe intensity(usually log scale)• Assume that E⎡ ⎤ = 0Var⎣ε ij⎦⎡⎣ε⎤ij ⎦ = σ2

Parallel Behavior SuggestsMulti-chip Model

Probe Pattern SuggestsIncluding Probe-Effect

Also Want Robustness

Summarization• Problem: Calculating gene expression values.• How do we reduce the 11-20 probe intensities foreach probeset on to a gene expression value?• Our Approach– RMA – a robust multi-chip linear model fit on the log scale• Some Other Approaches– Single chip• AvDiff (Affymetrix) – no longer recommended for use due to manyflaws• Mas 5.0 (Affymetrix) – use a 1 step Tukey&( biweight to combinethe probe intensities in log scale– Multiple Chip• MBEI (Li-Wong dChip) – a multiplicative model on natural scale

The Three Steps of RMA1. Convolution Background2. Quantile Normalization3. Linear model on the log2 scale fit robustly.• Software– Bioconductor affy packagewww.bioconductor.org– RMAExpresswww.stat.berkeley.edu/~bolstad/RMAExpress– Also available in some commercial software egS+ ArrayAnalzyer, Iobian GeneTraffic

RMA mostly does well inpracticeDetecting Differential ExpressionNot noisy in low intensitiesRMAMAS 5.0

One DrawbackRMA MAS 5.0Some fixes for this are being developed see GCRMA(Irizarry and Wu, JHU)

The RMA modely = m+ α + β + εij i j ijwhereyijα iβj= log N B2( ( PM ))ijis a probe-effect i= 1,…,Iis chip-effect ( m + βj is log2 geneexpression on array j) j=1,…,J

Median Polish Algorithmy11 L y1J0M O M MyI1L yIJ00 L 0 0ImposesConstraintsSweep ColumnsIteratemedianα= median β = 0iSweep Rowsmedian ε = median ε = 0i ij j ijjˆ ε L ˆ ε αˆ11 1J1M O M Mˆ ε ˆ ˆI1L ε αIJ Iˆ ˆ β L β mˆ1J

• Advantages–Fast– Very robust• DisadvantagesMedian Polish– No algorithmic flexibility to fit alternativemodels– No standard error estimates

An Alternative Method forFitting a PLM• Robust regression using M-estimation• In this talk, we will use Huber’s influencefunction ψ . The software handles manymore.• Fitting algorithm is IRLS with weightsdependent on current residualsψ rr( )• Software for fitting such models is part ofaffyPLM package of Bioconductor

Variance CovarianceEstimates• Suppose model is Y = Xβ + ε• Huber (1981) gives three forms for estimatingvariance covariance matrix2κ1/( n−p) ψ ( r)⎡⎢1/n⎣∑i∑iψ ′( r)i⎤⎥⎦i22(TX X )−11/( n−p)ψ riiκ1/ n ψ ′ r∑i∑( )( )i2W−11 1/−( ) ( ) (Tn−p ∑ψr )iW X X Wκi2 1 −1We will use this formT 'W = X Ψ X

We Will Focus on theSummarization PLM• Array effect modelArray EffectPre-processedLog PM intensityy = α + β + εij i j ijWith constraintI∑i=1αi=Probe Effect0

Detecting DifferentialExpression• Problem: Given an experiment with twotreatment groups correctly identify thedifferential genes without incorrectlychoosing non differential genes.• Question 1: How do different methods forDDE compare?• Question 2: Can we do better using PLM’s?

How Do We Know WhichGenes are Differential?• Spike-in datasets.– Transcripts at known concentrationsdiffering across arrays. Commonbackground cRNA.– Typically, Latin Square design• Affymetrix HG-U95A• GeneLogic AML – bacterial spike-ins• GeneLogic Tonsil – bacterial spike-ins

Affymetrix Spike-in Data• 59 chips. All but 1 of the rows are done as triplicates37777 684 1597 38734 39058 36311 36889 1024 36202 36085 40322 407 1091 1708A 0 0.25 0.5 1 2 4 8 16 32 64 128 0 512 1024B 0.25 0.5 1 2 4 8 16 32 64 128 256 0.25 1024 0C 0.5 1 2 4 8 16 32 64 128 256 512 0.5 0 0.25D 1 2 4 8 16 32 64 128 256 512 1024 1 0.25 0.5E 2 4 8 16 32 64 128 256 512 1024 0 2 0.5 1F 4 8 16 32 64 128 256 512 1024 0 0.25 4 1 2G 8 16 32 64 128 256 512 1024 0 0.25 0.5 8 2 4H 16 32 64 128 256 512 1024 0 0.25 0.5 1 16 4 8I 32 64 128 256 512 1024 0 0.25 0.5 1 2 32 8 16J 64 128 256 512 1024 0 0.25 0.5 1 2 4 64 16 32K 128 256 512 1024 0 0.25 0.5 1 2 4 8 128 32 64L 256 512 1024 0 0.25 0.5 1 2 4 8 16 256 64 128M 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256N 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256O 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256P 512 1024 0 0.25 0.5 1 2 4 8 16 32 512 128 256Q 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512R 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512S 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512T 1024 0 0.25 0.5 1 2 4 8 16 32 64 1024 256 512

Testing for DifferentialExpression• On a probeset by probeset basis compute atest statistic• Should include something related to observedFC and some variability estimate• A “t” statistic: something of the formt=XSE

Fold ChangeFC = X − XlmWhereXl=∑∑βjInd( j∈group l)Ind( j∈group l)

Simple t-statistict=Xlsnm2s2l ml− X+nm

“Robust” t-statistict=X% − X%lm2s2l ms% %+n nlm• Use medians in place of means• Use MAD in place of standard deviation

Simple Moderated t-statistict=snl2s2l mlX− Xmm+ +nsmed• s2 2slsmmed is median + across all genes• Analogous to SAMnlnm

Limma “ebayes” t-statistic• Generalization of Bayesian method ofLonnstadt and Speed (2002) to thegeneral linear model case• An alternative and much moresophisticated moderated t-statistic.Variability is estimated using a posterior• From the limma Bioconductor package.

Probe Level Model teststatisticsΣ• Suppose that is component of the variancecovariancematrix related to β• Let c be the contrast vector defined such thatthe j th element of c is−1n l1n mif array j is in group lif array j is in group m0 otherwise

Probe Level Model teststatisticstPLM.1=J∑j=1Tc βc2jΣjjPLM.2Tt = c βTc Σc

A First Comparison• 8 chips from Affymetrix HG-U95A spike-indataset– 4 arrays for each of two concentration profiles• Fit an array effect model to all 8 chips– Compare the performance of the differentmethods by looking at all comparisons• 1 vs 1• 2 vs 2• 3 vs 3• 4 vs 4

What Happens as the Numberof Arrays Increases?• Expand comparison to all 24 Arrays withsame concentration profiles fromAffymetrix HG-U95A spike-in dataset• Fit an array effect model to all 24 arrays• Look at comparisons between equalnumber of chips

A Larger Comparison• Look at the entire 59 chips forAffymetrix HG-U95A spike-in dataset• Examine two cases. After standardpreprocessing– Fit models for each pairwise comparison(individual models)– Fit a model to all 59 chips (single model)• There are 91 pairwise comparisions

ResultsMethod Individual Models Single Model0% FP 5% FP AUC 0% FP 5% FP AUCFC 0.451 0.985 0.975 0.444 0.982 0.971Std 0.323 0.982 0.956 0.301 0.975 0.952Robust 0.16 0.939 0.857 0.144 0.935 0.852Mod 0.437 0.987 0.975 0.413 0.98 0.97PLM.1 0.653 0.991 0.979 0.54 0.951 0.93PLM.2 0.657 0.991 0.979 0.539 0.951 0.93Ebayes 0.514 0.988 0.978 0.45 0.986 0.974

More Spike-in Datasets• Two GeneLogic Spike-in datasets– Tonsil dataset (36 arrays)– AML dataset (34 arrays)• In each case use single models fitted toall arrays

What is Going On Here?• Examine residuals stratified byconcentration group– Spike-ins– Randomly chosen non-differential probesets atlow, medium and high average expression

Affymetrix Spike-ins

Low Non-Differential

Middle Non-Differential

High Non-Differential

GeneLogic AML Spike-ins

GeneLogic Tonsil dataset

How About With More“Real” Data?GeneLogic Dilution/Mixture DatasetLearning SetLiver30xVSCNS30x400 top genes“truth”75:255xVS25:755xTest SetMixture DataDilution Series Data

Mixture Data ResultsMethod 3 vs 3 4 vs 4 5 vs 50% FP 5% FP AUC 0% FP 5% FP AUC 0% FP 5% FP AUCFC 0.007 0.886 0.697 0.008 0.888 0.703 0.005 0.888 0.708Std 0.004 0.793 0.53 0.008 0.872 0.626 0.018 0.902 0.675Robust 0.002 0.485 0.271 0.005 0.747 0.49 0.01 0.743 0.488Mod 0.007 0.908 0.697 0.002 0.932 0.735 0 0.948 0.76PLM.1 0.056 0.943 0.751 0.057 0.947 0.756 0.056 0.95 0.76PLM.2 0.057 0.943 0.752 0.057 0.948 0.758 0.058 0.95 0.761Ebayes 0.001 0.918 0.744 0 0.933 0.761 0 0.943 0.776

Moderation for the PLMtest statisticMethod 5 vs 50% FP 5% FP AUCPLM.2 0.058 0.95 0.761Ebayes 0 0.943 0.776PLM Moderated 0.053 0.963 0.795

Quality Assessment using PLM• PLM quantities useful for assessing chipquality– Weights– Residuals– Standard Errors• Expression values relative to median chip

Pseudo-chip imagesWeightsResidualsPositiveResidualsNegativeResiduals

An Image Gallery“Crop Circles”“Ring of Fire”“Tricolor”http://www.stat.berkeley.edu/~bolstad/PLMImageGallery/

NUSE PlotsNormalizedUnscaledStandardErrors

RLE PlotsRelativeLogExpression

Discordant Probes

Discordant Arrays

Morals for Today’s Talk• The “black box” that is low-level processingis important. Your processing can have asignificant effect on your final expressionvalues and resulting conclusions.• Testing for differential expression at theprobe level can improve upon probeset-leveltesting.• There is interesting quality assessmentinformation at the probe-level

Ongoing Work in this Area• Technology changes: What still works?What doesn’t?• Other probe-level models (eg Introducesequence information, MM’s)• Can we introduce the biology into theprocess (gene families? etc)

Acknowledgements• Terry Speed (UC Berkeley)• Francois Colin (UC Berkeley/UCSF)• Julia Brettschneider (UC Berkeley)• Rafael Irizarry (Johns Hopkins)• Bioconductor Corehttp://www.bioconductor.org

Additional Slides

Focusing on a SingleGeneChip Cell Location

Chip dat file – checkered board – oligo B2Courtesy: F. Colin

Chip dat file – checkered board – close up w/ gridCourtesy: F. Colin

From Chip To Data

Constructing a geneexpression measure

Computing ExpressionMeasures:A Three Step Procedure1. Background/Signal adjustment (B)2. Normalization (N)3. Summarization (S)Let be cel file data from multiple arrays thenXXExpression values = S(N(B( )))

Background Methods• Affymetrix– Location dependent– Ideal mismatch• RMA– Convolution model• Other– Standard curve adjustment– GCRMA (Wu, Irizzary et al 2003)+ =

Background Signal Methods• Affymetrix– Location dependent background based on grids• I will refer to this as the MAS 5 background– Originally proposed subtracting MM from PM butthis is problematic because as many as a third ofMM’s are greater than the respective PM• No longer used– Now uses what they refer to as the IdealMismatch which is MM when possible andsomething else when not possible (designed sothat there is now no negatives)• Call this IMM

Original RMA Background• Convolution model is suggested by lookingat density of observed empirical distributions

Convolution Model• O = S + N– O is observed PM, S is signal (assumedexponential), N is noise (assumed normal,truncated at zero)• Correction is then⎛ a ⎞ ⎛ o − a ⎞φ ⎜ ⎟ − φ ⎜ ⎟E ( )⎝ b= = +⎠ ⎝ bS O o a b⎠⎛ a ⎞ ⎛ o − a ⎞Φ ⎜ ⎟ −Φ⎜ ⎟ −⎝ b ⎠ ⎝ b ⎠a = o − − , b =2μ σ α σ1

A Standard Curve AdjustmentBased on Spike-in Information• Observes that there is acurve that relatesobserved expressionand spike-inconcentration. The idealwould be to have alinear relationshipbetween concentrationand computedexpression. The curvegives us aconcentrationdependent adjustment

What About Non Spike-ins?• We don’t know a concentration for mostprobesets. If we did, or if we had a variablethat related to concentration, the adjustmentwould be easy to perform• Fit the following modelyy( k )= α + ε( k) ( k)1i i i( k) k ( k)= + +( k) ( )2 iαi γ ε'i•Whereyy=log( k) ( k)1i2 i=logPMMM( k) ( k)2i2 i

γRelates to Concentration

Establishing a RelationshipγBetween and Concentration

The Two Curves Yield anAdjustment Curve

Normalization Methods• Methods already compared in Bolstad et al(2003)• Baseline (normalized using reference chip)– Scaling (Affymetrix)– Non linear (Li-Wong)• Complete data (no reference chip,information from all arrays used)– Quantile normalization (Bolstad et al 2003)– Contrast (Åstrand, 2003)– Cyclic Loess

Why Quantile Normalization?• Quantile normalization found to performacceptably in reducing variance withoutdrastic bias effects• Quantile normalization is fast

RMA Model• To each probeset (k), with i being number ofprobes and j being number of chips, fit the model:y= α + β + ε( k) ( k) ( k) ( k)ij i j ijα( k) ( k)whereiis a probe effect andjis the( k )log gene expression. yijis the log2 backgroundadjusted and normalized PM intensity• Different ways to fit this model– Median polish – quick– Robust linear model – yields some good qualitydiagnostic toolsβ

Basic RMA modelLetthenyij= log N B2( ( PM ))ijy = m+ α + β + εij i j ijwhereα iβ jis probe-effectis chip-effect ( m + β jis log2 geneexpression on array j)Median-polish imposes constraintsmedianα= median β = 0imedian ε = median ε = 0i ij j ijj

Advantages/Disadvantages of RMA/Median polish• Advantages–Fast– Gene expression measures perform favorablywhen compared with MAS 5.0, Li-Wong MBEI– Robust against outliers• Disadvantages– No standard error estimates– No algorithmic flexibility to fit alternative models

Probe Level Models are• RMA methodbased on RMA– Convolution Model Background– Quantile Normalization– Summarization using a robust multi-chipmodel on the log scale. Model is fittedusing the median polish algorithm on aprobeset by probeset basis

Comparing the backgroundmethods• Using an Affymetrix spike-in experiment weshall examine– Observed vs spike-in concentration– Observed vs expected fold change– Composite M vs A plots– ROC curves• In each case we will compute expressionvalues use standard RMA methodology. (iequantile normalization, median polishsummarization)

Assessing Bias: ObservedExpression vs Spike-inConcentrationSlope None RMA MAS 5 IMM MAS5/IMMAll 0.493 0.63 0.589 0.69 0.695Mid 0.665 0.784 0.751 0.82 0.82Low 0.184 0.376 0.318 0.52 0.563High 0.329 0.33 0.327 0.295 0.291S.C.A.0.8561.0410.6310.256

Assessing Bias: ObservedFold-change versusExpected Fold-changeSlope None RMA MAS 5 IMM MAS5/ S.C.A.IMMAll 0.484 0.624 0.583 0.683 0.692 0.847

Assessing Variability:M vs A plots• Vertical Axis is M: log2 fold-change.• Horizontal Axis is A: average expression value (onlog2 scale) aka geometric average.• Ideally non differential genes tight about M=0

Detecting DifferentialExpression: ROC Curves

Summary of Trade-offsBackgroundMethodDetect DifferentialGenesAccurateestimates of FoldchangeNo Background Good PoorRMA Good PoorMAS 5.0 Good PoorIdeal Mismatch Poor GoodMAS5.0/IdealMM Poor GoodStandard CurveAdjustmentGoodGood

Comparing theNormalization Methods• Want to reduce variation but at the sametime we do not want to introduce any bias• First a quick examination of the expressionvalues by array• Using same spike-in experiment as before,this time no background correction, onlynormalization and median polishsummarization.

Scaling is Not Sufficient

Variability of Non-DifferentialGenes is Reduced

Little effect on Spike-insMethod All Low Mid High FCNoNormalization0.493(0.845)0.185(0.148)0.664(0.733)0.328(0.207)0.484(0.952)Quantile0.493(0.851)0.184(0.153)0.665(0.741)0.329(0.224)0.484(0.955)Scaling0.493(0.852)0.186(0.156)0.663(0.742)0.33(0.225)0.484(0.954)

ROC Curves

Comparing EstablishedExpression Measures

Slope ValueAll 0.493Mid 0.665Low 0.184High 0.329






Slope: 0.484

Slope: 0.624

Slope: 0.583

Slope: 0.683

Slope: 0.692

Slope: 0.847

What About the TreatmentEffect Model?• With treatment effect model, have moreobservations contributing to estimationfor each element of the variancecovariance matrix• Much quicker to fit

We Will Focus on TwoParticular PLM• Array effect modelArray EffectPre-processedLog PM intensity• Treatment effect modely = α + β + εij i j ijy = α + τ + εij i l ijjIn both casesI∑i=1αi=0Probe EffectTreatment Effect

Probe-Level Analysis of Affymetrix GeneChip Microarray Data

Create successful ePaper yourself

Delete template?

Save as template?