Vorlesung Microarray Datenanalyse Kapitel 1: Einführung ... - Lectures

Vorlesung 

Microarray Datenanalyse 

Kapitel 1: Einführung, Normalisierung, 

Differentielle Gene, Multiples Testen 

Kapitel 2: Clustering und Klassifikation

Was sind DNAMicroarrays? 

Protein 

mRNA 

DNA


•Microarrays sind Technologieplattformen zur 

Messung der Aktivität einer großen Anzahl von 

Genen. 

•Dabei werden ihre Produkte (meist mRNA) 

quantifiziert. 

•Hierzu werden DNA Sequenzen verwendet, die auf 

einer Oberfläche (je nach Plattform verschiedene) 

immobilisiert werden. 

Vorlesung: Microarray Datenanalyse Kapitel 1


... Microarrays ... Messung der Aktivität von Genen 

(... mRNA). 

Welche anderen Methoden kennen Sie, die 

dieses Ziel verfolgen? 


Northern Blot 

Vorlesung: Microarray Datenanalyse Kapitel 1 

RNA 

RNA 

RNA 

RNA

RTPCR 

5‘ 3‘ 

5‘ 3‘ 

5‘ 3‘ 

RNA 

RNA 

cDNA 

cDNA 

dsDNA 

• Da RNA durch PCR nicht 

direkt amplifiziert werden 

kann, muß sie zunächst in 

cDNA umgeschrieben 

werden (revers 

transkribiert, RT) 

• Zur Quantifizierung sind 

zwei Ansätze möglich: 

• 1 Interner endogener 

Standard (zB 

Housekeeping gene) 

• 2 Kompetitive RT PCR: 

Zugabe von sog Mimic 

Fragmenten, die der 

Reaktion zugegeben 

werden und zusammen 

mit der eigentlichen 

Zielsequenz amplifiziert 

werden 


SAGE = Serial Analysis of Gene Expression 

Zellen isolieren 

mRNA isolieren und cDNA synthetisieren 

Transkript mit Anchor Enzym schneiden 

„Taggen“ 

Ligieren der Tags 

Sequenzierung 

Quantifizierung 


WOZU? – klassisches Beispiel: 

Normale Niere 

krank gesund 

RNAPräparation 

Tumor (Niere) 

MESSUNG ?! 

was unterscheidet 

“Tumor” von “Normal” ? 


WOMIT? – Plattformen 

Filter Glaschips Affymetrix 


Plattformen 


1991 

Lennon & Lehrach, 1991 

1995 

Stanford University, 

Schena et al, 1995 

1996 

Lockhardt et al, 1996 


Nylon Filter 

eine Probe 

radioaktives Signal 

viele Spots möglich 

große Fläche / lokale Effekte 

Überstrahlen 

nur eine Probe pro Hybridisierungsvorgang 

Plattformen 

Glas Träger 

rote und grüne Probe 

Floureszenz Signal 

bis ~ 20000 Spots möglich 

gleichzeitiges Hybridisieren 

von Probe und Kontrolle 

(rot/grün) 

Chip 

eine Probe bestehend aus 

1620 Wdh. und zugehörigen 

Mismatches 

kommerzieller Chip 

gute reproduzierbare Daten 

nur eine Probe pro Hybri 

disierungsvorgang 


Sequenz 1 

Sequenz 2 

Sequenz n 

cDNAs 

oder 

Oligos 

Grundprinzip 

RNA 

Probe 1 

Probe 2 


Grundprinzip 



Grundannahme 

Das gemessene Signal spiegelt (nach geeigneter 

“Aufreinigung”) grundsätzlich die Menge RNA in der 

Probe wider 


! 

Biologie 

Diagnostik 

Therapie 

... 

Biologische 

Verifikation 

? 

Verarbeitung von Microarray Daten: 

Experiment 

Design 

Experiment 

(Microarray) 

Analyse: Clustering; Class Discovery; Klassifikation; Differentielle Gene; .... 

Bildverarbeitung 

Rohe 

Intensitätswerte 

Normalisierung 

Expressions Level 


Welche Normalisierungs Methoden gibt es? 

Benutzer definierte Sets 

Housekeeping (?!) 

Interne Kontrollen etc… 

Nützlich bei 

“Most Genes Changed” Settings 

Skalierungs 

methoden 

•Mean 

•Median 

•Shorth 

•Zscore 

Regressions 

methoden 

Gesamter Datensatz 

Nützlich bei 

“Most Genes Unchanged” Settings 

•gesamt 

linear/polynomial 

•local 

linear/polynomial 

•qspline 

Transformations 

methoden 

•Varianz 

Stabilisierung 

Analysis of Variance/ 

ML based methods 

•ANOVA 

Verteilungsbasiert 


•Quantil 

Normalisierung

Beobachtung 

Varianz der gemessenen Intensität hängt von der 

absoluten Intensität ab 

Fuer jeden Spot k, 

wurde die Varianz (R k – 

G k )²/2 gegen das Mittel 

(R k + G k )/2 geplottet. 

Die rote Linie zeigt den 

moving average 


Fehler Modell Notation 

k=1,…n Gene 


k 

i=1,...,d Proben 

i 

... 

... 

... 

... 

... 

... 

... 

...

Fehler Modell 


Y = ( a + ε ) + ( b b exp( η )) x 

ik i ik i k ik ik 

Y − a ε + b b exp( η ) x 

= 

b b 

ik i ik i k ik ik 

Y − a 

ik i 

b 

i 

i i 

= ( ε / b ) + ( b x )exp( η ) 

ik i k ik ik 

ν ik 

ik m 

i

Beispiel: FehlerModell 

Rocke and Durbin (J. Comput. Biol. 2001): 

Y e η 

= α + β + ν 

k k 

Yk : Gemessene Intensität des Gens k 

k : Wahres Expressionslevel von Gen k 

: offset 

η, ν :multiplikativer/additiver Fehler, 

Unabhängig, normalverteilt 

Bei grossen Expressionswerten b k ist der multiplikative 

Fehler besonders dominant. 

Fuer kleine b k ist der additive Fehler dominant.

Y − a 

ik i 

b 

i 

= ν + m exp( η ) 

ki ki ik 

E( Y ) = a + b m E(exp( η )) 

ik i i ik ki 

Var( Y ) = Var( ν b ) + Var( b m exp( η )) 

ik ki i i ki ki 

= c ' b m + b 

c Var 

σ 

2 2 2 2 2 

η i ki i ν 

2 

' η = (exp( ηki 

)) 

η : N σ 

ki 

ki 

2 

(0, η ) 

ν : N σ 

= c ( E( Y ) − a ) + b 

2 

(0, ν ) 

σ 

2 2 2 2 

η ki i i ν 

c c ' / E (exp( )) 

2 2 2 

η = 

η ηik

Daraus ergibt sich 

= 

2 

− 

2 

+ 

2 

ik ik 

var( E( Y )) c ( E( Y ) a) b 

Nun transformiere die Daten, so dass man 

konstante Varianz erhält, die nicht vom Mittelwert 

abhängt

VarianzStabilisierende Transformation 

Sei Y u die Familie von zufälligen Variablen mit: 

EY u =u, VarY u =v(u). Definiere die Transformation 

∫ 

1 

h( x ) = du 

v( u) 

⇒ Var h(Y u ) ≈ unabhängig von u 

x

VarianzStabilisierende Transformation 

ar x x x 

2 

sinh( ) = log( + + 

1)

Die “verallgemeinerte log” 

Transformation 

f(x) = log(x) 

——— h s (x) = arsinh(x/s) 

200 0 200 400 

intensity 

600 800 1000 

W. Huber et al., 

ISMB 2002 

( ) 2 

arsinh( x ) = log x + x + 1 

D. Rocke & B. 

Durbin, ISMB 2002

Variance stabilizing transform ations 

1.) constant CV (‘multiplicative’) 

2.) offset 

3.) additive and multiplicative 

x 

1 

f ( x ) = ∫ du 

v( u) 

0 

2 

v( u) ∝ u ⇒ f ∝ log u 

v( u) ∝ ( u + u ) ⇒ f ∝ log( u + 

u ) 

2 

0 0 

v( u) ∝ ( u + u ) + s ⇒ f ∝ arsinh 

u + u 

2 2 0 

s

Y 

− 

a 

ki i 

2 

arsinh = μk + εki, εki 

: N (0, c ) 

bi 

• Robuste maximum likelihood Schätzung 

• 

Robuste Param eter Schätzung 

{ { } , { } , , { } } 

M = 

a b c μ 

i i k

! 

Biologie 

Diagnostik 

Therapie 

... 

Biologische 

Verifikation 

? 

Verarbeitung von Microarray Daten: 

Experiment 

Design 

Experiment 

(Microarray) 

Analyse: Clustering; Class Discovery; Klassifikation; Differentielle Gene; .... 

Bildverarbeitung 

Rohe 

Intensitätswerte 

Normalisierung 

Expressions Level 


Differentielle Gene finden 

Genes 

Two cell/tissue /disease types: 

wildtype / mutant 

control / treated 

disease A / disease B 

responding / non responding 

etc. etc.... 

Patients, Samples, Timepoints ... 

For every sample (cell line/patient) we have the 

expression levels of thousands of genes and 

the information whether it is A or B

Is a threefold induced gene more trust 

worthy than a twofold induced gene? 

Logratio 

Product intensity (logscale)

A B 

Conclusion: In addition to the 

differences in gene expression you 

also have a vital interest in its 

variability ... This information is 

needed to obtain meaningful lists 

of genes 

A B

Standard Deviation and Standard 

Error 

Standard Deviation (SD): Variability of the 

measurement 

Standard Error (SE): Variability of the mean of 

several measurements 

n Replications 

Normal Distributed Data:

Questions: 

Which genes are differentially expressed? 

> Ranking 

Are these results „significant“? 

> Statistical Analysis 

That means: Is the probability sufficiently 

small that the result is “by chance”?

Ranking: 

Problem: Produce an ordered list of 

differentially expressed genes starting 

with the most up regulated gene and 

ending with the most down regulated 

gene 

Ranking means finding the right genes 

… drawing our attention to them 

In many applications it is the most 

important step

Ranking is not Testing 

Ranking: Finding the right genes 

Testing: Deciding whether genes are 

significant 

There is more then one way to rank 

There is more then one way to test 

The criteria for which ranking is best is 

different from the criteria which test is 

best … power is often no argument

Ranking: Order Genes due to amount of fold 

change/Score > maybe some that are not differential 

in reality (False Positive) 

Gene candidate 1 









Gene .... 

Order due to some score, 

Intuitively: Fold change 

1st: most differential, 

2nd: second most diff 

...

Testing: Find Genes due to amount of fold 

change/Score which are significant s.t. there are less 

than 5% False Positives > maybe you miss some 

(False Negatives) 










Gene .... 

Order due to some score, 

Intuitively: Fold change 

1st: most differential, 

2nd: second most diff 

...

Which gene is more differentially 

expressed?

Ranking is Scoring 

You need to score differential 

gene expression 

Different scores lead to different 

rankings 

What scores are there?

TScore 

Idea: Take variances into account 

Change: low Change: high Change: high 

Variance: high Variance: low Variance: high

Change: HIGH 

Variance: SMALL 

T huge 

Change: SMALL 

Variance: HIGH 

T ~ 0

Change: HIGH 

Variance: HIGH 

T ? 

Change: SMALL 

Variance: SMALL 

T ?

Berechne TScores für ein 

zufälliges Experiment 

Erstelle ein Histogramm der Tscores 

und markiere die 5% höchsten und 

niedrigsten (rot) 

Berechne TScore für Gen x und 

zeichne diesen ein (grün) 

T Score – T test – P value 

Wie groß ist die Wahrscheinlichkeit, mindestens so extrem wie der grüne 

Pfeil zu sein?

TTest PROBLEMS 

• There are many genes (> tests) but only 

few repetitions 

• is using „s“ as estimate good? 

• if measured variance is small T 

becomes easily very large 

Therefore: for microarray it is reasonable 

to use a modfied version of the T test

Fudge Factors: 

You need to estimate the variance from data 

You might underestimate a already small variance 

(constantly expressed genes) 

The denominator in T becomes really small 

Constantly expressed genes show up on top of the list 

Correction: Add a constant fudge factor s 0 

Regularized Tscore 

>Limma 

>SAM 

>Twilight

SAM: Significance Analysis for Microarrays 

d( i) 

X − X 

1 2 

s( i) + s 

0 

2 2 

m 1 n 2 

m n 

s( i) = a( ( x ( i) − X ) + ( x ( i) −X 

) 

a 

= 

= 

1/ n + 1/ n 

1 2 

n + n − 

1 2 

∑ ∑ 

2

More Scores: 

Wilcoxon Score (robust) 

PAUc Score (separation) 

paired tScore (paired Data) 

FScore (more then 2 conditions) 

Correlation to a reference gene 

etc etc

Different scores give different 

rankings 

Krankheit 1 vs Krankheit 2 

(Golub et al.)

Which Score is the best 

one? 

That depends on your 

problem ...

Next Question: 

Ok, I chose a score and found a set of 

candidate genes 

Can I trust the observed expression 

differences? 

Statistical Analysis

PValues 

Everyone knows that the pvalue must 

be below 0.05 

0.05 is a holy number both in medicine 

and biology 

... what else should you know about pvalues

Rumors 

If the gene is not differentially 

expressed the pvalue is high 

If the gene is differentially expressed 

the pvalues is low 

Both these statements are wrong!

Reminder: Type I and Type II ERROR 

H0 

Null Hypothesis: 

Gene NOT 

differential 

H1 

Alternative 

Hypothesis: 

NOT H0 

Positive: rejected H0 (differential gene) 

Negative: accepted H0

Reminder: Type I and Type II ERROR 

H0 H1

The basic Idea behind pvalues: 

We observe a score S =1.27 

Can this be just a random fluctuation? 

Assume: It is a random fluctuation 

= The gene is not differentially 

expressed 

= The null hypothesis holds 

Theory gives us the distribution of the 

score under this assumption 

PValue: Probability that a random 

score is equal or higher to S =1.27 

in absolute value (two sided test)

Permutations and empirical pvalues

If a gene is not differentially expressed: 

The pvalue is a random number between 0 and 

1! 

It is unlikely that such a number is 

below 0.05 (5% probability)

If a gene is differentially expressed: 

The pvalue has no meaning, since it was 

computed under the assumption that the gene is 

not differentially expressed. 

We hope that it is small since the score 

is high, but there is absolutely no 

theoretical support for this

Testing only one gene: 

If the gene is not differentially 

expressed a small pvalue is unlikely, 

hence we should be surprised by this 

observation. 

If we make it a rule that we discard the gene if 

the pvalues is above 0.05, it is unlikely that a 

random score will pass this filter

Multiple testing with only noninduced genes 

1 gene 

10 genes 

30,000 genes

The Multiple Testing Problem 

Pvalues are random numbers between 0 and 1. For only one 

such number it is unlikely to fall in this small interval, but if we 

have 30.000 such numbers many will be in there.

Acctepted 

Rejected 

We test m hypotheses 

true hypotheses rejected hypotheses 

H0 H1 

TRUE FALSE 

Error = false positive 

Error = false negative 

Error = false positive 

Error = false negative

FWER=Familywise error rate: 

Probability of at least one Type1error (False Positive) among 

the accepted (significant) genes 

Accepted 

Rejected 

H0 H1 

TRUE FALSE

FDR = False Discovery Rate 

Expected number of Type 1 – errors (False Positives) among rejected 

hypotheses 

with 

Accepted 

Rejected 

if 

if 

TRUE FALSE 

H0 H1

Controlling the family wise error rate 

(FWER) 

If we want to avoid random numbers in this interval 

we need to make it smaller. The more numbers, the 

smaller. For 30.000 numbers very small. 

This strategy is called: Controlling the family wise 

error rate

How to control the FWER? 

Note, that adjusting the interval border can also be 

done by adjusting the pvalues and leaving the cut off 

at 0.05. 

There are many ways to adjust pvalues for multiple 

testing: 

Bonferroni: 

Better: Westfall and Young

In microarray studies controlling the 

FWER is not a good idea ... It is too 

conservative. 

A different type of error measure 

became more popular 

The False Discovery Rate 

What is the idea?

The FDR 

• Score genes and rank them 

• Choose a cutoff 

• Loosely speaking: The FDR is the 

best guess for the number of 

false positive genes that score 

above the cutoff

The confusing literature: 

There are many different definitions of the false 

discovery rate in the literature: 

Original: BenjaminiHochberg 

Positive FDR 

Conditional FDR 

Local FDR 

There is also a fundamental difference between 

controlling and estimating a FDR

In microarray analysis it became 

popular to use estimated FDRs 

Differences to pvalues: 

The FDR refers to a list of genes. The pvalue 

refers to a single gene. 

The pvalue is based on the assumption that the 

gene is not differentially expressed, the FDR 

makes no such assumption. 

Pvalues need to be corrected for multiplicity, 

FDRs not!

Another difference in concept: 

If a 4x change has a small pvalue, this means that 4x change 

is too high to be random fluctuation 

Conclusion: 4x change is significant 

If a list of 150 genes with 4x change or more has a small 

estimated FDR this means that we have more genes on this 

level than would be expected by chance. 

Conclusion: 4x change can be noise, but 150 genes on that 

level are too many to be explained just by random fluctuation. 

In FWER Analysis the fold change 4x is significant, in FDR 

Analysis it is the number 150 that is significant.

Histograms of the pvalues of all 

genes on the array

FWER: Vertical cutoff 

FDR: Horizontal cutoff

Vorlesung Microarray Datenanalyse Kapitel 1: Einführung ... - Lectures

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?