29.01.2013 Views

Vorlesung Microarray Datenanalyse Kapitel 1: Einführung ... - Lectures

Vorlesung Microarray Datenanalyse Kapitel 1: Einführung ... - Lectures

Vorlesung Microarray Datenanalyse Kapitel 1: Einführung ... - Lectures

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Vorlesung</strong><br />

<strong>Microarray</strong> <strong>Datenanalyse</strong><br />

<strong>Kapitel</strong> 1: <strong>Einführung</strong>, Normalisierung,<br />

Differentielle Gene, Multiples Testen<br />

<strong>Kapitel</strong> 2: Clustering und Klassifikation


Was sind DNA­<strong>Microarray</strong>s?<br />

Protein<br />

mRNA<br />

DNA


Was sind DNA­<strong>Microarray</strong>s?<br />

•<strong>Microarray</strong>s sind Technologieplattformen zur<br />

Messung der Aktivität einer großen Anzahl von<br />

Genen.<br />

•Dabei werden ihre Produkte (meist mRNA)<br />

quantifiziert.<br />

•Hierzu werden DNA Sequenzen verwendet, die auf<br />

einer Oberfläche (je nach Plattform verschiedene)<br />

immobilisiert werden.<br />

<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1


Was sind DNA­<strong>Microarray</strong>s?<br />

... <strong>Microarray</strong>s ... Messung der Aktivität von Genen<br />

(... mRNA).<br />

Welche anderen Methoden kennen Sie, die<br />

dieses Ziel verfolgen?<br />

<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1


Northern Blot<br />

<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1<br />

RNA<br />

RNA<br />

RNA<br />

RNA


RT­PCR<br />

5‘ 3‘<br />

5‘ 3‘<br />

5‘ 3‘<br />

RNA<br />

RNA<br />

cDNA<br />

cDNA<br />

dsDNA<br />

• Da RNA durch PCR nicht<br />

direkt amplifiziert werden<br />

kann, muß sie zunächst in<br />

cDNA umgeschrieben<br />

werden (revers<br />

transkribiert, RT)<br />

• Zur Quantifizierung sind<br />

zwei Ansätze möglich:<br />

• 1 Interner endogener<br />

Standard (zB<br />

Housekeeping gene)<br />

• 2 Kompetitive RT PCR:<br />

Zugabe von sog Mimic<br />

Fragmenten, die der<br />

Reaktion zugegeben<br />

werden und zusammen<br />

mit der eigentlichen<br />

Zielsequenz amplifiziert<br />

werden<br />

<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1


SAGE = Serial Analysis of Gene Expression<br />

Zellen isolieren<br />

mRNA isolieren und cDNA synthetisieren<br />

Transkript mit Anchor Enzym schneiden<br />

„Taggen“<br />

Ligieren der Tags<br />

Sequenzierung<br />

Quantifizierung<br />

<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1


WOZU? – klassisches Beispiel:<br />

Normale Niere<br />

krank gesund<br />

RNA­Präparation<br />

Tumor (Niere)<br />

MESSUNG ?!<br />

was unterscheidet<br />

“Tumor” von “Normal” ?<br />

<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1


WOMIT? – Plattformen<br />

Filter Glas­chips Affymetrix<br />

<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1


Plattformen<br />

Filter Glas­chips Affymetrix<br />

1991<br />

Lennon & Lehrach, 1991<br />

1995<br />

Stanford University,<br />

Schena et al, 1995<br />

1996<br />

Lockhardt et al, 1996<br />

<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 2


­ Nylon Filter<br />

­ eine Probe<br />

­ radioaktives Signal<br />

­ viele Spots möglich<br />

­ große Fläche / lokale Effekte<br />

­ Überstrahlen<br />

­ nur eine Probe pro Hybridisierungsvorgang<br />

Plattformen<br />

­ Glas Träger<br />

­ rote und grüne Probe<br />

­ Floureszenz Signal<br />

­ bis ~ 20000 Spots möglich<br />

­ gleichzeitiges Hybridisieren<br />

von Probe und Kontrolle<br />

(rot/grün)<br />

­ Chip<br />

­ eine Probe bestehend aus<br />

16­20 Wdh. und zugehörigen<br />

Mismatches<br />

­ kommerzieller Chip<br />

­ gute reproduzierbare Daten<br />

­ nur eine Probe pro Hybri­<br />

­ disierungsvorgang<br />

<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1


Sequenz 1<br />

Sequenz 2<br />

Sequenz n<br />

cDNAs<br />

oder<br />

Oligos<br />

Grundprinzip<br />

RNA<br />

Probe 1<br />

Probe 2<br />

<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1


Grundprinzip<br />

Filter Glas­chips Affymetrix<br />

<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 21


Grundannahme<br />

Das gemessene Signal spiegelt (nach geeigneter<br />

“Aufreinigung”) grundsätzlich die Menge RNA in der<br />

Probe wider<br />

<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 21


!<br />

Biologie<br />

Diagnostik<br />

Therapie<br />

...<br />

Biologische<br />

Verifikation<br />

?<br />

Verarbeitung von <strong>Microarray</strong> Daten:<br />

Experiment­<br />

Design<br />

Experiment<br />

(<strong>Microarray</strong>)<br />

Analyse: Clustering; Class Discovery; Klassifikation; Differentielle Gene; ....<br />

Bildverarbeitung<br />

Rohe<br />

Intensitätswerte<br />

Normalisierung<br />

Expressions Level<br />

<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1


Welche Normalisierungs Methoden gibt es?<br />

Benutzer definierte Sets<br />

Housekeeping (?!)<br />

Interne Kontrollen etc…<br />

Nützlich bei<br />

“Most Genes Changed”­ Settings<br />

Skalierungs<br />

methoden<br />

•Mean<br />

•Median<br />

•Shorth<br />

•Zscore<br />

Regressions<br />

methoden<br />

Gesamter Datensatz<br />

Nützlich bei<br />

“Most Genes Unchanged”­ Settings<br />

•gesamt<br />

linear/polynomial<br />

•local<br />

linear/polynomial<br />

•qspline<br />

Transformations<br />

methoden<br />

•Varianz<br />

Stabilisierung<br />

Analysis of Variance/<br />

ML based methods<br />

•ANOVA<br />

Verteilungsbasiert<br />

<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1<br />

•Quantil<br />

Normalisierung


Beobachtung<br />

Varianz der gemessenen Intensität hängt von der<br />

absoluten Intensität ab<br />

Fuer jeden Spot k,<br />

wurde die Varianz (R k –<br />

G k )²/2 gegen das Mittel<br />

(R k + G k )/2 geplottet.<br />

Die rote Linie zeigt den<br />

moving average<br />

<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1


Fehler Modell Notation<br />

k=1,…n Gene<br />

<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1<br />

k<br />

i=1,...,d Proben<br />

i<br />

...<br />

...<br />

...<br />

...<br />

...<br />

...<br />

...<br />

...


Fehler Modell<br />

<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1


Y = ( a + ε ) + ( b b exp( η )) x<br />

ik i ik i k ik ik<br />

Y − a ε + b b exp( η ) x<br />

=<br />

b b<br />

ik i ik i k ik ik<br />

Y − a<br />

ik i<br />

b<br />

i<br />

i i<br />

= ( ε / b ) + ( b x )exp( η )<br />

ik i k ik ik<br />

ν ik<br />

ik m<br />

i


Beispiel: Fehler­Modell<br />

Rocke and Durbin (J. Comput. Biol. 2001):<br />

Y e η<br />

= α + β + ν<br />

k k<br />

Yk : Gemessene Intensität des Gens k<br />

k : Wahres Expressionslevel von Gen k<br />

: offset<br />

η, ν :multiplikativer/additiver Fehler,<br />

Unabhängig, normalverteilt<br />

Bei grossen Expressionswerten b k ist der multiplikative<br />

Fehler besonders dominant.<br />

Fuer kleine b k ist der additive Fehler dominant.


Y − a<br />

ik i<br />

b<br />

i<br />

= ν + m exp( η )<br />

ki ki ik<br />

E( Y ) = a + b m E(exp( η ))<br />

ik i i ik ki<br />

Var( Y ) = Var( ν b ) + Var( b m exp( η ))<br />

ik ki i i ki ki<br />

= c ' b m + b<br />

c Var<br />

σ<br />

2 2 2 2 2<br />

η i ki i ν<br />

2<br />

' η = (exp( ηki<br />

))<br />

η : N σ<br />

ki<br />

ki<br />

2<br />

(0, η )<br />

ν : N σ<br />

= c ( E( Y ) − a ) + b<br />

2<br />

(0, ν )<br />

σ<br />

2 2 2 2<br />

η ki i i ν<br />

c c ' / E (exp( ))<br />

2 2 2<br />

η =<br />

η ηik


Daraus ergibt sich<br />

=<br />

2<br />

−<br />

2<br />

+<br />

2<br />

ik ik<br />

var( E( Y )) c ( E( Y ) a) b<br />

Nun transformiere die Daten, so dass man<br />

konstante Varianz erhält, die nicht vom Mittelwert<br />

abhängt


Varianz­Stabilisierende Transformation<br />

Sei Y u die Familie von zufälligen Variablen mit:<br />

EY u =u, VarY u =v(u). Definiere die Transformation<br />

∫<br />

1<br />

h( x ) = du<br />

v( u)<br />

⇒ Var h(Y u ) ≈ unabhängig von u<br />

x


Varianz­Stabilisierende Transformation<br />

ar x x x<br />

2<br />

sinh( ) = log( + +<br />

1)


Die “verallgemeinerte log”<br />

Transformation<br />

­ ­ ­ f(x) = log(x)<br />

——— h s (x) = arsinh(x/s)<br />

­200 0 200 400<br />

intensity<br />

600 800 1000<br />

W. Huber et al.,<br />

ISMB 2002<br />

( ) 2<br />

arsinh( x ) = log x + x + 1<br />

D. Rocke & B.<br />

Durbin, ISMB 2002


Variance stabilizing transform ations<br />

1.) constant CV (‘multiplicative’)<br />

2.) offset<br />

3.) additive and multiplicative<br />

x<br />

1<br />

f ( x ) = ∫ du<br />

v( u)<br />

0<br />

2<br />

v( u) ∝ u ⇒ f ∝ log u<br />

v( u) ∝ ( u + u ) ⇒ f ∝ log( u +<br />

u )<br />

2<br />

0 0<br />

v( u) ∝ ( u + u ) + s ⇒ f ∝ arsinh<br />

u + u<br />

2 2 0<br />

s


Y<br />

−<br />

a<br />

ki i<br />

2<br />

arsinh = μk + εki, εki<br />

: N (0, c )<br />

bi<br />

• Robuste maximum likelihood Schätzung<br />

•<br />

Robuste Param eter Schätzung<br />

{ { } , { } , , { } }<br />

M =<br />

a b c μ<br />

i i k


!<br />

Biologie<br />

Diagnostik<br />

Therapie<br />

...<br />

Biologische<br />

Verifikation<br />

?<br />

Verarbeitung von <strong>Microarray</strong> Daten:<br />

Experiment­<br />

Design<br />

Experiment<br />

(<strong>Microarray</strong>)<br />

Analyse: Clustering; Class Discovery; Klassifikation; Differentielle Gene; ....<br />

Bildverarbeitung<br />

Rohe<br />

Intensitätswerte<br />

Normalisierung<br />

Expressions Level<br />

<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1


Differentielle Gene finden<br />

Genes<br />

Two cell/tissue /disease types:<br />

wild­type / mutant<br />

control / treated<br />

disease A / disease B<br />

responding / non responding<br />

etc. etc....<br />

Patients, Samples, Timepoints ...<br />

For every sample (cell line/patient) we have the<br />

expression levels of thousands of genes and<br />

the information whether it is A or B


Is a three­fold induced gene more trust<br />

worthy than a two­fold induced gene?<br />

Logratio<br />

Product intensity (logscale)


A B<br />

Conclusion: In addition to the<br />

differences in gene expression you<br />

also have a vital interest in its<br />

variability ... This information is<br />

needed to obtain meaningful lists<br />

of genes<br />

A B


Standard Deviation and Standard<br />

Error<br />

Standard Deviation (SD): Variability of the<br />

measurement<br />

Standard Error (SE): Variability of the mean of<br />

several measurements<br />

n Replications<br />

Normal Distributed Data:


Questions:<br />

Which genes are differentially expressed?<br />

­> Ranking<br />

Are these results „significant“?<br />

­> Statistical Analysis<br />

That means: Is the probability sufficiently<br />

small that the result is “by chance”?


Ranking:<br />

Problem: Produce an ordered list of<br />

differentially expressed genes starting<br />

with the most up regulated gene and<br />

ending with the most down regulated<br />

gene<br />

Ranking means finding the right genes<br />

… drawing our attention to them<br />

In many applications it is the most<br />

important step


Ranking is not Testing<br />

Ranking: Finding the right genes<br />

Testing: Deciding whether genes are<br />

significant<br />

There is more then one way to rank<br />

There is more then one way to test<br />

The criteria for which ranking is best is<br />

different from the criteria which test is<br />

best … power is often no argument


Ranking: Order Genes due to amount of fold<br />

change/Score ­> maybe some that are not differential<br />

in reality (False Positive)<br />

Gene candidate 1<br />

Gene candidate 2<br />

Gene candidate 3<br />

Gene candidate 4<br />

Gene candidate 5<br />

Gene candidate 6<br />

Gene candidate 7<br />

Gene candidate 8<br />

Gene candidate 9<br />

Gene ....<br />

Order due to some score,<br />

Intuitively: Fold change<br />

1st: most differential,<br />

2nd: second most diff<br />

...


Testing: Find Genes due to amount of fold<br />

change/Score which are significant s.t. there are less<br />

than 5% False Positives ­> maybe you miss some<br />

(False Negatives)<br />

Gene candidate 1<br />

Gene candidate 2<br />

Gene candidate 3<br />

Gene candidate 4<br />

Gene candidate 5<br />

Gene candidate 6<br />

Gene candidate 7<br />

Gene candidate 8<br />

Gene candidate 9<br />

Gene ....<br />

Order due to some score,<br />

Intuitively: Fold change<br />

1st: most differential,<br />

2nd: second most diff<br />

...


Which gene is more differentially<br />

expressed?


Ranking is Scoring<br />

You need to score differential<br />

gene expression<br />

Different scores lead to different<br />

rankings<br />

What scores are there?


T­Score<br />

Idea: Take variances into account<br />

Change: low Change: high Change: high<br />

Variance: high Variance: low Variance: high


Change: HIGH<br />

Variance: SMALL<br />

T huge<br />

Change: SMALL<br />

Variance: HIGH<br />

T ~ 0


Change: HIGH<br />

Variance: HIGH<br />

T ?<br />

Change: SMALL<br />

Variance: SMALL<br />

T ?


Berechne TScores für ein<br />

zufälliges Experiment<br />

Erstelle ein Histogramm der Tscores<br />

und markiere die 5% höchsten und<br />

niedrigsten (rot)<br />

Berechne TScore für Gen x und<br />

zeichne diesen ein (grün)<br />

T Score – T test – P value<br />

Wie groß ist die Wahrscheinlichkeit, mindestens so extrem wie der grüne<br />

Pfeil zu sein?


T­Test PROBLEMS<br />

• There are many genes (­> tests) but only<br />

few repetitions<br />

• is using „s“ as estimate good?<br />

• if measured variance is small T<br />

becomes easily very large<br />

Therefore: for microarray it is reasonable<br />

to use a modfied version of the T test


Fudge Factors:<br />

You need to estimate the variance from data<br />

You might underestimate a already small variance<br />

(constantly expressed genes)<br />

The denominator in T becomes really small<br />

Constantly expressed genes show up on top of the list<br />

Correction: Add a constant fudge factor s 0<br />

Regularized T­score<br />

­>Limma<br />

­>SAM<br />

­>Twilight


SAM: Significance Analysis for <strong>Microarray</strong>s<br />

d( i)<br />

X − X<br />

1 2<br />

s( i) + s<br />

0<br />

2 2<br />

m 1 n 2<br />

m n<br />

s( i) = a( ( x ( i) − X ) + ( x ( i) −X<br />

)<br />

a<br />

=<br />

=<br />

1/ n + 1/ n<br />

1 2<br />

n + n −<br />

1 2<br />

∑ ∑<br />

2


More Scores:<br />

­ Wilcoxon Score (robust)<br />

­ PAUc Score (separation)<br />

­ paired t­Score (paired Data)<br />

­ F­Score (more then 2 conditions)<br />

­ Correlation to a reference gene<br />

­ etc etc


Different scores give different<br />

rankings<br />

Krankheit 1 vs Krankheit 2<br />

(Golub et al.)


Which Score is the best<br />

one?<br />

That depends on your<br />

problem ...


Next Question:<br />

Ok, I chose a score and found a set of<br />

candidate genes<br />

Can I trust the observed expression<br />

differences?<br />

Statistical Analysis


P­Values<br />

Everyone knows that the p­value must<br />

be below 0.05<br />

0.05 is a holy number both in medicine<br />

and biology<br />

... what else should you know about pvalues


Rumors<br />

If the gene is not differentially<br />

expressed the p­value is high<br />

If the gene is differentially expressed<br />

the p­values is low<br />

Both these statements are wrong!


Reminder: Type I and Type II ERROR<br />

H0<br />

Null Hypothesis:<br />

Gene NOT<br />

differential<br />

H1<br />

Alternative<br />

Hypothesis:<br />

NOT H0<br />

Positive: rejected H0 (differential gene)<br />

Negative: accepted H0


Reminder: Type I and Type II ERROR<br />

H0 H1


The basic Idea behind p­values:<br />

We observe a score S =1.27<br />

Can this be just a random fluctuation?<br />

Assume: It is a random fluctuation<br />

= The gene is not differentially<br />

expressed<br />

= The null hypothesis holds<br />

Theory gives us the distribution of the<br />

score under this assumption<br />

P­Value: Probability that a random<br />

score is equal or higher to S =1.27<br />

in absolute value (two sided test)


Permutations and empirical p­values


If a gene is not differentially expressed:<br />

The p­value is a random number between 0 and<br />

1!<br />

It is unlikely that such a number is<br />

below 0.05 (5% probability)


If a gene is differentially expressed:<br />

The p­value has no meaning, since it was<br />

computed under the assumption that the gene is<br />

not differentially expressed.<br />

We hope that it is small since the score<br />

is high, but there is absolutely no<br />

theoretical support for this


Testing only one gene:<br />

If the gene is not differentially<br />

expressed a small p­value is unlikely,<br />

hence we should be surprised by this<br />

observation.<br />

If we make it a rule that we discard the gene if<br />

the p­values is above 0.05, it is unlikely that a<br />

random score will pass this filter


Multiple testing with only non­induced genes<br />

1 gene<br />

10 genes<br />

30,000 genes


The Multiple Testing Problem<br />

P­values are random numbers between 0 and 1. For only one<br />

such number it is unlikely to fall in this small interval, but if we<br />

have 30.000 such numbers many will be in there.


Acctepted<br />

Rejected<br />

We test m hypotheses<br />

true hypotheses rejected hypotheses<br />

H0 H1<br />

TRUE FALSE<br />

Error = false positive<br />

Error = false negative<br />

Error = false positive<br />

Error = false negative


FWER=Family­wise error rate:<br />

Probability of at least one Type1­error (False Positive) among<br />

the accepted (significant) genes<br />

Accepted<br />

Rejected<br />

H0 H1<br />

TRUE FALSE


FDR = False Discovery Rate<br />

Expected number of Type 1 – errors (False Positives) among rejected<br />

hypotheses<br />

with<br />

Accepted<br />

Rejected<br />

if<br />

if<br />

TRUE FALSE<br />

H0 H1


Controlling the family wise error rate<br />

(FWER)<br />

If we want to avoid random numbers in this interval<br />

we need to make it smaller. The more numbers, the<br />

smaller. For 30.000 numbers very small.<br />

This strategy is called: Controlling the family wise<br />

error rate


How to control the FWER?<br />

Note, that adjusting the interval border can also be<br />

done by adjusting the p­values and leaving the cut off<br />

at 0.05.<br />

There are many ways to adjust p­values for multiple<br />

testing:<br />

Bonferroni:<br />

Better: Westfall and Young


In microarray studies controlling the<br />

FWER is not a good idea ... It is too<br />

conservative.<br />

A different type of error measure<br />

became more popular<br />

The False Discovery Rate<br />

What is the idea?


The FDR<br />

• Score genes and rank them<br />

• Choose a cutoff<br />

• Loosely speaking: The FDR is the<br />

best guess for the number of<br />

false positive genes that score<br />

above the cutoff


The confusing literature:<br />

There are many different definitions of the false<br />

discovery rate in the literature:<br />

­ Original: Benjamini­Hochberg<br />

­ Positive FDR<br />

­ Conditional FDR<br />

­ Local FDR<br />

There is also a fundamental difference between<br />

controlling and estimating a FDR


In microarray analysis it became<br />

popular to use estimated FDRs<br />

Differences to p­values:<br />

The FDR refers to a list of genes. The p­value<br />

refers to a single gene.<br />

The p­value is based on the assumption that the<br />

gene is not differentially expressed, the FDR<br />

makes no such assumption.<br />

P­values need to be corrected for multiplicity,<br />

FDRs not!


Another difference in concept:<br />

If a 4x change has a small p­value, this means that 4x change<br />

is too high to be random fluctuation<br />

Conclusion: 4x change is significant<br />

If a list of 150 genes with 4x change or more has a small<br />

estimated FDR this means that we have more genes on this<br />

level than would be expected by chance.<br />

Conclusion: 4x change can be noise, but 150 genes on that<br />

level are too many to be explained just by random fluctuation.<br />

In FWER Analysis the fold change 4x is significant, in FDR<br />

Analysis it is the number 150 that is significant.


Histograms of the p­values of all<br />

genes on the array


FWER: Vertical cutoff<br />

FDR: Horizontal cutoff

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!