Vorlesung Microarray Datenanalyse Kapitel 1: Einführung ... - Lectures
Vorlesung Microarray Datenanalyse Kapitel 1: Einführung ... - Lectures
Vorlesung Microarray Datenanalyse Kapitel 1: Einführung ... - Lectures
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Vorlesung</strong><br />
<strong>Microarray</strong> <strong>Datenanalyse</strong><br />
<strong>Kapitel</strong> 1: <strong>Einführung</strong>, Normalisierung,<br />
Differentielle Gene, Multiples Testen<br />
<strong>Kapitel</strong> 2: Clustering und Klassifikation
Was sind DNA<strong>Microarray</strong>s?<br />
Protein<br />
mRNA<br />
DNA
Was sind DNA<strong>Microarray</strong>s?<br />
•<strong>Microarray</strong>s sind Technologieplattformen zur<br />
Messung der Aktivität einer großen Anzahl von<br />
Genen.<br />
•Dabei werden ihre Produkte (meist mRNA)<br />
quantifiziert.<br />
•Hierzu werden DNA Sequenzen verwendet, die auf<br />
einer Oberfläche (je nach Plattform verschiedene)<br />
immobilisiert werden.<br />
<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1
Was sind DNA<strong>Microarray</strong>s?<br />
... <strong>Microarray</strong>s ... Messung der Aktivität von Genen<br />
(... mRNA).<br />
Welche anderen Methoden kennen Sie, die<br />
dieses Ziel verfolgen?<br />
<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1
Northern Blot<br />
<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1<br />
RNA<br />
RNA<br />
RNA<br />
RNA
RTPCR<br />
5‘ 3‘<br />
5‘ 3‘<br />
5‘ 3‘<br />
RNA<br />
RNA<br />
cDNA<br />
cDNA<br />
dsDNA<br />
• Da RNA durch PCR nicht<br />
direkt amplifiziert werden<br />
kann, muß sie zunächst in<br />
cDNA umgeschrieben<br />
werden (revers<br />
transkribiert, RT)<br />
• Zur Quantifizierung sind<br />
zwei Ansätze möglich:<br />
• 1 Interner endogener<br />
Standard (zB<br />
Housekeeping gene)<br />
• 2 Kompetitive RT PCR:<br />
Zugabe von sog Mimic<br />
Fragmenten, die der<br />
Reaktion zugegeben<br />
werden und zusammen<br />
mit der eigentlichen<br />
Zielsequenz amplifiziert<br />
werden<br />
<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1
SAGE = Serial Analysis of Gene Expression<br />
Zellen isolieren<br />
mRNA isolieren und cDNA synthetisieren<br />
Transkript mit Anchor Enzym schneiden<br />
„Taggen“<br />
Ligieren der Tags<br />
Sequenzierung<br />
Quantifizierung<br />
<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1
WOZU? – klassisches Beispiel:<br />
Normale Niere<br />
krank gesund<br />
RNAPräparation<br />
Tumor (Niere)<br />
MESSUNG ?!<br />
was unterscheidet<br />
“Tumor” von “Normal” ?<br />
<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1
WOMIT? – Plattformen<br />
Filter Glaschips Affymetrix<br />
<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1
Plattformen<br />
Filter Glaschips Affymetrix<br />
1991<br />
Lennon & Lehrach, 1991<br />
1995<br />
Stanford University,<br />
Schena et al, 1995<br />
1996<br />
Lockhardt et al, 1996<br />
<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 2
Nylon Filter<br />
eine Probe<br />
radioaktives Signal<br />
viele Spots möglich<br />
große Fläche / lokale Effekte<br />
Überstrahlen<br />
nur eine Probe pro Hybridisierungsvorgang<br />
Plattformen<br />
Glas Träger<br />
rote und grüne Probe<br />
Floureszenz Signal<br />
bis ~ 20000 Spots möglich<br />
gleichzeitiges Hybridisieren<br />
von Probe und Kontrolle<br />
(rot/grün)<br />
Chip<br />
eine Probe bestehend aus<br />
1620 Wdh. und zugehörigen<br />
Mismatches<br />
kommerzieller Chip<br />
gute reproduzierbare Daten<br />
nur eine Probe pro Hybri<br />
disierungsvorgang<br />
<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1
Sequenz 1<br />
Sequenz 2<br />
Sequenz n<br />
cDNAs<br />
oder<br />
Oligos<br />
Grundprinzip<br />
RNA<br />
Probe 1<br />
Probe 2<br />
<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1
Grundprinzip<br />
Filter Glaschips Affymetrix<br />
<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 21
Grundannahme<br />
Das gemessene Signal spiegelt (nach geeigneter<br />
“Aufreinigung”) grundsätzlich die Menge RNA in der<br />
Probe wider<br />
<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 21
!<br />
Biologie<br />
Diagnostik<br />
Therapie<br />
...<br />
Biologische<br />
Verifikation<br />
?<br />
Verarbeitung von <strong>Microarray</strong> Daten:<br />
Experiment<br />
Design<br />
Experiment<br />
(<strong>Microarray</strong>)<br />
Analyse: Clustering; Class Discovery; Klassifikation; Differentielle Gene; ....<br />
Bildverarbeitung<br />
Rohe<br />
Intensitätswerte<br />
Normalisierung<br />
Expressions Level<br />
<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1
Welche Normalisierungs Methoden gibt es?<br />
Benutzer definierte Sets<br />
Housekeeping (?!)<br />
Interne Kontrollen etc…<br />
Nützlich bei<br />
“Most Genes Changed” Settings<br />
Skalierungs<br />
methoden<br />
•Mean<br />
•Median<br />
•Shorth<br />
•Zscore<br />
Regressions<br />
methoden<br />
Gesamter Datensatz<br />
Nützlich bei<br />
“Most Genes Unchanged” Settings<br />
•gesamt<br />
linear/polynomial<br />
•local<br />
linear/polynomial<br />
•qspline<br />
Transformations<br />
methoden<br />
•Varianz<br />
Stabilisierung<br />
Analysis of Variance/<br />
ML based methods<br />
•ANOVA<br />
Verteilungsbasiert<br />
<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1<br />
•Quantil<br />
Normalisierung
Beobachtung<br />
Varianz der gemessenen Intensität hängt von der<br />
absoluten Intensität ab<br />
Fuer jeden Spot k,<br />
wurde die Varianz (R k –<br />
G k )²/2 gegen das Mittel<br />
(R k + G k )/2 geplottet.<br />
Die rote Linie zeigt den<br />
moving average<br />
<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1
Fehler Modell Notation<br />
k=1,…n Gene<br />
<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1<br />
k<br />
i=1,...,d Proben<br />
i<br />
...<br />
...<br />
...<br />
...<br />
...<br />
...<br />
...<br />
...
Fehler Modell<br />
<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1
Y = ( a + ε ) + ( b b exp( η )) x<br />
ik i ik i k ik ik<br />
Y − a ε + b b exp( η ) x<br />
=<br />
b b<br />
ik i ik i k ik ik<br />
Y − a<br />
ik i<br />
b<br />
i<br />
i i<br />
= ( ε / b ) + ( b x )exp( η )<br />
ik i k ik ik<br />
ν ik<br />
ik m<br />
i
Beispiel: FehlerModell<br />
Rocke and Durbin (J. Comput. Biol. 2001):<br />
Y e η<br />
= α + β + ν<br />
k k<br />
Yk : Gemessene Intensität des Gens k<br />
k : Wahres Expressionslevel von Gen k<br />
: offset<br />
η, ν :multiplikativer/additiver Fehler,<br />
Unabhängig, normalverteilt<br />
Bei grossen Expressionswerten b k ist der multiplikative<br />
Fehler besonders dominant.<br />
Fuer kleine b k ist der additive Fehler dominant.
Y − a<br />
ik i<br />
b<br />
i<br />
= ν + m exp( η )<br />
ki ki ik<br />
E( Y ) = a + b m E(exp( η ))<br />
ik i i ik ki<br />
Var( Y ) = Var( ν b ) + Var( b m exp( η ))<br />
ik ki i i ki ki<br />
= c ' b m + b<br />
c Var<br />
σ<br />
2 2 2 2 2<br />
η i ki i ν<br />
2<br />
' η = (exp( ηki<br />
))<br />
η : N σ<br />
ki<br />
ki<br />
2<br />
(0, η )<br />
ν : N σ<br />
= c ( E( Y ) − a ) + b<br />
2<br />
(0, ν )<br />
σ<br />
2 2 2 2<br />
η ki i i ν<br />
c c ' / E (exp( ))<br />
2 2 2<br />
η =<br />
η ηik
Daraus ergibt sich<br />
=<br />
2<br />
−<br />
2<br />
+<br />
2<br />
ik ik<br />
var( E( Y )) c ( E( Y ) a) b<br />
Nun transformiere die Daten, so dass man<br />
konstante Varianz erhält, die nicht vom Mittelwert<br />
abhängt
VarianzStabilisierende Transformation<br />
Sei Y u die Familie von zufälligen Variablen mit:<br />
EY u =u, VarY u =v(u). Definiere die Transformation<br />
∫<br />
1<br />
h( x ) = du<br />
v( u)<br />
⇒ Var h(Y u ) ≈ unabhängig von u<br />
x
VarianzStabilisierende Transformation<br />
ar x x x<br />
2<br />
sinh( ) = log( + +<br />
1)
Die “verallgemeinerte log”<br />
Transformation<br />
f(x) = log(x)<br />
——— h s (x) = arsinh(x/s)<br />
200 0 200 400<br />
intensity<br />
600 800 1000<br />
W. Huber et al.,<br />
ISMB 2002<br />
( ) 2<br />
arsinh( x ) = log x + x + 1<br />
D. Rocke & B.<br />
Durbin, ISMB 2002
Variance stabilizing transform ations<br />
1.) constant CV (‘multiplicative’)<br />
2.) offset<br />
3.) additive and multiplicative<br />
x<br />
1<br />
f ( x ) = ∫ du<br />
v( u)<br />
0<br />
2<br />
v( u) ∝ u ⇒ f ∝ log u<br />
v( u) ∝ ( u + u ) ⇒ f ∝ log( u +<br />
u )<br />
2<br />
0 0<br />
v( u) ∝ ( u + u ) + s ⇒ f ∝ arsinh<br />
u + u<br />
2 2 0<br />
s
Y<br />
−<br />
a<br />
ki i<br />
2<br />
arsinh = μk + εki, εki<br />
: N (0, c )<br />
bi<br />
• Robuste maximum likelihood Schätzung<br />
•<br />
Robuste Param eter Schätzung<br />
{ { } , { } , , { } }<br />
M =<br />
a b c μ<br />
i i k
!<br />
Biologie<br />
Diagnostik<br />
Therapie<br />
...<br />
Biologische<br />
Verifikation<br />
?<br />
Verarbeitung von <strong>Microarray</strong> Daten:<br />
Experiment<br />
Design<br />
Experiment<br />
(<strong>Microarray</strong>)<br />
Analyse: Clustering; Class Discovery; Klassifikation; Differentielle Gene; ....<br />
Bildverarbeitung<br />
Rohe<br />
Intensitätswerte<br />
Normalisierung<br />
Expressions Level<br />
<strong>Vorlesung</strong>: <strong>Microarray</strong> <strong>Datenanalyse</strong> <strong>Kapitel</strong> 1
Differentielle Gene finden<br />
Genes<br />
Two cell/tissue /disease types:<br />
wildtype / mutant<br />
control / treated<br />
disease A / disease B<br />
responding / non responding<br />
etc. etc....<br />
Patients, Samples, Timepoints ...<br />
For every sample (cell line/patient) we have the<br />
expression levels of thousands of genes and<br />
the information whether it is A or B
Is a threefold induced gene more trust<br />
worthy than a twofold induced gene?<br />
Logratio<br />
Product intensity (logscale)
A B<br />
Conclusion: In addition to the<br />
differences in gene expression you<br />
also have a vital interest in its<br />
variability ... This information is<br />
needed to obtain meaningful lists<br />
of genes<br />
A B
Standard Deviation and Standard<br />
Error<br />
Standard Deviation (SD): Variability of the<br />
measurement<br />
Standard Error (SE): Variability of the mean of<br />
several measurements<br />
n Replications<br />
Normal Distributed Data:
Questions:<br />
Which genes are differentially expressed?<br />
> Ranking<br />
Are these results „significant“?<br />
> Statistical Analysis<br />
That means: Is the probability sufficiently<br />
small that the result is “by chance”?
Ranking:<br />
Problem: Produce an ordered list of<br />
differentially expressed genes starting<br />
with the most up regulated gene and<br />
ending with the most down regulated<br />
gene<br />
Ranking means finding the right genes<br />
… drawing our attention to them<br />
In many applications it is the most<br />
important step
Ranking is not Testing<br />
Ranking: Finding the right genes<br />
Testing: Deciding whether genes are<br />
significant<br />
There is more then one way to rank<br />
There is more then one way to test<br />
The criteria for which ranking is best is<br />
different from the criteria which test is<br />
best … power is often no argument
Ranking: Order Genes due to amount of fold<br />
change/Score > maybe some that are not differential<br />
in reality (False Positive)<br />
Gene candidate 1<br />
Gene candidate 2<br />
Gene candidate 3<br />
Gene candidate 4<br />
Gene candidate 5<br />
Gene candidate 6<br />
Gene candidate 7<br />
Gene candidate 8<br />
Gene candidate 9<br />
Gene ....<br />
Order due to some score,<br />
Intuitively: Fold change<br />
1st: most differential,<br />
2nd: second most diff<br />
...
Testing: Find Genes due to amount of fold<br />
change/Score which are significant s.t. there are less<br />
than 5% False Positives > maybe you miss some<br />
(False Negatives)<br />
Gene candidate 1<br />
Gene candidate 2<br />
Gene candidate 3<br />
Gene candidate 4<br />
Gene candidate 5<br />
Gene candidate 6<br />
Gene candidate 7<br />
Gene candidate 8<br />
Gene candidate 9<br />
Gene ....<br />
Order due to some score,<br />
Intuitively: Fold change<br />
1st: most differential,<br />
2nd: second most diff<br />
...
Which gene is more differentially<br />
expressed?
Ranking is Scoring<br />
You need to score differential<br />
gene expression<br />
Different scores lead to different<br />
rankings<br />
What scores are there?
TScore<br />
Idea: Take variances into account<br />
Change: low Change: high Change: high<br />
Variance: high Variance: low Variance: high
Change: HIGH<br />
Variance: SMALL<br />
T huge<br />
Change: SMALL<br />
Variance: HIGH<br />
T ~ 0
Change: HIGH<br />
Variance: HIGH<br />
T ?<br />
Change: SMALL<br />
Variance: SMALL<br />
T ?
Berechne TScores für ein<br />
zufälliges Experiment<br />
Erstelle ein Histogramm der Tscores<br />
und markiere die 5% höchsten und<br />
niedrigsten (rot)<br />
Berechne TScore für Gen x und<br />
zeichne diesen ein (grün)<br />
T Score – T test – P value<br />
Wie groß ist die Wahrscheinlichkeit, mindestens so extrem wie der grüne<br />
Pfeil zu sein?
TTest PROBLEMS<br />
• There are many genes (> tests) but only<br />
few repetitions<br />
• is using „s“ as estimate good?<br />
• if measured variance is small T<br />
becomes easily very large<br />
Therefore: for microarray it is reasonable<br />
to use a modfied version of the T test
Fudge Factors:<br />
You need to estimate the variance from data<br />
You might underestimate a already small variance<br />
(constantly expressed genes)<br />
The denominator in T becomes really small<br />
Constantly expressed genes show up on top of the list<br />
Correction: Add a constant fudge factor s 0<br />
Regularized Tscore<br />
>Limma<br />
>SAM<br />
>Twilight
SAM: Significance Analysis for <strong>Microarray</strong>s<br />
d( i)<br />
X − X<br />
1 2<br />
s( i) + s<br />
0<br />
2 2<br />
m 1 n 2<br />
m n<br />
s( i) = a( ( x ( i) − X ) + ( x ( i) −X<br />
)<br />
a<br />
=<br />
=<br />
1/ n + 1/ n<br />
1 2<br />
n + n −<br />
1 2<br />
∑ ∑<br />
2
More Scores:<br />
Wilcoxon Score (robust)<br />
PAUc Score (separation)<br />
paired tScore (paired Data)<br />
FScore (more then 2 conditions)<br />
Correlation to a reference gene<br />
etc etc
Different scores give different<br />
rankings<br />
Krankheit 1 vs Krankheit 2<br />
(Golub et al.)
Which Score is the best<br />
one?<br />
That depends on your<br />
problem ...
Next Question:<br />
Ok, I chose a score and found a set of<br />
candidate genes<br />
Can I trust the observed expression<br />
differences?<br />
Statistical Analysis
PValues<br />
Everyone knows that the pvalue must<br />
be below 0.05<br />
0.05 is a holy number both in medicine<br />
and biology<br />
... what else should you know about pvalues
Rumors<br />
If the gene is not differentially<br />
expressed the pvalue is high<br />
If the gene is differentially expressed<br />
the pvalues is low<br />
Both these statements are wrong!
Reminder: Type I and Type II ERROR<br />
H0<br />
Null Hypothesis:<br />
Gene NOT<br />
differential<br />
H1<br />
Alternative<br />
Hypothesis:<br />
NOT H0<br />
Positive: rejected H0 (differential gene)<br />
Negative: accepted H0
Reminder: Type I and Type II ERROR<br />
H0 H1
The basic Idea behind pvalues:<br />
We observe a score S =1.27<br />
Can this be just a random fluctuation?<br />
Assume: It is a random fluctuation<br />
= The gene is not differentially<br />
expressed<br />
= The null hypothesis holds<br />
Theory gives us the distribution of the<br />
score under this assumption<br />
PValue: Probability that a random<br />
score is equal or higher to S =1.27<br />
in absolute value (two sided test)
Permutations and empirical pvalues
If a gene is not differentially expressed:<br />
The pvalue is a random number between 0 and<br />
1!<br />
It is unlikely that such a number is<br />
below 0.05 (5% probability)
If a gene is differentially expressed:<br />
The pvalue has no meaning, since it was<br />
computed under the assumption that the gene is<br />
not differentially expressed.<br />
We hope that it is small since the score<br />
is high, but there is absolutely no<br />
theoretical support for this
Testing only one gene:<br />
If the gene is not differentially<br />
expressed a small pvalue is unlikely,<br />
hence we should be surprised by this<br />
observation.<br />
If we make it a rule that we discard the gene if<br />
the pvalues is above 0.05, it is unlikely that a<br />
random score will pass this filter
Multiple testing with only noninduced genes<br />
1 gene<br />
10 genes<br />
30,000 genes
The Multiple Testing Problem<br />
Pvalues are random numbers between 0 and 1. For only one<br />
such number it is unlikely to fall in this small interval, but if we<br />
have 30.000 such numbers many will be in there.
Acctepted<br />
Rejected<br />
We test m hypotheses<br />
true hypotheses rejected hypotheses<br />
H0 H1<br />
TRUE FALSE<br />
Error = false positive<br />
Error = false negative<br />
Error = false positive<br />
Error = false negative
FWER=Familywise error rate:<br />
Probability of at least one Type1error (False Positive) among<br />
the accepted (significant) genes<br />
Accepted<br />
Rejected<br />
H0 H1<br />
TRUE FALSE
FDR = False Discovery Rate<br />
Expected number of Type 1 – errors (False Positives) among rejected<br />
hypotheses<br />
with<br />
Accepted<br />
Rejected<br />
if<br />
if<br />
TRUE FALSE<br />
H0 H1
Controlling the family wise error rate<br />
(FWER)<br />
If we want to avoid random numbers in this interval<br />
we need to make it smaller. The more numbers, the<br />
smaller. For 30.000 numbers very small.<br />
This strategy is called: Controlling the family wise<br />
error rate
How to control the FWER?<br />
Note, that adjusting the interval border can also be<br />
done by adjusting the pvalues and leaving the cut off<br />
at 0.05.<br />
There are many ways to adjust pvalues for multiple<br />
testing:<br />
Bonferroni:<br />
Better: Westfall and Young
In microarray studies controlling the<br />
FWER is not a good idea ... It is too<br />
conservative.<br />
A different type of error measure<br />
became more popular<br />
The False Discovery Rate<br />
What is the idea?
The FDR<br />
• Score genes and rank them<br />
• Choose a cutoff<br />
• Loosely speaking: The FDR is the<br />
best guess for the number of<br />
false positive genes that score<br />
above the cutoff
The confusing literature:<br />
There are many different definitions of the false<br />
discovery rate in the literature:<br />
Original: BenjaminiHochberg<br />
Positive FDR<br />
Conditional FDR<br />
Local FDR<br />
There is also a fundamental difference between<br />
controlling and estimating a FDR
In microarray analysis it became<br />
popular to use estimated FDRs<br />
Differences to pvalues:<br />
The FDR refers to a list of genes. The pvalue<br />
refers to a single gene.<br />
The pvalue is based on the assumption that the<br />
gene is not differentially expressed, the FDR<br />
makes no such assumption.<br />
Pvalues need to be corrected for multiplicity,<br />
FDRs not!
Another difference in concept:<br />
If a 4x change has a small pvalue, this means that 4x change<br />
is too high to be random fluctuation<br />
Conclusion: 4x change is significant<br />
If a list of 150 genes with 4x change or more has a small<br />
estimated FDR this means that we have more genes on this<br />
level than would be expected by chance.<br />
Conclusion: 4x change can be noise, but 150 genes on that<br />
level are too many to be explained just by random fluctuation.<br />
In FWER Analysis the fold change 4x is significant, in FDR<br />
Analysis it is the number 150 that is significant.
Histograms of the pvalues of all<br />
genes on the array
FWER: Vertical cutoff<br />
FDR: Horizontal cutoff