QC and normalisation of microarray experiments - BiGCaT
QC and normalisation of microarray experiments - BiGCaT
QC and normalisation of microarray experiments - BiGCaT
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>QC</strong> <strong>and</strong> <strong>normalisation</strong><br />
<strong>of</strong> <strong>microarray</strong> <strong>experiments</strong><br />
Lars Eijssen<br />
POT course toxicogenomics 22-02-2010
Contents<br />
• Background on quality control (<strong>QC</strong>)<br />
– Examples based on two channel data sets<br />
• Background on further data preprocessing<br />
• Application <strong>of</strong> a Genepattern <strong>QC</strong> module for Affymetrix data<br />
– settings<br />
– illustration on data sets<br />
– interpretation <strong>of</strong> outcome<br />
• Preprocessing Affymetrix data using a Genepattern module<br />
– settings<br />
• Introduction to the afternoon session <strong>and</strong> the data set to be used<br />
2
Quality Control
One <strong>and</strong> two channel arrays<br />
• In this course, we will focus on gene expression arrays<br />
• Specific details <strong>of</strong> <strong>QC</strong> <strong>and</strong> <strong>normalisation</strong> methods are<br />
different for one <strong>and</strong> two channel arrays<br />
• The principles, however, are similar<br />
• I will cover the principles first <strong>and</strong> specific application<br />
to Affymetrix data (using a predefined workflow) later<br />
4
<strong>QC</strong><br />
• Quality control is an important part <strong>of</strong> data processing<br />
• It includes two levels: array level <strong>and</strong> spot level<br />
• Array level: discard abberant arrays<br />
– Low quality material<br />
– Failed hybridisation<br />
– Too low or high overall intensity<br />
• Spot level: discard spots (or regions <strong>of</strong> spots) that are<br />
abberant<br />
– Stains on the array<br />
– Specific spots not fullfilling quality criteria<br />
5
Examples<br />
• To illustrate some <strong>QC</strong> aspects to consider, I will show<br />
examples from output <strong>of</strong> our in-house developed<br />
workflow Array<strong>QC</strong> (taken from different data sets)<br />
– This workflow has been developed to h<strong>and</strong>le one <strong>and</strong> two<br />
channel array <strong>experiments</strong>, with predefined settings for<br />
Agilent/Genepix data<br />
– Later I will show <strong>QC</strong> results for Affymetrix chips in more<br />
detail<br />
6
Global array <strong>QC</strong><br />
Foreground intensity<br />
Background intensity<br />
7
Spot specific <strong>QC</strong><br />
Red background intensity<br />
Green background intensity<br />
8
Red number <strong>of</strong> pixels<br />
Green number <strong>of</strong> pixels<br />
9
Red foreground intensity<br />
Green foreground intensity<br />
10
Spot selection<br />
• Based on aspects as shown in the previous images, one<br />
can determine which spots to include <strong>and</strong> exclude from<br />
further analysis<br />
• Example criteria<br />
– Spot size (number <strong>of</strong> pixels) above threshold<br />
– Sufficient spot uniformity<br />
– No saturation<br />
– Above intensity threshold<br />
• For two channel arrays this may involve considering<br />
both channels together; this may be complicated<br />
11
Filtering spots (e.g. low intensity)<br />
• Before filtering<br />
difference between channels<br />
intensity<br />
• After filtering<br />
• Note that low<br />
intensity spots are<br />
much more subject to<br />
noise!<br />
12
Further data preprocessing
Data analysis overview<br />
Microarray scans<br />
Image analysis<br />
•Background correction<br />
•Normalisation<br />
Raw data<br />
Quality control<br />
Further preprocessing<br />
Normalised data<br />
Statistical analysis<br />
List <strong>of</strong> regulated genes<br />
Pattern analysis<br />
Pathway analysis<br />
Literature data<br />
Untreated (control)<br />
Exposed to compound<br />
14<br />
Results<br />
Slide based on a slide from J. Pennings, RIVM, NL
Background correction<br />
• Background signal needs to be corrected for<br />
– For example signal <strong>of</strong> remaining non-hybridised mRNA<br />
• Three types <strong>of</strong> background<br />
– Overall slide background<br />
• Can be corrected for by subtracting mean background, or by<br />
subtracting mean <strong>of</strong> empty spots<br />
– Local slide background<br />
• Same as previous, but per slide region<br />
– Specific background<br />
• For example cross-hybridization, can be corrected for by mismatch<br />
probes (in case <strong>of</strong> Affymetrix chips)<br />
15
Normalisation<br />
• After discarding bad arrays <strong>and</strong> spots, remaining influences due<br />
to any differences related to the procedure followed need to be<br />
corrected for as much as possible<br />
• Between-slide <strong>normalisation</strong>: correct for experimental differences<br />
between slides<br />
– E.g. one may have an overall higher signal due to differences in<br />
hybridisation<br />
• Within-slide <strong>normalisation</strong>: correct for within slide variations<br />
– By applying <strong>normalisation</strong> per region, per spot group etc.<br />
• For two-channel arrays: between-channel <strong>normalisation</strong><br />
16
Normalisation procedures<br />
• Scale intensities to the same mean/median value for all slides<br />
– Only if just a small fraction <strong>of</strong> genes is expected to be changed (not<br />
always the case!)<br />
• Normalize based on values for housekeeping genes…<br />
– Genes that are assumed to have same expression in all samples)<br />
…or spikes<br />
– Unique transcripts added in known concentrations<br />
• Normalise dependent on intensity or on most similar spots<br />
(LOESS <strong>normalisation</strong>)<br />
• Force distribution <strong>of</strong> intensities to be the same for all slides<br />
– Quantile normalization<br />
17
Log-transformation<br />
• Generally, for one channel arrays the intensities are first 2 logtransformed.<br />
• After logging <strong>and</strong> <strong>normalisation</strong> one can compute the difference<br />
in means (‘logFC’) between several experimental groups.<br />
– The difference is much easier h<strong>and</strong>led statistically<br />
– Also the distribution <strong>of</strong> the logged intensities is more ‘normal’ than on<br />
the original scale<br />
– 2^logFC corresponds to the ratio on the original scale<br />
• The same procedure is taken to compute Cy5/Cy3 ratios for two<br />
channel arrays; the ratio is computed as 2^(2log (Cy3) - 2log (Cy5))<br />
18
Why log two-color ratio data? (2 channel example)<br />
• This ‘spreads out’ the data<br />
• And <strong>of</strong>fers symmetry<br />
• And makes subsequent statistical analysis easier<br />
• ‘raw’ ratio<br />
½<br />
1 2<br />
• log ratio<br />
2<br />
log <strong>of</strong>:<br />
½<br />
1 2<br />
19
Red <strong>and</strong> green foreground intensity<br />
20
LogFC values after LOESS <strong>normalisation</strong><br />
For two channel<br />
arrays, it is relevant<br />
to check whether<br />
effects cancel out<br />
between channels<br />
21
Preprocessing Affymetrix data<br />
• For Affymetrix we have more spots (probes) for one<br />
single transcript (probeset)<br />
– these must be summarised into one value<br />
• Well-known methods for preprocessing Affymetrix<br />
chips<br />
– MAS5.0 (uses mismatch intensities)<br />
– RMA (Robust Multiarray Average, does not use mismatches)<br />
• Includes both background correction <strong>and</strong> (quantile) <strong>normalisation</strong><br />
– GCRMA (like RMA, but also takes into account GC content)<br />
– dChip (model-based)<br />
– For exonST en geneST arrays, only RMA can be used (another option is<br />
PLIER, error-model)<br />
22
A <strong>QC</strong> module in Genepattern
The Affymetrix <strong>QC</strong> workflow<br />
• For <strong>QC</strong>, you will be using Genepattern module<br />
MADMAXArrayQualityAnalysis<br />
that connects to a dedicated R server in Wageningen<br />
– The module was called NuGOArrayQualityAnalysis before<br />
• Now I first explain the options you can select, <strong>and</strong> how<br />
to interpret the outcome<br />
• The images are taken from other data sets than the one<br />
you will be using<br />
24
<strong>QC</strong>: what do the options mean?<br />
1. Provide a zip file that<br />
contains all CEL files<br />
2. Provide email address<br />
3. Select a <strong>normalisation</strong><br />
method<br />
• RMA, gcRMA, VSN, or<br />
MAS5<br />
4. Select a distance measure<br />
• Pearson, Spearman,<br />
Euclidean<br />
5. Select a clustering method<br />
• McQuitty, Ward,<br />
average, centroid,<br />
complete, median, single<br />
6. Select the number <strong>of</strong><br />
images per page<br />
7. All further questions are<br />
yes/no questions to<br />
decide which output to<br />
receive (all ‘yes’ by<br />
default)<br />
25
Quality Control Overview plot<br />
Default criteria:<br />
Percentage present within 10%<br />
Background within 20 units<br />
Scaling factors<br />
within 3-fold from the average<br />
GAPDH 3’/5’ ≤ 1.25<br />
Actin 3’/5’ ≤ 3<br />
26
Image plot<br />
27
FitPLM weights image<br />
28
FitPLM residuals image<br />
29
Density plot<br />
30
Box- <strong>and</strong> whisker plot<br />
31
Relative Log Expression plot<br />
32
Normalised Unscaled St<strong>and</strong>ard Error plot<br />
33
RNA digestion plot<br />
34
MA plot<br />
logFC=<br />
=average<br />
35
MA plot<br />
36
Correlation plot/heatmap<br />
37
Clustering dendrogram<br />
38
A data preprocessing module in Genepattern
Data preprocessing<br />
• For data preprocessing we will use a Genepattern<br />
module again, NuGOExpressionFileCreator<br />
• This module performs:<br />
– background correction<br />
– <strong>normalisation</strong><br />
– summarisation (the combination <strong>of</strong> all probe signals that<br />
belong to the same reporter into one value for that probe set)<br />
• The outcome is a table <strong>of</strong> preprocessed data, that can<br />
be used for further (statistical) analysis<br />
– this will be discussed tomorrow<br />
40
NuGOExpressionFileCreator<br />
41<br />
• Provide a zip file that<br />
contains all CEL files<br />
• Select the CDF file to be<br />
used (next slide)<br />
• Select a <strong>normalisation</strong><br />
method<br />
– RMA, gcRMA, MAS5,<br />
or dChip<br />
• Do you want to perform<br />
quantile <strong>normalisation</strong><br />
(RMA, gcRMA)<br />
• Do you want to perform<br />
background correction<br />
(RMA)<br />
• Some more options for<br />
<strong>normalisation</strong><br />
• Some fields to select<br />
options for intput/output<br />
files
Intermezzo: custom CDF files<br />
• Affymetrix provides annotations files for their probesets (CDF file)<br />
• When these get outdated, one can <strong>of</strong> course update probeset<br />
annotations<br />
• But it may be even better to disassemble these sets into the separate<br />
probes, reannotate probes, <strong>and</strong> assemble into new probesets<br />
(different ones)<br />
• This is exactly what custom CDF files do<br />
• Note that reassembled probesets do not contain the same number<br />
<strong>of</strong> probes anymore<br />
42
Intermezzo: BrainArray CDF files 1<br />
• Reannotation based on one <strong>of</strong> several genome databases<br />
• IDs are created as follows: ID from the gene the probeset refers<br />
to followed by ‘_at’ to ressemble an Affymetrix ID<br />
• When using these annotations in other tools, you have to remove<br />
the ‘_at’ additions, in order to get recognisable IDs<br />
• Note that when using Entrez gene this means that the ID is<br />
composed <strong>of</strong> a number (Entrez gene ID) followed by ‘_at’, <strong>and</strong><br />
as such looks exactly like a normal Affymetrix ID, but IT IS<br />
NOT<br />
1<br />
http://arrayanalysis.mbni.med.umich.edu/arrayanalysis.html<br />
43
The afternoon session <strong>and</strong> the data set
The afternoon session<br />
• In the afternoon session, you will be performing <strong>QC</strong><br />
<strong>and</strong> further data preprocessing yourself<br />
• You will follow a stepwise guide available online<br />
– http://www.bigcat.unimaas.nl/wiki/index.php/PET_course<br />
• You will use an Affymetrix data set <strong>and</strong> make use <strong>of</strong> the<br />
Genepattern <strong>QC</strong> <strong>and</strong> data preprocessing workflows<br />
discussed<br />
45
Short description <strong>of</strong> the data set (1)<br />
• Microarray <strong>experiments</strong> have to be uploaded to online<br />
repositories such as Gene Expression Omnibus (GEO,<br />
NCBI) or ArrayExpress (AE, EBI) upon publication<br />
• We will use a published dataset available from AE<br />
46
Short description <strong>of</strong> the data set (2)<br />
• The course Wiki contains instructions on how to proceed<br />
• This data set is taken from Ezendam et al., 2004<br />
• Hexachlorobenzene (HCB) is a persistent pollutant, that is toxic<br />
for liver, neurons <strong>and</strong> the reproductive <strong>and</strong> immune systems<br />
• In this study, Brown Norway rats were fed a diet supplemented<br />
with HCB <strong>of</strong> 0, 150, or 450 mg/kg<br />
• Spleen, mesenteric lymph nodes (MLN), thymus, blood, liver,<br />
<strong>and</strong> kidney were analyzed using the Affymetrix rat RGU-34A<br />
GeneChip <strong>microarray</strong><br />
– 13-17 arrays per tissue, max 6 per concentration<br />
• We will be primarily considering the liver data<br />
47