01.11.2014 Views

QC and normalisation of microarray experiments - BiGCaT

QC and normalisation of microarray experiments - BiGCaT

QC and normalisation of microarray experiments - BiGCaT

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>QC</strong> <strong>and</strong> <strong>normalisation</strong><br />

<strong>of</strong> <strong>microarray</strong> <strong>experiments</strong><br />

Lars Eijssen<br />

POT course toxicogenomics 22-02-2010


Contents<br />

• Background on quality control (<strong>QC</strong>)<br />

– Examples based on two channel data sets<br />

• Background on further data preprocessing<br />

• Application <strong>of</strong> a Genepattern <strong>QC</strong> module for Affymetrix data<br />

– settings<br />

– illustration on data sets<br />

– interpretation <strong>of</strong> outcome<br />

• Preprocessing Affymetrix data using a Genepattern module<br />

– settings<br />

• Introduction to the afternoon session <strong>and</strong> the data set to be used<br />

2


Quality Control


One <strong>and</strong> two channel arrays<br />

• In this course, we will focus on gene expression arrays<br />

• Specific details <strong>of</strong> <strong>QC</strong> <strong>and</strong> <strong>normalisation</strong> methods are<br />

different for one <strong>and</strong> two channel arrays<br />

• The principles, however, are similar<br />

• I will cover the principles first <strong>and</strong> specific application<br />

to Affymetrix data (using a predefined workflow) later<br />

4


<strong>QC</strong><br />

• Quality control is an important part <strong>of</strong> data processing<br />

• It includes two levels: array level <strong>and</strong> spot level<br />

• Array level: discard abberant arrays<br />

– Low quality material<br />

– Failed hybridisation<br />

– Too low or high overall intensity<br />

• Spot level: discard spots (or regions <strong>of</strong> spots) that are<br />

abberant<br />

– Stains on the array<br />

– Specific spots not fullfilling quality criteria<br />

5


Examples<br />

• To illustrate some <strong>QC</strong> aspects to consider, I will show<br />

examples from output <strong>of</strong> our in-house developed<br />

workflow Array<strong>QC</strong> (taken from different data sets)<br />

– This workflow has been developed to h<strong>and</strong>le one <strong>and</strong> two<br />

channel array <strong>experiments</strong>, with predefined settings for<br />

Agilent/Genepix data<br />

– Later I will show <strong>QC</strong> results for Affymetrix chips in more<br />

detail<br />

6


Global array <strong>QC</strong><br />

Foreground intensity<br />

Background intensity<br />

7


Spot specific <strong>QC</strong><br />

Red background intensity<br />

Green background intensity<br />

8


Red number <strong>of</strong> pixels<br />

Green number <strong>of</strong> pixels<br />

9


Red foreground intensity<br />

Green foreground intensity<br />

10


Spot selection<br />

• Based on aspects as shown in the previous images, one<br />

can determine which spots to include <strong>and</strong> exclude from<br />

further analysis<br />

• Example criteria<br />

– Spot size (number <strong>of</strong> pixels) above threshold<br />

– Sufficient spot uniformity<br />

– No saturation<br />

– Above intensity threshold<br />

• For two channel arrays this may involve considering<br />

both channels together; this may be complicated<br />

11


Filtering spots (e.g. low intensity)<br />

• Before filtering<br />

difference between channels<br />

intensity<br />

• After filtering<br />

• Note that low<br />

intensity spots are<br />

much more subject to<br />

noise!<br />

12


Further data preprocessing


Data analysis overview<br />

Microarray scans<br />

Image analysis<br />

•Background correction<br />

•Normalisation<br />

Raw data<br />

Quality control<br />

Further preprocessing<br />

Normalised data<br />

Statistical analysis<br />

List <strong>of</strong> regulated genes<br />

Pattern analysis<br />

Pathway analysis<br />

Literature data<br />

Untreated (control)<br />

Exposed to compound<br />

14<br />

Results<br />

Slide based on a slide from J. Pennings, RIVM, NL


Background correction<br />

• Background signal needs to be corrected for<br />

– For example signal <strong>of</strong> remaining non-hybridised mRNA<br />

• Three types <strong>of</strong> background<br />

– Overall slide background<br />

• Can be corrected for by subtracting mean background, or by<br />

subtracting mean <strong>of</strong> empty spots<br />

– Local slide background<br />

• Same as previous, but per slide region<br />

– Specific background<br />

• For example cross-hybridization, can be corrected for by mismatch<br />

probes (in case <strong>of</strong> Affymetrix chips)<br />

15


Normalisation<br />

• After discarding bad arrays <strong>and</strong> spots, remaining influences due<br />

to any differences related to the procedure followed need to be<br />

corrected for as much as possible<br />

• Between-slide <strong>normalisation</strong>: correct for experimental differences<br />

between slides<br />

– E.g. one may have an overall higher signal due to differences in<br />

hybridisation<br />

• Within-slide <strong>normalisation</strong>: correct for within slide variations<br />

– By applying <strong>normalisation</strong> per region, per spot group etc.<br />

• For two-channel arrays: between-channel <strong>normalisation</strong><br />

16


Normalisation procedures<br />

• Scale intensities to the same mean/median value for all slides<br />

– Only if just a small fraction <strong>of</strong> genes is expected to be changed (not<br />

always the case!)<br />

• Normalize based on values for housekeeping genes…<br />

– Genes that are assumed to have same expression in all samples)<br />

…or spikes<br />

– Unique transcripts added in known concentrations<br />

• Normalise dependent on intensity or on most similar spots<br />

(LOESS <strong>normalisation</strong>)<br />

• Force distribution <strong>of</strong> intensities to be the same for all slides<br />

– Quantile normalization<br />

17


Log-transformation<br />

• Generally, for one channel arrays the intensities are first 2 logtransformed.<br />

• After logging <strong>and</strong> <strong>normalisation</strong> one can compute the difference<br />

in means (‘logFC’) between several experimental groups.<br />

– The difference is much easier h<strong>and</strong>led statistically<br />

– Also the distribution <strong>of</strong> the logged intensities is more ‘normal’ than on<br />

the original scale<br />

– 2^logFC corresponds to the ratio on the original scale<br />

• The same procedure is taken to compute Cy5/Cy3 ratios for two<br />

channel arrays; the ratio is computed as 2^(2log (Cy3) - 2log (Cy5))<br />

18


Why log two-color ratio data? (2 channel example)<br />

• This ‘spreads out’ the data<br />

• And <strong>of</strong>fers symmetry<br />

• And makes subsequent statistical analysis easier<br />

• ‘raw’ ratio<br />

½<br />

1 2<br />

• log ratio<br />

2<br />

log <strong>of</strong>:<br />

½<br />

1 2<br />

19


Red <strong>and</strong> green foreground intensity<br />

20


LogFC values after LOESS <strong>normalisation</strong><br />

For two channel<br />

arrays, it is relevant<br />

to check whether<br />

effects cancel out<br />

between channels<br />

21


Preprocessing Affymetrix data<br />

• For Affymetrix we have more spots (probes) for one<br />

single transcript (probeset)<br />

– these must be summarised into one value<br />

• Well-known methods for preprocessing Affymetrix<br />

chips<br />

– MAS5.0 (uses mismatch intensities)<br />

– RMA (Robust Multiarray Average, does not use mismatches)<br />

• Includes both background correction <strong>and</strong> (quantile) <strong>normalisation</strong><br />

– GCRMA (like RMA, but also takes into account GC content)<br />

– dChip (model-based)<br />

– For exonST en geneST arrays, only RMA can be used (another option is<br />

PLIER, error-model)<br />

22


A <strong>QC</strong> module in Genepattern


The Affymetrix <strong>QC</strong> workflow<br />

• For <strong>QC</strong>, you will be using Genepattern module<br />

MADMAXArrayQualityAnalysis<br />

that connects to a dedicated R server in Wageningen<br />

– The module was called NuGOArrayQualityAnalysis before<br />

• Now I first explain the options you can select, <strong>and</strong> how<br />

to interpret the outcome<br />

• The images are taken from other data sets than the one<br />

you will be using<br />

24


<strong>QC</strong>: what do the options mean?<br />

1. Provide a zip file that<br />

contains all CEL files<br />

2. Provide email address<br />

3. Select a <strong>normalisation</strong><br />

method<br />

• RMA, gcRMA, VSN, or<br />

MAS5<br />

4. Select a distance measure<br />

• Pearson, Spearman,<br />

Euclidean<br />

5. Select a clustering method<br />

• McQuitty, Ward,<br />

average, centroid,<br />

complete, median, single<br />

6. Select the number <strong>of</strong><br />

images per page<br />

7. All further questions are<br />

yes/no questions to<br />

decide which output to<br />

receive (all ‘yes’ by<br />

default)<br />

25


Quality Control Overview plot<br />

Default criteria:<br />

Percentage present within 10%<br />

Background within 20 units<br />

Scaling factors<br />

within 3-fold from the average<br />

GAPDH 3’/5’ ≤ 1.25<br />

Actin 3’/5’ ≤ 3<br />

26


Image plot<br />

27


FitPLM weights image<br />

28


FitPLM residuals image<br />

29


Density plot<br />

30


Box- <strong>and</strong> whisker plot<br />

31


Relative Log Expression plot<br />

32


Normalised Unscaled St<strong>and</strong>ard Error plot<br />

33


RNA digestion plot<br />

34


MA plot<br />

logFC=<br />

=average<br />

35


MA plot<br />

36


Correlation plot/heatmap<br />

37


Clustering dendrogram<br />

38


A data preprocessing module in Genepattern


Data preprocessing<br />

• For data preprocessing we will use a Genepattern<br />

module again, NuGOExpressionFileCreator<br />

• This module performs:<br />

– background correction<br />

– <strong>normalisation</strong><br />

– summarisation (the combination <strong>of</strong> all probe signals that<br />

belong to the same reporter into one value for that probe set)<br />

• The outcome is a table <strong>of</strong> preprocessed data, that can<br />

be used for further (statistical) analysis<br />

– this will be discussed tomorrow<br />

40


NuGOExpressionFileCreator<br />

41<br />

• Provide a zip file that<br />

contains all CEL files<br />

• Select the CDF file to be<br />

used (next slide)<br />

• Select a <strong>normalisation</strong><br />

method<br />

– RMA, gcRMA, MAS5,<br />

or dChip<br />

• Do you want to perform<br />

quantile <strong>normalisation</strong><br />

(RMA, gcRMA)<br />

• Do you want to perform<br />

background correction<br />

(RMA)<br />

• Some more options for<br />

<strong>normalisation</strong><br />

• Some fields to select<br />

options for intput/output<br />

files


Intermezzo: custom CDF files<br />

• Affymetrix provides annotations files for their probesets (CDF file)<br />

• When these get outdated, one can <strong>of</strong> course update probeset<br />

annotations<br />

• But it may be even better to disassemble these sets into the separate<br />

probes, reannotate probes, <strong>and</strong> assemble into new probesets<br />

(different ones)<br />

• This is exactly what custom CDF files do<br />

• Note that reassembled probesets do not contain the same number<br />

<strong>of</strong> probes anymore<br />

42


Intermezzo: BrainArray CDF files 1<br />

• Reannotation based on one <strong>of</strong> several genome databases<br />

• IDs are created as follows: ID from the gene the probeset refers<br />

to followed by ‘_at’ to ressemble an Affymetrix ID<br />

• When using these annotations in other tools, you have to remove<br />

the ‘_at’ additions, in order to get recognisable IDs<br />

• Note that when using Entrez gene this means that the ID is<br />

composed <strong>of</strong> a number (Entrez gene ID) followed by ‘_at’, <strong>and</strong><br />

as such looks exactly like a normal Affymetrix ID, but IT IS<br />

NOT<br />

1<br />

http://arrayanalysis.mbni.med.umich.edu/arrayanalysis.html<br />

43


The afternoon session <strong>and</strong> the data set


The afternoon session<br />

• In the afternoon session, you will be performing <strong>QC</strong><br />

<strong>and</strong> further data preprocessing yourself<br />

• You will follow a stepwise guide available online<br />

– http://www.bigcat.unimaas.nl/wiki/index.php/PET_course<br />

• You will use an Affymetrix data set <strong>and</strong> make use <strong>of</strong> the<br />

Genepattern <strong>QC</strong> <strong>and</strong> data preprocessing workflows<br />

discussed<br />

45


Short description <strong>of</strong> the data set (1)<br />

• Microarray <strong>experiments</strong> have to be uploaded to online<br />

repositories such as Gene Expression Omnibus (GEO,<br />

NCBI) or ArrayExpress (AE, EBI) upon publication<br />

• We will use a published dataset available from AE<br />

46


Short description <strong>of</strong> the data set (2)<br />

• The course Wiki contains instructions on how to proceed<br />

• This data set is taken from Ezendam et al., 2004<br />

• Hexachlorobenzene (HCB) is a persistent pollutant, that is toxic<br />

for liver, neurons <strong>and</strong> the reproductive <strong>and</strong> immune systems<br />

• In this study, Brown Norway rats were fed a diet supplemented<br />

with HCB <strong>of</strong> 0, 150, or 450 mg/kg<br />

• Spleen, mesenteric lymph nodes (MLN), thymus, blood, liver,<br />

<strong>and</strong> kidney were analyzed using the Affymetrix rat RGU-34A<br />

GeneChip <strong>microarray</strong><br />

– 13-17 arrays per tissue, max 6 per concentration<br />

• We will be primarily considering the liver data<br />

47

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!