QC and normalisation of microarray experiments - BiGCaT

QC and normalisation 

of microarray experiments 

Lars Eijssen 

POT course toxicogenomics 22-02-2010

Contents 

• Background on quality control (QC) 

– Examples based on two channel data sets 

• Background on further data preprocessing 

• Application of a Genepattern QC module for Affymetrix data 

– settings 

– illustration on data sets 

– interpretation of outcome 

• Preprocessing Affymetrix data using a Genepattern module 

– settings 

• Introduction to the afternoon session and the data set to be used 

2

Quality Control

One and two channel arrays 

• In this course, we will focus on gene expression arrays 

• Specific details of QC and normalisation methods are 

different for one and two channel arrays 

• The principles, however, are similar 

• I will cover the principles first and specific application 

to Affymetrix data (using a predefined workflow) later 

4

QC 

• Quality control is an important part of data processing 

• It includes two levels: array level and spot level 

• Array level: discard abberant arrays 

– Low quality material 

– Failed hybridisation 

– Too low or high overall intensity 

• Spot level: discard spots (or regions of spots) that are 

abberant 

– Stains on the array 

– Specific spots not fullfilling quality criteria 

5

Examples 

• To illustrate some QC aspects to consider, I will show 

examples from output of our in-house developed 

workflow ArrayQC (taken from different data sets) 

– This workflow has been developed to handle one and two 

channel array experiments, with predefined settings for 

Agilent/Genepix data 

– Later I will show QC results for Affymetrix chips in more 

detail 

6

Global array QC 

Foreground intensity 

Background intensity 

7

Spot specific QC 

Red background intensity 

Green background intensity 

8

Red number of pixels 

Green number of pixels 

9

Red foreground intensity 

Green foreground intensity 

10

Spot selection 

• Based on aspects as shown in the previous images, one 

can determine which spots to include and exclude from 

further analysis 

• Example criteria 

– Spot size (number of pixels) above threshold 

– Sufficient spot uniformity 

– No saturation 

– Above intensity threshold 

• For two channel arrays this may involve considering 

both channels together; this may be complicated 

11

Filtering spots (e.g. low intensity) 

• Before filtering 

difference between channels 

intensity 

• After filtering 

• Note that low 

intensity spots are 

much more subject to 

noise! 

12

Further data preprocessing

Data analysis overview 

Microarray scans 

Image analysis 

•Background correction 

•Normalisation 

Raw data 

Quality control 

Further preprocessing 

Normalised data 

Statistical analysis 

List of regulated genes 

Pattern analysis 

Pathway analysis 

Literature data 

Untreated (control) 

Exposed to compound 

14 

Results 

Slide based on a slide from J. Pennings, RIVM, NL

Background correction 

• Background signal needs to be corrected for 

– For example signal of remaining non-hybridised mRNA 

• Three types of background 

– Overall slide background 

• Can be corrected for by subtracting mean background, or by 

subtracting mean of empty spots 

– Local slide background 

• Same as previous, but per slide region 

– Specific background 

• For example cross-hybridization, can be corrected for by mismatch 

probes (in case of Affymetrix chips) 

15

Normalisation 

• After discarding bad arrays and spots, remaining influences due 

to any differences related to the procedure followed need to be 

corrected for as much as possible 

• Between-slide normalisation: correct for experimental differences 

between slides 

– E.g. one may have an overall higher signal due to differences in 

hybridisation 

• Within-slide normalisation: correct for within slide variations 

– By applying normalisation per region, per spot group etc. 

• For two-channel arrays: between-channel normalisation 

16

Normalisation procedures 

• Scale intensities to the same mean/median value for all slides 

– Only if just a small fraction of genes is expected to be changed (not 

always the case!) 

• Normalize based on values for housekeeping genes… 

– Genes that are assumed to have same expression in all samples) 

…or spikes 

– Unique transcripts added in known concentrations 

• Normalise dependent on intensity or on most similar spots 

(LOESS normalisation) 

• Force distribution of intensities to be the same for all slides 

– Quantile normalization 

17

Log-transformation 

• Generally, for one channel arrays the intensities are first 2 logtransformed. 

• After logging and normalisation one can compute the difference 

in means (‘logFC’) between several experimental groups. 

– The difference is much easier handled statistically 

– Also the distribution of the logged intensities is more ‘normal’ than on 

the original scale 

– 2^logFC corresponds to the ratio on the original scale 

• The same procedure is taken to compute Cy5/Cy3 ratios for two 

channel arrays; the ratio is computed as 2^(2log (Cy3) - 2log (Cy5)) 

18

Why log two-color ratio data? (2 channel example) 

• This ‘spreads out’ the data 

• And offers symmetry 

• And makes subsequent statistical analysis easier 

• ‘raw’ ratio 

½ 

1 2 

• log ratio 

2 

log of: 

½ 

1 2 

19

Red and green foreground intensity 

20

LogFC values after LOESS normalisation 

For two channel 

arrays, it is relevant 

to check whether 

effects cancel out 

between channels 

21

Preprocessing Affymetrix data 

• For Affymetrix we have more spots (probes) for one 

single transcript (probeset) 

– these must be summarised into one value 

• Well-known methods for preprocessing Affymetrix 

chips 

– MAS5.0 (uses mismatch intensities) 

– RMA (Robust Multiarray Average, does not use mismatches) 

• Includes both background correction and (quantile) normalisation 

– GCRMA (like RMA, but also takes into account GC content) 

– dChip (model-based) 

– For exonST en geneST arrays, only RMA can be used (another option is 

PLIER, error-model) 

22

A QC module in Genepattern

The Affymetrix QC workflow 

• For QC, you will be using Genepattern module 

MADMAXArrayQualityAnalysis 

that connects to a dedicated R server in Wageningen 

– The module was called NuGOArrayQualityAnalysis before 

• Now I first explain the options you can select, and how 

to interpret the outcome 

• The images are taken from other data sets than the one 

you will be using 

24

QC: what do the options mean? 

1. Provide a zip file that 

contains all CEL files 

2. Provide email address 

3. Select a normalisation 

method 

• RMA, gcRMA, VSN, or 

MAS5 

4. Select a distance measure 

• Pearson, Spearman, 

Euclidean 

5. Select a clustering method 

• McQuitty, Ward, 

average, centroid, 

complete, median, single 

6. Select the number of 

images per page 

7. All further questions are 

yes/no questions to 

decide which output to 

receive (all ‘yes’ by 

default) 

25

Quality Control Overview plot 

Default criteria: 

Percentage present within 10% 

Background within 20 units 

Scaling factors 

within 3-fold from the average 

GAPDH 3’/5’ ≤ 1.25 

Actin 3’/5’ ≤ 3 

26

Image plot 

27

FitPLM weights image 

28

FitPLM residuals image 

29

Density plot 

30

Box- and whisker plot 

31

Relative Log Expression plot 

32

Normalised Unscaled Standard Error plot 

33

RNA digestion plot 

34

MA plot 

logFC= 

=average 

35

MA plot 

36

Correlation plot/heatmap 

37

Clustering dendrogram 

38

A data preprocessing module in Genepattern

Data preprocessing 

• For data preprocessing we will use a Genepattern 

module again, NuGOExpressionFileCreator 

• This module performs: 

– background correction 

– normalisation 

– summarisation (the combination of all probe signals that 

belong to the same reporter into one value for that probe set) 

• The outcome is a table of preprocessed data, that can 

be used for further (statistical) analysis 

– this will be discussed tomorrow 

40

NuGOExpressionFileCreator 

41 

• Provide a zip file that 

contains all CEL files 

• Select the CDF file to be 

used (next slide) 

• Select a normalisation 

method 

– RMA, gcRMA, MAS5, 

or dChip 

• Do you want to perform 

quantile normalisation 

(RMA, gcRMA) 

• Do you want to perform 

background correction 

(RMA) 

• Some more options for 

normalisation 

• Some fields to select 

options for intput/output 

files

Intermezzo: custom CDF files 

• Affymetrix provides annotations files for their probesets (CDF file) 

• When these get outdated, one can of course update probeset 

annotations 

• But it may be even better to disassemble these sets into the separate 

probes, reannotate probes, and assemble into new probesets 

(different ones) 

• This is exactly what custom CDF files do 

• Note that reassembled probesets do not contain the same number 

of probes anymore 

42

Intermezzo: BrainArray CDF files 1 

• Reannotation based on one of several genome databases 

• IDs are created as follows: ID from the gene the probeset refers 

to followed by ‘_at’ to ressemble an Affymetrix ID 

• When using these annotations in other tools, you have to remove 

the ‘_at’ additions, in order to get recognisable IDs 

• Note that when using Entrez gene this means that the ID is 

composed of a number (Entrez gene ID) followed by ‘_at’, and 

as such looks exactly like a normal Affymetrix ID, but IT IS 

NOT 

1 

http://arrayanalysis.mbni.med.umich.edu/arrayanalysis.html 

43

The afternoon session and the data set

The afternoon session 

• In the afternoon session, you will be performing QC 

and further data preprocessing yourself 

• You will follow a stepwise guide available online 

– http://www.bigcat.unimaas.nl/wiki/index.php/PET_course 

• You will use an Affymetrix data set and make use of the 

Genepattern QC and data preprocessing workflows 

discussed 

45

Short description of the data set (1) 

• Microarray experiments have to be uploaded to online 

repositories such as Gene Expression Omnibus (GEO, 

NCBI) or ArrayExpress (AE, EBI) upon publication 

• We will use a published dataset available from AE 

46

Short description of the data set (2) 

• The course Wiki contains instructions on how to proceed 

• This data set is taken from Ezendam et al., 2004 

• Hexachlorobenzene (HCB) is a persistent pollutant, that is toxic 

for liver, neurons and the reproductive and immune systems 

• In this study, Brown Norway rats were fed a diet supplemented 

with HCB of 0, 150, or 450 mg/kg 

• Spleen, mesenteric lymph nodes (MLN), thymus, blood, liver, 

and kidney were analyzed using the Affymetrix rat RGU-34A 

GeneChip microarray 

– 13-17 arrays per tissue, max 6 per concentration 

• We will be primarily considering the liver data 

47

QC and normalisation of microarray experiments - BiGCaT

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?