16.11.2012 Views

Paterson Institute for Cancer Research SCIENTIFIC REPORT 2005

Paterson Institute for Cancer Research SCIENTIFIC REPORT 2005

Paterson Institute for Cancer Research SCIENTIFIC REPORT 2005

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

8<br />

GROUP LEADER<br />

Crispin Miller<br />

Bioin<strong>for</strong>matics Group<br />

http://www.paterson.man.ac.uk/groups/bioinf.jsp<br />

Modern molecular biology generates enormous data sets<br />

that are too big to analyse by hand. Instead, computers must<br />

be used to sift through the in<strong>for</strong>mation and find the relevant<br />

patterns and signals in the data. Bioin<strong>for</strong>matics is the study<br />

of how these programs must be written and applied to biological<br />

data. Our own research is focused on developing<br />

novel techniques and software tools <strong>for</strong> analysing microarray<br />

data and, increasingly, quantitative proteomics arising from<br />

iTRAQ based mass spectrometry.<br />

Both microarrays and high throughput quantitative<br />

proteomics measure the expression of many thousands<br />

of genes in parallel, raising issues of data<br />

management, statistics and computing as well as<br />

biology and biochemistry. Successful analysis relies<br />

on understanding how each of these contributes to<br />

the data produced by an experiment. In addition,<br />

each sample can result in large amounts of data.<br />

Recent Affymetrix chips, <strong>for</strong> example, use ~500,000<br />

features to probe <strong>for</strong> ~54,000 different transcripts;<br />

the simplest experiment comparing between two<br />

samples in triplicate generates data <strong>for</strong> about<br />

3,000,000 features at once. A clinical study might<br />

involve hundreds of samples, and generate many<br />

millions of data points <strong>for</strong> further analysis.<br />

Data management and analysis<br />

Microarray analysis relies on the use of statistical<br />

tests to assess the significance of each change in<br />

gene expression; experiments are repeated a number<br />

of times to generate replicates, and the replicate<br />

data used to evaluate the consistency of the<br />

observed differences. These tests are often accompanied<br />

by calculations of fold-change, produced<br />

POSTDOCTORAL<br />

FELLOWS<br />

Michal Okoniewski<br />

Claire Wilson<br />

RESEARCH<br />

APPLICATIONS<br />

PROGRAMMER<br />

Tim Yates<br />

SYSTEM<br />

ADMINISTRATOR/<br />

<strong>SCIENTIFIC</strong><br />

PROGRAMMER<br />

Zhi Cheng Wang<br />

GRADUATE STUDENTS<br />

Laura Edwards<br />

(nee Hollins, with Lez Fairbairn)<br />

Graeme Smethurst<br />

(with Peter Stern)<br />

from the mean values <strong>for</strong> each set of samples. The<br />

majority of our data analysis work uses<br />

BioConductor (www.bioconductor.org), a collection<br />

of analysis tools built with the statistical programming<br />

language R. We contribute code to<br />

BioConductor and have also been continuing to<br />

develop our own package, ‘simpleaffy’, which<br />

implements a variety of analysis algorithms <strong>for</strong><br />

Affymetrix data, including Quality Control, signal<br />

detection, expression level generation and a set of<br />

graph plotting and visualisation functions, and<br />

‘plier’, which uses a wrapper around Affymetrix’s<br />

SDK to provide access to their ‘plier’ algorithm.<br />

Knowledge of the replicate structure of a microarray<br />

experiment is fundamental to its correct interpretation.<br />

We have developed a large MIAME<br />

compliant database that provides access to expression<br />

data via a Web interface. In order to allow the<br />

database to be searched <strong>for</strong> experiments in which<br />

specified genes are differentially expressed, the<br />

database must have access to in<strong>for</strong>mation describing<br />

the replicate structure of each experiment, and<br />

use this to guide the statistical tests that underpin<br />

the search. We have developed an annotation system<br />

that uses a ‘drag-and-drop’ interface that allows<br />

users to build a pictorial representation of their<br />

experiment using a set of icons that represent the<br />

different stages of the experimental process. The<br />

system makes use of this apparently in<strong>for</strong>mal interaction<br />

to build a structured, and machine readable<br />

representation of experimental design. This is subsequently<br />

used by the database to group samples<br />

together to support a variety of tasks including data<br />

visualization and gene-centred searches. As<br />

datasets become larger and more complex to man-<br />

P A T E R S O N I N S T I T U T E S C I E N T I F I C R E P O R T 2 0 0 5

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!