Sarah Hunter
Sarah Hunter
Sarah Hunter
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Bioinformatics and the big data era<br />
<strong>Sarah</strong> <strong>Hunter</strong>, European Bioinformatics Institute<br />
EBI is an Outstation of the European Molecular Biology Laboratory.
Meteorology<br />
Bioinformatics<br />
Astrophysics<br />
Particle physics<br />
2<br />
15.12.2008
Wired<br />
June 2008<br />
Nature<br />
September 2008<br />
3<br />
15.12.2008
“The Petabyte Age”<br />
Hypothesis-driven science<br />
Data-driven science<br />
- Chris Anderson<br />
4<br />
15.12.2008
Gene<br />
knock-outs<br />
Protein<br />
Assays<br />
Point<br />
mutations<br />
Hypothesis-driven science<br />
Microarray<br />
Data-driven science<br />
Genomics<br />
Metagenomics<br />
HT<br />
proteomics<br />
5<br />
15.12.2008
Big data in physics<br />
• The Large Hadron Collider (LHC) will<br />
produce 15 petabytes of data per<br />
year from a single source<br />
• Multi-million pound computing grid<br />
infrastructure specifically built for LHC<br />
• Consists of 3 “tiers” of ~150 computer<br />
centres in >30 countries connected<br />
by high-speed networks<br />
6<br />
15.12.2008
Big data in biology<br />
• The 1000 genomes project will<br />
produce 1 petabyte of data per year<br />
from multiple sources in multiple<br />
countries<br />
• Computing infrastructure is… umm…<br />
• www.1000genomes.org<br />
7<br />
15.12.2008
Biology-wide problem<br />
8<br />
15.12.2008
We are faced with is a deluge of data:<br />
• Increasing number of fully sequenced genomes<br />
• Increasing amount of genome variation data<br />
• Increasing number of meta-genomic sequencing projects<br />
• Increasing amount of transcriptomic data<br />
• Increasing amount of protein identification data<br />
• Increasing amount of protein interaction data<br />
9<br />
15.12.2008
Challenge 1: Storing data<br />
10<br />
15.12.2008
If you ever feel chilly…<br />
… just spend some time in the EBI’s machine room!<br />
11<br />
15.12.2008
Challenge 2: Moving data around<br />
12<br />
15.12.2008
Data through the Hinxton Router<br />
April Data Push<br />
June Data Transfer<br />
13<br />
15.12.2008
(The solution if all else fails…)<br />
image courtesy http://www.simbaint.com<br />
14<br />
15.12.2008
Challenge 2: Moving data around (securely)<br />
15<br />
15.12.2008
Challenge 3: Analysing and interpreting data<br />
16<br />
15.12.2008
Sources of analysed data sets<br />
• Scientific journals<br />
• Bioinformatics databases<br />
17<br />
15.12.2008
How do you analyse over 15 million proteins?<br />
18<br />
15.12.2008
Bioinformatics analyses<br />
• Multiple Sequence Alignments and Assembly<br />
• Phylogenetic tree building<br />
• Peptide identification<br />
• 3-D Protein Structure prediction<br />
• Protein function classification<br />
• e.g. InterPro and its member databases -<br />
http://www.ebi.ac.uk/interpro/<br />
19<br />
15.12.2008
Build models describing protein(s)<br />
Position<br />
Specific<br />
Scoring<br />
Matrices<br />
Profiles<br />
Hidden<br />
Markov<br />
Models<br />
Regular<br />
Expressions<br />
Sequence<br />
clusters<br />
20<br />
15.12.2008
Associate annotation describing the models<br />
• Name<br />
• e.g. Phosphofructokinase family, biotin<br />
binding-site, PAS domain<br />
• Type of functional feature<br />
• e.g. Family, Domain, Repeat, Binding Site,<br />
Active Site, etc.<br />
• Abstract<br />
• Longer description precisely describing what<br />
the model is representing<br />
• Rules are written based on these models<br />
in order to associate annotations on a<br />
large scale.<br />
Position<br />
Specific<br />
Scoring<br />
Matrices<br />
Profiles<br />
Hidden<br />
Markov<br />
Models<br />
Regular<br />
Expressions<br />
Sequence<br />
clusters<br />
21<br />
15.12.2008
Searching unknown sequences<br />
HMMs<br />
• Using these models allows<br />
annotation to be associated with<br />
proteins more quickly than<br />
traditional curational methods.<br />
• http://www.ebi.ac.uk/interproscan/<br />
Protein<br />
classified<br />
22<br />
15.12.2008
However…<br />
10,000 average proteins<br />
(281 aa)<br />
HMMs<br />
1 average HMM<br />
(239 states)<br />
Protein<br />
classified<br />
= 1.18 CPUsec<br />
23<br />
15.12.2008
However…<br />
17,159,442 proteins<br />
HMMs<br />
44,117 HMMs<br />
Protein<br />
classified<br />
= 1.4 million CPU hrs<br />
24<br />
15.12.2008
Protein analysis problems and solutions<br />
• Problems:<br />
• For small-scale users this analysis is computationally very<br />
expensive<br />
• In the long-term, it will not scale<br />
• Short term solutions for this problem at EBI:<br />
• Purchase of specialised accelerated software (LDHmmer)<br />
• Investigation of Cell processor and GPU implementations<br />
• Increasing the size of compute farms<br />
• Longer term solutions for this problem :<br />
• Re-write of HMM searching software (HMMer3)<br />
• Using GRID or cloud computing resources to spread analysis<br />
load<br />
25<br />
15.12.2008
Challenge 4: Access to data<br />
26<br />
15.12.2008
Infrastructure for protein data access<br />
• Various infrastructures exist for protein data<br />
PRIDE GPMDB PeptideAtlas<br />
• Important that data exchange facilitated<br />
• ProteomExchange consortium aims to encourage data sharing<br />
between these repositories<br />
• Standard formats – HUPO PSI (see http://psidev.info/)<br />
• Data security is becoming more of an issue<br />
27<br />
15.12.2008
Improving proteomic data access<br />
• Encourage submission<br />
• Some journals require submission to proteomics databases such<br />
as PRIDE before accepting manuscripts; others strongly<br />
encourage it.<br />
• Strong focus on improving data submission tools (spreadsheets;<br />
XML formats; web wizards)<br />
• Standardise nomeclature and identifiers<br />
• Use of structured ontologies and vocabularies eases searching<br />
• PICR (protein identifier cross-reference) service for ID look-ups<br />
• Provide multiple access points<br />
• BioMart<br />
• Web services<br />
• DAS<br />
28<br />
15.12.2008
Conclusions<br />
29<br />
15.12.2008
4 Main Bioinformatics Challenges<br />
• Storing data<br />
• Amount of data being generated is increasing dramatically<br />
• Moving data<br />
• Data transfers are not increasing at the same rate as storage<br />
capacity or computing power<br />
• We may need to move compute to data rather than vice versa<br />
• Analysing data<br />
• Improvements in algorithm performance are necessary as well as<br />
utilisation of the latest hardware solutions<br />
• Providing access to data<br />
• Data security is becoming increasingly important<br />
• Use of centralised databases, standards and ontologies allows<br />
easier querying of and access to data<br />
30<br />
15.12.2008
Questions?<br />
31<br />
15.12.2008