Sarah Hunter

Bioinformatics and the big data era 

Sarah Hunter, European Bioinformatics Institute 

EBI is an Outstation of the European Molecular Biology Laboratory.

Meteorology 

Bioinformatics 

Astrophysics 

Particle physics 

2 

15.12.2008

Wired 

June 2008 

Nature 

September 2008 

3 

15.12.2008

“The Petabyte Age” 

Hypothesis-driven science 

Data-driven science 

- Chris Anderson 

4 

15.12.2008

Gene 

knock-outs 

Protein 

Assays 

Point 

mutations 

Hypothesis-driven science 

Microarray 

Data-driven science 

Genomics 

Metagenomics 

HT 

proteomics 

5 

15.12.2008

Big data in physics 

• The Large Hadron Collider (LHC) will 

produce 15 petabytes of data per 

year from a single source 

• Multi-million pound computing grid 

infrastructure specifically built for LHC 

• Consists of 3 “tiers” of ~150 computer 

centres in >30 countries connected 

by high-speed networks 

6 

15.12.2008

Big data in biology 

• The 1000 genomes project will 

produce 1 petabyte of data per year 

from multiple sources in multiple 

countries 

• Computing infrastructure is… umm… 

• www.1000genomes.org 

7 

15.12.2008

Biology-wide problem 

8 

15.12.2008

We are faced with is a deluge of data: 

• Increasing number of fully sequenced genomes 

• Increasing amount of genome variation data 

• Increasing number of meta-genomic sequencing projects 

• Increasing amount of transcriptomic data 

• Increasing amount of protein identification data 

• Increasing amount of protein interaction data 

9 

15.12.2008

Challenge 1: Storing data 

10 

15.12.2008

If you ever feel chilly… 

… just spend some time in the EBI’s machine room! 

11 

15.12.2008

Challenge 2: Moving data around 

12 

15.12.2008

Data through the Hinxton Router 

April Data Push 

June Data Transfer 

13 

15.12.2008

(The solution if all else fails…) 

image courtesy http://www.simbaint.com 

14 

15.12.2008

Challenge 2: Moving data around (securely) 

15 

15.12.2008

Challenge 3: Analysing and interpreting data 

16 

15.12.2008

Sources of analysed data sets 

• Scientific journals 

• Bioinformatics databases 

17 

15.12.2008

How do you analyse over 15 million proteins? 

18 

15.12.2008

Bioinformatics analyses 

• Multiple Sequence Alignments and Assembly 

• Phylogenetic tree building 

• Peptide identification 

• 3-D Protein Structure prediction 

• Protein function classification 

• e.g. InterPro and its member databases - 

http://www.ebi.ac.uk/interpro/ 

19 

15.12.2008

Build models describing protein(s) 

Position 

Specific 

Scoring 

Matrices 

Profiles 

Hidden 

Markov 

Models 

Regular 

Expressions 

Sequence 

clusters 

20 

15.12.2008

Associate annotation describing the models 

• Name 

• e.g. Phosphofructokinase family, biotin 

binding-site, PAS domain 

• Type of functional feature 

• e.g. Family, Domain, Repeat, Binding Site, 

Active Site, etc. 

• Abstract 

• Longer description precisely describing what 

the model is representing 

• Rules are written based on these models 

in order to associate annotations on a 

large scale. 

Position 

Specific 

Scoring 

Matrices 

Profiles 

Hidden 

Markov 

Models 

Regular 

Expressions 

Sequence 

clusters 

21 

15.12.2008

Searching unknown sequences 

HMMs 

• Using these models allows 

annotation to be associated with 

proteins more quickly than 

traditional curational methods. 

• http://www.ebi.ac.uk/interproscan/ 

Protein 

classified 

22 

15.12.2008

However… 

10,000 average proteins 

(281 aa) 

HMMs 

1 average HMM 

(239 states) 

Protein 

classified 

= 1.18 CPUsec 

23 

15.12.2008

However… 

17,159,442 proteins 

HMMs 

44,117 HMMs 

Protein 

classified 

= 1.4 million CPU hrs 

24 

15.12.2008

Protein analysis problems and solutions 

• Problems: 

• For small-scale users this analysis is computationally very 

expensive 

• In the long-term, it will not scale 

• Short term solutions for this problem at EBI: 

• Purchase of specialised accelerated software (LDHmmer) 

• Investigation of Cell processor and GPU implementations 

• Increasing the size of compute farms 

• Longer term solutions for this problem : 

• Re-write of HMM searching software (HMMer3) 

• Using GRID or cloud computing resources to spread analysis 

load 

25 

15.12.2008

Challenge 4: Access to data 

26 

15.12.2008

Infrastructure for protein data access 

• Various infrastructures exist for protein data 

PRIDE GPMDB PeptideAtlas 

• Important that data exchange facilitated 

• ProteomExchange consortium aims to encourage data sharing 

between these repositories 

• Standard formats – HUPO PSI (see http://psidev.info/) 

• Data security is becoming more of an issue 

27 

15.12.2008

Improving proteomic data access 

• Encourage submission 

• Some journals require submission to proteomics databases such 

as PRIDE before accepting manuscripts; others strongly 

encourage it. 

• Strong focus on improving data submission tools (spreadsheets; 

XML formats; web wizards) 

• Standardise nomeclature and identifiers 

• Use of structured ontologies and vocabularies eases searching 

• PICR (protein identifier cross-reference) service for ID look-ups 

• Provide multiple access points 

• BioMart 

• Web services 

• DAS 

28 

15.12.2008

Conclusions 

29 

15.12.2008

4 Main Bioinformatics Challenges 

• Storing data 

• Amount of data being generated is increasing dramatically 

• Moving data 

• Data transfers are not increasing at the same rate as storage 

capacity or computing power 

• We may need to move compute to data rather than vice versa 

• Analysing data 

• Improvements in algorithm performance are necessary as well as 

utilisation of the latest hardware solutions 

• Providing access to data 

• Data security is becoming increasingly important 

• Use of centralised databases, standards and ontologies allows 

easier querying of and access to data 

30 

15.12.2008

Questions? 

31 

15.12.2008

Sarah Hunter

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?