06.08.2014 Views

Sarah Hunter

Sarah Hunter

Sarah Hunter

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Bioinformatics and the big data era<br />

<strong>Sarah</strong> <strong>Hunter</strong>, European Bioinformatics Institute<br />

EBI is an Outstation of the European Molecular Biology Laboratory.


Meteorology<br />

Bioinformatics<br />

Astrophysics<br />

Particle physics<br />

2<br />

15.12.2008


Wired<br />

June 2008<br />

Nature<br />

September 2008<br />

3<br />

15.12.2008


“The Petabyte Age”<br />

Hypothesis-driven science<br />

Data-driven science<br />

- Chris Anderson<br />

4<br />

15.12.2008


Gene<br />

knock-outs<br />

Protein<br />

Assays<br />

Point<br />

mutations<br />

Hypothesis-driven science<br />

Microarray<br />

Data-driven science<br />

Genomics<br />

Metagenomics<br />

HT<br />

proteomics<br />

5<br />

15.12.2008


Big data in physics<br />

• The Large Hadron Collider (LHC) will<br />

produce 15 petabytes of data per<br />

year from a single source<br />

• Multi-million pound computing grid<br />

infrastructure specifically built for LHC<br />

• Consists of 3 “tiers” of ~150 computer<br />

centres in >30 countries connected<br />

by high-speed networks<br />

6<br />

15.12.2008


Big data in biology<br />

• The 1000 genomes project will<br />

produce 1 petabyte of data per year<br />

from multiple sources in multiple<br />

countries<br />

• Computing infrastructure is… umm…<br />

• www.1000genomes.org<br />

7<br />

15.12.2008


Biology-wide problem<br />

8<br />

15.12.2008


We are faced with is a deluge of data:<br />

• Increasing number of fully sequenced genomes<br />

• Increasing amount of genome variation data<br />

• Increasing number of meta-genomic sequencing projects<br />

• Increasing amount of transcriptomic data<br />

• Increasing amount of protein identification data<br />

• Increasing amount of protein interaction data<br />

9<br />

15.12.2008


Challenge 1: Storing data<br />

10<br />

15.12.2008


If you ever feel chilly…<br />

… just spend some time in the EBI’s machine room!<br />

11<br />

15.12.2008


Challenge 2: Moving data around<br />

12<br />

15.12.2008


Data through the Hinxton Router<br />

April Data Push<br />

June Data Transfer<br />

13<br />

15.12.2008


(The solution if all else fails…)<br />

image courtesy http://www.simbaint.com<br />

14<br />

15.12.2008


Challenge 2: Moving data around (securely)<br />

15<br />

15.12.2008


Challenge 3: Analysing and interpreting data<br />

16<br />

15.12.2008


Sources of analysed data sets<br />

• Scientific journals<br />

• Bioinformatics databases<br />

17<br />

15.12.2008


How do you analyse over 15 million proteins?<br />

18<br />

15.12.2008


Bioinformatics analyses<br />

• Multiple Sequence Alignments and Assembly<br />

• Phylogenetic tree building<br />

• Peptide identification<br />

• 3-D Protein Structure prediction<br />

• Protein function classification<br />

• e.g. InterPro and its member databases -<br />

http://www.ebi.ac.uk/interpro/<br />

19<br />

15.12.2008


Build models describing protein(s)<br />

Position<br />

Specific<br />

Scoring<br />

Matrices<br />

Profiles<br />

Hidden<br />

Markov<br />

Models<br />

Regular<br />

Expressions<br />

Sequence<br />

clusters<br />

20<br />

15.12.2008


Associate annotation describing the models<br />

• Name<br />

• e.g. Phosphofructokinase family, biotin<br />

binding-site, PAS domain<br />

• Type of functional feature<br />

• e.g. Family, Domain, Repeat, Binding Site,<br />

Active Site, etc.<br />

• Abstract<br />

• Longer description precisely describing what<br />

the model is representing<br />

• Rules are written based on these models<br />

in order to associate annotations on a<br />

large scale.<br />

Position<br />

Specific<br />

Scoring<br />

Matrices<br />

Profiles<br />

Hidden<br />

Markov<br />

Models<br />

Regular<br />

Expressions<br />

Sequence<br />

clusters<br />

21<br />

15.12.2008


Searching unknown sequences<br />

HMMs<br />

• Using these models allows<br />

annotation to be associated with<br />

proteins more quickly than<br />

traditional curational methods.<br />

• http://www.ebi.ac.uk/interproscan/<br />

Protein<br />

classified<br />

22<br />

15.12.2008


However…<br />

10,000 average proteins<br />

(281 aa)<br />

HMMs<br />

1 average HMM<br />

(239 states)<br />

Protein<br />

classified<br />

= 1.18 CPUsec<br />

23<br />

15.12.2008


However…<br />

17,159,442 proteins<br />

HMMs<br />

44,117 HMMs<br />

Protein<br />

classified<br />

= 1.4 million CPU hrs<br />

24<br />

15.12.2008


Protein analysis problems and solutions<br />

• Problems:<br />

• For small-scale users this analysis is computationally very<br />

expensive<br />

• In the long-term, it will not scale<br />

• Short term solutions for this problem at EBI:<br />

• Purchase of specialised accelerated software (LDHmmer)<br />

• Investigation of Cell processor and GPU implementations<br />

• Increasing the size of compute farms<br />

• Longer term solutions for this problem :<br />

• Re-write of HMM searching software (HMMer3)<br />

• Using GRID or cloud computing resources to spread analysis<br />

load<br />

25<br />

15.12.2008


Challenge 4: Access to data<br />

26<br />

15.12.2008


Infrastructure for protein data access<br />

• Various infrastructures exist for protein data<br />

PRIDE GPMDB PeptideAtlas<br />

• Important that data exchange facilitated<br />

• ProteomExchange consortium aims to encourage data sharing<br />

between these repositories<br />

• Standard formats – HUPO PSI (see http://psidev.info/)<br />

• Data security is becoming more of an issue<br />

27<br />

15.12.2008


Improving proteomic data access<br />

• Encourage submission<br />

• Some journals require submission to proteomics databases such<br />

as PRIDE before accepting manuscripts; others strongly<br />

encourage it.<br />

• Strong focus on improving data submission tools (spreadsheets;<br />

XML formats; web wizards)<br />

• Standardise nomeclature and identifiers<br />

• Use of structured ontologies and vocabularies eases searching<br />

• PICR (protein identifier cross-reference) service for ID look-ups<br />

• Provide multiple access points<br />

• BioMart<br />

• Web services<br />

• DAS<br />

28<br />

15.12.2008


Conclusions<br />

29<br />

15.12.2008


4 Main Bioinformatics Challenges<br />

• Storing data<br />

• Amount of data being generated is increasing dramatically<br />

• Moving data<br />

• Data transfers are not increasing at the same rate as storage<br />

capacity or computing power<br />

• We may need to move compute to data rather than vice versa<br />

• Analysing data<br />

• Improvements in algorithm performance are necessary as well as<br />

utilisation of the latest hardware solutions<br />

• Providing access to data<br />

• Data security is becoming increasingly important<br />

• Use of centralised databases, standards and ontologies allows<br />

easier querying of and access to data<br />

30<br />

15.12.2008


Questions?<br />

31<br />

15.12.2008

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!