Sarah Hunter

stfc.ac.uk

Sarah Hunter

Bioinformatics and the big data era

Sarah Hunter, European Bioinformatics Institute

EBI is an Outstation of the European Molecular Biology Laboratory.


Meteorology

Bioinformatics

Astrophysics

Particle physics

2

15.12.2008


Wired

June 2008

Nature

September 2008

3

15.12.2008


“The Petabyte Age”

Hypothesis-driven science

Data-driven science

- Chris Anderson

4

15.12.2008


Gene

knock-outs

Protein

Assays

Point

mutations

Hypothesis-driven science

Microarray

Data-driven science

Genomics

Metagenomics

HT

proteomics

5

15.12.2008


Big data in physics

• The Large Hadron Collider (LHC) will

produce 15 petabytes of data per

year from a single source

• Multi-million pound computing grid

infrastructure specifically built for LHC

• Consists of 3 “tiers” of ~150 computer

centres in >30 countries connected

by high-speed networks

6

15.12.2008


Big data in biology

• The 1000 genomes project will

produce 1 petabyte of data per year

from multiple sources in multiple

countries

• Computing infrastructure is… umm…

• www.1000genomes.org

7

15.12.2008


Biology-wide problem

8

15.12.2008


We are faced with is a deluge of data:

• Increasing number of fully sequenced genomes

• Increasing amount of genome variation data

• Increasing number of meta-genomic sequencing projects

• Increasing amount of transcriptomic data

• Increasing amount of protein identification data

• Increasing amount of protein interaction data

9

15.12.2008


Challenge 1: Storing data

10

15.12.2008


If you ever feel chilly…

… just spend some time in the EBI’s machine room!

11

15.12.2008


Challenge 2: Moving data around

12

15.12.2008


Data through the Hinxton Router

April Data Push

June Data Transfer

13

15.12.2008


(The solution if all else fails…)

image courtesy http://www.simbaint.com

14

15.12.2008


Challenge 2: Moving data around (securely)

15

15.12.2008


Challenge 3: Analysing and interpreting data

16

15.12.2008


Sources of analysed data sets

• Scientific journals

• Bioinformatics databases

17

15.12.2008


How do you analyse over 15 million proteins?

18

15.12.2008


Bioinformatics analyses

• Multiple Sequence Alignments and Assembly

• Phylogenetic tree building

• Peptide identification

• 3-D Protein Structure prediction

• Protein function classification

• e.g. InterPro and its member databases -

http://www.ebi.ac.uk/interpro/

19

15.12.2008


Build models describing protein(s)

Position

Specific

Scoring

Matrices

Profiles

Hidden

Markov

Models

Regular

Expressions

Sequence

clusters

20

15.12.2008


Associate annotation describing the models

• Name

• e.g. Phosphofructokinase family, biotin

binding-site, PAS domain

• Type of functional feature

• e.g. Family, Domain, Repeat, Binding Site,

Active Site, etc.

• Abstract

• Longer description precisely describing what

the model is representing

• Rules are written based on these models

in order to associate annotations on a

large scale.

Position

Specific

Scoring

Matrices

Profiles

Hidden

Markov

Models

Regular

Expressions

Sequence

clusters

21

15.12.2008


Searching unknown sequences

HMMs

• Using these models allows

annotation to be associated with

proteins more quickly than

traditional curational methods.

• http://www.ebi.ac.uk/interproscan/

Protein

classified

22

15.12.2008


However…

10,000 average proteins

(281 aa)

HMMs

1 average HMM

(239 states)

Protein

classified

= 1.18 CPUsec

23

15.12.2008


However…

17,159,442 proteins

HMMs

44,117 HMMs

Protein

classified

= 1.4 million CPU hrs

24

15.12.2008


Protein analysis problems and solutions

• Problems:

• For small-scale users this analysis is computationally very

expensive

• In the long-term, it will not scale

• Short term solutions for this problem at EBI:

• Purchase of specialised accelerated software (LDHmmer)

• Investigation of Cell processor and GPU implementations

• Increasing the size of compute farms

• Longer term solutions for this problem :

• Re-write of HMM searching software (HMMer3)

• Using GRID or cloud computing resources to spread analysis

load

25

15.12.2008


Challenge 4: Access to data

26

15.12.2008


Infrastructure for protein data access

• Various infrastructures exist for protein data

PRIDE GPMDB PeptideAtlas

• Important that data exchange facilitated

• ProteomExchange consortium aims to encourage data sharing

between these repositories

• Standard formats – HUPO PSI (see http://psidev.info/)

• Data security is becoming more of an issue

27

15.12.2008


Improving proteomic data access

• Encourage submission

• Some journals require submission to proteomics databases such

as PRIDE before accepting manuscripts; others strongly

encourage it.

• Strong focus on improving data submission tools (spreadsheets;

XML formats; web wizards)

• Standardise nomeclature and identifiers

• Use of structured ontologies and vocabularies eases searching

• PICR (protein identifier cross-reference) service for ID look-ups

• Provide multiple access points

• BioMart

• Web services

• DAS

28

15.12.2008


Conclusions

29

15.12.2008


4 Main Bioinformatics Challenges

• Storing data

• Amount of data being generated is increasing dramatically

• Moving data

• Data transfers are not increasing at the same rate as storage

capacity or computing power

• We may need to move compute to data rather than vice versa

• Analysing data

• Improvements in algorithm performance are necessary as well as

utilisation of the latest hardware solutions

• Providing access to data

• Data security is becoming increasingly important

• Use of centralised databases, standards and ontologies allows

easier querying of and access to data

30

15.12.2008


Questions?

31

15.12.2008

More magazines by this user
Similar magazines