13.08.2024 Views

Watershed_Presentation

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Effectively leveraging public datasets

and models for drug discovery

Stefanie Morgan, PhD

Head of Science & BD, Watershed Bio


Disclosures

Employee and shareholder of

Watershed Bio

s

Licensed technology with System

Biosciences Inc

#2


Watershed’s Background

s

Formed to make biological data analysis more

approachable

Built by a team of dedicated, rigorous scientists

passionate about good data practices in biology

Analysis

We believe that improving access to the right tools

and resources for data analysis can have farreaching

impacts on human health

#3


Drug discovery in 2025 won’t be aided by

public datasets and foundation models.

It will be defined by them.

4


1000000000

100000000

10000000

Kilobases of DNA per Day per Machine

1000000

100000

10000

1000

100

10

1975

1980 1985 1990

1995 2000 2005 2010 2015

Year

First Generation

Second

Generation

Third

Generation

Data use in research over time

#5


A Paradigm Shift

Open datasets and models

6


The explosion of new open datasets and data

types promises to accelerate drug discovery

Open Targets

Combines population genetics with functional

genomics screens in relevant cellular systems to

enable systematic identification and prioritization of

drug targets

JUMP-Cell Painting Consortium

116,750 unique compounds, 20,000 gene perterbation

studies, and 1.6 billion cells allow inference of which

pathways to target with which chemical matter for a given

disease

7


The explosion of new open models

promises to accelerate drug discovery

AlphaFold2

A deep neural network trained on 100,000

protein 3 dimensional structures, using only the

amino acid sequence of a protein. It has

predicted the structure of over 200 million

proteins to atomic-scale resolution

GeneFormer

Transformer model tuned on 30M

single-cell transcriptomes to

enable tissue-specific prediction

of gene perturbation

8


How can open data and

models actually accelerate

drug development?

#9


Historically understanding a disease well

enough to develop a drug has taken decades

5,000 - 10,000

compounds

250 compounds

Target identification

& validation

Assay development

Lead generation

In vitro & in vivo

toxicity

ADMET

PK/PD

5 compounds 1 new drug

Drug Discovery

Pre-Clinical

Clinical Trials

Regulatory Approval

(3-5 Years)

(1-2 Years)

(6-7 Years)

(1-2 Years)

10


Target identificiation

and prioritization

One of the most challenging steps has been identifying and

prioritizing therapeutic targets

5,000 - 10,000

compounds

250 compounds

Target identification

& validation

Assay development

Lead generation

In vitro & in vivo

toxicity

ADMET

PK/PD

5 compounds 1 new drug

Drug Discovery

Pre-Clinical

Clinical Trials

Regulatory Approval

(3-5 Years)

(1-2 Years)

(6-7 Years)

(1-2 Years)

11


Genetic

associations

One of the first steps in drug discovery is

Target

Pathway-level

analyses

identifying which mRNAs, proteins, or

pathways may be good targets for

treating a disease. All data for doing so

are incomplete, biased, and expensive

to generate.

Text

Mining

identification

Integrating signals across different

datasets and data types paints a more

complete and less biased picture. Open

Genetic Screens

datasets hold the promise of extracting

those signals in days instead of months.

Animal Models

12


Example: Multi-omics analysis identifies shared oncogenic driver

pathways and drug responsiveness across ten cancer types

Li et al (Cell, 2023) demonstrate the power of

integrating cohesive multi-omics data to derive

meaningful, impactful insights

Integrated data from transcriptome, proteome, and

phosphoproteome to identify multi-omics clusters of

pan-cancer aberrations and their pathway

perturbations

Deeper analyses reveal important insights about

therapeutic responses observed in the clinic

#13


Resources to identify the patients most likely to respond to a

Patient segmentation

given therapy is critical, as this can be particularly challenging

5,000 - 10,000

compounds

250 compounds

Target identification

& validation

Assay development

Lead generation

In vitro & in vivo

toxicity

ADMET

PK/PD

5 compounds 1 new drug

Drug Discovery

Pre-Clinical

Clinical Trials

Regulatory Approval

(3-5 Years)

(1-2 Years)

(6-7 Years)

(1-2 Years)

14


Data available to help with patient

segmentation is only set to grow over time

Genetic, epigenetic, transcripomic,

metabolomic, proteomic, EHR, and

wearable data all can identify

clinically salient patient subsets,

but must be utilized effectively for

meaningful insights to be derived

https://www.ahajournals.org/doi/10.1161/CIRCRESAHA.117.310782

15


This is so promising - why

doesn’t everyone do it?

16 #


Effective analysis of this type of data

requires integration of multiple components

Lab

Software Engineer /

System Admin

Analysis

’ Large dataset manipulation requires

specialized compute infrastructur

Bioinformatics

Expert

’ Setup must be designed to prevent

data silos and future data losŒ

Multi-omics

Data

Data Storage

Infrastructure

Insights

’ Effective setup, maintenance, and

data analysis requires experts in two

or more areas

Compute

Infrastructure

Bioinformatics

Tooling

17


Infrastructure challenges to successful analyses can be multifaceted

Is infrastructure secure?

Can it handle patient or

protected data?

Wet Lab

Software Engineer

/ System Admin

Analysis

Are you following FAIR

(findable, accessible,

interoperable, and

Multi-omics

Data

Data Storage

Infrastructure

reusable) data practices?

Bioinformatics

Consultant

Is your data tracked,

reliable, and reproducible?

Compute

Infrastructure

Bioinformatics

Tooling

Is the level of this robust

enough to withstand

Doing this incorrectly can turn thousands of

regulatory scrutiny?

hours of effort into irreproducible noise

18


Expertise-related challenges can pose a particular bottleneck

Even after secure, FAIR infrastructure has been built, you

need to work with and interpret the data accurately

Wet Lab

Software Engineer

/ System Admin

Analysis

Multi-omics

Data

Data Storage

Infrastructure

PhD-level bioinformatics and data

science expertise are absolute

Bioinformatics

Consultant

requirements for correctly analyzing

these bespoke analyses - entire drug

programs can hinge on their results

Compute

Infrastructure

Bioinformatics

Tooling

19


Effective support and resources can

powerfully advance your therapeutic

program at every stage

Productionisation of

therapeutic product

Full data set for

IND application

1st collection of key

success indicators

1st pilot data set

Growth with you from

day 1 to 1000 and beyond

20


Summary & Conclusions

Data is the future of science

Leveraging data effectively can transform your scientific program

Finding and utilizing the right resources to effectivley investigate

data can be challenging

Setting things up correctly can be transformative - it is worth the

effort to do it right!

Watershed is always available for input and guideance on data

processing when you need it.

21


Questions?

22

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!