Watershed_Presentation
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Effectively leveraging public datasets
and models for drug discovery
Stefanie Morgan, PhD
Head of Science & BD, Watershed Bio
Disclosures
Employee and shareholder of
Watershed Bio
s
Licensed technology with System
Biosciences Inc
#2
Watershed’s Background
s
Formed to make biological data analysis more
approachable
Built by a team of dedicated, rigorous scientists
passionate about good data practices in biology
Analysis
We believe that improving access to the right tools
and resources for data analysis can have farreaching
impacts on human health
#3
Drug discovery in 2025 won’t be aided by
public datasets and foundation models.
It will be defined by them.
4
1000000000
100000000
10000000
Kilobases of DNA per Day per Machine
1000000
100000
10000
1000
100
10
1975
1980 1985 1990
1995 2000 2005 2010 2015
Year
First Generation
Second
Generation
Third
Generation
Data use in research over time
#5
A Paradigm Shift
Open datasets and models
6
The explosion of new open datasets and data
types promises to accelerate drug discovery
Open Targets
Combines population genetics with functional
genomics screens in relevant cellular systems to
enable systematic identification and prioritization of
drug targets
JUMP-Cell Painting Consortium
116,750 unique compounds, 20,000 gene perterbation
studies, and 1.6 billion cells allow inference of which
pathways to target with which chemical matter for a given
disease
7
The explosion of new open models
promises to accelerate drug discovery
AlphaFold2
A deep neural network trained on 100,000
protein 3 dimensional structures, using only the
amino acid sequence of a protein. It has
predicted the structure of over 200 million
proteins to atomic-scale resolution
GeneFormer
Transformer model tuned on 30M
single-cell transcriptomes to
enable tissue-specific prediction
of gene perturbation
8
How can open data and
models actually accelerate
drug development?
#9
Historically understanding a disease well
enough to develop a drug has taken decades
5,000 - 10,000
compounds
250 compounds
Target identification
& validation
Assay development
Lead generation
In vitro & in vivo
toxicity
ADMET
PK/PD
5 compounds 1 new drug
Drug Discovery
Pre-Clinical
Clinical Trials
Regulatory Approval
(3-5 Years)
(1-2 Years)
(6-7 Years)
(1-2 Years)
10
Target identificiation
and prioritization
One of the most challenging steps has been identifying and
prioritizing therapeutic targets
5,000 - 10,000
compounds
250 compounds
Target identification
& validation
Assay development
Lead generation
In vitro & in vivo
toxicity
ADMET
PK/PD
5 compounds 1 new drug
Drug Discovery
Pre-Clinical
Clinical Trials
Regulatory Approval
(3-5 Years)
(1-2 Years)
(6-7 Years)
(1-2 Years)
11
Genetic
associations
One of the first steps in drug discovery is
Target
Pathway-level
analyses
identifying which mRNAs, proteins, or
pathways may be good targets for
treating a disease. All data for doing so
are incomplete, biased, and expensive
to generate.
Text
Mining
identification
Integrating signals across different
datasets and data types paints a more
complete and less biased picture. Open
Genetic Screens
datasets hold the promise of extracting
those signals in days instead of months.
Animal Models
12
Example: Multi-omics analysis identifies shared oncogenic driver
pathways and drug responsiveness across ten cancer types
Li et al (Cell, 2023) demonstrate the power of
integrating cohesive multi-omics data to derive
meaningful, impactful insights
Integrated data from transcriptome, proteome, and
phosphoproteome to identify multi-omics clusters of
pan-cancer aberrations and their pathway
perturbations
Deeper analyses reveal important insights about
therapeutic responses observed in the clinic
#13
Resources to identify the patients most likely to respond to a
Patient segmentation
given therapy is critical, as this can be particularly challenging
5,000 - 10,000
compounds
250 compounds
Target identification
& validation
Assay development
Lead generation
In vitro & in vivo
toxicity
ADMET
PK/PD
5 compounds 1 new drug
Drug Discovery
Pre-Clinical
Clinical Trials
Regulatory Approval
(3-5 Years)
(1-2 Years)
(6-7 Years)
(1-2 Years)
14
Data available to help with patient
segmentation is only set to grow over time
Genetic, epigenetic, transcripomic,
metabolomic, proteomic, EHR, and
wearable data all can identify
clinically salient patient subsets,
but must be utilized effectively for
meaningful insights to be derived
https://www.ahajournals.org/doi/10.1161/CIRCRESAHA.117.310782
15
This is so promising - why
doesn’t everyone do it?
16 #
Effective analysis of this type of data
requires integration of multiple components
Lab
Software Engineer /
System Admin
Analysis
’ Large dataset manipulation requires
specialized compute infrastructur
Bioinformatics
Expert
’ Setup must be designed to prevent
data silos and future data losŒ
Multi-omics
Data
Data Storage
Infrastructure
Insights
’ Effective setup, maintenance, and
data analysis requires experts in two
or more areas
Compute
Infrastructure
Bioinformatics
Tooling
17
Infrastructure challenges to successful analyses can be multifaceted
Is infrastructure secure?
Can it handle patient or
protected data?
Wet Lab
Software Engineer
/ System Admin
Analysis
Are you following FAIR
(findable, accessible,
interoperable, and
Multi-omics
Data
Data Storage
Infrastructure
reusable) data practices?
Bioinformatics
Consultant
Is your data tracked,
reliable, and reproducible?
Compute
Infrastructure
Bioinformatics
Tooling
Is the level of this robust
enough to withstand
Doing this incorrectly can turn thousands of
regulatory scrutiny?
hours of effort into irreproducible noise
18
Expertise-related challenges can pose a particular bottleneck
Even after secure, FAIR infrastructure has been built, you
need to work with and interpret the data accurately
Wet Lab
Software Engineer
/ System Admin
Analysis
Multi-omics
Data
Data Storage
Infrastructure
PhD-level bioinformatics and data
science expertise are absolute
Bioinformatics
Consultant
requirements for correctly analyzing
these bespoke analyses - entire drug
programs can hinge on their results
Compute
Infrastructure
Bioinformatics
Tooling
19
Effective support and resources can
powerfully advance your therapeutic
program at every stage
Productionisation of
therapeutic product
Full data set for
IND application
1st collection of key
success indicators
1st pilot data set
Growth with you from
day 1 to 1000 and beyond
20
Summary & Conclusions
Data is the future of science
Leveraging data effectively can transform your scientific program
Finding and utilizing the right resources to effectivley investigate
data can be challenging
Setting things up correctly can be transformative - it is worth the
effort to do it right!
Watershed is always available for input and guideance on data
processing when you need it.
21
Questions?
22