YSM Issue 94.3
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Computational Biology
FEATURE
IMAGE COURTESY OF NATIONAL HUMAN GENOME RESEARCH INSTITUTE
The circadian rhythm, also known as the body’s
“biological clock,” is endogenous (originates
from within an organism), but also influenced
by environmental variables, including light,
temperature, and geographical location.
3) decoding the “black box” of ML models
to explain the mechanism of how AI is
used to predict circadian clock function.
In order to effectively analyze the
expression of circadian rhythms, the
researchers chose the small flowering
plant Arabidopsis thaliana as a model
organism. Arabidopsis was the first plant
to have its entire genome sequenced, and
because some of its regulatory elements
were already known, the researchers used
that pre-existing knowledge to validate
their ML predictions. This allowed them
to understand how their ML model was
reaching its predictions, thereby decoding
the mystery of the AI black box.
When there are tens of thousands,
even millions, of data points, how do
we understand that data and extract
their patterns and trends? Mohsen
explained that we learn by finding
parameters that capture what patterns
exist—the more sophisticated the data,
the more parameters we need. But using
more parameters necessitates a greater
understanding of what each does.
“There are multiple approaches and even
definitions of what interpretability is,”
he said. Fundamentally, though, “it is
just learning how the prediction process
works or which input features are
corresponding to a specific prediction.”
The Earlham Institute researchers
used MetaCycle—a tool for detecting
circadian signals in transcriptomic
data—to analyze a dataset of Arabidopsis
genomic transcripts. Using this
information, the researchers trained
a series of ML classifiers to predict
if a transcript was circadian or noncircadian.
They found that the AI was
not just using gene expression levels,
but also timepoints for its predictions.
However, these predictions were not
always one-hundred percent accurate,
and the researchers thus set out to
ascertain the optimal sampling strategy
and number of timepoints needed.
Circadian gene expression rhythms
follow diverse patterns, but all share a
twenty-four-hour periodicity. Having
fewer timepoints is more efficient, but leads
to concerns over loss of information and
accuracy. The researchers aimed to find the
optimal balance between a low number of
transcriptomic timepoints and improved
accuracy, so they started with a twelve
timepoint ML model and sequentially
reduced it to three timepoints.
The explainablity aspect of their
model comes with understanding how
the model was making its predictions.
The researchers needed to see which
k-mers (short sequences of DNA) were
the most influential in impacting the
ML model's predictions, and found that
the most accurate predictions resulted
from a k-mer length of six.
“[Machine learning] has
already reshaped a significant
part of how we study the
biology of disease.
”
Overall, the study showed the
possibility for reducing the number of
transcriptomic timepoints while still
maintaining accuracy in predicting
circadian rhythmicity. Since creating
datasets takes significant time and
resources, a reduction in sampling could
have important long-term impacts in
increasing efficiency.
The findings of this study have major
implications for the future of biomedical
science and AI: recent studies have
shown that disruption of clock genes
is associated with sleep disorders,
heightened susceptibility to infections,
Alzheimer’s disease, and metabolic
syndrome. “[Machine learning] has
already reshaped a significant part of
how we study the biology of disease,”
Mohsen said. “I very much see AI playing
a larger role in drug development and in
terms of the way we study biology.”
More recently, Mohsen and the Earlham
Institute researchers have shifted to a
new focus: advancing the clarity of how
and why these powerful algorithms
are providing the predictions that they
do. As scientists explore foundational
questions of how human physiology
works, understanding the powerful tools
used in probing those questions is just
as crucial. According to Mohsen, having
unexplainable AI poses “a huge risk
in medicine and elsewhere” due to its
prevalence in everyday life, including face
recognition, surveillance, and biohealth.
In illuminating the “black box” for
ML models that predict circadian
rhythms, research merging transparent
AI and genomics opens possibilities for
understanding the rapidly-developing
technology in our hands. Ultimately, this
has implications for precision medicine,
novel drug development, and decoding
the genetic basis of disease in the future. ■
www.yalescientific.org
October 2021 Yale Scientific Magazine 31