18.11.2021 Views

YSM Issue 94.3

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Computational Biology

FEATURE

IMAGE COURTESY OF NATIONAL HUMAN GENOME RESEARCH INSTITUTE

The circadian rhythm, also known as the body’s

“biological clock,” is endogenous (originates

from within an organism), but also influenced

by environmental variables, including light,

temperature, and geographical location.

3) decoding the “black box” of ML models

to explain the mechanism of how AI is

used to predict circadian clock function.

In order to effectively analyze the

expression of circadian rhythms, the

researchers chose the small flowering

plant Arabidopsis thaliana as a model

organism. Arabidopsis was the first plant

to have its entire genome sequenced, and

because some of its regulatory elements

were already known, the researchers used

that pre-existing knowledge to validate

their ML predictions. This allowed them

to understand how their ML model was

reaching its predictions, thereby decoding

the mystery of the AI black box.

When there are tens of thousands,

even millions, of data points, how do

we understand that data and extract

their patterns and trends? Mohsen

explained that we learn by finding

parameters that capture what patterns

exist—the more sophisticated the data,

the more parameters we need. But using

more parameters necessitates a greater

understanding of what each does.

“There are multiple approaches and even

definitions of what interpretability is,”

he said. Fundamentally, though, “it is

just learning how the prediction process

works or which input features are

corresponding to a specific prediction.”

The Earlham Institute researchers

used MetaCycle—a tool for detecting

circadian signals in transcriptomic

data—to analyze a dataset of Arabidopsis

genomic transcripts. Using this

information, the researchers trained

a series of ML classifiers to predict

if a transcript was circadian or noncircadian.

They found that the AI was

not just using gene expression levels,

but also timepoints for its predictions.

However, these predictions were not

always one-hundred percent accurate,

and the researchers thus set out to

ascertain the optimal sampling strategy

and number of timepoints needed.

Circadian gene expression rhythms

follow diverse patterns, but all share a

twenty-four-hour periodicity. Having

fewer timepoints is more efficient, but leads

to concerns over loss of information and

accuracy. The researchers aimed to find the

optimal balance between a low number of

transcriptomic timepoints and improved

accuracy, so they started with a twelve

timepoint ML model and sequentially

reduced it to three timepoints.

The explainablity aspect of their

model comes with understanding how

the model was making its predictions.

The researchers needed to see which

k-mers (short sequences of DNA) were

the most influential in impacting the

ML model's predictions, and found that

the most accurate predictions resulted

from a k-mer length of six.

“[Machine learning] has

already reshaped a significant

part of how we study the

biology of disease.

Overall, the study showed the

possibility for reducing the number of

transcriptomic timepoints while still

maintaining accuracy in predicting

circadian rhythmicity. Since creating

datasets takes significant time and

resources, a reduction in sampling could

have important long-term impacts in

increasing efficiency.

The findings of this study have major

implications for the future of biomedical

science and AI: recent studies have

shown that disruption of clock genes

is associated with sleep disorders,

heightened susceptibility to infections,

Alzheimer’s disease, and metabolic

syndrome. “[Machine learning] has

already reshaped a significant part of

how we study the biology of disease,”

Mohsen said. “I very much see AI playing

a larger role in drug development and in

terms of the way we study biology.”

More recently, Mohsen and the Earlham

Institute researchers have shifted to a

new focus: advancing the clarity of how

and why these powerful algorithms

are providing the predictions that they

do. As scientists explore foundational

questions of how human physiology

works, understanding the powerful tools

used in probing those questions is just

as crucial. According to Mohsen, having

unexplainable AI poses “a huge risk

in medicine and elsewhere” due to its

prevalence in everyday life, including face

recognition, surveillance, and biohealth.

In illuminating the “black box” for

ML models that predict circadian

rhythms, research merging transparent

AI and genomics opens possibilities for

understanding the rapidly-developing

technology in our hands. Ultimately, this

has implications for precision medicine,

novel drug development, and decoding

the genetic basis of disease in the future. ■

www.yalescientific.org

October 2021 Yale Scientific Magazine 31

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!