YSM Issue 94.1
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
FEATURE
Virology
LEARNING THE
LANGUAGE OF A VIRUS
BY ANGELICA LORENZO
ART BY ELAINE CHENG
USING MACHINE LEARNING TO PREDICT WHICH VIRAL
MUTATIONS ESCAPE THE HUMAN IMMUNE SYSTEM
Viral escape, the strategy a virus
adopts to evade the human
immune system by mutating
just enough to avoid recognition and
destruction by host antibodies, is one
of the biggest challenges virologists face
while developing effective vaccines. It is
why a vaccine for HIV and a universal
vaccine for influenza have yet to exist.
Furthermore, it is why current vaccines
approved for emergency use against
SARS-CoV-2 may ultimately prove
ineffective against new strains of the
virus such as the the more contagious
B.1.1.7 and P.1 variants.
In an effort to predict which viral
mutations could result in successful escape,
a team of MIT researchers made use of a
machine learning technique originally
intended for natural language processing
to construct computational models of
three different surface
proteins: influenza A hemagglutinin,
HIV-1 envelope glycoprotein, and SARS-
CoV-2 spike glycoprotein.
In a recent article published in Science,
Brian Hie, an electrical engineering and
computer science graduate student at
MIT, along with senior advisors Bryan
Bryson, an MIT assistant professor of
Biological Engineering, and Bonnie
Berger, head of Computation and Biology
at MIT’s Computer Science and AI Lab,
explore how natural language components
such as grammaticality, or syntax, and
semantics, or meaning, can be used to
better understand viral evolution.
So, why a language model? To begin,
techniques for studying viral escape fall
into two main categories: experimental
and computational. One highthroughput
experimental technique
known as a deep mutational scan (DMS)
makes every possible amino
acid change to a
protein and
t h e n
measures the effect of each mutation by
analyzing some property of that protein,
such as cellular binding or infectivity.
While a DMS is effective in analyzing
mutations on a singular amino acid,
it becomes impractical—and quite
expensive—to analyze the escape
potential of combinatorial mutations.
To put it into perspective, proteins are
made up of chains of polypeptides with
between fifty to two thousand amino
acid residues, each of which can be one of
twenty unique amino acids. Considering
this complexity, testing every possible
combination of mutations in a laboratory
setting would be unfeasible.
Alternatively, machine learning models
can use statistics and algorithms to draw
patterns from large collections of data
without being explicitly told what patterns
to learn. “In natural language, that
corresponds to completing sentences and
modeling grammar and semantic similarity
or semantic change,” Hie said. For viral
escape, semantic change is analogous to
antigenic change, where the virus mutates
its surface proteins, and grammaticality
relates to adhering to biological rules in
order to survive and replicate.
Training the algorithm to model
viral escape rather than human
language involves feeding it sequences
of viral amino acid data instead of
English sentences. While machine
learning language models of proteins
previously existed, none of them looked
at both protein fitness and function
simultaneously and, therefore, could
not predict escape nearly as well as the
MIT model, which captures both fitness
22 Yale Scientific Magazine March 2021 www.yalescientific.org