27.04.2021 Views

YSM Issue 94.1

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

FEATURE

Virology

LEARNING THE

LANGUAGE OF A VIRUS

BY ANGELICA LORENZO

ART BY ELAINE CHENG

USING MACHINE LEARNING TO PREDICT WHICH VIRAL

MUTATIONS ESCAPE THE HUMAN IMMUNE SYSTEM

Viral escape, the strategy a virus

adopts to evade the human

immune system by mutating

just enough to avoid recognition and

destruction by host antibodies, is one

of the biggest challenges virologists face

while developing effective vaccines. It is

why a vaccine for HIV and a universal

vaccine for influenza have yet to exist.

Furthermore, it is why current vaccines

approved for emergency use against

SARS-CoV-2 may ultimately prove

ineffective against new strains of the

virus such as the the more contagious

B.1.1.7 and P.1 variants.

In an effort to predict which viral

mutations could result in successful escape,

a team of MIT researchers made use of a

machine learning technique originally

intended for natural language processing

to construct computational models of

three different surface

proteins: influenza A hemagglutinin,

HIV-1 envelope glycoprotein, and SARS-

CoV-2 spike glycoprotein.

In a recent article published in Science,

Brian Hie, an electrical engineering and

computer science graduate student at

MIT, along with senior advisors Bryan

Bryson, an MIT assistant professor of

Biological Engineering, and Bonnie

Berger, head of Computation and Biology

at MIT’s Computer Science and AI Lab,

explore how natural language components

such as grammaticality, or syntax, and

semantics, or meaning, can be used to

better understand viral evolution.

So, why a language model? To begin,

techniques for studying viral escape fall

into two main categories: experimental

and computational. One highthroughput

experimental technique

known as a deep mutational scan (DMS)

makes every possible amino

acid change to a

protein and

t h e n

measures the effect of each mutation by

analyzing some property of that protein,

such as cellular binding or infectivity.

While a DMS is effective in analyzing

mutations on a singular amino acid,

it becomes impractical—and quite

expensive—to analyze the escape

potential of combinatorial mutations.

To put it into perspective, proteins are

made up of chains of polypeptides with

between fifty to two thousand amino

acid residues, each of which can be one of

twenty unique amino acids. Considering

this complexity, testing every possible

combination of mutations in a laboratory

setting would be unfeasible.

Alternatively, machine learning models

can use statistics and algorithms to draw

patterns from large collections of data

without being explicitly told what patterns

to learn. “In natural language, that

corresponds to completing sentences and

modeling grammar and semantic similarity

or semantic change,” Hie said. For viral

escape, semantic change is analogous to

antigenic change, where the virus mutates

its surface proteins, and grammaticality

relates to adhering to biological rules in

order to survive and replicate.

Training the algorithm to model

viral escape rather than human

language involves feeding it sequences

of viral amino acid data instead of

English sentences. While machine

learning language models of proteins

previously existed, none of them looked

at both protein fitness and function

simultaneously and, therefore, could

not predict escape nearly as well as the

MIT model, which captures both fitness

22 Yale Scientific Magazine March 2021 www.yalescientific.org

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!