YSM Issue 94.1

More documents

Recommendations

Info

FEATUREVirologyLEARNING THELANGUAGE OF A VIRUSBY ANGELICA LORENZOART BY ELAINE CHENGUSING MACHINE LEARNING TO PREDICT WHICH VIRALMUTATIONS ESCAPE THE HUMAN IMMUNE SYSTEMViral escape, the strategy a virusadopts to evade the humanimmune system by mutatingjust enough to avoid recognition anddestruction by host antibodies, is oneof the biggest challenges virologists facewhile developing effective vaccines. It iswhy a vaccine for HIV and a universalvaccine for influenza have yet to exist.Furthermore, it is why current vaccinesapproved for emergency use againstSARS-CoV-2 may ultimately proveineffective against new strains of thevirus such as the the more contagiousB.1.1.7 and P.1 variants.In an effort to predict which viralmutations could result in successful escape,a team of MIT researchers made use of amachine learning technique originallyintended for natural language processingto construct computational models ofthree different surfaceproteins: influenza A hemagglutinin,HIV-1 envelope glycoprotein, and SARS-CoV-2 spike glycoprotein.In a recent article published in Science,Brian Hie, an electrical engineering andcomputer science graduate student atMIT, along with senior advisors BryanBryson, an MIT assistant professor ofBiological Engineering, and BonnieBerger, head of Computation and Biologyat MIT’s Computer Science and AI Lab,explore how natural language componentssuch as grammaticality, or syntax, andsemantics, or meaning, can be used tobetter understand viral evolution.So, why a language model? To begin,techniques for studying viral escape fallinto two main categories: experimentaland computational. One highthroughputexperimental techniqueknown as a deep mutational scan (DMS)makes every possible aminoacid change to aprotein andt h e nmeasures the effect of each mutation byanalyzing some property of that protein,such as cellular binding or infectivity.While a DMS is effective in analyzingmutations on a singular amino acid,it becomes impractical—and quiteexpensive—to analyze the escapepotential of combinatorial mutations.To put it into perspective, proteins aremade up of chains of polypeptides withbetween fifty to two thousand aminoacid residues, each of which can be one oftwenty unique amino acids. Consideringthis complexity, testing every possiblecombination of mutations in a laboratorysetting would be unfeasible.Alternatively, machine learning modelscan use statistics and algorithms to drawpatterns from large collections of datawithout being explicitly told what patternsto learn. “In natural language, thatcorresponds to completing sentences andmodeling grammar and semantic similarityor semantic change,” Hie said. For viralescape, semantic change is analogous toantigenic change, where the virus mutatesits surface proteins, and grammaticalityrelates to adhering to biological rules inorder to survive and replicate.Training the algorithm to modelviral escape rather than humanlanguage involves feeding it sequencesof viral amino acid data instead ofEnglish sentences. While machinelearning language models of proteinspreviously existed, none of them lookedat both protein fitness and functionsimultaneously and, therefore, couldnot predict escape nearly as well as theMIT model, which captures both fitness22 Yale Scientific Magazine March 2021 www.yalescientific.org
VirologyFEATUREIMAGE COURTESY OF WIKIMEDIA COMMONSA SARS-CoV-2 virus binds to a receptor on a host cell.and function through the languagecomponents of grammaticality andsemantic change.Viral fitness, more specificallyreplicative fitness, refers to a virus’s abilityto bind to a host cell, infect it, and produceinfectious offspring inside the host cell.In the language model, viral fitnesscorresponds to grammaticality whileprotein function is captured by semanticchange, or the ability of the virus to alterits surface proteins enough to evadeneutralizing antibodies. Mutating virusesmust sufficiently change their proteinsso as not to initiate an immune responsebut not so much that they are unable tofold into the correct conformation and,therefore, lose function. Thus, the hostimmune system will lose the originalability to recognize the viruses as foreigninvaders, and the viruses will be able tosuccessfully enter and infect host cells.The model was given the task ofidentifying viral mutations with highgrammaticality and high semanticchange, which are characteristics of highescape potential. Operating on aminoacid data alone and without humaninstruction, the model was able toexecute this task, known as constrainedsemantic change search (CSCS), byranking mutations based on fitness andfunction. Mutations with higher scorescorresponded to viruses that were bothgrammatical—able to preserve fitnessby following biological rules—and hadexperienced high semantic change—wereantigenically different from the originalwildtype sequence. The results of themodel’s rankings were then validated bycomparing it to the results of a DMS.“We started [this project] in responseto the pandemic and out of curiosityof how we can better understand viralevolution,” Hie said. While Brysonand Hie usually focus their work ontuberculosis research in Bryson’s lab,they transitioned to studying viralescape. "When you’re in a pandemic,you learn about the pathogen that iswreaking havoc,” Bryson said.Initially, the researchers trained theirmodel using influenza A and HIV data.“Once we validated the model on influenza Aand HIV, by then the data had been releasedfor SARS-CoV-2, and we were able to runit… The timing was perfect,” Hie said.In addition to scoring mutationsbased on grammaticality and semanticchange, the researchers also createdvisual representations of each proteinstructure that showed escape potentialin different regions of each protein.Different sections of the proteins werecolor coded according to high escapepotential or high escape depletion.Visualizing and quantifying escapepotential is significant in identifyingwhich areas of a protein should betargeted by drugs. “Our whole idea isthat we look for areas that are depletedby our predictions for escape, and we’resuggesting that [vaccine developers]target those areas,” Berger said.For example, areas such as the receptorbinding domain (RBD), a region of avirus located on its surface proteins thatallows the virus to attach to and enterhost cells, has high escape potential.This means that targeting RBDs maybe less effective due to the fact that theyhave a high possibility of mutating andavoiding immune defenses. “For COVID,we found this subunit domain—the S2domain—is low on depletion, whereasthe N-terminal domain and receptorbinding domain have high escapepotential,” Berger said. This findingsuggests that because the S2 domain isless likely to mutate, it is characterizedas a good target of antibodies instead ofthe receptor binding domain.This idea of identifying areas of escapedepletion raises the question that manyimmunologists are trying to solve:“How do you design immunogens forregions of a protein instead of a wholeprotein or a whole inactivated virus?”Bryson asked. Immunogen design issomething immunologists must keepin mind while developing vaccines. ThePfizer-BioNTech and Moderna vaccinescurrently being distributed in the UnitedStates target the entire SARS-CoV-2 spikeprotein rather than particular subunits.Because new variants such as B.1.1.7and P.1 have successfully mutated theirspike proteins, current vaccines may notbe as effective against them, as areas ofthe protein may be unrecognizable toneutralizing antibodies.Given that the language model wassuccessful in learning viral dynamicsfrom sequence data alone, the researcherscan now search for possible mutationson top of the SARS-CoV-2 variants thathave already emerged. “This can tellus what are the best experiments to gotest to anticipate potential even furtherescape,” Bryson said.As new data for SARS-CoV-2 is beinggenerated in real time, the researchersconsistently retrain the model and publishthe results on their GitHub repository.Considering the model’s successfulperformance, the researchers hope thatthe Centers for Disease Control andPrevention will adopt their model as a toolfor understanding viral epidemics. If thishappens, as new strains of SARS-CoV-2surface, the model could predict morevariants on top of the current mutations,which would give scientists a narrow set ofexperiments to test the efficacy of currentvaccines on and allow for modification ofthe vaccines as needed. ■IMAGE COURTESY OF U.S. CENSUSA scientist prepares to administer a COVID-19 vaccine.Hie, B., Zhong, E. D., Berger, B., & Bryson, B. (2021). Learning the language of viralevolution and escape. Science, 371(6526), 284-288.www.yalescientific.orgMarch 2021 Yale Scientific Magazine 23
Page 1 and 2: Yale ScientificTHE NATION’S OLDES
Page 3 and 4: TABLE OF CONTENTSVOL. 94 ISSUE NO.
Page 5 and 6: The Editor-in-Chief SpeaksHUMANITY
Page 7 and 8: Biomedical Engineering / Geophysics
Page 9 and 10: Material ScienceNEWSA NEW FRONTIERO
Page 11 and 12: Neuroscience & GeneticsNEWSTHEMOLEC
Page 13 and 14: Environmental ScienceFOCUSThey also
Page 15 and 16: Virology & Molecular BiologyFOCUSpr
Page 17 and 18: THEUNFINISHEDPUZZLE OFALZHEIMER’S
Page 19 and 20: NeuroscienceFOCUSexhibited less sev
Page 21: the radioactive decay of trace elem
Page 25 and 26: RoboticsFEATUREIn their experiments
Page 27 and 28: BiochemistryFEATUREIMAGE COURTESY O
Page 29 and 30: Protection ofthe world’sanimals p
Page 31 and 32: COUNTERPOINTDR. FAUCI, ARE WETHERE
Page 33 and 34: Public HealthSCOPEHowever, the 2007
Page 35 and 36: OWEN GARRICK(MD ’98)ALUMNI PROFIL
Page 37 and 38: THE SURGEON'S CUTBY TEJITA AGARWALI
Page 39 and 40: for more than 70 years, hrl's scien

YSM Issue 94.1

Create successful ePaper yourself

Delete template?

Save as template?