PhD Transfer ReportStudy and use of evolutionary relationships betweenprotein sequences and structuresBenjamin Jefferys - firstname.lastname@example.org
1Primary supervisor: Prof. Michael Sternberg - email@example.comSecondary supervisor: Prof. Marek Sergot - firstname.lastname@example.orgProgress Review PanelGeoff Baldwin - email@example.comMichael Stumpf - firstname.lastname@example.orgAbstractThis report comprises two parts. They are linked by the study and use of evolutionaryrelationships between protein sequences and structures.The first part is a report on a system for analysing results from a protein structureprediction tool. This tool uses putative evolutionary relationships between proteinsequences to predict protein structures. The system captures expert knowledge in astructure called an argumentation framework. This may be used to reason about predictedstructures and come to a conclusion about their correctness, based upon simplecharacteristics such as hydrophobic residue burial, sequence identity and secondarystructure. The technique is found to significantly improve the recall of the underlyingprediction tool, by up to 10 percentage points. It also aids the human interpretationof predicted structures by reducing their complex characteristics to simple statementsof fact.The second part details an ongoing investigation into protein evolution using simplifiedmodels. By simplifying protein structure, folding and evolution to a level whichmakes computational simulation tractable, a “parallel protein universe” may be directlystudied to reveal details of protein development which are currently hidden inextant proteins obfuscated by billions of years of evolutionary exploration. The simplifiedmodel may be used to quickly test hypotheses, and results from the model maybe compared directly with the real protein universe to ensure general applicability ofthe conclusions drawn. The protein structure model has been developed beyond thatused in published literature, with the aim of expressing secondary structure, whichis hypothesised to be a significant influence upon the evolutionary landscape. Theevolutionary model has been designed to work ab initio: that is, evolve viable proteinstructures from random polymers, thus simulating the early protein universe. Thisenables direct study of the development of nucleation sites and other hierarchical elementsof protein structure, which may be closely linked to the evolutionary pathway.It also enables direct study of the effect of in vivo factors such as slow ribosomalsynthesis and tolerance to misfolding on the development of protein structures. Anunoptimised model has developed a stable 17-mer from random trimers, and all indicationsare that longer runs and optimised parameters will allow evolution of larger,more biologically interesting polymers. The results from these simulations will feedinto homology-based protein structure prediction, protein design, synthetic biologyand bioengineering, and theories about early evolution and the origin of life.
2AcknowledgementsThanks to Mike Sternberg, Lawrence Kelley and Alex Herbert for proofreading, discussionand suggestions.
ContentsI Capturing expert knowledge with argumentation: a casestudy in bioinformatics 61 Introduction 81.1 Argumentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.2 Biological search engines . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3 Applying argumentation to a search . . . . . . . . . . . . . . . . . . . 102 Methods 112.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Step 1 - Framework design . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 Steps 2 to 4 - Framework application . . . . . . . . . . . . . . . . . . . 152.4 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.5 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.6 Statistical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.7 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Results 203.1 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.1.1 Optimisation benchmarking . . . . . . . . . . . . . . . . . . . 204 Discussion 224.1 Comparison to other methods . . . . . . . . . . . . . . . . . . . . . . . 234.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25II Effect of in vivo factors on protein evolution 265 Proteins and simplified models 305.1 Principles of protein folding and evolution . . . . . . . . . . . . . . . . 305.1.1 Protein folding . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.1.2 In vivo folding . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.1.3 Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.2 Simplified models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.3 Simplified structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.3.1 Lattice polymers . . . . . . . . . . . . . . . . . . . . . . . . . . 325.3.2 Off-lattice polymers . . . . . . . . . . . . . . . . . . . . . . . . 333
CONTENTS 45.3.3 Sidechains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.4 Simplified folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.4.1 Move set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.4.2 Energy function . . . . . . . . . . . . . . . . . . . . . . . . . . 355.4.3 Metropolis algorithm . . . . . . . . . . . . . . . . . . . . . . . . 375.4.4 Ending the simulation - the native state . . . . . . . . . . . . . 375.5 Simplified evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.5.1 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.5.2 Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.5.3 Evolutionary protocol . . . . . . . . . . . . . . . . . . . . . . . 405.6 In vivo effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.7 Research using simplified models . . . . . . . . . . . . . . . . . . . . . 425.7.1 Lattice polymers . . . . . . . . . . . . . . . . . . . . . . . . . . 425.7.2 Off-lattice polymers . . . . . . . . . . . . . . . . . . . . . . . . 445.7.3 Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.7.4 In vivo folding . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.7.5 Commentary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546 Methods 576.1 Aims and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.1.1 Ab initio polymer evolution . . . . . . . . . . . . . . . . . . . . 586.1.2 Off-lattice models . . . . . . . . . . . . . . . . . . . . . . . . . 586.1.3 More accurate polymer models . . . . . . . . . . . . . . . . . . 586.1.4 Effect of in vivo environment . . . . . . . . . . . . . . . . . . . 586.1.5 Effect of parameters . . . . . . . . . . . . . . . . . . . . . . . . 596.2 Software architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.2.1 Polymer model and folding . . . . . . . . . . . . . . . . . . . . 596.2.2 Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.2.3 Parallelisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616.2.4 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617 Results 647.1 Initial model development . . . . . . . . . . . . . . . . . . . . . . . . . 647.2 Extensive off-lattice evolution . . . . . . . . . . . . . . . . . . . . . . . 647.3 Effect of mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 698 Discussion and research plan 768.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768.2 Research plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778.2.1 (9) Implement more biologically accurate mutation scheme . . 778.2.2 (11) Develop a polymer model which exhibits secondary structureelements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778.2.3 (1) Continue exploration of parameter space . . . . . . . . . . . 778.2.4 (10) Increase evolved polymer size with parameters . . . . . . . 778.2.5 (6) Increase evolved polymer size with secondary structure model 778.2.6 (8) Publish paper about developed model . . . . . . . . . . . . 79
CONTENTS 58.2.7 (3, 4, 5) Use ideal model for evolutionary experimentation . . . 798.2.8 (2) Publish paper with results of experimentation . . . . . . . . 798.2.9 (7) Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79Bibliography 80
Part ICapturing expert knowledgewith argumentation: a casestudy in bioinformatics6
7A paper detailing work has recently been published in the OUP Bioinformaticsjournal [Jefferys et al., 2006]. The text in this part is based on that publication. Textlargely written by the other authors has been removed, and the distinction betweenwork performed as part of the PhD and work performed outside the PhD has beenmade clear.The PhD work significantly extended original work undertaken for a Mastersproject in Summer 2004 in many ways, including:Optimisation Application of the established “random walk” algorithm to optimisethe argumentation framework and parameters using a benchmark dataset astraining data. This is a novel development in the field of argumentation frameworks,and may be be the subject of a further publication in artificial intelligenceliterature.Comparison to other techniques The technique was compared to Bayesian networksand decision trees. A quantitative comparison to decision trees was performed,with the development of an algorithm for deriving a decision tree froman argumentation framework.Statistical significance The statistical significance of the results obtained was evaluatedusing the Sign test. All improvements were found to be highly significant.Implementation and web deployment A web server was developed for presentingargumentation of 3DPSSM search results on the web. The results from theargumentation are presented in a simple, intuitive manner with a “traffic light”colour coding. This is linked to from the 3DPSSM server. Extensive documentationand a worked example was produced to support the server. Scriptswere developed to allow production of high quality argumentation diagrams andsentences summarising an argumentation framework.Published paper A paper detailing the work was submitted to and accepted byOUP Bioinformatics, an internationally recognised journal covering developmentsin genome bioinformatics and computational biology. It has an impactfactor of 5.7.
Chapter 1IntroductionBioinformatic search tools, such as BLAST [Altschul et al., 1990], PFAM and specialisedprotein structure prediction servers, often give large amounts of complexoutput. This is not usually read uncritically by researchers: it must be interpretedby them, based upon their own expert knowledge and prior expectations, in order tocome to an informed and valid conclusion.There are many problems with this process. Often bioinformatic tools outputmore information than can be easily and efficiently interpreted by a researcher, andthe output may be confusingly presented. Different researchers interpret data indifferent ways, and even the same researcher may make inconsistent interpretations,adding an unreliable and non-uniform element to data processing. Once the data hasbeen interpreted, the reasoning behind an interpretation may be lost or only vaguelyrecalled. When a researcher leaves a research group, the method used to interpretdata goes with them and is lost. Finally, the researcher’s interpretation may be biasedtowards getting a preconceived result.This work develops a method for using argumentation theory [Dung, 1995] [Modgil and Fox, 2004]to solve this problem. The expert knowledge used by a researcher to interpret outputis captured by an argumentation framework, a structure which encapsulates a simpledebate between competing hypotheses. This representation can be used to automatethe interpretative process, making it faster, more reliable, repeatable and unbiased.The results of the analysis may be presented in a variety of intuitive formats whichset out the reasoning process. This can be sufficient in itself to help the user cometo an informed and reliable conclusion. In many cases it is also possible to automatethe conclusion itself, picking out good results and rejecting bad ones in a data set.The technique is applied to 3D-PSSM [Kelley et al., 2000], a tool for predictingthe structure of a protein based on its sequence. The aim is to improve the tool’srecall (or sensitivity) - a measure of its ability to spot positive matches in a database.1.1 ArgumentationArgumentation theory has its origins in legal discourse [Toulmin, 1958], where careful,structured debate based upon a massive body of legal literature is necessary, usuallyrequiring an expert’s time and experience. It helps to standardise and record legal8
CHAPTER 1. INTRODUCTION 9literature in order to enable automated searching and other tasks. In recent years,formal theories of argumentation have received much attention and have also beenapplied in many areas outside legal discourse, notably in medical decision making[Upshur and Colak, 2003] [Fox et al., 2001] and risk assessment [Krause et al., 1998].In medical argumentation, for example, the conclusion may be the diagnosis of a conditionor the prescription of a particular drug; the argumentation framework describesin an explicit and compact manner the various factors that have to be considered.The argumentation model used in this work was devised by Dung [Dung, 1995],which it is the starting point for many of the recent developments in this field. InDung’s model, an argumentation structure consists of claims and attacks. Dung’smodel does not include a method for drawing a simple, binary conclusion from anargumentation framework. A method for doing this has been added to allow broadcategorisation of results.1.2 Biological search enginesThe method has general applicability to all biological search engines 1 , the most popularbeing BLAST. The task here is to determine which matches are relevant or good(“positive”), and which are irrelevant or bad (“negative”). Traditionally this has beendone by assigning an E-value to each match, with a cut-off applied to limit the numberof results displayed. An E-value estimates the number of matches of similar qualityexpected to occur by chance with a given database: lower E-values indicate morestatistically relevant results. Usually an additional arbitrary cut-off will be chosen bythe researcher, above which results are dismissed as negative. In BLAST, the E-valuecut-offs are used to distinguish homologous database entries from randomly matched,non-homologous entries. In order to assess a search result, however, a researcher willnormally look beyond the E-value alone. They will take into account other features ofthe match (such as the complexity of the matched sequences or the level of sequenceidentity between the two), and perhaps also the output from other complementarybioinformatic tools, depending on application.3D-PSSM [Kelley et al., 2000] is used as a specific search engine to test the effectivenessof the technique. 3DPSSM is an established protein fold recognition serverwhich characterises the structure of an unannotated protein sequence based uponsequence homology and likely secondary structure. It uses the same E-value cut-offtechnique as BLAST - ordering the matches according to their E-value, displaying thetop 20, and highlighting likely candidates based on a cut-off derived experimentally toproduce around 95% precision. A variety of other measures are also reported for eachmatch, such as sequence identity, hydrophobic regions, secondary structure features,and so on. It was developed by researchers in the Structural Bioinformatics Group,therefore its source code and details of its output were available for analysis.1 It may also be applied to non-biological search engines such as Google
CHAPTER 1. INTRODUCTION 101.3 Applying argumentation to a searchArgumentation begins by identifying a subset of claims which apply to each searchresult. Claims are derived from expert knowledge and are often a matter of opinion.An example of a claim might be that the result is a good indicator of protein structurebecause the E-value is very low. Each claim either supports or rejects the hypothesisthat the match is positive - in the case of 3D-PSSM, that the match is a good predictorof protein structure. So, for example, a long match would usually be good, and a shortmatch would be bad. Additionally, each claim may attack others. If a match is long(claim A), but the identity between the matched sequence and the query sequence islow (claim B), then claim B might attack claim A. If these two claims are the onlyones present, then B would “win the argument” and the match would be rejected asnegative (a bad predictor of protein structure). If a third claim C attacks B, thenB’s attack on A is defeated, and now A together with C “win the argument” and theconclusion is that the match is positive.With many claims and many attacks, each claim must be checked to see if itis defeated by other claims. In Dung’s model, sets of undefeated claims are calledpreferred extensions of the argumentation framework. There are several algorithms forfinding preferred extensions: this work uses the one presented in [Cayrol et al., 2003].The claims in this argument may then be grouped into an overall conclusion using avariety of techniques [Krause et al., 1995].The argument constructed for each search result may serve two purposes. First,it summarises each result to help the user interpret and reason about its significance,based upon expert knowledge. The argument may be presented graphically, as illustratedin Figure 2.3, or textually [Reed and Grasso, 2001] [Modgil and Fox, 2004].Second, it can be interpreted as a filter on the result, identifying good and badmatches. As shown below, filtering achieves a significant improvement in predictiveaccuracy of the 3D-PSSM tool over basic E-value cut-off methods.
Chapter 2Methods2.1 OverviewThe process of applying argumentation to search results is shown in Figure 2.1. Themethod illustrated in this figure was developed in conjunction with Marek Sergot andLawrence Kelley during a Masters project in Spring 2004.1. A model argumentation framework must be designed. This has three components.(a) All the factors judged to be relevant when evaluating a result from thesearch engine are expressed in terms of claims. Each claim either supportsor rejects the conclusion that the result is positive.(b) For each claim, conditions are specified for determining whether it appliesto a given search result. These conditions are called the claim criteria.Claim criteria are boolean: a claim either applies to a search result, orit does not. For each result from the search engine, therefore, the set ofclaims that are applicable to it may be determined.(c) The third component is a set of attacks between claims. A claim may attackanother if it is contradictory, or if it undermines the basis of the other claim.The formulation of suitable attacks is a key part of representing the expertknowledge. All attacks have equal strength.2. The model framework is applied to a set of search results, as follows. Each searchresult is analysed to determine which claim criteria it satisfies and is therebyreduced to a set of applicable claims. Attacks between the applicable claims aretaken from the model framework constructed at step 1. This produces a specificargumentation framework for reasoning about a single search result.3. For each search result, it is determined which of the applicable claims are undefeatedby attacks from other claims [Dung, 1995] in the specific argumentationframework for that search result. There are several methods for doing this.The algorithm in [Cayrol et al., 2003] is used for computing the preferred extensionsof an argumentation framework. The process is illustrated informallyfor a simple example in Figure 2.1.11
CHAPTER 2. METHODS 12LegendClaim supportingmatchClaim opposingmatch1Model argumentation framework – same for each Match in the Search resultsAll possible claims All possible attacksClaim A Claim D Claim A attacks Claim C Claim DClaim B Claim E Claim E attacks Claim C Claim D Claim FClaim C Claim F Claim F attacks Claim A Claim BSearch resultsMatch 1Match 2Match 3Determine which claimsapply to Match 1 inthe Search resultsExtract attacks betweenclaims which apply toMatch 1Match 4Match 52Claims applicable toMatch 1Attacks between claims applicableto Match 1Claim AClaim DClaim AattacksClaim DClaim EClaim EattacksClaim DClaim FClaim FClaim FattacksClaim A3Argumentation framework for Match 1Determine set of undefeated claimsStart with Claim D●On its own, it is undefeatedbecause it is not attackedAdd Claim A●Claim D is defeated byattack from Claim A●Claim A is undefeated asit is not attackedAdd Claim F●Claim A is now defeated byan attack from Claim F●Claim D is now undefeatedas Claim A is defeated andits attack is ignored●Claim F is not attacked, sois undefeatedAdd Claim E - finished●Claims D and F aredefeated by attacksfrom Claim E●Claim A is now undefeatedas its attacker has beendefeatedClaim DClaim DClaim DClaim DClaim EattackClaim AClaim AClaim FClaim AClaim FUndefeated claimsClaim AClaim E4Do all undefeated claimssupport the match?noDo all undefeated claimsoppose the match?noyesyesMatch is good Match is bad Match may beMatch 1 decisiongood or badFigure 2.1: Argumentation flow chart. Steps 2 to 4 are repeated for each match inthe search results, until all have been classified as good, bad or undecided.
CHAPTER 2. METHODS 13Figure 2.2: An example model argumentation framework for 3D-PSSM. The boxesrepresent individual claims, colour coded to indicate whether they support (green) oroppose (red) a match. Arrows indicate attacks, for example ‘Low 3D-PSSM E-value’(good) attacks ‘Low sequence identity’ (bad).4. The set of undefeated claims can be further reduced to a simple good, bad orundecided verdict. If the undefeated claims all support the result, it is acceptedas a good result. If all oppose the search result, then the result is rejectedas bad. If some undefeated claims support and some oppose the result, thenthe verdict is undecided. For some purposes, the undefeated claims, togetherwith an explanation of how they defend themselves against attacks, may alreadyprovide a useful analysis of the significance of a search result, even in the casewhere the verdict is undecided. Good and bad verdicts may be regarded asa solid prediction, in which case we can quantitatively compare the predictiveaccuracy of this method with others.There is an optional optimisation stage which is described separately later.2.2 Step 1 - Framework designThis initial step captures the knowledge (and opinion) an expert applies when interpretingsearch results.This results in a model argumentation framework: a general template for constructingargumentations for interpreting search results. The first model argumentationframework for 3D-PSSM was designed by Lawrence Kelley, and had 34 argumentsand numerous attacks. A smaller example model argumentation framework is shownin Figure 2.2. The original designed framework cannot be shown in such a mannerdue to its complexity.Internally in the implementation, a claim is represented by a simple identifier - arepresentative sample of these, for the subset of claims shown in Figure 2.2, is givenin Table 2.1. Each claim also has an indication as to whether the claim supports oropposes the result, and a short phrase or sentence for presentation to the user. Anattack from one claim to another is represented as a pair of claim identifiers, attackerfirst and victim second.
CHAPTER 2. METHODS 14Many template homologues (THmany)Supports matchThe initial PSI-BLAST search performed to create the template profile revealed morethan 50 homologues, making the template profile more reliableVery high sequence identity (SeqIDVH)Supports matchSequence identity is more than 80%, therefore the match sequence is very similar tothe query sequenceLow occurrence of fold in top 20 (FreqT20L)Opposes matchThe SCOP fold type of the match occurs less than five times in the top 20 matchesfor this search, making it an unlikely outlying candidateFew query homologues (QHfew)Opposes matchThe initial PSI-BLAST search performed to create a query profile revealed less than10 homologues, making the query profile less reliableBad core residues (CRbad)Opposes matchWhen the query sequence is fitted to the matched protein structure, more than 10%ofamino acids in the core do not commonly occur in protein coresShort template (ST)Opposes matchThe match template sequence is less than 30 amino acids long, making spuriousmatches more likelyLow 3D-PSSM E-value (3DPL)Supports matchThe E-value calculated for the match by 3D-PSSM gives a confidence of more than95%that it is a good matchLow sequence identity (SeqIDL)Opposes matchSequence identity is less than 10%, therefore the match sequence is very different tothe query sequenceHigh SAWTED/Biotext score (SBH)Supports matchKeywords in the annotation of homologues of the query sequence and match templatesequence are very similar, making it more likely the match is goodTable 2.1: Explanation of claims in the model argumentation framework of Figure 2.2.Symbolic identifiers for each claim are given in brackets. These are examples of thekinds of claims which are useful in reasoning about 3D-PSSM results, and of coursethey will be different for other tools.The applicability criteria for each claim specify when it applies to a given searchresult. Some of these are simple threshold values (for example, SeqIDVH), while othersare based upon properties of the entire search result set (for example, FreqT20L). Thenature of the claims, and the specific values and thresholds are initially chosen by thedesigner to reflect his or her own reasoning and preferences. The values, and thegeneral structure of the argumentation framework, may be altered later to improvethe predictions made by the system, to reflect better the preferences of a differentuser, or to customise the system to a particular problem. For example, a researcherworking specifically with small proteins might want to lower the threshold for “Shorttemplate” to something more appropriate to their requirements. The normal case,however, is that the same model argumentation framework and claim criteria areused for all searches, and they are designed to be generally applicable.Determining which claims attack which others is a decision for the designer andusers of a particular application, in this case 3D-PSSM. A claim can attack anotherif it undermines the basis of that claim. For example, “Very high sequence identityindicates good match” attacks “Short template indicates bad match”, because ashort template may still be significant if its sequence is very similar to the querysequence. Some attacks are even looser, signifying a general superiority of one claim
CHAPTER 2. METHODS 15Figure 2.3: An example argumentation framework for a particular result from 3D-PSSM. Six of the claims from the model framework have been found to apply to theresult. Claims which do not apply to the result are blurred: for example, this resultdoes not have a low sequence identity. Claims with a cross through them are defeated- they are attacked by claims which are not themselves defeated. Four undefeatedclaims remain, about fold occurrence, number of query homologues, core residuesand template length. They are all claims which oppose the result (hence colouredred). Since all undefeated claims agree the result is bad, the result is concluded tobe negative - that is, the found protein structure is not a good model for the queryprotein sequence.over another. The specification of attacks will be adjusted and refined after the implementationof the system, in order to improve results being obtained.2.3 Steps 2 to 4 - Framework applicationSteps 2 to 4 apply the expert knowledge to the 3D-PSSM results. Step 2 creates aspecific argumentation framework, based upon the model from Step 1, for each searchresult. This framework consists of claims which apply to that result, and any attacksbetween them. Step 3 determines the set of undefeated claims for each search result,using the algorithm in [Cayrol et al., 2003], as illustrated in Figure 2.1. Figure 2.3shows an example framework for a specific 3D-PSSM search result, and informallyhow the undefeated claims are obtained.The specific argumentation framework and/or undefeated claims may already beuseful as an evaluation of each protein search result. They may be presented graphically,textually, or otherwise. Additionally, the undefeated claims may be used tocome to a final conclusion, as a means of highlighting results. This is commonly donewith other search engines using an E-value or other quality score cut-off. The argumentationsystem may be undecided - unable to come to a conclusion. In this casethe user may “fall back” to the E-value cut-off filter, or study the argumentation indetail in order to come to a conclusion.The implementation a single program which takes as input a complete argumentationframework (e.g. Figure 2.2), a set of criteria (e.g. Table 2.1), and a set of searchresults, and outputs an argumentation framework, undefeated arguments, and a finalconclusion for each search result. The outputs may be presented to the user in any
CHAPTER 2. METHODS 16Figure 2.4: Screenshots from the argumentation web server. Left: all the resultsare grouped into good (green), undecided (yellow) and bad (red) matches. Clickingon the match name instantly shows details below this display. Right: These are thedetails that are shown. The precise details displayed depend upon the argumentation.If a match is good, then only the claims which support the match are shown. If amatch is bad, only those which oppose the match are shown. This is an undecidedmatch: all claims are shown so the user can make up their own mind.sensible manner.A web-based application was developed for 3D-PSSM 1 . Here the implementationis a single web page script which performs the argumentation and outputs web pagesto the researcher’s web browser. The script and supporting libraries are written inPython, a language similar to Perl and C++. The system was originally prototypedin Prolog, which lends itself to direct and compact application of logical algorithms.However it is easy to re-implement in any other language, in order to make deploymenton a web server simpler. The argumentation server has been used hundreds of timessince it was made live in January 2005. Example pages from the server are given inFigure 2.42.4 BenchmarkingBenchmarking was performed to assess the improvement in recall of the conclusionsdrawn from an argumentation framework, over predictions based upon E-value alone.Benchmarking of the basic expert-designed framework was performed in the originalwork this thesis is based upon. Further benchmarking was performed on the optimisationmethod described in Section 2.5, and that technique makes use of the benchmarkdataset, therefore the benchmarking approach is described here.The recall that could be achieved when the argumentation framework is used tomake a prediction about each search result was determined. This was compared tothe current naive method of calculating an E-value and predicting that everythingbelow a cut-off point is positive, and everything above is negative. This method istreated as the benchmark. The standard measures for recall and precision were used,as well as a simpler measure of accuracy:Where:1 3D-PSSM Argumentation Server: http://www.sbg.bio.ic.ac.uk/~brj03/argumentation/paper/
CHAPTER 2. METHODS 17tp =number of true positives (positive results predicted to be positive)fp =number of false positives (negative results predicted to be positive)tn =number of true negatives (negative results predicted to be negative)fn =number of false negatives (positive results predicted to be negative)accuracy =tp + tn true predictions=tp + fp + tn + fn total predictionsprecision =tptp + fprecall =tptp + fnPrecision is the proportion of results predicted to be positive which actually arepositive. Recall is the proportion of positive results which are predicted to be positive.First, a model argumentation framework for the target search engine must beproduced, either by an expert designer or by an automated system. The database3D-PSSM searches has many good matches for most possible queries, as it is regularlyupdated. Due to this, precision and recall become less meaningful measures as allmatches are likely to be positive. Therefore, it was be necessary to cut down thedatabase so that it gives a mix of positive and negative results, effectively testing thedesign. The database was reduced so that it only contained only sequences from the30% identity subset of the ASTRAL SCOP [Brenner et al., 2000] database.In order to measure the predictive accuracy of argumentation, a set of queries for3D-PSSM was devised for which the SCOP superfamilies are known [Murzin et al., 1995].If two proteins have the same superfamily, then they are homologous. Therefore if aprotein is predicted to be homologous to another, and they share the same superfamily,then the prediction is correct. Query sequences were rejected where there wereless than five potential matches in the database, to ensure that it is possible to findmatching results and that each search is informative. The queries were also picked sothat matches were not trivial to find, in order to test the search algorithm. Any querysequence with more than 30% identity with any of the search engine database templateswas rejected. Therefore every query occupied the “twilight zone” of homologueswhich are hard to find.The queries were submitted to 3D-PSSM. For each result returned, predictionswere made using the E-value cut-off filter and using the designed argumentationframework. Predictive accuracy, precision and recall for each method may be measureddirectly as described above. The accuracy of each technique may be compareddirectly. However, this measure understates improvements in recall alone, since thereare likely to be fewer positive matches than negative ones. 3D-PSSM always returns20 matches, of which as few as five may be positive with the search set defined above.Therefore it would be interesting to study the improvement in recall alone.In order to compare the recall of an E-value cut-off prediction to argumentation,the E-value threshold must be adjusted to make the precision of each technique equal.
CHAPTER 2. METHODS 18This is necessary due to a characteristic of argumentation frameworks. The traditionalmeasures of precision and recall locate the performance of a predictive technique at asingle point within a two dimensional space. Most techniques have some parameterwhich may be varied, trading recall against precision, producing a curve (a precisionrecallcurve) which characterises the predictive abilities of the algorithm. When usingan E-value cut-off, the threshold itself may be altered to vary precision and recallin this way. Two techniques may be compared by measuring the difference in areaunder their respective precision-recall curves. However, argumentation has no suchsingle variable parameter. A particular argumentation system has a fixed precisionand recall for a given input data set, and a precision-recall curve cannot be produced.The precision of the E-value cut-off technique needs to be fixed, reducing it to asingle point. The recall of the two techniques can then be compared in isolation.Different argumentation frameworks must be compared to different parts of the E-value precision-recall curve, since they will have different precision values.Given the results from a particular argumentation framework, an E-value thresholdmust be found which gives a precision as close as possible to that of the framework.This is done with the standard binary search algorithm. Then the recall values maybe fairly compared directly.2.5 OptimisationThe initial expert design may be adjusted in order to maximise the predictive accuracyof the technique. In the original work, this was done in a limited way by handadjustment, but this is a slow process due to the number of variables. Alternatively,adjustment may be automated using any optimisation algorithm, which explores thepossible frameworks and criteria for establishing which claims apply to a search result.This optimisation stage is a novel feature in the application and development ofargumentation frameworks.The framework was optimised using the established iterative random walk techniqueto improve the accuracy incrementally and automatically, using the benchmarkdataset as training data. In each iteration, a random change is made to either theframework or the claim criteria. The change is either retained or discarded, dependingupon the nature of the change and the change’s effect on the total number of correctlyclassified search results. The pattern of retention is designed to minimise theargumentation framework whilst maximising accuracy, to give the most parsimoniousframework possible.The criteria for claim checking are changed by changing the claim’s parameters,which are usually thresholds - for example, a threshold length above which sequencesare considered to be long. The standard “simulated annealing” process is used: parametersare changed randomly within a fixed range in jumps scaled according to atemperature parameter, which reduces each time a change results in an improvementto predictive accuracy.Automatic optimisation has the disadvantage that the resulting frameworks mightnot reflect an expert’s reasoning process, and so are less useful in intuitively summarisingthe search results for the user. However, they perhaps better reflect the reality of
CHAPTER 2. METHODS 19the dataset, and abandon possibly faulty reasoning by the researcher who designedthe original framework. The random nature of the optimisation algorithm results inmany different frameworks. The nature of the search space has not been explored inthis work.2.6 Statistical analysisThe exact Sign Test [Bland, 1989] was used to calculate p-values for the significanceof the improvement in predictions of argumentation over the cut-off method. Thisnon-parametric test makes no assumption about the distribution of predictions. For“plus”, cases are counted where argumentation predicts correctly and E-value cut-offpredicts wrongly. For “minus”, cases are counted where argumentation is wrong andcut-off is correct. For each cross-validation, all of the predictions made for each ofthe test subsets are pooled and a single p-value is calculated from the pool. Since thecross-validation was repeated five times, five p-values are calculated, and the meanand maximum reported.2.7 Other methodsArgumentation as applied to search results is an extra layer on top of the E-valuemethod which can be used to filter results. Many other methods exist which takediverse data sets and make decisions based on them. Decision trees are perhaps thesimplest and oldest, based upon classifying each result with a set of simple binaryquestions. Bayesian networks are a more recent development, which take a probabilisticapproach to decision making. These are discussed and compared to theargumentation method in Section 4.1.
Chapter 3Results3.1 BenchmarkingThis benchmarking is part of work done prior to the PhD. It is presented here as acomparison to the optimisation benchmark below.The results for the initial expert-designed argumentation framework and criteriaare given in Table 3.1. Argumentation gave a precision of 92.8%. The E-value thresholdwas adjusted so that the cut-off technique gave a closest precision of 92.7%. Atthis precision level, the increase in recall is 5.2 percentage points. Overall accuracy increasesslightly by 1.2 percentage points from 89.0% to 90.2%. The difference betweenthese increases reflects the relatively small number of positive matches, as opposedto negative ones - a ratio of around 1:3. The improvement of argumentation overE-value cut-off was found to be statistically significant, with a probability of 0.0042that an equal or greater improvement might be achieved by chance.3.1.1 Optimisation benchmarkingThe argumentation and optimisation system underwent 5-fold cross-validation, optimisingon 80% of the benchmark dataset and measuring performance on the remaining20%. Five random replicates of the cross-validation were performed. The results aregiven in Table 3.1. The mean precision of the optimised argumentation frameworkswas 86.4%. The E-value threshold was adjusted in each case so that the cut-offtechnique gave a closest mean precision of 86.5%. At this precision level, the meanimprovement in recall over the E-value cut-off algorithm was 10.2 percentage points,with a standard deviation of 2.4. Therefore, optimisation on average boosted the recallof the expert-defined argumentation framework by an extra 5.0 percentage points.Overall accuracy increases slightly by 2.1 percentage points from 89.2% to 91.3%. Aswith the unoptimised argumentation, the difference between these increases reflectsthe relatively small number of positive matches, as opposed to negative ones. Theimprovement of argumentation over E-value cut-off was found to be statistically significantfor each of the optimisation runs, with a probability of at most 0.0030 thatan equal or greater improvement might be achieved by chance.Bear in mind that each optimisation run has different input data, and produces20
CHAPTER 3. RESULTS 21Expert designed Design refined by optimisationMetric E-value cut-off Argumentation E-value cut-off ArgumentationAccuracy 89.0% 90.2% 89.2%±0.2 91.3%±0.8Precision 92.7% 92.8% 86.5%±1.4 86.4%±1.4Recall 64.0% 69.2% 67.1%±0.4 77.2%±2.1p-value 0.0042 0.0014 (max: 0.0030)Table 3.1: Comparison of recall between E-value cut-off and Argumentation algorithms,for expert designed and optimised frameworks. The figures after ± are thestandard deviation of these percentages for the five random replicates of the optimisation.Both recall and precision are inherently fixed for argumentation frameworks,whereas the E-value cut-off may be adjusted to trade recall against precision. So thatthe improvement in recall in each case can be isolated, E-value cut-offs are calculatedwhich each give a precision as close as possible to the precision of the correspondingargumentation. See text for method. The statistical significance of the accuracy improvementof argumentation over cut-off in each case is calculated using the one-tailedSign Test [Bland, 1989], with the p-value reported in the bottom row. All p-valuesshow that the Sign Test hypothesis that the improvement is due to random chancecan be rejected, at a 99% confidence level.a slightly different argumentation framework. As a result, the precision is differentin each case. A different E-value threshold is calculated for each run in order tocompare recall values. In a sense, the automated optimisation is exploring the multidimensionalsearch space of possible argumentation frameworks. This is similar toadjusting the single E-value threshold to produce a precision-recall curve. However,because there are so many variables in argumentation, their adjustment does notproduce a curve: there are many possible recall values for a given precision. Twoargumentation precision levels have been examined (92.8% and 86.4%) and it hasbeen found that argumentation (with and without optimisation) can out-perform thestandard technique at both, to varying degrees.A longer optimisation run was performed based upon the entire training set. Thecomplete framework resulting from this optimisation is given in Figure 2.2. Thisframework is quite minimal compared to the original, which comprised 34 claims andmany attacks. It gives the claims and attacks which are important in distinguishingpositive and negative 3D-PSSM results, for the training set.
Chapter 4DiscussionAn argumentation framework for reasoning about the output from 3D-PSSM, a proteinstructure prediction tool, was designed by the tool’s author. Then the frameworkwas applied to the search results from the 3D-PSSM server. In common with othertools, the output from this server is complex and difficult to study consistently andas a whole. The argumentation framework effectively captured the reasoning used bythe author of 3D-PSSM when studying results from his own search engine. This hasbeen made available to users as guidance as to how to use the tool effectively.The method was benchmarked with respect to its improvement in recall whenused to identify positive search results from 3D-PSSM. The initial argumentationframework for 3D-PSSM based upon expert knowledge produces a modest increasein recall of around 5 percentage points with the benchmark dataset. An automatedoptimisation method was developed, and the improvement achieved when this methodwas applied to the expert-designed framework. It doubled the increase in recall toaround 10 percentage points, albeit at a different precision level. It is likely thatcontinued work on the optimisation process will give further improvements in recallfor any given precision level.The framework and the criteria form a complex multi-dimensional search spacewhich would benefit from more advanced algorithms. Longer optimisation runs mayalso yield further improvements. As a proof of the basic concept, this work showsthat argumentation has an immediate application as a tool for categorising positiveand negative matches in search results from complex search engines.Once the framework has been refined, it is less reflective of the expert’s reasoningwhich went into the original design. This may indicate faults in the original design.For example, the optimised framework may have attacks between claims which agreeon the search result. Whilst this appears counter-intuitive, it may indicate thatthe basis of one claim is incompatible with that of the other claim. Therefore theoptimised framework retains some usefulness as an interpretative tool, even if theattack structure is difficult to comprehend. A systematic study of the agreementbetween the reasoning derived from optimisation and expert reasoning is left for futurework. This may be part of a future publication in a computer science or artificialintelligence journal, where the successful and novel application of this technique islikely to be of great interest.22
CHAPTER 4. DISCUSSION 23More directed and informed refinements by an expert may achieve the same or evenbetter results than automated optimisation, with the advantage that the frameworkstill reflects an individual’s reasoning. In addition, the argumentation framework neednot be fixed for all users of a particular bioinformatic tool. Different frameworks mightbe appropriate in different circumstances, and it is important to allow researchers toadjust the system to their preferences, in order to fulfil their personal expectations. Asuitable user interface for this refinement must be developed to make this refinementand customisation as easy as possible. The framework should be seen as a valuablestarting point of generalised expert knowledge, rather than an end in itself.4.1 Comparison to other methodsDecision trees are a simple technique for categorisation based on a set of yes/noquestions of exactly the same kind as the claim criteria described in earlier sections.They are widely used for prediction in biology and medicine. For example, proteins inan archeon genome have been categorised into soluble and insoluble categories usingdecision trees grown from examples and counterexamples [Christendat et al., 2000].Cellular migration in response to various conditions has been modelled using a simpledecision tree [Hautaniemi et al., 2005]. In both of these examples, a small number ofdiscrete and continuous properties are brought together into a simple logical structurewhich can be comprehended easily. At each level of questioning, the complexity ofthe problem is reduced, until a conclusion is reached. Because the questions may bearbitrarily complex, no assumptions are made regarding the distribution of input data:it is reduced to a simple binary response. Decision trees may be easily constructedautomatically using established algorithms, or by hand. These features make decisiontrees a compelling solution for analysis of certain data sets.Argumentation frameworks provide a much higher-level representation than decisiontrees. They place much less emphasis on the process of reaching a decision,focusing rather on making explicit the factors that contribute to a decision. Argumentationis a dialectic technique at heart; a mechanism for coming to a conclusionhas been added to it. A decision tree can be constructed to produce exactly the sameoutcomes as the argumentation methods described in this paper. An algorithm to dothis was devised, so the corresponding decision trees could be examined. A decisiontree equivalent to the example framework in Figure 2.2 is shown in Figure 4.1. Ithas a total of 34 binary decision points. Assuming a uniform distribution across allpossible claims, the mean number of questions asked before coming to a conclusion isjust over 4, the maximum is 9 and the minimum is 2.Whilst the process associated with coming to a conclusion is generally simpler withthe decision tree, the tree itself is fairly complex. The same question appears multipletimes in different branches, making it a less succinct representation than argumentation.More importantly, decision trees focus on coming to a conclusion efficiently;argumentation frameworks, as used in this paper, have the same predictive power asdecision trees, but have the additional advantage of more closely and concisely reflectingthe reasoning and the underlying justifications than the reductionist decisiontree approach. However, argumentation frameworks are more difficult to construct,
CHAPTER 4. DISCUSSION 24Figure 4.1: Decision tree equivalent to argumentation: This tree is equivalentto the argumentation framework in Figure 2.2. Each question is represented by thesymbolic identifier of the equivalent claim, given in Table 2.1. Follow the left branchwhen answering “yes” (i.e. the claim applies to the result) and the right branch whenanswering “no”. It can be seen that questions are repeated in different parts of thetree, an inherent problem with decision trees.partly as they contain more information than decision trees. The approach takenin this work is that of designing one by hand, which is then refined by automatedoptimisation using a random walk algorithm.Bayesian networks (or Bayes nets) are more complex than argumentation frameworks.They make decisions based on networks of Bayesian inferences. They havebeen employed successfully in algorithms for predicting various features of proteinsand in other biological applications. For example, [Drawid and Gerstein, 2000] developeda network to predict the normal subcellular compartment occupied by proteinsin yeast, giving around 75% accuracy on a test dataset. Protein-protein interactionscan also be predicted with approximately 65% accuracy using Bayes nets[Jansen et al., 2003]. These applications operate directly on a large amount of biologicaldata, which is inherently a difficult and subtle task, and they model and makepredictions about biological systems which are themselves probabilistic. Therefore, aprobability-based approach is apt. Moreover, some features may have only a subtleeffect on the property being predicted, and this may be modelled with smaller probabilitiescontributing to the final conclusion. When predicting protein-protein interactionsin yeast, for example, there are a small number of positive examples comparedto the total number of possibilities. Fewer than 100,000 of the possible 18 millioninteractions actually occur, a positive rate of around 0.5% [Jansen et al., 2003]. Thisrequires a technique which can discriminate on the basis of many features, each witha variable contribution to the prediction.
CHAPTER 4. DISCUSSION 25The argumentation search space is quite different. Argumentation is evaluatingthe output of a tool which has already made predictions, and reduced the possiblesearch space to 20 candidates. From this, the task is to pick at least one goodmatch, assuming a good match is present. The positive rate is at least 5%, andthe search space is tiny by comparison. The task is not to model the tool, or thebiological process behind the predictions that it makes. The aim is to model thereasoning process of the researcher who uses the tool and evaluates its output. Whilsthuman reasoning of this kind accommodates a degree of uncertainty, it does not use acomplex probabilistic model. Bayesian nets and argumentation frameworks are thusnot direct competitors. Argumentation theories have been developed as an approachto reasoning about uncertainty where even rough estimates of probabilities are notavailable or meaningful.In summary, decision trees potentially have the same predictive power as argumentationframeworks. They are easier to create, but are also lower-level representationsthat focus on the process of making a decision rather than its justification. Bayesiannetworks lend themselves to modelling problems with an underlying probabilistic,quantitative basis, where the contribution of each parameter may be very subtle. Argumentationframeworks are appropriate where there is need to model the reasoningprocess such that it can be used to make automatic predictions, and justify thesepredictions to a person.An argumentation framework represents a miniature debate about each searchresult. The focus of this work has been on summarising this as a simple conclusionderived from the framework. The content of a framework is richer than this, andmay be presented in various ways so that the user may come to their own conclusion.Natural language may be generated [Reed and Grasso, 2001] to explain the statusof the result. For example, “The length of the match would normally imply a goodresult, however the sequence identity is low” might represent a framework with a“Low sequence identity” claim attacking a “Long match” claim. The framework maybe presented visually as in Figure 2.2 and Figure 2.3. Such a presentation may allowthe user to interact with the framework, for example removing arguments to see whateffect this has on the undefeated claims and final conclusion.4.2 ConclusionArgumentation captures expert knowledge as a simple model, so it can be automaticallyapplied to data to aid interpretation. It has potential applications in all areas ofbioinformatics, where large and complex datasets are commonplace. Given the everincreasingnumber of tools available to the biologist, a technique which helps to makesense of their output is timely. Argumentation can also help in pooling evidence frommultiple tools and coming to a consensus decision, a technique successfully applied toprotein structure prediction which should prove useful in other biological problems.
Part IIEffect of in vivo factors onprotein evolution26
27The general aim of this work is to explore the process of protein evolution, includingthe structure and function of proteins which result from it. The work will makeextensive use of simplified models. Such models reduce the complexity of proteinstructure to a level that may be tractably modelled computationally, whilst maintainingthe ability to make biologically relevant predictions. Part of the aim of thiswork is to build upon these models in order to make them more indicative of the realprotein world. The resulting model will be used to explore three hypotheses:• Background: A key question in protein evolution is the relative effect of pointmutation versus recombination. Work with simplified models of protein structureand evolution has found that point mutation is the main mechanism bywhich protein sequence space is explored. The work has found that recombinationplays a role in searching protein structure space and undoing the effects ofgenetic drift, however in the vast majority of cases a recombination eventis deleterious [Cui et al., 2002] or neutral [Xia and Levitt, 2002] - and muchmore so than for point mutation.Hypothesis: It is hypothesised that these conclusions are skewed by theexcessive simplicity of the models used. They do not exhibit the hierarchyof structure present in real proteins. In particular, they do not have secondarystructure: only the primary and tertiary levels of real protein structures aremodelled. An extra level of structural hierarchy may give recombinationa more complete set of reusable units which it can reorganise to createnovel structures, hence increasing the importance of recombination inexploring protein structure space.• Background: The nucleation-condensation model is seen as unifying the twomain extreme theories of protein folding, those of hydrophobic collapse and theframework model [Nölting and Andert, 2000, Daggett and Fersht, 2003]. Nucleationsites have been identified in simplified protein models [Li et al., 2000],and evolutionary simulations using these models have shown that nucleationstructures are conserved when designed proteins are optimised for folding speed[Mirny et al., 1998]. Simplified evolutionary studies such as this are usually initiated(or seeded) with polymers designed with non-evolutionary techniques asthe starting population.Hypothesis: It is hypothesised that nucleation sites in modern proteinsrepresent the earliest evolved core structures, from which later folds weredeveloped. This hypothesis may be directly studied with simulated ab initio evolutionof simplified proteins from random polymers. The steps which led to therepertoire of modern protein folds are a matter of conjecture, based in large partupon comparisons between their structures and sequences. Simplified modelsallow direct observation of the steps which lead to a protein populationwhich is analogous to the modern real-world protein population. Only abinitio evolution will test this hypothesis: seeding with known structures doesnot show us an early evolutionary pathway.• Background: Anfinsen’s early work in protein folding [Anfinsen, 1973] showedthat proteins can fold as quickly and reliably in vitro as in vivo. Subsequent
28work has revealed subtle effects which complicate this basic principle, and suggestthat in vivo effects are significant in the folding of many proteins, includingthe formation of secondary structure [van den Berg et al., 1999, Ellis, 2001,Hartl and Hayer-Hartl, 2002, Kinjo and Takada, 2003, Ping et al., 2003, Rivas et al., 2003,Maier et al., 2005, Cheung et al., 2005, Ziv et al., 2005].Hypothesis: It is hypothesised that, although proteins may fold well in vitro,they have been optimised by evolution to fold in the living cellular environment,and the imprint of this may be found on the dynamics of proteinfolding. In particular, the slow synthesis of proteins by the ribosome and theconfining effect of the crowded cell may influence the location on the proteinbackbone of residues involved in nucleation sites. This hypothesis may be exploredwith direct observation of proteins evolved ab initio with a simplifiedmodel. Only ab initio evolution will test this hypothesis: initiating evolutionwith known structures means there is no opportunity for in vivo conditions toimprint upon early evolved aspects of protein structures.Underlying the exploration of these hypotheses is a model for simplified protein structure,folding and evolution. Simplified models are often described as a “parallel universe”which can be used to quickly explore hypotheses that would be impossible totest in the real protein universe. By directly modelling evolution, the heritage of aset of simplified proteins may be directly observed, a heritage which is difficult toderive by studying real proteins [Larson and Pande, 2003]. Therefore, this work maybe broadly divided into two, heavily interrelated parts:1. Development of a protein evolution model which can evolve proteins of abiologically relevant size and structure2. Use of this model to perform experiments testing the hypotheses outlinedaboveThe results of this work may have impact in several areas:• Ab initio protein structure prediction: knowledge about the relationshipbetween protein evolution and structure can help make structure predictionfaster and more reliable. More informed use may be made of the patterns ofconservation and evidence of evolutionary heritage between homologous proteins.• Protein design, synthetic biology and bioengineering: simplified modelsmay yield successful approaches for developing novel protein folds, alteringexisting ones, and ensuring they operate successfully in the unique in vivo environment.• Early evolution and the origins of life: the model can evolve proteins fromrandom polymers, taking as the starting point the hypothesised pre-proteinRNA world [Gilbert, 1986].
29OverviewChapter 5 covers background theory behind protein structure, folding and evolution,including details of a particular class of simplified model which is widely used, andwhich is used in this work. It includes a review of research using simplified models,and a critique of this research.Chapter 6 gives details of the aims and objectives of the work, and documents themodels, methods and software developed to fulfil these objectives.Chapter 7 gives details of the early development of the protein evolutionary model,and presents data drawn from a long evaluation run of the model. It also presentsthe results of an experiment designed to explore the relationship between mutationrates and the number and size of proteins produced by the simulation. Results arediscussed in detail.Chapter 8 summarises the discussion from Chapter 7 and outlines a plan for futurework.
Chapter 5Proteins and simplified models5.1 Principles of protein folding and evolution5.1.1 Protein foldingAnfinsen [Anfinsen, 1973] laid the foundations upon which protein folding theory hasbeen built for over thirty years. He established several principles which underliemost protein folding research, although they have been challenged many times, boththeoretically and practically.Anfinsen’s most important contribution was the thermodynamic hypothesis. Thisstates that the native conformation of a protein, in its normal physiological conditions,minimises the Gibbs free energy of the whole system. The native conformation isdetermined entirely by the protein’s amino acid sequence and interactions betweenatoms in the protein and solvent. Amongst these interactions are hydrogen bonding,covalent bonding, steric interactions and the hydrophobic effect.Anfinsen identified several aspects of protein folding dynamics which have beeninfluential. First, the experiments that contributed to the thermodynamic hypothesisalso indicated that it was not necessary for a protein to fold during synthesis, fromthe NH 2 - to the COOH-terminus. Second, proteins transfer rapidly from folded todenatured state as pH increases. This gave rise to the two-state theory of proteinfolding, whereby most proteins in a solution are either stably folded or completelydisordered, with few occupying stable transition states. Third, Anfinsen postulatedthe existence of local nucleation sites which fold to a native-like state initially, followedby the thermodynamic collapse of the rest of the molecule to the native state. Thishelps to explain how a flexible protein with many potential conformations can findits way from any one of them to its native state. Nucleation sites are landmarks ona limited number of paths leading to the native structure across the conformationallandscape.5.1.2 In vivo foldingAlthough Anfinsen’s experiments suggest that proteins find their native conformationin vitro, the in vivo folding environment is significantly different, and may influencespecific elements of protein folding dynamics and evolution.30
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 31The cellular environment is chaotic and crowded [Ellis, 2001] [Rivas et al., 2003].Between 10% and 40% of the volume of a cell is occupied by macromolecules - in vitrothis figure is around 0.1%. This crowding can help protein folding and stability, butcan also promote aggregation [van den Berg et al., 1999] [Kinjo and Takada, 2003].Proteins are synthesized slowly by the ribosome: between 5 and 40 residues areadded to the growing chain per second. By comparison, folding takes microseconds.The ribosome has a very large mass compared to most proteins, so the growing peptidechain is effectively tethered to an immovable object. A synthesizing protein passesthrough a long tube from the centre of the ribosome complex before it emerges into thecell, and there is evidence of helix formation in this tube [Maier et al., 2005]. Chaperonesand other proteins protect and guide co-translational folding [Maier et al., 2005][Hartl and Hayer-Hartl, 2002]. Proteins which fold poorly can be refolded with thehelp of GroEL or destroyed by ubiquitylation and the proteasome, and aggregatescan be disaggregated by proteins such as ClpB in bacteria.5.1.3 EvolutionMolecular evolution is driven by mutation and recombination, and it is directed byselection. Evolutionary mutation takes place in the form of insertions, deletions andpoint mutations to the germline DNA, due to copying errors and exposure to radiation,chemicals or viruses. Recombination involves the movement of sections of DNA toother locations, including the homologous “crossover” recombination that takes placein the formation of offspring DNA from two parents, and the transposition of mobileDNA elements.Selection takes place at the level of an organism, which has resulting impact onthe selection of its protein-coding genes. A change may reduce the organism’s abilityto survive, causing it to die before it can reproduce. A change may also alter theorganism’s likelihood of reproducing even if it does survive. Often selective pressuresdo not act on a change, resulting in genetic drift.Theories about early evolution of proteins are dominated by the idea of a preproteinRNA world [Gilbert, 1986], where RNA structures were the biological catalysts.This theory stems in part from the observation that the ribosome is primarilycomposed of RNA structures. The efficient non-random assembly of proteinsequences depends upon ribosomal synthesis. The first steps towards proteins asthe main functional biomolecules are the subject of speculation and experimentation[Schimmel and Kelley, 2000]: this work assumes that a mechanism exists for reliablysynthesizing proteins from an inherited template, thus working in the early proteinworld of around 3.5 billion years ago.5.2 Simplified modelsCurrently, the most physically realistic model of protein structure and folding usesmolecular dynamics to model low-level interactions between the atoms of a proteinsuspended in a simulated water bath. Depending upon the temporal resolution of thesimulation, and physical range of inter-atom interactions, molecular dynamics is a very
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 32time-consuming process. Each atom has three degrees of freedom, and may interactto varying degrees with hundreds of other atoms. A simulation of the dynamics of asingle small protein for a fraction of a second of simulated time may take many yearsof CPU time, and the end result is still not likely to agree well with an experimentallydetermined structure [Yong Duan, 1998].Simplified models remove detail from this “perfect model” in order to gain acceptableresults for a particular application in more realistic timescales. The amount ofdetail removed depends upon the application. For real protein structure predictionapplications, the degrees of freedom may be reduced to rotations around ϕ and ψangles and side chain rotations. Combined with a suitable energy function, such amodel can be used as part of a Metropolis algorithm [Metropolis et al., 1953] to findthe native conformation of a protein. Still, however, the numerous degrees of freedommean it is computationally expensive to explore the conformational search spacethoroughly. Steric and other chemical constraints mean the angles adopted are somewhatlimited, but the search space remains very large and modelling the interactionsbetween atoms is a lengthy process.Models may be further simplified if not dealing with real proteins based upon realamino acid sequences. When looking at general principles governing the dynamicsor evolution of proteins, we require (as a minimum) a model which behaves likea protein, with similar characteristics, but without the complexity of real proteinstructures. Both the number of elements in the model, and the degrees of freedom,may be restricted without destroying the characteristic essence of protein behaviour.A dynamic simulation of such a protein over a second of simulated time is reducedto a few minutes or less of CPU time. The resulting models are often described asa “parallel universe” of protein structure and folding, which can be used to quicklyexplore hypotheses that would be impossible to test in the real protein universe.5.3 Simplified structureMany simplified models of protein structure are based upon the reduction of eachamino acid to its Cα atom [Kolinski and Skolnick, 2004]. This instantly reduces thenumber of elements in the system by an order of magnitude, reducing the complexityof energy and interaction calculations. An extra point may be added to each Cαpoint, representing a notional side-chain.To make clear the distinction between real-world proteins and simplified proteins,the latter will be referred to as polymers and the units which represent the aminoacids will be referred to as monomers.5.3.1 Lattice polymersThis model is often described as a self-avoiding walk on a lattice. The lattice modelreduces each amino acid to a single point representing the Cα atom, linked to its chainneighbours by two bonds [Kolinski and Skolnick, 2004]. The position of each monomeris restricted to a lattice (or grid), usually square (in two dimensions) or cubic (in threedimensions). The space between each grid point is equal to the length of the bond
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 33Figure 5.1: Example cubic lattice polymer with monomers chosen from 20 types.Each type is shown in a different colour.between each monomer. Usually, no two monomers are allowed to occupy the sameposition on the grid, however some models allow this with a punitive energy penalty.Each monomer has a type selected from a set of between 2 and 20 types. Where twotypes are used, they are intended to represent the broad categories of hydrophobic andhydrophilic amino acids. This is termed the HP model (Hydrophobic-Polar model).Where 20 monomer types are used, they are intended to represent those amino acidsused in real proteins. An example cubic lattice polymer with 20 monomer types isgiven in Figure 5.1. The grid restricts the bond angles to 90 ◦ or 180 ◦ and the torsionangles to 0 ◦ , ±90 ◦ or 180 ◦ .5.3.2 Off-lattice polymersThe off-lattice model releases the monomers of the standard lattice model from theirconstraining grid [Kolinski and Skolnick, 2004]. In order to prevent monomers fromoverlapping, a steric radius is usually simulated, often around half the length of thebond between neighbouring monomers (this is equivalent to the steric radius imposedby the lattice model). The angles between neighbouring bonds may be continuous,restricted only by the steric constraints, or they may be additionally restricted toangles which encourage helix formation or occur on a Ramachandran plot derivedfrom real protein conformations. These are enforced either by rejecting or punishing(through the energy function - see later) conformations which violate them. Theresult is a polymer model which corresponds much better with real proteins, whilstretaining most of the computational benefits of the lattice model. An example offlatticepolymer is shown in Figure 188.8.131.52.3 SidechainsSidechains may be added to both lattice and off-lattice models. They are typicallyrepresented as points linked by a flexible bond of constant length to the correspondingbackbone alpha carbon. In the lattice model, the extra points are restricted to thelattice in the same way as the backbone points. In the off-lattice model, a side chainis restricted by steric constraints, and perhaps constraints on its angle relative to
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 34Figure 5.2: Example off-lattice polymer with monomers chosen from 20 types. Eachtype is shown in a different colour. Selected residues have their steric radii shown asa translucent sphere.either of the bonds from its backbone carbon to neighbouring backbone carbons. Theelongated shape of a sidechain may be modelled with a ellipsoidal steric boundary.5.4 Simplified foldingWe would like to simulate the folding of a simplified polymer from some state to anative state, as a model for real-world protein folding.Folding both lattice and off-lattice models can proceed using various algorithms,which correspond to real-life folding to varying degrees. For this work the Metropolisalgorithm [Metropolis et al., 1953] with simple local moves and a hybrid energy function(partly derived from pair potentials derived from published protein structures)is employed. This method has been shown to model protein folding thermodynamicsand stability well [Šali et al., 1994, Shakhnovich, 1997].5.4.1 Move setMoves applied to a model in the Metropolis algorithm may be broadly categorisedinto local and global varieties. Global moves change large sections or even all of theelements of the model as a unit - for example, changing the torsion angle of a singlebond might move half of the monomers all at once. Local moves change small sections,perhaps just one or two elements, leaving the rest unchanged. Local moves are morewidely used since they are fast to perform. It may also be argued that they are morebiologically plausible.This work uses three local moves for the lattice model, and two for the off-latticemodel. The moves are illustrated in Figure 5.3. A requirement for the Metropolisalgorithm is that these moves are ergodic, that is that any conformation is reachablefrom any other conformation in one or more moves, and the sampling of conformationsis representative of the sampling a real physical system undertakes. The lattice movesare not ergodic: a small number of tightly knotted states are not reachable, for
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 35Figure 5.3: Local moves in the (a) lattice and (b) off-lattice model. There is no needfor a “corner flip” in the off-lattice model, since this can be achieved with a simplecrankshaft move. If sidechains are modelled, then they are moved the same way asthe “End bend” move for each model.example that in Figure 5.4 [Madras and Sokal, 1987]. It is probable that the offlatticemove set has a similar but smaller number of unreachable states, dependingupon the ratio of steric radii to bond lengths. This is not generally regarded as aproblem in modelling protein dynamics since tightly knotted conformations are nota feature of folded proteins, and are unlikely to be an important feature in foldingdynamics [Madras and Sokal, 1987].5.4.2 Energy functionThe energy function calculates a value notionally equivalent to the Gibbs free energyof the polymer. Minimising this free energy is the aim of the Metropolis algorithm.Many energy functions exist, and they are employed according to the requirements ofthe work. Different energy functions may be combined into a hybrid function. Whenusing simplified models to study the thermodynamics of protein folding, it is standardto use an energy function based upon pair potentials. Given a set of monomer types,T , a function V (a, b) gives the free energy of the interaction between a ∈ T and b ∈ Twhen they are within some defined distance of each other. Note that V (a, b) = V (b, a).For the HP model, V is quite simple since there are only two monomer types. For thiswork, 20 monomer types are used, each representing a real amino acid. The interactionpotentials used are taken from [Miyazawa and Jernigan, 1996]. These are calculatedfrom the observed distribution of distances between amino acid types in experimentallydetermined protein structures published in the PDB [Berman et al., 2000].Amino acid types which regularly appear close together have a negative pair potential,and types which usually appear far apart have a positive pair potential. In thisway, the simplified model manages to capture a large amount of complex interactiondata in a small lookup table.If i and j are the indices of two monomers along a polymer chain length N, themonomer type at index i is given by a i . The contact between monomers at i and j is
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 36Figure 5.4: A knotted lattice polymer from [Madras and Sokal, 1987]. Lattice movesdescribed in Figure 5.3(a), performed on a cubic self avoiding walk, are not ergodic- this conformation cannot be reached from a simple rod conformation, or escapedto form a rod. There is still, however, some flexibility in the structure, allowingit to explore a very limited search space. Other such conformations exist, but theconformation space they occupy is very small compared to the total conformationspace.given by c ij , where c ij = 1 iff 1 the monomers are in contact and are not neighbourson the polymer chain, and c ij = 0 otherwise. Given these definitions, the free energyformula is:E = 1 2N−1∑i=0N−1∑j=0V (a i , a j ) c ijThe contact function c ij is a coarse classifier of the interaction between monomers.More elaborate potential functions may replace V which take into account the specificdistance between the amino acid types d ij , or their separation in monomers along thepolymer chain:Other energy functionsE = 1 2N−1∑i=0N−1∑j=0V (a i , a j , d ij , |i − j|)The energy function can include a term for hydrogen bonding. This may be restrictedto, for example, bonds between monomers exactly four apart on the polymer chain,in order to allow the effective formation of helices. It can be made into a more generalbond between a monomer and up to two other monomers, as enabled by the CO andNH groups of each amino acid.Where a sidechain model is used, there are two more potential interactions toconsider: backbone-sidechain and sidechain-sidechain.One commonly used modelhas only sidechain-sidechain interactions contributing to the pair potential energy.Only backbone-backbone elements interact with hydrogen bonds, and there are nobackbone-sidechain interactions (apart from the basic steric interactions).The backbone and sidechain bond and torsion angles may also be an element ofthe energy function. Such an energy term may be calculated by comparing the angles1 iff - if and only if
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 37to the observed distribution of such angles in published proteins (e.g. as embodiedin the ubiquitous Ramachandran plot [Ramachandran et al., 1963]) and punishingangles which do not regularly occur.An early energy function used in the Go model [Taketomi et al., 1975] simplyrewarded interactions between amino acids which were known (or intended) to bein contact in the final conformation. Such a function was useful in observing basicdynamics of protein folding, but is clearly no use in folding arbitrary sequences wherethe final conformation is unknown.Many other terms may be included in an energy function. Some, such as pair potentials,have a basis in statistical physics; whereas others have a less formal groundingand are simply aimed at getting a particular result. Several functions may be combinedinto a hybrid function, where the result is a “virtual” energy.5.4.3 Metropolis algorithmGiven a polymer in an arbitrary conformation, the Metropolis algorithm [Metropolis et al., 1953]proceeds as follows:1. Calculate the free energy E 0 of the polymer conformation2. Perform a single move from Figure 5.3. Immediately reject moves which causemonomers to violate steric constraints (e.g. occupy an already occupied positionon the lattice, or fall within some distance of other monomers)3. Calculate the free energy E 1 of the new polymer conformation, and hence thechange in energy ∆E = E 1 − E 04. If ∆E ≤ 0, then the new conformation is accepted unconditionally: such a movemay occur spontaneously since it decreases Gibbs free energy.5. If ∆E > 0, then the new conformation is accepted with probability exp ( )−∆EkT ,where k is a virtual Boltzmann constant and T is a simulated temperature.Since we are simulating a simplified model, we set our own scale for T (unlikereal world temperatures in Kelvin) - therefore k is unnecessary. Note that ahigher temperature increases the probability that any given change is accepted,and as T approaches infinity, all moves are accepted irrespective of ∆E. Thisreflects the fact that a change which increases Gibbs free energy depends onsome kind of energy input, which is provided by heat in the environment.5.4.4 Ending the simulation - the native stateIn a real protein, the native state is the conformation (or ensemble of conformations)in which a protein is biologically active. It has been shown [Anfinsen, 1973] that thenative state usually corresponds to the lowest free energy conformation. Therefore,we define the native state of a simulated polymer as the lowest energy conformationfound during the entirety of the simulation. Note that this may not correspond tothe conformation of the polymer when the simulation terminates: a lower energyconformation may have been found earlier in the simulation.
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 38Deciding when to terminate the simulation depends upon application. If tryingto find the best, lowest energy state no matter how long it might take, one mightend the simulation when some fixed number of random moves has been completedwithout a decrease in free energy. In evolutionary simulations, we can simply performa fixed number of moves and say that any polymer that hasn’t found its native stateby that point would be selected against anyway. This an approximation to findingthe native state which may fail in some cases. It will certainly present an upper limiton protein length, as longer proteins require more moves to successfully fold. Evenshorter proteins may fail to find the lowest energy state due to the stochastic natureof the process.Cubic lattice polymers of up to 48 monomers in length can be expected to foldwithin one million random moves, depending upon the nature of the native conformation[Shakhnovich, 1997]. The increased degrees of freedom in off-lattice polymersgives them a larger search space, but also greater flexibility in breaking out of localenergy minima. Ad hoc trials conducted as part of this work have shown that twomillion random moves are more than sufficient for 16-monomer off-lattice polymersto find a native (lowest energy) state. Published work may use as little as one millionmoves, or as many as ten million moves, depending upon the complexity of the modeland the metrics which are to be extracted from the simulation (see Section 5.5.1).Models with sidechains will require more moves.5.5 Simplified evolutionTwo key elements are required for a basic simulation of the evolutionary process.Selection governs which individuals in a population might successfully reproduce.Variation introduces new features to the offspring of individuals.In our case, the individuals in question are polymer sequences. Their successful reproductionrepresents the success of an imagined population of organisms which havethe polymer sequence stored in their DNA sequence. The organism population mustreproduce enough to make it likely that some offspring are produced with variationsin the polymer sequence. Therefore, directly modelling polymer evolution representsa fast-forward of real polymer evolution as takes place in populations of organisms.This idea is illustrated in Figure 5.5Note that polymer sequences are evolved, that is, sequences of amino acids encodedin some form. In real life, DNA is the medium. In a simulation, a simple array ofamino acid identities is sufficient. The polymers themselves are a product of theinterpretation of this encoding and dynamic processes, either physico-chemical orsimulated. It is this end product upon which selection is based, but only the polymersequence is transmitted to offspring and it is the sequence which is altered to producevariation. In real evolution, the codon to amino acid mapping means that completelyrandom mutation in DNA results in biased mutation of the protein sequence. This biasmay be simulated in the sequence mutation probabilities, with a simple probabilityassociated with changing one monomer type to another (although this is not yet apart of the model used in this work). DNA mutation may also result in reading framechanges: it is assumed that this is normally deleterious and plays little or no role in
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 39Figure 5.5: Evolution of a single protein represents many generations of organismevolution. Directly simulating protein evolution shortcuts this process.protein evolution.5.5.1 SelectionIn order to select an individual for reproduction, it needs to be assessed in some way todetermine its likely success in performing a useful function without undesirable sideeffects.The fitter a polymer is deemed to be, the more likely it is to reproduce. Thisdrives a process of gradual improvement in the fitness of the polymer population. It isthe folded polymer which is assessed, therefore a folding simulation must be performedseveral times in order to make an accurate judgement of a polymer’s capabilities.Several characteristics may be used to assess a polymer’s fitness:Folding consistency If a polymer consistently folds to the same conformation, it islikely to be more useful than a polymer which folds less reliably. Contact mapsare usually used to determine the similarity of folds. Contact maps for large,compact lattice protein conformations are often unique to that conformation.However for more extended conformations, and for off-lattice models, a degreeof error must be accommodated in drawing comparisons between contact maps.Stability A polymer which is stable in its native state is more often available toperform its function, and therefore arguably more fit for its purpose.Folding speed Polymers are selected to fold quickly because unfolded polymers mayform harmful aggregates and interfere with other cellular processes.Compactness Although compactness is a general consequence of stability and thehydrophobic effect, a direct compactness constraint may help speed up evolution.Function The polymer in its native state may be compared to a “target shape” orscanned for a desirable “active site”. These characteristics give the polymer a
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 40function beyond merely existing and not being deleterious, and therefore give apositive selective advantage.Sequence length or number of contacts If we merely wish to grow a set of largepolymers from small ones, undirected in terms of function, then these measuresfavour larger or more interconnected polymers, without favouring particularsequences.Surface hydrophobicity If, in its native state, a polymer has large patches of hydrophobicresidues, it is prone to aggregation with other hydrophobic molecules,including other copies of itself. However, this may also be a means for polymercomplex formation.Temporal sampling Most of the characteristics above may be sampled over time,to build up a picture of the average fitness of a polymer over its functional life.Each characteristic must be combined into a single metric F which can be used asthe basis of a probabilistic function which selects polymers to reproduce.Whilst selection can solely be used to determine which polymer in a population toreproduce next, it can also be used to “kill” polymers, effectively deleting them fromthe population. This is primarily a computational optimisation. The effect of deathcan be achieved by never selecting particular polymers for reproduction. However,for long simulations, hundreds of thousands of polymers may be produced which willnever reproduce. Removing these to save time and memory makes sense.5.5.2 VariationFeatures of proteins are created, altered and destroyed by the process of variation.Variation is random with respect to the consequence of the changes made. Thedirection of evolution is governed solely by selection. Variation takes place solelyupon the polymer sequences.Amino acids may be changed, inserted or deleted. In addition, the whole sequence(or parts thereof) may be duplicated. Finally, whole sequences (or parts thereof) fromother polymers may be inserted. Each type of variation occurs with some probability.The probabilities used may attempt to reflect real mutation and recombination rates,or be set to explore their influence on evolutionary outcomes.5.5.3 Evolutionary protocolThe evolutionary algorithm proceeds as follows:1. Generate a set of sequences as a seed population. These may be random, shortsequences (between three and five monomers) from which larger sequences areevolved. Or they may be longer sequences designed to fold to a known structure.2. For each polymer sequence which has not yet been assessed:(a) Synthesise and fold a polymer based upon it several times, using one ofthe simplified polymer models.
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 41(b) Assess the set of folding simulations for some subset of the selection characteristicsdescribed above, producing a single fitness metric F for thesequence.3. (Optional optimisation) Use the characteristics to kill polymers which are deleteriouslybad.4. Use a probabilistic function to select a set of polymer sequences to reproduce.Two possible applications of this function are:(a) Use F to derive a probability of reproducing in a given generation. Eachindividual is given an opportunity to reproduce, independently of otherreproduction events(b) Use F to derive a weight given to each individual. A fixed number of reproductionevents are shared randomly amongst the population, accordingto the assigned weight.5. Reproduce the selected polymer sequences, altering the offspring according tothe variation model described above. The offspring are added to the population.6. Loop back to 2 forever.At each step, information must be recorded about the relationships between polymersequences - both at a whole-sequence level and at the monomer level. [Tiana et al., 2004]is an example of published work using this protocol.5.6 In vivo effectsMany aspects of the cellular protein folding and evolution environment are modelleddirectly as described above. These include the temperature and mutation rates. Thetolerance of the cell to misfolded and aggregated proteins may be modelled with theselective criteria. Other in vivo characteristics can be explicitly added, in a simplifiedform, to the models.Slow synthesis is easily modelled by adding monomers to one end of the polymerchain one by one, at a rate far slower than the folding rate of the polymer (for example,one monomer per 100,000 Monte Carlo moves) [Sikorski and Skolnick, 1990]. Thiswill allow co-translational folding. This may be further refined by simple modellingof several elements of the ribosome. It can be modelled as an immovable wall to whichthe growing end of the polymer is tethered [Ping et al., 2003]. The long tube fromwhich the protein emerges, where alpha helices may form unhindered by other parts ofthe protein, can easily be modelled as a confining tube: moves resulting in monomerslying outside this tube will be rejected [Ziv et al., 2005]. The effects of crowding andconfinement can again be modelled by rejecting moves which place monomers outside abounding volume or inside multiple surrounding bounding volumes [Ping et al., 2003,Cheung et al., 2005].
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 425.7 Research using simplified modelsThis section reviews other research which uses simplified models. Sections 5.7.1, 5.7.2,5.7.3 and 5.7.4 summarise research and present results uncritically. Section 5.7.5 givesan overview of all the research, and comments upon it.Sections 5.7.1 and 5.7.2 review the use of lattice and off-lattice polymers respectivelyto explore basic properties of proteins, not including evolution. Section 5.7.3looks at the use of simplified models to investigate evolution. Section 5.7.4 reviewsresearch using simplified models to look at the impact of in vivo effects on proteinfolding.Section 5.7.5 comes to the conclusion that there are several key omissions in muchof the work in this area. Models are oversimplified: evolutionary work uses the simplestmodels, which do not exhibit secondary structure, which has given interestingresults when off-lattice models are used in non-evolutionary work. Many simulationslook at the “sequence-structure space” in general terms, ignoring the effects of populationdynamics. Parameters of the simulations have not been thoroughly explored.Finally, many simulations “shortcut” the most interesting part of evolution by seedingexperiments with proteins of known structure designed by other means. This workaims to address these omissions.5.7.1 Lattice polymersEarly workLattice polymers, using a square or cubic lattice, have been used in numerous investigationsof protein folding dynamics which, without the simplification offeredby the model, would have been impossible. One of the earliest was a basic testof what features of a protein allow it to fold successfully in a short amount oftime [Šali et al., 1994]. 200 random sequences were generated and folded using theMetropolis algorithm. It was found that some of the features previously assumed to beimportant in protein folding were actually not required. For example, neither a largenumber of local interactions, nor a large number of conformations close to the nativeconformation with low energies, were found in polymers which nonetheless found thelowest energy state in a short time. In fact, the main requirement was for the polymerto occupy a pronounced energy minimum in structural space. It was concluded thatthis is necessary for folding to be successful at temperatures high enough to alloweffective exploration of structure space.Amino acid usageThe lattice model was used as one of three approaches to determining the minimumnumber of amino acid types required to fold a protein by [Fan and Wang, 2003].There are around fifty thousand conformations for a lattice 27-mer folding into acompact 3 × 3 × 3 cube. They tested what proportion of randomly generated polymersequences could reliably fold to one of these conformations, given a variable numberof monomer types. They found that the proportion of viable polymer sequences didnot increase significantly when more than 10 monomer types were available. This,
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 43Figure 5.6: Principle of designability. The structures decrease in designability from(a) to (d). In reality, hundreds or thousands of sequences will fold to a given structure.Sequences folding to a given structure form sets, illustrated with dotted boundaries.along with other evidence presented in the paper, points at a minimum of 10 aminoacids for producing a large repertoire of polymer conformations. This study was onlymade feasible by the relatively small number of lattice conformations possible whenrestricted to a 3×3×3 cube, hence allowing an exhaustive analysis of structure space.Nucleation sitesA polymer designed by the Metropolis algorithm in sequence space has been used tostudy nucleation sites [Li et al., 2000]. These are small groups of amino acids whichform a core structure before the rest of the polymer folds. The existence of such asite in the studied structure was verified. Some nucleus contacts were non-native andtransient, but are still instrumental in making folding faster. The contacts were notpresent in the native conformation. For this work, a basic 28-mer 3D lattice modelwas extended with the addition of side chain points connected to each simulatedbackbone point. These also lay on the lattice. The free energy of the conformation isbased solely upon interactions between side chain points. Once the nucleation site wasidentified, the interactions between those residues were weakened. The native statestability was not significantly altered by this, but folding took on average aroundthree times longer. These results are compared to experimental results from severalstudies where anomalous φ-values have been reported, which indicate the effect ofmutations on the transitional state is opposite to or greater than the effect on thenative state.DesignabilityThe designability of a protein structure is defined as the number of protein sequencesfor which that structure is the native conformation, usually defined as the lowestenergy conformation. Early work on designability used the HP cubic and squarelattice model to determine the number of sequences that fold to all compact 3 × 3 × 3
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 44conformations in 3D and 6×6 conformations in 2D [Li et al., 1996]. It was found thathighly designable structures tended to be more stable than less designable structures.Highly designable structures also showed evidence of protein-like secondary structure,including strands (long parallel stretches) and helices (snaking patterns).Misfolding2D lattice 16-mers with four monomer types have been used to study protein misfolding,aggregation and their link to many diseases [Harrison et al., 2001]. The normalhydrophobic and polar monomers were used, plus A and B types which are attractedto B and A respectively. They found, amongst a population of random sequences,several where two polymers could misfold into a dimer complex which has a loweroverall energy than the two polymers independently in their ground state. Suchdimers, given certain geometric properties, could then propagate their misfolding toother nearby polymers undergoing Monte Carlo simulation. They found that mostsuch misfolded polymers were extended, beta-sheet like conformations which presentedtwo flat faces in order to allow easy tessellation with other polymers, allowingpropagation. Propagation was more effective under slight denaturing conditions (i.e.elevated temperature).5.7.2 Off-lattice polymersOff-lattice polymers are a step closer to real protein structures, and as such work usingthem is often based upon or compared to real protein data. Comparisons in publishedwork are predictably favourable. An off-lattice model with sidechains and bond angleslimited to three categories equivalent to hot spots in the Ramachandran plot[Ramachandran et al., 1963] has been used to reveal highly designable 23-mer folds[Miller et al., 2002]. Configurations considered were limited to those having less thana given cut-off surface area. This means the designability of fairly compact conformationsis evaluated, equivalent to the set of maximally compact 3 × 3 × 3 cubic latticepolymers. The HP model was used for monomer types, allowing exhaustive search ofsequence space. Every sequence was threaded onto every structure and a conformationalenergy calculated based upon surface exposure of hydrophobic residues (notethat this differs from the pair potential energy described above). It was found that asmall number of structures are highly designable, while most have poor designability.Structures identified as designable do not change significantly when model parameters- such as sidechain size, amino acid hydrophobicity types, bond angle categoriesand surface area cutoff - are changed. This indicates the simplifications used in themodel are acceptable. Some of the highly designable folds have good agreement withreal protein fold such as helix-turn-helix and zinc finger motifs. Other designablefolds are suggested as candidates for novel protein design. The work was the first todemonstrate the designability principle in model polymers more complex than cubiclattices, and is a step towards demonstrating it for real proteins.
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 45Secondary structure stackingThe off-lattice model has been used to study conformations made up from helical andstrand elements analogous to those in naturally occurring proteins [Emberly et al., 2002].A protein was simplified to a “stack” of off-lattice models of helix and strand elements.The arrangement of these elements was optimised with respect to an energy functionbased upon hydrophobic exposure, steric constraints and tethering between elementswhere they are linked by turns. All naturally occurring stacks were found by thismethod. Some designable stacks were found which do not occur naturally, whichmight be used in protein design.Alpha helicesAn off-lattice model with several enhancements has been used to study the folding,dynamics and stability of alpha-helices [Shental-Bechor et al., 2005]. The basicbackbone model has sidechains centroids added. The energy function is decomposedinto long-range interactions and short-range interactions. Long-range interactions betweensidechain centroids, between backbone carbons, and between the centroids andcarbons were modelled in a similar way to the simple alpha carbon pair potentialsdescribed above. Interactions between backbone carbons are generic - they do notdepend upon the residue type. Short range conformational energy was based uponbond angles. This model was used to study alpha-helices formed by polyalanine sequences,which normally reliably fold to helices in real proteins. The proportion ofthe chain forming a helix reduces with increasing temperature and decreasing chainlength. The results broadly agree with those found experimentally. The latter findingis explained by the relatively high flexibility of the chain ends, which cause thepolymer to unravel. The chain did not perform an exhaustive search of conformationspace to reach a helical state: stable helix-like intermediates formed on the way to astable complete helix.5.7.3 Evolution[Xia and Levitt, 2004] reviews research on protein structure evolution using simplifiedand all-atom models. Some of the more pertinent work reviewed is expanded uponbelow, along with other work relating to evolutionary models.Neutral nets and stabilityWork in this area is dominated by exhaustive search of structure/sequence spaceusing highly simplified models. Usually the HP model is employed, and structure isconstrained to a 2D square lattice. Sequences up to 25 monomers long are examined- often shorter. Normally only maximally compact formations are studied. Suchsimplifications are necessary to perform an exhaustive search.A neutral net is a set of proteins with the same structure and similar sequence.The network is formed by linking proteins with a single point mutation between them.Most neutral nets have at their core a single representative prototype sequence whichcan tolerate the highest number of mutations whilst retaining structural stability
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 46Figure 5.7: Sequence-structure space, neutral networks, thermodynamic stability andmutational robustness combine to form superfunnels which draw proteins towardsparticular sequences, linked by bridges and recombination. (a) is linked to (c) byrecombination. (c) is linked to (d) by a bridge of stable sequences. (b) is isolated:although it may be linked to other neutral nets by point mutation, the sequences enrouteare unstable and therefore unviable. Clearly, many sequences may be relatedby some sequence of recombination events. This conceptual illustration shows justtwo, for clarity.
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 47[Bornberg-Bauer, 1997]. On average, thermodynamic stability increases as the sequenceapproaches that of the prototype sequence: thus, polymer stability forms a funnelshape in sequence space. This is termed a superfunnel [Bornberg-Bauer and Chan, 1999].See Figure 5.7 for an illustration of these concepts.Proteins have been found to be marginally stable, and several possible reasonsfor this have been put forward, including that it is necessary for proteins to function(a need for flexibility) and for proteins to be degraded when necessary. 2D latticeevolution simulations of maximally compact 25-mers with 20 monomer types,have found that marginal stability can be explained by sequence entropy effects[Taverna and Goldstein, 2001a]. At each generation, each residue was mutated witha probability 0.2%. Insertions and deletions were not modelled. It was found thatmore sequences exist in the marginally stable peripheral region of a neutral network(the lighter blues in Figure 5.7), and they are greatly interconnected by point mutations,therefore population dynamic effects mean more polymer sequences tend tooccupy these regions. The simple model does not include any functional constraints,therefore this tendency is independent of function. The authors postulate that thiseffect facilitates exploration of sequence space in the search for new, useful structures.Bridges between neutral netsNeutral nets for different protein structures are isolated from each other in sequencespace by barriers of unstable sequences, although a few rare “bridges” exist betweenthem which might be exploited by evolution [Cui et al., 2002] [Grishin, 2001]. Thebarriers may also be overcome with large sequence leaps enabled by, for example,recombination or transposon activity, as illustrated in Figure 5.7. Using a 2D latticeHP model with 18-mers, Cui et al [Cui et al., 2002] found that homologous crossoversgave rise to many stable structures, however these are very rarely (0.69% of crossoverevents) different to the parent structures. Non-homologous crossovers are worse atproducing viable offspring, but are an order of magnitude more successful in producingstructural innovation. Such large scale alterations are described as “tunnelling under”barriers in the sequence landscape to reach new structures with novel functions -a process which would be tortuous or impossible using point mutation alone. Thestudy also identified 2D lattice equivalents to the foldon [Panchenko et al., 1996],hypothetical autonomously folding sequences which may be reused in the formationof numerous different protein structures. In real proteins, foldons (determined withstatistical foldability criteria) are around 20-60 amino acids long.Work of this nature is enabled by profound simplifications. More complex modelscan be used if exhaustive search is abandoned. As a result, more nuanced andbiologically interesting results may be obtained. Bastolla et al [Bastolla et al., 2000]explored neutral networks by simulating the evolution of a designed cubic latticepolymer under the constraint that the contact map should not diverge from that ofthe original polymer. In essence, they use an evolutionary algorithm to explore aneutral net. They found that a point mutation which results in a change in nativeconformation (i.e. a move away from the neutral net structure) typically results ina polymer with low native stability. Therefore, traversing structure space by point
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 48mutation alone is unlikely to be a successful approach, as the first steps away froma given structure will result in polymers with poor stability, which will be selectedagainst. They found a close relationship between thermodynamic stability of a latticestructure and its resilience to mutations. Finally, they found that the number ofneutral mutations stemming from a given sequence varies hugely across a neutral net.These two findings, taken together, imply that stability also varies across a neutralnet.Bastolla notes some of the shortcomings in their study. The lattice model doesnot show significant secondary structure and so is perhaps not a good model for proteinstructural evolution (even if the lattice model simulate protein thermodynamicsrealistically). Their evolutionary model assumes constant environmental parameterssuch as temperature and mutagenic effects. They also only include point mutations,ignoring insertions and deletions. These are all potentially interesting areas for furtherstudy.Mutation and recombinationThe population dynamics of maximally compact 2D lattice 25-mers has been usedto explain the robustness of proteins to mutation [Taverna and Goldstein, 2001b].Using a similar model to [Taverna and Goldstein, 2001a] described above, it was foundthat polymers evolved for stability alone tend to be robust to point mutations -such mutations result in unchanged or increased stability more often than expected.Random mutation occurs all the time, and a protein’s genotype is more likely tosurvive if such mutation results in an equally viable protein. A protein’s fitness istherefore enhanced by having a large number of neutral neighbours in sequence space,even if this property is not explicitly selected for.The relative effects of mutation versus recombination on simulated neutral evolutionwere explored for 25-mer 2D lattice polymers [Xia and Levitt, 2002]. Theirinvestigation explored neutral nets for all compact structures in a similar way to[Bornberg-Bauer and Chan, 1999]. The evolutionary model included recombinationand point mutation. They found that when point mutation dominates, sequencestend diffuse throughout the neutral net, with most occupying the marginally stableperiphery (as found in [Taverna and Goldstein, 2001a]). However, when recombinationevents dominate, sequences tend to congregate towards the most stable, optimisedprototype sequence at the core of the neutral net. Recombination is described as a“spring”, holding the sequence population together against mutation-induced diffusion.The ratio of mutation versus recombination rates determines how optimisedsequences are for their structure. The authors suggest that these results may helpinform structure prediction and design. The findings suggest that most proteins foldingto a given structure are marginally stable, therefore it is likely that the analysisof homologous sets of sequences would be useful to increase the likelihood that morestable, easily predicted examples are included in the predictions.
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 49Evolved nucleation sitesUsing an evolutionary simulation to optimise the folding speed of a 48-mer 3D latticepolymer with a fixed structure has revealed the existence of folding nuclei [Mirny et al., 1998].For each evolutionary optimisation, the sequences produced were evolutionarily related,and amino acids in the folding nucleus identified through Monte Carlo simulationwere highly conserved. Repeated runs of the evolutionary optimisation produceda superfamily of polymer families with the same structure but differing heritage. Itwas found that the same aligned residue positions were conserved between families,but the precise residue type was not conserved. This pattern of “conserved conservatism”amongst polymer families within a superfamily was tested against two realprotein superfamilies, CMBF and type-III repeat fold, and was successfully verifiedagainst published data and experimental work by other groups.Structure and designabilityA 25-mer 2D lattice model has been used to examine the distribution of structuresin evolving polymer populations [Taverna and Goldstein, 2000]. Four evolutionarymodels were used, three of which are similar to that described above. Polymers wereassessed for “foldability”, a measure of thermodynamic and structural viability which isless precise but faster to compute than measures gained from Monte Carlo simulation.Population dynamic effects due to selection for foldability were found to favour highlydesignable structures more than is suggested by the distribution of designable structures.Differences in the occurrence of different structures with similar designabilitywere observed. This is explained by two factors: differences in topology of the neutralnet for a given structure, and differences in closeness of sequences folding to differentstructures. This is explained in a similar way to [Taverna and Goldstein, 2001b]:polymers with a large number of viable neighbours in sequence space are favoured,even if they fold to different structures. The sequences are in “flux”, moving back andforth between viable states, ensuring their survival. This idea is illustrated in Figure5.8. The over-representation of certain structures is compared to the real proteinuniverse, where a limited subset of possible viable folds are employed.Structure diversityAn elaborate evolutionary model has been used to simulate the “Big Bang” scenarioof protein structure evolution, whereby numerous diverse protein structures evolvefrom few seed structures [Tiana et al., 2004]. A 36-mer sequence was designed for agiven 3D lattice structure, and this used as the seed for an evolutionary simulationwhich alternates between creating homologous sequence families which fold to thesame structure, and inducing structural innovation (whilst maintaining protein-likefeatures) with point mutations. The creation of sequence families with the samestructure uses the Metropolis algorithm in sequence space: this process speeds up thesearch of sequence space which would otherwise proceed slowly through mutationsand repeated simulation with Metropolis in structure space. This represents a shortcutadded to the process described in Figure 5.5.3: it expands the search of sequencespace without the need for slow, repeated Monte Carlo simulations in structure space.
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 50Figure 5.8: Population dynamics of proteins on a point mutation network. Sequencesin a network may fold to the same or different structures - that is, they are not neutralnetworks. The coloured “counters” represent individuals with a given protein sequencefor a given generation. For the sake of illustration, all individual protein sequencesrandomly undergo a single point mutation in the next generation, and the parentsequence dies after producing a mutated child sequence, thereby “moving” along anedge. In reality a sequence will produce more than one offspring before its death.It can be seen that the less interconnected network loses more protein sequences tounviable sequences (5 in this example) than the more interconnected network (2).Note that the same number of links to unviable sequences exist in both networks.
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 51The results confirm the extreme bias towards highly designable structures found inprevious work. It also finds that later evolved structures are less compact than theirancestors, but are more designable. They find that this agrees with findings for realproteins: in particular they find that eukaryote-only domains are statistically morelikely to be less compact than prokaryote-only domains (prokaryotic domains areassumed to have evolved earlier than eukaryotic ones).Work with evolution of 27-mer maximally compact cubic lattice structures hasreiterated the findings of [Taverna and Goldstein, 2000] and revealed particularly wellpopulated and easily evolved 3D structures termed “wonderfolds” [Zeldovich et al., 2006].Sequences forming these folds have a higher mean stability than other evolved structures.Wonderfolds formed after around 2000 evolutionary mutations. Ordinary foldswere optimised further, up to 200,000 mutations, in an attempt to stabilise them tothe same level as wonderfolds - this proved impossible. The relatively short evolutionarypathway to discovering wonderfolds is put forward as a potential advantagewhere search in sequence space is slow or otherwise limited, such as in the “prebiotic”early evolution world.Function - ligand bindingSeveral studies have looked at the evolution of function in simplified polymer models.Function is typically reduced to the ability to bind simplified ligands in various ways.This ability is added to selective pressures such as stability and compactness.A 2D lattice model with 20 monomer types analogous to real amino acids hasbeen used to evolve 4 × 4 polymers which bind tetramer ligands along one side ofthe square [Williams et al., 2001]. The model was found to focus in on a particularconformation very quickly, after which structural innovation is rare. This is explainedfor their simulation by the competitive effect of a steadily improving population: aftera few generation, changes which move to a new structure are likely to be detrimentalto fitness, and therefore will be selected against due to the relative fitness of the restof the population. Simple changes may be made to the model to avoid this effect,however the authors argue that by “fixing” upon a structure, their work reflects realprotein evolution well. The model also shows that the distribution of structures, whichprevious work found was closely related to designability, was not greatly affected bythe addition of selection for function.Function - binding pocketsAn HP 2D lattice model has been used to exhaustively enumerate polymers withfunctional “binding pockets”, for polymers up to 20 monomers long [Hirst, 1999]. Inaddition to the basic native state requirements, polymers are required to have an unoccupiedlattice position which is surrounded by at least three occupied positions. Aspecial energy function was required to allow native states which are not maximallycompact. The study found that binding pockets are formed by a polar “loop” beingpushed away from a hydrophobic region to vacate a single lattice location. Bindingpockets with two unoccupied spaces were not observed, but polymers with morethan one binding pocket did occur. The smallest functional model polymers were
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 5211-mers. Just 18 out of 2 11 possible 11-mer sequences fold to four distinct functionalstructures. Only when the polymers grow to 16-mers does the model reflectthe structural diversity present in larger model polymers. The median size of familiesof proteins folding to a given structure remains fairly constant as chain lengthincreases, but the size of the largest families grows significantly - that is, more designablestructures occur as proteins get larger. This work was extended to look at theevolutionary landscape, as characterised by point mutations, insertions and deletions[Blackburne and Hirst, 2001]. Similarly to previous studies, it was found that functionalpolymers cluster into neutral networks linked by rare bridges. Longer chains(up to 23-mers) were studied, and these were found to increase resilience to mutation,forming a less rugged evolutionary landscape with more bridges between neutral networksand structural families. Longer polymers also give greater functional diversity,since it is easier to form more than one binding pocket.Non-protein evolutionThe study of evolution using simplified models can be viewed two ways. First, it can beseen as a way to model and make direct predictions about real protein structures andevolution, in the assumption that the model is accurate enough to do this. Second, itcan be seen as the study of a protein-like parallel universe, where general predictionscan be made about evolutionary dynamics and processes, without directly inferringanything specific about real proteins. The highly simplified models such as 2D HPlattice focus on the latter. As models become more complicated, up to and beyond theoff-lattice models with sidechains, correlations to real protein structures and evolutionincrease and we can be bolder about the predictions made. If, however, we focus onthe idea that we’re simply modelling protein-like entities in using an evolutionaryalgorithm, then we might extend this to include entities which are not protein-like,but may evolve in a similar fashion.RNA structure evolutionThe most obvious example is RNA structure evolution. RNA shares one element ofthe protein universe - a linear molecule folds into a specific 3D structure in order toperform some function. The fact that the ribosome and other ancient biomolecules areRNA-based has given rise to the idea that an RNA world predates the protein world[Gilbert, 1986]. The key difference is that RNA structures are based around specificinteractions based on base pairing. There is no protein-like secondary structure (secondarystructure is essentially the topology produced by correct base pairing) andthe uncertainties inherent in protein folding are reduced by the fact that base pairingis strict. The “pair potentials” in the RNA world are very simple. This degree ofcertainty in RNA topology means fairly precise studies of RNA evolution may be conducted.A simulation of evolution of a specific RNA structure starting from a randomsequence [Fontana and Schuster, 1998] has revealed that, after initial rapid improvementin the RNA structure, further improvements are sporadic and infrequent. Thisfinding is used the lend weight, at a molecular level, to the “punctuated equilibrium”theory of evolution. Since the study does not include external influences which may
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 53prompt the infrequent improvements, it is suggested that genetic drift combined witha discontinuous structure space (whereby a single mutation may prompt a suddenleap to a new structure) may account for the effect.Digital organism evolutionSimulated digital organisms represent a step even further from protein evolution, butvaluable insights into evolution of complex entities may still be offered by their study.Digital organisms are self-replicating computer programs which compete for limitedresources of CPU time and memory, and replicate and mutate their code based upontheir success in doing so. Such a system has been used to look at the effect of mutationrates on evolution [Wilke et al., 2001]. They evolved 40 sets of digital organisms.They then took this as a starting point for two parallel experiments. In one, theorganisms were mutated at each generation at four times the rate of the other. Thehigher mutation rate tended to favour organisms with a lower reproductive fitness thanthe lower mutation rate did. However the organisms had a larger number of viableneighbours in “code sequence space”: they were therefore more robust to mutation.The authors described this as “survival of the flattest” - organisms suffering from ahigh mutation rate ended up occupying a large, flat, low area of the fitness surface.The analogy with protein evolution is striking - simulated model polymers have beenfound to be most successful if they have a large number of viable neighbours. Itwould be revealing to see if polymer structural stability or innovation increases at theexpense of mutational robustness when mutation rates are decreased.A very simple digital organism model has been used to study the evolution of complexfeatures [Lenski et al., 2003]. As above, the genome of the organism representsa code sequence, which consists of data movement, branching and NAND (not-and)instructions. These simple instructions are enough to construct any logical function,analogous to a protein’ function. As for real proteins, They evolved such organismswith the aim of constructing an EQU function, which compares two input numbersand gives one output if they are equal and another if they are not. They found thatthe EQU function could only reliably evolve if simpler functions which are en-routeto EQU, such as AND and XOR, are rewarded.Both of these systems share a key difference to the real organism or proteinworld: they are functionally deterministic. That is, the execution of the code intheir “genomes” is entirely reliable. Proteins are not reliable - they misfold, aggregate,and their marginal stability means they may not be functional all the time.Conversely, the fact that organisms can tolerate a certain level of errors means theyare less sensitive to changes. With a logical organism, a single change to the “genome”will usually have a very profound impact on its function. It has been argued that, ifdigital organisms are to model biological organisms well, their components must bemade unreliable [Endy, 2005]. The method by which evolution (or a human engineer)makes a reliable system from unreliable computational components may inform workin biological evolution.
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 545.7.4 In vivo foldingA physical model similar to the off-lattice polymer has been used to explore somebasic physical properties of growing chains tethered to a point at the growing end[Marenduzzo et al., 2005]. A homogeneous string-of-beads model similar to that describedabove was modelled in a molecular dynamic simulation. Beads not adjacentto each other on the string are attracted as they would be in a poor solvent. Theintention was to model the synthesis of a polymer by a comparatively large ribosome,the fixed point representing the end of the ribosomal exit tunnel. It was found thathelical structures formed spontaneously, without the need for more complex interbeadinteractions. It is suggested that nature took advantage of this spontaneoushelix formation by constructing polymers which stabilised and built upon them.The effects of confinement on the thermodynamics of protein folding have beenexplored with a simplified off-lattice model with a molecular dynamic simulation[Lu et al., 2006]. Extreme crowding was found to inhibit folding, but a larger hydrophilic“cage” aids folding by disallowing extended conformations. A hydrophobiccage was found to inhibit folding due to interactions with itself and with the hydrophilicresidues of the polymer.A similar model with sidechains was used to study effects of macromolecular crowdingon the folding of a real beta-sheet domain from the PDB (1pin) [Cheung et al., 2005].Crowding was modelled by a set of hard spheres from which polymers were excluded.Up to a certain level, crowding is found to enhance the stability of the native state,since unfolded states are destabilised. Even the most unfolded states are highly structured.Beyond a certain crowding level, the native state is altered and folding inhibited.The effect of this is very similar to that of confinement, and the paper states thatcrowding may be modelled more simply by confinement within a sphere of appropriatesize.An off-lattice model polymer confined to a tunnel designed to simulate the ribosomalexit tunnel has been found to form stable helices more readily than unconfinedpolymers [Ziv et al., 2005]. The stability of helices is increased by such confinement,however the process of forming a helix is more difficult. The effect is greater forlonger polymers. The effect is greatest when the diameter of the tunnel is close tothe diameter of the alpha helix, which is consistent with the observed diameter ofthe exit tunnel. This suggests the diameter of the exit tunnel has been optimised toencourage secondary structure formation along its length.5.7.5 CommentaryThe work discussed here is unified by some factors which limit their scope and biologicalinterest.For the most part, they involve profound conformational simplifications to 2D or3D lattice models [Bornberg-Bauer, 1997, Mirny et al., 1998, Bornberg-Bauer and Chan, 1999,Hirst, 1999, Bastolla et al., 2000, Taverna and Goldstein, 2000, Taverna and Goldstein, 2001b,Grishin, 2001, Taverna and Goldstein, 2001a, Williams et al., 2001, Blackburne and Hirst, 2001,Harrison et al., 2001, Xia and Levitt, 2002, Cui et al., 2002, Tiana et al., 2004, Zeldovich et al., 2006].Where models are extended into the off-lattice realm, they are not used in evolu-
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 55tionary models [Miller et al., 2002, Emberly et al., 2002, Shental-Bechor et al., 2005].[Emberly et al., 2002] does look at designability of off-lattice polymers, however itis simplified (arguably more so than the 3D lattice model) to packing of rigid secondarystructures. Secondary structure is a feature largely missing from lattice polymers[Bastolla et al., 2000], although some non-right-angle lattices can accommodatealpha-helix like conformations. Therefore, evolutionary studies inherently ignore theinfluence of this level of structural hierarchy.In addition, they do not model anyaspects of the unique in vivo environment, such as slow synthesis or crowding.Evolutionary studies may be classified into three general types, all of which use2D or 3D lattice proteins:1. Exhaustively evaluate the relationship between sequence and structure space(permitted only by the HP model) and draw conclusions about features such asdesignability, neutral networks and superfunnels from this [Bornberg-Bauer, 1997,Bornberg-Bauer and Chan, 1999, Hirst, 1999]. This allows firm conclusions tobe drawn regarding the topology of the search space, but does not allow study ofpopulation dynamic effects, which appear to be an important factor influencingthe distribution of sequences and structures.2. Explore a neutral network with a simulated evolution model, including point mutationsand/or recombination [Mirny et al., 1998, Bastolla et al., 2000, Taverna and Goldstein, 2001a,Taverna and Goldstein, 2001b, Xia and Levitt, 2004].This can take into accountpopulation dynamic effects, but sequence and structure sampling is necessaryunless the very simple models are used. Since a neutral network is beingexplored, structural diversity is inherently not a part of this kind of study.3. Explore the whole evolutionary landscape with a simulated evolution model, includingmutations, and/or recombination [Taverna and Goldstein, 2000, Williams et al., 2001,Blackburne and Hirst, 2001, Cui et al., 2002, Tiana et al., 2004, Zeldovich et al., 2006].These allow the exploration of structural diversity and population dynamic effects.But the simplicity of the lattice models means the evolutionary landscapeis necessarily coarse, due to the coarseness of the energy function 2 .All evolutionary models start with a sequence known to reliably fold to a particularstructure. This is often designed using Metropolis methods on sequence space (forexample, [Tiana et al., 2004]). This gives the methods a “kick start” into a knownpart of structure space. It is the simplified model equivalent of starting with a realprotein sequence and changing it to optimise for stability [Korkegian et al., 2005] oralter the structure [Reina et al., 2002]. As such there is no attempt to look at theearly evolution of polymers from short, essentially random sequences into diversestructures. [Tiana et al., 2004] claims to explore the “Big Bang” hypothesis of proteinevolution. However, by seeding the simulation with a set of known 36-mer structures,2 The energy function of a lattice model is more coarse than that of the off-lattice model because,in the lattice model, there is a greater chance that a given move results in two or more contactsbeing made with other amino acids. This means the energy jumps as a result of moves are greater,and so the distribution of energies is more quantised. It is more difficult to navigate sequence spaceas a single change has a greater impact on the energy, stability and conformation of a structure.
CHAPTER 5. PROTEINS AND SIMPLIFIED MODELS 56their starting point is equivalent to the modern protein repertoire rather than theprimordial soup of random sequences.Finally, as acknowledged by [Bastolla et al., 2000], evolutionary simulations havevery many parameters (such as temperature, mutation rates, mutational bias, recombinationrates, death/reproductive selection criteria) which are fixed either byanalogy with the real-world, or by intuitive reasoning and trial-and-error on the partof researchers. Little work has been done to explore the parameter space of evolution,or to explore the effects of changing parameters over evolutionary time. Outsideof protein evolution, interesting results have been obtained with differing mutationrates for evolving digital organisms [Wilke et al., 2001]. Such an exploration may berevealing in the domain of simplified polymers.
Chapter 6Methods6.1 Aims and objectivesThe overall aim of this project it to test the hypotheses given on page 27:• Findings based upon 2D and 3D lattice models are skewed by theexcessive simplicity of the models used. They do not exhibit the hierarchyof structure present in real proteins.• Nucleation sites in modern proteins represent the earliest evolvedcore structures, from which later folds were developed.• Although proteins may fold well in vitro, they have been optimised by evolutionto fold in the living cellular environment, and the imprint of thismay be found on the dynamics of protein folding and the nature of proteinstructures.Several interesting questions arise from the findings in section 5.7.5, which relate tothese hypotheses:• What does the evolutionary landscape look like when polymers evolve ab initio- from effectively random sequences rather than designed seed sequences?• How does the use of an off-lattice polymer model influence features of the evolutionarylandscape, such as designability, neutral networks, mutational robustnessand structural diversity?• How does the addition of features designed to more closely emulate real proteinconformations, such as secondary structure and sidechains, influence theevolutionary landscape?• How does the addition of features designed to model unique characteristics ofthe in vivo environment influence the evolutionary landscape?• How does the evolutionary landscape change as the parameters of the simulationchange?The following sections briefly discuss the motivations for and possible answers to thesequestions.57
CHAPTER 6. METHODS 586.1.1 Ab initio polymer evolutionEvolving polymers from random sequences is a basic model for early evolution ofproteins. How did the estimated 2000 folds [Govindarajan et al., 1999] which modernproteins are constructed from arise? Were they independently developed from scratch,or could they form families rooted in more basic folds? Which parts of the foldscould have evolved first - is there a relationship between folding dynamics (such asnucleation sites) and evolution? Previous research has suggested that nucleation sitesare more highly conserved [Mirny et al., 1998], and this may help determine an earlierevolutionary heritage. Evidence from simplified models may lend weight to this idea.6.1.2 Off-lattice modelsOff-lattice polymers have greater conformational diversity than lattice polymers. Thisgives rise to a smoother energy landscape. Monte Carlo moves more frequently resultin more than one contact being made in lattice models than off-lattice models, due tothe more limited conformation space. A smoother energy landscape may give rise toa smoother evolutionary landscape. In lattice models, neutral networks are separatedby regions of sequence space which give unviable polymers. A smoother landscapemay mean that “bridges” between such networks are more easily found, and can betraversed with point mutations alone. The need for recombination to “tunnel under”the sequence-structure landscape may be reduced, and additionally it may becomeeasier to evolve polymers ab initio.6.1.3 More accurate polymer modelsLattice models exhibit little secondary structure. Whilst conformations with proteinlikesecondary structure are available to off-lattice models, they do not necessarily occurwith the basic pair potential energy function usually employed with such models.The energy function or accessible conformational space must be altered to encouragetheir formation. This may be done by limiting bond and torsion angles to those whichare compatible with forming secondary structure. It can also be done by emulatingthe hydrogen bonds which stabilise secondary structures as part of the energy function.Sidechains may be added to the model in order to even more closely modelreal proteins. The steric obstacle of sidechains is important in secondary structureformation and packing. Such additions to polymer structure models may enhance the“structural toolkit” available to evolution for producing novel folds. The extra levelof hierarchy may make recombination a more effective method for structural innovationthan it is for more basic lattice models, which is one of the basic hypothesesunderlying this work.6.1.4 Effect of in vivo environmentMost simplified models are based upon Anfinsen’s view of proteins folding spontaneouslyfrom a denatured state in vitro. Much work has been done to investigate theeffect on folding thermodynamics of constraints presented by the ribosome and slow
CHAPTER 6. METHODS 59synthesis, and the confinement imposed by a crowded cellular environment and the actionof chaperones [van den Berg et al., 1999, Ellis, 2001, Hartl and Hayer-Hartl, 2002,Rivas et al., 2003, Kinjo and Takada, 2003, Maier et al., 2005] [Sikorski and Skolnick, 1990,Ping et al., 2003, Cheung et al., 2005, Ziv et al., 2005]. However, it would be interestingto see the effect on polymer evolution of these factors. Does slow synthesisencourage helix formation? Does the polymer evolve such that folding of the synthesizedportion does not interfere with correct folding of the portion which has yet tobe synthesized? Does confinement help or hinder evolution? Another in vivo effectis the detection and degradation of poorly folded proteins. This may be modelled inan evolutionary simulation by reducing the proportion of folding attempts requiredto be successful for a sequences to be viable. Does this error tolerance help evolution,and to what extent?6.1.5 Effect of parametersMany parameters underlie structural, folding and evolutionary models. These mayaffect the outcome of simulation profoundly. Whilst some parameters may be setwith reference to real life (for example, mutation rates), others must be adjusted andoptimised with reference to the required simulation results. Even if parameters maybe drawn from real life, it is interesting to explore the effect of varying them. Theparameters form an intricate multidimensional search space, which it is impossible toexplore without intelligent sampling. Each evolutionary simulation takes many hourson many processors. But it is important to explore these parameters to ensure thatthose used are effective and realistic.6.2 Software architectureThe first stage is to specify and implement a model for evolving simplified proteins,which may be then used to perform experiments to test hypotheses. This sectiongives an overview of the work done so far in developing software which implementsthe model.With key exceptions, the entire system is coded in Python 1 . This is a relativelynew interpreted scripting language, characterised by a clean and complete syntax andextensive built-in library. It is particularly suited to rapid development of efficientobject-oriented systems. It suffers from the same performance issues as other scriptinglanguages, but simple interfaces to fast, generic processing libraries and alternativelanguages allow it to overcome this.In this section, words with an initial capital which look like This are objectorientedclass names.6.2.1 Polymer model and foldingA polymer model has been implemented with the following features:• 3D cubic lattice and off-lattice structural models1 Official Python website: www.python.org
CHAPTER 6. METHODS 60• Variable number of monomer types• Steric constraints with a constant steric radius• Metropolis folding algorithm which accepts a variety of “plug-in” move sets andenergy functions– Standard move sets as described in Figure 5.3– Pair potential energy function for lattice model– Pair potential, chain separation pair potential, hydrogen bonding and hybridenergy functions for off-lattice model• Sidechain model, sidechain move sets and sidechain energy function for offlatticemodel• Polymer folding simulation may start from a native or denatured state, or a“nascent” state whereby the polymer is slowly synthesized on a simulated ribosomeSeveral functions, including energy calculations and steric constraints, require a searchfor monomers which are within a certain distance of each other. This is easy for thelattice model, which is inherently indexed such that finding nearest neighbours is fast.For the off-lattice model, an analogous optimisation was used, whereby monomers areindexed in buckets which allow rapid search for an a monomer’s nearest neighbours.This means the off-lattice code is almost as efficient as the lattice equivalent.Move sets for the lattice model are optimised by keeping track of the location ofbends suitable for crankshaft and corner flip moves. Every time a move is made, thelocal structure is scanned for such bends, and their location is stored in a look-uptable.The polymer structure and folding code is entirely in object-oriented C++, usedfor speed. All C++ classes present an object oriented interface to Python using theSWIG library 2 , which allows Python to use them almost transparently.6.2.2 EvolutionA polymer evolution experiment is encapsulated in a ProteinEvolver class, whichconsists of a Universe and a population of Templates. Simulation parameters arestored centrally in the Universe class. A Template is analogous to a gene encodedin DNA. It consists of the monomer type sequence plus (once a polymer has beensynthesized from the template several times) a variety of metrics which describe thesequence’s fitness. A Template also includes links to its parent and children, whichgeneration it was born in, and how many attempts have been made to produce childsequences from it. Therefore, we have stored the heritage for all the Templates in theexperiment.The evolutionary algorithm proceeds as described in 5.5.3. All elements describedthere are present, but optional. Functions for determining a Template’s viability and2 Simplified Wrapper and Interface Generator: www.swig.org
CHAPTER 6. METHODS 61reproductive fitness are kept in the Universe class, and so are part of the experimentalparameters. Fitness may be wholly or partly based upon the synthesized polymer’ssimilarity to a given structural Target. Several target types have been implemented,including a sphere, ellipsoid, and an arbitrary lattice conformation. Several targetsmay be specified for the evolution to optimise for, and the set of fits of a structure tothe targets may be combined in several ways. The most useful is to take the maximumfit, which will allow the evolution to optimise for one target above the others.6.2.3 ParallelisationEven with simplified models, the evolutionary process is slow. Each template mustbe synthesized into a 3D polymer several times to measure the various metrics neededto determine its fitness, and the sequence search space is very large. It makes senseto take advantage of multiprocessing resources to speed the process up. A simpleparallelisation with minimal communication was implemented. For each generation,the task of assessing a given template is farmed out from the single core processorto one of a pool of slave processors. Since assessing a given template may takeseveral minutes, and the data communicated is minimal, this essentially gives an Ntimes speed up when using N processors. However, one has to ensure the number oftemplates tested per generation is an integer multiple of the number of processors,otherwise some processors will go unused.Parallel runs used the Structural Bioinformatics Group’s own 50-processor cluster,as well as the larger shared Mars cluster in the Department of Computing.6.2.4 VisualisationEffective visualisation of polymer folding and evolution is useful in spotting codingerrors and problems with the models and parameters used. It helps most in the initialdevelopment of the model and in the addition of new features. All visualisation isbased upon OpenGL 3 used from Python via the PyOpenGL library 4 .Polymer structures and folding can be displayed, along with a large number ofmetrics regarding the structure and the template which produced it. The backbone isshown as a solid tube, with colours representing the monomer types. Other featuresmay be added to the view:• Interactions between monomers (such as pair potentials) are shown as translucenttubes, with thickness proportional to the strength of interaction. An unfavourableinteraction is shown in black, and favourable interactions are shownin white.• Sidechains, if present, shown as tubes thinner than the backbone extending fromthe corresponding backbone monomer.• Space-filling and steric radius spheres for some or all monomers• Targets which are part of the evolutionary fitness function3 OpenGL developers website: www.opengl.org4 PyOpenGL website: pyopengl.sourceforge.net
CHAPTER 6. METHODS 62Figure 6.1: Example visualisation showing most of the elements which can be displayed.This is for illustrative purposes only, for normal usage only a few of thefeatures will be included to make a less confusing image.
CHAPTER 6. METHODS 63• A surface representing the ribosomeMost of these features are illustrated in Figure 6.1. The view may be dragged aroundto change the viewing angle.Polymer evolution is viewed as a traditional tree of interrelated templates, withroot templates at the top and children branching off below. Templates are shown asan “example synthesis” of that template into a 3D polymer, similar to that shown inFigure 6.1. The user can zoom in and out of the tree, in order to see more detail ofthe polymer structures or an overview of the evolutionary relationships respectively.The evolutionary pathway of a specific template may be viewed as an animationof example native structures, starting from the root template through to the finishedtemplate. The polymers for each template are physically aligned to minimise the RMSbetween related amino acids, giving a smooth visual transition from one generationto the next. A set of example frames from such an animation is given in Table 7.2.
Chapter 7Results7.1 Initial model developmentA basic aim underlying all other investigation is the ability to evolve structurally diversepolymers ab initio, of a size which is biologically interesting - that is, at least 20monomers (this being the smallest putative foldon size [Panchenko et al., 1996]) andpreferably much longer. Proteins this small have been designed [Neidigh et al., 2002,Honda et al., 2004] and evolved [Murzin et al., 1995, small proteins category] butmost natural proteins and protein domains are much larger.The first basic model had two HP monomer types, which has previously mainlybeen used with 2D lattice structures. It was impossible to evolve sequences that foldreliably to a unique structure. This reflects the findings of [Yue et al., 1995], whereten designed 3D HP lattice polymers were found to have multiple native conformations,with at least 1000 global minimum ground states for each of the polymersinvestigated. The HP model has only worked with the cubic lattice in other workwhere conformation space is limited to maximally compact structures, for example ina 3 × 3 × 3 cube.The model was extended to 20 monomer types, with a pair potential functionusing data from [Miyazawa and Jernigan, 1996]. This change alone allowed evolutionof reliably folding polymers. However, the simulation was unable to produce polymersmore than 15 monomers long. This prompted a switch to the off-lattice model, tosee if longer polymers were more easily reachable. The off-lattice model represents amore accurate model of proteins at minimal computational cost, so it was consideredto be a generally worthwhile step. Off-lattice models were also limited to around 15monomers. Subsequent work has focused upon the off-lattice model.7.2 Extensive off-lattice evolutionA very long evolutionary simulation has been performed, with the aim of seeing ifthe length limitations may be overcome with simple application of computing power.This long run allows a detailed study of the system.The off-lattice polymer model with basic pair potential energy function and nosidechains was used. 8,166 generations were simulated, giving rise to 51,163 polymers.64
CHAPTER 7. RESULTS 65The run took approximately 2,500 processor hours (one week using 16 processors).Selection for reproduction was biased towards those with more contacts - i.e. large,compact polymers. Each new polymer was folded 3 times, and those which did notfind the same native structure every time were discarded. These simple criteria allowthe simulation to explore diverse structures, however the bias towards selecting largepolymers for reproduction does tend to fix the simulation on a few larger structures.The distribution of the population of polymer lengths is given in Figure 7.1. Themost populated polymer length is 5 monomers, and the longest is 17 monomers (justtwo polymers are this long).Clearly, at each generation, the contribution to the population of a given lengthdepends in part upon the population of polymers of other lengths, due to insertionsand deletions. The mean leap in length from a parent to child is -0.27 - that is,most polymers are created by shrinking from larger polymers. However, more of thelarger polymers are created by growing from smaller ones, and more of the smallerpolymers are created by shrinking from larger ones, as illustrated in Figure 7.2. Thesefindings are unsurprising. Each polymer length population is contributed to by thepopulations of polymers of other lengths. The larger polymers have fewer contributorsbigger than them, and smaller polymers have fewer contributors small than them. Theoverall average leap is dominated by the small polymers, since there are more of them.Further work would reveal if the distribution shown deviates significantly from thatwhich would be obtained if selective pressures were not present.The mean absolute polymer size leap is 0.90, i.e. adding or subtracting arounda single amino acid. For this long experiment, the insertion and deletion probabilitywas 0.1 per amino acid (insertion and deletion events are independent), therefore onemight expect this outcome given that most polymers are smaller than 10 monomers- it may simply arise from the fact that a change of more than one monomer in sizeis less likely. An experiment with a much higher insertion/deletion probability couldseparate the mutation strategy which arises from selective pressures from that whicharises from the simulation parameters.Figure 7.3 illustrates the change in the population of polymers in the simulationover time, as measured in generations. The overall trend is a steady, linear growth.However, it is clear that the growth is constantly slowing down, but punctuatedby regular growth spurts, approximately at generations 500, 1400 and 6100. Thepopulation is subdivided by further curves, the lowest representing the population of3-mers, followed by 4-mers further up, and so on. It can be seen that the growth spurtsare caused by increased growth in the larger polymers present at that generation. Atgeneration 6100, increase in 3-7-mer polymers is steady and they do not contributeto the spurt at that point. It seems likely that the “discovery” of a new, rich seam oflarger polymers causes these sudden growths. More work should pinpoint the eventswhich cause these.After an initial rapid expansion of polymers to 9-mers and then 12-mers, themaximum polymer size increases at regular intervals of around 1,500 generations,as shown by the horizontal spacing between red dots in Figure 7.3. This is furtherillustrated in Figure 7.4. If this trend continues, then it may indeed be just a matterof processing time in order to achieve longer, more biologically relevant polymers.
CHAPTER 7. RESULTS 66x1e41.2Protein length distribution1.0Number of proteins0.80.60.40.20.02 4 6 8 10 12 14 16 18Protein lengthFigure 7.1: Distribution of protein lengths after over 8,000 generations of simulatedevolution. There are two 17-mer proteins, with none longer than 17 monomers
CHAPTER 7. RESULTS 6718Distribution of length leaps from parent to child1614Length of child12108642-8 -6 -4 -2 0 2 4 6 8Length leap from parent to childFigure 7.2: Distribution of leaps in length from parent to child polymer, for eachchild length. Blue line gives mean leap size. The grey squares indicate the numberof polymers of a given size which reached that size with a given leap. The grey levelis normalised to the maximum number of polymers in any leap size for each polymerlength. White indicates maximum number of polymers, black indicates no polymers.Therefore, each row gives the distribution of leap sizes for that polymer size.
CHAPTER 7. RESULTS 68x1e46Growth of protein populations over generations17516Population43131415216789 10 111200 1000 2000 3000 4000 5000 6000 7000 8000GenerationFigure 7.3: Change in population over generations of the simulation. Based upon 54sampled snapshots of the simulation. Red dots give the first sampled arrival of one ormore sizes of protein. The size(s) that arose at each point are given above the points.3, 4 and 5-mers occur in the first few generations. The lines subdivide the populationinto sizes: the lowest slice gives the population of 3-mers, the next slice up give thepopulation of 4-mers, and so on.
CHAPTER 7. RESULTS 69Length of protein Simulated population (s) Possible population (p) s/p %3 1,777 8,000 22%4 7,158 160,000 4%5 10,465 3,200,000 0.3%Table 7.1: Table comparing protein size populations with possible sequence spacepopulationOptimised parameters are likely to increase the polymer growth rate significantly,reducing the large amount of processing required to run fruitful simulations.By the end of the simulation, the 3-5-mer populations are steady, and 6-merpopulations are heading that way. Either the total population of very small polymerswhich fold reliably has been found, or a neutral net has been thoroughly exploredand others will not be found without higher mutation rates. Table 7.1 shows how thesimulated population compares to the possible population of n-mers = 20 n .This does not take into account the fact that not all of the possible polymersequences are viable. Each generation of the simulation involves 32 attempts at producingviable offspring via random mutation. Therefore, at the end of the simulation261,312 reproduction events have occurred, and 51,163 produced viable offspring - ahit rate of around 20%. So it seems probable that all of the viable 3-mers have beenfound. However, it looks less likely that 4-mer and larger sequence space has been exhausted- the experiment has just looked at an extended neutral net, where sequencesmay be related by more than one mutation. Multiple random replicates may revealregions of 4-mer and 5-mer space which have gone unexplored by this simulation.Finally, Table 7.2 shows an image sequence of the evolution of one of the two17-mers from a random 3-mer. The polymers have been geometrically aligned tominimise the RMS between monomers which are known to be evolutionarily relatedbetween each polymer. This relationship is known and stored as part of the evolutionexperiment. 20 generations were required for its evolution (note that this is ageneration relative to this polymer - in reality over 8,000 generations of successes andfailures across a large population of polymers contributed to its development). It canbe seen that in some generations it shrank. It is clear that by generation 7, a corestructural motif of three blue monomers wrapped around a pair of pink monomersis established. This deteriorates slightly at generation 15, and then is built up atgeneration 17. Observation of many evolutionary runs during development of themodel indicates that this motif is very common for the off-lattice model used. It isnot a motif which has an obvious analogue in real protein structures. Changes tothe model, including increased steric radius as compared to inter-residue backbonedistance, may reveal more protein-like structural motifs.7.3 Effect of mutationThe experiment outlined in Section 7.2 used several thousand processing hours. Clearlythis kind of extensive run is not appropriate to explore the effect of various parameters.Therefore an alternative strategy is used.Experimental parameters are set and the simulation is run for 400 generations.
CHAPTER 7. RESULTS 7018Median, mean and maximum protein size over generations1614Protein size12108MeanMaximumMedian6420 1000 2000 3000 4000 5000 6000 7000 8000 9000GenerationFigure 7.4: Median, mean and maximum protein size over generations
CHAPTER 7. RESULTS 711 2 3 45 6 7 89 10 11 1213 14 15 1617 18 19 20Table 7.2: Sequence of images illustrating the evolution of a 17-mer. Numbers aregenerations, relative to this particular polymer.
CHAPTER 7. RESULTS 72After this, the number of polymers produced and the distribution of polymer sizes areused as a metric for judging the success of the simulation. These are basic metrics,but sufficient to judge progress towards the 20-mer goal.With the off-lattice model and no sidechains, 400 generations takes around 160processor hours. This means it is feasible to conduct many such experiments (withthe use of multiple processors) to sample parameter space with a degree of replication.The first parameters to be explored were the probabilities of point mutations.Three such probabilities exist in the simulation:• Probability of insertion of a random monomer between each monomer or additionto each end of the polymer• Probability of deletion of a monomer• Probability of replacement of a monomer with a random monomerEach event is independent, therefore there may be many insertions, deletions andmutations for a given reproduction event. For each experiment, the three probabilitiesare equal: this unified probability is termed the mutation probability. Four randomreplicates are performed with mutation probabilities equal to 0.025, 0.05, 0.1, 0.2 and0.4. Therefore 20 experimental runs are performed.Figure 7.5 shows how the polymer population after 400 generations varies with themutation rate used. The small number of random replicates means an outlier skewsthe trend somewhat, but it would be uncontroversial to say that lower mutation ratesare generally more successful at producing large numbers of polymers.The lowest rate, 0.025, will result in a difference of a single monomer betweenparent and child in around 35% of reproduction events, and no difference betweenparent and child in around 31% of reproductions. Children which are exactly thesame as their parents are immediately discarded to optimise the simulation, thereforearound 50% of all children have just one monomer different to their parents. Thesecalculations ignore the possibility that independent mutation events may negate eachother (e.g. deletion of a monomer with an insertion of the same monomer).At this low mutation rate, we are close to a different mutation model, whereby asingle mutation is performed per offspring, rather than having independent mutationevents. This model is closer to biological reality, due to the very low rate of mutationbetween generations in DNA. A rough estimate of this is 2 × 10 −8 mutations per basepair per generation. The high redundancy of the genetic code means many mutationsdo not affect the protein sequence. The sparsity of protein-coding genes means manymutations fall outside them. Single insertions and deletions are likely to destroy theprotein’s structure and function for the most part, as they destroy the downstreamreading frame. On the whole, it extremely unlikely that more than one mutationevent ever occurs to a given protein in a single generation.Figure 7.6 shows that mutation rate does not significantly influence the mean sizeof polymers. A single outlier may indicate that lower rates give greater opportunitiesto grow longer polymers, however with such a small number of replicates this is not areliable conclusion. What may be said is that lower rates do not give smaller polymers,
CHAPTER 7. RESULTS 73so the larger population of polymers illustrated in Figure 7.5 is not accompanied byreduced size.
CHAPTER 7. RESULTS 749500Protein population size varies with mutation rate90008500Protein population8000750070006500600055000.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45Mutation probabilityFigure 7.5: Protein population against mutation probability. Line gives mean of allreplicates, circles give exact values from each replicate.
CHAPTER 7. RESULTS 75Mean protein size varies with mutation rate7.0Mean protein size6.56.05.50.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45Mutation probabilityFigure 7.6: Mean protein size against mutation probability. Line gives mean of allreplicates, circles give exact values from each replicate.
Chapter 8Discussion and research plan8.1 DiscussionThe results detailed above indicate that very low mutation rates, giving rise to nomore than one mutation per generation, are better at exploring the polymer sequencespace than higher mutation rates. It is also a more biologically accurate model, wherethe probabilities suggest that more than one mutation per generation is extremelyunlikely.A long experiment showed that structural motifs are established early in generationtime, and built upon in later generations. The largest polymer size grows at a steadyrate over generation time, and it is likely that longer simulations will give largerpolymers. However it took thousands of processor hours to produce a 17-mer, soways of improving the performance of the search should be sought. It seems likelythat, without recombination, the model is limited to exploring an extended neutralnet. Other neutral nets are accessible through mutations from the small polymers,but the bias of the model towards selecting larger polymers for reproduction (due tothe underlying goal of growing larger polymers) means other structures and networkscould go unexplored.The search for larger polymers may be made more effective with changes to thepolymer model. Adding the ability to fold with secondary structure may be importanthere, since it gives another level of structural hierarchy to stabilise the overall structureand reuse through recombination. However, care must be taken to ensure biologicallyinteresting folds arise from this process - it would be easy to end up with thousandsof straight alpha helices.The basic task of evolving polymers of biologically significant size has proven asignificant obstacle to exploring the more interesting issues listed in Section 6.1 (Aimsand objectives). However, work so far indicates that the model is very close to beingappropriate for conducting significant experiments exploring the factors influencingearly ab initio protein evolution. Other protein structural and dynamic models mayhelp speed the process to yet larger protein models. Non-local Monte Carlo moveson entire secondary structure units, for example, will help explore the influence ofsecondary structure, without the need for slowly modelling monomer-level secondarystructure formation.76
CHAPTER 8. DISCUSSION AND RESEARCH PLAN 778.2 Research planThe numbers in brackets in the headings below refer to the task numbers in the Ganttchart in Figure 184.108.40.206.1 (9) Implement more biologically accurate mutation schemeEither a very low mutation rate should be used, which usually gives a single mutationin a polymer, or the model should allocate a single mutation event to each childpolymer.Target date: April 20068.2.2 (11) Develop a polymer model which exhibits secondarystructure elementsThe current implementation allows various hydrogen bonding schemes, including onewhich cheats and forces a bonding pattern which allows alpha helices to form. It ishoped that a freer hydrogen bonding pattern combined with simulated sidechains willgive rise to helical and strand elements more naturally, and based upon the nature ofthe monomer sequence, in order to give structural variety.Target date: July 20068.2.3 (1) Continue exploration of parameter spaceMany other parameters influence the basic evolution of the current polymer model.These will be explored in experiments parallel to working towards other milestones.The effects of changing temperature, viability fitness criteria, reproductive selectioncriteria, relative dimensions of the polymer model, recombination and thesimulated ribosomal synthesis should be established.Target date: September 20068.2.4 (10) Increase evolved polymer size with parametersAlthough 20-mers are the current target, the longer term goal would be to evolve60-mers, this being the largest foldon, and the size of a small protein or proteindomain.This work will take results obtained from the exploration of parameter space, andapply them to an “ideal” model which will be used in further experiments.Target date: October 20068.2.5 (6) Increase evolved polymer size with secondary structuremodelApply the secondary structure model to the idealised parameter model.exploration of parameters will be necessary given this new model.Target date: December 2006Further
CHAPTER 8. DISCUSSION AND RESEARCH PLAN 78Figure 8.1: Gantt chart for remainder of work
CHAPTER 8. DISCUSSION AND RESEARCH PLAN 798.2.6 (8) Publish paper about developed modelEvolving biologically interesting simplified molecules ab initio is a significant stepin itself, and may be the subject of a publication, which should include comparisonagainst the evidence from the real protein world.Target date: February 20078.2.7 (3, 4, 5) Use ideal model for evolutionary experimentationExplore the hypotheses raised in Section 6.1 using the model developed thus far. Theeffect of recombination versus point mutation on the polymer evolutionary landscapewhen applied to an off-lattice model with secondary structure is a likely candidate forpublication, building upon and expanding previous work which has yielded interestingresults [Cui et al., 2002, Xia and Levitt, 2002]. The effect of slow synthesis on theevolutionary landscape and nucleation sites is also a possible interesting publication.Target date: April 20078.2.8 (2) Publish paper with results of experimentationTarget date: July 20078.2.9 (7) ThesisTarget date: September 2007
Bibliography[Altschul et al., 1990] Altschul, S., Gish, W., Miller, W., Meyers, E., and Lipman, D.(1990). Basic local alignment search tool. J. Mol. Biol., 215(3):403–410.[Anfinsen, 1973] Anfinsen, C. B. (1973). Principles that govern the folding of proteinchains. Science, 181(96):223–230.[Bastolla et al., 2000] Bastolla, U., Vendruscolo, M., and Roman, H. (2000). Structurallyconstrained protein evolution: results from a lattice simulation. Eur PhysJ B, 15:385–397.[Berman et al., 2000] Berman, H., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.,Weissig, H., Shindyalov, I., and Bourne, P. (2000). The protein data bank. NucleicAcids Research, 28:235–242.[Blackburne and Hirst, 2001] Blackburne, B. P. and Hirst, J. D. (2001). Evolution offunctional model proteins. Journal of Chemical Physics, 115(4):1935–1942.[Bland, 1989] Bland, M. (1989). An introduction to medical statistics. Oxford UniversityPress, Oxford, UK, pages 149–151.[Bornberg-Bauer, 1997] Bornberg-Bauer, E. (1997). How are model protein structuresdistributed in sequence space? Biophysical Journal, 73(5):2393–2403.[Bornberg-Bauer and Chan, 1999] Bornberg-Bauer, E. and Chan, H. S. (1999). Modelingevolutionary landscapes: Mutational stability, topology, and superfunnels insequence space. PNAS, 96(19):10689–10694.[Brenner et al., 2000] Brenner, S., Koehl, P., and Levitt, M. (2000). The astral compendiumfor sequence and structure analysis. Nucleic Acids Res., 28:254–256.[Cayrol et al., 2003] Cayrol, C., Doutre, S., and Mengin, J. (2003). On decision problemsrelated to the preferred semantics for argumentation frameworks. LOGCOM,13(3):377–403.[Cheung et al., 2005] Cheung, M., Klimov, D., and Thirumalai, D. (2005). Molecularcrowding enhances native state stability and refolding rates of globular proteins.Proc Natl Acad Sci USA, 102(13):4753–4758.[Christendat et al., 2000] Christendat, D., Yee, A., Dharamsi, A., Kluger, Y.,Savchenko, A., Cort, J., Booth, V., Mackereth, C., Saridakis, V., Ekiel, I., Kozlov,G., Maxwell, K., Wu, N., McIntosh, L., Gehring, K., Kennedy, M., Davidson,80
BIBLIOGRAPHY 81A., Pai, E., Gerstein, M., Edwards, A., and Arrowsmith, C. (2000).proteomics of an archaeon. Nat. Struct. Biol., 7:903–909.Structural[Cui et al., 2002] Cui, Y., Wong, W. H., Bornberg-Bauer, E., and Chan, H. S. (2002).Recombinatoric exploration of novel folded structures: A heteropolymer-basedmodel of protein evolutionary landscapes. PNAS, 99(2):809–814.[Daggett and Fersht, 2003] Daggett, V. and Fersht, A. R. (2003). Is there a unifyingmechanism for protein folding? Trends in Biochemical Sciences, 28(1):18–25.[Drawid and Gerstein, 2000] Drawid, A. and Gerstein, M. (2000). A bayesian systemintegrating expression data with sequence patterns for localizing proteins: comprehensiveapplication to the yeast genome. J Mol Biol., 301(4):1059–75.[Dung, 1995] Dung, P. (1995). On the acceptability of arguments and its fundamentalrole in nonmonotonic reasoning, logic programming and n-person games. Artif.Intell., 77(2):321–357.[Ellis, 2001] Ellis, R. J. (2001). Macromolecular crowding: an important but neglectedaspect of the intracellular environment. Current Opinion in StructuralBiology, 11(1):114–119.[Emberly et al., 2002] Emberly, E. G., Wingreen, N. S., and Tang, C. (2002). Designabilityof alpha-helical proteins. PNAS, 99(17):11163–11168.[Endy, 2005] Endy, D. (2005). Foundations for engineering biology. Nature, 438:449–453.[Fan and Wang, 2003] Fan, K. and Wang, W. (2003). What is the minimum numberof letters required to fold a protein? Mol Biol J, 328(4):921–926.[Fontana and Schuster, 1998] Fontana, W. and Schuster, P. (1998). Continuity inevolution: On the nature of transitions. Science, 280(5368):1451–1455.[Fox et al., 2001] Fox, J., Glasspool, D., and Bury, J. (2001). Quantitative and qualitativeapproaches to reasoning under uncertainty in medical decision making. AIME2001, 2101:272–282.[Gilbert, 1986] Gilbert, W. (1986). Origin of life: The rna world. Nature, 319(618).[Govindarajan et al., 1999] Govindarajan, S., Recabarren, R., and Goldstein, R.(1999). Estimating the total number of protein folds. Proteins, 35(4):408–414.[Grishin, 2001] Grishin, N. V. (2001). Fold change in evolution of protein structures.Journal of Structural Biology, 134(2-3):167–185.[Harrison et al., 2001] Harrison, P. M., Chan, H. S., Prusiner, S. B., and Cohen,F. E. (2001). Conformational propagation with prion-like characteristics in a simplemodel of protein folding. Protein Science, 10:819–835.[Hartl and Hayer-Hartl, 2002] Hartl, F. U. and Hayer-Hartl, M. (2002). Molecularchaperones in the cytosol: From nascent chain to folded protein. Science,295(5561):1852–1858.
BIBLIOGRAPHY 82[Hautaniemi et al., 2005] Hautaniemi, S., Kharait, S., Iwabu, A., Wells, A., and Lauffenburger,D. (2005). Modeling of signal response cascades using decision treeanalysis. Bioinformatics, 21(9):2027–2035.[Hirst, 1999] Hirst, J. D. (1999). The evolutionary landscape of functional modelproteins. Protein engineering, design and selection, 12(9):721–726.[Honda et al., 2004] Honda, S., Yamasaki, K., Sawada, Y., and Morii, H. (2004). 10residue folded peptide designed by segment statistics. Structure, 12(8):1507–1518.[Jansen et al., 2003] Jansen, R., Yu, H., Greenbaum, D., Kluger, Y., Krogan, N.,Chung, S., Emili, A., Snyder, M., Greenblatt, J., and Gerstein, M. (2003). Abayesian networks approach for predicting protein-protein interactions from genomicdata. Science, 302(5644):449–53.[Jefferys et al., 2006] Jefferys, B., Kelley, L., Sergot, M., Fox, J., and Sternberg, M.(2006). Capturing expert knowledge with argumentation: a case study in bioinformatics.Bioinformatics, PMID: 16446279, doi:10.1093/bioinformatics/btl018.[Kelley et al., 2000] Kelley, L., MacCallum, R., and Sternberg, M. (2000). Enhancedgenome annotation using structural profiles in the program 3d-pssm. J. Mol. Biol.,299(2):499–520.[Kinjo and Takada, 2003] Kinjo, A. R. and Takada, S. (2003). Competition betweenprotein folding and aggregation with molecular chaperones in crowded solutions:Insight from mesoscopic simulations. Biophysical Journal, 85:3521–3531.[Kolinski and Skolnick, 2004] Kolinski, A. and Skolnick, J. (2004). Reduced modelsof proteins and their applications. Polymer, 45(2):511–524.[Korkegian et al., 2005] Korkegian, A., Black, M., Baker, D., and Stoddard, B.(2005). Computational thermostabilization of an enzyme. Science, 308(5723):857–860.[Krause et al., 1995] Krause, P., Ambler, S., Elvang-Goransson, M., and Fox, J.(1995). A logic of argumentation for reasoning under uncertainty. Comp. Intell.,11:113–131.[Krause et al., 1998] Krause, P., Fox, J., Judson, P., and Patel, M. (1998). Qualitativerisk assessment fulfils a need. Lecture Notes In Computer Science, 1455:138–156.[Larson and Pande, 2003] Larson, S. M. and Pande, V. S. (2003). Sequence optimizationfor native state stability determines the evolution and folding kinetics ofa small protein. Journal of Molecular Biology, 332(1):275–286.[Lenski et al., 2003] Lenski, R. E., Ofria, C., Pennock, R. T., and Adami, C. (2003).The evolutionary origin of complex features. Nature, 423:139–144.[Li et al., 1996] Li, H., Helling, R., Tang, C., and Wingreen, N. (1996). Emergence ofpreferred structures in a simple model of protein folding. Science, 273(5275):666–669.
BIBLIOGRAPHY 83[Li et al., 2000] Li, L., Mirny, L. A., and Shakhnovich, E. I. (2000). Kinetics, thermodynamicsand evolution of non-native interactions in a protein folding nucleus.Nature Structural Biology, 7:336 – 342.[Lu et al., 2006] Lu, D., Liu, Z., and Wu, J. (2006). Structural transitions of confinedmodel proteins: molecular dynamics simulation and experimental validation.Biophys J., PMID: 16461405.[Madras and Sokal, 1987] Madras, N. and Sokal, A. D. (1987). Nonergodicity of local,length-conserving monte carlo algorithms for the self-avoiding walk. J Stat Phys,47(3/4):573–595.[Maier et al., 2005] Maier, T., Ferbitz, L., Deuerling, E., and Ban, N. (2005). A cradlefor new proteins: trigger factor at the ribosome. Current Opinion in StructuralBiology, 15(2):204–212.[Marenduzzo et al., 2005] Marenduzzo, D., Hoang, T. X., Seno, F., Vendruscolo, M.,and Maritan, A. (2005). Form of growing strings. Phys Rev Lett, 95.[Metropolis et al., 1953] Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., andTeller, A. H. (1953). Equation of state calculations by fast computing machines.Chem Phys J, 21(6):1087–1092.[Miller et al., 2002] Miller, J., Zeng, C., Wingreen, N. S., and Tang, C. (2002). Emergenceof highly designable protein-backbone conformations in an off-lattice model.Proteins: Structure, Function, and Genetics, 47(4):506–512.[Mirny et al., 1998] Mirny, L. A., Abkevich, V. I., and Shakhnovich, E. I. (1998).How evolution makes proteins fold quickly. PNAS, 95(9):4976–4981.[Miyazawa and Jernigan, 1996] Miyazawa, S. and Jernigan, R. L. (1996). Residueresiduepotentials with a favorable contact pair term and an unfavorable highpacking density term, for simulation and threading. Journal of Molecular Biology,256(3):623–644.[Modgil and Fox, 2004] Modgil, S. and Fox, J. (2004). Review of argumentation technology:State of the art technical and user requirements. ASPIC Consortium.[Murzin et al., 1995] Murzin, A., Brenner, S., Hubbard, T., and Chothia, C. (1995).Scop: a structural classification of proteins database for the investigation of sequencesand structures. J. Mol. Biol., 247:536–540.[Neidigh et al., 2002] Neidigh, J., Fesinmeyer, R., and Andersen, N. (2002). Designinga 20-residue protein. Nat Struct Biol, 9(6):425–430.[Nölting and Andert, 2000] Nölting, B. and Andert, K. (2000). Mechanism of proteinfolding. Proteins: Structure, Function, and Genetics, 41(3):288–298.[Panchenko et al., 1996] Panchenko, A. R., Luthey-Schulten, Z., and Wolynes, P. G.(1996). Foldons, protein structural modules, and exons. PNAS, 93(3):2008–2013.
BIBLIOGRAPHY 84[Ping et al., 2003] Ping, G., Yuan, J. M., Vallieres, M., Dong, H., Sun, Z., Wei, Y.,Li, F. Y., and Lin, S. H. (2003). Effects of confinement on protein folding andprotein stability. The Journal of Chemical Physics, 118(17):8042–8048.[Ramachandran et al., 1963] Ramachandran, G., Ramakrishnan, C., and Sasisekharan,V. (1963). Stereochemistry of polypeptide chain configurations. J Mol Biol,7:95–99.[Reed and Grasso, 2001] Reed, C. and Grasso, F. (2001). Computational models ofnatural language argument. Proc. ICCS 2001, Springer-Verlag, 2073:999–1008.[Reina et al., 2002] Reina, J., Lacroix, E., Hobson, S., Fernandez-Ballester, G., Rybin,V., Schwab, M., Serrano, L., and Gonzalez, C. (2002). Computer-aided designof a pdz domain to recognize new target sequences. Nat Struct Biol, 9(8):621–627.[Rivas et al., 2003] Rivas, G., Ferrone, F., and Herzfeld, J. (2003). Life in a crowdedworld. EMBO reports, 5(1):23–27.[Schimmel and Kelley, 2000] Schimmel, P. and Kelley, S. O. (2000). Exiting an rnaworld. Nature Structural Biology, 7:5 – 7.[Shakhnovich, 1997] Shakhnovich, E. I. (1997). Theoretical studies of protein-foldingthermodynamics and kinetics. Current Opinion in Structural Biology, 7(1):29–40.[Shental-Bechor et al., 2005] Shental-Bechor, D., Kirca, S., Ben-Tal, N., andHaliloglu, T. (2005). Monte carlo studies of folding, dynamics, and stability inalpha-helices. Biophys J, 88:2391–2402.[Sikorski and Skolnick, 1990] Sikorski, A. and Skolnick, J. (1990). Dynamic montecarlo simulations of globular protein folding. model studies of in vivo assembly offour helix bundles and four member beta-barrels. J Mol Biol, 215(1):183–98.[Taketomi et al., 1975] Taketomi, H., Ueda, Y., and Go, N. (1975). Studies on proteinfolding, unfolding and fluctuations by computer simulation. i. the effect of specificamino acid sequence represented by specific inter-unit interactions. Int. J. Pept.Protein Res., 7:445–459.[Taverna and Goldstein, 2000] Taverna, D. M. and Goldstein, R. A. (2000). Thedistribution of structures in evolving protein populations. Biopolymers, 53(1):1–8.[Taverna and Goldstein, 2001a] Taverna, D. M. and Goldstein, R. A. (2001a). Whyare proteins marginally stable? Proteins: Structure, Function, and Genetics,46(1):105–109.[Taverna and Goldstein, 2001b] Taverna, D. M. and Goldstein, R. A. (2001b). Whyare proteins so robust to site mutations? J Mol Biol, 315(3):479–484.[Tiana et al., 2004] Tiana, G., Shakhnovich, B. E., Dokholyan, N. V., andShakhnovich, E. I. (2004). Imprint of evolution on protein structures. PNAS,101(9):2846–2851.
BIBLIOGRAPHY 85[Toulmin, 1958] Toulmin, S. (1958). The uses of argument. Cambridge UniversityPress, Cambridge, UK.[Upshur and Colak, 2003] Upshur, R. and Colak, E. (2003). Argumentation and evidence.Theor. Med. Bioeth., 24(4):283–299.[van den Berg et al., 1999] van den Berg, B., Ellis, R. J., and Dobson, C. M. (1999).Effects of macromolecular crowding on protein folding and aggregation. EMBOJournal, 18:6927–6933.[Šali et al., 1994] Šali, A., Shakhnovich, E., and Karplus, M. (1994). A lattice modelstudy of the requirements for folding to the native state. Journal of MolecularBiology, 235(5):1614–1638.[Wilke et al., 2001] Wilke, C. O., Wang, J. L., Ofria, C., Lenski, R. E., and Adami,C. (2001). Evolution of digital organisms at high mutation rates leads to survivalof the flattest. Nature, 412:331–333.[Williams et al., 2001] Williams, P. D., Pollock, D. D., and Goldstein, R. A. (2001).Evolution of functionality in lattice proteins. J Mol Graph and Modelling,19(1):150–156.[Xia and Levitt, 2002] Xia, Y. and Levitt, M. (2002). Roles of mutation and recombinationin the evolution of protein thermodynamics. Biophysics, 99(16):10382–10387.[Xia and Levitt, 2004] Xia, Y. and Levitt, M. (2004). Simulating protein evolutionin sequence and structure space. Curr Op Struct Biol, 14(2):202–207.[Yong Duan, 1998] Yong Duan, P. A. K. (1998). Pathways to a protein folding intermediateobserved in a 1-microsecond simulation in aqueous solution. Science,282(5389):740–744.[Yue et al., 1995] Yue, K., Fiebig, K., Thomas, P., Chan, H., Shakhnovich, E., andDill, K. (1995). A test of lattice protein folding algorithms. Proc Natl Acad SciUSA, 92(1):325–329.[Zeldovich et al., 2006] Zeldovich, K. B., Berezovsky, I. N., and Shakhnovich, E. I.(2006). Physical origins of protein superfamilies. J Mol Biol, 357(4):1335–1343.[Ziv et al., 2005] Ziv, G., Haran, G., and Thirumalai, D. (2005). Ribosome exit tunnelcan entropically stabilize alpha-helices. Proc Natl Acad Sci USA, 102(52):18956–61.