bbc 2015

Recommendations

Info

BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015 Abstract ID: P Poster 10th Benelux Bioinformatics Conference bbc 2015 P46. ANALYSIS OF BIAS AND ASYMMETRY IN THE PROTEIN STABILITY PREDICTION Fabrizio Pucci 1,* , Katrien Bernaerts 1,2 , Fabian Teheux 1 , Dimitri Gilis 1 & Marianne Rooman 1 . Department of BioModeling, BioInformatics & BioProcesses 1 , Université Libre de Bruxelles, 1050 Brussels, Belgium; BioBased Materials, Faculty of Humanities and Sciences 2 , Maastricht University, 6200 Maastricht, The Netherlands. * fapucci@ulb.ac.be In many bioinformatics analyses avoiding biases towards the training dataset is one of the most intricate issue. Here we focus on the specific case of the prediction of protein thermodynamic stability changes upon point mutations (G). In a first instance we measure the bias towards the destabilizing mutations of some widely used G-prediction algorithms described in the literature. Then we show how important is the use of the symmetry of the model to avoid biasing. In the last step we briefly discuss the distribution of the G values for all possible point mutations in a series of proteins with the aim of understanding whether the distribution is universal and how much it is biased towards the training dataset. INTRODUCTION The accurate prediction of the stability changes on a large scale is still a challenge in protein science. Despite the large amount of work done in the last years, the results frequently suffer from hidden biases towards the training dataset and this makes the evaluation of the real performances a difficult task. Here we study the “bias problem” in the case of the prediction of protein thermodynamic stability changes upon point mutations and more precisely of its best descriptor G that is the change of folding free energy upon mutation from the wild type protein W to the mutant M. In principle the predicted G value of the inverse mutation (M to W) has to be exactly equal to minus the G of the direct mutation (W to M), since the free energy is a state function. Unfortunately the asymmetry of the training dataset towards the destabilizing mutations (reflecting the evolutionary optimization of protein stability) makes the prediction of inverse mutations less accurate with respect to the direct ones. This introduces a series of distortions in the prediction model that we will analyze here. METHODS We computed the G value for a set of almost 200 mutations in which both the structure of the wild type protein and mutant are known, using a series of prediction tools, i.e. PoPMuSiC [1], I-Mutant, FoldX, Duet, AutoMute, CupSat, Eris and ProSMS. We then computed the Ratio (RID) of the standard deviation between the predicted and the experimental values of G for the Inverse mutations to for the Direct mutations (which should be one in the case of a perfect symmetric prediction) and compared the results of the different programs. If the functional structure of the model is known as in the case of the artificial neural network of PoPMuSiC, one can further understand which terms contribute more than others to deviate the RID from unit and thus propose new model structures in which the biases are correctly avoided [2]. In the more blind machine learning approaches (as the methods based on Random Forest or Support Vector Machine) in which the functional form is not explicitly known, the asymmetry correction is less obvious. In a second part, we investigated how the symmetry of the G values distribution in the training dataset influences the prediction of the G distribution for all possible mutations in a series of proteins with known structures. RESULTS & DISCUSSION The estimation of the asymmetry computed for a series of available prediction methods gives a RID values between 1 for bias-corrected methods and about 3 for the most biased programs. From these results we have shown that the correct use of the symmetry in setting up the model structure helps to avoid unwanted biases towards the destabilizing mutations. Furthermore the distribution of the G values for all point mutations in some proteins has been analyzed and showed a dependence from the G distribution of the training dataset when the RID deviate significantly from one. The understanding of the relation between the two distrubutions is an important step to comprehend the universality of the distribution [3] and how much the proteins are optimized to minimize the impact of single-site aminoacid substitution. REFERENCES [1] Y. Dehouck, Jean Marc Kwasigroch, D. Gilis, M. Rooman (2011), PopMusic 2.1 : a web server for the estimation of the protein stability changes upon mutation and sequence optimality. BMC Bioinformatics. 12, 151 [2] F. Pucci, K. Bernaerts, F. Teheux, D. Gilis, M. Rooman, Symmetry Principles in Optimization Problems: an application to Protein Stability Prediction (2015), IFAC-PapersOnLine 48-1, 458-463 [3] Tokuriki N, Stricher F, Schymkowitz J, Serrano L, Tawfik DS, The stability effects of protein mutations appear to be universally distributed (2007), J Mol Biol, 356, 1318-1332. 90
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015 Abstract ID: P Poster 10th Benelux Bioinformatics Conference bbc 2015 P47. MULTI-LEVEL BIOLOGICAL CHARACTERIZATION OF EXOMIC VARIANTS AT THE PROTEIN LEVEL IMPROVES THE IDENTIFICATION OF THEIR DELETERIOUS EFFECTS Daniele Raimondi 1,2,3,4 , Andrea Gazzo 1,2 , Marianne Rooman 1,6 , Tom Lenaerts 1,2,5 & Wim Vranken 1,2,3,4 . Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, 1050, Belgium 1 ; Machine Learning group, Université Libre de Bruxelles, Brussels, 1050, Belgium 2 ; Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, 1050, Belgium 3 ; Structural Biology Research Centre, VIB, Brussels, 1050, Belgium 4 ; Artificial Intelligence lab, Vrije Universiteit Brussel, Brussels, 1050 Belgium 5 ; 3BIO-BioInfo group, Université Libre de Bruxelles, Brussels, 1050, Belgium 6 . * daniele.raimondi@vub.ac.be The increasing availability of genome sequence data led to the development of predictors that are capable of identifying the likely phenotypic effects of Single Nucleotide Variants (SNVs) or short inframe Insertions or Deletions (INDELs). Most of these predictors focus on SNVs and use a combination of features related to sequence conservation, biophysical and/or structural properties to link the observed variant to either a neutral or a disease phenotype. Despite notable successes, the mapping between genetic alterations and phenotypic effects is riddled with levels of complexity that are not yet fully understood and that are often not taken into account in the predictions. A better multi-level molecular and functional contextualization of both the variant and the protein may therefore significantly improve the predictive quality of variant-effect predictors. INTRODUCTION The phenotypical interpretation at the organism level of protein-level alterations is the ultimate goal of the varianteffect prediction field. This causal relationship is still far from being completely understood and is confounded by many aspects related to the intrinsic complexity of cell life. A crucial restriction of variant-effect prediction is that an alteration of the protein’s molecular phenotype, even if it is a sine qua non condition for the disease phenotype in the carrier individual,may not constitute in itself a sufficient cause for the disease: this also depends on the particular role that the affected protein plays in the well-being of the organism. Even the most commonly used features, which relate evolutionary constraints with likely functional damage, offer only a partial correlation with the pathogenicity of the variant. Consequently, additional information that bridges the variant-phenotype gap is crucial to improve variant-effect predictions. METHODS We address the inherently complex variant-effect prediction problem through the integration of different sources of information. By describing each (protein, variant) pair from different perspectives corresponding to different levels of contextualisation, we assembled the most relevant and accessible pieces of information that are currently available, with the aim to elucidate the fuzzy and complex mapping between molecular-level alterations and the individual-level phenotypic outcome. We use three variant-oriented features with different characteristics: the log-odd ratio (LOR) score and Conservation index (CI) [1], which are column-wise measures of the conservation of a mutated column within a multiple-sequence alignment (MSA), and the PROVEAN [2] predictions (PROV), which provide a sequence-wide measure of the change in evolutionary distance between the mutated target protein and close functional homologs that correlates with the deleteriousness of variants. The protein-oriented features use pathway [4] and protein-protein interaction networks information [5] (DGR) as well as genetic and clinical information, for instance an evaluation of how tolerant the affected genes are to homozygous loss-offunction mutations (REC) [3]. RESULTS & DISCUSSION DEOGEN is our novel variant effect predictor that can natively handle both SNVs and inframe INDELs. By integrating information from different biological scales and mimicking the complex mixture of effects that lead from the variant to the phenotype, we obtain significant improvements in the variant-effect prediction results. Next to the typical variant-oriented features based on the evolutionary conservation of the mutated positions, we added a collection of protein-oriented features that are based on functional aspects of the gene affected. We cross-validated DEOGEN on 36825 polymorphisms, 20821 deleterious SNVs and 1038 INDELs from SwissProt. Method Missing SNVs Sen Spe Pre Bac MCC PROVEAN 0.0 78 79 68 79 56 SIFT 2.0 85 69 61 77 52 Mutation Assessor 0.6 85 71 63 78 54 PolyPhen2 (HumDiv) 4.0 89 63 57 76 50 CADD 7.0 82 75 66 78 55 EFIN 0.0 86 80 87 83 64 MutationTaster 20.7 86 75 69 81 60 GERP++ 20.7 97 24 45 61 28 DEOGEN 4.4 77 92 85 84 71 FIGURE 1. Comparison of the performances of 8 variant-effect predictors with DEOGEN on Humsavar 2013 dataset. REFERENCES [1]Calabrese, R. et al., R. Functional annotations improve the predictive score of human disease-related mutations in proteins. Hum. Mutat. 30, 123744 (2009). [2]Choi, Y. et al., Predicting the functional effect of amino acid substitutions and indels. PLoS One 7, e46688 (2012). [3]Daniel G. MacArthur et al. A Systematic Survey of Loss-of-Function Variants in Human Protein-Coding Genes Science 17 February 2012: 335 (6070), 823-828. [4]Atanas Kamburov et al. (2011) ConsensusPathDB: toward a more complete picture of cell biology. Nucleic Acids Research 39:D712- 717. 91
Page 1 and 2:
10 th Benelux Bioinformatics Confer
Page 3 and 4:
10th Benelux Bioinformatics Confere
Page 5 and 6:
Page 7 and 8:
Page 9 and 10:
Page 11 and 12:
Page 13 and 14:
Page 15 and 16:
Page 17 and 18:
Page 19 and 20:
BeNeLux Bioinformatics Conference -
Page 21 and 22:
Page 23 and 24:
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40: BeNeLux Bioinformatics Conference -
Page 89: BeNeLux Bioinformatics Conference -
Page 115: 10th Benelux Bioinformatics Confere
show all

bbc 2015

Create successful ePaper yourself

Delete template?

Save as template?