bbc 2015
BBC2015_booklet
BBC2015_booklet
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P46. ANALYSIS OF BIAS AND ASYMMETRY IN THE PROTEIN STABILITY<br />
PREDICTION<br />
Fabrizio Pucci 1,* , Katrien Bernaerts 1,2 , Fabian Teheux 1 , Dimitri Gilis 1 & Marianne Rooman 1 .<br />
Department of BioModeling, BioInformatics & BioProcesses 1 , Université Libre de Bruxelles, 1050 Brussels, Belgium;<br />
BioBased Materials, Faculty of Humanities and Sciences 2 , Maastricht University, 6200 Maastricht, The Netherlands.<br />
* fapucci@ulb.ac.be<br />
In many bioinformatics analyses avoiding biases towards the training dataset is one of the most intricate issue. Here we<br />
focus on the specific case of the prediction of protein thermodynamic stability changes upon point mutations (G). In a<br />
first instance we measure the bias towards the destabilizing mutations of some widely used G-prediction algorithms<br />
described in the literature. Then we show how important is the use of the symmetry of the model to avoid biasing. In the<br />
last step we briefly discuss the distribution of the G values for all possible point mutations in a series of proteins with<br />
the aim of understanding whether the distribution is universal and how much it is biased towards the training dataset.<br />
INTRODUCTION<br />
The accurate prediction of the stability changes on a large<br />
scale is still a challenge in protein science. Despite the<br />
large amount of work done in the last years, the results<br />
frequently suffer from hidden biases towards the training<br />
dataset and this makes the evaluation of the real<br />
performances a difficult task.<br />
Here we study the “bias problem” in the case of the<br />
prediction of protein thermodynamic stability changes<br />
upon point mutations and more precisely of its best<br />
descriptor G that is the change of folding free energy<br />
upon mutation from the wild type protein W to the mutant<br />
M. In principle the predicted G value of the inverse<br />
mutation (M to W) has to be exactly equal to minus the<br />
G of the direct mutation (W to M), since the free energy<br />
is a state function.<br />
Unfortunately the asymmetry of the training dataset<br />
towards the destabilizing mutations (reflecting the<br />
evolutionary optimization of protein stability) makes the<br />
prediction of inverse mutations less accurate with respect<br />
to the direct ones. This introduces a series of distortions in<br />
the prediction model that we will analyze here.<br />
METHODS<br />
We computed the G value for a set of almost 200<br />
mutations in which both the structure of the wild type<br />
protein and mutant are known, using a series of prediction<br />
tools, i.e. PoPMuSiC [1], I-Mutant, FoldX, Duet,<br />
AutoMute, CupSat, Eris and ProSMS. We then computed<br />
the Ratio (RID) of the standard deviation between the<br />
predicted and the experimental values of G for the<br />
Inverse mutations to for the Direct mutations (which<br />
should be one in the case of a perfect symmetric<br />
prediction) and compared the results of the different<br />
programs.<br />
If the functional structure of the model is known as in the<br />
case of the artificial neural network of PoPMuSiC, one<br />
can further understand which terms contribute more than<br />
others to deviate the RID from unit and thus propose new<br />
model structures in which the biases are correctly avoided<br />
[2].<br />
In the more blind machine learning approaches (as the<br />
methods based on Random Forest or Support Vector<br />
Machine) in which the functional form is not explicitly<br />
known, the asymmetry correction is less obvious.<br />
In a second part, we investigated how the symmetry of the<br />
G values distribution in the training dataset influences<br />
the prediction of the G distribution for all possible<br />
mutations in a series of proteins with known structures.<br />
RESULTS & DISCUSSION<br />
The estimation of the asymmetry computed for a<br />
series of available prediction methods gives a RID<br />
values between 1 for bias-corrected methods and<br />
about 3 for the most biased programs. From these<br />
results we have shown that the correct use of the<br />
symmetry in setting up the model structure helps to<br />
avoid unwanted biases towards the destabilizing<br />
mutations.<br />
Furthermore the distribution of the G values for all<br />
point mutations in some proteins has been analyzed<br />
and showed a dependence from the G distribution<br />
of the training dataset when the RID deviate<br />
significantly from one. The understanding of the<br />
relation between the two distrubutions is an<br />
important step to comprehend the universality of the<br />
distribution [3] and how much the proteins are<br />
optimized to minimize the impact of single-site<br />
aminoacid substitution.<br />
REFERENCES<br />
[1] Y. Dehouck, Jean Marc Kwasigroch, D. Gilis, M. Rooman (2011),<br />
PopMusic 2.1 : a web server for the estimation of the protein<br />
stability changes upon mutation and sequence optimality. BMC<br />
Bioinformatics. 12, 151<br />
[2] F. Pucci, K. Bernaerts, F. Teheux, D. Gilis, M. Rooman, Symmetry<br />
Principles in Optimization Problems: an application to Protein<br />
Stability Prediction (<strong>2015</strong>), IFAC-PapersOnLine 48-1, 458-463<br />
[3] Tokuriki N, Stricher F, Schymkowitz J, Serrano L, Tawfik DS, The<br />
stability effects of protein mutations appear to be universally<br />
distributed (2007), J Mol Biol, 356, 1318-1332.<br />
90