bbc 2015

Recommendations

Info

BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015 Abstract ID: O2 10th Benelux Bioinformatics Conference Oral presentation bbc 2015 O2. PREDICTING OLIGOGENIC EFFECTS USING DIGENIC DISEASE DATA Andrea M. Gazzo 1,2,3* , Dorien Daneels 1,3 , Maryse Bonduelle 3 , Sonia Van Dooren 1,3 , Guillaume Smits 1,4 & Tom Lenaerts 1,2,5 . Interuniversity Institute of Bioinformatics in Brussels, Brussels, Belgium 1 ; MLG, Departement d'Informatique, Universite Libre de Bruxelles, Brussels, Belgium 2 ; Center for Medical Genetics, Reproduction and Genetics, Reproduction Genetics and Regenerative Medicine, Vrije Universiteit Brussel, UZ Brussel, Brussel, Belgium 3 ; Genetics, Hopital Universitaire des Enfants Reine Fabiola, Universite Libre de Bruxelles, Brussels, Belgium 4 ; Computerwetenschappen, Vrije Universiteit Brussel, Brussel, Belgium 5 . * Andrea.Gazzo@ulb.ac.be Recent research has shown that disorders may be better described by more complex inheritance mechanisms, advocating that some of the monogenic disease may in fact be oligogenic. Understanding how the combined interplay and weight of variants leads to disease may provide improved and novel insights into diseases classically considered being monogenic. Here we present a unique classification method that separates two types of digenic diseases, i.e. those that requires variants in both genes to induce the disease and those where one is causative and the second increases the severity. Our results show that a clear separation can be made between both classes using gene and variant-level features extracted from DIDA. INTRODUCTION DIDA is a novel database that provides for the first time detailed information on genes and associated genetic variants involved in digenic diseases, the simplest form of oligogenic inheritance 1 . The database is accessible via http://dida.ibsquare.be and currently includes 213 digenic combinations involved in 44 different digenic diseases 2 . These combinations are composed of 364 distinct variants, which are distributed over 136 distinct genes. Creating this new repository was essential, as current databases do not allow one to retrieve detailed records regarding digenic combinations. Genes, variants, diseases and digenic combinations in DIDA are annotated with manually curated information and information mined from other online resources. Each digenic combination was categorized into one of two effect classes: either ``on/off'', in which variant combinations in both genes are required to develop the disease, or ``severity'', where variants in one gene are enough to develop the disease and carrying variant combinations in two genes increases the severity or affects its age of onset. In this work we present a predictor capable of distinguishing between the digenic effect classes. We analyse the result of this predictor in relation to specific features collected for the different digenic combinations in DIDA, as for instance the haploinsufficiency of the genes, their zygosity and the relationship between them, providing insight into the biological meaning of the result. METHODS We used a machine learning approach to determine the classes, i.e. "severity" or "on/off", of a digenic combination. Starting with feature selection we chose the most informative features to classify the digenic combination in either 2 classes. For each of the two genes involved in a digenic combination: Zygosity (Heterozygote, Homozygote, etc.), recessiveness probability, haploinsufficiency score, known recessive information, if the gene is essential or not (based on Mouse knock out experimental data) are used as features in the predictor. At variant level, we used as features the pathogenicity predictions from SIFT and Polyphen 2 tools. Finally, we encode also the relationship between the two genes, defining the relation "Similar function", "Directly interacting" and "Pathway membership". After different tests we decided to use a Random forest algorithm, as this approach gave the best results. RESULTS & DISCUSSION After a 10-fold cross validation we obtained promising performances, with an MCC of 0,67 and 0,92 as AUROC. Regretfully, this performance is an overestimation since, as the gene-based features are the most important, many examples with mutations mapped on the same gene pair lead to the same oligogenic effect class. A stratification that ensures that the same pair of genes are never in both the training and in the testing set was required. We manually created 5 subsets, where the instances with the same gene-pair belong to the same subset. . After this procedure we assessed again the performances, obtaining an MCC of 0,36 and as AUROC 0,78. In order to verify the significance of the performances we retrained the random forest on a randomization of the data. This randomization was obtained by shuffling all the features for each instance but maintaining class unchanged. This reshuffling resulted in an MCC close to zero and a AUROC near to 0.5, as expected. This additional test confirms the significance of the stratified results. In a next stage we are analysing the relationship between the oligogenic effect and the features used, particularly in terms of biological and molecular interpretation. As a future perspective, the benefit at clinical level is very promising: one goal of medical genetics is to assign predictive value to the genotype, in order to it to assist in diagnosis and disease management. If we can infer, based on the genotype, what the digenic/oligogenic effect will be, we can potentially anticipate the treatment. REFERENCES [1] Gazzo, A. et al., DIDA: a curated and annotated digenic diseases database, under review on NAR database issue (2016). [2] Schäffer, A. A. (2013) Digenic inheritance in medical genetics. J. Med. Genet., 50, 641–652. 22
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015 Abstract ID: O3 Oral presentation 10th Benelux Bioinformatics Conference bbc 2015 O3. A COMPREHENSIVE COMPARISON OF MODULE DETECTION METHODS FOR GENE EXPRESSION DATA Wouter Saelens 1,2* , Robrecht Cannoodt 1,2,3 , Bart N. Lambrecht 1,2 & Yvan Saeys 1,2 . VIB Inflammation Research Center 1 ; Department of Respiratory Medicine, Ghent University 2 ; Center for Medical Genetics, Ghent University Hospital 3 . * wouter.saelens@ugent.be Module detection is central in every analysis of large scale gene expression data. While numerous methods have been developed, the relative merits and drawbacks of these different approaches is still unclear. In this work we use known gene regulatory networks to do an unbiased comparison of 41 module detection methods, spanning clustering, biclustering, decomposition, direct network inference and iterative network inference. This analysis showed that decomposition methods outperform current clustering methods. Our work provides a first comprehensive evaluation to guide the biologist in their choice but also serves as a protocol for the evaluation of novel module detection methods. INTRODUCTION Module detection methods form a cornerstone in the analysis of genome wide gene expression compendia. Modules in this context are defined as groups of genes with a similar expression profile, and therefore frequently share certain functions, are co-regulated and cooperate to produce a certain phenotype. Over the last years, dozens of module detection methods have been developed, which can be classified in five different categories. The most popular method is undoubtedly clustering, which will group genes into modules based on global similarity in expression profiles. Within the transcriptomics community these methods have received a considerable amount of criticism. This is mainly due to three drawbacks: (i) clustering cannot detect so called local co-expression effects, (ii) most clustering methods are unable to detect overlapping modules and (iii) clustering methods do not model the underlying gene regulatory network. Alternative approaches have therefore been developed which either handle both overlap and local co-expression (biclustering and decomposition) or model the gene regulatory network (direct network inference and iterative network inference). Given this methodological diversity, it is important that existing and new approaches are evaluated on robust and objective benchmarks. However, evaluation studies in the past were limited in the number of methods, use synthetic data or do not correctly assess the balance between false positives and false negatives. In this study we therefore provide a novel unbiased and comprehensive evaluation strategy (Figure 1), and used it to evaluate 41 state-of-theart module detection methods. METHODS The key of our approach is that we use golden standard regulatory networks to define sets of known modules. These can be used to directly assess the sensitivity and specificity of the different module detection methods. We used four different large scale gene expression compendia, two from E. coli and two from S. cerevisae. For each of these organisms a substantial part of the regulatory network is already known, either based on the integration of small-scale experiments or based on large, genome wide datasets. We use these networks to define groups of known modules using by looking at genes which either share on regulator, all regulators or are strongly interconnected. We used four different metrics to compare a set of observed modules with known modules: recovery and recall control the type II errors, while the relevance and specificity control the type I errors. Parameter tuning is a necessary but often overlooked challenge of module detection methods. As default parameters of a tool are usually optimized for some specific test cases by the authors, they do not necessarily reflect general good performance on other datasets. On the other hand, one should be careful of overfitting parameters on specific characteristics of the data, as such parameters will lead to suboptimal results when using the same parameter settings on other datasets. In this study we first optimized parameters using a grid-based approach. Next, to avoid overfitting we used the optimal parameters on one dataset to score the performance on another dataset, in an approach akin to cross-validation. RESULTS & DISCUSSION We evaluated 41 different module detection methods covering all five approaches. Overall, our analysis showed that certain decomposition methods, those based on the independent component analysis, outperform current stateof-the-art clustering methods. However, despite their theoretical advantages, neither biclustering nor network inference methods are able to outperform clustering methods. Importantly, our results are stable across datasets, module definitions and scoring metrics, demonstrating the robustness of our evaluation methodology. FIGURE 1. Overview of our evaluation methodology. The applications of our work are twofold. First, if local coexpression and overlap are of interest, we discourage the use of biclustering methods and suggest the use of decomposition instead. Secondly, we provide a new comprehensive evaluation methodology which can be used to compare novel methods with the current state-of-the-art. 23
Page 1 and 2: 10 th Benelux Bioinformatics Confer
Page 3 and 4: 10th Benelux Bioinformatics Confere
Page 19 and 20: BeNeLux Bioinformatics Conference -
Page 21: BeNeLux Bioinformatics Conference -
Page 73 and 74:
BeNeLux Bioinformatics Conference -
Page 75 and 76:
Page 77 and 78:
Page 79 and 80:
Page 81 and 82:
Page 83 and 84:
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
Page 105 and 106:
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
Page 113 and 114:
Page 115:
10th Benelux Bioinformatics Confere
show all

bbc 2015

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?