bbc 2015

Recommendations

Info

BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015 Abstract ID: P Poster 10th Benelux Bioinformatics Conference bbc 2015 P44. ASSESSMENT OF THE CONTRIBUTION OF COCOA-DERIVED STRAINS OF ACETOBACTER GHANENSIS AND ACETOBACTER SENEGALENSIS TO THE COCOA BEAN FERMENTATION PROCESS THROUGH A GENOMIC APPROACH Rudy Pelicaen, Koen Illeghems, Luc De Vuyst, and Stefan Weckx * . Research Group of Industrial Microbiology and Food Biotechnology (IMDO), Faculty of Sciences and Bioengineering Sciences, Vrije Universiteit Brussel, Brussels, Belgium; Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, Belgium. *Stefan.Weckx@vub.ac.be Acetobacter ghanensis LMG 23848 T and Acetobacter senegalensis 108B are acetic acid bacteria species that originate from a spontaneous cocoa bean heap fermentation process. They have been indicated as strains with interesting functionalities through extensive metabolic and kinetic studies. Whole-genome sequencing of A. ghanensis LMG 23848 T and A. senegalensis 108B allowed to unravel their genetic adaptations to the cocoa bean fermentation ecosystem. INTRODUCTION Fermented dry cocoa beans are the basic raw material for chocolate production. The cocoa pulp-bean mass contents of the cocoa pods undergo, once taken out of the pods, a spontaneous fermentation process that lasts four to six days. This process is characterised by a succession of yeasts, lactic acid bacteria (LAB), and acetic acid bacteria (AAB) coming from the environment (De Vuyst et al., 2015). METHODS Total genomic DNA isolation and purification of A. ghanensis LMG 23848 T and A. senegalensis 108B was followed by the construction of an 8-kb paired-end library, 454 pyrosequencing, and assembly of the sequence reads using the GS De Novo Assembler version 2.5.3 with default parameters. Genome finishing was performed by PCR assays to close gaps in the draft assembly using CONSED 23.0. Automated gene prediction and annotation of the assembled genome sequences were carried out using the bacterial genome sequence annotation platform GenDB v2.2 (Meyer et al., 2003). The predicted genes were functionally characterised using searches in public databases and bioinformatics tools, and annotations were manually curated. Comparative analysis of the genome sequences of the cocoa-derived strains A. ghanensis LMG 23848 T (this study), A. senegalensis 108B (this study), and A. pasteurianus 386B (Illeghems et al., 2013) was accomplished by the EDGAR framework (Blom et al., 2009). RESULTS & DISCUSSION The genomes of the strains investigated consisted of a circular chromosomal DNA sequence with a size of 2.7 Mbp and two plasmids for A. ghanensis LMG 23848 T and a circular chromosomal DNA sequence with a size of 3.9 Mbp and one plasmid for A. senegalensis 108B (Figure 1). Comparative analysis revealed that the order of orthologous genes was highly conserved between the genome sequences of A. pasteurianus 386B and A. ghanensis LMG 23848 T . Evidence was found that both species possessed the genetic ability to be involved in citrate assimilation and they displayed adaptations in their respiratory chain. As is the case for many AAB, the missing gene encoding phosphofructokinase in the genome sequences of both A. ghanensis LMG 23848 T and A. senegalensis 108B resulted in a non-functional upper part of the Embden–Meyerhof–Parnas pathway. However, the presence of genes coding for membrane-bound PQQdependent dehydrogenases enabled the AAB strains examined to rapidly oxidise ethanol into acetic acid. Furthermore, an alternative TCA cycle, characterised by genes coding for a succinyl-CoA:acetate-CoA transferase and a malate:quinone oxidoreductase, was present. Furthermore, evidence was found in both genome sequences that glycerol, mannitol and lactate could be used as energy sources. Thus, although both species displayed genetic adaptations to the cocoa bean fermentation process, their dependence on glycerol, mannitol and lactate may partly explain their low competitiveness during cocoa bean fermentation processes, as these substrates have to be formed through yeast or LAB activities, respectively. FIGURE 1. Graphical representation of the genomes of A. ghanensis LMG 23848 T (A) and A. senegalensis 108B (B). REFERENCES Blom, J., Albaum, S., Doppmeier, D., Pühler, A., Vorhölter, F.-J., Zakrzewski, M., Goesmann, A., 2009. EDGAR: a software framework for the comparative analysis of prokaryotic genomes. BMC Bioinformatics 10, 1-14. De Vuyst, L., Weckx, S., 2015. The functional role of lactic acid bacteria in cocoa bean fermentation. In: Mozzi, F., Raya, R.R., Vignolo, G.M. (Eds.). Biotechnology of Lactic Acid Bacteria: Novel Applications. Wiley-Blackwell, Ames, IA, USA. In press.Illeghems, K., De Vuyst, L., Weckx, S., 2013. Complete genome sequence and comparative analysis of Acetobacter pasteurianus 386B, a strain well-adapted to the cocoa bean fermentation ecosystem. BMC Genomics 14, 526. Meyer, F., Goesmann, A., McHardy, A. C., Bartels, D., Bekel, T., et al., 2003. GenDB - an open source genome annotation system for prokaryote genomes. Nucleic Acids Res. 31, 2187-2195. 88
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015 Abstract ID: 000 Category: Abstract template 10th Benelux Bioinformatics Conference bbc 2015 P45. REPRESENTATIONAL POWER OF GENE FEATURES FOR FUNCTION PREDICTION Konstantinos Pliakos 1* , Isaac Triguero 2,3 , Dragi Kocev 4 & Celine Vens 1 . Department of Public Health and Primary Care, KU Leuven Kulak 1 ; Department of Respiratory Medicine, Ghent University 2 ; Data Mining and Modelling for Biomedicine group, VIB Inflammation Research Center 3 ; Department of Knowledge Technologies, Jožef Stefan Institute 4 . * konstantinos.pliakos@kuleuven-kulak.be We present a short study on gene function prediction datasets, revealing an existing issue of non-unique feature representation, as well as the effect of this issue on hierarchical multi-label classification algorithms. INTRODUCTION This study focuses on hierarchical multi-label classification (HMC). HMC is a variant of classification where one sample can be assigned to several classes simultaneously. It differs though from multi-label classification as these classes are organized in a hierarchy. That means that a sample belonging to a class automatically belongs to all its super-classes. Typical HMC tasks include gene function prediction or text classification. Here, we focus on the former. A typical characteristic of genes is that they can be described in several ways: using information about their sequence, homology to well-characterized genes, expression profiles, secondary structure of their derived proteins, etc. The HMC community has multiple research datasets at its disposal on gene functions (e.g., (Vens et al., 2008) or (Schietgat et al., 2010)), each representing genes by one type of features. Indisputably, researchers should get advantage of this amount of data but the question arises how “good” these datasets are. How discriminant are the features describing a gene? Here, a short study is trying to display existing data-related problems and give answers to the aforementioned questions. DATA STUDY & RESULTS After careful experimentation on various publicly available datasets it was noted that some of them suffer from large amount of duplicate feature vectors. The irrational behind this occurrence is that there are genes, which despite having different functions, have exactly the same feature representation. The table below lists the aforementioned problem in the 20 gene function prediction datasets described in (Vens et al., 2008) and (Schietgat et al., 2010). Organism Dataset Nb of genes Nb of unique gene representations S. cerevisiae church 3755 2352 pheno 1591 514 hom 3854 3646 seq 3919 3913 struc 3838 3785 A. thaliana scop 9843 9415 struc 11763 11689 TABLE 1. Datasets, the number of genes and their unique representations. As it is displayed, the church (micro-array expression) and the pheno (phenotype features) datasets suffer the most. More specifically, in pheno dataset the 67.7% of the gene representations are duplicates. The most frequent feature vector appears 315 times, 197 times in the training set and 118 times in the test set. Due to this, 20% of the 582 test examples will give the same feature vector as input for prediction. In a decision tree model, for example, these genes will end up in the same leaf, receive the same prediction (the average class vector of 197 training examples), but receive a different error term as they are a priori associated with a different class label-set. In the training phase, there may still be a lot of variation in the class vectors of the 197 genes, but no split exists to separate them. In the Church dataset, the 3755 genes correspond to only 2352 unique feature descriptors. In Hom or Struc datasets the number of the duplicates is lower but still impressive, considering the enormous size of the feature vectors in these datasets. For evaluation purposes, ML-KNN (Zhang M. L et al., 2007) was employed to demonstrate the effect of the studied problem on the average precision for the FunCat annotated datasets. Here, “unique” refers to the datasets occurring after removing all the duplicates. Thus, any feature vector can only once be included in a gene’s neighbour set. We report the average of 10 “unique” versions, each one using a different gene’s class label as ground truth for the feature vector. Dataset K= 1 K = 5 K = 17 Train Test (5cv) Train Test (5cv) Train Test (5cv) pheno initial 51.59 23.62 39.55 24.14 32.76 23.59 unique 100 24.21 55.62 24.90 39.70 25.01 hom initial 98.30 39.32 63.64 39.45 48.96 37.28 unique 100 39.14 64.64 39.67 49.28 37.53 TABLE 2. Average Precision rates (%) using ML-KNN. The table shows that the less discriminant feature representation can affect the ML-KNN and decrease the precision of multi-label classification. Indisputably, it could be concluded that the same problem will be more obvious or even completely disastrous for two-class or multi-class classification problems. CONCLUSION The major point of this study was to inform the research community of the relatively low representational power of the features present in some widely used gene function prediction datasets, making them even more difficult and challenging datasets from machine learning perspective. We observed the same issue in datasets of other HMC application domains like text categorization. REFERENCES Zhang M. L. & Zhou Z. H. ML-KNN: A lazy learning approach to multi-label learning, Pattern recognition 40, 2038-2048, (2007). Vens C. et al. Decision trees for hierarchical multi-label classification, Machine Learning 73, 185-214, (2008). Schietgat L. et al. Predicting gene function using hierarchical multi-label decision tree ensembles, BMC Bioinformatics 11, (2010). 89
Page 1 and 2:
10 th Benelux Bioinformatics Confer
Page 3 and 4:
10th Benelux Bioinformatics Confere
Page 5 and 6:
Page 7 and 8:
Page 9 and 10:
Page 11 and 12:
Page 13 and 14:
Page 15 and 16:
Page 17 and 18:
Page 19 and 20:
BeNeLux Bioinformatics Conference -
Page 21 and 22:
Page 23 and 24:
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38: BeNeLux Bioinformatics Conference -
Page 87: BeNeLux Bioinformatics Conference -
Page 115: 10th Benelux Bioinformatics Confere
show all

bbc 2015

Create successful ePaper yourself

Delete template?

Save as template?