03.12.2015 Views

bbc 2015

BBC2015_booklet

BBC2015_booklet

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

10 th Benelux Bioinformatics Conference<br />

<strong>bbc</strong> <strong>2015</strong><br />

December 7 - 8, <strong>2015</strong><br />

Antwerp, Belgium<br />

www.<strong>bbc</strong><strong>2015</strong>.be<br />

1


10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

10 th Benelux Bioinformatics Conference<br />

<strong>bbc</strong> <strong>2015</strong><br />

PROCEEDINGS<br />

December 7 and 8, <strong>2015</strong><br />

Antwerp, Belgium<br />

Elzenveld, Lange Gasthuisstraat 45, 2000 Antwerp, Belgium<br />

2


10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

3


10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

Welcome to the 10 th Benelux Bioinformatics Conference!<br />

Dear attendee,<br />

It is our great pleasure to welcome you to the 10th Benelux Bioinformatics Conference in Antwerp (Belgium)!<br />

We are especially proud to host this conference, for the first time ever, in Antwerp, the diamond city.<br />

Ten years of BBC is worth some celebration. The meeting has always struck the right balance between<br />

strengthening the regional network and offering a scientifically strong program. From its inception 10 years<br />

ago, the BBC has always been a prominent platform for the thriving regional bioinformatics community to<br />

present their latest research. Not only did many young bioinformatics scientists get their first experience<br />

presenting their work as a poster or an oral presentation at one of the BBC editions, it has always attracted a<br />

healthy mix of presenters and attendees from all career stages, with diverse backgrounds.<br />

The program of this year's edition again demonstrates the wide range of life science disciplines in which<br />

bioinformatics plays a key role nowadays. First, we are delighted to introduce two eminent keynote speakers:<br />

Cedric Notredame (Center for Genomic Regulation) and Lars Juhl Jensen (Novo Nordisk Foundation Center for<br />

Protein Research). Second, a program committee of 36 scientists has critically reviewed a large number of<br />

submissions and selected 24 authors to deliver an oral presentation. In addition, we have two special<br />

corporate talks. Furthermore, we have again a large number of poster presentations that promise a very<br />

interactive poster session, and our corporate sponsors present their activities at their respective booths. Last<br />

but not least, our special guest Pierre Rouzé will bring us a perspective on the history of bioinformatics and 10<br />

years of Benelux Bioinformatics Conferences.<br />

For this edition, we would like to congratulate 10 (mostly master) students that were selected from a large<br />

pool of submissions to enjoy a student fellowship. For many of them it is their first chance to actively<br />

participate in a scientific conference, and we hope that it inspires them for their future bioinformatics career.<br />

The program also includes a healthy mix of chances for social interaction and networking. Conference dinner,<br />

coffee and lunch breaks and the farewell drink are perfect opportunities to strengthen the network even<br />

further.<br />

We cannot close this foreword without a very strong word of thank you to the many people who made this<br />

event possible. Thanks to the sponsors for their crucial support, to the keynote speakers and all other<br />

presenters for presenting their work, to the program committee for reviewing many abstracts, to many<br />

volunteers and people in the administration of the University of Antwerp for their helping hands, in many<br />

different ways.<br />

Last but not least, thank you for being here and being part of yet another great BBC edition. We wish you an<br />

enjoyable and very illuminating meeting.<br />

On behalf of the organizing committee,<br />

Kris Laukens & Pieter Meysman<br />

BBC<strong>2015</strong> chairs<br />

University of Antwerp<br />

4


10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

Special thanks to the BBC <strong>2015</strong> sponsors!<br />

Gold sponsors:<br />

Silver sponsors:<br />

Bronze sponsors:<br />

Affiliations:<br />

5


10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

Organizing committee<br />

<br />

<br />

<br />

<br />

<br />

Kris Laukens, University of Antwerp, Belgium<br />

Pieter Meysman, University of Antwerp, Belgium<br />

Geert Vandeweyer, University of Antwerp, Belgium<br />

Yvan Saeys, Ghent University, Belgium<br />

Thomas Abeel, Delft University of Technology, The Netherlands<br />

Programme committee<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

Thomas Abeel, Delft University of Technology, The Netherlands<br />

Stein Aerts, University of Leuven, Belgium<br />

Francisco Azuaje, Luxembourg Institute of Health, Luxembourg<br />

Gianluca Bontempi, Université libre de Bruxelles, Belgium<br />

Tomasz Burzykowski, Hasselt University, Belgium<br />

Susan Coort, Maastricht University, The Netherlands<br />

Tim De Meyer, Ghent University, Belgium<br />

Jeroen De Ridder, Delft University of Technology, The Netherlands<br />

Dick De Ridder, Delft University of Technology, The Netherlands<br />

Peter De Rijk, University of Antwerp, Belgium<br />

Pierre Dupont, Université catholique de Louvain, Belgium<br />

Pierre Geurts, University of Liège, Belgium<br />

Peter Horvatovich, University of Groningen, The Netherlands<br />

Jan Ramon, University of Leuven, Belgium<br />

Rob Jelier, University of Leuven, Belgium<br />

Gunnar Klau, Centrum Wiskunde & Informatica, The Netherlands<br />

Andreas Kremer, ITTM S.A., Luxembourg<br />

Kris Laukens, University of Antwerp, Belgium<br />

Tom Lenaerts, Université libre de Bruxelles, Belgium<br />

Steven Maere, Ghent University / VIB, Belgium<br />

Lennart Martens, Ghent University / VIB, Belgium<br />

Pieter Meysman, University of Antwerp, Belgium<br />

Perry Moerland, University of Amsterdam, Belgium<br />

Pieter Monsieurs, SCK-CEN, Belgium<br />

Yves Moreau, University of Leuven, Belgium<br />

Yvan Saeys, Ghent University / VIB, Belgium<br />

Thomas Sauter, University of Luxembourg, Luxembourg<br />

Alexander Schoenhuth, Centrum Wiskunde & Informatica, The Netherlands<br />

Berend Snel, Utrecht University, Belgium<br />

Dirk Valkenborg, VITO, Belgium<br />

Raf Van de Plas, Delft University of Technology, The Netherlands<br />

Vera van Noort, University of Leuven, Belgium<br />

Natal van Riel, Eindhoven University of Technology, The Netherlands<br />

Klaas Vandepoele, Ghent University / VIB, Belgium<br />

Geert Vandeweyer, University of Antwerp, Belgium<br />

Wim Vrancken, Vrije Universiteit Brussel, Belgium<br />

6


10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

Local Organizing Committee<br />

<br />

<br />

<br />

<br />

<br />

<br />

Charlie Beirnaert, University of Antwerp<br />

Wout Bittremieux, University of Antwerp<br />

Bart Cuypers, University of Antwerp<br />

Nicolas De Neuter, University of Antwerp<br />

Aida Mrzic, University of Antwerp<br />

Stefan Naulaerts, University of Antwerp<br />

The results published in this book of abstracts are under the full responsibility of the authors. The<br />

organizing committee cannot be held responsible for any errors in this publication or potential<br />

consequences thereof.<br />

7


10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

Conference agenda 1/2<br />

December 6, <strong>2015</strong>: Satellite events<br />

12.30 – 19.00 Student-run satellite meeting at the Institute of Tropical Medicine, Antwerp.<br />

19.00 - … Guided sightseeing tour of Antwerp for early arrivals.<br />

December 7, <strong>2015</strong>: Main Conference<br />

8.30 - 9.30 Registration and welcome coffee.<br />

9.30 - 9.50<br />

Welcome and conference opening, with foreword by UAntwerpen Rector Prof.<br />

Alain Verschoren.<br />

9.50 - 10.50<br />

K1 Invited keynote: Lars Juhl Jensen. Medical data and text mining: Linking<br />

diseases, drugs, and adverse reactions.<br />

10.50 - 11.10 Coffee break.<br />

Selected talks session 1<br />

11.10 - 11.25<br />

O1 Mafalda Galhardo, Philipp Berninger, Thanh-Phuong Nguyen, Thomas Sauter and Lasse<br />

Sinkkonen. Cell type-selective disease association of genes under high regulatory load.<br />

11.25 - 11.40<br />

O2 Andrea M. Gazzo, Dorien Daneels, Maryse Bonduelle, Sonia Van Dooren, Guillaume<br />

Smits and Tom Lenaerts. Predicting oligogenic effects using digenic disease data.<br />

11.40 - 11.55<br />

O3 Wouter Saelens, Robrecht Cannoodt, Bart N. Lambrecht and Yvan Saeys. A<br />

comprehensive comparison of module detection methods for gene expression data.<br />

11.55 - 12.10<br />

O4 Joana P. Gonçalves and Sara C. Madeira. LateBiclustering: Efficient discovery of temporal<br />

local patterns with potential delays.<br />

12.10 - 12.30<br />

C1 Nicolas Goffard. Illumina software platforms to transform the path to knowledge and<br />

discovery. (Corporate presentation: Illumina)<br />

8


10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

12.30 - 15.00 Lunch break & poster session.<br />

Selected talks session 2<br />

15.00 - 15.15<br />

O5 Robrecht Cannoodt, Katleen De Preter and Yvan Saeys. Inferring developmental<br />

chronologies from single cell RNA.<br />

15.15 - 15.30<br />

O6 Vân Anh Huynh-Thu and Guido Sanguinetti. Combining tree-based and dynamical<br />

systems for the inference of gene regulatory networks.<br />

15.30 - 15.45<br />

15.45 - 16.00<br />

O7 Annika Jacobsen, Nika Heijmans, Renée van Amerongen, Martine Smit, Jaap Heringa<br />

and K. Anton Feenstra. Modeling the Regulation of β-Catenin Signalling by WNT stimulation<br />

and GSK3 inhibition.<br />

O8 Thanh Le Van, Jimmy Van den Eynden, Dries De Maeyer, Ana Carolina Fierro, Lieven<br />

Verbeke, Matthijs van Leeuwen, Siegfried Nijssen, Luc De Raedt and Kathleen Marchal.<br />

Ranked tiling based approach to discovering patient subtypes.<br />

16.00 - 16.15<br />

O9 Martin Bizet, Jana Jeschke, Matthieu Defrance, François Fuks and Gianluca Bontempi.<br />

Development of a DNA methylation-based score reflecting Tumour Infiltrating Lymphocytes.<br />

16.15 - 16-30<br />

O10 Aliaksei Vasilevich, Shantanu Singh, Aurélie Carlier and Jan de Boer. Prediction of cell<br />

responses to surface topographies using machine learning techniques.<br />

16.30 - 17.00 Coffee break.<br />

Selected talks session 3<br />

17.00 - 17.15<br />

O11 Wout Bittremieux, Pieter Meysman, Lennart Martens, Bart Goethals, Dirk Valkenborg<br />

and Kris Laukens. Analysis of mass spectrometry quality control metrics.<br />

17.15 - 17.30<br />

O12 Şule Yılmaz, Masa Cernic, Friedel Drepper, Bettina Warscheid, Lennart Martens and<br />

Elien Vandermarliere. Xilmass: A cross-linked peptide identification algorithm.<br />

17.30 - 17.45<br />

17.45 - 18.00<br />

O13 Nico Verbeeck, Jeffrey Spraggins, Yousef El Aalamat, Junhai Yang, Richard M. Caprioli,<br />

Bart De Moor, Etienne Waelkens and Raf Van de Plas. Automated anatomical interpretation<br />

of differences between imaging mass spectrometry experiments.<br />

O14 Yousef El Aalamat, Xian Mao, Nico Verbeeck, Junhai Yang, Bart De Moor, Richard M.<br />

Caprioli, Etienne Waelkens and Raf Van de Plas. Enhancement of imaging mass spectrometry<br />

data through removal of sparse intensity variations.<br />

18.10 - 18.30 Walk to the gala dinner leaving from conference venue.<br />

18.30 - 22.00 Gala dinner at Pelgrom – Pelgrimstraat 15, Antwerpen.<br />

9


10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

Conference agenda 2/2<br />

December 8, <strong>2015</strong>: Main Conference<br />

8.30 - 9.30 Welcome coffee.<br />

9.30 - 9.40 Opening and announcements.<br />

Selected talks session 4<br />

9.40 - 9.55<br />

9.55 - 10.10<br />

10.10 – 10.25<br />

10.25 - 10.40<br />

O15 Gipsi Lima Mendez, Karoline Faust, Nicolas Henry, Johan Decelle, Sébastien Colin,<br />

Fabrizio Carcillo, Simon Roux, Gianluca Bontempi, Matthew B. Sullivan, Chris Bowler, Eric<br />

Karsenti, Colomban de Vargas and Jeroen Raes. Determinants of community structure in the<br />

plankton interactome.<br />

O16 Mohamed Mysara, Yvan Saeys, Natalie Leys, Jeroen Raes and Pieter Monsieurs.<br />

Bioinformatics tools for accurate analysis of amplicon sequencing data for<br />

biodiversity analysis.<br />

O17 Sjoerd M. H. Huisman, Else Eising, Ahmed Mahfouz, Boudewijn P.F. Lelieveldt, Arn<br />

van den Maagdenberg and Marcel Reinders. Gene co-expression analysis identifies brain<br />

regions and cell types involved in migraine pathophysiology: a GWAS-based study using the<br />

Allen Human Brain Atlas.<br />

O18 Ahmed Mahfouz, Boudewijn P.F. Lelieveldt, Aldo Grefhorst, Isabel Mol, Hetty Sips,<br />

Jose van den Heuvel, Jenny Visser, Marcel Reinders and Onno Meijer. Spatial co-expression<br />

analysis of steroid receptors in the mouse brain identifies region-specific<br />

regulation mechanisms.<br />

10.40 - 11.10 Coffee break.<br />

Selected talks session 5<br />

11.10 - 11.25<br />

O19 Bart Cuypers, Pieter Meysman, Manu Vanaerschot, Maya Berg, Malgorzata<br />

Domagalksa, Jean-Claude Dujardin and Kris Laukens. A systems biology compendium for<br />

Leishmania Donovani.<br />

11.25 - 11.40<br />

O20 Volodimir Olexiouk, Elvis Ndah, Sandra Steyaert, Steven Verbruggen, Eline De Schutter,<br />

Alexander Koch, Daria Gawron, Wim Van Criekinge, Petra Van Damme and Gerben<br />

Menschaert. Multi-omics integration: Ribosome profiling applications.<br />

11.40 - 11.55<br />

O21 Qingzhen Hou, Kamil Krystian Belau, Marc Lensink, Jaap Heringa and K. Anton<br />

Feenstra. CLUB-MARTINI: Selecting favorable interactions amongst available candidates: A<br />

coarse-grained simulation approach to scoring docking decoys.<br />

11.55 - 12.10<br />

O22 Elien Vandermarliere, Davy Maddelein, Niels Hulstaert, Elisabeth Stes, Michela Di<br />

Michele, Kris Gevaert, Edgar Jacoby, Dirk Brehmer and Lennart Martens. Pepshell:<br />

Visualization of conformational proteomics data.<br />

10


10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

12.10 - 12.30<br />

C2 Carine Poussin. The systems toxicology computational challenge: Identification of<br />

exposure response markers. (Corporate presentation: sbv IMPROVER)<br />

12.30 - 13.30 Lunch break.<br />

13.30 - 14.30<br />

K2 Invited keynote: Cedric Notredame. Multiple survival strategies to deal with the<br />

multiplication of multiple sequence alignment methods.<br />

Selected talks session 6<br />

14.30 - 14.45<br />

O23 Thomas Moerman, Dries Decap and Toni Verbeiren. Interactive VCF comparison using<br />

Spark Notebook.<br />

14.45 - 15.00<br />

O24 Sepideh Babaei, Waseem Akhtar, Johann de Jong, Marcel Reinders and Jeroen de<br />

Ridder. 3D hotspots of recurrent retroviral insertions reveal long-range interactions with<br />

cancer genes.<br />

15.00 - 15.30 Coffee break.<br />

15.30 - 16.00 K3 Invited keynote: Pierre Rouzé. Thirty years in Bioinformatics.<br />

16.00 - 16.30 Closing and awards.<br />

16.30 - 17.00 Closing reception.<br />

11


10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

Gala dinner<br />

The gala event will take place at the Pelgrom, a Medieval-style restaurant at walking distance from<br />

the Elzenveld conference location, on the evening of Monday December 7th, after the conference<br />

programme, from 18h30 until 22h00. Gala dinner participation is optional, although highly<br />

recommended!<br />

The Pelgrom is one of Antwerp’s most historic eating and drinking place, situated in authentic 15th<br />

century cellars that were used by merchants for temporary storage during the two big annual<br />

Antwerp fairs. Prepare to feast on a Medieval buffet in the style of Antwerp’s Golden Century!<br />

The Pelgrom is at walking distance from the<br />

Elzenveld conference location. For people using<br />

public transportation, after the end of the gala<br />

dinner, the Antwerp-Central train station can easily<br />

be reached by tram from the Groenplaats station<br />

(10 minutes), or on foot (20 minutes).<br />

Where? Restaurant Pelgrom, Pelgrimsstraat 15, 2000 Antwerp<br />

When? Monday December 7th, <strong>2015</strong>; 18h30 - 22h00<br />

12


10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

List of abstracts<br />

K1 MEDICAL DATA AND TEXT MINING: LINKING DISEASES, DRUGS, AND ADVERSE REACTIONS 17<br />

K2<br />

Keynotes<br />

MULTIPLE SURVIVAL STRATEGIES TO DEAL WITH THE MULTIPLICATION OF MULTIPLE SEQUENCE<br />

ALIGNMENT METHODS<br />

18<br />

Corporate presentations<br />

C1 ILLUMINA SOFTWARE PLATFORMS TO TRANSFORM THE PATH TO KNOWLEDGE AND DISCOVERY 19<br />

C2<br />

THE SYSTEMS TOXICOLOGY COMPUTATIONAL CHALLENGE: IDENTIFICATION OF EXPOSURE<br />

RESPONSE MARKERS<br />

20<br />

Selected oral presentations<br />

O1 CELL TYPE-SELECTIVE DISEASE ASSOCIATION OF GENES UNDER HIGH REGULATORY LOAD 21<br />

O2 PREDICTING OLIGOGENIC EFFECTS USING DIGENIC DISEASE DATA 22<br />

O3<br />

O4<br />

A COMPREHENSIVE COMPARISON OF MODULE DETECTION METHODS FOR GENE EXPRESSION<br />

DATA<br />

LATEBICLUSTERING: EFFICIENT DISCOVERY OF TEMPORAL LOCAL PATTERNS WITH POTENTIAL<br />

DELAYS<br />

O5 INFERRING DEVELOPMENTAL CHRONOLOGIES FROM SINGLE CELL RNA 25<br />

O6<br />

O7<br />

COMBINING TREE-BASED AND DYNAMICAL SYSTEMS FOR THE INFERENCE OF GENE<br />

REGULATORY NETWORKS<br />

MODELING THE REGULATION OF Β-CATENIN SIGNALLING BY WNT STIMULATION AND GSK3<br />

INHIBITION<br />

O8 RANKED TILING BASED APPROACH TO DISCOVERING PATIENT SUBTYPES 28<br />

O9<br />

O10<br />

DEVELOPMENT OF A DNA METHYLATION-BASED SCORE REFLECTING TUMOUR INFILTRATING<br />

LYMPHOCYTES<br />

PREDICTION OF CELL RESPONSES TO SURFACE TOPOGRAPHIES USING MACHINE LEARNING<br />

TECHNIQUES<br />

O11 ANALYSIS OF MASS SPECTROMETRY QUALITY CONTROL METRICS 31<br />

O12 XILMASS: A CROSS-LINKED PEPTIDE IDENTIFICATION ALGORITHM 32<br />

O13<br />

O14<br />

AUTOMATED ANATOMICAL INTERPRETATION OF DIFFERENCES BETWEEN IMAGING MASS<br />

SPECTROMETRY EXPERIMENTS<br />

ENHANCEMENT OF IMAGING MASS SPECTROMETRY DATA THROUGH REMOVAL OF SPARSE<br />

INTENSITY VARIATIONS<br />

O15 DETERMINANTS OF COMMUNITY STRUCTURE IN THE PLANKTON INTERACTOME 35<br />

O16<br />

O17<br />

BIOINFORMATICS TOOLS FOR ACCURATE ANALYSIS OF AMPLICON SEQUENCING DATA FOR<br />

BIODIVERSITY ANALYSIS<br />

GENE CO-EXPRESSION ANALYSIS IDENTIFIES BRAIN REGIONS AND CELL TYPES INVOLVED IN<br />

MIGRAINE PATHOPHYSIOLOGY: A GWAS-BASED STUDY USING THE ALLEN HUMAN BRAIN ATLAS<br />

13<br />

23<br />

24<br />

26<br />

27<br />

29<br />

30<br />

33<br />

34<br />

36<br />

37


10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O18<br />

SPATIAL CO-EXPRESSION ANALYSIS OF STEROID RECEPTORS IN THE MOUSE BRAIN IDENTIFIES<br />

REGION-SPECIFIC REGULATION MECHANISMS<br />

O19 A SYSTEMS BIOLOGY COMPENDIUM FOR LEISHMANIA DONOVANI 39<br />

O20 MULTI-OMICS INTEGRATION: RIBOSOME PROFILING APPLICATIONS 40<br />

O21<br />

CLUB-MARTINI: SELECTING FAVORABLE INTERACTIONS AMONGST AVAILABLE CANDIDATES: A<br />

COARSE-GRAINED SIMULATION APPROACH TO SCORING DOCKING DECOYS<br />

O22 PEPSHELL: VISUALIZATION OF CONFORMATIONAL PROTEOMICS DATA 42<br />

O23 INTERACTIVE VCF COMPARISON USING SPARK NOTEBOOK 43<br />

O24<br />

3D HOTSPOTS OF RECURRENT RETROVIRAL INSERTIONS REVEAL LONG-RANGE INTERACTIONS<br />

WITH CANCER GENES<br />

Poster presentations<br />

38<br />

41<br />

44<br />

P1 KNN-MDR APPROACH FOR DETECTING GENE-GENE INTERACTIONS 45<br />

P2 CONSERVATION AND DIVERSITY OF SUGAR-RELATED CATABOLIC PATHWAYS IN FUNGI 46<br />

P3<br />

VISUALIZING BIOLOGICAL DATA THROUGH WEB COMPONENTS USING POLIMERO AND<br />

POLIMERO-BIO<br />

P4 DISEASE-SPECIFIC NETWORK CONSTRUCTION BY SEED-AND-EXTEND 48<br />

P5<br />

P6<br />

BIG DATA SOLUTIONS FOR VARIANT DISCOVERY FROM LOW COVERAGE SEQUENCING DATA, BY<br />

INTEGRATION OF HADOOP, HBASE AND HIVE<br />

ENTEROCOCCUS FAECIUM GENOME DYNAMICS DURING LONG-TERM PATIENT GUT<br />

COLONIZATION<br />

P7 XCMS OPTIMISATION IN HIGH-THROUGHPUT LC-MS QC 51<br />

P8 IDENTIFICATION OF NUMTS THROUGH NGS DATA 52<br />

P9 MICROBIAL SEMANTICS: GENOME-WIDE HIGH-PRECISION NAMING SCHEMES FOR BACTERIA 53<br />

P10<br />

P11<br />

FROM SNPS TO PATHWAYS: AN APPROACH TO STRENGTHEN BIOLOGICAL INTERPRETATION OF<br />

GWAS RESULTS<br />

IDENTIFICATION OF TRANSCRIPTION FACTOR CO-ASSOCIATIONS IN SETS OF FUNCTIONALLY<br />

RELATED GENES<br />

P12 PHENETIC: MULTI-OMICS DATA INTERPRETATION USING INTERACTION NETWORKS 56<br />

P13<br />

THE ROLE OF HLA ALLELES UNDERLYING CYTOMEGALOVIRUS SUSCEPTIBILITY IN ALLOGENEIC<br />

TRANSPLANT POPULATIONS<br />

P14 NOVOPLASTY: IN SILICO ASSEMBLY OF PLASTID GENOMES FROM WHOLE GENOME NGS DATA 58<br />

P15<br />

ENANOMAPPER - ONTOLOGY, DATABASE AND TOOLS FOR NANOMATERIAL SAFETY<br />

EVALUATION<br />

P16 BIOMEDICAL TEXT MINING FOR DISEASE-GENE DISCOVERY: SOMETIMES LESS IS MORE 60<br />

P17 TUNESIM - TUNABLE VARIANT SET SIMULATOR FOR NGS READS 61<br />

P18<br />

P19<br />

P20<br />

RNA-SEQ REVEALS ALTERNATIVE SPLICING WITH ALTERNATIVE FUNCTIONALITY IN<br />

MUSHROOMS<br />

MSQROB: AN R/BIOCONDUCTOR PACKAGE FOR ROBUST RELATIVE QUANTIFICATION IN LABEL-<br />

FREE MASS SPECTROMETRY-BASED QUANTITATIVE PROTEOMICS<br />

A MIXTURE MODEL FOR THE OMICS BASED IDENTIFICATION OF MONOALLELICALLY EXPRESSED<br />

LOCI AND THEIR DEREGULATION IN CANCER<br />

P21 GEVACT: GENOMIC VARIANT CLASSIFIER TOOL 65<br />

P22<br />

MAPPI-DAT: MANAGEMENT AND ANALYSIS FOR HIGH THROUGHPUT INTERACTOMICS DATA<br />

FROM ARRAY-MAPPIT EXPERIMENTS<br />

P23 HIGHLANDER: VARIANT FILTERING MADE EASIER 67<br />

14<br />

47<br />

49<br />

50<br />

54<br />

55<br />

57<br />

59<br />

62<br />

63<br />

64<br />

66


10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P24<br />

P25<br />

P26<br />

P27<br />

P28<br />

DOSE-TIME NETWORK IDENTIFICATION: A NEW METHOD FOR GENE REGULATORY NETWORK<br />

INFERENCE FROM GENE EXPRESSION DATA WITH MULTIPLE DOSES AND TIME POINTS<br />

IDENTIFICATION OF NOVEL ALLOSTERIC DRUG TARGETS USING A “DUMMY” LIGAND<br />

APPROACH<br />

PASSENGER MUTATIONS CONFOUND INTERPRETATION OF ALL GENETICALLY MODIFIED<br />

CONGENIC MICE<br />

DETECTING MIXED MYCOBACTERIUM TUBERCULOSIS INFECTION AND DIFFERENCES IN DRUG<br />

SUSCEPTIBILITY WITH WGS DATA<br />

APPLICATION OF HIGH-THROUGHPUT SEQUENCING TO CIRCULATING MICRORNAS REVEALS<br />

NOVEL BIOMARKERS FOR DRUG-INDUCED LIVER INJURY<br />

P29 INFORMATION THEORETIC MODEL FOR GENE PRIORITIZATION 73<br />

P30 GALAHAD: A WEB SERVER FOR THE ANALYSIS OF DRUG EFFECTS FROM GENE EXPRESSION DATA 74<br />

P31<br />

KMAD: KNOWLEDGE BASED MULTIPLE SEQUENCE ALIGNMENT FOR INTRINSICALLY DISORDERED<br />

PROTEINS<br />

P32 ON THE LZ DISTANCE FOR DEREPLICATING REDUNDANT PROKARYOTIC GENOMES 76<br />

P33 THE ROLE OF MIRNAS IN ALZHEIMER’ S DISEASE 77<br />

P34 FUNCTIONAL SUBGRAPH ENRICHMENTS FOR NODE SETS IN REGULATORY NETWORKS 78<br />

P35 HUMANS DROVE THE INTRODUCTION & SPREAD OF MYCOBACTERIUM ULCERANS IN AFRICA 79<br />

P36<br />

LEVERAGING AGO-SRNA AFFINITY TO IMPROVE IN SILICO SRNA DETECTION AND<br />

CLASSIFICATION IN PLANTS<br />

P37 ANALYSIS OF RELATIONSHIP PATTERNS IN UNASSIGNED MS/MS SPECTRA 81<br />

P38 MINING ACROSS “ OMICS ” DATA FOR DRUG PRIORITIZATION 82<br />

P39<br />

P40<br />

ABUNDANT TRANS-SPECIFIC POLYMORPHISM AND A COMPLEX HISTORY OF NON-BIFURCATING<br />

SPECIATION IN THE GENUS ARABIDOPSIS<br />

RIBOSOME PROFILING ENABLES THE DISCOVERY OF SMALL OPEN READING FRAMES (SORFS), A<br />

NEW SOURCE OF BIOACTIVE PEPTIDES<br />

P41 RIGAPOLLO, A HMM-SVM BASED APPROACH TO SEQUENCE ALIGNMENT 85<br />

P42 EARLY FOLDING AND LOCAL INTERACTIONS 86<br />

P43<br />

P44<br />

BINDING SITE SIMILARITY DRUG REPOSITIONING: A GENERAL AND SYSTEMATIC METHOD FOR<br />

DRUG DISCOVERY AND SIDE EFFECTS DETECTION<br />

ASSESSMENT OF THE CONTRIBUTION OF COCOA-DERIVED STRAINS OF ACETOBACTER<br />

GHANENSIS AND ACETOBACTER SENEGALENSIS TO THE COCOA BEAN FERMENTATION PROCESS<br />

THROUGH A GENOMIC APPROACH<br />

P45 REPRESENTATIONAL POWER OF GENE FEATURES FOR FUNCTION PREDICTION 89<br />

P46 ANALYSIS OF BIAS AND ASYMMETRY IN THE PROTEIN STABILITY PREDICTION 90<br />

P47<br />

P48<br />

MULTI-LEVEL BIOLOGICAL CHARACTERIZATION OF EXOMIC VARIANTS AT THE PROTEIN LEVEL<br />

IMPROVES THE IDENTIFICATION OF THEIR DELETERIOUS EFFECTS<br />

NGOME: PREDICTION OF NON-ENZYMATIC PROTEIN DEAMIDATION FROM SEQUENCE-DERIVED<br />

SECONDARY STRUCTURE AND INTRINSIC DISORDER<br />

P49 OPTIMAL DESIGN OF SRM ASSAYS USING MODULAR EMPIRICAL MODELS 93<br />

P50<br />

P51<br />

EVALUATING THE ROBUSTNESS OF LARGE INDEL IDENTIFICATION ACROSS MULTIPLE MICROBIAL<br />

GENOMES<br />

INTEGRATING STRUCTURED AND UNSTRUCTURED DATA SOURCES FOR PREDICTING CLINICAL<br />

CODES<br />

P52 SUPERVISED TEXT MINING FOR DISEASE AND GENE LINKS 96<br />

P53<br />

FLOWSOM WEB: A SCALABLE ALGORITHM TO VISUALIZE AND COMPARE CYTOMETRY DATA IN<br />

THE BROWSER<br />

P54 TOWARDS A BELGIAN REFERENCE SET 98<br />

P55 MANAGING BIG IMAGING DATA FROM MICROSCOPY: A DEPARTMENTAL-WIDE APPROACH 99<br />

15<br />

68<br />

69<br />

70<br />

71<br />

72<br />

75<br />

80<br />

83<br />

84<br />

87<br />

88<br />

91<br />

92<br />

94<br />

95<br />

97


10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P56<br />

ESTIMATING THE IMPACT OF CIS-REGULATORY VARIATION IN CANCER GENOMES USING<br />

ENHANCER PREDICTION MODELS AND MATCHED GENOME-EPIGENOME-TRANSCRIPTOME<br />

DATA<br />

P57 I-PV: A CIRCOS MODULE FOR INTERACTIVE PROTEIN SEQUENCE VISUALIZATION 101<br />

P58<br />

P59<br />

SFINX: STRAIGHTFORWARD FILTERING INDEX FOR AFFINITY PURIFICATION-MASS<br />

SPECTROMETRY DATA ANALYSIS<br />

MAPREDUCE APPROACHES FOR CONTACT MAP PREDICTION: AN EXTREMELY IMBALANCED BIG<br />

DATA PROBLEM<br />

P60 COEXPNETVIZ: THE CONSTRUCTION AND VIZUALISATION OF CO-EXPRESSION NETWORKS 104<br />

P61<br />

THE DETECTION OF PURIFYING SELECTION DURING TUMOUR EVOLUTION UNVEILS CANCER<br />

VULNERABILITIES<br />

P62 FLOREMI: SURVIVAL TIME PREDICTION BASED ON FLOW CYTOMETRY DATA 106<br />

P63<br />

P64<br />

P65<br />

STUDYING BET PROTEIN-CHROMATIN OCCUPATION TO UNDERSTAND GENOTOXICITY OF MLV-<br />

BASED GENE THERAPY VECTORS<br />

THE COMPLETE GENOME SEQUENCE OF LACTOBACILLUS FERMENTUM IMDO 130101 AND ITS<br />

METABOLIC TRAITS RELATED TO THE SOURDOUGH FERMENTATION PROCESS<br />

ORTHOLOGICAL ANALYSIS OF AN EBOLA VIRUS – HUMAN PPIN SUGGESTS REDUCED<br />

INTERFERENCE OF EBOLA VIRUS WITH EPIGENETIC PROCESSES IN ITS SUSPECTED BAT<br />

RESERVOIR HOST<br />

P66 PLADIPUS EMPOWERS UNIVERSAL DISTRIBUTED COMPUTING 110<br />

P67<br />

P68<br />

IDENTIFICATION OF ANTIBIOTIC RESISTANCE MECHANISMS USING A NETWORK-BASED<br />

APPROACH<br />

DEFINING THE MICROBIAL COMMUNITY OF DIFFERENT LACTOBACILLUS NICHES USING<br />

METAGENOMIC SEQUENCING<br />

P69 HUNTING HUMAN PHENOTYPE-ASSOCIATED GENES USING MATRIX FACTORIZATION 113<br />

P70 THE IMPACT OF HMGA PROTEINS ON REPLICATION ORIGINS DISTRIBUTION 114<br />

100<br />

102<br />

103<br />

105<br />

107<br />

108<br />

109<br />

111<br />

112<br />

Corporate poster presentations<br />

C2<br />

THE SYSTEMS TOXICOLOGY COMPUTATIONAL CHALLENGE: IDENTIFICATION OF EXPOSURE<br />

RESPONSE MARKERS<br />

20<br />

16


10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

K1. MEDICAL DATA AND TEXT MINING:<br />

LINKING DISEASES, DRUGS, AND ADVERSE REACTIONS<br />

Lars Juhl Jensen<br />

Clinical data describing the phenotypes and treatment of patients is an underused data source that has much greater<br />

research potential than is currently realized. Mining of electronic health records (EHRs) has the potential for revealing<br />

unknown disease correlations and for improving post-approval monitoring of drugs. In my presentation I will introduce<br />

the centralized Danish health registries and show how we use them for identification of temporal disease correlations and<br />

discovery of common diagnosis trajectories of patients. I will also describe how we perform text mining of the clinical<br />

narrative from electronic health records and use this for identification of new adverse reactions of drugs.<br />

17


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: K2<br />

Keynote<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

K2. MULTIPLE SURVIVAL STRATEGIES TO DEAL WITH THE<br />

MULTIPLICATION OF MULTIPLE SEQUENCE ALIGNMENT METHODS<br />

Cedric Notredame<br />

In this seminar I will introduce some of the latest developments in the field of multiple sequence alignment construction,<br />

including some of the work from my group. I will briefly review the main challenges and the latest work in the field,<br />

including ClustalO and the phylogeny aware aligners like SATe and how these aligners relate to consistency based<br />

methods like T-Coffee. I will also look at the complex relationship between multiple sequence alignment accuracy,<br />

structural modeling and phylogenetic tree reconstruction and introduce the notion of reliability index while reviewing<br />

some of the latest advances in this field, including the TCS (Transitive consistency score). I will show how this index can<br />

be used to both identify structurally correct positions in an alignment and evolutionary informative sites, thus suggesting<br />

more unity than initially thought between these two parameters. I will then introduce the structure based clustering<br />

method we recently developed to further test these hypothesis. I will finish with some consideration on the main<br />

challenges that need to be confronted for the accurate modeling of biological sequences relationship with a special<br />

attention on genomic and RNA sequences. All methods are available from www.tcoffee.org.<br />

REFERENCES<br />

TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Chang<br />

JM, Di Tommaso P, Notredame C. Mol Biol Evol. 2014 Jun;31(6):1625-37. doi: 10.1093/molbev/msu117. Epub 2014 Apr 1.<br />

Using tertiary structure for the computation of highly accurate multiple RNA alignments with the SARA-Coffee package. Kemena C, Bussotti G,<br />

Capriotti E, Marti-Renom MA, Notredame C. Bioinformatics. 2013 May 1;29(9):1112-9. doi: 10.1093/bioinformatics/btt096. Epub 2013 Feb 28.<br />

Alignathon: a competitive assessment of whole-genome alignment methods. Earl D, Nguyen N, Hickey G, Harris RS, Fitzgerald S, Beal K,<br />

Seledtsov I, Molodtsov V, Raney BJ, Clawson H, Kim J, Kemena C, Chang JM, Erb I, Poliakov A, Hou M, Herrero J, Kent WJ, Solovyev V,<br />

Darling AE, Ma J, Notredame C, Brudno M, Dubchak I, Haussler D, Paten B. Genome Res. 2014 Dec;24(12):2077-89. doi: 10.1101/gr.174920.114.<br />

Epub 2014 Oct 1.<br />

Epistasis as the primary factor in molecular evolution. Breen MS, Kemena C, Vlasov PK, Notredame C, Kondrashov FA. Nature. 2012 Oct<br />

25;490(7421):535-8. doi: 10.1038/nature11510. Epub 2012 Oct 14.<br />

18


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: C1<br />

Corporate presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

C1. ILLUMINA SOFTWARE PLATFORMS TO TRANSFORM THE PATH TO<br />

KNOWLEDGE AND DISCOVERY<br />

Nicolas Goffard<br />

Illumina, Inc. ngoffard@illumina.com<br />

The next big bottleneck in the biological sample to answer workflow has undoubtedly moved beyond the generation of<br />

the raw data towards its initial processing and analysis and even more so its biological and medical interpretation. There<br />

are two main reasons why this is particularly challenging for research organisations to successfully accomplish. Firstly<br />

there is a need to easily and securely analyse, archive and share sequencing data as well as to simplify and accelerate the<br />

data analysis with push button tools using widely validated and scientifically accepted algorithms. Secondly there is a<br />

requirement to normalize, standardize and curate not just their proprietary data from multiple studies, but to do it in a<br />

way that allows them to compare it in real time to data produced from public domain studies. Illumina provides two<br />

integrated software platforms to overcome these challenges called BaseSpace and NextBio and this presentation provides<br />

an overview of the capabilities found within both to empower biologists and informaticians to interactively explore the<br />

data.<br />

19


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: C2<br />

Corporate presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

C2. THE SYSTEMS TOXICOLOGY COMPUTATIONAL CHALLENGE:<br />

IDENTIFICATION OF EXPOSURE RESPONSE MARKERS<br />

Carine Poussin, Vincenzo Belcastro, Stéphanie Boué, Florian Martin,<br />

Alain Sewer, Bjoern Titz, Manuel C. Peitsch & Julia Hoeng.<br />

Philip Morris International Research and Development, Philip Morris Product SA,<br />

Quai Jeanrenaud 5, CH-2000 Neuchâtel, Switzerland<br />

INTRODUCTION<br />

Risk assessment in the context of 21st century<br />

toxicology relies on the identification of specific<br />

exposure response markers and the elucidation of<br />

mechanisms of toxicity, which can lead to adverse<br />

events. As a foundation for this future predictive risk<br />

assessment, diverse set of chemicals or mixtures are<br />

tested in different biological systems, and datasets are<br />

generated using high-throughput technologies.<br />

However, the development of effective computational<br />

approaches for the analysis and integration of these data<br />

sets remains challenging.<br />

METHODS<br />

The sbv IMPROVER (Industrial Methodology for<br />

Process Verification in Research;<br />

http://sbvimprover.com/) project aims to verify methods<br />

and concepts in systems biology research via challenges<br />

posed to the scientific community. In fall <strong>2015</strong>, the 4th<br />

sbv IMPROVER computational challenge will be<br />

launched which is aimed at evaluating algorithms for<br />

the identification of specific markers of chemical<br />

mixture exposure response in blood of humans or<br />

rodents. The blood is an easily accessible matrix,<br />

however remains a complex biofluid to analyze. This<br />

computational challenge will address questions related<br />

to the classification of samples based on transcriptomics<br />

profiles from well-defined sample cohorts. Moreover, it<br />

will address whether gene expression data derived from<br />

human or rodent whole blood are sufficiently<br />

informative to identify human-specific or speciesindependent<br />

blood gene signatures predictive of the<br />

exposure status of a subject to chemical mixtures<br />

(current/former/non-exposure).<br />

RESULTS & DISCUSSION<br />

Participants will be provided with high quality datasets<br />

to develop predictive models/classifiers and the<br />

predictions will be scored by an independent scoring<br />

panel. The results and post-challenge analyses will be<br />

shared with the scientific community, and will open<br />

new avenues in the field of systems toxicology.<br />

REFERENCES<br />

Meyer et al. Industrial methodology for process verification in<br />

research (IMPROVER): toward systems biology verification.<br />

Bioinformatics, 2012<br />

Meyer et al. Verification of systems biology research in the age of<br />

collaborative competition. Nat Biotechnol, 2011<br />

Tarca et al. Strengths and limitations of microarray-based phenotype<br />

prediction: lessons learned from the IMPROVER Diagnostic<br />

Signature Challenge. Bioinformatics, 2013<br />

Hartung, T. Lessons learned from alternative methods and their<br />

validation for a new toxicology in the 21st century. Journal of<br />

toxicology and environmental health, 2010<br />

Hoeng et al. A network-based approach to quantifying the impact of<br />

biologically active substances. Drug Discov Today, 2012.<br />

20


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O1<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O1. CELL TYPE-SELECTIVE DISEASE ASSOCIATION<br />

OF GENES UNDER HIGH REGULATORY LOAD<br />

Mafalda Galhardo 1 , Philipp Berninger 2 , Thanh-Phuong Nguyen 1 , Thomas Sauter 1 & Lasse Sinkkonen 1*.<br />

Life Sciences Research Unit, University of Luxembourg, Luxembourg, Luxembourg 1 ; Biozentrum, University of Basel<br />

and Swiss Institute of Bioinformatics, Basel, Switzerland 2 . * lasse.sinkkonen@uni.lu<br />

Identification of biomarkers and drug targets is a key task of biomedical research. We previously showed that diseaselinked<br />

metabolic genes are often under combinatorial regulation (Galhardo et al. 2014). Here we extend this analysis to<br />

include almost 100 transcription factors (TFs) and key histone modifications from over 100 samples to show that genes<br />

under high regulatory load (HRL) are enriched for disease-association across cell types. Network and pathway analysis<br />

suggests the central role of HRL genes in biological networks, under heavy regulation both at transcriptional and posttranscriptional<br />

level, as a possible explanation for the observed enrichment. Thus, epigenomic mapping of enhancers<br />

presents an unbiased approach for identification of novel disease-associated genes.<br />

INTRODUCTION<br />

Identification of disease-relevant genes and gene products<br />

as biomarkers and drug targets is one of key tasks of<br />

biomedical research. Still, a great majority of research is<br />

focused on a small minority of genes while many remain<br />

unstudied (Pandey et al. 2014). Unbiased prioritization<br />

within these ignored genes would be important to harvest<br />

the full potential of genomics in understanding diseases.<br />

Many databases to catalog disease-associated genes have<br />

been created, including DisGeNET that draws from<br />

multiple sources (Bauer-Mehren et al. 2010). In addition,<br />

large amounts of publicly available epigenomic data on<br />

the cell type-selective regulation of these genes has been<br />

produced. The importance of epigenetic regulation for<br />

disease development is increasingly recognized, for<br />

example in analysis of GWAS studies where causal SNPs<br />

are mostly located within gene regulatory regions<br />

(Maurano et al. 2012).<br />

METHODS<br />

Public ChIP-seq data produced by the ENCODE project<br />

(Dunham et al. 2012), the BLUEPRINT Epigenome<br />

project (Martens et al. 2013) and the NIH Epigenomic<br />

Roadmap project (Kundaje et al. <strong>2015</strong>) were downloaded<br />

on May 2014. The data were used to rank active protein<br />

coding genes (based on NCBI Entrez and marked by<br />

H3K4me3) by their regulatory load based on the number<br />

of associated TFs or enhancer (H3K27ac) regions using<br />

GREAT tool. The enrichment of disease genes from<br />

DisGeNET among HRL genes was tested using either<br />

Matlab® hypergeometric cumulative distribution function<br />

and adjusted for multiple testing with the Benjamini and<br />

Hochberg methodology or normalized enrichment score.<br />

Enriched diseases were clustered using R package<br />

“blockcluster”. Peak calling for super-enhancers was done<br />

using HOMER. A liver disease gene network was<br />

constructed from HPRD based on liver diseases genes<br />

from MeSH and genes from CTD and had 8278<br />

interactions. Statistical analysis of KEGG pathway<br />

enrichments and betweenness centrality was done using<br />

random sampling tests. miRNA target predictions were<br />

obtained from TargetScan6.2. Further details of the used<br />

methods can be found in Galhardo et al. <strong>2015</strong>.<br />

RESULTS & DISCUSSION<br />

Using ENCODE ChIP-Seq profiles for 93 transcription<br />

factors (TFs) in nine cell lines, we show that HRL genes<br />

are enriched for disease-association across cell types<br />

(Figure 1). TF load correlates with the enhancer load of<br />

the genes, allowing the identification of HRL genes by<br />

epigenomic mapping of active enhancers marked by<br />

H3K27ac modifications. Identification of the HRL genes<br />

across 139 samples from 96 different cell and tissue types<br />

reveals a consistent enrichment for disease-associated<br />

genes in a cell type-selective manner.<br />

The HRL genes are involved in more pathways than<br />

expected by chance, exhibit increased betweenness<br />

centrality in the interaction network of liver disease genes,<br />

and carry longer 3’UTRs with more microRNA binding<br />

sites than genes on average, suggesting a role as hubs<br />

within regulatory networks.<br />

Thus, epigenomic mapping of enhancers presents an<br />

unbiased approach for identification of novel diseaseassociated<br />

genes (Galhardo et al. <strong>2015</strong>).<br />

Transcription factor<br />

binding sites<br />

(93 TFs)<br />

9 ENCODE cell lines<br />

A549, GM12878, H1hESC, HCT116,<br />

HeLaS3, HepG2, HUVEC, K562, MCF7<br />

Gene ranking by<br />

regulatory load<br />

(Number of TFs or enhancers per gene)<br />

ChIP-seq data (Human)<br />

Active enhancers<br />

(H3K27ac)<br />

139 samples comprising<br />

96 tissue or cell types<br />

Disease genes<br />

(min score 0.08)<br />

High regulatory load genes are enriched<br />

for disease association<br />

FIGURE 1. Worflow of the disease-gene enrichment analysis.<br />

Figure 1<br />

REFERENCES<br />

Pandey AK et al. PLoS One, 9:e88889 (2014).<br />

Bauer-Mehren A et al. Nucleic Acids Res., 33:D514-D517 (2010).<br />

Maurano et al. Science, 337:1190-1195 (2012).<br />

Galhardo et al. Nucleic Asics Res. 42:1474-1496 (2014).<br />

Dunham et al. Nature, 489:57-74 (2012)<br />

Martens et al. Haematologica, 98:1487-1489 (2013)<br />

Kundaje et al. Nature, 518:317-330 (<strong>2015</strong>).<br />

Galhardo et al. Nucleic Acids Res. 10.1093/nar/gkv863 (<strong>2015</strong>).<br />

21


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O2<br />

10th Benelux Bioinformatics Conference Oral presentation<br />

<strong>bbc</strong> <strong>2015</strong><br />

O2. PREDICTING OLIGOGENIC EFFECTS USING DIGENIC DISEASE DATA<br />

Andrea M. Gazzo 1,2,3* , Dorien Daneels 1,3 , Maryse Bonduelle 3 , Sonia Van Dooren 1,3 , Guillaume Smits 1,4 & Tom<br />

Lenaerts 1,2,5 .<br />

Interuniversity Institute of Bioinformatics in Brussels, Brussels, Belgium 1 ; MLG, Departement d'Informatique,<br />

Universite Libre de Bruxelles, Brussels, Belgium 2 ; Center for Medical Genetics, Reproduction and Genetics,<br />

Reproduction Genetics and Regenerative Medicine, Vrije Universiteit Brussel, UZ Brussel, Brussel, Belgium 3 ; Genetics,<br />

Hopital Universitaire des Enfants Reine Fabiola, Universite Libre de Bruxelles, Brussels, Belgium 4 ;<br />

Computerwetenschappen, Vrije Universiteit Brussel, Brussel, Belgium 5 . * Andrea.Gazzo@ulb.ac.be<br />

Recent research has shown that disorders may be better described by more complex inheritance mechanisms, advocating<br />

that some of the monogenic disease may in fact be oligogenic. Understanding how the combined interplay and weight of<br />

variants leads to disease may provide improved and novel insights into diseases classically considered being monogenic.<br />

Here we present a unique classification method that separates two types of digenic diseases, i.e. those that requires<br />

variants in both genes to induce the disease and those where one is causative and the second increases the severity. Our<br />

results show that a clear separation can be made between both classes using gene and variant-level features extracted<br />

from DIDA.<br />

INTRODUCTION<br />

DIDA is a novel database that provides for the first time<br />

detailed information on genes and associated genetic<br />

variants involved in digenic diseases, the simplest form of<br />

oligogenic inheritance 1 . The database is accessible via<br />

http://dida.ibsquare.be and currently includes 213 digenic<br />

combinations involved in 44 different digenic diseases 2 .<br />

These combinations are composed of 364 distinct variants,<br />

which are distributed over 136 distinct genes. Creating this<br />

new repository was essential, as current databases do not<br />

allow one to retrieve detailed records regarding digenic<br />

combinations. Genes, variants, diseases and digenic<br />

combinations in DIDA are annotated with manually<br />

curated information and information mined from other<br />

online resources. Each digenic combination was<br />

categorized into one of two effect classes: either ``on/off'',<br />

in which variant combinations in both genes are required<br />

to develop the disease, or ``severity'', where variants in<br />

one gene are enough to develop the disease and carrying<br />

variant combinations in two genes increases the severity or<br />

affects its age of onset. In this work we present a predictor<br />

capable of distinguishing between the digenic effect<br />

classes. We analyse the result of this predictor in relation<br />

to specific features collected for the different digenic<br />

combinations in DIDA, as for instance the<br />

haploinsufficiency of the genes, their zygosity and the<br />

relationship between them, providing insight into the<br />

biological meaning of the result.<br />

METHODS<br />

We used a machine learning approach to determine the<br />

classes, i.e. "severity" or "on/off", of a digenic<br />

combination. Starting with feature selection we chose the<br />

most informative features to classify the digenic<br />

combination in either 2 classes. For each of the two genes<br />

involved in a digenic combination: Zygosity<br />

(Heterozygote, Homozygote, etc.), recessiveness<br />

probability, haploinsufficiency score, known recessive<br />

information, if the gene is essential or not (based on<br />

Mouse knock out experimental data) are used as features<br />

in the predictor. At variant level, we used as features the<br />

pathogenicity predictions from SIFT and Polyphen 2 tools.<br />

Finally, we encode also the relationship between the two<br />

genes, defining the relation "Similar function", "Directly<br />

interacting" and "Pathway membership". After different<br />

tests we decided to use a Random forest algorithm, as this<br />

approach gave the best results.<br />

RESULTS & DISCUSSION<br />

After a 10-fold cross validation we obtained promising<br />

performances, with an MCC of 0,67 and 0,92 as AUROC.<br />

Regretfully, this performance is an overestimation since,<br />

as the gene-based features are the most important, many<br />

examples with mutations mapped on the same gene pair<br />

lead to the same oligogenic effect class. A stratification<br />

that ensures that the same pair of genes are never in both<br />

the training and in the testing set was required. We<br />

manually created 5 subsets, where the instances with the<br />

same gene-pair belong to the same subset. . After this<br />

procedure we assessed again the performances, obtaining<br />

an MCC of 0,36 and as AUROC 0,78. In order to verify<br />

the significance of the performances we retrained the<br />

random forest on a randomization of the data. This<br />

randomization was obtained by shuffling all the features<br />

for each instance but maintaining class unchanged. This<br />

reshuffling resulted in an MCC close to zero and a<br />

AUROC near to 0.5, as expected. This additional test<br />

confirms the significance of the stratified results.<br />

In a next stage we are analysing the relationship between<br />

the oligogenic effect and the features used, particularly in<br />

terms of biological and molecular interpretation. As a<br />

future perspective, the benefit at clinical level is very<br />

promising: one goal of medical genetics is to assign<br />

predictive value to the genotype, in order to it to assist in<br />

diagnosis and disease management. If we can infer, based<br />

on the genotype, what the digenic/oligogenic effect will be,<br />

we can potentially anticipate the treatment.<br />

REFERENCES<br />

[1] Gazzo, A. et al., DIDA: a curated and annotated digenic diseases<br />

database, under review on NAR database issue (2016).<br />

[2] Schäffer, A. A. (2013) Digenic inheritance in medical genetics.<br />

J. Med. Genet., 50, 641–652.<br />

22


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O3<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O3. A COMPREHENSIVE COMPARISON OF MODULE DETECTION METHODS<br />

FOR GENE EXPRESSION DATA<br />

Wouter Saelens 1,2* , Robrecht Cannoodt 1,2,3 , Bart N. Lambrecht 1,2 & Yvan Saeys 1,2 .<br />

VIB Inflammation Research Center 1 ; Department of Respiratory Medicine, Ghent University 2 ; Center for Medical<br />

Genetics, Ghent University Hospital 3 . * wouter.saelens@ugent.be<br />

Module detection is central in every analysis of large scale gene expression data. While numerous methods have been<br />

developed, the relative merits and drawbacks of these different approaches is still unclear. In this work we use known<br />

gene regulatory networks to do an unbiased comparison of 41 module detection methods, spanning clustering,<br />

biclustering, decomposition, direct network inference and iterative network inference. This analysis showed that<br />

decomposition methods outperform current clustering methods. Our work provides a first comprehensive evaluation to<br />

guide the biologist in their choice but also serves as a protocol for the evaluation of novel module detection methods.<br />

INTRODUCTION<br />

Module detection methods form a cornerstone in the<br />

analysis of genome wide gene expression compendia.<br />

Modules in this context are defined as groups of genes<br />

with a similar expression profile, and therefore frequently<br />

share certain functions, are co-regulated and cooperate to<br />

produce a certain phenotype.<br />

Over the last years, dozens of module detection methods<br />

have been developed, which can be classified in five<br />

different categories. The most popular method is<br />

undoubtedly clustering, which will group genes into<br />

modules based on global similarity in expression profiles.<br />

Within the transcriptomics community these methods have<br />

received a considerable amount of criticism. This is<br />

mainly due to three drawbacks: (i) clustering cannot detect<br />

so called local co-expression effects, (ii) most clustering<br />

methods are unable to detect overlapping modules and (iii)<br />

clustering methods do not model the underlying gene<br />

regulatory network. Alternative approaches have therefore<br />

been developed which either handle both overlap and local<br />

co-expression (biclustering and decomposition) or model<br />

the gene regulatory network (direct network inference and<br />

iterative network inference).<br />

Given this methodological diversity, it is important that<br />

existing and new approaches are evaluated on robust and<br />

objective benchmarks. However, evaluation studies in the<br />

past were limited in the number of methods, use synthetic<br />

data or do not correctly assess the balance between false<br />

positives and false negatives. In this study we therefore<br />

provide a novel unbiased and comprehensive evaluation<br />

strategy (Figure 1), and used it to evaluate 41 state-of-theart<br />

module detection methods.<br />

METHODS<br />

The key of our approach is that we use golden standard<br />

regulatory networks to define sets of known modules.<br />

These can be used to directly assess the sensitivity and<br />

specificity of the different module detection methods. We<br />

used four different large scale gene expression compendia,<br />

two from E. coli and two from S. cerevisae. For each of<br />

these organisms a substantial part of the regulatory<br />

network is already known, either based on the integration<br />

of small-scale experiments or based on large, genome<br />

wide datasets. We use these networks to define groups of<br />

known modules using by looking at genes which either<br />

share on regulator, all regulators or are strongly<br />

interconnected. We used four different metrics to compare<br />

a set of observed modules with known modules: recovery<br />

and recall control the type II errors, while the relevance<br />

and specificity control the type I errors.<br />

Parameter tuning is a necessary but often overlooked<br />

challenge of module detection methods. As default<br />

parameters of a tool are usually optimized for some<br />

specific test cases by the authors, they do not necessarily<br />

reflect general good performance on other datasets. On the<br />

other hand, one should be careful of overfitting parameters<br />

on specific characteristics of the data, as such parameters<br />

will lead to suboptimal results when using the same<br />

parameter settings on other datasets. In this study we first<br />

optimized parameters using a grid-based approach. Next,<br />

to avoid overfitting we used the optimal parameters on one<br />

dataset to score the performance on another dataset, in an<br />

approach akin to cross-validation.<br />

RESULTS & DISCUSSION<br />

We evaluated 41 different module detection methods<br />

covering all five approaches. Overall, our analysis showed<br />

that certain decomposition methods, those based on the<br />

independent component analysis, outperform current stateof-the-art<br />

clustering methods. However, despite their<br />

theoretical advantages, neither biclustering nor network<br />

inference methods are able to outperform clustering<br />

methods. Importantly, our results are stable across datasets,<br />

module definitions and scoring metrics, demonstrating the<br />

robustness of our evaluation methodology.<br />

FIGURE 1. Overview of our evaluation methodology.<br />

The applications of our work are twofold. First, if local coexpression<br />

and overlap are of interest, we discourage the<br />

use of biclustering methods and suggest the use of<br />

decomposition instead. Secondly, we provide a new<br />

comprehensive evaluation methodology which can be used<br />

to compare novel methods with the current state-of-the-art.<br />

23


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O4<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O4. LATEBICLUSTERING: EFFICIENT DISCOVERY OF TEMPORAL LOCAL<br />

PATTERNS WITH POTENTIAL DELAYS<br />

Joana P. Gonçalves 1,2* & Sara C. Madeira 3,4 .<br />

Pattern Recognition and Bioinformatics Group, Department of Intelligent Systems, Delft University of Technology 1 ;<br />

Division of Molecular Carcinogenesis, The Netherlands Cancer Institute 2 ; Department of Computer Science and<br />

Engineering, Instituto Superior Técnico, Universidade de Lisboa 3 ; INESC-ID 4 . * research@joanagoncalves.org<br />

Temporal transcriptomes can provide valuable insight into the dynamics of transcriptional response and gene regulation.<br />

In particular, many studies seek to uncover functional biological units by identifying and grouping genes with common<br />

expression patterns. Nevertheless, most analytical tools available for this purpose fall short in their ability to consider<br />

biologically reasonable models and adequately incorporate the temporal dimension. Each biological task is likely to<br />

occur within a time period that does not necessarily span the whole time course of the experiment, and genes involved in<br />

such a task are expected to coordinate only while the task is ongoing. LateBiclustering is an efficient algorithm to<br />

identify this type of coordinated activity, while allowing genes to participate in distinct biological tasks with multiple<br />

partners over time. Additionally, LateBiclustering is able to capture temporal delays suggestive of transcriptional<br />

cascades: one of the hallmarks of gene expression and regulation.<br />

INTRODUCTION<br />

The discovery of patterns in temporal transcriptomes<br />

exposes gene expression dynamics and contributes to<br />

understand the machinery involved in its modulation.<br />

Various analytical tools are employed in this regard.<br />

Differential expression summarizes an entire time course<br />

into one feature, thus lacking detail. Clustering maintains<br />

respects the chronological order, but focuses on global<br />

similarities and tends to identify rather broad patterns,<br />

associated with unspecific functions. Biclustering offers<br />

increased granularity by additionally searching for local<br />

patterns, but allows for arbitrary jumps in time, eventually<br />

leading to patterns that are incoherent from a temporal<br />

perspective.<br />

METHODS<br />

LateBiclustering is an efficient algorithm for the<br />

identification of transcriptional modules, here termed<br />

LateBiclusters. Each LateBicluster is a group of genes<br />

showing a similar expression pattern with potential delays,<br />

within a particular time frame that does not necessarily<br />

span the whole time course of the transciptome.<br />

LateBiclustering only reports maximal LateBiclusters, that<br />

is, those that cannot be extended and are not fully<br />

contained in any other LateBicluster.<br />

LateBiclustering takes as input a gene-time expression<br />

matrix of real values. Each gene expression profile is first<br />

normalized to zero mean and unit standard deviation. A<br />

discretization is further applied to discern variations<br />

between consecutive time points into three levels: downtrend,<br />

no-change and up-trend. Upon discretization each<br />

gene profile can be seen as a string.<br />

<br />

<br />

A generalized suffix tree is built to find common<br />

patterns in the gene profiles. Internal nodes<br />

satisfying certain properties are marked for their<br />

potential to denote LateBiclusters.<br />

When an internal node does not satisfy the basic<br />

conditions for LateBicluster maximality, a<br />

procedure is applied to remove occurrences<br />

leading to non-maximal LateBiclusters. For this<br />

purpose, LateBiclustering uses a bit array<br />

representing the occurrences underlying each<br />

<br />

internal node. During the maximality update<br />

procedure, the bit array of the inspected node is<br />

compared against those of internal children nodes<br />

(right-max) and nodes from which the inspected<br />

node receives suffix links (left-max).<br />

Finally, LateBiclustering comes with different<br />

heuristics to report a single pattern occurrence per<br />

gene in each maximal LateBicluster. A heuristic<br />

is necessary because there may be multiple<br />

occurrences of a pattern in the profile of a given<br />

gene, which is a direct consequence of allowing<br />

the discovery of delayed patterns.<br />

RESULTS & DISCUSSION<br />

LateBiclustering is the first efficient algorithm suitable for<br />

the discovery of biclusters with temporal delays. It runs in<br />

polynomial time, while previous methods yielded<br />

exponential time complexity. LateBiclustering was able to<br />

find planted biclusters in synthetic data. It also identified<br />

biologically relevant LateBiclusters associated with<br />

Saccharomyces cerevisiae’s response to heat stress, and<br />

interesting time-lagged responses.<br />

FIGURE 1. Schematic of the LateBiclustering algorithm.<br />

REFERENCES<br />

Gonçalves JP & Madeira SC. IEEE/ACM Transactions on<br />

Computational Biology and Bioinformatics, 11(5), 801–813<br />

(2014).<br />

24


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O5<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O5. INFERRING DEVELOPMENTAL CHRONOLOGIES FROM SINGLE CELL<br />

RNA<br />

Robrecht Cannoodt 1,2,3* , Katleen De Preter 3 & Yvan Saeys 1,2 .<br />

Data Mining and Modelling for Biomedicine group, VIB Inflammation Research Center, Ghent 1 ; Department of<br />

Respiratory Medicine, Ghent University Hospital, Ghent 2 ; Center of Medical Genetics, Ghent University Hospital,<br />

Ghent 3 . * robrecht.cannoodt@ugent.be<br />

With the advent of single cell RNA sequencing, it is now possible to analyse the transcriptomes of hundreds of individual<br />

cells in an unbiased manner. Reconstructing the developmental chronology of differentiating cells is a challenging task,<br />

and doing so in a unsupervised and robust manner is a hitherto untackled problem. We developed a truly unsupervised<br />

developmental chronology inference technique, and evaluated its performance and robustness using multiple datasets.<br />

INTRODUCTION<br />

Early attempts at inferring the chronologies of single cells<br />

are MONOCLE (Trapnell et al., 2014) and NBOR<br />

(Schlitzer et al., <strong>2015</strong>). However, these techniques are not<br />

unsupervised as they require knowledge of the cell type of<br />

each cell prior to analysis, which biases the results to prior<br />

knowledge and possibly obstructs the discovery of novel<br />

subpopulations.<br />

METHODS<br />

Our approach consists of four steps.<br />

In the first step, the feature space (~30000 genes) is<br />

reduced to three dimensions.<br />

Secondly, outliers are detected and removed, using a K-<br />

nearest neighbour approach. After outlier removal, the<br />

original feature space is again reduced to three dimensions.<br />

Next, a nonparametric nonlinear curve is iteratively fitted<br />

to the data.<br />

Finally, each cell is projected onto the curve, thus<br />

resulting in a cell chronology.<br />

RESULTS & DISCUSSION<br />

A single-cell RNAseq dataset (Schlitzer et al., <strong>2015</strong>)<br />

contains profilings of DC progenitor cells. These cells are<br />

expected to differentiate from MDP to CDP to PreDC. Our<br />

method is able to intuitively visualise known population<br />

groups (Figure 1), as well as infer the developmental<br />

chronology of the individual cells (Figure 2).<br />

We evaluated our method on four datasets (Shalek et al.,<br />

2014; Trapnell et al., 2014; Buettner et al., <strong>2015</strong> and<br />

Schlitzer et al., <strong>2015</strong>), and found it to perform better and<br />

more robustly than existing methods MONOCLE and<br />

NBOR.<br />

This approach opens opportunities to further study known<br />

mechanisms or investigate unknown key regulatory<br />

structures in cell differentiation, or detect novel<br />

subpopulations in a truly unsupervised manner.<br />

REFERENCES<br />

Buettner F et al. Nature Biotechnology 33, 155-160 (<strong>2015</strong>).<br />

Schlitzer A et al. Nature Immunology 16, 718-726 (<strong>2015</strong>).<br />

Shalek A et al. Nature 509, 363-369 (2014).<br />

Trapnell C et al. Nature Biotechnology 32, 381-386 (2014).<br />

FIGURE 1. After feature space reduction and outlier detection of 244 DC<br />

progenitor cells (Schlitzer et al., <strong>2015</strong>), our method can intuitively<br />

visualise known populations.<br />

FIGURE 2. An iterative curve fitting results in a smooth curve reflecting<br />

the developmental chronology. After projecting each cell to the curve,<br />

regulatory patterns in expression which correlate with this timeline can<br />

be investigated.<br />

25


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O6<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O6. COMBINING TREE-BASED AND DYNAMICAL SYSTEMS<br />

FOR THE INFERENCE OF GENE REGULATORY NETWORKS<br />

Vân Anh Huynh-Thu 1* & Guido Sanguinetti 2,3 .<br />

GIGA-R & Department of Electrical Engineering and Computer Science, University of Liège 1 ; School of Informatics,<br />

University of Edinburgh 2 ; SynthSys – Systems and Synthetic Biology, University of Edinburgh 3 . * vahuynh@ulg.ac.be<br />

INTRODUCTION<br />

Reconstructing the topology of gene regulatory networks<br />

(GRNs) from time series of gene expression data remains<br />

an important open problem in computational systems<br />

biology. Current approaches can be broadly divided into<br />

model-based and model-free approaches, and face one of<br />

two limitations: model-free methods are scalable but<br />

suffer from a lack of interpretability, and cannot in general<br />

be used for out of sample predictions. On the other hand,<br />

model-based methods focus on identifying a dynamical<br />

model of the system; these are clearly interpretable and<br />

can be used for predictions, however they rely on strong<br />

assumptions and are typically very demanding<br />

computationally. Here, we aim to bridge the gap between<br />

model-based and model-free methods by proposing a<br />

hybrid approach to the GRN inference problem, called<br />

Jump3 (Huynh-Thu & Sanguinetti, <strong>2015</strong>). Our approach<br />

combines formal dynamical modelling with the efficiency<br />

of a nonparametric, tree-based method, allowing the<br />

reconstruction of GRNs of hundreds of genes.<br />

METHODS<br />

Gene expression model. At the heart of the Jump3<br />

framework, we use the on/off model of gene expression<br />

(Ptashne & Gann, 2002), where the rate of transcription of<br />

a gene can vary between two levels depending on the<br />

activity state μ of the promoter of the gene. The expression<br />

x of a gene is modelled through the following stochastic<br />

differential equation:<br />

dx i = (A i μ i (t) + b i – λ i x i )dt + σdω(t),<br />

where subscript i refers to the i-th target gene. Here, the<br />

promoter state μ i (t) is a binary variable (the promoter is<br />

either active or inactive) that depends on the expression<br />

levels of the transcription factors (TFs) that bind to the<br />

promoter. A i , b i and λ i are kinetic parameters, and the term<br />

σdω(t) represents a white noise-driving process with<br />

variance σ 2 .<br />

Network reconstruction with jump trees. Recovering<br />

the regulatory links pointing to gene i amounts to finding<br />

the genes whose expression is predictive of the promoter<br />

state μ i . To achieve this goal, we propose a procedure that<br />

learns, for each target gene i, an ensemble of decision trees<br />

predicting the promoter state μ i at any time t from the<br />

expression levels of the candidate regulators at the same<br />

time t. However, standard tree-based methods cannot be<br />

applied here since the output μ i (t) is a latent variable. We<br />

therefore propose a new decision tree algorithm called<br />

“jump tree”, which splits the observations by maximising<br />

the marginal likelihood of the dynamical on/off model.<br />

The learned tree-based model is then used to derive an<br />

importance score for each candidate regulator, computed<br />

as the sum of the likelihood gains that are obtained at all<br />

the tree nodes where this regulator was selected to split the<br />

observations. The importance of a candidate regulator j is<br />

used as weight for the putative regulatory link of the<br />

network that is directed from gene j to gene i.<br />

RESULTS & DISCUSSION<br />

We evaluated Jump3 on the networks of the DREAM4 In<br />

Silico Network challenge (Prill et al., 2010). For each<br />

network topology, two types of simulated expression data<br />

were used: data simulated using the on/off model (toy<br />

data) and the time series data that was provided in the<br />

context of the DREAM4 challenge. We compared Jump3<br />

to other GRN inference methods: two model-free methods,<br />

which are time-lagged variants of GENIE3 (Huynh-Thu et<br />

al., 2010) and CLR (Faith et al., 2007) respectively; two<br />

model-based methods, namely Inferelator (Greenfield et<br />

al., 2010) and TSNI (Bansal et al., 2006), and G1DBN<br />

(Lèbre, 2009), a method based on dynamic Bayesian<br />

networks. Areas Under the Precision-Recall curves<br />

(AUPRs) obtained for size-100 networks are shown in<br />

Table 1. Jump3 yields the highest AUPR in the case of the<br />

toy data. As expected, its performance decreases when the<br />

networks are inferred from the DREAM4 data, due to the<br />

mismatch between the on/off model and the one used to<br />

simulate the data. However, Jump3 still outperforms the<br />

other methods.<br />

Toy<br />

DREAM4<br />

Jump3 0.272 ± 0.060 0.187 ± 0.058<br />

GENIE3-lag 0.114 ± 0.010 0.176 ± 0.056<br />

CLR-lag 0.088 ± 0.008 0.169 ± 0.047<br />

Inferelator 0.069 ± 0.006 0.144 ± 0.036<br />

TSNI 0.020 ± 0.003 0.042 ± 0.010<br />

G1DBN 0.104 ± 0.024 0.114 ± 0.043<br />

TABLE 1. Comparison of network inference methods (mean AUPR and<br />

standard deviation).<br />

We also applied Jump3 to gene expression data from<br />

murine bone marrow-derived macrophages treated with<br />

interferon gamma (Blanc et al., 2011). Several of the hub<br />

TFs in the predicted network have biologically relevant<br />

annotations. They include interferon genes, one gene<br />

associated with cytomegalovirus infection, and cancerassociated<br />

genes, showing the potential of Jump3 for<br />

biologically meaningful hypothesis generation.<br />

REFERENCES<br />

Bansal M et al. Bioinformatics 22, 815-822 (2006).<br />

Blanc M et al. PLoS Biol 9, e1000598 (2011).<br />

Faith JJ et al. PLoS Biol 5, e8 (2007).<br />

Greenfield A. PLoS ONE 5, e13397 (2010).<br />

Huynh-Thu VA & Sanguinetti G. Bioinformatics 31, 1614-1622 (<strong>2015</strong>).<br />

Huynh-Thu VA et al. PLoS ONE 5, e12776 (2010).<br />

Lèbre S. Stat Appl Genet Mol Biol 8, Article 9 (2009).<br />

Prill RJ et al. PLoS ONE 5, e9202 (2010).<br />

Ptashne M & Gann A. Genes and Signals. Cold Harbor Spring<br />

Laboratory Press (2002).<br />

26


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O7<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O7. MODELING THE REGULATION OF Β-CATENIN SIGNALLING BY WNT<br />

STIMULATION AND GSK3 INHIBITION<br />

Annika Jacobsen 1 , Nika Heijmans 2 , Reneé van Amerongen 2 , Folkert Verkaar 3 ,<br />

Martine J. Smit 3 , Jaap Heringa 1 & K. Anton Feenstra 1 *.<br />

1 Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, The Netherlands; 2 Van Leeuwenhoek Centre<br />

for Advanced Microscopy and Section of Molecular Cytology, Swammerdam Institute for Life Sciences, University of<br />

Amsterdam, The Netherlands; 3 Division of Medicinal Chemistry, VU University Amsterdam, The Netherlands.<br />

*k.a.feenstra@vu.nl<br />

The Wnt/β-catenin signaling pathway is crucial for stem cell self-renewal, proliferation and differentiation. Hyperactive<br />

Wnt/β-catenin signaling caused by genetic alterations plays an important role in oncogenesis. In our newly developed<br />

Petri net model, GSK3 inhibition leads to significantly higher pathway activation (high β-catenin levels) compared to<br />

WNT stimulation, which is confirmed by TCF/LEF luciferase reporter assays experimentally. Using this validated model<br />

we can now simulate changes in Wnt/β-catenin signaling resulting from different mutations found in breast and<br />

colorectal cancer. We propose that this model can be used further to investigate different players affecting Wnt/β-catenin<br />

signaling during oncogenic transformation and the effect of drug treatment.<br />

WNT/Β-CATENIN<br />

Wnt/β-catenin signaling is important for stem cell<br />

maintenance and developmental processes and is highly<br />

conserved in all multicellular organisms (1, 2). The<br />

pathway regulates the expression of specific target genes<br />

by changing the levels of the transcriptional co-activator,<br />

β-catenin which activates the TCF/LEF transcription<br />

factors. Wnt/β-catenin signaling is active in stem cells<br />

located in Wnt rich environments.<br />

APC and AXIN are key proteins of the destruction<br />

complex, which targets β-catenin for destruction.<br />

Mutations in APC, AXIN and β-catenin play important<br />

roles in oncogenesis (2, 3). To better understand its role in<br />

oncogenesis, we here create a Petri net (PN) model of the<br />

Wnt/β-catenin signaling pathway, that uses available<br />

coarse-grained data, such as binary interactions and semiquantitative<br />

protein levels. Using this model and<br />

validating experiments we show how different strengths of<br />

Wnt stimulation and GSK3 inhibition activate signaling<br />

over time.<br />

PETRI-NET MODELLING<br />

We built a PN model of Wnt/β-catenin signaling describing<br />

the logic of known (inter)actions, cf. our previous<br />

work (5). In a PN, a place represents an entity (e.g. gene),<br />

a transition indicates the activity occurring between the<br />

places (e.g. gene expression), and these are connected by<br />

directed edges called arcs that represent their interactions<br />

(e.g., activation of gene expression by a protein).<br />

TRANSCRIPTION AND PROTEIN ASSAYS<br />

TCF/LEF transcription was measure by TOPFLASH<br />

reporter activity at several time points and at different<br />

concentrations of Wnt3a stimulation and GSK3 inhibition<br />

by CHIR99021. Active and total β-catenin (CTNNB1)<br />

levels were measured by Western blot.<br />

VALIDATED ACTIVATION & INHIBITION<br />

We simulate the model with initial Wnt and GSK3 token<br />

levels ranging from 0 to 5 to represent addition of Wnt and<br />

inhibition of GSK3. Figure 1 shows the four different β-<br />

catenin responses for Wnt addition (purple) and GSK3<br />

inhibition (green). At low GSK3 levels, β-catenin linearly<br />

increases, but at high GSK levels β-catenin remains low.<br />

At high Wnt levels, β-catenin shows a transient response,<br />

with the peak height increasing with Wnt levels. The<br />

increase of β-catenin is due to sequestration of AXIN to<br />

the cell membrane, which inactivates the destruction<br />

complex. Increase in β-catenin activates transcription of<br />

AXIN2 which triggers the negative feedback.<br />

FIGURE 1. Pathway response for different levels of Wnt and activity of<br />

GSK3. When adding Wnt, the pathway transiently activates but GSK3<br />

inhibition permanently activates.<br />

TCF/LEF reporter assay validation experiments for both<br />

perturbations show that transcriptional activity of<br />

TCF/LEF is both dosage and time dependent,<br />

corresponding well for GKS3 inhibition. Wnt3a stimulation,<br />

on the other hand, does activate expression, but we<br />

do not observe the β-catenin dosage or time effect<br />

predicted by our model. Measuring β-catenin by Western<br />

blot reveals a consistent increase upon pathway activation,<br />

however protein levels and changes are on the border of<br />

experimental sensitivity.<br />

In conclusion, our Petri net model recapitulates much of<br />

the known behavior of the Wnt/β-catenin pathway upon<br />

Wnt stimulation and GSK3 inhibition, and hints at<br />

subtleties in the mechanism that will help us gain further<br />

understanding in the role of this pathway in development<br />

and oncogenesis.<br />

REFERENCES<br />

1. Clevers & Nusse (2012) Cell. 149:1192-1205<br />

2. Holstein (2012) Cold Spring Harb Perspect Biol. 4:a007922<br />

3. MacDonald, Tamai & He (2009) Dev Cell. 17:9-26<br />

4. Klaus & Birchmeier (2008) Nat. Rev. Cancer. 8:387-398<br />

5. Bonzanni et al., (2009) Bioinformatics. 25:2049-2056<br />

27


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O8<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O8. RANKED TILING BASED APPROACH TO DISCOVERING PATIENT<br />

SUBTYPES<br />

Thanh Le Van 1,* , Jimmy Van den Eynden 3 , Dries De Maeyer 2 , Ana Carolina Fierro 5 , Lieven Verbeke 5 , Matthijs van<br />

Leeuwen 4 , Siegfried Nijssen 1,4 , Luc De Raedt 1 & Kathleen Marchal 5,6 .<br />

Department of Computer Science 1 , Centre of Microbial and Plant Genetics 2 , KULeuven, Belgium; Department of<br />

Medical Biochemistry, University of Gothenburg 3 , Sweden; Leiden Institute for Advanced Computer Science 4 ,<br />

Universiteit Leiden, The Netherlands; Department of Plant Biotechnology and Bioinformatics 5 , Department of<br />

Information Technology, iMinds 6 , Ghent University, Belgium. * thanh.levan@cs.kuleuven.be<br />

Cancer is a heterogeneous disease consisting of many subtypes that usually have both shared and distinguishing<br />

mechanisms. To derive good subtypes, it is essential to have a computational model that can score their homogeneity<br />

from different angles, for example, mutated pathways and gene expression. In this paper, we introduce our ongoing work<br />

which studies a constraint-based optimisation model to discover patient subtypes as well as their perturbed pathways<br />

from mutation, transcription and interaction data. We propose a way to solve the optimisation problem based on<br />

constraint programming principles. Experiments on a TCGA breast cancer dataset demonstrate the promise of the<br />

approach.<br />

INTRODUCTION<br />

Discovering patient subtypes and understanding their<br />

mechanisms are essential to provide precise treatments to<br />

patients. There have been efforts to understand how<br />

mutation causes subtypes such as the work by Hofree et<br />

al., (2013). However, to the best knowledge of the authors,<br />

it is still an open question on how to combine mutation<br />

and expression data to derive good subtypes. Therefore,<br />

we study a new computation model that can discover<br />

subtypes as well as their specific mutated genes and<br />

expressed genes from mutation, transcription and<br />

interaction data.<br />

METHODS<br />

We conjecture that a subtype consists of a number of<br />

patients who have the same set of differentially expressed<br />

genes and a set of mutated genes that hit the same<br />

pathways.<br />

To find both mutations and expressions of patient subtypes,<br />

we extend our recent ranked tiling method (Le Van et al.,<br />

2014). Ranked tiling is a data mining method proposed to<br />

mine regions with high average rank values in a rank<br />

matrix. In this type of matrix, each row is a complete<br />

ranking of the columns. We find that rank matrices are a<br />

good abstraction for numeric data and are useful to<br />

integrate datasets that are at different scales.<br />

To apply the ranked tiling method, we first transform the<br />

given numeric expression matrix, where rows are<br />

expressed genes and columns are patients, into a ranked<br />

expression matrix. Then, we search for a region in the<br />

transformed matrix that has high average rank scores.<br />

However, different from the ranked tiling method, we<br />

impose a further constraint that the columns (patients) of<br />

the region should also have a number of mutated genes<br />

that have high rank scores in a network with respect to a<br />

network model. We formalise this as a constraint<br />

optimisation problem and use a constraint solver to solve<br />

it.<br />

RESULTS & DISCUSSION<br />

We apply our method on TCGA breast cancer dataset and<br />

discover eight subtypes. Compared to PAM50 annotations,<br />

our method divide the Basal subtype into three sub-groups<br />

named S2, S3 and S6. The LumA subtype is divided into<br />

04 smaller groups, namely, S1, S4, S7 and S8. Finally, our<br />

method could recover the Her2 subtype in S5.<br />

To validate the mined subtypes in the patient dimension,<br />

we assume PAM50 annotations are true labels for them.<br />

Then, grouping patients into subtypes can be seen as a<br />

multi-class prediction problem, for which we can calculate<br />

F1 score to measure the average accuracy. We also<br />

compare our scores with state-of-the-art, including<br />

iCluster+ (Mo, Q. et al., 2013), NBS (Hofree et al., 2013)<br />

and SNF (Wang B. et al., 2014). The result (not shown)<br />

illustrates that our subtypes are more homogeneous than<br />

the ones produced by iCluster+ and NBS and are<br />

comparable to those by SNF.<br />

To validate the mined subtypes in the gene dimension, we<br />

perform geometric tests to see how their mutated genes<br />

and expressed genes are related to cancer pathways. The<br />

figure below is the heatmap showing the log_10 p-values<br />

of the tests. In this Figure, we can see that the discovered<br />

subtypes have specific perturbed pathways.<br />

FIGURE 1. Cancer pathway enrichment analysis using mined mutated<br />

genes and expressed genes of subtypes<br />

REFERENCES<br />

Hofree et al., Nat Methods 10(11), 1108–15 (2013).<br />

Le Van et al., ECML/PKDD 2014 (2), 98–113 (2014)<br />

Mo, Q. et al., PNAS 110(11), 4245–50 (2013)<br />

Wang, B. et al., Nature methods, 11(3), 333–7 (2014)<br />

28


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O9<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O9. DEVELOPMENT OF A DNA METHYLATION-BASED SCORE<br />

REFLECTING TUMOUR INFILTRATING LYMPHOCYTES<br />

Martin Bizet 1,2,3*# , Jana Jeschke 1# , Christine Desmedt 4 , Emilie Calonne 1 , Sarah Dedeurwaerder 1 ,<br />

Gianluca Bontempi 2,3 , Matthieu Defrance 1,2 , Christos Sotiriou 4 and Francois Fuks 1<br />

Laboratory of Cancer Epigenetics, Faculty of Medicine, Université Libre de Bruxelles 1 ; Interuniversity Institute of<br />

Bioinformatics in Brussels, Université Libre de Bruxelles & Vrije Universiteit Brussel 2 ; Machine Learning Group,<br />

Computer Science Department, Université Libre de Bruxelles, Brussels 3 ; Breast Cancer Translational Research<br />

Laboratory, Jules Bordet Institute, Université Libre de Bruxelles 4 ; # These authors contributed equally to this work;<br />

* mbizet@ulb.ac.be<br />

Tumour infiltrating lymphocytes (TIL) are increasingly recognised as one of the key feature to predict outcome and<br />

therapy response in malignancies. However, measuring quantities of TIL remains challenging since it relies on subjective<br />

and spatially-restricted measurements from a pathologist. In this study we used genome-scale DNA-methylation profiles<br />

from breast tumours to develop a so-called MeTIL score, which reflects TIL level within whole-tumour samples. We<br />

demonstrate the robustness to noise of the MeTIL score using simulated data as well as the ability of the MeTIL score to<br />

sensitively measure TIL in patient samples and to improve prediction of outcome.<br />

INTRODUCTION<br />

Breast cancer (BC) is one of the most common and<br />

deadliest diseases in women from Western countries.<br />

Tumour infiltrating lymphocytes (TIL) emerged as one of<br />

the key feature to predict outcome and response to<br />

treatment in this disease [ 1 ]. However the measurement of<br />

TIL levels remains challenging because it relies on manual<br />

readings of a tumour cancer slide by a pathologist, which<br />

is subjective by nature and does not necessary reflect the<br />

whole-tumour TIL content. In this study we took<br />

advantage of the high tissue-specificity of DNAmethylation<br />

patterns [ 2 ] to develop a so-called MeTIL<br />

score, which predicts the amount of lymphocytes within<br />

the tumour.<br />

METHODS<br />

The MeTIL score has been developed in 3 key-steps:<br />

We first used genome-scale DNA-methylation<br />

profiles data from 11 cell-lines (8 normal or<br />

cancerous epithelial breast and 3 T-lymphocytes)<br />

to extract 29 cytosines specifically unmethylated<br />

in T-lymphocytes (delta-beta < -0.8 and standard<br />

deviation between groups < 0.1).<br />

We then applied a cross-validated pipeline,<br />

associating mRMR feature selection and randomforest<br />

algorithm, on 118 BC samples to extract a<br />

minimal set of cytosines, which methylation level<br />

is predictive for quantities of TIL.<br />

Finally we used a “normalised PCA” approach to<br />

compute a unique MeTIL score from the<br />

individual methylation values.<br />

The robustness of the relation between the MeTIL score<br />

and TIL levels was also assessed using spearman<br />

correlation computed from 10 000 simulations with<br />

varying proportion of TIL (Fig.1B&C). The simulated<br />

data took two sources of noise into account:<br />

<br />

<br />

Technical noise modeled as a Gaussian noise<br />

Perturbations due to the presence of other celltypes<br />

within the tumour microenvironment that<br />

are not lymphocytic or epithelial, modeled by a<br />

methylation value sampled randomly among the<br />

array.<br />

Lastly, we measured TIL quantities with the MeTIL score<br />

in three independent BC cohorts and applied COX<br />

regression models to evaluate the prognostic value of the<br />

MeTIL score.<br />

RESULTS & DISCUSSION<br />

We first applied a hierarchical clustering analysis and<br />

observed that BC samples with high TIL infiltration show<br />

a hypomethylated pattern for all MeTIL markers (Fig.1A).<br />

Furthermore we demonstrated, using simulations, a strong<br />

correlation between the MeTIL score and TIL levels, even<br />

when high level of noise (0.7 times the standard deviation)<br />

and high proportion of perturbing unknown cell-types<br />

(70%) were included in the model (Fig.1B).<br />

(A)<br />

(C)<br />

(B)<br />

FIGURE 1. The MeTIL score reflects TIL levels (A) Heatmap showing the<br />

methylation values of the 5 MeTIL markers. A ‘TIL high’ group with a<br />

hypomethylated pattern (orange) appeared. (B) Color-map of the<br />

spearman correlation between MeTIL score and TIL level for increasing<br />

noise (y-axis) and abundance of unknown cell-types (x-axis) based on<br />

simulations. (C) Methylation value of each MeTIL marker was simulated<br />

as the sum of the methylation level in lymphocyte (M1), epithelial cell<br />

(M2) and other cell-types (random value M3) weighted by their<br />

proportion in the tissue (f1, f2, f3) and an Gaussian noise (e).<br />

Finally, we observed consistent patterns of TIL levels<br />

within BC subtypes in independent cohorts suggesting the<br />

robust nature of our score to evaluate TIL levels.<br />

Furthermore, COX regressions analysis revealed a<br />

prognostic value for the MeTIL score in triple negative<br />

and luminal BC (p-value < 0.05).<br />

REFERENCES<br />

[ 1 ] Loi, S., et al. Official journal of the European Society for Medical Oncology /<br />

ESMO 25, 1544-1550 (2014).<br />

[ 2 ] Jeschke, J., Collignon, E., Fuks, F. FEBS J., 282, 9:1801-14. (<strong>2015</strong>).<br />

29


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O10<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O10. PREDICTION OF CELL RESPONSES TO SURFACE TOPOGRAPHIES<br />

USING MACHINE LEARNING TECHNIQUES<br />

Aliaksei S Vasilevich 1 *,Shantanu Singh 2 , Aurélie Carlier 1 & Jan de Boer 1 .<br />

Laboratory for Cell Biology-inspired Tissue Engineering, Merln Institute, Maastricht University 1 , Imaging Platform,<br />

Broad Institute of MIT and Harvard 2 . *a.vasilevich@maastrichtuniversity.nl<br />

Topographical cues have been repeatedly shown to influence cell fate dramatically (Bettinger et. al., 2009). This<br />

phenomenon opens new opportunities to design the interaction between biomaterials and biological tissues in a<br />

predictable manner. Unfortunately, the exact mechanism of topographical control of cell behavior remains largely<br />

unknown. We have therefore developed a technology in our laboratory to determine an optimal surface topography for<br />

virtually any application in biomedical field. Previously we have reported that we can control cell shape by our surfaces<br />

in a predictable manner (Hulsman et.al., <strong>2015</strong>). Here we demonstrate that we can successfully predict not only cell shape,<br />

but also cell response on protein level based on the properties of our topographies. The results of our study show that we<br />

are able to design materials for biomedical applications that require a particular cell behavior.<br />

INTRODUCTION<br />

The TopoChip, a micro topography screening platform,<br />

enables the assessment of cell response to 2176 unique<br />

topographies in a single high-throughput screen. The<br />

topographical features were randomly selected from an in<br />

silico library of more than 150 million of topographies,<br />

which were designed from algorithm that synthesized<br />

patterns based on simple geometric elements – circles,<br />

triangles and rectangles (Unadkat et al, 2011). In our<br />

previous studies, we have demonstrated that these surface<br />

topographies exert a mitogenic effect on hMSCs (Unadkat<br />

et al, 2011), as well as on cell shape (Hulsman et. al.,<br />

<strong>2015</strong>). In this paper, we show that these topographies can<br />

also be used to modulate the ALP expression in human<br />

mesenchymal stromal cells, as well as pluripotency in<br />

human induced pluripotent stem (iPS) cells. We further<br />

show that computational models can be build to predict<br />

these protein levels using surface topography parameters.<br />

METHODS<br />

Cell response to topography was captured by high-content<br />

imaging. Using image analysis and data mining methods<br />

described previously (Hulsman et.al., <strong>2015</strong>),<br />

multiparametric “profiles” of cellular response were<br />

obtained. Multiple replicates of each topography were<br />

used to estimate the median level of a cellular response of<br />

interest – either ALP in human mesenchymal stromal cells<br />

(hMSCs), or the median number of Oct4 positive cells in<br />

population of human induced pluripotent stem cell<br />

(hIPSCs). We aimed to predict the cellular response based<br />

on surface topography parameters using machine learning<br />

methods. To learn and validate these methods (specifically,<br />

classifiers), the data were split into training and testing<br />

sets in a 3:1 proportion respectively. In the training step,<br />

we performed a 10-fold cross-validation to obtain optimal<br />

parameters for each classifier. The caret package (Kuhn<br />

M., 2008) in R (R core team, <strong>2015</strong>) was used to perform<br />

the analysis.<br />

RESULTS & DISCUSSION<br />

In the first project, we conducted a screening on the<br />

TopoChip with hMSCs in order to find topographies that<br />

would be able to increase the ALP level, a protein that is<br />

an early marker of osteogenesis. We were able to<br />

successfully find such surfaces and confirm results<br />

experimentally (publication in preparation). To move<br />

further we decided to check how accurately we can make a<br />

prediction of ALP level in hMSCs based on topographical<br />

features. Focussing only on extreme examples, we<br />

selected 100 high- and and low-scoring topographies and<br />

used the model validation scheme described in Methods to<br />

find the most accurate binary classifier for our data set.<br />

We tested several classifiers and identified random forest<br />

as most precise, which obtained an accuracy of 96% on<br />

the held-out test set.<br />

In a second project, we aim to find a topography that will<br />

increase proliferation and pluripotency of hIPSCs. We<br />

used Oct4 as a marker of pluripotency. The screening was<br />

performed on one half of the Topochip (1000+ surfaces),<br />

which were then ranked based on the number of Oct4<br />

positive cells. One hundred high- and low-scoring surfaces<br />

were chosen to train a classifier. Using logistic regression ,<br />

we obtained 72% accuracy on a held-out test set. We used<br />

this model to predict surfaces that would increase<br />

pluripotency in hIPSCs among surfaces that were not<br />

included in the initial screening. Topographies were<br />

ranked according to their predicted probability score and<br />

top 30 surfaces were chosen for experimental validation.<br />

We found that 79% of selected surfaces were predicted<br />

accurately.<br />

In summary, the combination of our screening methods<br />

and machine learning algorithms open new avenues to<br />

design surfaces with desired properties for variable<br />

applications. Our next step will be to find a surface with<br />

maximum ALP level from our virtual library based on our<br />

screening data.<br />

REFERENCES<br />

Bettinger C J, Langer R, & Borenstein J T. “Engineering Substrate<br />

Micro- and Nanotopography to Control Cell Function.” Angewandte<br />

Chemie (International ed. in English) 48.30 (2009).<br />

Hulsman M et. al., Analysis of high-throughput screening reveals the<br />

effect of surface topographies on cellular morphology, Acta<br />

Biomaterialia, 15, (<strong>2015</strong>).<br />

Kuhn M. “Building Predictive Models in R Using the caret Package”<br />

Journal of Statistical Software, Vol. 28, (2008)<br />

R Core Team. R: A language and environment for statistical computing.<br />

R Foundation for Statistical Computing, Vienna, Austria. URL<br />

http://www.R-project.org/. (<strong>2015</strong>)<br />

Unadkat H V. et al. “An Algorithm-Based Topographical Biomaterials<br />

Library to Instruct Cell Fate.” Proceedings of the National Academy<br />

of Sciences of the United States of America 108.40 (2011).<br />

30


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O11<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O11. ANALYSIS OF MASS SPECTROMETRY QUALITY CONTROL METRICS<br />

Wout Bittremieux 1 , Pieter Meysman 1 , Lennart Martens 2 , Bart Goethals 1 , Dirk Valkenborg 3 & Kris Laukens 1 .<br />

Advanced Database Research and Modeling (ADReM) & Biomedical Informatics Research Center Antwerp (biomina),<br />

University of Antwerp / Antwerp University Hospital 1 ; Department of Biochemistry & Department of Medical Protein<br />

Research, Ghent University / VIB 2 ; Flemish Institute for Technological Research (VITO) 3 .<br />

* wout.bittremieux@uantwerpen.be<br />

Mass-spectrometry-based proteomics is a powerful analytical technique to identify complex protein samples, however,<br />

its results are still subject to a large variability. Lately several quality control metrics have been introduced to assess the<br />

performance of a mass spectrometry experiment. Unfortunately these metrics are generally not sufficiently thoroughly<br />

understood. For this reason, we present a few powerful techniques to analyse multiple experiments based on quality<br />

control metrics, identify low-performance experiments, and provide an interpretation of outlying experiments.<br />

INTRODUCTION<br />

Mass-spectrometry-based proteomics is a powerful<br />

analytical technique that can be used to identify complex<br />

protein samples. Despite many technological and<br />

computational advances, performing a mass spectrometry<br />

experiment is still a highly complicated task and its results<br />

are subject to a large variability. To understand and<br />

evaluate how technical variability affects the results of an<br />

experiment, lately several quality control (QC) and<br />

performance metrics have been introduced. Unfortunately,<br />

despite the availability of such QC metrics covering a<br />

wide range of qualitative information, a systematic<br />

approach to quality control is often still lacking.<br />

As most quality control tools are able to generate several<br />

dozens of metrics, any single experiment can be<br />

characterized by multiple QC metrics. Therefore it is<br />

often not clear which metrics are most interesting in<br />

general, or even which metrics are relevant in a specific<br />

situation. To take into account the multidimensional data<br />

space formed by the numerous metrics, we have applied<br />

advanced techniques to visualize, analyze, and interpret<br />

the QC metrics.<br />

METHODS<br />

Outlier detection can be used to detect deviating<br />

experiments with a low performance or a high level of<br />

(unexplained) variability. These outlying experiments can<br />

subsequently be analyzed to discover the source of the<br />

reduced performance and to enhance the quality of future<br />

experiments.<br />

However, it is insufficient to know that a specific<br />

experiment is an outlier; it is also of vital importance to<br />

know the reason. To understand why an experiment is an<br />

outlier, we have used the subspace of QC metrics in which<br />

the outlying experiment can be differentiated from the<br />

other experiments. This provides crucial information on<br />

how to interpret an outlier, which can be used by domain<br />

experts to increase interpretability and investigate the<br />

performance of the experiment.<br />

RESULTS & DISCUSSION<br />

Figure 1 shows an example of interpreting a specific<br />

experiment that has been identified as an outlier. As can<br />

be seen, two QC metrics mainly contribute to this<br />

experiment being an outlier. The explanatory subspace<br />

formed by these QC metrics can be extracted, which can<br />

then be interpreted by domain experts, resulting in insights<br />

in relationships between various QC metrics.<br />

FIGURE 1. QC metrics importances for interpreting an outlying<br />

experiment.<br />

Next, by combining the explanatory subspaces for all<br />

individual outliers, it is possible to get a general view on<br />

which QC metrics are most relevant when detecting<br />

deviating experiments. When taking the various<br />

explanatory subspaces for all different outliers into<br />

account, a distinction between several of the outliers can<br />

be made in terms of the number of identified spectra<br />

(PSM’s). As can be seen in Figure 2, for some specific QC<br />

metrics (highlighted in italics) the outliers result in a<br />

notably lower number of PSM's compared to the nonoutlying<br />

experiments.<br />

Because monitoring a large number of QC metrics on a<br />

regular basis is often unpractical, it is more convenient to<br />

focus on a small number of user-friendly, well-understood,<br />

and discriminating metrics. As the QC metrics highlighted<br />

in Figure 2 are shown to indicate low-performance<br />

experiments, these metrics are prime candidates to monitor<br />

on a continuous basis to quickly detect faulty experiments.<br />

FIGURE 2. Comparison of the number of PSM’s between the non-outlying<br />

and the outlying experiments.<br />

31


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O12<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O12. XILMASS: A CROSS-LINKED PEPTIDE IDENTIFICATION ALGORITHM<br />

Şule Yılmaz 1,2,3* , Masa Cernic 4 , Friedel Drepper 5 , Bettina Warscheid 5 , Lennart Martens 1,2,3 & Elien Vandermarliere 1,2,3 .<br />

Medical Biotechnology Center, VIB, Ghent, Belgium 1 ; Department of Biochemistry, Ghent University, Ghent, Belgium 2 ;<br />

Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium 3 ; Department of Biochemistry, Molecular and<br />

Structural Biology, Jožef Stefan Institute, Ljubljana, Slovenia 4 ; Functional Proteomics and Biochemistry, Department of<br />

Biochemistry and Functional Proteomics, Institute for Biology II and BIOSS Centre for Biological Signaling Studies,<br />

University of Freiburg, Freiburg, Germany 5 . *sule.yilmaz@ugent.be<br />

Chemical cross-linking coupled with mass spectrometry (XL-MS) facilitates the determination of protein structure and<br />

the understanding of protein interactions. The current computational approaches rely on different strategies with a limited<br />

number of open-source and easy-to-use search algorithms. We therefore built a novel cross-linked peptide identification<br />

algorithm, called Xilmass which has a novel database construction and a new scoring function adapted from traditional<br />

database search algorithms. We compared the performance of Xilmass against one of the most popular and publicly<br />

available algorithms: pLink, and a recently published algorithm Kojak. We found that Xilmass identified 140 spectra<br />

whereas Kojak and pLink identified 119 and 35, respectively. We mapped the cross-linking sites on the structure which<br />

resulted in the identification of 20 possible cross-linking sites. These findings show that Xilmass allows the identification<br />

of cross-linking sites.<br />

INTRODUCTION<br />

The structure of a protein is crucial for its functionality.<br />

Protein structure is commonly determined by X-ray<br />

crystallography or nuclear magnetic resonance (NMR). X-<br />

ray crystallography is only feasible for crystallizable<br />

proteins and NMR has a protein size limitation. Due to<br />

these restrictions, protein complexes are much more<br />

difficult to approach with these classical methods.<br />

However, chemical cross-linking of the complex coupled<br />

with mass spectrometry (XL-MS) allows to study of these<br />

protein complexes. The identification of the measured<br />

fragmentation spectra is a challenging task. One approach<br />

to identify cross-linked peptides is to linearize crosslinked<br />

peptide-pairs in order to generate a database to<br />

perform traditional search engines (Maiolica et al., 2007).<br />

However, a traditional search engine is not directly<br />

applicable to identify cross-linked peptides. Another<br />

approach is to rely on the usage of labeled cross-linkers,<br />

but this has a decreased performance when unlabeled<br />

cross-linkers are used. We therefore built an algorithm,<br />

Xilmass, which is designed for the identification of XL-<br />

MS fragmentation spectra without linearization of peptides<br />

and the requirement of labeled cross-linkers. We also<br />

introduced a new way of representation of a cross-linked<br />

peptide database and directly implemented a new scoring<br />

function.<br />

METHODS<br />

The data sets were derived from human calmodulin (CaM)<br />

and the actin binding domain of plectin (plectin-ABD)<br />

which were cross-linked by DSS. The data sets were<br />

analyzed on a Velos Orbitrap Elite.<br />

Cross-linked peptides were identified by Xilmass, pLink<br />

(Yang et al., 2012) and Kojak (Hoopmann et al., <strong>2015</strong>).<br />

The identifications of both Xilmass and Kojak were<br />

validated by Percolator (Käll et al., 2007) at q-value=0.05.<br />

pLink returned a validated list at FDR=0.05.<br />

The findings on cross-linking sites were validated with the<br />

aid of the available structures (Plectin PDB-entry: 4Q57<br />

and calmodulin PDB-entry: 2F3Y). The cross-linking sites<br />

were predicted by X-Walk (Kahraman et al., 2011) and<br />

PyMOL was used for the visualization.<br />

RESULTS & DISCUSSION<br />

We compared the number of identified spectra and crosslinking<br />

sites from Xilmass, pLink and Kojak. Xilmass<br />

identified 140 spectra whereas Kojak and pLink identified<br />

119 and 35 spectra, respectively (at FDR=0.05). Xilmass<br />

identified 53 cross-linking sites from the 140 spectra with<br />

37 obtained from at least 2 peptide-to-spectrum matches<br />

(PSMs). Kojak identified more cross-linking sites (60),<br />

however, only 26 cross-linking sites have at least 2 PSMs.<br />

The identified cross-linking sites by Xilmass were<br />

manually verified on the structure (Figure1). We defined<br />

20 cross-linking sites as possible (Cα-Cα distances within<br />

30Å (orange)) and not-predicted (Cα-Cα distances<br />

exceeding 30Å (blue)). These findings show that Xilmass<br />

allows the identification of cross-linking sites.<br />

FIGURE 1. The identified cross-linking sites were mapped on the plectin<br />

protein structure to manually verify them (PDB-entry:4Q57)<br />

REFERENCES<br />

Hoopmann ,M R et al. Journal of Proteome Research, 14, 2190–2198<br />

(<strong>2015</strong>)<br />

Kahraman,A. et al. Bioinformatics, 27, 2163–2164 (2011)<br />

Käll,L. et al. Nature Methods, 4, 923–925 (2007)<br />

Maiolica,A. et al. Molecular & cellular proteomics:MCP, 6, 2200–2211<br />

(2007)<br />

Yang,B. et al. Nature Methods, 9, 904–906 (2012)<br />

32


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O13<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O13. AUTOMATED ANATOMICAL INTERPRETATION OF DIFFERENCES<br />

BETWEEN IMAGING MASS SPECTROMETRY EXPERIMENTS<br />

Nico Verbeeck 1* , Jeffrey Spraggins ,2 , Yousef El Aalamat 3,4 , Junhai Yang 2 ,<br />

Richard M. Caprioli 2 , Bart De Moor 3,4 ,Etienne Waelkens 5,6 & Raf Van de Plas 1,2 .<br />

Delft Center for Systems and Control (DCSC), Delft University of Technology 1 ; Mass Spectrometry Research Center<br />

(MSRC),Vanderbilt University 2 ; STADIUS Center for Dynamical Systems, Signal Processing, and Data Analytics, Dept.<br />

of Electrical Engineering (ESAT), KU Leuven 3 ; iMinds Medical IT, KU Leuven 4 ; Dept. of Cellular and Molecular<br />

Medicine, KU Leuven 5 ; Sybioma, KU Leuven 6 . * n.verbeeck@tudelft.nl<br />

Imaging mass spectrometry (IMS) is a powerful molecular imaging technology that generates large amounts of data,<br />

making manual analysis often practically infeasible. In this work we aid the differential analysis of multiple IMS datasets<br />

by linking these data to an anatomical atlas. Using matrix factorization based multivariate analysis techniques, we are<br />

able to identify differential biomolecular signals between individual tissue samples in an obesity case study on mouse<br />

brain. The resulting differential signals are then automatically interpreted in terms of anatomical structures using a<br />

convex optimization approach and the Allen Mouse Brain Atlas. The automated anatomical interpretation facilitates<br />

much deeper exploration by the biomedical expert for these types of very rich data sets.<br />

INTRODUCTION<br />

Imaging Mass Spectrometry (IMS) is a relatively new<br />

molecular imaging technology that enables a user to<br />

monitor the spatial distributions of hundreds of<br />

biomolecules in a tissue slice simultaneously. This unique<br />

property makes IMS an immensely valuable technology in<br />

biomedical research. However, it also leads to very large<br />

amounts of data in a single analysis (e.g. >1 TB), making<br />

manual analysis of these data increasingly impractical. In<br />

order to aid the exploration of these data, we have recently<br />

developed a framework that integrates IMS data with an<br />

anatomical atlas. The framework uses the anatomical data<br />

in the atlas to automatically interpret the IMS data in terms<br />

of anatomical structures, and guides the user towards<br />

relevant findings within a single tissue section. In this<br />

work, we extend this framework towards the automated<br />

interpretation of biomolecular differences between<br />

multiple IMS datasets.<br />

METHODS<br />

We demonstrate our method on IMS data of multiple<br />

mouse brain sections, and use the Allen Mouse Brain<br />

Atlas as the curated anatomical data source that is linked<br />

to the MALDI-based IMS measurements. We spatially<br />

map the data of each individual IMS dataset to the<br />

anatomical atlas using both rigid and non-rigid registration<br />

techniques. This establishes a common reference space<br />

and allows for direct comparison of spatial locations<br />

between the different IMS datasets. Group Independent<br />

Component Analysis (GICA) is then used to automatically<br />

extract the differentially expressed biomolecular patterns,<br />

after which convex optimization is used to automatically<br />

interpret the differential components in terms of known<br />

anatomical structures (Verbeeck et al, 2014), directly<br />

listing the anatomical areas in which changes occur.<br />

RESULTS & DISCUSSION<br />

We demonstrate our approach in an obesity case study on<br />

mouse brain. All tissue sections are cryosectioned at 10<br />

μm and thaw-mounted onto ITO coated glass slides after<br />

which they are sublimated with CMBT matrix. MALDI<br />

IMS images are collected using the Bruker 15T solariX<br />

FTICR MS with a spatial resolution of 50 μm, collecting<br />

approximately 35,000 pixels per experiment.<br />

The IMS data of the different experiments are registered to<br />

the anatomical reference space provided by the Allen<br />

Mouse Brain Atlas, establishing an inter-experiment<br />

study-wide reference space. Analysis of the IMS<br />

measurements using GICA reveals multiple biomolecular<br />

patterns that differentiate between the various dietary<br />

conditions examined by the study. The retrieved<br />

differentially expressed biomolecular patterns are then<br />

translated to combinations of anatomical structures using<br />

our convex optimization approach, similar to what a<br />

human investigator intends to do. This automated<br />

interpretation of inter-experiment differences can serve as<br />

a great accelerator in the exploration of IMS data, as it<br />

avoids the time-and resource-intensive step of having a<br />

histological expert manually interpret the differential<br />

patterns.<br />

FIGURE 1. Automated anatomical interpretation of a biomolecular<br />

pattern that is differentially expressed in coronal mouse brain sections<br />

between a high fat and a low fat diet in our obesity case study.<br />

REFERENCES<br />

Verbeeck, N. et al. Automated anatomical interpretation of ion<br />

distributions in tissue: linking imaging mass spectrometry to curated<br />

atlases. Anal. Chem. 86, 8974–8982 (2014).<br />

33


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O14<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O14. ENHANCEMENT OF IMAGING MASS SPECTROMETRY DATA<br />

THROUGH REMOVAL OF SPARSE INTENSITY VARIATIONS<br />

Yousef El Aalamat 1,2* , Xian Mao 1,2 , Nico Verbeeck 3 , Junhai Yang 4 , Bart De Moor 1,2 ,<br />

Richard M. Caprioli 4 , Etienne Waelkens 5,6 & Raf Van de Plas 3,4 .<br />

Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing, and Data<br />

Analytics, KU Leuven 1 ; iMinds Medical IT, KU Leuven 2 ; Delft Center for Systems and Control, Delft University of<br />

Technology 3 ; Mass Spectrometry Research Center (MSRC),Vanderbilt University 4 ; Department of Cellular and<br />

Molecular Medicine, KU Leuven 5 ; Sybioma, KU Leuven 6 . *yelaalam@esat.kuleuven.be<br />

Imaging mass spectrometry (IMS) is rapidly evolving as a label-free, spatially resolved molecular imaging tool for the<br />

direct analysis of biological samples. However, mass spectrometry (MS) measurements are subject to different types of<br />

noise. In IMS, one of the most abundant noise types in ion images is the presence of localized intensity spikes, known<br />

also as sparse intensity variations, which occur on top of the biological ion distribution pattern. In this study, we develop<br />

a method that addresses the issue of sparse intensity noise. We use low-rank approximations of the IMS data to separate<br />

and filter sparse intensity variations from the MS signals. The efficiency of the developed method is tested using MS<br />

measurements of coronal sections of mouse brain and strong de-noising performance is demonstrated both along the<br />

spatial and the spectral domain.<br />

INTRODUCTION<br />

Imaging mass spectrometry (IMS) provides unique<br />

capabilities for biomedical and biological research.<br />

However, its measurements tend to be subject to different<br />

types of noise. One of the more abundant noise types in<br />

IMS are localized intensity spikes, which can be seen as<br />

sparse intensity variations on top of the true biological ion<br />

patterns. This kind of noise can have a substantial impact,<br />

particularly on low ion intensity measurements where the<br />

signal-to-noise ratio (SNR) can be significantly affected.<br />

We present a method to filter sparse intensity variations<br />

from IMS data, and demonstrate its use to de-noise IMS<br />

measurements both along the spatial and the spectral<br />

domain.<br />

METHODS<br />

We introduce a de-noising algorithm based on low-rank<br />

approximation, a concept from linear algebra. The method<br />

can separate sparse intensity variations from biological<br />

and tissue sample patterns, which hold up across multiple<br />

ions and pixels. This approach decomposes IMS data into<br />

two parts, namely a structured data matrix and a sparse<br />

data matrix. Since the noise tends to be sparse in nature, it<br />

will have a propensity to be collected into the sparse data<br />

part. The structured part tends to capture the de-noised<br />

IMS signals, effectively de-noising the ion images and the<br />

spectral profiles in the process. This de-noising method<br />

allows us to automatically filter sparse intensity variations<br />

from the underlying tissue signal without requiring any<br />

parameter tuning.<br />

RESULTS & DISCUSSION<br />

The filter method is demonstrated on two IMS<br />

experiments (one lipid-focused and one protein-focused)<br />

acquired from coronal sections of mouse brain. For the<br />

protein experiment, the tissue section was coated with<br />

sinapinic acid, and measurements were acquired using a<br />

Bruker AutoFlex MALDI-TOF/TOF in positive linear<br />

mode at a spatial resolution of 100 μm and with a mass<br />

range extending from m/z 3000 to 22000. For the lipid<br />

experiment, the tissue section was sublimated with 1,5-<br />

diaminonaphthalene, and the measurements were acquired<br />

using a Bruker AutoFlex MALDI-TOF/TOF in negative<br />

reflectron mode at a spatial resolution of 80 μm and with a<br />

mass range extending from m/z 400 to 1000. The case<br />

studies demonstrate robust de-noising performance,<br />

retrieving the underlying tissue signal efficiently and<br />

consistently using the structured data matrix. On the<br />

spatial side, we observe a clean-up effect in the spatial<br />

distributions of both high- and low-intensity ions. The<br />

effect is especially impactful for low-intensity ions,<br />

showing a strong increase in the amount of spatial<br />

structure that can be retrieved from low SNR<br />

measurements and revealing patterns that would have<br />

gone unnoticed otherwise. On the spectral side, we<br />

observe an improved SNR after applying the method.<br />

Thus, at the cost of computational analysis, the de-noising<br />

method described here provides a means of increasing the<br />

amount of information that can be extracted from an IMS<br />

experiment, without requiring user interaction or<br />

additional measurement.<br />

FIGURE 1. Impact on both spatial and spectral domain. Top: example of<br />

de-noised ion image. Bottom: plot of a spectrum before (blue) and after<br />

(red) removal of sparse intensity variations.<br />

34


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O15<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O15. DETERMINANTS OF COMMUNITY STRUCTURE<br />

IN THE PLANKTON INTERACTOME<br />

Gipsi Lima-Mendez 1,2* , Karoline Faust 1,2,3 , Nicolas Henry 4 , Johan Decelle 4 , Sébastien Colin 4 , Fabrizio Carcillo 2,3,5 ,<br />

Simon Roux 6 , Gianluca Bontempi 5 , Matthew B. Sullivan 6 , Chris Bowler 7 , Eric Karsenti 7,8 , Colomban de Vargas 4 &<br />

Jeroen Raes 1,2 .<br />

Department of Microbiology and Immunology, Rega Institute KU Leuven 1 ; VIB Center for the Biology of Disease 2 ;<br />

Laboratory of Microbiology, Vrije Universiteit Brussel, Belgium 3 ; CNRS, UMR 7144, Station Biologique de Roscoff 4 ;<br />

Interuniversity Institute of Bioinformatics in Brussels (IB) 2 , Machine Learning Group, Université Libre de Bruxelles 5 ;<br />

Department of Ecology and Evolutionary Biology, University of Arizona, USA 6 ; Ecole Normale Supérieure, Institut de<br />

Biologie (IBENS), France 7 ; European Molecular Biology Laboratory 8 .*Gipsi.limamendez@vib-kuleuven.be<br />

Identifying the abiotic and biotic factors that shape species interactions are fundamental yet unsolved goals in ecology.<br />

Here, we integrate organismal abundances and environmental measures from Tara Oceans to reconstruct the first global<br />

photic-zone co-occurrence network. Environmental factors are incomplete predictors of community structure. Putative<br />

biotic interactions are non-randomly distributed across phylogenetic groups, and show both local and global patterns.<br />

Known and novel interactions were identified among grazers, primary producers, viruses and symbionts. The high<br />

prevalence of parasitism suggests that parasites are important regulators in the ocean food web. Together, this effort<br />

provides a foundational resource for ocean food web research and integrating biological components into ocean models.<br />

INTRODUCTION<br />

Determining the relative importance of both biotic and<br />

abiotic processes represents a grand challenge in ecology.<br />

Here we analyze sequence on plankton organisms and<br />

environmental data from the Tara-Oceans project. We<br />

applied network inference methods to construct a globalocean<br />

cross-kingdom species interaction network and<br />

disentangled the biotic and abiotic signals shaping this<br />

interactome (Lima-Mendez, et al., <strong>2015</strong>).<br />

METHODS<br />

Methods are described in details in (Lima-Mendez, et al.,<br />

<strong>2015</strong>). Briefly:<br />

<br />

<br />

Network inference. Taxon-taxon networks were<br />

constructed as in (Faust, et al., 2012), selecting<br />

Spearman and Kullback-Leibler dissimilarity.<br />

Edges with merged multiple-test-corrected p-<br />

values below 0.05 were kept. Taxon-environment<br />

networks were computed with the same<br />

procedure and merged with taxon-taxon networks<br />

for environmental triplet detection.<br />

Indirect taxon edge detection. For each triplet<br />

consisting of two taxa and one environmental<br />

parameter, we computed the interaction<br />

information (II) and taxon edges were considered<br />

indirect when II


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O16<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O16. BIOINFORMATICS TOOLS FOR ACCURATE ANALYSIS OF AMPLICON<br />

SEQUENCING DATA FOR BIODIVERSITY ANALYSIS<br />

Mohamed Mysara 1-3 , Yvan Saeys 4,5 , Natalie Leys 1 , Jeroen Raes 2,6 & Pieter Monsieurs 1* .<br />

Unit of Microbiology, Belgian Nuclear Research Centre SCK•CEN, Mol; Belgium 1; Department of Bioscience<br />

Engineering, Vrije Universiteit Brussel VUB, Brussels, Belgium 2 ; Department of Structural Biology, Vlaams Instituut<br />

voor Biotechnologie VIB, Brussels, Belgium 3 ; Data Mining and Modeling Group, VIB Inflammation Research Center,<br />

Ghent, Belgium 4 , Department of RespiratoryMedicine, Ghent University Hospital, Ghent, Belgium 5 , Department of<br />

Microbiology and Immunology, REGA institute, KU Leuven, Belgium 6 . * pmonsieu@sckcen.be<br />

High-throughput sequencing technologies have created a wide range of new applications, also in the field of microbial<br />

ecology. Yet when used in 16S rRNA biodiversity studies, it suffers from two important problems: the presence of PCR<br />

artefacts (called chimera) and sequencing errors resulting from the sequencing sequencing technologies. In this work<br />

three artificial intelligence-based algorithms are proposed, CATCh, NoDe and IPED, to handle these two problems. A<br />

benchmarking study was performed comparing CATCh/NoDe (for 454 pyrosequencing) or CATCh/IPED (for Illumina<br />

MiSeq sequencing) with other state-of-the art tools, showing a clear improvement in chimera detection and reduction of<br />

sequencing errors respectively, and in general leading to more accurate clustering of the sequencing reads in Operational<br />

Taxonomic Units (OTUs). All algorithms are available via http://science.sckcen.be/en/Institutes/EHS/MCB/MIC<br />

/Bioinformatics/.<br />

INTRODUCTION<br />

The revolution in new sequencing technologies has led to<br />

an explosion of possible applications, including new<br />

opportunities for microbial ecological studies via the<br />

usage of 16S rDNA amplicon sequencing. However,<br />

within such studies, all sequencing technologies suffer<br />

from the presence of erroneous sequences, i.e. (i) chimera,<br />

introduced by wrong target amplification in PCR, and (ii)<br />

sequencing errors originating from different factors during<br />

the sequencing process. As such, there is a need for<br />

effective algorithms to remove those erroneous sequences<br />

to be able to accurately assess the microbial diversity.<br />

METHODS<br />

First, a new algorithm called CATCh (Combining<br />

Algorithms to Track Chimeras) was developed by<br />

integrating the output of existing chimera detection tools<br />

into a new more powerful method. Second, NoDe (Noise<br />

Detector) was introduced, an algorithm that identifies and<br />

corrects erroneous positions in 454-pyrosequencing reads.<br />

Third, IPED (Illumina Paired End Denoiser) algorithm<br />

was developed to handle error correction in Illumina<br />

MiSeq sequencing data as the first tool in the field. After<br />

identifying those positions likely to contain an error, those<br />

sequencing reads are subsequently clustered with correct<br />

reads resulting in error-free consensus reads. The three<br />

algorithms were benchmarked with state-of-the-art tools.<br />

RESULTS & DISCUSSION<br />

Via a comparative study with other chimera detection<br />

tools, CATCh was shown to outperform all other tools,<br />

thereby increasing the sensitivity with up to 14% (see<br />

Figure 1).<br />

FIGURE 1. Plot indicating the effect of applying 5% indels (shown on the<br />

left) and 5% mismatches (shown on the right), on the performance of<br />

different chimera detection tools. CATCh was found to outperform other<br />

existing tools.<br />

Similarly, NoDe and IPED were benchmarked against<br />

other denoising algorithms, thereby showing a significant<br />

improvement in reduction of the error rate up to 55% and<br />

75% respectively (see Figure 2). The combined effect of<br />

our algorithms for chimera removal and error correction<br />

also had a positive effect on the clustering of reads in<br />

operational taxonomic units (OTUs), with an almost<br />

perfect correlation between the number of OTUs and the<br />

number of species present in the mock communities.<br />

Indeed, when applying our improved pipeline containing<br />

CATCh and NoDe on a 454 pyrosequencing mock dataset,<br />

our pipeline could reduce the number of OTUs to 28 (i.e.<br />

close 18, the correct number of species). In contrast,<br />

running the straightforward pipeline without our<br />

algorithms included would inflate the number of OTUs to<br />

98. Similarly, when tested on Illumina MiSeq sequencing<br />

data obtained for a mock community, using a pipeline<br />

integrating CATCh and IPED, the number of OTUs<br />

returned was 33 (i.e. close to the real number of 21<br />

species), while 86 OTUs was obtained using the default<br />

mothur pipeline.<br />

REFERENCES<br />

Mysara M., Leys N., Raes J., Monsieurs P.- NoDe: a fast error-correction<br />

algorithm for pyrosequencing amplicon reads.- In: BMC<br />

Bioinformatics, 16:88(<strong>2015</strong>), p. 1-15.- ISSN 1471-2105<br />

Mysara M., Saeys Y., Leys N., Raes J., Monsieurs P.- CATCh, an<br />

Ensemble Classifier for Chimera Detection in 16S rRNA Sequencing<br />

Studies.- In: Applied and Environmental Microbiology, 81:5(<strong>2015</strong>),<br />

p. 1573-1584.- ISSN 0099-2240<br />

36


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O17<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O17. GENE CO-EXPRESSION ANALYSIS IDENTIFIES BRAIN REGIONS AND<br />

CELL TYPES INVOLVED IN MIGRAINE PATHOPHYSIOLOGY: A GWAS-<br />

BASED STUDY USING THE ALLEN HUMAN BRAIN ATLAS<br />

Sjoerd M.H. Huisman 1,2* , Else Eising 3 , Ahmed Mahfouz 1,2 , Lisanne Vijfhuizen 3 , International Headache Genetics<br />

Consortium, Boudewijn P.F. Lelieveldt 2 , Arn M.J.M. van den Maagdenberg 3,4 & Marcel J.T. Reinders 1 .<br />

DBL, Dept. of Intelligent Systems, Delft University of Technology, The Netherlands 1 ; LKEB, Dept. of Radiology, Leiden<br />

University Medical Center, The Netherlands 2 ; Dept. of Human Genetics, Leiden University Medical Center, The<br />

Netherlands 3 ; Dept. of Neurology, Leiden University Medical Center, The Netherlands 4 . * s.m.h.huisman@tudelft.nl<br />

Migraine is a common brain disorder, with a heritability of around 50%. To understand the genetic component of this<br />

disease, a large genome wide association study has been carried out. Several loci were identified, but their interpretation<br />

remained challenging. We integrated the GWAS results with gene expression data, from healthy human brains, to<br />

identify anatomical regions and biological pathways implicated in migraine pathophysiology.<br />

INTRODUCTION<br />

Genome Wide Association Studies (GWAS) are<br />

frequently used to find common variants with small effect<br />

sizes. However, they often provide researchers with short<br />

lists of single nucleotide polymorphisms (SNPs) with<br />

uncertain connections to biological functions.<br />

We present an analysis of GWAS data for migraine, where<br />

the full list of SNP statistics is used to find groups of<br />

functionally related migraine-associated genes. For this<br />

end we make use of gene co-expression in the healthy<br />

human brain.<br />

We performed genome wide clustering of genes, followed<br />

by enrichment analysis for migraine candidate genes. In<br />

addition, we constructed local co-expression networks<br />

around high-confidence genes. Both approaches converge<br />

on distinct biological functions and brain regions of<br />

interest.<br />

METHODS<br />

Migraine GWAS data was obtained from the International<br />

Headache Genetics Consortium, with 23,285 cases and<br />

95,425 controls (Anttila et al., 2013). Genes were scored<br />

by SNP load and divided into high-confidence genes,<br />

migraine candidate genes, and non-migraine genes.<br />

Spatial gene expression data in the healthy adult human<br />

brain was obtained from the Allen Brain Institute<br />

(Hawrylycz et al., 2012). It contains microarray<br />

expression values of 3702 samples from 6 donors. Robust<br />

gene co-expressions were used to cluster genes into 18<br />

modules, which were then tested for enrichment of<br />

migraine candidate genes, and functionally characterized.<br />

In a second approach, local co-expression networks were<br />

built around the high-confidence migraine genes. These<br />

local networks were then compared to the modules of the<br />

first approach.<br />

RESULTS & DISCUSSION<br />

The genome wide analysis revealed several modules of<br />

genes enriched in migraine candidates. Two modules have<br />

preferential expression in the cerebral cortex and are<br />

enriched in synapse related annotations and neuron<br />

specific genes. A third module contains oligodendrocytes<br />

and genes preferentially expressed in subcortical regions.<br />

The local co-expression networks, of the second approach,<br />

converge on the same pathways and expression patterns,<br />

even though the high confidence genes lie mostly outside<br />

of the modules of interest. This provides a control to the<br />

results of the first approach.<br />

FIGURE 1. The co-expression network around high confidence migraine<br />

genes of the second approach. Genes (and links between them) of the<br />

migraine modules of the first approach are coloured in red, yellow, blue,<br />

and green.<br />

The analyses confirm the previously observed link<br />

between migraine and cortical neurotransmission. They<br />

also point to the involvement of subcortical myelination,<br />

which is in line with recent tentative findings. These<br />

results show that more relevant information can be<br />

extracted from GWAS results, using (publicly available)<br />

tissue specific expression patterns.<br />

REFERENCES<br />

Anttila V. et al. Genome-wide meta-analysis identifies new susceptibility<br />

loci for migraine. Nat. Genet. 45, 912–7, (2013).<br />

Hawrylycz M.J. et al. An anatomically comprehensive atlas of the adult<br />

human brain transcriptome. Nature 489, 391–9, (2012).<br />

37


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O18<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O18. SPATIAL CO-EXPRESSION ANALYSIS OF STEROID RECEPTORS IN<br />

THE MOUSE BRAIN IDENTIFIES REGION-SPECIFIC REGULATION<br />

MECHANISMS<br />

Ahmed Mahfouz 1,2* , Boudewijn P.F. Lelieveldt 1,2 , Aldo Grefhorst 3 , Isabel M. Mol 4 , Hetty C.M. Sips 4 , José K. van den<br />

Heuvel 4 , Jenny A. Visser 3 , Marcel J.T. Reinders 2 , & Onno C. Meijer 4 .<br />

Department of Radiology, Leiden University Medical Center 1 ; Delft Bioinformatics Lab, Delft University of<br />

Technology 2 ; Department of Internal Medicine, Erasmus University Medical Center 3 ; Department of Internal Medicine,<br />

Leiden University Medical Center 4 . * a.mahfouz@lumc.nl<br />

Steroid hormones coordinate the activity of many brain regions by binding to nuclear receptors that act as transcription<br />

factors. This study uses genome wide correlation of gene expression in the mouse brain to discover 1) brain regions that<br />

respond in a similar manner to particular steroids, 2) signaling pathways that are used in a steroid receptor and brain<br />

region-specific manner, and 3) potential target genes and relationships between groups of target genes. The data<br />

constitute a rich repository for the research community to support new insights in neuroendocrine relationships, and to<br />

develop novel ways to manipulate brain activity in research of clinical settings.<br />

INTRODUCTION<br />

Steroid receptors are pleiotropic transcription factors that<br />

coordinate adaptation to different physiological states. An<br />

important target organ is the brain, but its complexity<br />

hampers the understanding of their modulation.<br />

METHODS<br />

We used the Allen Brain Atlas (ABA) (Lein et al., 2007),<br />

the most comprehensive repository of in situ<br />

hybridization-based gene expression in the adult mouse<br />

brain, to identify genes that have three dimensional (3D)<br />

spatial gene expression profiles similar to steroid receptors.<br />

To validate the functional relevance of this approach, we<br />

analyzed the co-expression relationship of the<br />

glucocorticoid receptor (Gr) and estrogen receptor alpha<br />

(Esr1) and their known transcriptional targets in their<br />

brain regions of action. Next, we studied the regionspecific<br />

co-expression of nuclear receptors and their coregulators<br />

to identify potential partners mediating the<br />

hormonal effects on dopaminergic transmission. Finally,<br />

to illustrate the potential of using spatial co-expression to<br />

predict region-specific steroid receptor targets in the brain,<br />

we identified and validated gene which responded to<br />

changes in estrogen in the arcuate nucleus and medial<br />

preoptic area of the mouse hypothalamus.<br />

RESULTS & DISCUSSION<br />

For each steroid receptor, we ranked genes based on their<br />

spatial co-expression across the whole brain as well as in<br />

each of the aforementioned 12 brain structures separately.<br />

For each steroid receptor, strongly co-expressed genes<br />

within a brain region are likely related to the localized<br />

functional role of the receptor. For example, out of the top<br />

10 genes co-expressed with Esr1 across the whole brain, 4<br />

were previously shown to be regulated by Esr1 and/or<br />

estrogens in various tissues (Gpr101, Calcr, Ngb, and<br />

Gpx3)<br />

We assessed the extent of co-expression of glucocorticoid<br />

(GC)-responsive genes (Datson et al., 2012) with Gr in the<br />

whole brain, the hippocampus and its substructures the<br />

dentate gyrus (DG) and the different subregions of the<br />

cornu ammonis (CA). GC-responsive genes were<br />

significantly co-expressed with Gr in the DG, but<br />

interestingly also in the whole brain and in the CA3 region<br />

(FDR-corrected p < 1.8×10 -3 ; Mann-Whitney U-Test).<br />

Similarly, A Mann-Whitney U-test showed that a set of 15<br />

genes that are sensitive to gonadal steroids (Xu et al.,<br />

2012) is significantly correlated to Esr1 across the whole<br />

brain (FDR-corrected p = 8.69 ×10 -14 ), as well as in the<br />

hypothalamus (p = 3.85×10 -10 ) , the brain region<br />

responsible for the sexual behavior in animals.<br />

In order to identify putative region-dependent coregulators<br />

of steroid receptors, we analyzed the coexpression<br />

relationships of the each steroid receptor and a<br />

set of 62 nuclear receptor co-regulators as present on a<br />

peptide array (Nwachukwu et al., 2014). We focused our<br />

analysis on well-established target regions of steroid<br />

hormone action, dopaminergic brain regions (ventral<br />

tegmental area; VTA & substantia nigra; SN). We found<br />

three significantly co-expressed co-regulators with<br />

androgen receptor (Ar): Pnrc2, Pak6 and Trerf1,<br />

suggesting that these receptors may be involved in<br />

mediating Ar effects on dopaminergic transmission.<br />

In order to validate the predictive value of high correlated<br />

expression with a steroid receptor, we analyzed the<br />

response of top 10 genes that are strongly co-expressed<br />

with Esr1 in the hypothalamus to the estrogen<br />

diethylstilbesterol (DES) in castrated male mice using<br />

qPCR. We performed quantitative double in situ<br />

hybridization (dISH) for Esr1 and the six mRNAs (Irs4,<br />

Magel2, Adck4, Unc5, Ngb, and Gdpd2) that showed more<br />

than 1.3 fold enrichment in qPCR. We found Irs4 and<br />

Magel2 mRNA were both significantly upregulated by<br />

DES treatment (1.9 and 2.4-fold, respectively).<br />

REFERENCES<br />

Lein E. et al. Nature 445, 168–76 (2007).<br />

Datson N. et al. Hippocampus 22, 359–71 (2012).<br />

Xu X. et al., Cell 3, 596–607 (2012).<br />

Nwachukwu J. et al. eLife 3, e02057 (2014).<br />

38


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O19<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O19. A SYSTEMS BIOLOGY COMPENDIUM FOR LEISHMANIA DONOVANI<br />

Bart Cuypers 1,2,3* , Pieter Meysman 1,2 , Manu Vanaerschot 3 , Maya Berg 3 , Malgorzata Domagalska 3 , Jean-Claude<br />

Dujardin 3,4# & Kris Laukens 1,2# .<br />

Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Biomedical informatics research center<br />

Antwerpen (biomina) 2 ; Molecular Parasitology Unit, Department of Biomedical Sciences, Institute of Tropical Medicine,<br />

Antwerp 3 ; 4 Department of Biomedical Sciences, University of Antwerp 4 . * bart.cuypers@uantwerpen.be # shared senior<br />

authors<br />

Leishmania donovani is the cause of visceral leishmaniasis in the Indian subcontinent and poses a threat to public health<br />

due to increasing drug resistance. Only little is known about its very peculiar molecular biology and there has been little<br />

‘omics integration effort so far. Here we present an integratory database or ‘omics compendium that contains all<br />

genomics, transcriptomics proteomics and metabolomics experiments that are currently publically available for<br />

Leishmania donovani. Additionally the user interface contains analysis tools for new datasets that uses smart data mining<br />

strategies like frequent itemset mining to link results from different ‘omics layers.<br />

INTRODUCTION<br />

The protozoan parasite Leishmania donovani causes<br />

visceral leishmaniasis (VL), a life threatening disease<br />

which affects 500 000 people each year. With only four<br />

drugs available and rapidly emerging drug resistance,<br />

knowledge about the parasite’s resistance mechanisms is<br />

essential to boost the development of new drugs. However,<br />

only little is known about the gene regulation of<br />

Leishmania and the few findings indicate major<br />

differences to known gene expression systems. Indeed, no<br />

polymerase II promotors have ever been found in<br />

Leishmania 1 . Genes are constitutively transcribed in large<br />

polycistronic units and subsequently spliced into<br />

individual mRNAs (trans-splicing) 1 . A modified thymine,<br />

Base J, marks the end of transcription units and functions<br />

as a stop signal for the RNA polymerase 2 . Gene<br />

expression is then assumed to be regulated at the posttranscriptional<br />

level (mRNA stability, translation<br />

efficiency, epigenetic factors, etc…) but evidence to<br />

support this is scarce 1 . Integration of different ‘omics<br />

could shed light on these gene regulatory mechanisms, but<br />

there has been little integration effort so far.<br />

METHODS<br />

We developed an easy to use tool, able to import and<br />

connect all existing L. donovani –omics experiments.<br />

Genomics, epigenomics, transcriptomics, proteomics,<br />

metabolomics and phenotypic data was collected and<br />

added to a MySQL database compendium, further<br />

complemented with publicly available data. Relations<br />

between different ‘omics layers were explicitly defined<br />

and provided with a level of confidence. Python scripts<br />

were developed to preprocess, analyse and import the data.<br />

To allow comparability between different experiments,<br />

platforms and labs the three integration principles of the<br />

COLOMBOS bacterial expression compendium were<br />

adapted 3 . 1) Use the same data-analysis pipeline for all<br />

data. 2) Work with contrasts to a control condition instead<br />

of expression values. 3) Annotate these contrasts in a<br />

unified and structured manner.<br />

Next to this vast data source a set of integrative dataanalysis<br />

tools was developed based on data mining<br />

strategies. For example: One tool uses frequent itemset<br />

mining algorithms to detect which proteins and<br />

metabolites frequently exhibit the same behaviour under<br />

different conditions. Another tool converts several –omics<br />

layers to a network format that can be opened in<br />

Cytoscape and can thus be the basis for network analysis.<br />

The Django and Twitter Bootstrap frameworks were used<br />

to create a web portal to make the tools accessible to any<br />

Leishmania researcher.<br />

RESULTS & DISCUSSION<br />

Excellent public gene, protein, metabolite annotation<br />

databases for Leishmania and related species are already<br />

available (e.g. TriTrypDB and GeneDB). However, the<br />

strength of our tool is that it links these annotation data to<br />

‘omics experiments that are either provided by the user, or<br />

that are publically available. New experiments can quickly<br />

be preprocessed, analysed and integrated in the database<br />

via its python back end. The compendium is therefore not<br />

only a look-up tool (e.g. under which conditions is this<br />

gene or metabolite upregulated?), but has tools available<br />

to also analyse the user-provided data with intelligent data<br />

mining tools (e.g. which metabolites/genes are typically<br />

upregulated in drug-resistant strains?). These new<br />

experiments provide additional confidence and<br />

information about the biological entities in the database.<br />

Unlike many other databases, the compendium has an<br />

elaborate quality control system. Every result provided by<br />

the tools can be traced back to the experimental data,<br />

which contains the necessary quality control plots to<br />

support the experiment’s validity. Additionally, it contains<br />

all relevant information about the extractions and the<br />

origin of the biological material.<br />

Using the compendium and its tools, we characterized the<br />

development and drug-resistance in a system biology<br />

context of Leishmania donovani. The genomes of more<br />

than 200 strains were examined for associations with<br />

phenotypical features and a subset was linked to<br />

transcriptomics, proteomics and metabolomics results. The<br />

compendium and its scripts were designed to be generic<br />

and can therefore be used for other organisms with only<br />

minor changes.<br />

REFERENCES<br />

1. Donelson, J. (1999) PNAS. 96, 2579–258.<br />

2. Van Luenen, H. G. a M. et al. (2012) Cell. 150, 909–21.<br />

3. Meysman. et al. (2014) Nucleic acids research. 42, D649-<br />

D653.<br />

39


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O20<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O20. MULTI-OMICS INTEGRATION: RIBOSOME PROFILING<br />

APPLICATIONS<br />

Volodimir Olexiouk 1 , Elvis Ndah 1 , Sandra Steyaert 1 , Steven Verbruggen 1 , Eline De Schutter 1 , Alexander Koch 1 , Daria<br />

Gawron 2 , Wim Van Criekinge 1 , Petra Van Damme 2 , Gerben Menschaert 1,* .<br />

Lab of Bioinformatics and Computational Genomics (BioBix), Department of Mathematical Modelling, Statistics and<br />

Bioinformatics, Faculty of Bioscience Engineering, Ghent University 1 ; Dept. Medical Protein Research, VIB-Ghent<br />

University 2 . * Gerben.menschaert@ugent.be<br />

Ribosome profiling is a relatively new NGS technology that enables the monitoring of the in vivo synthesis of mRNAencoded<br />

translation products measured at the genome-wide level. The technique, also sometimes referred to as RIBOseq,<br />

uses the property of translating ribosomes to protect mRNA fragments from nuclease digestion and allows to determine<br />

genomic positions of translating ribosomes with sub-codon to single-nucleotide precision. Since the advent of the<br />

technology, several bioinformatics solutions have been devised to investigate this type of data. Here we will present<br />

several solutions to detect novel proteoforms by combining RIBOseq and mass spectrometry data, to detect putatively<br />

coding small open reading frames (sORFs), and to evaluate the impact of DNA and RNA methylation on the translation<br />

level.<br />

INTRODUCTION<br />

Integration of different OMICS technologies is routinely<br />

adapted to investigate biological systems. Our lab focuses<br />

on high-throughput data analysis and the development of<br />

novel data integration methodologies. Currently our focus<br />

goes to ribosome profiling (Ingolia et al., 2011), an NGS<br />

based technique to measure the so-called translatome (i.e.<br />

the mRNA that shows ribosome occupancy). This<br />

technique is applied in combination with other sequencing<br />

based protocols to measure expression (RNAseq),<br />

translation (mass spectrometry) and to chart maps of<br />

regulatory elements such as DNA methylation (reduced<br />

representation bisulfite sequencing, RRBS) and RNA<br />

methylation (m 6 Aseq) to address several biological<br />

questions.<br />

METHODS<br />

For the integration of RIBOseq and mass spectrometry<br />

(MS), we devised a tool called PROTEOFORMER<br />

(www.biobix.be/proteoformer). This proteogenomics tool<br />

consists of several steps. It starts with the mapping of<br />

ribosome-protected fragments (RPFs) and quality control<br />

of subsequent alignments. It further includes modules for<br />

identification of transcripts undergoing protein synthesis,<br />

positions of translation initiation with sub-codon<br />

specificity and single nucleotide polymorphisms (SNPs).<br />

We used PROTEOFORMER to create protein sequence<br />

search databases from publicly available mouse and inhouse<br />

performed human RIBOseq experiments and<br />

evaluated these with matching proteomics data (Crappé et<br />

al., <strong>2015</strong>).<br />

Another pipeline based on RIBOseq data is built around<br />

the discovery of putatively coding small open reading<br />

frames (sORFs). Herein, the first step is to delineate<br />

sORFs based on RPF coverage throughout the coding<br />

sequence and at the translation initiation site. Afterwards,<br />

state-of-the-art tools and metrics accessing the coding<br />

potential of sORFs are implemented and a list of candidate<br />

sORFs for downstream analysis is compiled (e.g. MSbased<br />

identification).<br />

To assess the impact of DNA-methylation at the<br />

translation level a double knockout DNMT model was<br />

studied (WT and DNMT1 + 3B knockout HCT116 cell<br />

line). Genome-wide DNA methylation profiling was<br />

performed using RRBS, while ribosome profiling,<br />

quantitative shotgun and positional proteomics (Nterminal<br />

COFRADIC) were used to obtain protein<br />

expression data.<br />

An initial experiment to integrate m6Aseq (measuring the<br />

m6A epitranscriptome) and ribosome profiling has also<br />

been executed on HCT116 cells.<br />

RESULTS & DISCUSSION<br />

The RIBOseq-MS integration (through<br />

PROTEOFORMER) increases the overall protein<br />

identification rates with 3% and 11% (improved and new<br />

identifications) for human and mouse respectively and<br />

enables proteome-wide detection of 5’-extended<br />

proteoforms, upstream ORF (uORF) translation and nearcognate<br />

translation start sites. The PROTEOFORMER<br />

tool is available as a stand-alone pipeline and has been<br />

implemented in the galaxy framework for ease of use.<br />

The sORF pipeline was tested and curated on three<br />

different cell-lines (HCT116: human, E14 mESC: mouse,<br />

and S2: fruitfly). The public repository has been made<br />

available at www.sorfs.org (Olexiouk V. et al., in review),<br />

and so far includes the datasets mentioned above.<br />

In the study for the effect of DNA methylation at the<br />

proteome level in the DNMT double knock-out we found<br />

that the knockout cells show more significantly upregulated<br />

than down-regulated genes and that these upregulated<br />

genes were characterized by higher levels of<br />

promoter methylation in the wild type cells. Both the MS<br />

and RIBOseq analyses corroborated these findings.<br />

Preliminary results based on the m6A sequencing confirm<br />

previous findings on know m6A sequence motifs and<br />

enrichment of m6A sites in specific functional regions<br />

(around translation start sites and in 3’UTR regions) and<br />

moreover some examples hint at an effect of m6A on<br />

ribosomal pausing, after integrating m6A- and RIBOseq<br />

data.<br />

REFERENCES<br />

Ingolia N. et al. Cell 11;147(4):789-802 (2011).<br />

Crappé, J., Ndah, E. et al. NAR 11;43(5):e29 (<strong>2015</strong>).<br />

40


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O21<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O21. CLUB-MARTINI: SELECTING FAVORABLE INTERACTIONS<br />

AMONGST AVAILABLE CANDIDATES: A COARSE-GRAINED SIMULATION<br />

APPROACH TO SCORING DOCKING DECOYS<br />

Qingzhen Hou 1* , Kamil K. Belau 2 , Marc F. Lensink 3 , Jaap Heringa 1 & K. Anton Feenstra 1* .<br />

Center for Integrative Bioinformatics VU (IBIVU), VU University Amsterdam, De Boelelaan 1081A, 1081 HV<br />

Amsterdam, The Netherlands 1 ; Intercollegiate Faculty of Biotechnology, University of Gdańsk - Medical University of<br />

Gdańsk, Kładki 24, 80-822 Gdańsk, Poland 2 ; Institute for Structural and Functional Glycobiology (UGSF), CNRS<br />

UMR8576, FRABio FR3688, University Lille, 59000, Lille, France 3 .<br />

Protein-protein Interactions (PPIs) play a central role in all cellular processes. Large-scale identification of native binding<br />

orientations is essential to understand the role of particular protein-protein interactions in their biological context. We<br />

estimate the binding free energy using coarse-grained simulations with the MARTINI forcefield, and use those to rank<br />

decoys for 15 CAPRI benchmark targets. In our top 100 and top 10 ranked structures, for the 'easier' targets that have<br />

many near-native conformations, we obtain a strong enrichment of acceptable or better quality structures; for the 'hard'<br />

targets with very few near-native complexes in the decoys, our method is still able to retain structures which have native<br />

interface contacts. Moreover, CLUB-MARTINI is rather precise for some targets and able to pinpoint near-native<br />

binding modes in top 1, 5, 10 and 20 selections.<br />

INTRODUCTION<br />

Measuring binding free energy is essential to under­stand the<br />

relevance of particular protein-protein interactions in their<br />

biological context. Moreover, at the atomic scale, molecular<br />

simulations give us insight into the physically realistic details<br />

of these interactions. In our recent study, we successfully<br />

applied coarse-grained molecular dynamics simulations to<br />

estimate binding free energy with similar accuracy as and<br />

500-fold less time consuming than full atomistic simulation<br />

(May et al., 2014). The approach relied on the availability of<br />

crystal structures of the protein complex of interest. Here, we<br />

investigate the effectiveness of this approach as a scoring<br />

method to identify stable binding confor­mations out of<br />

docking decoys from protein docking.<br />

We apply our method as an evaluation method to rank more<br />

than 19 000 docked protein conformations, or ‘decoys’, for<br />

15 bench­mark targets from the Critical Assessment of<br />

PRedicted Interactions (CAPRI) (Lensink & Wodak, 2014).<br />

METHODS<br />

For each target, the binding free energy of all decoys was<br />

calculated, using the MARTINI forcefield as introduced<br />

before (May et al., 2014). In short, for a set of closely spaced<br />

separation distances, we calculate the constraint force applied<br />

to maintain the set distance. Integrating this force yields a<br />

potential of mean force (PMF), from which the binding free<br />

energy is extracted as the highest minus the lowest value.<br />

Previously, for accuracy, we used up to 20 replicate<br />

simulations for each distance in the PMF, but for efficiency,<br />

here we use only a single replicate initially. We then selected<br />

the lowest-scoring half to run an additional four replicates to<br />

obtain better sampling and more accurate estimates of the<br />

binding free energy. In total, we used approximately 800 000<br />

core-hours of compute time.<br />

RESULTS & DISCUSSION<br />

We obtained strong enrichment of acceptable and high<br />

quality structures in the TOP 100 based on our PMF free<br />

energies, as shown in Figure 1. We estimate the error of our<br />

energies to be significant. This can be approved by increasing<br />

sampling, but remains very expensive.<br />

Moreover, for several targets, we can select near-native<br />

structures in top 1, top 5 and top 10 as shown in Table 1,<br />

which means that, overall, our method is rather precise. From<br />

estimates of the error, we expect we can improve accuracy by<br />

extending the amount of sampling done at each distance. In<br />

conclusion, our approach can find favorable interactions from<br />

available candidates produced by docking programs. To the<br />

best of our knowledge, this is the first time interaction free<br />

energy from a coarse-grained force field is used as a scoring<br />

method to rank docking solutions at a large scale.<br />

FIG. 1. Enrichment in<br />

percentage of<br />

acceptable or better<br />

structures. For each of<br />

the 13 targets with<br />

acceptable or better<br />

decoys, two columns<br />

(from left to right)<br />

stand for CAPRI<br />

Score_set and top 100<br />

in our rank of binding<br />

free energy calculation. Red, orange and yellow represent the fractions of<br />

high, medium and acceptable quality structures over the number of all or<br />

selected docking decoys. The order (left to right) is based on the fraction<br />

of acceptable structures in each target (easy to difficult)<br />

Table 1. Success selections of top ranked structures<br />

Selection Target\Quality High Medium Acceptable<br />

Total<br />

(% )<br />

TOP 1<br />

T47 1 0 0 100<br />

T53 0 0 1 100<br />

T47 3 2 0 100<br />

TOP 5<br />

T41 0 0 4 80<br />

T53 0 0 3 60<br />

T37 0 2 0 40<br />

T47 7 3 0 100<br />

T41 0 1 7 80<br />

TOP 10 T53 0 1 5 60<br />

T37 0 3 0 30<br />

T50 0 0 1 10<br />

T47 14 6 0 100<br />

T41 0 4 13 85<br />

T53 0 3 9 60<br />

TOP 20 T37 0 4 2 30<br />

T50 0 0 3 15<br />

T40 1 2 0 15<br />

T46 0 0 1 5<br />

REFERENCES<br />

May, Pool, Van Dijk, Bijlard, Abeln, Heringa & Feenstra. Coarsegrained<br />

versus atomistic simulations: realistic interaction free energies<br />

for real proteins. Bioinformatics (2014) 30: 326-334.<br />

Lensink & Wodak. Score_set: A CAPRI benchmark for scoring protein<br />

complexes. Proteins (2014) 82:3163-3169.<br />

41


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O22<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O22. PEPSHELL: VISUALIZATION OF CONFORMATIONAL PROTEOMICS<br />

DATA<br />

Elien Vandermarliere 1,2* , Davy Maddelein 1,2 , Niels Hulstaert 1,2 , Elisabeth Stes 1,2 , Michela Di Michele 1,2 ,<br />

Kris Gevaert 1,2 , Edgar Jacoby 3 , Dirk Brehmer 3 & Lennart Martens 1,2 .<br />

Department of Medical Protein Research, VIB 1 ; Department of Biochemistry, Ghent University 2 ; Oncology Discovery,<br />

Janssen Research and Development – Janssen Pharmaceutica, Beerse 3 . * elien.vandermarliere@ugent.be<br />

Proteins are dynamic molecules; they undergo crucial conformational changes induced by post-translational<br />

modifications and by binding of cofactors or other molecules. The characterization of these conformational changes and<br />

their relation to protein function is a central goal of structural biology. Unfortunately, most conventional methods to<br />

obtain structural information do not provide information on protein dynamics. Therefore, mass spectrometry-based<br />

approaches, such as limited proteolysis, hydrogen-deuterium exchange, and stable-isotope labelling, are frequently used<br />

to characterize protein conformation and dynamics, yet the interpretation of these data can be cumbersome and time<br />

consuming. Here, we present PepShell, a tool that allows interactive data analysis of mass spectrometry-based<br />

conformational proteomics studies by visualization of the identified peptides both at the sequence and structure levels.<br />

Moreover, PepShell allows the comparison of experiments under different conditions which include proteolysis times or<br />

binding of the protein to different substrates or inhibitors.<br />

INTRODUCTION<br />

The study of protein structure with mass spectrometry,<br />

called conformational proteomics, is frequently used to<br />

characterize protein conformations and dynamics. Most of<br />

these methods exploit the surface accessibility of amino<br />

acids within the native protein conformation or more<br />

specifically, the differences in protein surface accessibility<br />

in different situations within a protein structure.<br />

The experimental setup and subsequent workflow of a<br />

conformational proteomics experiment do not deviate<br />

drastically from that of a classic mass spectrometry-based<br />

experiment in which peptides present in a complex peptide<br />

mixture are identified. The final outcome of a<br />

conformational proteomics experiment is a list of peptides.<br />

These peptide lists typically span multiple experimental<br />

conditions across which the structural observations are to<br />

be compared; the peptide lists have to be combined and, if<br />

available, mapped onto the structure of the protein.<br />

To fulfill these latter steps, we developed PepShell<br />

(Vandermarliere et al., <strong>2015</strong>), to guide the interpretation<br />

of mass spectrometry-based proteomics data in the context<br />

of protein structure and dynamics.<br />

TOOL DESCRIPTION<br />

PepShell aids the user in the interpretation of the outcome<br />

of conformational proteomics experiments and is<br />

composed of three panels: the experiment comparison<br />

panel, the PDB view panel, and the statistics panel.<br />

<br />

The data to analyze<br />

PepShell allows the input from limited proteolysis,<br />

hydrogen-deuterium exchange, MS footprinting and<br />

stable-isotope labelling experiments. The data have to<br />

be present in a comma-separated text file format. The<br />

project selection interface allows the user to select a<br />

reference project and to indicate which setups need to<br />

be compared with each other.<br />

<br />

Experiment comparison<br />

This panel allows the comparison of the selected<br />

experimental setups at the sequence level. For each<br />

experimental condition, the identified and quantified<br />

peptides are mapped onto the sequence of the protein<br />

of interest.<br />

The PDB view panel<br />

Here, the detected peptides are mapped on the protein<br />

structure. The main requirement is the availability of a<br />

3D structure of the protein of interest.<br />

<br />

Statistics within PepShell<br />

In this panel, the peptides of interest can be analyzed<br />

in more detail. The outcome from CP-DT (Fannes et<br />

al., 2013) for tryptic cleavage probability for each<br />

tryptic cleavage position is given. Also detailed<br />

comparison of the peptide ratios over the different<br />

experimental setups is allowed.<br />

CONCLUSIONS<br />

The increasing popularity of structural proteomics is in<br />

stark contrast with the availability of efficient tools to<br />

visualize this multitude of data. There are however some<br />

tools available that aid data interpretation; but these are<br />

approach-specific and are aimed primarily at mass<br />

spectrometrists with a specific focus on the experimental<br />

mass spectrometry data and their processing and<br />

interpretation. PepShell on the other hand is intended to<br />

support downstream users to interpret the results obtained<br />

from a variety of conformational proteomics approaches.<br />

PepShell uses the peptide lists to compare different<br />

experimental conditions and allows the visualization of<br />

these differences onto the structure of the protein. As such,<br />

PepShell bridges the gap between mass spectrometrybased<br />

proteomics data and their interpretation in the<br />

context of protein structure and dynamics.<br />

PepShell is an open source Java application. Its binaries,<br />

source code and documentation can be found at:<br />

compomics.github.io/projects/pepshell.html<br />

REFERENCES<br />

Fannes T et al. J Proteome Res 12, 2253-2259 (2013).<br />

Vandermarliere E et al. J Proteome Res 14, 1987-1990 (<strong>2015</strong>).<br />

42


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O23<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O23. INTERACTIVE VCF COMPARISON USING SPARK NOTEBOOK<br />

Thomas Moerman 1,2,5* , Dries Decap 3,5 , Toni Verbeiren 2,5 , Jan Fostier 3,5 , Joke Reumers 4,5 , Jan Aerts 2,5 .<br />

Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Visual Data Analysis Lab, ESAT –<br />

STADIUS, Dept. of Electrical Engineering, KU Leuven – iMinds Medical IT 2 ; Department of Information Technology,<br />

Ghent University – iMinds, Gaston Crommenlaan 8 bus 201, 9050 Ghent, Belgium 3 ; Janssen Research & Development,<br />

a division of Janssen Pharmaceutica N.V., 2340 Beerse, Belgium 4 ; ExaScience Life Lab, Kapeldreef 75, 3001 Leuven,<br />

Belgium 5 . * thomas.moerman@esat.kuleuven.be<br />

Researchers benefit greatly from tools that allow hands-on, interactive and visual experimentation with data, unimpeded<br />

by setup complexities nor scaling issues resulting from large data sizes. In our contribution we present an implementation<br />

of an interactive VCF comparison tool, making use of a technology stack based on Apache Spark [1], Big Data<br />

Genomics Adam [2] and Spark Notebook [3].<br />

INTRODUCTION<br />

Current genomics data formats and processing pipelines<br />

are not designed to scale well to large datasets [1]. They<br />

were also not conceived to be used in an interactive<br />

environment. The bioinformatics field typically struggles<br />

with these difficulties as high-throughput, next-generation<br />

sequencing jobs produce large data files. Although many<br />

high-quality bioinformatics processing tools exist, it is<br />

often hard to express analyses in a consolidated and<br />

reproducible fashion. These tools typically do not allow to<br />

interactively iterate on an analysis while visualizing<br />

results.<br />

OBJECTIVE<br />

Analysis tools preferably provide the expressive power to<br />

define ad hoc queries on data. Biologists or clinical<br />

researchers, when dealing with genomic variants encoded<br />

in VCF files, typically perform queries comparing one<br />

protocol to another, tumor to normal, treated to untreated<br />

cell lines and so on. Ideally these comparisons make use<br />

of all quality-related metrics stored in VCF files (e.g.<br />

coverage depth, quality score) as well as the actual region<br />

annotations (e.g. repeat regions, exonic regions) and<br />

generate visual output. We aim to implement a tool that<br />

provides the necessary expressiveness as well as the<br />

computational power needed for making these types of<br />

analyses practical and interactive.<br />

APPROACH<br />

Recent advances in computation platform technology<br />

(Spark) and notebook technologies (Spark Notebook)<br />

enable orchestration of distributed jobs on cluster<br />

infrastructure from a programmable environment running<br />

in a browser. These technologies, combined with Adam<br />

[2], a library specifically designed for processing nextgeneration<br />

sequencing data, provide the necessary<br />

architectural bedrock for our purposes.<br />

Analyses are expressed in a high-level programming<br />

language (Scala), operating on specialized data structures<br />

(Spark resilient distributed datasets, or RDDs [1]) that<br />

make abstraction of the complexity of defining distributed<br />

computations on data sets too large for single node<br />

processing. Adam meets the need for an explicit data<br />

schema for abstraction of the different bioinformatics file<br />

formats.<br />

RESULTS & CONTRIBUTIONS<br />

Our work focuses on the pairwise comparison of annotated<br />

VCF files. Our contributions consist of two open-source<br />

Scala libraries: VCF-comp [4] and Adam-FX [5]. VCFcomp<br />

implements the concordance by variant position<br />

algorithm, which segregates the variants from two VCF<br />

inputs (A, B) into 5 categories: A/B-unique, concordant<br />

(equal variants on position) and A/B-discordant (different<br />

variants on position). This results in a distributed data<br />

structure from which we project visualizations, presented<br />

to the user by means of the Spark Notebook interface.<br />

FIGURE 1 Allele frequency distribution for concordant and unique<br />

variants in a tumor vs. normal VCF comparison.<br />

FIGURE 2 Functional impact (SnpEff annotation) histogram for<br />

concordant, unique and discordant variants in a tumor vs. normal VCF<br />

comparison.<br />

Adam-FX extends the Adam data structures and file<br />

parsing logic in order to support queries on SnpEff [6],<br />

SnpSift [7], dbSNP and Clinvar annotations.<br />

We believe our tool facilitates the comparison of<br />

annotated VCF files in an interactive manner while<br />

reducing runtime by leveraging the Spark platform.<br />

REFERENCES<br />

[1] Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant<br />

abstraction for in-memory cluster computing."<br />

[2] Massie, Matt, et al. "Adam: Genomics formats and processing<br />

patterns for cloud scale computing."<br />

[3] https://github.com/andypetrella/spark-notebook<br />

[4] https://github.com/tmoerman/vcf-comp<br />

[5] https://github.com/tmoerman/adam-fx<br />

[6] Cingolani, P, et al. "A program for annotating and predicting the<br />

effects of single nucleotide polymorphisms, SnpEff: SNPs in the<br />

genome of Drosophila melanogaster strain w1118; iso-2; iso-3.", Fly<br />

(Austin). 2012 Apr-Jun;6(2):80-92. PMID: 22728672<br />

43


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: O24<br />

Oral presentation<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

O24. 3D HOTSPOTS OF RECURRENT RETROVIRAL INSERTIONS REVEAL<br />

LONG-RANGE INTERACTIONS WITH CANCER GENES<br />

Sepideh Babaei 1 , Waseem Akhtar 2 , Johann de Jong 3 , Marcel Reinders 1 & Jeroen de Ridder 1* .<br />

Delft Bioinformatics Lab, Delft University of Technology 1 ; Division of Molecular Genetics 2 ;<br />

Division of Molecular Carcinogenesis, The Netherlands Cancer Institute 3 . * j.deridder@tudelft.nl<br />

Genomically distal mutations can contribute to deregulation of cancer genes by engaging in chromatin interactions. To<br />

study this, we overlay viral cancer-causing insertions obtained in a murine retroviral insertional mutagenesis screen with<br />

genome-wide chromatin conformation capture data. In this talk, we show that insertions tend to cluster in 3D hotspots<br />

within the nucleus. The identified hotspots are significantly enriched for known cancer genes, and bear the expected<br />

characteristics of bona-fide regulatory interactions, such as enrichment for transcription factor binding sites.<br />

Additionally, we observe a striking pattern of mutual exclusive integration. This is an indication that insertions in these<br />

loci target the same gene, either in their linear genomic vicinity or in their 3D spatial vicinity. Our findings shed new<br />

light on the repertoire of targets obtained from insertional mutagenesis screening and underlines the importance of<br />

considering the genome as a 3D structure when studying effects of genomic perturbations.<br />

Evidence is mounting that the organization of the genome<br />

in the cell nucleus is extremely important for gene<br />

regulation. This finding is facilitated by recent<br />

technological advances (i.e. Hi-C) that enabled researchers<br />

to accurately capture the 3D conformation of<br />

chromosomes in the cellular nucleus at a high resolution.<br />

We have exploited a large existing Hi-C dataset to take 3D<br />

chromosome conformation into account while determining<br />

hotspots of viral cancer-causing mutations. These<br />

identified hotspots are significantly enriched for known<br />

cancer genes, and bear the expected characteristics of<br />

bona-fide regulatory interactions, such as enrichment for<br />

transcription factor binding sites. Additionally, we observe<br />

a striking pattern of mutual exclusive integration. This is<br />

an indication that insertions in these loci target the same<br />

gene through long-range interactions (1).<br />

In a second study (2), we performed a similar analysis that<br />

shows a striking relation between genome conformation<br />

and expression correlation in the brain. Although recent<br />

studies have shown a strong correlation between<br />

chromatin interactions and gene co-expression exists,<br />

predicting gene co-expression from frequent long-range<br />

chromatin interactions remains challenging. We address<br />

this by characterizing the topology of the cortical<br />

chromatin interaction network using scale-aware<br />

topological measures. We demonstrate that based on these<br />

characterizations it is possible to accurately predict spatial<br />

co-expression between genes in the mouse cortex.<br />

Consistent with previous findings, we find that the<br />

chromatin interaction profile of a gene-pair is a good<br />

predictor of their spatial co-expression. However, the<br />

accuracy of the prediction can be substantially improved<br />

when chromatin interactions are described using scaleaware<br />

topological measures of the multi-resolution<br />

chromatin interaction network. We conclude that, for coexpression<br />

prediction, it is necessary to take into account<br />

different levels of chromatin interactions ranging from<br />

direct interaction between genes (i.e. small-scale) to<br />

chromatin compartment interactions (i.e. large-scale).<br />

In this talk, I will focus on the computational and<br />

statistical methods that are required to make an insightful<br />

overlaying high-resolution conformation maps obtained<br />

using Hi-C with ~20.000 cancer-causing retroviral<br />

mutations and expression maps from the Allen Brain<br />

Atlas.<br />

FIGURE 1. Circos visualization of the insertions clusters that co-localize<br />

with the Notch1 locus.<br />

REFERENCES<br />

(1) Babaei, S. et al. Nature Communications (<strong>2015</strong>).<br />

(2) Babaei and Mahfouz et al. PLoS Computational Biology (<strong>2015</strong>)<br />

44


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P1. KNN-MDR APPROACH FOR DETECTING GENE-GENE<br />

INTERACTIONS<br />

Sinan Abo alchamlat 1 & Frédéric Farnir 1,* .<br />

Fundamental and Applied Research for Animals & Health (FARAH), Sustainable Animal Production, University of<br />

Liège 1 . * f.farnir@ulg.ac.be<br />

These last years have seen the emergence of a wealth of biological information. Facilitated access to the genome<br />

sequence, along with massive data on genes expression and on proteins have revolutionized the research in many fields<br />

of biology. For example, the identification of up to several millions SNPs in many species and the development of chips<br />

allowing for an effective genotyping of these SNPs in large cohorts have triggered the need for statistical models able to<br />

identify the effects of individual and of interacting SNPs on phenotypic traits in this new high-dimensional landscape.<br />

Our work is a contribution to this field...............................................................................................................<br />

INTRODUCTION<br />

GWAS has allowed the identification of hundreds of<br />

genetic variants associated to complex diseases and traits,<br />

and provided valuable information into their genetic<br />

architecture (Wu M et al., 2010). Nevertheless, most<br />

variants identified so far have been found to confer<br />

relatively small information about the relationship<br />

between changes at the genomic level and phenotypes<br />

because of the lack of reproducibility of the findings, or<br />

because these variants most of the time explain only a<br />

small proportion of the underlying genetic variation (Fang<br />

G et al., 2012). This observation, quoted as the ‘missing<br />

heritability’ problem (Manolio T et al., 2009) of course<br />

raises the question: where does the unexplained genetic<br />

variation come from? A tentative explanation is that genes<br />

do not work in isolation, leading to the idea that sets of<br />

genes (or genes networks) could have a major effect on the<br />

tested traits while almost no marginal – i.e. individual<br />

gene – effect is detectable. Consequently, an important<br />

question concerns the exact relationship between the<br />

genomic configuration, including the interactions between<br />

the involved genes, and the phenotypic expression.<br />

METHODS<br />

To tackle this subject, different statistical methods such as<br />

MDR (Multi Dimensional Reduction) have been proposed<br />

for detecting gene-gene interaction (Ritchie, D., et al.,<br />

2001); their relative performances remain largely unclear,<br />

and their extension to situations combining many variants<br />

turns out to be challenging. So we propose a novel MDR<br />

approach using K-Nearest Neighbors (KNN) methodology<br />

(KNN-MDR) for detecting gene-gene interaction as a<br />

possible alternative, especially when the number of<br />

involved determinants is potentially high. The idea behind<br />

our method is to replace the status allocation used in<br />

classical MDR methods by a KNN approach: the majority<br />

vote occurs in the k (a parameter that must be tuned and<br />

depends on the various possible scenarios) nearest<br />

neighbors instead of within the (potentially empty) cell<br />

determined by the tested attributes of the individual to be<br />

classified. The steps other than classification are identical<br />

in both methods (i.e. cross-validation, attributes selection,<br />

training and tests balanced accuracy computations, best<br />

model selection procedure).<br />

RESULTS & DISCUSSION<br />

Experimental results on both simulated data and real<br />

genome-wide data from Wellcome Trust Case Control<br />

Consortium (WTCCC) (Wellcome Trust Case Control C.,<br />

2007) show that KNN-MDR has interesting properties in<br />

terms of accuracy and power, and that, in many cases, it<br />

significantly outperforms its recent competitors.<br />

FIGURE 1. Comparison of the inter-chromosomal interactions detected<br />

on the WTCCC dataset by KNN-MDR and other interaction methods<br />

using this same dataset as example (Shchetynsky et al. (<strong>2015</strong>); Zhang et<br />

al. (2012))<br />

The results of this study allow us to draw some<br />

conclusions about the performance of KNN-MDR: on the<br />

one hand, the performance of the KNN-MDR method to<br />

detect gene-gene interactions are similar to the<br />

performance of MDR for small problems. On the other<br />

hand, KNN-MDR has significant advantages in large<br />

samples and large number of markers (such as GWAS) to<br />

detect the existence of genes effect. So KNN-MDR can be<br />

seen as a new and more comprehensive method than MDR<br />

and other competitors for detecting gene-gene interaction.<br />

REFERENCES<br />

Wu M et al. American journal of human genetics 86, 929-942 (2010).<br />

Fang G et al. PloS one 7, 1932-6203 (2012).<br />

Manolio T et al. Nature 461, 747-753 (2009).<br />

Ritchie, D., et al. Am J Hum Genet,69, 138-147 (2001).<br />

Wellcome Trust Case Control C. Nature, 447(7145):661-678 (2007).<br />

Shchetynsky K et al. Clinical immunology 158(1):19-28 (<strong>2015</strong>).<br />

Zhang J et al. American Medical Journal 3(1) (<strong>2015</strong>).<br />

45


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P2. CONSERVATION AND DIVERSITY OF SUGAR-RELATED CATABOLIC<br />

PATHWAYS IN FUNGI<br />

Maria Victoria Aguilar Pontes*, Eline Majoor, Claire Khosravi, Ronald P. de Vries, Miaomiao Zhou<br />

Fungal Physiology, CBS-KNAW Fungal Biodiversity Centre, Utrecht, The Netherlands; Fungal Molecular Physiology,<br />

Utrecht University, The Netherlands.*v.aguilar@cbs.knaw.nl, e.majoor@cbs.knaw.nl, c.khosravi@cbs.knaw.nl,<br />

r.devries@cbs.knaw.nl, m.zhou@cbs.knaw.nl<br />

INTRODUCTION<br />

Plant polysaccharides are among the major substrates for<br />

many fungi. After extracellular degradation, the<br />

monomeric components (mainly monosaccharides) are<br />

taken up by the cells and used as carbon sources to enable<br />

the fungus to grow. This would also imply that the range<br />

of catabolic pathways of a fungus may be correlated to the<br />

decomposition of the polysaccharides it can degrade.<br />

Several carbon catabolic pathways have been studied in<br />

different fungi able to grow on plant biomass such as<br />

Aspergillus niger (De Vries, et al., 2012).<br />

In this study we have tested this hypothesis by identified<br />

the presence of genes of a number of catabolic pathways<br />

in selected fungi from the Ascomycota and the<br />

Basidiomycota.<br />

METHODS<br />

A total of 104 fungal genomes were identified from the<br />

JGI fungal program (Grigoriev IV, et al., 2011), Broad<br />

Institute of Harvard and MIT, AspGD (Arnaud, et al.,<br />

2012) and NCBI genbank (Benson, et al., 2012) (data<br />

version March 2013).<br />

We identified A. niger genes involved in individual<br />

pathways from literature. Genome scale protein ortholog<br />

clusters were detected according to (Li, et al., 2003), using<br />

inflation factor 1, E-value cutoff 1E-3, percentage match<br />

cut off 60% as for identification of distant homologs<br />

(Boekhorst, et al., 2007). The all-vs-all BlastP search<br />

required by OrthoMCL was carried out in a grid of 500<br />

computers by parallel fashion. The orthologs clusters were<br />

then curated manually by expert knowledge and literature<br />

search. Manual curation was aided by aligning the amino<br />

acid sequences of the hits for each query together with a<br />

suitable outgroup by MAFFT (Katoh, et al., 2009; Katoh,<br />

et al., 2005), after which neighbor joining trees were<br />

generated using MEGA5 with 1000 bootstraps. Genes that<br />

were clearly separated from the query branch in the trees<br />

were removed from the results.<br />

RESULTS & DISCUSSION<br />

Patterns of pathway gene presence are conserved among<br />

clades. Galacturonic acid and rhamnose pathways are<br />

missing in yeast. Pentose pathway is conserved in<br />

Pezizomycetes and Basidiomycota, which explains their<br />

ability to grow on pentose as carbon source (www.funggrowth.org).<br />

These results may indicate that different evolutionary<br />

tracks have led to different metabolic strategies.<br />

The expression of metabolic genes will be evaluated for<br />

those species for which transcriptome data are available.<br />

The results will be compared to growth profiling data of<br />

the species on a set of plant-related poly- and<br />

monosaccharides to determine to which extent the genome<br />

content fits the physiological ability of the species.<br />

ACKNOWLEDGEMENTS<br />

The comparative genomics analysis was carried out on the<br />

Dutch national e-infrastructure with the support of SURF<br />

Foundation (e-infra1300787).<br />

REFERENCES<br />

Arnaud, M.B., et al., Nucleic Acids Res, 40, 653-659 (2012).<br />

Benson, D.A., et al., Nucleic Acids Res, 40, 48-53 (2012).<br />

Boekhorst, J., et al., BMC Bioinformatics, 8, 356-363 (2007).<br />

De Vries, R.P., et al. Pan Stanford Publishing Pte. Ltd, Singapore (2012).<br />

Grigoriev IV, et al., Mycology, 2, 192-209 (2011).<br />

Katoh, K., et al., Methods Mol Biol, 537, 39-64 (2009).<br />

Katoh, K., et al., Nucleic Acids Res, 33, 511-518 (2005).<br />

Li, L., et al., Genome Res, 13, 2178-2189 (2003).<br />

46


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P3. VISUALIZING BIOLOGICAL DATA THROUGH WEB COMPONENTS<br />

USING POLIMERO AND POLIMERO-BIO<br />

Daniel Alcaide 1,2* , Ryo Sakai 1,2 , Raf Winand 1,2 , Toni Verbeiren 1,2 , Thomas Moerman 1,2 , Jansi Thiyagarajan & Jan Aerts.<br />

KU Leuven Department of Electrical Engineering-ESAT, STADIUS, VDA-lab, Belgium 1 ; iMinds Medical IT, Leuven,<br />

Belgium. * daniel.alcaide@esat.kuleuven.be<br />

Although there are currently several tools for fast prototyping in data visualization, the specifics of the biological domain<br />

often require the development of custom visuals. This leads to the issue that we end up re-implementing the base visuals<br />

over and over if we want to build them into a specific analysis tool. This work presents a proof-of-principle library for<br />

creating composable linked data visualizations, including an initial collection of parsers and visuals with an emphasis on<br />

biology. With Polimero and Polimero-bio, we want to create a library to build scalable domain-specific visual data<br />

exploration tools using a collection of D3-based reusable web components.<br />

INTRODUCTION<br />

As a visual data analysis lab, we often combine<br />

(brush/link) well-known data visualization techniques<br />

(scatterplots, barcharts, etc.). Despite it is possible to use<br />

general-purpose tools like Tableau or Excel, the singular<br />

needs of the biological field usually demand the creation<br />

of particular data visualizations which are not included in<br />

these commercial solutions (Figure 1).<br />

These visuals implementations need to be re-implemented<br />

for each new tool created. The present solution tries to be<br />

an alternative to create composable linked data<br />

visualizations.<br />

<br />

<br />

<br />

<br />

<br />

<br />

Modular: Each element is an independent module<br />

that has a specific purpose (data, visualization,<br />

computation)<br />

Composable: The elements can be combined<br />

setting up new functionalities (linking, filtering,<br />

reading different data sources)<br />

Encapsulated: Web components aim to provide<br />

the user a simple element interface, avoiding to<br />

have to deal with the underlying code.<br />

Reusable: The same element can be used in the<br />

same project for different objectives.<br />

Linkable: Polimero elements can speak to each<br />

other, allowing the use of events for brushing and<br />

linking.<br />

Embeddable: The elements can be added to any<br />

existing frameworks that use HTML (e.g. ipython<br />

notebook).<br />

FIGURE 1. Klaudia-plot - Visualization created with Polimero that shows<br />

the read pairs mapped around a deletion in the NA12878 genome on<br />

chromosome 20.<br />

METHODS<br />

Polimero is a library that uses Polymer implementation for<br />

creating visual web components. (www.polymerproject.org).<br />

Web components are an emerging W3C standard for<br />

extending the HTML platform to create web-based apps.<br />

This new technology includes custom elements, HTML<br />

templates, shadow DOM, and HTML imports (Figure 2).<br />

The D3-based custom elements that Polimero and<br />

Polimero-bio offer, allow us to create a scalable<br />

framework for building domain-specific visual data<br />

exploration tools.<br />

Leveraging the web components concepts, the main<br />

characteristics of Polimero library are:<br />

FIGURE 2. HTML example – Representing Polimero elements to create<br />

visualization.<br />

RESULTS & DISCUSSION<br />

This library makes it possible to create applications that<br />

are composable, encapsulated, and reusable. This is<br />

valuable both for the developer/designer who can easily<br />

create and plug-in custom visual encodings, and for the<br />

end-user who can create linked visualizations by dragging<br />

existing components onto a canvas using the Polimerodesigner.<br />

Polimero and Polimero-bio are still in development but<br />

they are available at www.bitbucket.org/vda-lab/polimero.<br />

47


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P4. DISEASE-SPECIFIC NETWORK CONSTRUCTION BY SEED-AND-EXTEND<br />

Ganna Androsova 1* , Reinhard Schneider 1 & Roland Krause 1 .<br />

Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Belval, Luxembourg 1 .<br />

* ganna.androsova@uni.lu<br />

INTRODUCTION<br />

Molecular interaction networks are dense structures of<br />

protein interactions, from which we would like to extract<br />

relevant sub-networks specific to the disease of interest.<br />

Such a disease-specific network is often constructed by the<br />

seed-and-extend algorithm, which extracts the relevant<br />

genes from an organism-wide, weighted interaction<br />

network, typically as its first-neighbourhood. Seed-andextend<br />

is suitable when disease biomarkers are poorly<br />

investigated and the knowledge about biomarker<br />

interaction partners is missing or when the interacting<br />

partners are established but the connections are missing<br />

between them.<br />

Our syndrome of interest is the postoperative cognitive<br />

impairment frequently experienced by elderly patients,<br />

characterized by progressive cognitive and sensory decline.<br />

The acute phase of cognitive impairment is postoperative<br />

delirium (POD). The underlying pathophysiological<br />

mechanisms have not been studied in depth due to<br />

mulitifactorial pathogenesis of this postoperative cognitive<br />

impairment. The known POD-related genes can be<br />

integrated into the draft network for exploration on a<br />

systems level.<br />

Here, we investigate how stable the results of such<br />

analysis are when the input set of seed genes is varied, and<br />

what is the role of stringency in the initial selection of the<br />

networks. Ideally, we would like to find the “sweet spot”<br />

that provides a biologically meaningful trade-off between<br />

false-positives and -negatives to be used for such analyses.<br />

METHODS<br />

The list of disease-related genes/proteins was retrieved<br />

from literature studies in the PubMed database.<br />

We extended the seed list with directly linked interactors<br />

by seed-and-extend from protein-protein interaction<br />

network databases. We extracted all interactions between<br />

seeds and connected neighbours, which resulted in the<br />

first-degree network.<br />

Next, we evaluated a biological enrichment of the<br />

extracted network, its topological parameters, overlap with<br />

other diseases and clustered the network into the smaller<br />

sub-networks.<br />

RESULTS & DISCUSSION<br />

The POD network (Figure 1) follows a free-scale<br />

distribution and consists of 541 proteins with 5,242<br />

interactions between them.<br />

FIGURE 1. Postoperative delirium molecular network.<br />

The network was evaluated topologically by degree<br />

assortativity, density, shortest path, eccentricity and other<br />

measures. Pathways enrichment analysis showed<br />

glucocorticoid receptor signalling, immune response, and<br />

dopamine signalling as relevant to POD (Figure 2).<br />

FIGURE 2. Postoperative delirium pathway enrichment analysis.<br />

Top 5 hub proteins included UBC_HUMAN,<br />

GCR_HUMAN, P53_HUMAN, HS90A_HUMAN and<br />

EGFR_HUMAN. Appearance of p53 and other very<br />

frequent genes among top 5 hubs in our but also several<br />

other studies, motivated us to investigate its relevance to<br />

the disease and question the possible data bias. We<br />

compare how size, specificity and completeness of the<br />

input seed list can affect the resulting network and<br />

retrieval of the other disease-related proteins.<br />

48


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P5. BIG DATA SOLUTIONS FOR VARIANT DISCOVERY FROM LOW<br />

COVERAGE SEQUENCING DATA, BY INTEGRATION OF HADOOP, HBASE<br />

AND HIVE<br />

Amin Ardeshirdavani 1* , Erika Souche 2 , Martijn Oldenhof 3 & Yves Moreau 1 .<br />

KU Leuven ESAT-STADIUS Center for Dynamical Systems, Signal Processing and Data Analytic 1; KU Leuven<br />

Department of Human Genetics 2; KU Leuven Facilities for Research 3. *amin.ardeshirdavani@esat.kuleuven.be<br />

Next Generation Sequencing (NGS) technologies allow the sequencing of the whole human genome to, among others,<br />

efficiently study human genetic disorders. However, the sequencing data flood needs high computation power and<br />

optimized programming structure to tackle data analysis. A lot of researchers use scale-out network to simulate<br />

supercomputer. In many use cases Apache Hadoop and HBase have been used to coordinate distributed computation and<br />

act as a storage platform, respectively. However, scale-out network has rarely been used to handle gene variation data<br />

from NGS, except for sequencing reads assembly. In our study, we propose a Big Data solution by integrating Apache<br />

Hadoop, HBase and Hive to efficiently analyze NGS output such as VCF files.<br />

INTRODUCTION<br />

The goal of this project is trying to overcome the<br />

difficulties between massive NGS data and low data<br />

process ability. We want propose a data process and<br />

storage model specifically for NGS data. To address our<br />

goal we develop an application based on this model to test<br />

whether its process ability is highly increased. The target<br />

users of this application are researchers with intermediatelevel<br />

computer skills. The new model should meet certain<br />

demands, which are scalable, high tolerant and availability.<br />

Data import procedure should be fast and occupies the<br />

smallest storage volume. It also needs to make querying<br />

data faster and possible from remote place. In order to<br />

achieve these demands, three open source projects:<br />

Apache Hadoop, HBase and Hive are integrated as the<br />

backbone and on top of them a user-friendly interface<br />

designed application is developed to make this integration<br />

more straightforward.<br />

METHODS<br />

Generally, Hadoop is for utilizing distributed MapReduce<br />

data processing, HBase is the platform for complex<br />

structured data storage and Hive is for data retrieve from<br />

HBase using of Structural Query Language (SQL) syntax.<br />

Though Hadoop and HBase are popular recently, the<br />

combination of Hadoop, HBase and Hive is rare to be<br />

implemented in bioinformatics field.<br />

Here we mainly discuss gene variation data analysis. Thus<br />

the application developing is focusing on parsing and<br />

storing VCF (Variant Call Format) file. The application is<br />

designed to dynamically adapt VCF file structures with<br />

respect to variant callers. For example in<br />

UnifiedGenotyper calls SNPs and InDels separately by<br />

considering each variant is independent, yet the other<br />

caller HaplotypeCaller calls variants by using local<br />

assembly. For gene variation analysis, the VCF files of<br />

different samples need to be queried and the results should<br />

be able to export for further usage. Normally a VCF file<br />

for each sample or a group of samples is considerably<br />

large, so the efficiency of processing is for sure very<br />

crucial.<br />

The model we have decided is the integration of Hadoop,<br />

HBase and Hive; Hadoop will be used for data processing,<br />

HBase for storage and Hive for querying. Since all of<br />

these projects need distributed cluster to optimize the<br />

performance, it is crucial to decide the suitable<br />

architecture for our application. The cluster will be the<br />

major processing and storage platform. The single server<br />

outside the cluster will act as a client for users. Our<br />

application can connect remotely to the Hive server for<br />

researchers.<br />

RESULTS & DISCUSSION<br />

The tests show clearly that the Apache integration<br />

performances much better than SQL model when dealing<br />

with large size VCF files. Also, for small VCF files, the<br />

integration performance is acceptable. So we conclude that<br />

Apache integration could be a good solution for this kind<br />

of file management. Our newly developed application H3<br />

VCF with user-friendly interface is a nice tool for users<br />

without high level IT knowledge so they can conveniently<br />

use the integration to tackle VCF files. User can either<br />

choose to build his/ her own local computer cluster or use<br />

Amazon EMR to easily create a cluster with Apache<br />

projects for a few dollars.<br />

49


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P6. ENTEROCOCCUS FAECIUM GENOME DYNAMICS DURING<br />

LONG-TERM PATIENT GUT COLONIZATION<br />

Jumamurat R. Bayjanov 1* , Jery Baan 1 , Mark de Been 1 , Mick Watson 2 & Willem van Schaik 1 .<br />

Department of Medical Microbiology, University Medical Center Utrecht, Utrecht, The Netherlands 1 ; Edinburgh<br />

Genomics, The University of Edinburgh, Edinburgh, Scotland 2 . * J.Bayjanov@umcutrecht.nl<br />

Enterococcus faecium – recently evolved multi-drug resistant nosocomial pathogen – is able to rapidly colonize human<br />

gut. Previous work on animal, healthy human and clinical E. faecium strains has shown that clinical isolates form a<br />

distinct lineage. However, these studies lack detailed niche-specific and longitudinal evolutionary dynamics analysis of<br />

this organism. Here we show longitudinal within-host evolutionary dynamics analysis of E. faecium gut isolates, which<br />

were sampled from five patients over the period of 8 years. Whole-genome sequencing analysis showed that rapid<br />

diversification of E. faecium clones in patient gut is mainly due to recombinations and phages. High diversification<br />

allows E. faecium clones to acquire new genes including antibiotic resistance genes, which allows this bacterium to<br />

rapidly colonize hostile environments.<br />

INTRODUCTION<br />

In recent decades, Enterococcus faecium, normally a<br />

harmless gut commensal, has emerged as an important<br />

multi-drug resistant nosocomial pathogen. Previous work<br />

has shown that clinical isolates of E. faecium form a subpopulation<br />

that is distinct from strains isolated from<br />

animals and healthy humans (Lebreton et al., 2013). We<br />

used whole-genome sequencing to characterize how<br />

clinical E. faecium strains evolve during long-term patient<br />

gut colonization.<br />

METHODS<br />

The genomes of 96 E. faecium gut isolates, obtained over<br />

8 years from 5 different patients, were sequenced using<br />

Illumina HiSeq 2x100bp paired-end sequencing. Quality<br />

filtering of sequence reads was performed using Nesoni<br />

(version 0.117) (Nesoni, 2014) and high-quality reads<br />

were assembled into contiguous sequences using Spades<br />

assembler (version 3.1.0) (Bankevich et al., 2012).<br />

Subsequently, assembled sequences were annotated using<br />

Prokka (v 1.10) (Seeman T, 2014). In addition to these 96<br />

genomes, we also included publicly available genome<br />

sequences of 70 E. faecium strains, which were<br />

downloaded from NCBI Genbank database. In the set of<br />

166 strains, orthology between genes were identified using<br />

orthAgogue (Ekseth et al., 2014) and orthologous genes<br />

were clustered into ortholog groups using MCL algorithm<br />

(Enright et al., 2002). Core genome alignments were then<br />

constructed by concatenating core gene sequences and<br />

were filtered for recombinations using Gubbins (Croucher<br />

et al., <strong>2015</strong>). Subsequently, recombination-filtered core<br />

genome alignments were used to construct a phylogenetic<br />

tree. In addition to core-genome based analyses, we have<br />

also studied gene gain and loss across time.<br />

RESULTS & DISCUSSION<br />

As expected all of 96 isolates were grouped in E. faecium<br />

clade A, with only one strain clustering in clade A-2,<br />

which mainly contains animal isolates. The remaining 95<br />

strains were assigned to clade A-1, which is almost<br />

exclusively comprised of clinical isolates. The<br />

phylogenetic tree showed 5 clusters of closely related<br />

strains of patients, revealing the microevolution of E.<br />

faecium strains during gut colonization. We also anticipate<br />

that direct transfer of strains had occurred between<br />

patients during hospitalization in the same ward.<br />

Additionally, analysis of gene gain and loss across time<br />

showed that loss and gain of prophages is an important<br />

factor in generating genetic diversity during gut<br />

colonization.<br />

This study highlights the ability of E. faecium clones to<br />

rapidly diversify, which may contribute to the ability of<br />

this bacterium to efficiently colonize new environments<br />

and rapidly acquire antibiotic resistance determinants.<br />

REFERENCES<br />

Lebreton F, et. al. “Emergence of epidemic multidrug-resistant<br />

Enterococcus faecium from animal and commensal strains”. MBio.<br />

4(4):e00534-13, 2013.<br />

Nesoni. https://github.com/Victorian-Bioinformatics-Consortium/nesoni<br />

Bankevich A, et. al. "SPAdes: A New Genome Assembly Algorithm and<br />

Its Applications to Single-Cell Sequencing". Journal of<br />

Computational Biology 19(5):455-477, 2012<br />

Seemann T. "Prokka: rapid prokaryotic genome annotation".<br />

Bioinformatics. 30(14):2068-9, 2014.<br />

Ekseth OK, et. al. "orthAgogue: an agile tool for the rapid prediction of<br />

orthology relations". Bioinformatics. 30(5):734-6, 2014.<br />

Enright AJ, et. al. "An efficient algorithm for large-scale detection of<br />

protein families". Nucleic Acids Res. 40:1575-1584, 2002.<br />

Croucher NJ, et. al. "Rapid phylogenetic analysis of large samples of<br />

recombinant bacterial whole genome sequences using Gubbins".<br />

Nucleic Acids Res. 43(3):e15, <strong>2015</strong>.<br />

50


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P7. XCMS OPTIMISATION IN HIGH-THROUGHPUT LC-MS QC<br />

Charlie Beirnaert 1,2* , Matthias Cuykx 3 , Adrian Covaci 3 & Kris Laukens 1,2 .<br />

Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Biomedical Informatics Research Centre<br />

Antwerp (biomina) 2 ; Toxicological Centre, University of Antwerp 3 . * charlie.beirnaert@uantwerpen.be<br />

In high-throughput untargeted metabolomics studies, quality control is still a prominent bottleneck. In analogy to a<br />

recently developed QC tool for proteomics, work in our research group aims to develop a QC environment specific for<br />

metabolomics. One component in this work is the XCMS analysis software for LC-MS data, which is very inputparameter-sensitive.<br />

The presented work deals with the automatic optimisation of the XCMS parameters by building<br />

further upon an existing framework for XCMS optimisation. The additions to this framework will be the inclusion of<br />

quantified resolution data by using the otherwise ignored profile-data and intelligent use of the isotopic profile of<br />

measured compounds.<br />

INTRODUCTION<br />

Metabolomics is the study of small molecules or<br />

metabolites. These metabolites have an enormous<br />

chemical diversity and are only now starting to be<br />

identified in a high-throughput fashion. Reason for this is<br />

the adoption of high performance liquid chromatography<br />

mass spectrometry and nuclear magnetic resonance<br />

spectroscopy. However, the data analysis of these large<br />

datasets is not trivial, specifically for LC-MS there are<br />

almost more ways of analysing data than there are<br />

researchers. Arguably, the most common used software<br />

platform for the initial analysis is XCMS (Smith et al.,<br />

2006). However, the output of XCMS is very dependent<br />

on the input-parameters. Often the default parameters are<br />

chosen or they are adapted to the intuition of the<br />

researcher, with no account of the introduction of false<br />

positives etc. Optimization algorithms have been<br />

constructed by using a dilution series (Eliasson et al.,<br />

2012) and by using the carbon isotope (Libiseller et al.,<br />

<strong>2015</strong>). In this work, we build further upon the latter by<br />

including quantified information from the profile m/z<br />

domain (the continuous data in the m/z dimension) where<br />

accurate resolutions can be obtained for the mono-isotopic<br />

peaks and other isotopes. The developed optimisation can<br />

be used for both the data analysis and the quality control<br />

framework that is under development.<br />

METHODS<br />

The proposed work uses XCMS to find the peaks of<br />

interest in the data. To optimise this process, the results<br />

from XCMS are analysed for the occurrence of peaks and<br />

their isotopes. In this step, the raw profile data is inspected<br />

around the, by XCMS, identified peaks for the<br />

quantification of the peak resolution and for the<br />

occurrence of missed isotopes.<br />

Centroid vs Profile data: Modern day MS specialists use<br />

centroid data because the file size is considerably lower.<br />

The mass spectrometer converts the continuous data in the<br />

m/z dimension to a collection of spikes where each<br />

approximately Gaussian peak is converted to a single<br />

spike (delta function with the same height as the original<br />

peak). All other data is discarded. The result is a huge<br />

reduction in the file size but a loss of the peak shape and,<br />

as a result, no quantification of the resolution is possible.<br />

Optimization parameter: The peaks and their isotopes<br />

are characterized by a Gaussian in the chromatographic<br />

dimension and spaced apart by 1.0063 Da in the m/z<br />

dimension. When an isotope is missing or the extracted<br />

peak does not appear in enough samples (for example in<br />

50% of the samples in the sample group), the peak is<br />

categorized as “unreliable”. When a peak is present in all<br />

samples or has a clear isotopic distribution it is considered<br />

as “reliable”. With these measures a so called peak picking<br />

score can be calculated, which in turn can be optimised by<br />

a variety of methods. This results in an increase in reliable<br />

peaks, while not increasing false positives.<br />

Analysis & Quality control: The optimisation of the<br />

XCMs parameters is useful both in the analysis of the data<br />

itself, but it is also applicable in quality control for large<br />

scale LC-MS experiments. By being able to quantify the<br />

resolutions of all relevant peaks in a dataset corresponding<br />

to a control sample, it is possible to monitor the quality of<br />

spectra, and when combining this with other QC<br />

frameworks, like iMonDB (Bittremieux et al., <strong>2015</strong>) it is<br />

possible to assure the quality of all experiments in a long<br />

lasting study.<br />

RESULTS & DISCUSSION<br />

The aim is to use the profile data to improve the available<br />

optimization algorithms available. It remains to be seen<br />

whether the extra information in this data (compared to<br />

centroid data) justifies the increased need of computer<br />

resources. Nonetheless, profile data provides a valuable<br />

contribution in LC-MS optimization, because it enables<br />

researchers to evaluate (quantitatively) and improve the<br />

m/z resolution.<br />

REFERENCES<br />

Smith CA et al. Anal. Chem. 78(3), 779-789, (2006).<br />

Eliasson M. et al. Anal. Chem. 84(15), 6869-6876, (2012).<br />

Libiseller G. et al. BMC Bioinformatics 16:118, (<strong>2015</strong>).<br />

Bittremieux W. et al. J. Proteome Res. 14(5), 2360-2366, (<strong>2015</strong>).<br />

51


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P8. IDENTIFICATION OF NUMTS THROUGH NGS DATA<br />

Vincent Branders 1,2* , Chedly Kastally 2 & Patrick Mardulyn 2 .<br />

Machine Learning Group, Institute of Information and Communication Technologies, Electronics and Applied<br />

Mathematics (ICTEAM), Université catholique de Louvain 1 ; Evolutionary Biology and Ecology, Université libre de<br />

Bruxelles 2 . * vincent.branders@uclouvain.be<br />

Numts are copies of mitochondrial DNA sequences that have been transferred into the nuclear genome. Due to their<br />

similarity with mitochondrial DNA sequences, numts have led to many misinterpretations from overestimation of<br />

diversity to wrong association between cystic fibrosis and mitochondrial genome variation. To avoid such bias induced<br />

by numts, theses sequences have to be identified. Current methodologies are based on comparisons of existing nuclear<br />

and mitochondrial sequences and searches for similarities. The Pacific Biosciences (PacBio) new technology generates<br />

sequencing reads that span thousands of base pairs, which gives the opportunity to identify numts by looking for reads<br />

with regions similar to mitochondrial sequences and surrounded by regions highly different from it. It should allow the<br />

systematic identification of numts without a complete known nuclear reference.<br />

INTRODUCTION<br />

The transfer of DNA from mitochondria to the nucleus<br />

generates nuclear copies of mitochondrial DNA (numts).<br />

Numts have been found in many species including yeasts,<br />

rodents and plants. Due to their similarity to mitochondrial<br />

DNA, numts are responsible for many misinterpretations,<br />

both in mitochondrial disease studies and phylogenetic<br />

reconstructions (Hazkani-Covo et al., 2010). Numt<br />

variation have commonly been misreported as<br />

mitochondrial mutations in patients (Yao et al., 2008).<br />

Moreover, DNA barcoding was found to overestimate the<br />

number of species when numts are coamplified (Song et<br />

al., 2008). Current methods identify such sequences by<br />

aligning mitochondrial sequences against the nuclear<br />

genome and identifying similar regions (Figure 1, left).<br />

The PacBio technology allows the sequencing of DNA<br />

fragments spanning thousands of bases pairs. This size<br />

should allow the identification of numts without the need<br />

of a complete nuclear reference (the insect species<br />

Gonioctena intermedia for example). Indeed, it should be<br />

possible to use a mitochondrial assembly to identify<br />

PacBio reads with a central region similar to the<br />

mitochondrial sequence enclosed by nuclear regions that<br />

are dissimilar to it (Figure 1, right).<br />

FIGURE 1. Identification of numts – Existing methods (left) and proposed<br />

method (right). Comparison of mitochondrial sequence to nuclear<br />

sequence (left) or long reads (right).<br />

METHODS<br />

The proposed approach aligns PacBio reads to a<br />

mitochondrial genome (here de novo assemblies of PacBio<br />

reads and Illumina HiSeq 2000 reads are used). In these<br />

long reads, numts are identified with one region similar<br />

to the mitochondrial genome but surrounded by regions<br />

that are not similar. We introduce different criteria to<br />

distinguish reads that are presumably numts and reads of<br />

mitochondrial origin (Figure 2). DNA sequences comes<br />

from an insect (Gonioctena intermedia) without reference<br />

genome.<br />

FIGURE 2. Mitochondrial reads and numts with nuclear borders.<br />

RESULTS & DISCUSSION<br />

A systematic identification of potential numts is proposed:<br />

through alignments, we identify 10 mitochondrial reads<br />

and 34 reads with potential numt for one particular<br />

mitochondrial region (the widely studied cytochrome<br />

oxidase I gene). As an exploratory research, we highlight<br />

the usefulness of Pacific Biosciences data in the<br />

identification of numts when no nuclear reference is<br />

available. It only requires PacBio reads and a<br />

mitochondrial assembly. The proposed approach is more<br />

efficient than an identification of numts through short<br />

reads that would require the complete reconstruction of<br />

both mitochondrial and nuclear genomes. A systematic<br />

identification of numts in non-models organisms should<br />

avoid misinterpretations in studies where numts could be<br />

sources of bias. Our current distinction of numts and<br />

mitochondrial reads is quite simple. A detailed analysis of<br />

this distinction could be a perspective of improvements.<br />

REFERENCES<br />

Hazkani-Covo E. et al. PLOS Genetics 6, 1-11 (2010).<br />

Song H. et al. PNAS 105, 13486-13491 (2008).<br />

Yao Y. G. et al. Journal of Medical Genetics 45, 769-772 (2008).<br />

52


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P9. MICROBIAL SEMANTICS: GENOME-WIDE HIGH-PRECISION NAMING<br />

SCHEMES FOR BACTERIA<br />

Esther Camilo dos Reis, Dolf Michielsen, Hannes Pouseele*.<br />

Applied Maths NV, Keistraat 120, 9830 Sint-Martens-Latem, Belgium.<br />

INTRODUCTION<br />

As next-generation sequencing in general, and whole<br />

genome sequencing (WGS) in particular, is increasingly<br />

adopted in public health for routine surveillance tasks,<br />

there is a clear need to incorporate this new technology in<br />

the day-to-day operational workflow of a public health<br />

institute. As cluster detection based on WGS data is<br />

evolving into a commodity, thanks to technologies such as<br />

whole genome multi-locus sequence typing (wgMLST),<br />

the question remains as to how WGS-based data analysis<br />

can be used to build up a human-friendly but highprecision<br />

and epidemiologically consistent naming<br />

strategy for communication purposes.<br />

METHODS<br />

For various organisms, the use of so-called ‘SNP<br />

addresses’ (based on single nucleotide polymorphisms or<br />

SNPs) has been proposed to build up a hierarchical<br />

naming scheme (see [1], [2]). This idea relies on single<br />

linkage clustering of isolates at different levels of<br />

similarity or distance, hence leading to a hierarchical name.<br />

However, the main difficulty here is to define the<br />

appropriate levels of similarity to cluster on, and the<br />

dependence of the naming scheme on the samples at hand.<br />

Moreover, the SNP approach might not provide the best<br />

type of data for this due to its relatively large volatility.<br />

In this work, we present a mathematical framework to<br />

define the levels of similarity upon which single linkage<br />

clustering makes sense. For this, we model the observed<br />

multimodal distribution of pairwise similarities between<br />

samples to obtain a theoretical model of the similarity<br />

distribution, and from there infer the most likely breaking<br />

points for stable similarity cutoffs. This is done in a dataindependent<br />

manner, and is therefore applicable to SNP<br />

data, but also to wgMLST data and even gene presenceabsence<br />

data. We assess the stability of the naming<br />

scheme by using a cross-validation approach.<br />

RESULTS & DISCUSSION<br />

We apply our methods to propose a wgMLST-based<br />

naming scheme for Listeria monocytogenes. Using a<br />

reference dataset of the diversity within Listeria<br />

monocytogenes, and an extensive data set of over 4000<br />

isolates from real-time surveillance, we show the stability<br />

of the naming scheme, and the epidemiological<br />

concordance.<br />

REFERENCES<br />

[1] Dallman T et al., Applying phylogenomics to understand the<br />

2 emergence of Shiga Toxin producing Escherichia coli<br />

3 O157:H7 strains causing severe human disease in the<br />

4 United Kingdom. Microbial Genomics., 10.1099/mgen.0.000029<br />

[2] Coll F et al., PolyTB: A genomic variation map for Mycobacterium<br />

tuberculosis, Tuberculosis (Edinb). 2014 May; 94(3): 346–354. doi:<br />

10.1016/j.tube.2014.02.005<br />

53


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P10. FROM SNPS TO PATHWAYS: AN APPROACH TO STRENGTHEN<br />

BIOLOGICAL INTERPRETATION OF GWAS RESULTS<br />

Elisa Cirillo 1,* , Michiel Adriaens 2 & Chris T Evelo 1,2 .<br />

1 Department of Bioinformatics – BiGCaT, Maastricht University, The Netherlands<br />

2 Maastricht Centre for Systems Biology (MaCSBio), Maastricht University, The Netherlands<br />

* elisa.cirillo@maastrichtuniversity.nl<br />

Pathway and network analysis are established and powerful methods for providing a biological context for a variety of<br />

omics data, including transcriptomics, proteomics and metabolomics. These approaches could in theory also be a boon<br />

for the interpretation of genetic variation data, for instance in the context of Genome Wide Association Studies (GWAS),<br />

as it would allow the study of genetic variants in the context of the biological processes in which the implicated genes<br />

and proteins are involved. However, currently genetic variation data cannot easily be integrated into pathways.<br />

Additionally, it is not clear how to visualise and interpret genetic variation data once connected to pathway content. In<br />

this project we take up that challenge and aim to (i) visualise SNPs from a Type 2 Diabetes Mellitus (T2DM) GWAS<br />

dataset on pathways and (ii) generate and analyze a network of all associated genes and pathways. Together, this could<br />

enable a comprehensive pathway and network interpretation of genetic variations in the context of T2DM.<br />

INTRODUCTION<br />

GWAS has become a common approach for discovery of<br />

gene disease relationships, in particular for complex<br />

diseases like T2DM (Wellcome Trust Case Control,<br />

2009). However, biological interpretation remains a<br />

challenge, especially when it concerns connecting genetic<br />

findings with known biological processes. We wish to<br />

improve the interpretation of GWAS results, using a<br />

meaningful network representation that links SNPs to<br />

biological processes.<br />

METHODS<br />

We selected a GWAS data set related to T2DM from a<br />

meta GWAS resource for diseases created by Jhonson et<br />

al. (2009), and we extracted 1971 SNPs associated with<br />

T2DM.<br />

We identified the location for each SNP using Variant<br />

Effect Prediction (VeP) (http://www.ensembl.org) and we<br />

classified them in 5 categories (Figure 1): exonic, 3' UTR,<br />

5' UTR, intronic and intergenic. SNPs located in the first<br />

three categories are easily connected to genes using<br />

BioMart Ensembl (http://www.ensembl.org/). Pathways<br />

related with these genes are identified from the curated<br />

collection of WikiPathways (Kutmon et al., <strong>2015</strong>). SNPs,<br />

genes and pathways are visualized in networks using<br />

Cytoscape (Shannon et al., 2003).<br />

RESULTS & DISCUSSION<br />

We analysed four gene related SNP categories: 3' and 5'<br />

UTR, intronic and exonic. The exonic category was<br />

divided into 8 SNP sub-categories based on sequence<br />

interpretation: up- and downstream, splice region,<br />

synonymous, missense, stop/gain, transcription factor<br />

binding, and non-coding transcript. For each of the 11<br />

resulting categories we created a SNP-disease genepathway<br />

network. Disease related genes are not always<br />

included in pathways and this is also the case for disease<br />

genes in which GWAS resulting SNPs were found. For the<br />

SNPs that are related to genes in pathways we did a<br />

pathway gene set enrichment analysis and evaluated<br />

whether the resulting pathways were already known to be<br />

related to T2DM.<br />

SNPs in intergenic region need to be analysed and<br />

visualized differently. A possible approach might be using<br />

the expression quantitative trait locus (eQTL) data, which<br />

relates SNPs in intergenic regions to modulation of gene<br />

expression distally. Such datasets are available for many<br />

different human tissues and can provide additional<br />

regulatory information for pathways and the genes they<br />

comprise.<br />

FIGURE 1. Pie chart of the 5 SNPs categories. The total number of SNPs<br />

is 2767.<br />

REFERENCES<br />

Wellcome Trust Case Control Genome-wide association study of 14,000<br />

cases of seven common diseases and 3,000 shared controls. Nature.<br />

2007;447(7145):661-78.<br />

Johnson A, O'Donnell C. An Open Access Database of Genome-wide<br />

Association Results. BMC Medical Genetics. 2009;10(1):6.<br />

Kutmon M, Riutta A, Nunes N, Hanspers K, Willighagen E, Bohler A,<br />

Mélius J, Waagmeester A, Sinha S, Miller R, Coort S, Cirillo E<br />

Smeets B, Evelo C, Pico A. WikiPathways: Capturing the Full<br />

Diversity of Pathway Knowledge . Accepted September <strong>2015</strong>, NAR-<br />

02735- E- Database issue 2016.<br />

Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al.<br />

Cytoscape: A Software Environment for Integrated Models of<br />

Biomolecular Interaction Networks. Genome Research.<br />

2003;13(11):2498-504.<br />

54


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P11. IDENTIFICATION OF TRANSCRIPTION FACTOR CO-ASSOCIATIONS<br />

IN SETS OF FUNCTIONALLY RELATED GENES<br />

Pieter De Bleser 1,2,4* , Arne Soetens 1,2,4 & Yvan Saeys 1,3,4 .<br />

VIB Inflammation Research Center 1 ; Department of Biomedical Molecular Biology 2 , Department of Respiratory<br />

Medicine 3 , Ghent University 4 . * pieterdb@irc.vib-ugent.be<br />

Co-associations between transcription factors (TFs) have been studied genome-wide and resulted in the identification of<br />

frequently co-associated pairs of TFs. Co-association of TFs at distinct binding sites is contextual: different combinations<br />

of TFs co-associate at different genomic locations, producing a condition-dependent gene expression profile for a cell.<br />

Here, we present a novel method to identify these condition-dependent co-associations of TFs in sets of functionally<br />

related genes.<br />

INTRODUCTION<br />

The functional expression of genes is achieved by<br />

particular interactions of regulatory transcription factors<br />

(TFs) operating at specific DNA binding sites of their<br />

target genes. Dissecting the specific co-associations of TFs<br />

that bind each target gene represent a difficult challenge.<br />

Co-associations of transcription factor pairs have been<br />

studied genome-wide and resulted in the identification of<br />

frequently co-associated pairs of TFs (ENCODE Project<br />

Consortium, 2012). It was found that TFs co-associate in a<br />

context-specific fashion: different combinations of TFs<br />

bind different target sites and the binding of one TF might<br />

influence the preferred binding partners of other TFs. Here,<br />

we present a tool to identify these condition-dependent coassociations<br />

of TFs in sets of functionally related genes<br />

(e.g. metabolic pathways, tissues, sets of TF target genes,<br />

sets of differentially regulated genes).<br />

METHODS<br />

In a first step, we determine the set of regulatory TFs for<br />

each gene (Tang et al., 2011) in the set using the ChIP-Seq<br />

binding data for 237 TFs from the ReMap database<br />

(Griffon et al., <strong>2015</strong>). This results in a number of<br />

regulatory ChIP-Seq binding regions per TF per gene,<br />

represented as a matrix in which each row corresponds to<br />

a gene while the columns correspond to the used TF. In a<br />

next step, this matrix is used as input to the distance<br />

difference matrix (DDM) algorithm, modified to<br />

accommodate this data. The DDM algorithm is a method<br />

that simultaneously integrates statistical over<br />

representation and co-association of TFs (De Bleser et al.,<br />

2007). The result matrix is subsequently reduced, retaining<br />

only the columns of over-represented and co-associated<br />

TFs. Visualization is done by (1) hierarchical clustering of<br />

the reduced result matrix and reordering of the columns<br />

and (2) conversion of the reduced result matrix into a SIF<br />

(simple interaction file format) file, summarizing the<br />

regulator-regulated relationships between transcription<br />

factors and target genes. This SIF file can be imported into<br />

CytoScape for visualization of the regulatory network.<br />

RESULTS & DISCUSSION<br />

FOXF1, TBX3, GATA6, IRX3, PITX2, DLL1 and<br />

NKX2-5 are experimentally verified target genes of the<br />

EZH2 transcription factor (Grote et al., 2013).<br />

Running the transcription factor co-association analysis<br />

method on this data set results in the clustering solution<br />

plot shown in Figure 1.<br />

The strongest associations between TFs are found between<br />

EZH2, POU5F1, SUZ12 and CTBP2. A secondary cluster<br />

of transcription factor associations is composed of<br />

EOMES, SMAD2+3 and NANOG.<br />

The finding of SUZ12 as a cofactor can be accounted for:<br />

EZH2 and SUZ12 are subunits of Polycomb repressive<br />

complex 2 (PRC2), which is responsible for the repressive<br />

histone 3 lysine 27 trimethylation (H3K27me3) chromatin<br />

modification (Yoo and Hennighausen, 2012). CTBP2 is a<br />

known transcriptional repressor (Turner and Crossley,<br />

2001).<br />

The method has been applied previously for the<br />

identification of TFs associated with both high tissuespecificity<br />

and high gene expression levels (Rincon et al.,<br />

<strong>2015</strong>). The method will be made available as a web tool.<br />

FIGURE 1. Transcription factor co-associations in the EZH2 data set.<br />

Note the tendency of EZH2 to co-localize with POU5F1, SUZ12 and<br />

CTBP2.<br />

REFERENCES<br />

De Bleser,P. et al. (2007) A distance difference matrix approach to identifying<br />

transcription factors that regulate differential gene expression. Genome Biol., 8,<br />

R83.<br />

ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements<br />

in the human genome. Nature, 489, 57–74.<br />

Griffon,A. et al. (<strong>2015</strong>) Integrative analysis of public ChIP-seq experiments reveals<br />

a complex multi-cell regulatory landscape. Nucleic Acids Res., 43, e27.<br />

Grote,P. et al. (2013) The tissue-specific lncRNA Fendrr is an essential regulator of<br />

heart and body wall development in the mouse. Dev. Cell, 24, 206–214.<br />

Rincon,M.Y. et al. (<strong>2015</strong>) Genome-wide computational analysis reveals<br />

cardiomyocyte-specific transcriptional Cis-regulatory motifs that enable<br />

efficient cardiac gene therapy. Mol. Ther. J. Am. Soc. Gene Ther., 23, 43–52.<br />

Tang,Q. et al. (2011) A comprehensive view of nuclear receptor cancer cistromes.<br />

Cancer Res., 71, 6940–6947.<br />

Turner,J. and Crossley,M. (2001) The CtBP family: enigmatic and enzymatic<br />

transcriptional co-repressors. BioEssays News Rev. Mol. Cell. Dev. Biol., 23,<br />

683–690.<br />

Yoo,K.H. and Hennighausen,L. (2012) EZH2 methyltransferase and H3K27<br />

methylation in breast cancer. Int. J. Biol. Sci., 8, 59–65.<br />

55


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P12. PHENETIC: MULTI-OMICS DATA INTERPRETATION USING<br />

INTERACTION NETWORKS<br />

Dries De Maeyer 1,2,3* , Bram Weytjens 1,2,3 , Luc De Raedt 4 & Kathleen Marchal 2,3 .<br />

Centre for Microbial and Plant Genetics, KULeuven 1 ; Department for Information Sciences (INTEC, IMinds), UGent 2 ;<br />

Department for Plant Biotechnology and Bioinformatics, UGent 3 ; Department of Computer Science, KULeuven 4 .<br />

* dries.demaeyer@biw.kuleuven.be<br />

The omics revolution has introduced new challenges when studying interesting phenotypes. High throughput omics<br />

technologies such as next-generation sequencing and microarray technologies generate large amounts of data.<br />

Interpreting the resulting data from these experiments is not trivial due to the data’s size and the inherent noise of the<br />

underlying technologies. In addition to this, the “omics” technologies have led to an ever expanding biological<br />

knowledge which has to be taken into account when interpreting new experimental results. Interaction network in<br />

combination with subnetwork inference methods provide a solution to this problem by mining the current public<br />

interactomics knowledge using experimental omics data to better understand the molecular mechanisms driving the<br />

interesting phenotypes under study.<br />

INTRODUCTION<br />

Computational methods are becoming essential for<br />

analyzing large scale omics datasets in the light of current<br />

knowledge. By representing publicly available<br />

interactomics knowledge as interaction networks<br />

subnetwork inference methods can extract the actual<br />

molecular mechanisms that drive an interesting phenotype.<br />

The PheNetic framework is such a method that allows for<br />

mining interaction networks with multi-omics datasets.<br />

Using this framework different types of biological<br />

applications have been analyzed in the past such as KOtranscriptomics<br />

interpretation (De Maeyer, 2013),<br />

expression analysis (De Maeyer, <strong>2015</strong>) and distinguishing<br />

driver from passenger mutation from eQTL experiments<br />

(De Maeyer).<br />

METHODS<br />

Interaction networks provide a flexible representation of<br />

public biological interactomics knowledge. These<br />

networks represent the physical interactions between<br />

genes and their corresponding gene products in the<br />

interactome of the organism under research (Cloots, 2011).<br />

The interaction network integrates different layers of<br />

homogeneous interactomics data, e.g. signalling, proteinprotein,<br />

(post)transcriptional and metabolic interactomics<br />

data, into a single heterogeneous network representation.<br />

The PheNetic framework uses interaction networks to find<br />

biologically valid paths which connect (in)activated genes<br />

selected from multi-omics data sets. These paths provide a<br />

biological explanation of how the genes from these data<br />

sets can trigger each other. Finding the best explanations<br />

or paths in the interaction network corresponds to finding<br />

that subnetwork that best explains the observed results and<br />

provides an insight into the molecular mechanisms that<br />

drive the interesting phenotype. Depending on the type of<br />

biological application and provided data different types of<br />

paths can be used to infer the subnetwork such as KOtranscriptomics<br />

interpretation (De Maeyer, 2013),<br />

expression analysis (De Maeyer, <strong>2015</strong>) and interpreting<br />

eQTL experiments (De Maeyer).<br />

RESULTS & DISCUSSION<br />

In a first setup PheNetic was used to study the pathways<br />

and processes involved in acid resistance in Escherichia<br />

coli (De Maeyer, 2013). Using our framework we were<br />

able to determine the different molecular pathways that<br />

drive acid resistance and identify the regulators that<br />

underlie this phenotype. It was shown that subnetwork<br />

inference methods outperform naïve gene rankings in<br />

identifying the biological pathways associated with the<br />

phenotype under research based.<br />

In a second setup PheNetic was used to interpret<br />

expression data (De Maeyer, <strong>2015</strong>) to extract from the<br />

interaction network those parts of the interaction network<br />

that show differences in expression. This method was<br />

provided as a web server that can be accessed at<br />

http://bioinformatics.intec.ugent.be/<br />

phenetic and that allows for an intuitive and visual<br />

interpretation of the inferred subnetworks.<br />

In a third setup PheNetic was used to select driver<br />

mutations from passenger mutations in coupled genetictranscriptomics<br />

data sets from evolution experiments (De<br />

Maeyer). Evolved strains with the same phenotype are<br />

expected to have consistent changes in the same pathways.<br />

Therefore, finding the subnetwork that best connects the<br />

mutations to the differentially expressed genes over all<br />

strains is expected to identify the driver mutations over<br />

passenger mutations in combination with identifying the<br />

molecular mechanisms that induce the observed change in<br />

phenotype. This approach provides a systemic insight in<br />

both the biological processes and genetic background that<br />

induces phenotype.<br />

Based on the different approaches it can be concluded that<br />

PheNetic is a flexible framework for subnetwork selection<br />

that allows for solving a large variety of biological<br />

applications using multi-omics data sets.<br />

REFERENCES<br />

Cloots, L., & Marchal, K. (2011). Curr Opin Microbiol, 14(5), 599-607.<br />

De Maeyer, D., Renkens, J., Cloots, L., De Raedt, L., & Marchal, K.<br />

(2013). Mol Biosyst, 9(7), 1594-1603.<br />

De Maeyer, D., Weytjens, B., Renkens, J., De Raedt, L., & Marchal, K.<br />

(<strong>2015</strong>). Nucleic Acids Res, 43(W1), W244-250.<br />

De Maeyer, D., Weytjens, B., De Raedt, L., & Marchal, K. Molecular<br />

biology and evolution. Submitted<br />

56


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P13. THE ROLE OF HLA ALLELES UNDERLYING CYTOMEGALOVIRUS<br />

SUSCEPTIBILITY IN ALLOGENEIC TRANSPLANT POPULATIONS<br />

Nicolas De Neuter 1,2* , Benson Ogunjimi 3 , Anke Verlinden 4 , Kris Laukens 1,2 & Pieter Meysman 1,2 .<br />

Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Biomedical informatics research center<br />

Antwerpen (biomina) 2 ; Centre for Health Economics Research and Modeling Infectious Diseases (CHERMID), Vaccine<br />

and Infectious Disease Institute, University of Antwerp 3 ; Antwerp University Hospital 4 .<br />

* nicolas.deneuter@uantwerpen.be<br />

In this study, we aim to characterize those HLA alleles that increase or decrease the risk of cytomegalovirus infections<br />

following tissue or organ transplants. This HLA-dependent susceptibility will then be explained using state-of-the-art<br />

HLA peptide affinity methods to identify the underlying molecular reason. This insight can greatly aid prediction of<br />

those transplantation patients that are most at risk from cytomegalovirus infection.<br />

INTRODUCTION<br />

Patients suffering from disorders of the hematopoietic<br />

system or with chemo-, radio-, or immuno- sensitive<br />

malignancies such as leukemia often receive<br />

hematopoietic stem cell transplantation therapy (HSCT).<br />

The transplantation is preceded by a conditioning regimen<br />

that eradicates the recipient’s malignant cell population<br />

through intensive chemotherapy and irradiation,<br />

simultaneously ablating the recipient’s bone marrow. Self<br />

(autologous) or non-self (allogeneic) hematopoietic stem<br />

cells are then reintroduced into the recipient after which<br />

they are allowed to reestablish hematopoietic functions.<br />

HSCT is associated with high morbidity and mortality and<br />

requires careful monitoring of patients during the weeks<br />

following transplantation. Opportunistic cytomegalovirus<br />

(CMV) infections are one of the major causes of this high<br />

morbidity and mortality and can occur in up to 80% of<br />

HSCT patients, depending on the use of prophylactic<br />

treatment or pre-emptive therapy and the serological CMV<br />

status of donor and recipient. CMV disease can manifest<br />

itself as life-threatening pneumonia, gastrointestinal<br />

disease, retinitis, encephalitis or hepatitis.<br />

The relevance of HLA alleles in varicella zoster virus<br />

associated disease has recently been demonstrated by our<br />

group (Meysman et al., <strong>2015</strong>) and similar insights might<br />

be gained in CMV related disease. Several studies have<br />

already shown a correlation between the incidence of<br />

CMV infection and the presence of certain human<br />

leukocyte antigens (HLA) alleles in the transplant<br />

recipient. However, the exact alleles identified in previous<br />

studies are very inconsistent, likely due to small sample<br />

sizes and type I multiple testing errors.<br />

METHODS<br />

Anonymized patient records on the HLA alleles, CMV<br />

infection and serological status of 1284 transplant<br />

recipients were collected from the Antwerp University<br />

Hospital (UZA). This data set was further extended with<br />

publicly available HLA data from transplant patient and<br />

the counts for the HLA alleles of each loci present were<br />

combined. A hypergeometric distribution was used to test<br />

HLA loci (A, B, C, DRB1, DQB1 and DPB1) for<br />

statistical over- or underrepresentation of their respective<br />

alleles. HLA alleles were tested for over- or<br />

underrepresentation in two test populations: recipients<br />

who were seropositive for CMV before transplantation<br />

and recipients who developed a CMV infection posttransplantation.<br />

In the later case, we also examined if<br />

donor seropositivity had an influence on the CMV<br />

infection status. The P value cutoff used is 0.05 and was<br />

adjusted with a Bonferroni correction for multiple testing,<br />

in this case the number of alleles tested per loci.<br />

Putative nonameric peptides were generated in silico from<br />

CMV protein sequences available in online protein<br />

sequence repositories such as the UniProt Knowledgebase.<br />

Three complementary methods were employed to predict<br />

the affinity of each putative nonameric peptide to the<br />

significantly enriched or depleted HLA alleles. The<br />

methods used were: NetCTLpan, the stabilized matrix<br />

method (SMM) and an in-house-developed approach<br />

called CRFMHC. Peptide-binding affinity results of each<br />

predictor were normalized against the affinity of a<br />

restricted panel of human proteins and used to compare<br />

results between predictors. Additionally, each CMV<br />

protein was assessed for depletion of high-affinity<br />

peptides using a hypergeometric distribution.<br />

RESULTS<br />

Preliminary results on a small portion of the UZA data<br />

reveals HLA alleles underlying either CMV seropositivity<br />

or CMV infection with a trend towards significance but do<br />

not reach the Bonferroni corrected threshold. We expect<br />

the additional data to increase the power of the analysis.<br />

REFERENCES<br />

Meysman,P. et al. (<strong>2015</strong>) Varicella-Zoster Virus-Derived Major<br />

Histocompatibility Complex Class I-Restricted Peptide Affinity Is<br />

a Determining Factor in the HLA Risk Profile for the<br />

Development of Postherpetic Neuralgia. J. Virol., 89, 962–969.<br />

57


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P14. NOVOPLASTY: IN SILICO ASSEMBLY OF PLASTID GENOMES FROM<br />

WHOLE GENOME NGS DATA<br />

Nicolas Dierckxsens 1,2* , Olivier Hardy 2 , Ludwig Triest 3 , Patrick Mardulyn 2 & Guillaume Smits 1,4 .<br />

Interuniversity Institute of Bioinformatics Brussels (IB2), ULB-VUB, Triomflaan CP 263, 1050 Brussels, Belgium 1 ;<br />

Evolutionary Biology and Ecology Unit, CP 160/12, Faculté des Sciences, Université Libre de Bruxelles, Av. F. D.<br />

Roosevelt 50, B-1050 Brussels, Belgium 2 ; Plant Biology and Nature Management, Vrije Universiteit Brussel, Brussels,<br />

Belgium 3 ; Department of Paediatrics, Hôpital Universitaire des Enfants Reine Fabiola (HUDERF), Université Libre de<br />

Bruxelles (ULB), Brussels, Belgium 4 . * nicolasdierckxsens@hotmail.com<br />

Thanks to the evolution in next-generation sequencer (NGS) technology, whole genome data can be readily obtained<br />

from a variety of samples. There are many algorithms available to assemble these reads, but few of them focus on<br />

assembling the plastid genomes. Therefore we developed a new algorithm that solely assembles the plastid genomes<br />

from whole genome data, starting from a single seed. The algorithm is capable of utilizing the full advantage of very high<br />

coverage, which makes it even capable of assembling through problematic regions (AT-rich). The algorithm has been<br />

tested on several whole genome Illumina datasets and it outperformed other assemblers in runtime and specificity. Every<br />

assembly resulted in a single contig for any chloroplast or mitochondrial genome and this always within a timeframe of<br />

30 minutes.<br />

INTRODUCTION<br />

Chloroplasts and mitochondria are both responsible for<br />

generating metabolic energy within eukaryotic cells. Both<br />

plastids are maternally inherited and have a persistent gene<br />

organization, what makes them ideal for phylogenetic<br />

studies or as a barcode in plant and food identification<br />

(Brozynska et al., 2014). But assembling these plastids<br />

genomes is not always that straightforward with the<br />

currently available tools. Therefore we developed a new<br />

algorithm, specifically for the assembly of plastid<br />

genomes from whole genome data.<br />

METHODS<br />

The algorithm is written in Perl. All assemblies were<br />

executed on Intel Xeon CPU machine containing 24 cores<br />

of 2.93 GHz with a total of 96,8 GB of RAM. All nonhuman<br />

samples were sequenced on the Illumina HiSeq<br />

platform (101 bp paired-end reads). The human<br />

mitochondria samples (PCR-free) were sequenced on the<br />

Illumina HiSeqX platform (150 bp paired-end reads). The<br />

Gonioctena intermedia sample was also sequenced on the<br />

PacBio platform.<br />

RESULTS & DISCUSSION<br />

Algorithm. The algorithm is similar to string overlap<br />

algorithms like SSAKE (Warren et al., 2007) and VCAKE<br />

(Jeck et al., 2007). It starts with reading the sequences into<br />

a hash table, which facilitates a quick accessibility. The<br />

assembly has to be initiated by a seed that will be<br />

extended bidirectionally in iterations. The seed input is<br />

quite flexible, it can be one sequence read, a conserved<br />

gene or even a complete mitochondrial genome from a<br />

distant species. Every base extension is determined by a<br />

consensus between the overlapping reads. Unlike most<br />

assemblers, NOVOPlasty doesn’t try to assemble every<br />

read, but will extend the given seed until the circular<br />

plastid is formed.<br />

Assemblies. NOVOPlasty has currently been tested for the<br />

assembly of 8 chloroplasts and 6 mitochondria. Since<br />

chloroplasts contain an inverted repeat, two versions of the<br />

assembly are generated. The differ only in the orientation<br />

of the region between the two repeats; the correct one will<br />

have to be resolved manually. Besides the mitochondrion<br />

of the leaf beetle Gonioctena intermedia, all assemblies<br />

resulted in a complete circular genome. A comparative<br />

study of four assemblers for the mitochondrial genome of<br />

G. intermedia clearly shows the speed and specificity of<br />

NOVOPlasty (Table 1).<br />

NOVO<br />

Plasty<br />

MIRA MITO bim ARC<br />

Duration (min) 12 536 4777* 586<br />

Memory (GB) 15 57,6 63,4 1,9<br />

Storage (GB) 0 144 418 12<br />

Total contigs 1 3434 2221 2502<br />

Mitochondrial contigs 1 1 4 48<br />

Coverage (%) 98 94 94 84<br />

Mismatches 10 25 26 2<br />

Unidentified nucleotides 43 194 197 0<br />

TABLE 1. Benchmarking results between four assemblies of the<br />

mitochondrial genome of Gonioctena intermedia. The assemblies were<br />

constructed with MITObim (Hahn et al., 2013), MIRA (Chevreux et al.,<br />

1999), ARC (Hunter et al., <strong>2015</strong>) and NOVOPlasty.*manually terminated<br />

Discussion. Despite the many available assemblers, many<br />

researchers still struggle to find a good assembler for<br />

plastids genomes. NOVOPlasty offers an assembler<br />

specifically designed for plastids that will deliver the<br />

complete genome within 30 minutes. The algorithm will<br />

be tested on more datasets and a comparative study with<br />

other assemblers is in progress.<br />

REFERENCES<br />

Brozynska et al. PLoS One 9 (2014).<br />

Chevreux et al. Computer Science and Biology: Proceedings of the<br />

German Conference on Bioinformatics (GCB) (1999).<br />

Hahn et al. Nucleic Acids Research, 1-9 (2013).<br />

Hunter et al. http://dx.doi.org/10.1101/014662 (<strong>2015</strong>).<br />

Jeck et al. BMC Bioinformatics 23, 2942-2944 (2007).<br />

Warren et al. BMC Bioinformatics 23, 500-501 (2007).<br />

58


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P15. ENANOMAPPER - ONTOLOGY, DATABASE AND TOOLS FOR<br />

NANOMATERIAL SAFETY EVALUATION<br />

Friederike Ehrhart 1 , Linda Rieswijk 1 , Chris T. Evelo 1 , Haralambos Sarimveis 2 , Philip Doganis 2 , Georgios Drakakis 2 ,<br />

Bengt Fadeel 3 , Barry Hardy 4 , Janna Hastings 5 , Christoph Helma 6 , Nina Jeliazkova 7 , Vedrin Jeliazkov 7 , Pekka Kohonen 89 ,<br />

Roland Grafström 9 , Pantelis Sopasakis 10 , Georgia Tsiliki 2 & Egon Willighagen 1 .<br />

Department of Bioinformatics - BiGCaT, Maastricht University 1 ; National Technical University of Athens 2 ; Karolinska<br />

Institutet 3 ; Douglas Connect 4 ; European Molecular Biology Laboratory – European Bioinformatics Institute 5 ; In silico<br />

toxicology 6 ; Ideaconsult Ltd. 7 ; VTT Technical Research Centre of Finland 8 ; Misvik Biology 9 ; IMT Institute for Advanced<br />

Studies 10 . *friederike.ehrhart@maastrichtuniversity.nl<br />

eNanoMapper is an open computational infrastructure for engineered nanomaterial data: it comprises a semantic web<br />

supported database, ontology, and user applications for up- and download of experimental data, and tools for modelling.<br />

INTRODUCTION<br />

Nanomaterials are defined by size: between 1 nm and 100<br />

nm in at least one dimension. The properties of these<br />

material do not always resemble those of the bulk<br />

material, i.e. micro- and bigger particles, or solutions.<br />

Nanomaterials can differ in reactivity, toxicity in<br />

biological organisms and ecosystems depending on their<br />

size and surface properties and the possibility for<br />

“leakage” of the material it is made off. That is why it is<br />

so difficult to assess the safety of nanomaterials and why<br />

the NanoSafety Cluster defined a need for a new<br />

computational infrastructure in 2012. eNanoMapper is a<br />

European project with partners from eight European<br />

countries. This project has been developing an<br />

computational infrastructure consisting of a semantic web<br />

assisted database, a modular ontology, and tools to use<br />

them for nanomaterial safety assessment. Data sharing,<br />

data storage, data analysis tools, and web services are<br />

currently under development, being developed and tested,<br />

and put into production use. The project website can be<br />

found at www.enanomapper.net.<br />

PROBLEM<br />

The eNanoMapper platform is designed to support hosting<br />

of data on nanomaterial properties relevant for nanosafety<br />

assessment as found in existing databases like the<br />

NanoMaterial Registry, DaNa Knowledge Base,<br />

Nanoparticle Information Library NIL, Nanomaterial-<br />

Biological Interactions Knowledgebase, caNanoLab,<br />

InterNano, Nano-EHS Database Analysis Tool, nanoHUB,<br />

etc. Each of them has different data formats and<br />

descriptors, like CODATA-VAMAS’ Universal<br />

Description System, ISO-Tab(-Nano), OECD templates,<br />

custom spreadsheets, and images. Interoperability is a<br />

main aim and semi-automatic import or upload of<br />

information and to integrate it in the eNanoMapper data<br />

structure is being enabled. Vice versa, retrieval or<br />

download of experimental data from the database for (re-<br />

)analysis should be provided too, using programmable<br />

interfaces to the data and the ontology. Database and<br />

search functionality should be semantic web compatible:<br />

the project developed and maintain a nanosafety ontology<br />

to support this. This eNanoMapper ontology was<br />

developed using the Web Ontology Language and the<br />

challenge is to map nanomaterial terms to their multiple<br />

ontology terms, namely physico-chemical properties,<br />

biological and ecological impact, experimental assay<br />

description, and known safety aspects.<br />

RESULTS & DISCUSSION<br />

The current eNanoMapper demo database instance,<br />

available at https://data.enanomapper.net/, contains the<br />

physico-chemical, biologic and environmental properties<br />

of nanomaterials of 465 different nanomaterials 1 . Loading<br />

data into the database supports various formats, including<br />

the OECD Harmonized Templates and the data structure<br />

used by the NanoWiki 2 . A web interface is designed to<br />

support all interactions with the database you may want to<br />

perform, including uploading of experimental data, as well<br />

as querying data to support analysis and modelling of<br />

nanoparticle properties. The eNanoMapper ontology is<br />

available<br />

under<br />

http://purl.enanomapper.net/onto/enanomapper.owl and is<br />

based on a multi-faceted description of nanoparticles<br />

concerning nanoparticle types, physico-chemical<br />

description, life cycle, biological and environmental<br />

characterisation including experimental methods and<br />

protocols, and safety information 3 . The terms are verified<br />

against the definitions of REACH, ISO, or common<br />

practices used in science in general. The often confused<br />

different meanings of endpoints and assays were<br />

discriminated in the definitions, e.g. size and size<br />

measurement assay. It was partly possible to use existing<br />

ontologies as basis, e.g. NPO, ChEBI, GO, etc. but many<br />

terms had to be added manually. Currently, there are 4592<br />

classes defined. Users can get access and download the<br />

ontology from the U.S. National Center for BioMedical<br />

Ontologies BioPortal platform,<br />

http://bioportal.bioontology.org/ontologies/ENM.<br />

REFERENCES<br />

1 Jeliazkova, N. et al. The eNanoMapper database for<br />

nanomaterial safety information. Beilstein Journal of<br />

Nanotechnology 6, 1609-1634, doi:10.3762/bjnano.6.165<br />

(<strong>2015</strong>).<br />

2 Willighagen, E.; doi: org/10.6084/m9.figshare.1330208<br />

3 Hastings, J. et al. eNanoMapper: harnessing ontologies to<br />

enable data integration for nanomaterial risk assessment. J<br />

Biomed Semantics 6, 10, doi:10.1186/s13326-015-0005-5<br />

(<strong>2015</strong>).<br />

59


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P16. BIOMEDICAL TEXT MINING FOR DISEASE-GENE DISCOVERY:<br />

SOMETIMES LESS IS MORE<br />

Sarah ElShal 1,2* , Jesse Davis 3 & Yves Moreau 1,2 .<br />

Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data<br />

Analytics Department, KU Leuven 1 ; iMinds Future Health Department, KU Leuven 2 ; Department of Computer Science,<br />

KU Leuven 3 . * sarah.elshal@esat.kuleuven.be<br />

Biomedical text is increasingly being made available online in either abstract or full article formats. This goes in parallel<br />

with the knowledge desire to extract information from such text (e.g. finding links between diseases and genes).<br />

Consequently text mining is very popular in the biomedical domain given that it provides the possibility to automatically<br />

analyze these texts in order to extract knowledge. One of the big challenges in text mining is recognizing named entities<br />

(e.g. disease and gene entities) inside a given text, which is widely known as Named Entity Recognition (NER). We<br />

studied two biomedical taggers that apply different NER methods on MEDLINE abstracts. Here, we compare the<br />

contribution of each of the two taggers in associating genes with diseases. We show that with fewer recognized entities<br />

we gain more knowledge and we better associate genes with diseases.<br />

INTRODUCTION<br />

MEDLINE currently has more than 25 million biomedical<br />

citations from different journals all over the world. With<br />

this vast amount of text available, it is increasingly<br />

important to mine such data and find the best ways to<br />

extract relevant knowledge out of it. One example of such<br />

knowledge is links between diseases and genes. However<br />

it is very challenging and time consuming to recognize<br />

biomedical entities inside a given text with the evolving<br />

number of dictionaries and tagging strategies. Different<br />

taggers exist that map MEDLINE abstracts to biomedical<br />

entities. Such tagged entities can be used to generate<br />

disease and gene profiles and by applying certain<br />

similarity measures, we can extract knowledge and<br />

generate disease-gene hypothesis.<br />

METHODS<br />

We compare two MEDLINE taggers that map the whole<br />

set of MEDLINE abstracts to biomedical entities (e.g.<br />

genes, diseases, GO and MeSH terms …). The first one is<br />

MetaMap (Aronson et al., 2010), and the second one has<br />

been used as a text mining pipeline in many resources,<br />

latest in Diseases (Pletscher-Frankild et al., <strong>2015</strong>). For<br />

sake of simplicity, we will refer to the second tagger by<br />

m_tagger throughout the rest of the abstract. For each<br />

MEDLINE abstract we could obtain two sets of mapped<br />

entities: (1) the metamap set, and (2) the m_tagger set. The<br />

metamap set (given all the abstracts) corresponds to<br />

78,298 distinct entities vs. 29,536 for M_tagger.<br />

In order to compare the contribution of each tagger to the<br />

disease-gene association process, we proceeded as follows.<br />

First, we generated a validation set from the OMIM<br />

database to acquire a list of experimentally-validated<br />

disease-gene pairs. Second, we generated an entity profile<br />

for every gene in our database and for every disease in our<br />

validation set. This profile corresponds to the TF-IDF<br />

score of a given entity in one profile, which is calculated<br />

according to the set of abstracts found to be linked with a<br />

disease or gene. Then for every disease, we computed the<br />

cosine similarity between its profile and all the gene<br />

profiles. Hence we could have a similarity score for each<br />

disease and gene pair, which we used to rank the genes for<br />

a given disease. We computed the average recall at the top<br />

10, 25, 50, and 100 ranked genes. We ran this analysis<br />

once according to the metamap set and once according to<br />

the m_tagger set. We also tried another association<br />

measure where we filtered the profiles such that they only<br />

contain gene entities. Then we ranked the genes according<br />

to their TF-IDF scores in a given disease profile. This<br />

corresponds to 9290 gene entities in the metamap set, and<br />

10,003 entities in the m_tagger set. Again we measured<br />

the average recall at the different rank thresholds, and we<br />

repeated the analysis using the metamap and m_tagger<br />

profiles.<br />

RESULTS & DISCUSSION<br />

Figure 1 presents the recall results on the OMIM<br />

validation set. We observe that MetaMap and M_tagger<br />

result in comparable recall when ranking the genes<br />

according to their cosine similarity with the disease<br />

profiles. We also observe that M_tagger results in the best<br />

recall when simply ranking the genes according to their<br />

TF-IDF scores inside the disease profile.<br />

FIGURE 1. Recall results on the OMIM validation set: comparing the<br />

contribution of MetaMap and M_tagger, once with cosine similarity and<br />

once with TF-IDF ranks.<br />

Even though using the m_tagger set implies using less<br />

entities than the metamap one, we could gain the same<br />

knowledge to associate genes with diseases. Moreover,<br />

when we further reduced this set of entities to only genes,<br />

we gained even more knowledge and better associated<br />

genes with diseases.<br />

REFERENCES<br />

Aronson A.R. et al. J. Am. Med. Inform. Assoc. An overview of MetaMap: historical<br />

perspective and recent advances. 17, 229-236 (2010).<br />

Pletscher-Frankild S. et al. DISEASES: text mining and data integration of diseasegene<br />

associations. Methods. 74, 83-89 (<strong>2015</strong>).<br />

60


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P17. TUNESIM - TUNABLE VARIANT SET SIMULATOR FOR NGS READS<br />

Bertrand Escaliere 1,2 , Nicolas Simonis 1,3 , Gianluca Bontempi 1,2 & Guillaume Smits 1,4 .<br />

Interuniversity Institute of Bioinformatics in Brussels 1 ; Machine Learning Group, Université Libre de Bruxelles 2 ; Institut<br />

de Pathologie et de Génétique 3 ; Hopital Universitaire des Enfants Reine Fabiola, Université Libre de Bruxelles 4 .<br />

NGS analysis softwares and pipelines optimization is crucial in order to improve discovery of (new) disease causing<br />

variants. A better combination between existing tools and the right choice of parameters can lead to more specific and<br />

sensitive calling. Simulated datasets allow the step-by-step generation of new alignment or calling software. Creating a<br />

simulator able to insert known human variants at a realistic minor frequency and artificial variants in a tunable controlled<br />

way would allow to overcome three optimization limits: complete knowledge of the input dataset, allowing to determine<br />

exact calling sensitivity and accuracy; optimization on the appropriate population; and the capacity to dynamically test a<br />

pipeline one variable at the time.<br />

INTRODUCTION<br />

Identification of anomalies causing genetic disorders is<br />

difficult. It can be limited by scarcity of affliction<br />

concerned, by disorder genetic heterogeneity, or by<br />

phenotypic pleiotropy associated with the anomalies in a<br />

single gene. Exome and genome sequencing allowed the<br />

identification of many genetic diseases causes, whose<br />

origin remained inaccessible up to now by the usual<br />

techniques of research in genetics (Ng et al., 2009),<br />

(Gilissen et al., 2012), (Yang et al., 2013), (Gilissen et al.,<br />

2014). Exome and genome sequencing data analysis<br />

pipelines are constituted by several steps (roughly:<br />

alignment, quality filters, variant calling) and several<br />

software are available for those steps. Evaluation and<br />

comparison of those tools are crucial in order to improve<br />

pipelines accuracy. Exome and genome sequencing<br />

simulations should allow to determine the veracity of<br />

called variants (false positives and false negatives).<br />

METHODS<br />

We implemented TuneSIM, a wrapper around NGS<br />

dwgsim (http://sourceforge.net/projects/dnaa/) reads<br />

simulator with realistic mutations. Generated reads contain<br />

real mutations from 1KG project and dbsnp138. We use<br />

existing tool dwgsim for reads generations. In order to<br />

generate data as realistic as possible we decided to keep<br />

the haplotype blocks structure. We computed blocks using<br />

vcf files from 1KG project phase 3 in european individuals<br />

with Plink (Purcell et al., 2007). For each block, we<br />

obtained a frequency of each combination of variants and<br />

we used these frequencies for blocks selection. We also<br />

insert variants in an independent way using their<br />

frequencies in dbSNP (Smigielski et al., 2000). Using 33<br />

in house samples, we computed global allele frequency<br />

variants distributions in coding and non coding regions<br />

and we select the variants according to those frequencies.<br />

Similar operation has been performed for CNVs insertion<br />

using 1KG data. We are developing a web interface<br />

allowing users to download existing generated datasets.<br />

After running their pipelines they can upload their output<br />

and see accuracy of their pipelines.<br />

RESULTS & DISCUSSION<br />

Simulations with different coverage, rate of indels have<br />

been performed and analysed with different pipelines.<br />

Results will be presented.<br />

REFERENCES<br />

Gilissen, et al. (2012). Disease gene identification strategies for exome<br />

sequencing. Eur J Hum Genet, 20, 490–497.<br />

Gilissen, et al. (2014). Genome sequencing identifies major causes of<br />

severe intellectual disability. Nature, 511, 344–347.<br />

Ng, S. B., et al. (2009). Exome sequencing identifies the cause of a<br />

mendelian disorder. Nature Genetics, 42, 30–35.<br />

Purcell, et al. (2007). PLINK: a tool set for whole-genome association<br />

and population-based linkage analyses. American journal of human<br />

genetics, 81, 559–575.<br />

Smigielski, E. M., Sirotkin, K., Ward, M., & Sherry, S. T. (2000). dbsnp:<br />

a database of single nucleotide polymorphisms. Nucleic Acids<br />

Research, 28, 352–355.<br />

Yang, et al. (2013). Clinical Whole-Exome Sequencing for the Diagnosis<br />

of Mendelian Disorders. N Engl J Med, 369, 1502–1511.<br />

61


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P18. RNA-SEQ REVEALS ALTERNATIVE SPLICING WITH<br />

ALTERNATIVE FUNCTIONALITY IN MUSHROOMS<br />

Thies Gehrmann 1 , Jordi F. Pelkmans 2 , Han Wösten 2 , Marcel J.T. Reinders 1 & Thomas Abeel 1* .<br />

Delft Bioinformatics Lab, Delft Technical University 1 ; Fungal Microbiology, Science Faculty, Utrecht University 2 ;<br />

* T.Abeel@tudelft.nl<br />

Alternative splicing is well studied in mammalian genomes, and alternative transcripts are often associated with disease<br />

and their role in regulation is gradually being unveiled. In fungi, the study of alternative splicing has only scratched the<br />

surface. Using RNA-Seq data, we predict alternative transcripts based on existing gene predictions in two mushroom<br />

forming fungi. We study the alternative functionality of genes through functional domains, developmental stages, tissue<br />

and time. This analysis reveals the amount of alternative functionality induced by alternative splicing which was<br />

previously unknown in fungi, and asserts the need for further research.<br />

INTRODUCTION<br />

Transcriptreconstruction algorithms rely on the sparsity<br />

(intergenic regions) of the genome in order distinguish<br />

between genes. In fungi, due to the density of the genome,<br />

transcripts overlap in the up and down-stream untranslated<br />

regions (UTRs) and prevent the use of existing tools for<br />

transcript prediction (Roberts et. al. 2011). Previous<br />

studies (Xie et. al. <strong>2015</strong>, Zhao et. al. 2013), were limited<br />

to the study of splice junctions, more advanced functional<br />

analyses. We transform the genomes of S. commune and A.<br />

bisporusin order to enable the prediction of alternative<br />

transcripts applying existing transcript reconstruction<br />

algorithms to RNA-Seq data from different tissue types<br />

and developmental stages. We present a functional<br />

analysis of the resulting transcripts.<br />

METHODS<br />

We apply a transformation on our fungal genomes in order<br />

to reduce the impact of overlapping UTRs which prevent<br />

the prediction of alternative transcripts. We split the<br />

genome into chunks, with each chunk being defined by<br />

existing gene annotations. Thus, the transformation<br />

essentially removes intergenic regions (which contain the<br />

UTRs). Each chunk is then analyzed separately by<br />

Cufflinks (Roberts et. al. 2011). Predicted transcripts are<br />

filtered based on read information and ORF sanity. Protein<br />

domain annotations are predicted for each transcript using<br />

InterPro (Zdobnov & Apweiler 2001).<br />

For each gene with multiple alternative transcripts, we<br />

construct a consensus sequence which allows us to call<br />

specific splicing events without the influence of erroneous<br />

reference annotations.<br />

RESULTS & DISCUSSION<br />

For both fungi, we find that alternative splicing is<br />

prevalent and many genes have multiple alternative<br />

transcripts (see Table 1).<br />

# Orig. Genes # Filt. # Transcripts<br />

Genes<br />

S. commune<br />

16,319 14,615 20,077<br />

A. bisporus<br />

10,438 9612 14,320<br />

TABLE 1. The number of originally annotated genes in S. Commune and<br />

A. Bisporus is decreased after prediction based on RNA-Seq data filters<br />

them out. The number of new transcripts predicted indicates that<br />

alternative splicing is not a rare event in these fungi.<br />

The frequency of specific events in the two fungi are<br />

similar and match what is seen in humans (Sammeth, M,<br />

et. al. 2008). However, there are significant differences in<br />

the event usage. While most transcripts in S. commune<br />

only have one event associated with it, most transcripts in<br />

A. Bisporushave at least two events. We show that this is a<br />

result of co-operative events.<br />

As our dataset consists of multiple developmental timepoints<br />

and tissue types, we are able to observe the<br />

alternative use of transcripts through time. If a gene swaps<br />

transcript usage at a certain time point, this is indicative of<br />

a functional involvement of that particular transcript (Lees<br />

et. al. <strong>2015</strong>). We find multiple transcripts in both S.<br />

commune and A. bisporus which are activated in specific<br />

developmental stages of the mushroom. Furthermore, in A.<br />

bisporus, we are able to identify transcripts which are<br />

activated specifically for certain tissue types through<br />

development.<br />

Using protein domain predictions for each transcript in a<br />

gene, we can measure how gene functionality changes<br />

across its transcripts. Figure 1 shows that functional<br />

annotations are not always preserved across all transcripts,<br />

indicating alternative functionality.<br />

FIGURE 1. Many genes in S. commune demonstrate alternative<br />

functionality through alternative splicing<br />

This is the first genome-wide functional analysis of<br />

alternative splicing in fungi from RNA-Seq data. We find<br />

a wealth of alternative splicing events in two fungi,<br />

resulting in many newly discovered transcripts. Although<br />

their functional influence is not yet demonstrated, we<br />

present evidence to suggest that they are relevant to<br />

mushroom development.<br />

REFERENCES<br />

Lees, J. G., et. al. BMC Genomics, 16:1 (<strong>2015</strong>)<br />

Roberts, A., et. al. Bioinformatics 27:17, 2325–2329. (2011)<br />

Sammeth, M., et. al. PLoS Computational Biology, 4:8. (2008)<br />

Xie, B.-B., et. al.. BMC Genomics, 16:54(<strong>2015</strong>).<br />

Zdobnov, E. M., & Apweiler, R. Bioinformatics 17:9 (2001)<br />

Zhao, C., et. al. BMC Genomics, 14:21. (2013).<br />

62


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P19. MSQROB: AN R/BIOCONDUCTOR PACKAGE FOR ROBUST RELATIVE<br />

QUANTIFICATION IN LABEL-FREE MASS SPECTROMETRY-BASED<br />

QUANTITATIVE PROTEOMICS<br />

Ludger Goeminne 1,2,3* , Kris Gevaert 2,3 & Lieven Clement 1 .<br />

Department of Applied Mathematics, Computer Science and Statistics, Ghent University 1 ; VIB Medical Biotechnology<br />

Center 2 ; Department of Biochemistry, Ghent University 3 . * ludger.goeminne@UGent.be<br />

MSqRob is an R/Bioconductor package that uses robust ridge regression on peptide-level data for robust relative<br />

quantification of proteins in label-free data-dependent acquisition (DDA) mass spectrometry (MS)-based proteomic<br />

experiments. It has been shown that statistical methods inferring at the peptide-level outperform workflows that<br />

summarize peptide intensities prior to inference. MSqRob improves upon existing peptide-level methods by three<br />

modular extensions: (1) ridge regression, (2) empirical Bayes variance estimation and (3) M-estimation with Huber<br />

weights. The extensions make MSqRob less sensitive towards outliers and missing peptides, enabling more proteins to be<br />

processed. Our software provides streamlined data analysis pipelines for experiments with simple layouts as well as for<br />

more complex multi-factorial designs. Using a spike-in dataset, we illustrate that MSqRob grants more stable protein fold<br />

change estimates and improves the differential abundance (DA) ranking.<br />

INTRODUCTION<br />

In a typical label-free DDA LC-MS/MS-based proteomic<br />

workflow, proteins are digested to peptides, separated by<br />

RP-HPLC and analyzed by a mass spectrometer. However,<br />

several issues inherent to the protocol make data analysis<br />

non-trivial. Most of the common data analysis procedures<br />

use summarization-based workflows. We have previously<br />

shown that inference at the peptide level outperforms these<br />

summarization-based approaches (Goeminne et al., <strong>2015</strong>).<br />

However, even these pipelines are sensitive to outliers and<br />

suffer from overfitting. Here, we present MSqRob, an<br />

R/Bioconductor package that starts form peptide-level data<br />

and provides robust inference on DA at the protein level.<br />

METHODS<br />

Dataset. To demonstrate the performance of our package,<br />

we use the CPTAC dataset, in which 48 known human<br />

proteins were spiked-in at different concentrations in a<br />

yeast proteome background. Ideally, when comparing<br />

different spike-in conditions, only the human proteins<br />

should be flagged as differentially abundant.<br />

Competing analytical methods. MaxLFQ+Perseus,<br />

which summarizes peptide data followed by pairwise t-<br />

tests.<br />

LM model. Generally, peptide-based models are<br />

constructed as follows:<br />

y ijklmn<br />

= treat ij + pep ik + biorep il + techrep im<br />

+ ε ijklmn<br />

with y ijklmn the n th log 2 -transformed normalized feature<br />

intensity for the i th protein under the j th treatment treat ij ,<br />

the k th peptide sequence pep ik , the lth biological repeat<br />

biorep il and the m th technical repeat techrep im , and<br />

ε ijklmn a normally distributed error term with mean zero<br />

and variance σ i<br />

2 .<br />

MSqRob. MSqRob adds the following improvements to<br />

the LM model:<br />

1. Ridge regression: shrink parameter estimates<br />

towards 0 by adding a ridge penalty term to the<br />

loss function.<br />

2. Stabilize variance estimation by borrowing<br />

information across proteins with empirical<br />

Bayes (EB): shrink individual variances towards<br />

the pooled variance.<br />

3. M estimation with Huber weights: weigh down<br />

observations with large errors.<br />

RESULTS & DISCUSSION<br />

MSqRob uses MaxQuant or Mascot peptide-level data as<br />

input. It performs preprocessing, robust model fitting and<br />

returns log 2 fold change estimates and FDR corrected p-<br />

values for all model parameters and/or (user specified)<br />

contrasts. Advanced users have the flexibility to (a) adopt<br />

their own preprocessing pipeline (e.g. transformation,<br />

normalization, drop contaminants…) and (b) specify the<br />

appropriate model structure. Compared to competing<br />

methods, MSqRob returns more stable log 2 fold change<br />

estimates, improves DA ranking (Figure 1) and is able to<br />

discern between consistently strong DA and an accidental<br />

hit caused by outliers or a small variance due to random<br />

chance in low-abundant proteins.<br />

FIGURE 1. Receiver operating characteristic (ROC) curves showing the<br />

superior performance of MSqRob compared to a simple linear model<br />

(LM) and a summarizarion-based approach (MaxLFQ+Perseus) when<br />

comparing the lowest spike-in concentration 6A with the second lowest<br />

spike-in concentration 6B. Stars denote the methods’ cut off at an<br />

estimated 5 % FDR.<br />

REFERENCES<br />

Goeminne LJE et al. Journal of Proteome Research 14, 2457-2465<br />

(<strong>2015</strong>).<br />

63


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P20. A MIXTURE MODEL FOR THE OMICS BASED IDENTIFICATION OF<br />

MONOALLELICALLY EXPRESSED LOCI AND THEIR DEREGULATION IN<br />

CANCER<br />

Tine Goovaerts 1 , Sandra Steyaert 1 , Jeroen Galle 1 , Wim Van Criekinge 1 & Tim De Meyer 1* .<br />

BIOBIX lab of Bioinformatics and Computational Genomics, Department of Mathematical Modelling,<br />

Statistics and Bioinformatics, Ghent University 1 . * tim.demeyer@ugent.be<br />

Imprinting is a phenomenon featured by parent-specific monoallelic gene expression. Its deregulation has been<br />

associated with non-Mendelian inherited genetic diseases but is also a common feature of cancer. As imprinting does not<br />

alter the genome yet is mitotically inherited, epigenetics is deemed to be a key regulator. Current knowledge in the field<br />

is particularly hampered by a lack of accurate computational techniques suitable for omics data. Here we introduce a<br />

mixture model for the identification of monoallelically expressed loci based on large scale omics data that can also be<br />

exploited to identify samples and loci featured by loss of imprinting / monoallelic expression.<br />

INTRODUCTION<br />

The genome-wide identification of mono-allelically<br />

expressed or epigenetically modified loci typically<br />

requires the presence of SNPs to discriminate both alleles.<br />

Current methods predominantly rely on genotyping for the<br />

identification of heterozygous loci in a limited sample set,<br />

followed by testing whether the expression/epigenetic<br />

modification levels for both alleles deviate from a 1:1 ratio<br />

for those loci (Wang et al., 2014). This approach is limited<br />

by the genotyping step and the required presence of<br />

heterozygous individuals. As large scale omics data is<br />

becoming increasingly available, an alternative strategy<br />

may be to screen larger numbers (e.g. hundreds) of<br />

samples, ensuring the presence of heterozygous<br />

individuals at predictable rates, thereby also avoiding the<br />

need for and limitations of a prior genotyping step.<br />

Based on this concept, a previous strategy (Steyaert et al.,<br />

2014) enabled us to identify and validate approximately 80<br />

loci featured by monoallelic DNA methylation, but had<br />

several drawbacks, such as computational inefficiency,<br />

heavy reliance on Hardy-Weinberg equilibrium (HWE),<br />

need for 100% imprinting and low power, which limited<br />

its practical use. Here we present a novel mixture model<br />

for the identification of monoallelically modified or<br />

expressed loci from large-scale omics data (without<br />

known genotypes) that largely circumvents previous<br />

drawbacks.<br />

METHODS<br />

The rationale of the methodology is that RNA-seq and<br />

ChIP-seq(-like) derived SNP data for monoallelic loci are<br />

featured by a general lack of apparent heterozygosity.<br />

More specifically, under the null-hypothesis (no<br />

imprinting) the homozygous and heterozygous sample<br />

fractions can be modelled as a mixture of (beta-)binomial<br />

distributions, with weights according to HWE or<br />

empirically derived. For imprinted loci however, the<br />

heterozygous fraction is split and shifted towards the two<br />

homozygous fractions (Figure 1), which can be evaluated<br />

with a likelihood ratio test. The model does not require but<br />

can incorporate prior genotyping data and allows for<br />

deviation from HWE, sequencing errors and efficiency<br />

differences and partial monoallelic events. Once loci<br />

featured by monoallelic events have been identified in<br />

control data, a loss of imprinting index can be calculated<br />

for each non-normal sample based on the mixture model<br />

likelihoods and loci generally featured by loss of<br />

imprinting in the pathology under study can be identified.<br />

RESULTS & DISCUSSION<br />

We demonstrate the applicability of the novel mixture<br />

model with simulations and a proof of concept study using<br />

breast cancer and control RNA-seq data from The Cancer<br />

Genome Atlas (TCGA Research Network, 2008). Well<br />

known imprinted loci such as IGF2 (Figure 1) and H19<br />

were indeed identified. Ongoing efforts are directed<br />

towards artefact-free RNA/ChIP-seq data based allele<br />

frequency inference and the efficient implementation of a<br />

beta-binomial based mixture.<br />

FIGURE 1. Observed (red) and modelled (green) allele frequencies for a<br />

100% (right, no observable heterozygotes) and a partially imprinted<br />

(left) SNP of the IGF2 gene<br />

In conclusion, we introduce a novel mixture model for the<br />

identification of loci featured by monoallelic events which<br />

can subsequently be exploited to determine their<br />

deregulation in the pathology of interest.<br />

REFERENCES<br />

Steyaert S et al. Nucleic Acids Research 42, e157 (2014).<br />

TCGA Research Network. Nature 455, 1061-1068 (2008).<br />

Wang X & Clark AG. Heredity 113, 156-166 (2014).<br />

64


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P21. GEVACT: GENOMIC VARIANT CLASSIFIER TOOL<br />

Isel Grau 1,4 , Dorien Daneels 2,3 , Sonia Van Dooren 2,3 , Maryse Bonduelle 2 ,<br />

Dewan Md. Farid 1,3 , Didier Croes 2,3 , Ann Nowé 1,3 & Dipankar Sengupta 1,3* .<br />

Como - Artificial Intelligence Lab, Vrije Universiteit Brussel 1 ; Centre for Medical Genetics, Reproduction and Genetics,<br />

Reproduction Genetics and Regenerative Medicine, Vrije Universiteit Brussel,UZ Brussel 2 ; Interuniversity Institute of<br />

Bioinformatics in Brussels, ULB-VUB 3 ; Department of Computer Sciences, Universidad Central de Las Villas 4 .<br />

* Dipankar.Sengupta@vub.ac.be<br />

High throughput screening (HTS) techniques, like genome or exome screening are becoming norms in the conventional<br />

clinical analysis. However, classifying the identified variants to be pathogenic, or potentially pathogenic or nonpathogenic,<br />

is still a manual, tedious and time consuming process for clinicians or geneticists. Thus, to facilitate the<br />

variant classification process, we have developed G E V A CT, a Java based tool, designed on an algorithm, i.e. based on the<br />

existing literature and knowledge of clinical geneticists. G E V A CT can classify variants annotated by Alamut Batch, with<br />

a future plan to support for inputs from other annotation software's also.<br />

INTRODUCTION<br />

With the emergence of new screening techniques, targeted<br />

or whole exome and genome screening are becoming<br />

standard diagnostic norms in clinical settings to identify<br />

the variants for a genetic disease (Ng et al., 2010;<br />

Saunders et al., 2012). However, development of<br />

bioinformatics solutions for pathogenic classification of<br />

the variants still remains a big challenge and henceforth,<br />

making the process ponderous for geneticists and<br />

clinicians. In this work, we describe G E V A CT (Genomic<br />

Variant Classifier Tool), a tool for classification of<br />

genomic single nucleotide and short insertion/deletion<br />

variants. The aim of this study was to design and<br />

implement a variant classification algorithm, based on a<br />

literature review of cardiac arrhythmia syndromes<br />

(Hofman et al., 2013; Schulze-Bahr et al., 2000; Wilde &<br />

Tan, 2007) and existing knowledge of clinical geneticists.<br />

METHODS<br />

The algorithm we propose for G E V A CT is based on a<br />

published variant classification schema for cardiac<br />

arrhythmia syndromes. This approach is based on the yield<br />

of DNA testing over a time span of 15 years (1996-2011),<br />

between probands with isolated/familial cases, and also<br />

between probands with or without clear disease-specific<br />

clinical characteristics (Hofman et al., 2013). It proposes<br />

two varying approaches: one to classify missense variants<br />

and another to classify nonsense and frameshift variants.<br />

The algorithm is implemented in two phases: preprocessing<br />

and classification. In the pre-processing phase,<br />

the annotated tab-delimited variant file (vcf.ann) from the<br />

Alamut batch, is refined based on the gene list for the<br />

disease-of-interest, so as to reduce the number of variants<br />

for the analysis. Filters are applied to look for variants that<br />

have already been reported in the Human Genome<br />

Mutation Database (Stenson et al., 2003) and in ClinVar<br />

(Landrum et al., 2014), or that have previously been<br />

detected and classified in an internal patient population.<br />

And lastly, the variants are filtered based on their location<br />

in the genome and their coding effect, followed by the<br />

check for minor allele frequency of the variant in a control<br />

population (Sherry ST et al. 2001). Thereafter, in the<br />

classification phase, the filtered variants are classified as<br />

missense or nonsense and frameshift variants. For<br />

missense variants the classification is based on the<br />

parameters: amino acid substitution and its impact on<br />

protein function (Adzhubei et al., 2010; Kumar et al.,<br />

2009), biochemical variation (Mathe et al., 2006),<br />

conservation (Pollard et al., 2010), frequency of variant<br />

alleles in a control population (ExAC, <strong>2015</strong>), effects on<br />

splicing (Desmet et al., 2009), family and phenotype<br />

information and functional analysis. Whereas, for the<br />

nonsense and frameshift variants, it is based on: effects on<br />

splicing, frequency of variant alleles in a control<br />

population, family and phenotype information and<br />

functional analysis. For each parameter, a score is given to<br />

the variant, which is subsequently cumulated.<br />

Conclusively, based on the cumulative score each variant<br />

is classified into one of the five categories: Class I - Non-<br />

Pathogenic; Class II - VUS1 (unlikely pathogenic); Class<br />

III - VUS2 (unclear); Class IV - VUS3 (likely<br />

pathogenic); Class V - Pathogenic (Sharon et al., 2008).<br />

RESULTS & DISCUSSION<br />

In this study, we report a Java based tool called G E V A CT,<br />

developed for classification of genomic variants. Input for<br />

the tool is an annotated vcf file, while the output depicts<br />

the cumulative classification score along with the class<br />

label for a variant. The tool was tested on a dataset of 130<br />

cardiac arrhythmia syndrome patients, available at UZ<br />

Brussel. The results of the variant classification made by<br />

the tool were cross-validated by manual curation,<br />

performed by the clinical geneticist. Definitively, the<br />

study indicates the tool to be promising but needs to be<br />

further validated on datasets from other diseases. In<br />

addition to, we are working on the tool to be adaptable for<br />

file inputs from other annotation software.<br />

REFERENCES<br />

Adzhubei IA et al. Nat Methods 7(4), 248-249 (2010).<br />

Desmet et al. Nucleic Acids Res 37 (9): e67 (2009).<br />

Exome Aggregation Consortium (ExAC), Cambridge, MA (<strong>2015</strong>).<br />

Hofman N et al. Circulation 128(14),1513-21 (2013).<br />

Kumar P et al. Nat Protoc 4(7), 1073–1081 (2009).<br />

Landrum MJ et al. Nucleic Acids Res 42(1), D980-5 (2014).<br />

Mathe E et al. Nucleic Acids Res 34(5),1317-25 (2006).<br />

Ng SB et al. Nat Genetics 42, 30–35 (2010).<br />

Pollard K et al. Genome Res 20, 110-121 (2010).<br />

Saunders CJ et al. Sci Transl Med 4, 154ra135 (2012).<br />

Sharon EP et al. Hum Mutat. 29(11), 1282–1291 (2008).<br />

Sherry ST et al. Nucleic Acids Res 29(1),308-11 (2001).<br />

Schulze-Bahr E et al. Z Kardiol 89 Suppl 4:IV12-22 (2000).<br />

Stenson et al. Hum Mutat. 21:577-581 (2003).<br />

Wilde AA & Tan HL Circ J 71, Suppl A:A12-9 (2007).<br />

65


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P22. MAPPI-DAT: MANAGEMENT AND ANALYSIS FOR HIGH<br />

THROUGHPUT INTERACTOMICS DATA FROM ARRAY-MAPPIT<br />

EXPERIMENTS<br />

Surya Gupta 1,2,3 , Jan Tavernier 1,2 & Lennart Martens 1,2,3 .<br />

Medical Biotechnology Center, VIB, Ghent, Belgium 1 ; Department of Biochemistry, Ghent University, Ghent, Belgium 2 ;<br />

Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium 3 .<br />

INTRODUCTION<br />

Proteins are highly interesting objects of study, involved<br />

in different cellular and molecular functions. Identification<br />

and quantification of these proteins along with their<br />

interacting proteins, nucleic acids and molecules can<br />

provide insight into development and disease mechanisms<br />

at the systems level. Yet studying these interactions is not<br />

trivial. In vivo methods exist to determine these<br />

interactions, but these suffer from several drawbacks [4].<br />

To overcome existing problems, an innovative approach<br />

called MAPPIT (Mammalian Protein-Protein Interaction<br />

Trap) [2] has been established in the Cytokine Receptor<br />

Lab to determine interacting partners of proteins in<br />

mammalian cells. To allow screening of thousands of<br />

interactors simultaneously, MAPPIT has been parallelized<br />

in the array MAPPIT system [3].<br />

AIM<br />

However, no effective pipeline existed to process the highthrough<br />

put data generated from array MAPPIT. We<br />

therefore established an automated high-throughput data<br />

analysis system called MAPPI-DAT (Mappit Array<br />

Protein Protein Interaction- Database & Analysis Tool).<br />

METHODS<br />

In the array-MAPPIT platform the interaction of two<br />

proteins (bait-prey) restores a mutated JAK-STAT<br />

signaling pathway which leads to the expression of<br />

florescence emitting genes. In order to rank the positive<br />

interactions based on fluorescence intensity, RankProd [1]<br />

is used. This method was originally developed to<br />

determine differentially expressed genes in microarray<br />

experiments and is available as R package. To minimize<br />

false positive hits from RankProd output, quartile based<br />

filtration was applied. MySQL platform was used to build<br />

the data management system for the array-MAPPIT<br />

system.<br />

RESULTS<br />

To extend and ease the usage of the analysis pipeline and<br />

database system, an interface has been developed called<br />

MAPPI-DAT. MAPPI-DAT is capable of processing<br />

many thousand data points for each experiment, and<br />

comprising a data storage system that stores the<br />

experimental data in a structured way for meta-analysis.<br />

REFERENCES<br />

[1] Breitling, R., Armengaud, P., Amtmann, A., & Herzyk, P. (2004).<br />

Rank products: A simple, yet powerful, new method to detect<br />

differentially regulated genes in replicated microarray experiments.<br />

FEBS Letters, 573(1-3), 83–92.<br />

[2] Lievens, S., Peelman, F., De Bosscher, K., Lemmens, I., &<br />

Tavernier, J. (2011). MAPPIT: A protein interaction toolbox built on<br />

insights in cytokine receptor signaling. Cytokine and Growth Factor<br />

Reviews, 22(5-6), 321–329.<br />

[3] Lievens, S., Vanderroost, N., Heyden, J. Van Der, Gesellchen, V.,<br />

Vidal, M., Tavernier, J., & Heyden, V. Der. (2009). Array MAPPIT :<br />

High-Throughput Interactome Analysis in Mammalian Cells Array<br />

MAPPIT : High-Throughput Interactome Analysis in Mammalian Cells,<br />

877–886.<br />

[4] S.Gopichandran and S.Ranganathan. (2013). Protein-protein<br />

Interactions and Prediction: A Comprehensive Overview. Protein and<br />

Peptide Letters, 779–789<br />

66


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P23. HIGHLANDER: VARIANT FILTERING MADE EASIER<br />

Raphael Helaers 1* & Miikka Vikkula 1 .<br />

Human Molecular Genetics (GEHU), de Duve Institute, Université catholique de Louvain 1 .<br />

* Raphael.helaers@UCLouvain.be<br />

The field of human genetics is being revolutionized by exome and genome sequencing. A massive amount of data is<br />

being produced at ever-increasing rates. Targeted exome sequencing can be completed in a few days using NGS,<br />

allowing for new variant discovery in a matter of weeks. The technology generates considerable numbers of false<br />

positives, and the differentiation of sequencing errors from true mutations is not a straightforward task. Moreover, the<br />

identification of changes-of-interest from amongst tens of thousands of variants requires annotation drawn from various<br />

sources, as well as advanced filtering capabilities. We have developed Highlander, a Java software coupled to a MySQL<br />

database, in order to centralize all variant data and annotations from the lab, and to provide powerful filtering tools that<br />

are easily accessible to the biologist. Data can be generated by any NGS machine (such as Illumina’s HiSeq, or Life<br />

Technologies’ Solid or Ion Torrent) and most variant callers (such as Broad Institute’s GATK or Life Technologies’<br />

LifeScope). Variant calls are annotated using DBNSFP (providing predictions from 6 different programs, and MAF from<br />

1000G and ESP), GoNL and SnpEff, subsequently imported into the database. The database is used to compute global<br />

statistics, allowing for the discrimination of variants based on their representation in the database. The Highlander GUI<br />

easily allows for complex queries to this database, using shortcuts for certain standard criteria, such as “sample-specific<br />

variants”, “variants common to specific samples” or “combined-heterozygous genes”. Users can browse through query<br />

results using sorting, masking and highlighting of information. Highlander also gives access to useful additional tools,<br />

including direct access to IGV, and an algorithm that checks all available alignments for allele-calls at specific positions.<br />

67


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P24. DOSE-TIME NETWORK IDENTIFICATION: A NEW METHOD FOR<br />

GENE REGULATORY NETWORK INFERENCE FROM GENE EXPRESSION<br />

DATA WITH MULTIPLE DOSES AND TIME POINTS<br />

Diana M Hendrickx 1* , Danyel G J Jennen 1 & Jos C S Kleinjans 1 .<br />

Department of Toxicogenomics, Maastricht University, The Netherlands 1 .<br />

*d.hendrickx@maastrichtuniversity.nl<br />

Toxicogenomics, the application of ‘omics’ technologies to toxicology, is a rapidly growing field due to the need for<br />

alternatives to animal experiments for toxicity testing of compounds. Identification of gene regulatory networks affected<br />

by compounds is important to gain more insight into the mode of action of a toxic compound. The response to a toxic<br />

compound is both time and dose dependent. Therefore, toxicogenomics data are often measured across several time<br />

points and doses. However, to our knowledge, there does not exist a method for gene regulatory network inference that<br />

takes into account both time and dose dependencies. Here we present Dose-Time Network Identification (DTNI), a novel<br />

gene regulatory network inference algorithm that takes into account both dose and time dependencies in the data. We<br />

show that DTNI can be used to infer gene regulatory networks affected by a group of compounds with the same mode of<br />

action. This is illustrated with gene expression (microarray) data from COX inhibitors, measured in human hepatocytes.<br />

INTRODUCTION<br />

Identifying and understanding gene regulatory networks<br />

(GRN) influenced by chemical compounds is one of the<br />

main challenges of systems toxicology. A GRN affected<br />

by one or more compounds evolves over time and with<br />

dose. The analysis of gene expression data measured at<br />

multiple time points and for multiple doses can provide<br />

more insight in the effects of compounds. Therefore, there<br />

is a need for mathematical approaches for GRN<br />

identification from this type of data.<br />

METHODS<br />

One of the mathematical approaches currently used for<br />

GRN inference is based on ordinary differential equations<br />

(ODE), where changes in gene expression over time are<br />

related to each other and to the external perturbation (i.e.<br />

the dose of the compound). Because gene expression data<br />

usually have less data points than variables (genes), ODE<br />

approaches are often combined with interpolation and/or<br />

dimension reduction techniques (PCA). A current method<br />

that combines ODE with both interpolation and dimension<br />

reduction techniques is Time Series Network<br />

Identification (TSNI) (Bansal et al., 2006).<br />

Here, we present Dose-Time Network Identification<br />

(DTNI), a method that extends TSNI by including ODE<br />

that describe changes in gene expression over dose in<br />

relation to each other and to time. We also adapted the<br />

original method so that it can include data from multiple<br />

perturbations (compounds).<br />

RESULTS & DISCUSSION<br />

By exploiting simulated data, we show that including<br />

ODE for expression changes over dose leads to improved<br />

GRN identification compared with including only ODE<br />

that describe changes over time. Furthermore, we show<br />

that DTNI performs better when including data from<br />

multiple perturbations (compounds) than when applying<br />

DTNI to data from a single perturbation. This suggests<br />

that the method is suitable to infer a GRN affected by<br />

compounds with the same mode of action. As an example,<br />

we infer the network affected by COX inhibitors from<br />

public microarray data of 6 COX inhibitors, measured in<br />

human hepatocytes, available from Open TG-Gates<br />

(http://toxico.nibio.go.jp/english/index.html) (Noriyuki et<br />

al., 2012). The interactions in the inferred network were<br />

compared to interactions from ConsensusPathDB, a<br />

database including interactions from 32 different sources<br />

(Kamburov et al., 2013). The inferred network was<br />

validated by leave-one out cross-validation (LOOCV). Six<br />

datasets were created from the original data by leaving out<br />

the data of one compound. The network constructed from<br />

the whole data set showed large overlap with the networks<br />

constructed from each of the LOOCV datasets. Edges in<br />

the network constructed from the whole data set, but not in<br />

the networks constructed from the LOOCV datasets were<br />

removed from the network. The remaining novel<br />

interactions, i.e. those that are not in ConsensusPathDB,<br />

have to be validated experimentally, e.g. by geneknockdown<br />

experiments.<br />

FIGURE 1. Workflow for identifying a gene regulatory network affected<br />

by a group of compounds with the same mode of action.<br />

REFERENCES<br />

Bansal M et al. Bioinformatics 22, 815-822 (2006).<br />

Noriyuki N et al. J Toxicol Sci 37,791-801 (2012).<br />

Kamburov A et al. Nucl Acids Res 41, D793-D800 (2013).<br />

68


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Category: Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P25. IDENTIFICATION OF NOVEL ALLOSTERIC DRUG TARGETS<br />

USING A “DUMMY” LIGAND APPROACH<br />

Susanne M.A. Hermans, Christopher Pfleger & Holger Gohlke * .<br />

Department of Mathematics and Natural Sciences, Institute for Pharmaceutical and Medicinal Chemistry, Heinrich-<br />

Heine-University, Düsseldorf, Germany. * gohlke@uni-duesseldorf.de<br />

Targeting allosteric sites is a promising strategy in drug discovery due to their regulatory role in almost all cellular<br />

processes. Currently, there is no standard method to identify novel pockets and to detect whether a pocket has a<br />

regulatory effect on the protein. Here, we present a new and efficient approach to probe information transfer through<br />

proteins in the context of dynamically dominated allostery that exploits “dummy” ligands as surrogates for allosteric<br />

modulators.<br />

INTRODUCTION<br />

Allosteric regulation is the coupling between separated<br />

sites in biomacromolecules such that an action at one site<br />

changes the function at a distant site. Allosteric drugs are<br />

popular, they often have less side effects then orthosteric<br />

drugs because the allosteric sites are less conserved. The<br />

identification of novel allosteric pockets is complicated by<br />

the large variation in allosteric regulation, ranging from<br />

rigid body motions to disorder/order transitions, with<br />

dynamically dominated allostery in between (Motlagh et<br />

al., 2014). Here we focus on dynamically dominated<br />

allostery with minimal or no conformational changes.<br />

Novel pockets do not have a known ligand, therefore we<br />

generate “dummy” ligands to function as surrogates for<br />

allosteric ligands. We have developed an efficient<br />

approach to probe information transfer through proteins<br />

using “dummy” ligands and detect if allosteric coupling is<br />

present between the novel pocket and the orthosteric site.<br />

METHODS<br />

In a preliminary study to test the general feasibility, the<br />

approach was applied to conformations extracted from a<br />

MD trajectory of the holo and apo structures of LFA1.<br />

The grid-based PocketAnalyzer program (Craig et al.,<br />

2011) is used to detect putative binding sites. “Dummy”<br />

ligands were generated for each detected pocket along the<br />

ensemble. Finally, the Constraint Network Analysis<br />

(CNA) software, which links biomacromolecular structure,<br />

(thermo-)stability, and function, is used to probe the<br />

allosteric response by monitoring altered stability<br />

characteristics of the protein due to the presence of the<br />

“dummy” ligand (Pfleger et al., 2013; Krüger et al., 2013;<br />

Pfleger, 2014). The results were compared to those of the<br />

holo structure with the bound allosteric ligand to validate<br />

the “dummy” ligand approach.<br />

RESULTS & DISCUSSION<br />

Remarkably, the usage of “dummy” ligands almost<br />

perfectly reproduced the results obtained from the known<br />

allosteric effector. Although it turned out that the intrinsic<br />

rigidity of the “dummy” ligands over-stabilizes the LFA1<br />

structure, these results are already encouraging. Even for<br />

the LFA1 apo structures, where the allosteric pocket is<br />

partially closed, the results are in agreement with known<br />

allosteric effectors. Overall, the results obtained from the<br />

validation of the “dummy” ligand approach are<br />

encouraging. This suggests that our “dummy” ligand<br />

approach for the characterization of unexplored allosteric<br />

pockets is a promising step towards identifying novel drug<br />

targets.<br />

REFERENCES<br />

Craig, I.R. et al. J. Chem. Inf. Model. 51 2666–2679 (2011).<br />

Krüger, D. M. et al. Nucleic Acids Res. 41 340–348 (2013).<br />

Motlagh, H.N. et al. Nature 508 7496 331–339 (2014).<br />

Pfleger, C. et al. J. Chem. Inf. Model. 53 1007–1015 (2013).<br />

Pfleger, C. Doctoral Thesis, Heinrich Heine University, Düsseldorf,<br />

Germany (2014).<br />

69


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P26. PASSENGER MUTATIONS CONFOUND INTERPRETATION OF ALL<br />

GENETICALLY MODIFIED CONGENIC MICE<br />

Paco Hulpiau 1,2,3 *, Liesbet Martens 1,2,3 *, Yvan Saeys 1,2,3 , Peter Vandenabeele 1,2,4 & Tom Vanden Berghe 1,2 .<br />

Inflammation Research Center, VIB, Ghent, Belgium 1 ; Department of Biomedical Molecular Biology, Ghent University,<br />

Ghent, Belgium 2 ; Data Mining and Modelling for Biomedicine (DaMBi), Ghent, Belgium 3 ; Methusalem Program, Ghent<br />

University, Belgium 4 . *paco.hulpiau@irc.vib-ugent.be, liesbet.martens@irc.vib-ugent.be<br />

Targeted mutagenesis in mice is a powerful tool for functional analysis of genes. However, genetic variation between<br />

embryonic stem cells (ESCs) used for targeting (previously almost exclusively 129-derived) and recipient strains (often<br />

C57BL/6J) typically results in congenic mice in which the targeted gene is flanked by ESC-derived passenger DNA<br />

potentially containing mutations. Comparative genomic analysis of 129 and C57BL/6J mouse strains revealed indels and<br />

single nucleotide polymorphisms resulting in alternative or aberrant amino acid sequences in 1,084 genes in the 129-<br />

strain genome.<br />

INTRODUCTION<br />

Annotating the passenger mutations to the reported<br />

genetically modified congenic mice that were generated<br />

using 129-strain ESCs revealed that nearly all these mice<br />

possess multiple passenger mutations potentially<br />

influencing the phenotypic outcome. We illustrated this<br />

phenotypic interference of 129-derived passenger<br />

mutations with several case studies and developed a Me-<br />

PaMuFind-It web tool to estimate the number and possible<br />

effect of passenger mutations in transgenic mice of interest.<br />

METHODS<br />

We analyzed the SNP data release v3 from the Mouse<br />

Genome Project available at Sanger Institute (Keane et al.,<br />

2011). The data in the indel vcf file and SNP vcf file were<br />

filtered to retrieve indels and SNPs present in at least one<br />

of the three 129 strains (129P2/OlaH, 129S1/SvIm and<br />

129S5SvEvB) and affecting the protein coding sequence<br />

of the genes. These so-called protein coding variants are<br />

based on the following sequence ontology (SO) terms:<br />

stop gained, stop lost, inframe insertion, inframe deletion,<br />

frameshift variant, splice donor variant, splice acceptor<br />

variant, and coding sequence variant. In total, 949 indels<br />

and 446 SNPs affecting 1,084 mouse genes were retained.<br />

We gathered chromosome and gene start and end positions<br />

for 1,084 genes covering 1,395 variations. The Ensembl<br />

gene ID was used to find the most upstream and<br />

downstream start and stop in all Ensembl transcripts for<br />

that gene. Next these genome coordinates were used to<br />

search for flanking genes within 2, 10, and 20 Mbps<br />

upstream and downstream. We then downloaded all mouse<br />

phenotypic allele data from the MGI resource and<br />

extracted the data of genetically modified mouse lines.<br />

Information on 5,322 genes (corresponding to 7,979 129-<br />

derived genetically modified mouse lines) was connected<br />

to genes with passenger mutations and affected genes.<br />

Additionally we filtered the data to identify putative<br />

regulatory variants. All data were stored in a MySQL<br />

database and can be queried using the publicly available<br />

web tool Me-PaMuFind-It:<br />

http://me-pamufind-it.org/<br />

Passenger genome mutations in gene-targeted mice (Nechanitzky and<br />

Mak, <strong>2015</strong>)<br />

RESULTS & DISCUSSION<br />

The vast majority of existing and well-characterized<br />

genetically engineered congenic mice have been created<br />

using 129 ESCs. 99.5% of these mouse lines are affected<br />

by a median number of 20 passenger mutations within a<br />

10 cM flanking region. This implies that nearly all<br />

genetically modified congenic mice contain multiple<br />

passenger mutations despite intensive backcrossing.<br />

Consequently, the phenotypes observed in these mice<br />

might be due to flanking passenger mutations rather than a<br />

defect in the targeted gene (Vanden Berghe et al, <strong>2015</strong>).<br />

REFERENCES<br />

Keane, T.M., Goodstadt, L., Danecek, P., White, M.A., Wong, K., Yalcin,<br />

B., Heger, A., Agam, A., Slater, G., Goodson, M., et al. (2011).<br />

Mouse genomic variation and its effect on phenotypes and gene<br />

regulation. Nature 477, 289–294.<br />

Nechanitzky R, Mak TW (<strong>2015</strong>). Passenger Mutations Identified in the<br />

Blink of an Eye. Immunity 43(1), 9-11.<br />

Vanden Berghe, T., Hulpiau, P., Martens, L. et al (<strong>2015</strong>). Passenger<br />

Mutations Confound Interpretation of All Genetically Modified<br />

Congenic Mice. Immunity 43(1), 200-9.<br />

70


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: 000 Category: Abstract template<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P27. DETECTING MIXED MYCOBACTERIUM TUBERCULOSIS INFECTION<br />

AND DIFFERENCES IN DRUG SUSCEPTIBILITY WITH WGS DATA<br />

Arlin Keo 1 & Thomas Abeel 1,2,* .<br />

Delft Bioinformatics Lab, Delft University of Technology , Delft, the Netherlands 1 ; Broad Institute of MIT<br />

and Harvard, Cambridge, MA, USA 2 . * t.abeel@ tudelft.nl<br />

Mycobacterium tuberculosis is a bacterial pathogen that causes tuberculosis and infects millions of people. When a<br />

person is infected with more than one distinct strain type of tuberculosis (TB), referred to as mixed infection, diagnosis<br />

and treatment is complicated. Due to difficulty of diagnosis the prevalence of mixed infections among TB patients<br />

remain uncertain. Whole genome sequencing (WGS) yields a great number of single nucleotide polymorphisms (SNPs)<br />

and offers increased resolution to distinguish distinct strains. Here, we present a tool that maps sample reads against 21<br />

bp cluster specific SNP markers to detect putative mixed infections and estimate the frequencies of the present<br />

subpopulations.<br />

INTRODUCTION<br />

Mycobacterium tuberculosis is a clonal, bacterial pathogen<br />

that causes the pulmonary disease tuberculosis (TB), and it<br />

infects and kills millions of people worldwide [1]. The<br />

study of genetic diversity within the M. tuberculosis<br />

complex (MTBC) is complicated by mixed TB infections,<br />

which happens when a person is infected with more than<br />

one distinct strain type of MTBC. This often results in<br />

poor diagnosis and treatment of patients as the bacterial<br />

subpopulation may have undetected differences in drug<br />

susceptibility [2]. A strain typing method should be able to<br />

distinguish closely related strains, to also allow the<br />

detection of a mixed infection at finer resolutions [3]. This<br />

study aims to detect a possible mixed TB infection at<br />

different levels in MTBC and to determine the frequencies<br />

of the present strains based on established tree paths in the<br />

MTBC phylogenetic tree.<br />

METHODS<br />

A global comprehensive dataset of 5992 MTBC strains<br />

was used for analysis, and 226570 SNPs were extracted<br />

from this set to construct a SNP-based phylogenetic tree<br />

with RAxML. In this bifurcating tree, each branch<br />

represents a cluster of strains and splits into two new<br />

monophyletic subclusters of genetically more closely<br />

related strain. These ¨splits¨ were used to define clusters<br />

and subclusters that contain more than 10. Global SNP<br />

association was done for each cluster to get clusterspecific<br />

SNPs, those for which the true positive rate, true<br />

negative rate, positive predictive value, and negative<br />

predictive value were >0.95. Markers were generated from<br />

these SNPs by extending them with 10 bp sequence on<br />

each side based on reference genome H37Rv. Each<br />

hierarchical cluster now has a set of specific SNP markers.<br />

By mapping sample reads against these 21 bp clusterspecific<br />

SNP markers the tool determines the presence of<br />

paths in the phylogenetic tree that start at the MTBC root<br />

node. Paths that split indicate the presence of multiple<br />

strains and thus a mixed infection.<br />

The read depth at the root node represents a frequency of 1<br />

of the present MTBC species. If the path splits further in<br />

the tree, the total read depth is divided over the two<br />

subpaths and determines the frequencies of those present<br />

subclusters (Figure 1).<br />

FIGURE 1. Detection of mixed TB infection with hierarchical clusters.<br />

The detected strains are combined with detected drug<br />

susceptibility profiles. A minimized reference genome<br />

consisting of drug resistance genes and 1000 bp flanking<br />

regions is used to map sample reads with BWA, and call<br />

variants with Pilon. Ambiguous variation calls may<br />

indicate that present strains in a mixed infection sample<br />

also have differences in drug susceptibility.<br />

RESULTS & DISCUSSION<br />

In the phylogenetic tree 308 clusters (MTBC root<br />

excluded) were defined and there are 14823 SNP markers<br />

in total that are specific to a cluster and unique within the<br />

cluster. The known MTBC lineages 1 to 6 have between<br />

355-614 markers.<br />

7661 TB samples were tested, present strain(s) and<br />

frequencies could be predicted for 7495 samples of which<br />

914 (~12%) are mixed infections (Table 1).<br />

# of subpopulations 1 2 3 >3<br />

# of samples 6581 798 95 21<br />

TABLE 1. 914 Out of 7495 samples is a mixed infection.<br />

REFERENCES<br />

1. World Health Organization. Global Tuberculosis Report. World<br />

Health Organization, Geneva, Switzerland, 2014.<br />

2. Zetola et al. Mixed Mycobacterium tuberculosis complex infections<br />

and false-negative results for rifampicin resistance by GeneXpert<br />

MTB/RIF are associated with poor clinical outcomes. Journal of<br />

Clin. Microb., 52:2422-2429, 2014.<br />

3. G. Plazzotta, T. Cohen, and C. Colijn. Magnitude and sources of bias<br />

in the detection of mixed strain M. tuberculosis infection. Journal of<br />

theoretical biology, 368:67–73, <strong>2015</strong>.<br />

71


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P28. APPLICATION OF HIGH-THROUGHPUT SEQUENCING TO<br />

CIRCULATING MICRORNAS REVEALS NOVEL BIOMARKERS FOR DRUG-<br />

INDUCED LIVER INJURY<br />

Julian Krauskopf 1* , Florian Caiment 1 , Sandra Claessen 1 , Kent J. Johnson 2 , Roscoe L. Warner 2 , Shelli J. Schomaker 3 ,<br />

Deborah A. Burt 3 , Jiri Aubrecht 3 , Jos C. Kleinjans 1 .<br />

Department of Toxicogenomics, Maastricht University, Maastricht 6200 MD, The Netherlands 1 ; Pathology Department,<br />

University of Michigan, Ann Arbor, MI 48109, USA 2 ; Drug Safety Research and Development, Pfizer, Inc., Groton, CT<br />

06340, USA 2 . *j.krauskopf@maastrichtuniversity.nl<br />

Drug-induced liver-injury (DILI) is a leading cause of acute liver failure and the major reason for withdrawal of drugs<br />

from the market. Preclinical evaluation of drug candidates has failed to detect about 40% of potentially hepatotoxic<br />

compounds in humans. At the onset of liver injury in humans, currently used biomarkers have difficulty differentiating<br />

severe DILI from mild, and/or predict the outcome of injury for individual subjects. Therefore, new biomarker<br />

approaches for predicting and diagnosing DILI in humans are urgently needed. Recently, circulating microRNAs<br />

(miRNAs) such as miR-122 and miR-192 have emerged as promising biomarkers of liver injury in preclinical species<br />

and in DILI patients. In this study, we focused on examining global circulating miRNA profiles in serum samples from<br />

subjects with liver injury caused by accidental acetaminophen (APAP)-overdose. Upon applying next generation highthroughput<br />

sequencing of small RNA libraries, we identified 36 miRNAs, including three novel miRNA-like small<br />

nuclear RNAs, which were enriched in serum of APAP overdosed subjects. The set comprised miRNAs that are<br />

functionally associated with liver-specific biological processes and relevant to APAP toxic mechanisms. Although more<br />

patients need to be investigated, our study suggests that profiles of circulating miRNAs in human serum might provide<br />

additional biomarker candidates and possibly mechanistic information relevant to liver injury.<br />

72


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P29. INFORMATION THEORETIC MODEL FOR GENE PRIORITIZATION<br />

Ajay Anand Kumar 1,2 * , Geert Vandeweyer 1,2 , Lut Van Laer 1,2 & Bart Loeys 1,2 .<br />

Department of Medical Genetics, University of Antwerp 1 ; Biomedical informatics, Antwerp University Hospital 2 .<br />

*ajay.kumar@uantwerpen.be<br />

The identification of top candidate genes involved in human diseases from a list of candidate genes remains<br />

computationally challenging. Many tools exist for this computational prioritization, of which the core typically utilizes<br />

fusion or integration of various genomic annotation data sources. However, due to the rapid generation of novel data<br />

high-throughput experiments, annotation sources often become outdated, lead to annotation errors. Hence, predictions<br />

based on these computational tools are not reliable. To tackle this, we propose an information theoretic model that<br />

effectively fuses annotation sources and regression model under Bayesian framework to prioritize candidate genes. Our<br />

method is fast and performs better as compared to four existing tools on their own benchmark dataset.<br />

INTRODUCTION<br />

Gene Prioritizaton has become a central research problem<br />

in the bioinformatics domain. With the advent of exome<br />

sequencing in clinical genetics, it became a necessity to<br />

automate the identification of the top most genes likely<br />

involved in the disease from a given pool of affected<br />

genes. Various annotation sources can be integrated or<br />

fused to learn multiple functionality of genes and then<br />

design a classifiers/regressor for prioritization. We<br />

propose here an early data integration method that<br />

implements an information retrieval model to fusing the<br />

data at functional feature level and then designing a<br />

discriminative regression model in Bayesian framework to<br />

prioritize candidate genes.<br />

METHODS<br />

Principle behind our approach is based on guilt-byassociation.<br />

Genes that are known to be disease associated<br />

might also share similar functions. The idea is that a<br />

classifier or regressor can be trained on the linear<br />

mapping between functional proximity profiles of genes<br />

and their phenotypic proximity profiles. We implemented<br />

Bayesian regressor to infer the degree of association of the<br />

test genes with the query disease. The work-flow of is<br />

shown in the Figure 1. The details are:<br />

1. Functional annotation: Text, Ontologies (GO, MPO),<br />

Sequence similarity, Pathways, Interactions. Phenotype<br />

annotation: Human Phenotype Ontology (HPO), Disease<br />

Ontology (DO), HuGe/ MeSh terms and GAD<br />

2. TF - IDF (Term Frequency – Inverse document<br />

frequency) methodology is used to assign statistical<br />

weights to the functional attributes of genes form these<br />

annotation sources. TF-IDF is data driven model<br />

traditionally used for information retrieval. We apply same<br />

methodology for weighing features. Together, it gives<br />

gene-by-gene functional & phenotypic proximity profiles.<br />

3. Finally, the Bayesian linear regression model for a<br />

given set of query disease or training genes it learns the<br />

linear mapping between functional & phenotypic<br />

proximity profiles. Y = βX + η, where is Gaussian<br />

distributed. We have incorporated traditional noninformative<br />

Normal-Inverse Gama (NIG) priors for<br />

estimating the unknowns namely β and б.<br />

RESULTS & DISCUSSION<br />

We performed leave-one-out cross validation experiment<br />

on the benchmark data set that was used to compare four<br />

other tools whose design principles are similar to our<br />

method [1]. Our dataset consisted of 1040 disease genes<br />

categorized under manually curated 12 different disease<br />

classes [2]. In our preliminary results for 1154<br />

prioritizations under the cut-off of top 5%, 10% and 30%<br />

genes ranked in random control dataset we achieved<br />

AUROC of 86.31 % against their best achieved score of<br />

83.0%. This clearly indicates our method is comparatively<br />

better with other tools mentioned in the comparative<br />

analysis.<br />

FIGURE 1. Workflow of Bayesian regression model for gene<br />

prioritization.<br />

Currently, we are incurring large-scale cross-validation<br />

with manually curated 6762 disease gene association with<br />

more number of tools and benchmark data [3].<br />

Additionally, we also plan to explore to develop<br />

probabilistic generative approach to model cooccurrences,<br />

dependencies of features for effective data<br />

fusion that can help in finding novel disease causing<br />

genes.<br />

REFERENCES<br />

1. Chen B et.al BMC Med Genomics. <strong>2015</strong>;8 Suppl 3:S2<br />

2. Goh et.al Proc Natl Acad Sci USA 2007, 104(21):8685-8690<br />

3. Börnigen, Daniela, et al. Bioinformatics 28.23 (2012): 3081-3088.<br />

73


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P30. GALAHAD: A WEB SERVER FOR THE ANALYSIS OF DRUG EFFECTS<br />

FROM GENE EXPRESSION DATA<br />

Griet Laenen 1,2,* , Amin Ardeshirdavani 1,2 , Yves Moreau 1,2 & Lieven Thorrez 1,3 .<br />

Dept. of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics,<br />

KU Leuven 1 ; iMinds Medical IT Dept., KU Leuven 2 ; Dept. of Development and Regeneration @ Kulak, KU Leuven 3 .<br />

* griet.laenen@esat.kuleuven.be<br />

Galahad (https://galahad.esat.kuleuven.be) is a web-based application for the analysis of gene expression data from drug<br />

treatment versus control experiments, aimed at predicting a drug’s molecular targets and biological effects. Galahad<br />

provides data quality assessment and exploratory analysis, as well as computation of differential expression. Based on<br />

the obtained differential expression values, drug target prioritization and both pathway and disease enrichment can be<br />

calculated and visualized. Drug target prioritization is based on the integration of the gene expression data with a<br />

functional protein association network.<br />

INTRODUCTION<br />

Gene expression analysis is frequently employed to study<br />

the effects of drug compounds on cells. The observed<br />

transcriptional patterns can provide valuable information<br />

for identifying compound–protein inter-actions as well as<br />

resulting biological effects. To facilitate the analysis of<br />

this particular data type and enable an in-depth exploration<br />

of a drug’s mode of effect, we have developed Galahad 1 .<br />

INPUT<br />

The main input for Galahad are raw Affymetrix human,<br />

mouse or rat DNA microarray data derived from both<br />

untreated control samples and samples treated with a drug<br />

of interest. In addition, Galahad provides the possibility to<br />

start from differential expression data derived with other<br />

platforms to perform drug target prioritization and<br />

enrichment analysis.<br />

METHODS<br />

The different analyses are depicted in Figure 1 and<br />

include:<br />

<br />

<br />

<br />

<br />

<br />

<br />

preprocessing of the raw data with RMA or<br />

MAS5.0, as indicated by the user;<br />

quality assessment and exploratory analysis to<br />

ascertain data quality, uncover experimental<br />

issues, and help in deciding whether certain<br />

arrays need to be considered as outlying;<br />

differential expression analysis to determine the<br />

significance of gene up- and downregulation<br />

following drug treatment;<br />

genome-wide drug target prioritization by<br />

means of an in-house developed algorithm for<br />

network neighborhood analysis integrating the<br />

expression data with functional protein<br />

association infor-mation 2 ;<br />

prediction of molecular pathways involved in the<br />

drug’s mode of effect;<br />

identification of associated disease phenotypes<br />

enabling side effect prediction and drug<br />

repositioning.<br />

OUTPUT<br />

The output is displayed in a series of tabs corresponding to<br />

the different analyses selected by the user:<br />

<br />

<br />

<br />

<br />

in the Quality Control and Data Exploration<br />

tabs, several diagnostic plots are displayed along<br />

with a short explanation;<br />

the Differential Expression tab contains a sorted<br />

table listing all genes together with their log 2<br />

ratios and P-values for differential expression, as<br />

well as links to the corresponding GeneCards<br />

sections;<br />

in the Drug Target Prioritization tab, a ranked<br />

list of genes as potential targets of the drug can be<br />

found, together with the network diffusion-based<br />

scores and P-values for prioritization, and links to<br />

the corresponding GeneCards section; in addition,<br />

a network-based visualization is available for<br />

each gene, showing the 10 interaction partners<br />

contrib-uting most to the gene’s ranking;<br />

the tabs summarizing the results for Pathway<br />

and Disease Enrichment contain a sorted table<br />

with pathway or disease ontology IDs, names,<br />

and database links, together with the number of<br />

differentially expressed genes in the<br />

corresponding gene sets and the accompanying P-<br />

values; in addition, network graphs are available,<br />

consisting of the top 10 most significant<br />

pathways or disease phenotypes, along with their<br />

associated genes colored according to fold change.<br />

FIGURE 1. Overview of the Galahad analysis steps.<br />

REFERENCES<br />

1. Laenen G. et al. Nucl Acids Res 43, W208-W212 (<strong>2015</strong>).<br />

2. Laenen G. et al. Mol BioSyst 9, 1676-1685 (2013).<br />

74


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: 000 Category: Abstract template<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P31. KMAD: KNOWLEDGE BASED MULTIPLE SEQUENCE ALIGNMENT<br />

FOR INTRINSICALLY DISORDERED PROTEINS<br />

Joanna Lange 1,2 , Lucjan S Wyrwicz 1 & Gert Vriend 2* .<br />

Laboratory of Bioinformatics and Biostatistics, M. Sklodowska-Curie Memorial Cancer Center;<br />

Institute of Oncology 1 , CMBI, Radboud University Nijmegen 2 . * vriend@cmbi.ru.nl<br />

INTRODUCTION<br />

Intrinsically disordered proteins (IDPs) lack tertiary<br />

structure and thus differ from globular proteins in terms of<br />

their sequence – structure – function relations. IDPs have a<br />

lower sequence conservation, different types of active<br />

sites, and a different distribution of functionally important<br />

regions, which altogether makes their multiple sequence<br />

alignment (MSA) difficult.<br />

Algorithms underlying existing MSA programs are<br />

directly or indirectly based on knowledge obtained from<br />

studying three dimensional protein structures. Hereby we<br />

introduce a tool for Knowledge based Multiple sequence<br />

Alignment for intrinsically Disordered proteins, KMAD,<br />

that incorporates SLiM, domain, and PTM annotations to<br />

improve the alignments.<br />

KMAD web server is accessible at<br />

http://www.cmbi.ru.nl/kmad/. A standalone version is<br />

freely available.<br />

METHODS<br />

Dataset of proteins experimentally proven to be disordered<br />

was obtained from DisProt (Sickmeier et al., 2007). For<br />

each IDP all homologous sequences were extracted from<br />

SwissProt (The Uniprot Consortium, 2014) using BLAST.<br />

The sequence sets were aligned with several MSA tools.<br />

Apart from manual validation we also performed a<br />

benchmark validation on reference sets from BAliBASE<br />

(Thompson et al., 2005) and PREFAB holding structurebased<br />

'gold standard' sequence alignments. For this<br />

purpose we used KMAD and a modified version of<br />

KMAD, which performs a ’refinement’ of Clustal Omega<br />

(Sievers et al., 2011) alignments.<br />

RESULTS & DISCUSSION<br />

Manual validation showed that KMAD bypasses many<br />

mistakes made by Clustal Omega. An example of an<br />

alignment mistake is shown on Figure 1.<br />

a) Clustal Omega<br />

b) KMAD<br />

FIGURE 1. Excerpts from Clustal Omega and KMAD alignments of<br />

human sialoprotein (SIAL HUMAN) with four homologues. Various PTM<br />

kinds are highlighted with bright colours<br />

In the field of sequence alignment research it is common<br />

practice to compare the sequence alignments obtained with<br />

MSA software with those that are obtained from structure<br />

superpositions. IDPs do not possess a static 3D structure<br />

so that this method is not applicable to KMAD alignments.<br />

Both of the validation methods that we used have their<br />

disadvantages, but so far there is no alternative. Validation<br />

on benchmark alignments of structured proteins is biased<br />

towards Clustal Omega, because it was optimized to work<br />

with structured proteins. On the other hand, the manual<br />

inspection based on the same features that influence the<br />

alignment is not a very elegant method, but given the<br />

nature of IDPs probably the best we can do.<br />

REFERENCES<br />

Edgar, R. C. (2004). MUSCLE: multiple sequence alignment with high<br />

accuracy and high throughput. Nucleic Acids Research, 32(5), 1792–<br />

1797.<br />

Sievers, F., Wilm, A., Dineen, D., Gibson, T. J., Karplus, K., Li, W.,<br />

Lopez, R., McWilliam, H., Remmert, M., S öding, J., Thompson, J.<br />

D., and Higgins, D. G. (2011). Fast, scalable generation of highquality<br />

protein multiple sequence alignments using Clustal Omega.<br />

Molecular System Biology, 7(539), 539.<br />

Sickmeier, M., Hamilton, J. a., LeGall, T., Vacic, V., Cortese, M. S.,<br />

Tantos, A., Szabo, B., Tompa, P., Chen, J., Uversky, V. N.,<br />

Obradovic, Z., and Dunker, a. K. (2007). DisProt: the Database of<br />

Disordered Proteins. Nucleic Acids Research, 35(Database issue),<br />

D786–93.<br />

The Uniprot Consortium (2014). Activities at the Universal Protein<br />

Resource (UniProt). Nucleic Acids Research, 42(Database issue),<br />

D191–8.<br />

Thompson, J. D., Koehl, P., Ripp, R., and Poch, O. (2005). BAliBASE<br />

3.0: latest developments of the multiple sequence alignment<br />

benchmark. Proteins: Structure, Function, and Bioinformatics,<br />

61(1), 127–136.<br />

75


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P32. ON THE LZ DISTANCE FOR DEREPLICATING<br />

REDUNDANT PROKARYOTIC GENOMES<br />

Raphaël R. Léonard 1,2* , Damien Sirjacobs², Eric Sauvage 1 , Frédéric Kerff 1 & Denis Baurain².<br />

Centre for Protein Engineering, University of Liège 1 ; PhytoSYSTEMS, University of Liège 2 . * rleonard@doct.ulg.ac.be<br />

The fast-growing number of available prokaryotic genomes, along with their uneven taxonomic distribution, is a problem<br />

when trying to assemble broadly sampled genome sets for phylogenomics and comparative genomics. Indeed, most of<br />

the new genomes belong to the same subset of hyper-sampled phyla, such as Proteobacteria and Firmicutes, or even to<br />

single species, such as Escherichia coli (almost 2000 genomes as of Sept <strong>2015</strong>), while the continuous flow of newly<br />

discovered phyla prompts for regular updates. This situation makes it difficult to maintain sets of representative genomes<br />

combining lesser known phyla, for which only few species are available, and sound subsets of highly abundant phyla. An<br />

automated straightforward method is required but none are publicly available. The LZ distance, in conjunction with the<br />

quality of the annotations, can be used to create an automated approach for selecting a subset of representative genomes<br />

without redundancy. We are planning to release this tool on a website that will be made publicly available.<br />

INTRODUCTION<br />

The LZ distance (Lempel and Ziv, 1977; Otu and Sayood,<br />

2003) is inspired by compression algorithms, such as gzip<br />

or WinRAR. This distance, amongst others, has already<br />

been used in attempts to produce alignment-free<br />

phylogenetic trees (Bacha and Baurain, 2005; Hohl et al.<br />

2007), though the results were disappointing in such a<br />

context (due to the heterogeneity of the substitution<br />

process at large evolutionary scales). However, the LZ<br />

distance is likely to provide enough resolving power to<br />

identify groups of redundant genomes and to keep only<br />

one representative for each group.<br />

METHODS<br />

For each pair of genomes A and B, the LZ distance is<br />

computed from the gzip-compressed file lengths of the<br />

corresponding nucleotide assemblies s(A) and s(B) and of<br />

their concatenations s(A+B) and s(B+A). These distances,<br />

along with taxonomic information, are stored in a<br />

database.<br />

A clustering method is then applied to regroup the similar<br />

genomes into a user-specified number of groups. For each<br />

of these groups, a representative is chosen based on the<br />

quality of the genomic assemblies (chromosomes rather<br />

than scaffolds) and of the protein annotations (e.g., few<br />

rather than many “unknown proteins”).<br />

RESULTS & DISCUSSION<br />

Our method using the LZ distance is currently under<br />

development using the genomes from the release 28 of<br />

Ensembl Bacteria (ftp://ftp.ensemblgenomes.org/pub/<br />

bacteria/release-28/). It contains 20,950 unique<br />

prokaryotic genomes, composed of 286 Archaea and<br />

20,664 Bacteria. The three most represented phyla are the<br />

Proteobacteria (8642, of which 1980 E. coli), the<br />

Firmicutes (7766) and the Actinobacteria (2673). These<br />

genomes are already the result of a pre-processing step<br />

designed to remove extra assemblies for strains present in<br />

multiple copies (due to parallel sequencing or<br />

resequencing in different labs).<br />

We are working on different approaches for validating our<br />

dereplication method, based on (1) current taxonomy, (2)<br />

16S rRNA phylogeny, and (3) clustering using genomic<br />

signatures (Moreno-Hagelsieb et al. 2013).<br />

First, we compute a central measure of the taxonomic<br />

“purity” of all genome clusters, which reflects the amount<br />

of “mixture” at different taxonomic levels (phylum, class,<br />

order etc). A good clustering should regroup different<br />

genera (or species) without amalgamating distinct classes<br />

(or phyla). Second, we cut the branches of a large 16S<br />

rRNA tree based on the same genome collection to<br />

produce an equal number of groups to compare with our<br />

clustering method. We then compute a statistic of the<br />

overlap between the 16S subtrees and the LZ clusters. A<br />

good clustering should have a reasonable overlap with the<br />

gold standard that is the 16S rRNA tree. Third, using the<br />

same overlap metric, we compare the LZ clusters to<br />

clusters obtained using the genomic signature.<br />

Finally, an interactive tool will be made available through<br />

a website. It will allow the users to download precomputed<br />

sets of representative genomes for either the<br />

complete database or for taxonomic subsets. We are also<br />

planning to allow users to upload their own genomes to<br />

cluster them with the LZ method.<br />

REFERENCES<br />

Ziv, J. and a. Lempel. 1977. ‘A Universal Algorithm for Sequential Data<br />

Compression.’ IEEE Transactions on Information Theory 23.3.<br />

doi:10.1109/TIT.1977.1055714.<br />

Otu, H. H. and K. Sayood. 2003. ‘A New Sequence Distance Measure for<br />

Phylogenetic Tree Construction.’ Bioinformatics 19.16: 2122–2130.<br />

doi:10.1093/bioinformatics/btg295.<br />

Moreno-Hagelsieb, G., Z. Wang, S. Walsh and A. Elsherbiny. 2013.<br />

‘Phylogenomic Clustering for Selecting Non-Redundant Genomes<br />

for Comparative Genomics.’ Bioinformatics 29.1: 947–949.<br />

doi:10.1093/bioinformatics/btt064.<br />

Höhl, M. and M. a Ragan. 2007. ‘Is Multiple-Sequence Alignment<br />

Required for Accurate Inference of Phylogeny?’ Systematic biology<br />

56.2: 206–221. doi:10.1080/10635150701294741.<br />

Bacha, S. and Baurain, D. 2005. ‘Application of Lempel-Ziv complexity<br />

to alignment-free sequence comparison of protein families’.<br />

Benelux Bioinformatics Conference 2005.<br />

http://hdl.handle.net/2268/80179<br />

76


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P33. THE ROLE OF MIRNAS IN ALZHEIMER’S DISEASE<br />

Ashley Lu 1,2* , Annerieke Sierksma 1,2 , Bart De Strooper 1,2 & Mark Fiers 1,2 .<br />

VIB Center for the Biology of Disease 1 ; KU Leuven Center for Human Genetics 2 . * ashley.lu@cme.vib-kuleuven.be<br />

MicroRNAs (miRNA) play an important role in post-transcriptional regulation and were shown to be dysregulated in<br />

Alzheimer’s disease. By analysing the hippocampal miRNA and mRNA expression of two mouse models of Alzheimer’s<br />

disease, we identify a set of miRNAs that are dysregulated with the onset of cognitive impairments. Using GO<br />

enrichment analysis we aim to identify miRNAs that likely play a role in learning and memory.<br />

INTRODUCTION<br />

MiRNAs are small non-coding RNAs involved in posttranscriptional<br />

regulation through mRNA inhibition or<br />

degradation. Past studies have suggested miRNAs to play<br />

a direct role in Alzheimer’s disease (AD), e.g. by<br />

modulating the expression of genes involved in the<br />

formation of neuropathological protein aggregates (Lau P<br />

& De Strooper B, 2010). In this study, we investigated the<br />

changes in miRNA and mRNA expression in two AD<br />

mouse models: APPswe/PS1 L166P (Radde R, 2006) and<br />

Thy-Tau22 (Schindowski K, 2006), which have similar<br />

patterns of cognitive impairment, but different pathology.<br />

We aim to better understand the functional role of<br />

miRNAs in AD-related cognitive impairments.<br />

METHODS<br />

RNA was extracted from the left hippocampus of 96 mice.<br />

The experiment covers the two models (APPswe/PS1 L166P<br />

& Thy-Tau22), with wild type controls for each. All<br />

genotypes are tested at two ages (4 and 10 months); before<br />

and after onset of cognitive impairment. This yields eight<br />

experimental groups with twelve mice each.<br />

Expression profiles of miRNAs and mRNAs were<br />

generated using Illumina single-end sequencing.<br />

Differential Expression (DE) analysis was performed<br />

using the limma package of R/Bioconductor with a linear<br />

model to test the effects of age, genotype and their<br />

interaction.<br />

Functional analysis of the mRNAs and miRNAs are<br />

conducted separately. For mRNAs, gene ontology analysis<br />

was applied to sets of the most up- and down regulated<br />

genes.<br />

To determine the functional impact of dysregulated<br />

miRNAs we determined which mRNAs are the most likely<br />

direct targets of each miRNA using the following<br />

approach: 1) for each miRNA we calculated the Pearson’s<br />

correlation coefficient to each mRNA based on the<br />

miRNA and mRNA expression data. 2) For each miRNA<br />

we extracted the predicted set of targets from Targetscan<br />

(Lewis BP & Burge CB & Bartel DP, 2005), with Diana<br />

(Maragkakis M et al. 2011) as backup when Targetscan<br />

had no record. 3) We filtered the miRNA target genes by<br />

determining the leading edge set in a GSEA PreRanked<br />

analysis (Subramanian A. et al, 2005) using the predicted<br />

target mRNAs of each miRNA against the mRNAs ranked<br />

according to the Pearson’s scores generated in step 1. We<br />

additionally investigated target sets based on a Pearson’s<br />

correlation coefficient cut-off of -0.2, -0.3, and -0.4. 4)<br />

Gene-ontology analysis was then applied to these<br />

candidate target sets to infer the likely biological function<br />

of each miRNA.<br />

RESULTS & DISCUSSION<br />

DE analysis showed that the direction of expression level<br />

changes in mRNAs are similar between APPswe/PS1 166P<br />

and Thy-Tau22 in terms of age*genotype interaction<br />

effects. However, for the miRNAs the expression pattern<br />

is less obvious. Overall, the effect size is more pronounced<br />

in APPswe/PS1 L166P mouse than the Thy-Tau22 for both<br />

miRNAs and mRNAs.<br />

Functional analyses of the down-regulated mRNAs show a<br />

clear enrichment in cognition and neural development<br />

related categories, whereas up-regulated genes show a<br />

clear inflammatory signature.<br />

Combining miRNA target prediction with miRNA/mRNA<br />

correlation analysis shows a marked increase of GO<br />

enrichment scores. This analysis strongly suggests a<br />

regulatory role for miRNAs in the down regulation of<br />

genes involved in learning, cognition and related<br />

categories.<br />

This analysis workflow has allowed focusing on a list of<br />

miRNAs that likely play a direct role in the observed<br />

learning and memory deficits in AD mouse models, and<br />

have been used to select candidate miRNAs for<br />

downstream in vivo experiments, which will hopefully<br />

provide a deeper understanding in the impact of AD on<br />

learning and cognition.<br />

REFERENCES<br />

Lau P & De Strooper B. Seminars in Cell & Developmental Biology,<br />

21(7), 768–773, (2010).<br />

Radde R. EMBO reports, 7(9), 940–946, (2006).<br />

Schindowski K. The American Journal of Pathology, 169(2),599–616,<br />

(2006).<br />

Lewis BP & Burge CB & Bartel DP. Cell, 120,15-20 (2005).<br />

Maragkakis M et al. Nucleic Acids Research (2011)<br />

Subramanian A. et al. Proceedings of the National Academy of Sciences<br />

of the United States of America, 102(43), 15545–15550, (2005)<br />

77


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P34. FUNCTIONAL SUBGRAPH ENRICHMENTS<br />

FOR NODE SETS IN REGULATORY NETWORKS<br />

Pieter Meysman 1,2* , Yvan Saeys 3,4 , Ehsan Sabaghian 5,6 , Wout Bittremieux 1,2 ,<br />

Yves van de Peer 5,6 , Bart Goethals 1 & Kris Laukens 1,2 .<br />

Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Biomedical informatics research center<br />

Antwerpen (biomina) 2 ; VIB Inflammation Research Center 3 ; Department of Respiratory Medicine, Ghent University 4 ;<br />

Department of Plant Biotechnology and Bioinformatics, Ghent University 5 ; Department of Plant Systems Biology,<br />

VIB/Ghent University 6 . * pieter.meysman@uantwerpen.be<br />

We have developed a subgroup discovery algorithm to find subgraphs in a single graph that are associated with a given<br />

set of nodes. The association between a subgraph pattern and a set of vertices is defined by its significant enrichment<br />

based on a Bonferroni-corrected hypergeometric probability value, and can therefore be considered as a network-focused<br />

extension of traditional gene ontology enrichment analysis. We demonstrate the operation of this algorithm by applying it<br />

on two transcriptional regulatory networks and show that we can find relevant functional subgraphs enriched for the<br />

selected nodes.<br />

INTRODUCTION<br />

Frequent subgraph mining (FSM) is a common but<br />

complex problem within the data mining field that has<br />

gained in importance as more graph data has become<br />

available. However traditional FSM finds all frequent<br />

subgraphs within the graph dataset, while often a more<br />

interesting query is to find the subgraphs that are most<br />

associated with a specific set of nodes. Nodes of interest<br />

might be those that are associated with a specific disease,<br />

or those that are differentially expressed in an omics<br />

experiment.<br />

METHODS<br />

To address this issue, we developed a novel subgraph<br />

mining algorithm that can efficiently construct, match and<br />

test candidate subgraphs against the given graph for<br />

enrichment within a specific set of nodes (Meysman et al.<br />

<strong>2015</strong>). To allow the enrichment testing, each candidate<br />

subgraph is built around a ‘source’ node. A subgraph<br />

match where the source node corresponds to a node of<br />

interest is counted as a ‘hit’. If the source node is not a<br />

node of interest, it is counted as a background hit. In this<br />

manner the problem of enrichment can be easily tested<br />

using a hypergeometric test. Furthermore, we show that<br />

this definition of enrichment allows us to drastically prune<br />

the search space that the algorithm must traverse to find all<br />

enriched subgraphs.<br />

An implementation of the algorithm is available at<br />

http://adrem.ua.ac.be/sigsubgraph.<br />

RESULTS & DISCUSSION<br />

The first data set concerned the yeast genes that have<br />

remained in duplicate following the most recent whole<br />

genome duplication. Within the yeast transcriptional<br />

network, we found that these duplicate genes were<br />

enriched for self-regulating motifs (e.g. feedback loops,<br />

self edges, etc.), which matches the duplicated nature of<br />

these genes (Figure 1).<br />

FIGURE 1. Enriched subgraphs for yeast duplicated genes<br />

The second data set concerned mining the subgraphs<br />

associated with the homologs of the PhoR transcription<br />

factor across seven different inferred bacterial regulatory<br />

networks from Colombos expression data (Meysman et al.<br />

2014). These PhoR homologs were found to be<br />

significantly associated with several complex regulatory<br />

motifs.<br />

REFERENCES<br />

Meysman P et al. Discovery of Significantly Enriched<br />

Subgraphs Associated with Selected Vertices in a<br />

Single Graph. Proceedings of the 14th International<br />

Workshop on Data Mining in Bioinformatics (<strong>2015</strong>).<br />

Meysman P et al. COLOMBOS v2. 0: an ever expanding<br />

collection of bacterial expression compendia. Nucleic<br />

acids research 42 (D1), D649-D653 (2014).<br />

78


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: 000<br />

Category: Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P35. HUMANS DROVE THE INTRODUCTION & SPREAD OF<br />

MYCOBACTERIUM ULCERANS IN AFRICA<br />

Koen Vandelannoote 1,2,* , Conor Meehan 1* , Miriam Eddyani 1 , Dissou Affolabi 3 , Delphin Mavinga Phanzu 4 , Sara<br />

Eyangoh 5 , Kurt Jordaens 6 , Françoise Portaels 1 , Kirstie Mangas 7 , Torsten Seemann 7 , Herwig Leirs 2 , Tim Stinear 7 &<br />

Bouke C. de Jong 1 .<br />

Institute of Tropical Medicine, Antwerp, Belgium 1 ; Evolutionary Ecology Group, University of Antwerp, Antwerp,<br />

Belgium 2 ; Laboratoire de Référence des Mycobactéries, Cotonou, Benin 3 ; Institut Médical Evangélique, Kimpese,<br />

Democratic Republic of Congo 4 ; Centre Pasteur du Cameroun, Yaoundé, Cameroun 5 ; Joint Experimental Molecular<br />

Unit, Royal Museum for Central Africa, Tervuren, Belgium 6 ; Department of Microbiology and Immunology, University<br />

of Melbourne, Melbourne, Australia 7 . *cmeehan@itg.be<br />

Buruli ulcer (BU) is an insidious neglected tropical disease. BU is reported around the world but the rural regions of<br />

West and Central Africa are most affected. How BU is transmitted and spreads has remained a mystery, even though the<br />

causative agent, Mycobacterium ulcerans, has been known for more than 70 years. Here, using the tools of population<br />

genomics, we reconstruct the evolutionary history of M. ulcerans by comparing 167 isolates spanning 48 years and<br />

representing 11 endemic countries across Africa. The genetic diversity of African M. ulcerans proved very limited<br />

because of its slow substitution rate coupled with its recent origin. We show for the first time how M. ulcerans has<br />

existed in Africa for several hundreds of years but was recently re-introduced during the period of Neo-imperialism. We<br />

also provide evidence of the role that the so-called “Scramble for Africa” played in the spread of the disease.<br />

INTRODUCTION<br />

The clonal population structure of M. ulcerans has meant<br />

that conventional genetic fingerprinting methods have<br />

largely failed to differentiate clinical disease isolates,<br />

complicating molecular analyses on the elucidation of the<br />

population structure, and the evolutionary history of the<br />

pathogen. Whole genome sequencing (WGS) is currently<br />

replacing conventional genotyping methods for M.<br />

ulcerans.<br />

METHODS<br />

We analyzed a panel of 165 M. ulcerans disease isolates<br />

originating from disease foci in 11 different African<br />

countries that had been cultured between 1964 and 2012.<br />

Index-tagged paired-end sequencing-ready libraries were<br />

prepared from gDNA extracts. Genome sequencing was<br />

performed on the Illumina HiSeq 2000 DNA sequencer or<br />

the Illumina MiSeq sequencing platform with respectively<br />

2x150bp and 2x250bp paired-end sequencing chemistry.<br />

Read mapping and SNP detection were performed using<br />

the Snippy v.2.6 pipeline. Bayesian model-based inference<br />

of the genetic population structure was performed using<br />

BAPS v.6.0. 1 Evidence for recombination between<br />

different BAPS-clusters was assessed using BRAT-<br />

NextGen 2 . We used BEAST2 v2.2.1 3 to date evolutionary<br />

events, determine the substitution rate and produce a timetree<br />

of African M. ulcerans. A permutation test was used<br />

to assess the validity of the temporal signal in the data. To<br />

assess the geospatial distribution of African M. ulcerans<br />

through time, an additional BEAST2 analysis was<br />

performed with a discrete BSSVS geospatial model 4 .<br />

RESULTS & DISCUSSION<br />

Resulting sequence reads were mapped to the Ghanaian M.<br />

ulcerans Agy99 reference genome and, after excluding<br />

mobile repetitive elements and small indels, we detected a<br />

total of 9,193 SNPs randomly distributed across the M.<br />

ulcerans chromosome with approximately 1 SNP per 613<br />

bp (0.15% nucleotide divergence). We explored the<br />

distribution of DNA chromosomal deletions and identified<br />

differential genome reduction that strongly supports the<br />

existence of two specific M. ulcerans lineages within the<br />

African continent, hereafter referred to as Lineage Africa I<br />

(Mu_A1) and Lineage Africa II (Mu_A2). Subsequent<br />

SNP-based exploration of the genetic population structure<br />

agreed with the above deletion analysis and subdivided the<br />

African M. ulcerans population into four major clusters.<br />

BRAT-NextGen did not detect any recombined segments<br />

in any isolate, supporting a strongly clonal population<br />

structure for M. ulcerans that is evolving by vertically<br />

inherited mutations. Within the phylogenetic tree, isolates<br />

formed tight, shallow-rooted phylogenetic clusters which<br />

are suggestive of contemporary dispersal. We estimated a<br />

very slow mean genome wide substitution rate of 6.32E-8<br />

per site per year. The Bayesian analysis demonstrated that<br />

Mu_A1 has existed in Africa for several hundreds of years<br />

and that Mu_A2 was recently introduced on the continent.<br />

The re-introduction event coincides well with a historical<br />

event of particular interest: the period of Neo-imperialism<br />

(1881-1914). Since tMCRA(Mu_A2) did not predate<br />

colonization it seems very likely that lineage Mu_A2 was<br />

introduced after the instigation of colonial rule through an<br />

influx of BU infected humans. The time-tree of African M.<br />

ulcerans also reveals evidence of the likely role that the<br />

so-called “Scramble for Africa” played in the spread of<br />

endemic Mu_A1 clones in three hydrological basins<br />

(Congo, Oueme & Nyong) that are particularly well<br />

covered by our isolate panel.<br />

REFERENCES<br />

1. Corander, J., et al. (2008) BMC bioinformatics. 9: p. 539.<br />

2. Marttinen, P., et al. (2012) Nucleic acids research. 40(1): p. e6.<br />

3. Bouckaert, R., et al. (2014) PLoS computational biology. 10(4): p.<br />

e1003537.<br />

4. Lemey, P., et al., (2009) PLoS computational biology. 5(9): p.<br />

e1000520.<br />

79


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P36. LEVERAGING AGO-SRNA AFFINITY TO IMPROVE IN SILICO SRNA<br />

DETECTION AND CLASSIFICATION IN PLANTS<br />

Lionel Morgado 1* & Frank Johannes 2,3 .<br />

Groningen Bioinformatics Centre (GBiC), University of Groningen 1 ; Department of Plant Sciences, Center of Life and<br />

Food Sciences Weihenstephan, Technical University Munich 2 ; Institute of Advanced Studies, Technical University<br />

Munich 3 . * lionelmorgado@gmail.com<br />

Small RNAs (sRNA) have an important role in the regulation of gene expression, either through post-transcriptional<br />

silencing or the recruitment of repressive epigenetic marks such as DNA methylation. In plants, the mode of action of a<br />

given sRNA is tightly related with the Argonaute protein (AGO) to which it binds. High throughput sequencing in<br />

combination with immunoprecipitation techniques have made it possible to determine the sequences of sRNA that are<br />

bound to different families of AGO. Here we apply Support Vector Machines (SVM) to recent AGO-sRNA sequencing<br />

data of A. thaliana to learn which sRNA sequence features govern their differential association with certain AGOs. Our<br />

SVM classifiers show good sensitivity and specificity and provide a framework for accurate in silico sRNA detection and<br />

classification in plants.<br />

INTRODUCTION<br />

Small RNA molecules are known to have an important<br />

role in gene expression control. It is therefore of extreme<br />

interest to be able to detect them and determine the<br />

regulatory pathways in which they are involved. With the<br />

current laboratorial methods it is unfeasible to test the high<br />

number of sRNA candidates, but there are computational<br />

methods that can greatly narrow down the list.<br />

Nevertheless, sRNA activity is still far from being fully<br />

understood and that is reflected in the very high false<br />

positive rate of the prediction tools currently available.<br />

High throughput sequencing in combination with<br />

immunoprecipitation (IP) techniques make nowadays<br />

possible to access sRNA sequences associated with<br />

specific AGO. AGO-sRNA binding is a fundamental step<br />

for the activation of specific silencing pathways. Here,<br />

AGO-sRNA data acquired from A. thaliana is explored<br />

with SVM-based algorithms to learn which sequence<br />

features drive different AGO-sRNA associations. Using<br />

this knowledge, a framework for in silico sRNA detection<br />

and classification in plants is presented.<br />

METHODS<br />

A system with 3 layers of classifiers (see figure 1) was<br />

designed to identify different kinds of sRNA: the 1 st layer<br />

includes a binary SVM model that filters out sequences<br />

that don’t bind to AGO and are therefore most probably<br />

inactive; 2 nd layer is composed by an ensemble of binary<br />

classifiers, each trained to explore the differences in sRNA<br />

bound to a specific AGO against all others; and finally, the<br />

3 rd layer comprises a multiclass linear model to assign the<br />

most akin AGO to a given sRNA, using scores produced<br />

in the previous layer.<br />

Diverse AGO-sRNA libraries from A. thaliana were<br />

explored, namely from AGO: 1, 2, 4, 5, 6, 7, 9 and 10.<br />

After the typical RNA-seq library preprocessing, quality<br />

check and genome mapping, several features were<br />

extracted from the remaining sequences, namely: position<br />

specific base composition, sequence length, k-mer<br />

composition and entropy scores. The different feature sets<br />

were explored separately and in different combinations.<br />

Initially, highly correlated features (pearson score>0.75)<br />

were removed, and the remaining ones were further<br />

subjected to selection using SVM-RFE (Guyon et al.,<br />

2002) with a linear kernel to handle the large data set size.<br />

A 10-fold cross-validation procedure was executed to<br />

modulate the variation in the data, being the best features<br />

of each round determined as the ones with the highest<br />

average weight across the models with the best ROC-AUC<br />

score in each cross-validation subset. Each round, 1/3 of<br />

the remaining features with the worst performance were<br />

eliminated, being the process repeated until no more<br />

features were available. The best features found were then<br />

used to train the final classifiers using RBF kernels with<br />

optimal parameters. This was repeated for all models in<br />

layers 1 and 2.<br />

AGO1<br />

vs<br />

otherAGO<br />

AGO vs noAGO<br />

AGO2<br />

vs<br />

otherAGO<br />

…<br />

Final AGO prediction<br />

FIGURE 1. Proposed architecture for the SVM-based framework.<br />

RESULTS & DISCUSSION<br />

AGO10<br />

vs<br />

otherAGO<br />

Layer 1<br />

Layer 2<br />

Layer 3<br />

Although the classifiers are still being optimized,<br />

preliminary results from the 2 nd layer of the framework<br />

(see figure 1) show that the top ranked features by SVM-<br />

RFE reflect indeed significant biological patterns for<br />

AGO-sRNA association. Among others, the relevance of<br />

the 5’ terminal nucleotide was observed, in agreement<br />

with findings from previous work (Mi et al., 2008).<br />

Additionally, the accuracy for the models trained span<br />

values that range from 71% to 86%, showing their<br />

capacity to recognize specific AGO-binding patterns.<br />

REFERENCES<br />

Guyon I et al.Gene selection for cancer classification using support vector machines. Mach Learn<br />

46:389-422 (2002)<br />

Mi S et al. Sorting of small RNAs into Arabidospis agonaute complexes is directed by the<br />

5’terminal nucleotide. Cell 133(1): 116-27 (2008).<br />

Zhou A & Pawlowski WP. Regulation of meiotic gene expression in plants. Front Plant Sci 5:<br />

413, 209-215 (2014).<br />

80


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P37. ANALYSIS OF RELATIONSHIP PATTERNS<br />

IN UNASSIGNED MS/MS SPECTRA<br />

Aida Mrzic 1,2* , Wout Bittremieux 1,2 , Trung Nghia Vu 4 , Dirk Valkenborg 3,5,6 , Bart Goethals 1 & Kris Laukens 1,2 .<br />

Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Biomedical informatics research center<br />

Antwerpen (biomina) 2 ; Flemish Institute for Technological Research (VITO), Mol 3 ; Karolinska Institutet, Stockholm 4 ;<br />

CFP, University of Antwerp 5 ; I-BioStat, Hasselt University 6 . * aida.mrzic@uantwerpen.be<br />

Tandem mass spectrometry (MS/MS) spectra generated in proteomics experiments often contain a large portion of<br />

unexplained peaks, despite continuous search engines improvements. Here we use pattern mining technique to determine<br />

the origin of these unassigned spectra. We discover patterns that indicate the presence of chimeric spectra and missed<br />

post-translational modifications (PTMs).<br />

INTRODUCTION<br />

Regardless of being a rich source of information, mass<br />

spectra acquired in mass spectrometry proteomics<br />

experiments often contain a significant number of<br />

unexplained peaks, or even remain completely<br />

unidentified. The unexplained fraction of mass spectra<br />

may come from low-quality or chimeric MS/MS spectra,<br />

or unexpected PTMs. To interpret the unexplained data,<br />

we propose a structured analysis of the peaks occurring in<br />

MS/MS spectra. We employ an unsupervised pattern<br />

mining technique (Naulaerts et al., 2013) to discover<br />

which peaks are associated with each other, and therefore<br />

are likely to have a common origin.<br />

METHODS<br />

Frequent itemset mining<br />

The technique we used to discover relationships between<br />

frequently co-occurring peaks in MS/MS data is frequent<br />

itemset mining, a class of data mining techniques that is<br />

specifically designed to discover co-occurring items in<br />

transactional datasets. The typical example of frequent<br />

itemset mining is the discovery of sets of products that are<br />

frequently bought together. Here, every set of products<br />

purchased together represents a single transaction, which<br />

results in a dataset consisting of a large number of<br />

supermarket basket transactions that can be mined for<br />

frequent patterns (Figure 1). In our approach a transaction<br />

consists of the mass differences between relevant peaks in<br />

the MS/MS spectrum.<br />

FIGURE 1. Frequent itemset mining principle.<br />

Mass differences associations<br />

In order to detect relationships between different types of<br />

mass spectrometry peaks, a distinction is made between<br />

peaks that were relevant for spectrum identification<br />

(assigned peaks) and peaks that were not used for the<br />

identification (unassigned peaks) (Vu et al., 2013). The<br />

mass differences between peaks (either assigned,<br />

unassigned, or both) are then calculated so that for each<br />

MS/MS spectrum in the dataset there is a single<br />

transaction consisting of all its mass differences.<br />

After obtaining these transactions for all MS/MS spectra<br />

in the dataset, frequent itemset mining can be employed to<br />

detect relationship patterns (Figure 2). These patterns can<br />

indicate previously unknown characteristics of the spectra,<br />

or even detect novel PTMs.<br />

FIGURE 2. Outline of the approach.<br />

RESULTS & DISCUSSION<br />

In order to evaluate our approach, we used MS/MS<br />

datasets from the PRoteomics IDEntifications (PRIDE)<br />

database (Vizcaino et al., 2013). This database contains a<br />

large number of publicly available datasets from massspectrometry-based<br />

proteomics experiments. However, the<br />

quality of the submitted datasets can be subject to a large<br />

variability, which makes it a proper candidate for our<br />

pattern mining approach.<br />

Preliminary results show that the detected patterns are able<br />

to capture valid information in a spectrum. The obtained<br />

patterns indicate peaks originating from the same peptide<br />

in case of chimeric spectra and mass differences<br />

originating from common PTMs.<br />

REFERENCES<br />

Naulaerts et al. Brief Bioinform, 16(2): 216–231 (<strong>2015</strong>).<br />

Vizcaino et al. Nucleic Acids Res, 41(D1):D1063-9 (2013).<br />

Vu et al. Proteome Science, 12:54 (2014).<br />

81


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P38. MINING ACROSS “OMICS” DATA FOR DRUG PRIORITIZATION<br />

Stefan Naulaerts 1,2* , Pieter Meysman 1,2 , Bart Goethals 1 , Wim Vanden Berghe ,3 & Kris Laukens 1,2 .<br />

Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Biomedical informatics research center<br />

Antwerpen (biomina) 2 ; Department for Biomedical Sciences, University of Antwerp 3 . * stefan.naulaerts@uantwerpen.be<br />

Drug resistance and response have traditionally been investigated by means of case-by-case studies. The process to<br />

profile drug compounds is time and resource intensive. Large scale information on gene expression and protein<br />

abundance, protein interactions, as well as functional and pathways annotations exist nowadays, as well as freely<br />

accessible repositories for drug targets. Also structural evidence of select drug compounds is publicly available. These<br />

data offer an enormous opportunity for data integration and pattern mining efforts across each of these levels. Here, we<br />

apply frequent itemset mining to identify structurally similar compounds, and to detect patterns within the biological<br />

effect profiles of these chemical compound families. Next, we explore how we can link both types of patterns to metainformation<br />

(such as drug interactions) in a bid to identify promising compounds and speed up the drug discovery<br />

process by means of candidate prioritization.<br />

INTRODUCTION<br />

In the last decades, several widely used databases have<br />

emerged. These vary from gene expression data and massspectrometric<br />

protein identifications to resources covering<br />

interaction graphs or functional annotations of proteins<br />

and chemicals.<br />

The presence of these resources offers interesting<br />

opportunities to gain deeper insight in drug mode of action,<br />

as well as help reduce important bottlenecks with regards<br />

to the speed of novel drug discovery or drug repurposing,<br />

by intelligently prioritizing potentially interesting<br />

compounds.<br />

METHODS<br />

To integrate the listed kinds of data, we use pattern mining<br />

methods that are collectively known as “frequent itemset<br />

mining”. This set of techniques uses clever heuristics to<br />

efficiently find items that occur more often together than a<br />

minimal threshold. In this work, we identified several<br />

pattern types based on their source:<br />

<br />

<br />

<br />

Expression itemsets<br />

Metadata itemsets<br />

Graph patterns (protein-protein, protein-drug and<br />

chemical structures)<br />

For subgraph mining, we used GASTON 1 . All other data<br />

sources were analysed with Apriori 2 .<br />

To deal with the extreme numbers of patterns that result<br />

from mining this kind of data, we used a filter which<br />

incorporates several quality measures based on objective<br />

data mining measures properties (e.g. lift), as well as more<br />

biologically inspired methods (e.g. functional coherence in<br />

the Gene Ontology 3 tree).<br />

Simple classification based on the patterns was performed<br />

with CBA 4 .<br />

RESULTS & DISCUSSION<br />

We were able to identify several backbone patterns within<br />

the chemical structures studied and used these to define<br />

“chemical compound families”. Next, we used this<br />

classification as starting point to group experimental<br />

evidence (bio-assays, interactions and metadata). After<br />

applying cut-offs based on the quality measures, all<br />

patterns remaining were significant and made sense<br />

biologically.<br />

Unsurprisingly, structurally similar compound families<br />

show significant pattern overlaps in drug-drug interactions,<br />

gene expression, term co-occurrence and conserved<br />

protein-protein interactions. We found that specific<br />

patterns in the biological profile often correlate with<br />

specific discriminative structural patterns. Moreover, these<br />

collections of structural frequent subgraphs seemed highly<br />

relevant for the mode in which a compound connects to<br />

the “core” proteome. This central proteome performs<br />

essential functions of the cell (e.g. energy metabolism) and<br />

it is known to be conserved across cell types. Structurally<br />

distinct compound families converge much later (if at all)<br />

to the same “core proteins” than more similar chemicals<br />

do. This observation corresponds to currently known<br />

pathway knowledge and tissue biology.<br />

We were further able to associate previously unseen<br />

compounds to chemicals present in the database, based on<br />

the subgraph collection and by extension to the biological<br />

profile patterns. Manual survey of literature indicated that<br />

several compounds not covered by our database have<br />

recently been approved or are in testing as alternative<br />

drugs to the compounds we hypothesized as being<br />

substantially similar.<br />

FIGURE 1. Visualizing the dexamethasone environment. Both predictions<br />

and experimental evidence (drug-target and protein-protein interactions)<br />

are shown.<br />

REFERENCES<br />

1. Nijssen S & Kok J. ENTCS 127, 77-87 (2005).<br />

2. Agrawal R & Srikant R. Proc 20th Int Conf on Very Large Databases<br />

(1994).<br />

3. Ashburner M et al. Nat Genet 25, 25-29 (2000).<br />

4. Liu B et al. KDD (1998).<br />

82


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P39. ABUNDANT TRANS-SPECIFIC POLYMORPHISM AND A COMPLEX<br />

HISTORY OF NON-BIFURCATING SPECIATION IN THE GENUS<br />

ARABIDOPSIS<br />

Polina Novikova 1 , Nora Hohmann 2 , Marcus Koch 2 & Magnus Nordborg 1 .<br />

Gregor Mendel Institute, Austrian Academy of Sciences, Vienna Biocenter (VBC), A-1030 Vienna, Austria 1 ; Centre for<br />

Organismal Studies Heidelberg, University of Heidelberg, D-69120 Heidelberg, Germany 2 .<br />

*magnus.nordborg@gmi.oeaw.ac.at<br />

The prevailing notion of species rests on the concept of reproductive isolation. Under this model, sister taxa should not<br />

share genetic variation unless they still hybridize, or diverged too recently for genetic drift to have eliminated shared<br />

ancestral polymorphism, and gene trees should generally agree with species trees. Advances in sequencing technology<br />

are finally making it possible to evaluate this model. We sequenced (Illumina 100bp paired reads) multiple individuals<br />

from 26 proposed taxa in the genus Arabidopsis. Cluster analysis identified seven distinct groups, corresponding to four<br />

common species — the model species A. thaliana, plus A. arenosa, A. halleri and A. lyrata — and three species with<br />

very limited geographical distribution. However, at the level of gene trees, only the separation of A. thaliana from the<br />

remaining taxa was universally supported, and even in this case there was abundant sharing of ancestral polymorphism<br />

with the other taxa, demonstrating that reproductive isolation must be fairly recent. By considering the distribution of<br />

derived alleles, we were also able to reject a bifurcating species tree because there is clear evidence for asymmetrical<br />

gene flow between taxa. Finally, we show that the pattern of sharing and divergence between taxa differs between gene<br />

ontologies, suggesting a role for selection.<br />

83


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P40. RIBOSOME PROFILING ENABLES THE DISCOVERY OF SMALL OPEN<br />

READING FRAMES (SORFS), A NEW SOURCE OF BIOACTIVE PEPTIDES<br />

Volodimir Olexiouk 1,* , Jeroen Crappé 1 , Steven Verbruggen 1 & Gerben Menschaert 1,* .<br />

Lab of Bioinformatics and Computational Genomics (BioBix), Department of Mathematical Modelling, Statistics and<br />

Bioinformatics, Faculty of Bioscience Engineering, Ghent University 1 .<br />

INTRODUCTION<br />

Evidence for micropeptides, defined as translation<br />

products from small open reading frames (sORFs), has<br />

recently emerged. While limitations contributed to<br />

sequencing technologies as well as proteomics have<br />

stalled the discovery of micropeptides. It is the advent of<br />

ribosome profiling (RIBO-SEQ), a next generation<br />

sequencing technique revealing the translation machinery<br />

on a sub-codon resolution, that provided evidence in favor<br />

of translating sORFs. RIBO-SEQ captures and<br />

subsequently sequences the +-30 nt mRNA-fragments<br />

captured within ribosomes, providing means to identify<br />

translating sORFs, possible encoding functional<br />

micropeptides. Since the advent of ribosome profiling<br />

several micropeptides were described with import cellular<br />

functions micropeptides (e.g. Toddler, Pri-peptides,<br />

Sarcolipin and Myoregulin).<br />

METHODS<br />

RIBO-SEQ allows the identification of sORFs with<br />

ribosomal activity, however in order to further access the<br />

coding potential (potential of sORFs truly encoding<br />

functional micropeptides) down-stream analysis is<br />

necessary. Here we propose a pipeline which starts from<br />

RIBO-SEQ, implements state-of-the-art tools and metrics<br />

accessing the coding potential of sORFs and creates a list<br />

of candidate sORFs for downstream analysis (e.g.<br />

proteomic identification). In summary, assessment of the<br />

coding potential includes: PhyloCSF (conservation<br />

analysis), FLOSS-score (Ribosome protected fragment<br />

(RPF) length distribution analysis), ORFscore (distribution<br />

analysis of RPFs towards the first frame of a coding<br />

sequence (CDS), BLASTp (sequence similarity), VarAn<br />

(genetic variation analysis). In an attempt to set a<br />

community standard in addition to make sORFs accessible<br />

to a larger audience, a public database (www.sorfs.org) is<br />

provided where public available datasets were processed<br />

by this pipeline, allowing users to browse, query and<br />

export identified ORFs. Furthermore a PRIDE-respin<br />

pipeline was developed in order to periodically search the<br />

PRIDE database for proteomic evidence.<br />

RESULTS & DISCUSSION<br />

The pipeline has been tested and curated on three different<br />

cell-lines. These cell-lines include: HCT116 (human), E14<br />

mESC (mouse) and s2 (fruitfly). Results obtained<br />

provided similar results to those reported in recent<br />

literature proving its relevance. All metrics, as stated<br />

above, have been carefully inspected for their biological<br />

relevance and contributed significantly to the detection of<br />

sORFs. The pipeline is currently being finalized, however<br />

is available upon request. The public repository is<br />

accessible at http://www.sorfs.org, and includes the<br />

datasets mentioned above resulting in 263354 sORFs. Two<br />

querying interfaces were implemented, a default query<br />

interface intended for browsing sORFs and a BioMart<br />

query interface for advanced querying and export<br />

functions. sORFs have their own detail page, visualizing<br />

the above discussed metrics and ribosome profiling data<br />

and a link to the UCSC-browser is provided, visualizing<br />

the RIBO-SEQ data.<br />

REFERENCES<br />

Pauli,A., Norris,M.L., Valen,E., Chew,G.-L., Gagnon,J. a,<br />

Zimmerman,S., Mitchell,A., Ma,J., Dubrulle,J., Reyon,D., et al.<br />

(2014) Toddler: an embryonic signal that promotes cell movement<br />

via Apelin receptors. Science, 343, 1248636.<br />

Pauli,A., Norris,M.L., Valen,E., Chew,G.-L., Gagnon,J. a,<br />

Zimmerman,S., Mitchell,A., Ma,J., Dubrulle,J., Reyon,D., et al.<br />

(2014) Toddler: an embryonic signal that promotes cell movement<br />

via Apelin receptors. Science, 343, 1248636.<br />

Crappé,J., Ndah,E., Koch,A., Steyaert,S., Gawron,D., De Keulenaer,S.,<br />

De Meester,E., De Meyer,T., Van Criekinge,W., Van Damme,P., et<br />

al. (2014) PROTEOFORMER: deep proteome coverage through<br />

ribosome profiling and MS integration. Nucleic Acids Res.,<br />

10.1093/nar/gku1283.<br />

Ingolia,N.T. (2014) Ribosome profiling: new views of translation, from<br />

single codons to genome scale. Nat. Rev. Genet., 15, 205–13.<br />

Crappé,J., Van Criekinge,W., Trooskens,G., Hayakawa,E., Luyten,W.,<br />

Baggerman,G. and Menschaert,G. (2013) Combining in silico<br />

prediction and ribosome profiling in a genome-wide search for novel<br />

putatively coding sORFs. BMC Genomics, 14, 648.<br />

Pauli,A., Norris,M.L., Valen,E., Chew,G.-L., Gagnon,J. a,<br />

Zimmerman,S., Mitchell,A., Ma,J., Dubrulle,J., Reyon,D., et al.<br />

(2014) Toddler: an embryonic signal that promotes cell movement<br />

via Apelin receptors. Science, 343, 1248636.<br />

Chanut-Delalande,H., Hashimoto,Y., Pelissier-Monier,A., Spokony,R.,<br />

Dib,A., Kondo,T., Bohère,J., Niimi,K., Latapie,Y., Inagaki,S., et al.<br />

(2014) Pri peptides are mediators of ecdysone for the temporal<br />

control of development. Nat. Cell Biol., 16<br />

84


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

PosterBeNeLux Bioinformatics Conference – Antwerp,<br />

December 7-8 <strong>2015</strong><br />

Abstract 10th ID: Benelux 000 Bioinformatics Category: Conference Abstract template<br />

<strong>bbc</strong> <strong>2015</strong><br />

P41. RIGAPOLLO, A HMM-SVM BASED APPROACH TO SEQUENCE<br />

ALIGNMENT<br />

Gabriele Orlando 1,2,3,4 , Wim Vranken 1,2,3 and & Tom Lenaerts 1,4,5 .<br />

1 Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, La Plaine Campus, Triomflaan, CP 263 1 ; 2 Structural<br />

Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 2 2 ; 3 Structural Biology Research Center, VIB,1050 Brussels,<br />

Belgium 3 ;. 4 Machine Learning group, Université Libre de Bruxelles, Brussels, 1050, Belgium 4 ;. 5 Artificial Intelligence<br />

lab, Vrije Universiteit Brussel, Brussels, 1050, Belgium 5 .<br />

INTRODUCTION<br />

Reliable protein alignments are a central problem for<br />

many bioinformatics tools, such as homology modelling.<br />

Over the years many different algorithms have been<br />

developed and different kinds of information have been<br />

used to align very divergent sequences [1]. Here we<br />

present a pairwise alignment tool, called Rigapollo, based<br />

on pairwise HMM-SVM, which includes backbone<br />

dynamics predictions [2] in the alignment process: recent<br />

work suggests that protein backbone dynamics is often<br />

evolutionary conserved and contains information<br />

orthogonal to the amino acid conservation..<br />

METHODS<br />

Rigapollo uses a pairwise HMM-SVM alignment<br />

approach to infer the optimal alignment between two<br />

proteins, taking into consideration both sequence and<br />

dynamic information. The model (described in Figure 1) is<br />

composed by 3 states: M (match), G1 (gap in the first<br />

sequence) and G2 (gap in the second sequence). The<br />

transition probabilities are defined in the same way as a<br />

standard HMM. This new alignment tool is further<br />

designed in the following manner:<br />

Defining the N-dimensional feature vectors:<br />

Each amino acid in the sequences is described by an N-<br />

dimensional feature vector. That vector can be defined<br />

using any kind of information, ranging from evolutionary<br />

information (i.e. PSSM calculated with HHblits [3])) to<br />

dynamics predictions (using the DynaMine predictor [2]).<br />

While standard pairwise HMMs require the definition of a<br />

finite and discrete alphabet of observable states, our model<br />

works directly using these feature vectors (that can be both<br />

orthonormal or not orthonormal), evaluating the emission<br />

probability with a support vector machine (SVM).<br />

Definition of the emisisonemission probability:<br />

We define the emission probability using a SVM trained<br />

to discriminate matches from mismatches. We define as<br />

matches all the positions in the reference pairwise<br />

alignments that do not contain gaps and we use the<br />

concatenation of the previously defined feature vectors to<br />

describe them. These matches are considered positive hits.<br />

For what concerns the mismatches, we perform the same<br />

procedure, but couple positions that, in the reference<br />

alignment, are shifted a number of amino acids, varying<br />

between 5 and 10. After the training, the predicted<br />

emission probabilities for the M state, given the<br />

concatenation of two feature vectors, will be a function of<br />

the distance from the decision hyperplane of the SVM<br />

(called f(D)). The corresponding emission probabilities for<br />

the states G1 and G2 will be modeled as 1-f(D)<br />

RESULTS & DISCUSSION<br />

For the evaluation of the performances of Rigapollo, we<br />

adopted two publicly available subsets of the Balibase and<br />

SABmark alignmenta datasets, already used to evaluate<br />

other pairwise alignment tools [1]; from the MSAs, allpair<br />

pairwise alignments has been extracted, and all these<br />

that shared a percentage of sequence equal to the median<br />

of the one of the full database has been put in the subset.<br />

The datasets consist respectively in 38 and 123 manually<br />

curated, structure based pairwise alignments and they<br />

share very low sequence identity. For the evaluation of the<br />

performances we performed a 10 folds randomized crossvalidtion.<br />

Rigapollo increases the quality of low sequence<br />

identity pairwise alignment from 5 to 10% respect to the<br />

state of the art methods and it seams appears that the<br />

increase in the performancewse is more marked in very<br />

Figure 1: Structure of the pairwise HMM-SVM model<br />

divergent sequences, such as the onesthose in the<br />

SABmark dataset , where the dynamics information seams<br />

to significantly increase the quality of the alignment. This<br />

is probably due to the fact that dynamics are often well<br />

conserved in functional patterns, also when the sequence<br />

is not preserved [2].<br />

REFERENCES<br />

[1] Do Chuong B.et al. Research in Computational Molecular Biology.<br />

Springer Berlin Heidelberg, 2006<br />

[2] Cilia, Elisa, et al. Nucleic acids research 42.W1 (2014): W264-W270<br />

[3] Remmert, Michael, et al.Nature methods 9.2 (2012): 173-175.<br />

85


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P42. EARLY FOLDING AND LOCAL INTERACTIONS<br />

R. Pancsa 1 , M. Varadi 1 , E. Cilia 2,3 , D. Raimondi 1,2,3 & W. F. Vranken 1,3,* .<br />

Structural Biology Research Centre, VIB and Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium 1 ;<br />

Machine Learning Group, Université Libre de Bruxelles, Brussels, Belgium 2 ; Interuniversity Institute of Bioinformatics<br />

in Brussels (IB) 2 , Brussels, Belgium 3 . * wvranken@vub.ac.be<br />

INTRODUCTION<br />

Protein folding is in its early stages largely determined by<br />

the protein sequence and complex local interactions<br />

between amino acids, resulting in the formation of foldons<br />

that provide the context for further folding into the native<br />

state. These early folding processes are therefore<br />

important to understand subsequent folding steps and their<br />

influence on, for example, aggregation, but they are<br />

difficult to study experimentally. We here address this<br />

issue computationally by assembling and analysing a<br />

dataset on early folding residues from hydrogen deuterium<br />

exchange (HDX) data from NMR and MS, and analyse<br />

how they relate to the sequence-based backbone dynamics<br />

predictions from DynaMine (Cilia et al. 2013, 2014) and<br />

evolutionary information from multiple sequence<br />

alignments.<br />

METHODS<br />

We assembled a dataset of HDX experimental data from<br />

NMR and MS from literature for 57 proteins totalling<br />

4172 residues. The data was classified by the into early,<br />

intermediate and late classes depending on the folding<br />

time where protection of the backbone NH was observed,<br />

and into strong, medium and weak classes depending on<br />

how long the amides remain protected upon unfolding the<br />

native state. This resulted in 219 residue sets that are<br />

organised in XML files and loaded into a database that is<br />

made available online via http://start2fold.eu.<br />

The DynaMine predictions were run locally with a new<br />

version of the software that handles C- and N-terminal<br />

effects. These original predictions were then normalised<br />

by shifting them so that the maximum prediction value for<br />

each protein is always 1.0, so not affecting the relative<br />

differences between the prediction values within each<br />

protein, but effectively normalising the values between<br />

different proteins. MSAs were generated for each<br />

sequence in the dataset using HHblits and Jackhmmer with<br />

3 iterations and E value threshold of 10 -4 . All the retrieved<br />

homologs have minimum 90% coverage with the query<br />

sequence. By using HHfilter, a post processing tool<br />

provided in the HHblits package, we built two different<br />

sets of MSAs by varying the maximum pairwise sequence<br />

identity threshold between the collected homologs in each<br />

MSA. The (ungapped) sequences in the MSAs were<br />

predicted without normalisation in order to preserve the<br />

differences within a protein family, and mapped back to<br />

the full (gapped) MSA.<br />

Our analysis shows that the DynaMine-predicted rigidity<br />

of the protein backbone represents where the protein is<br />

likely to adopt specific lower free energy conformations<br />

based on sequence-encoded local interactions, as<br />

evidenced by the HDX data on early folding (Figure 1).<br />

This effect is also present on a per-residue basis.<br />

FIGURE 1. Distribution of DynaMine predictions for early folding<br />

residues (green) and non-early folding residues (brown) for the original<br />

(left) and normalized (right) values.<br />

When relating the secondary structure elements as<br />

observed in the native fold to the early folding residues,<br />

we observe that the ‘early folding’ secondary structure<br />

elements also tend to be more rigid overall. Finally, we<br />

examined whether early folding is conserved in evolution<br />

on the basis of multiple sequence alignments. Although<br />

there is no conservation of individual amino acids, the<br />

physical characteristic of a rigid backbone seems to be<br />

conserved.<br />

We therefore propose that the backbone dynamics of the<br />

protein is a fundamental physical feature conserved by<br />

proteins that can provide important insights into their<br />

folding mechanisms and stability.<br />

REFERENCES<br />

Cilia, E., Pancsa, R., Tompa, P., Lenaerts, T., & Vranken, W. F. (2013).<br />

From protein sequence to dynamics and disorder with DynaMine.<br />

Nature Communications, 4, 2741.<br />

http://doi.org/10.1038/ncomms3741<br />

Cilia, E., Pancsa, R., Tompa, P., Lenaerts, T., & Vranken, W. F. (2014).<br />

The DynaMine webserver: predicting protein dynamics from<br />

sequence. Nucleic Acids Research, 12(Web Server), W264–W270.<br />

http://doi.org/10.1093/nar/gku270<br />

RESULTS & DISCUSSION<br />

86


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P43. BINDING SITE SIMILARITY DRUG REPOSITIONING:<br />

A GENERAL AND SYSTEMATIC METHOD FOR DRUG DISCOVERY<br />

AND SIDE EFFECTS DETECTION<br />

Daniele Parisi & Yves Moreau.<br />

I developed a protocol based on prediction of druggable cavities, comparison of these putative binding sites and crossdocking<br />

between bound ligands and the binding site detected to be similar to the one of the complex, in order to study the<br />

cross reactivity of known compounds. It is a general method because it can find applications both in drug repositioning<br />

and in the study of adverse effects, and it is systematic because it consists in several subsequent steps. It would indicate<br />

ligands to screen, reducing the number of candidates and allowing companies or universities to save money and time<br />

from unnecessary tests.<br />

INTRODUCTION<br />

The ability of small molecules to interact with multiple<br />

proteins is referred to as polypharmacology [1] , and the<br />

strategy that aims to exploit the positive aspects of<br />

polypharmacology is drug repositioning, whereby existing<br />

drugs are investigated for efficacy against targets for other<br />

indications. Existing drugs are privileged structures with<br />

verified bioavailability and compatibility. Furthermore,<br />

virtual screening allows to conduct repositioning of<br />

existing drugs against novel disease targets without the<br />

expense of purchasing thousands of compounds [2] . The<br />

combination of structure-based virtual screening (such as<br />

estimation of similarity of protein-ligand binding sites and<br />

consequent cross-docking) and drug repositioning<br />

represents a highly efficient and fast methodology for<br />

predicting cross-reactivity and putative side effects of drug<br />

candidates [3] .<br />

METHODS<br />

Each step of my work is related to a bioinformatics<br />

technique or tool, resulting to be the coupling of different<br />

software.<br />

1. At first there is the choice of the query (a single protein<br />

as PDB file) and the templates (a set of PDB<br />

structures). At least one of the two categories has to<br />

present a ligand bound in a cavity;<br />

2. prediction of druggable cavities in all the protein<br />

structures using a geometry-based or an energy-based<br />

algorithm (Fpocket, geometry-based tool, in my case);<br />

3. comparison of the query binding sites to the binding<br />

sites of the templates for assessing the similarity. It can<br />

be carried out by an alignment or alignment-free<br />

algorithm (I used Apoc, an alignment based tool);<br />

4. cross-docking of the ligand available in the pair of<br />

similar binding sites, into the other cavity, in order to<br />

study the binding with a different target for toxicity or<br />

new therapeutic indications (AutodockVina);<br />

5. Fingerprinting of the new complex ligand-cavity for<br />

scoring the docking poses.<br />

I applied this protocol on two different queries (Thrombin<br />

and Dihydrofolate reductase), using a data set of 1067<br />

druggable proteins as tamplates (Druggable Cavity<br />

Directory).<br />

RESULTS & DISCUSSION<br />

The method works well in repositioning ligands among<br />

proteins of the same family (intraprotein), but is not able<br />

to detect interprotein similarities (among not related<br />

proteins). It happens because of the big size of the<br />

predicted cavities (larger than the mere space occupied by<br />

the ligand) coupled to the alignment-based algorithm used,<br />

which make difficult to have a sufficient similarity rate<br />

and exponentially increase the false negatives. For my<br />

further works I will divide the cavity space in subpockets,<br />

disengage the similarity from the sequence by using<br />

pharmacophoric maps, and couple the structure based<br />

similarity to the ligand based and network based. All the<br />

information will be fused with data integrations algorithms.<br />

REFERENCES<br />

On the origins of drug polypharmacology, Xavier Jalencas and Jordi<br />

Mestres, Med. Chem. Commun., 2013, 4, 80.<br />

Drug repositioning by structure-based virtual screening, Dik-Lung Ma,<br />

Daniel Shiu-Hin Chana and Chung-Hang Leung, Chem. Soc. Rev.,<br />

2013, 42, 2130.<br />

Comparison and Druggability Prediction of Protein−Ligand Binding<br />

Sites from Pharmacophore-Annotated Cavity Shapes, Jérémy<br />

Desaphy, Karima Azdimousa, Esther Kellenberger, and Didier<br />

Rognan, J. Chem. Inf. Model. 2012, 52, 2287−2299.<br />

87


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P44. ASSESSMENT OF THE CONTRIBUTION OF COCOA-DERIVED STRAINS<br />

OF ACETOBACTER GHANENSIS AND ACETOBACTER SENEGALENSIS TO<br />

THE COCOA BEAN FERMENTATION PROCESS THROUGH A GENOMIC<br />

APPROACH<br />

Rudy Pelicaen, Koen Illeghems, Luc De Vuyst, and Stefan Weckx * .<br />

Research Group of Industrial Microbiology and Food Biotechnology (IMDO), Faculty of Sciences and Bioengineering<br />

Sciences, Vrije Universiteit Brussel, Brussels, Belgium; Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB,<br />

Brussels, Belgium. *Stefan.Weckx@vub.ac.be<br />

Acetobacter ghanensis LMG 23848 T and Acetobacter senegalensis 108B are acetic acid bacteria species that originate<br />

from a spontaneous cocoa bean heap fermentation process. They have been indicated as strains with interesting<br />

functionalities through extensive metabolic and kinetic studies. Whole-genome sequencing of A. ghanensis LMG 23848 T<br />

and A. senegalensis 108B allowed to unravel their genetic adaptations to the cocoa bean fermentation ecosystem.<br />

INTRODUCTION<br />

Fermented dry cocoa beans are the basic raw material for<br />

chocolate production. The cocoa pulp-bean mass contents<br />

of the cocoa pods undergo, once taken out of the pods, a<br />

spontaneous fermentation process that lasts four to six<br />

days. This process is characterised by a succession of<br />

yeasts, lactic acid bacteria (LAB), and acetic acid bacteria<br />

(AAB) coming from the environment (De Vuyst et al.,<br />

<strong>2015</strong>).<br />

METHODS<br />

Total genomic DNA isolation and purification of A.<br />

ghanensis LMG 23848 T and A. senegalensis 108B was<br />

followed by the construction of an 8-kb paired-end library,<br />

454 pyrosequencing, and assembly of the sequence reads<br />

using the GS De Novo Assembler version 2.5.3 with<br />

default parameters. Genome finishing was performed by<br />

PCR assays to close gaps in the draft assembly using<br />

CONSED 23.0. Automated gene prediction and annotation<br />

of the assembled genome sequences were carried out using<br />

the bacterial genome sequence annotation platform<br />

GenDB v2.2 (Meyer et al., 2003). The predicted genes<br />

were functionally characterised using searches in public<br />

databases and bioinformatics tools, and annotations were<br />

manually curated. Comparative analysis of the genome<br />

sequences of the cocoa-derived strains A. ghanensis LMG<br />

23848 T (this study), A. senegalensis 108B (this study), and<br />

A. pasteurianus 386B (Illeghems et al., 2013) was<br />

accomplished by the EDGAR framework (Blom et al.,<br />

2009).<br />

RESULTS & DISCUSSION<br />

The genomes of the strains investigated consisted of a<br />

circular chromosomal DNA sequence with a size of 2.7<br />

Mbp and two plasmids for A. ghanensis LMG 23848 T and<br />

a circular chromosomal DNA sequence with a size of 3.9<br />

Mbp and one plasmid for A. senegalensis 108B (Figure 1).<br />

Comparative analysis revealed that the order of<br />

orthologous genes was highly conserved between the<br />

genome sequences of A. pasteurianus 386B and A.<br />

ghanensis LMG 23848 T . Evidence was found that both<br />

species possessed the genetic ability to be involved in<br />

citrate assimilation and they displayed adaptations in their<br />

respiratory chain. As is the case for many AAB, the<br />

missing gene encoding phosphofructokinase in the<br />

genome sequences of both A. ghanensis LMG 23848 T and<br />

A. senegalensis 108B resulted in a non-functional upper<br />

part of the Embden–Meyerhof–Parnas pathway. However,<br />

the presence of genes coding for membrane-bound PQQdependent<br />

dehydrogenases enabled the AAB strains<br />

examined to rapidly oxidise ethanol into acetic acid.<br />

Furthermore, an alternative TCA cycle, characterised by<br />

genes coding for a succinyl-CoA:acetate-CoA transferase<br />

and a malate:quinone oxidoreductase, was present.<br />

Furthermore, evidence was found in both genome<br />

sequences that glycerol, mannitol and lactate could be<br />

used as energy sources. Thus, although both species<br />

displayed genetic adaptations to the cocoa bean<br />

fermentation process, their dependence on glycerol,<br />

mannitol and lactate may partly explain their low<br />

competitiveness during cocoa bean fermentation processes,<br />

as these substrates have to be formed through yeast or<br />

LAB activities, respectively.<br />

FIGURE 1. Graphical representation of the genomes of A. ghanensis<br />

LMG 23848 T (A) and A. senegalensis 108B (B).<br />

REFERENCES<br />

Blom, J., Albaum, S., Doppmeier, D., Pühler, A., Vorhölter, F.-J., Zakrzewski, M.,<br />

Goesmann, A., 2009. EDGAR: a software framework for the comparative<br />

analysis of prokaryotic genomes. BMC Bioinformatics 10, 1-14.<br />

De Vuyst, L., Weckx, S., <strong>2015</strong>. The functional role of lactic acid bacteria in cocoa<br />

bean fermentation. In: Mozzi, F., Raya, R.R., Vignolo, G.M. (Eds.).<br />

Biotechnology of Lactic Acid Bacteria: Novel Applications. Wiley-Blackwell,<br />

Ames, IA, USA. In press.Illeghems, K., De Vuyst, L., Weckx, S., 2013.<br />

Complete genome sequence and comparative analysis of Acetobacter<br />

pasteurianus 386B, a strain well-adapted to the cocoa bean fermentation<br />

ecosystem. BMC Genomics 14, 526.<br />

Meyer, F., Goesmann, A., McHardy, A. C., Bartels, D., Bekel, T., et al., 2003.<br />

GenDB - an open source genome annotation system for prokaryote genomes.<br />

Nucleic Acids Res. 31, 2187-2195.<br />

88


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: 000 Category: Abstract template<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P45. REPRESENTATIONAL POWER OF GENE FEATURES<br />

FOR FUNCTION PREDICTION<br />

Konstantinos Pliakos 1* , Isaac Triguero 2,3 , Dragi Kocev 4 & Celine Vens 1 .<br />

Department of Public Health and Primary Care, KU Leuven Kulak 1 ; Department of Respiratory Medicine, Ghent<br />

University 2 ; Data Mining and Modelling for Biomedicine group, VIB Inflammation Research Center 3 ; Department of<br />

Knowledge Technologies, Jožef Stefan Institute 4 . * konstantinos.pliakos@kuleuven-kulak.be<br />

We present a short study on gene function prediction datasets, revealing an existing issue of non-unique feature<br />

representation, as well as the effect of this issue on hierarchical multi-label classification algorithms.<br />

INTRODUCTION<br />

This study focuses on hierarchical multi-label<br />

classification (HMC). HMC is a variant of classification<br />

where one sample can be assigned to several classes<br />

simultaneously. It differs though from multi-label<br />

classification as these classes are organized in a hierarchy.<br />

That means that a sample belonging to a class<br />

automatically belongs to all its super-classes. Typical<br />

HMC tasks include gene function prediction or text<br />

classification. Here, we focus on the former.<br />

A typical characteristic of genes is that they can be<br />

described in several ways: using information about their<br />

sequence, homology to well-characterized genes,<br />

expression profiles, secondary structure of their derived<br />

proteins, etc. The HMC community has multiple research<br />

datasets at its disposal on gene functions (e.g., (Vens et al.,<br />

2008) or (Schietgat et al., 2010)), each representing genes<br />

by one type of features. Indisputably, researchers should<br />

get advantage of this amount of data but the question<br />

arises how “good” these datasets are. How discriminant<br />

are the features describing a gene? Here, a short study is<br />

trying to display existing data-related problems and give<br />

answers to the aforementioned questions.<br />

DATA STUDY & RESULTS<br />

After careful experimentation on various publicly<br />

available datasets it was noted that some of them suffer<br />

from large amount of duplicate feature vectors. The<br />

irrational behind this occurrence is that there are genes,<br />

which despite having different functions, have exactly the<br />

same feature representation. The table below lists the<br />

aforementioned problem in the 20 gene function<br />

prediction datasets described in (Vens et al., 2008) and<br />

(Schietgat et al., 2010).<br />

Organism Dataset Nb of genes Nb of unique gene<br />

representations<br />

S. cerevisiae church 3755 2352<br />

pheno 1591 514<br />

hom 3854 3646<br />

seq 3919 3913<br />

struc 3838 3785<br />

A. thaliana scop 9843 9415<br />

struc 11763 11689<br />

TABLE 1. Datasets, the number of genes and their unique representations.<br />

As it is displayed, the church (micro-array expression) and<br />

the pheno (phenotype features) datasets suffer the most.<br />

More specifically, in pheno dataset the 67.7% of the gene<br />

representations are duplicates. The most frequent feature<br />

vector appears 315 times, 197 times in the training set and<br />

118 times in the test set. Due to this, 20% of the 582 test<br />

examples will give the same feature vector as input for<br />

prediction. In a decision tree model, for example, these<br />

genes will end up in the same leaf, receive the same<br />

prediction (the average class vector of 197 training<br />

examples), but receive a different error term as they are a<br />

priori associated with a different class label-set. In the<br />

training phase, there may still be a lot of variation in the<br />

class vectors of the 197 genes, but no split exists to<br />

separate them. In the Church dataset, the 3755 genes<br />

correspond to only 2352 unique feature descriptors. In<br />

Hom or Struc datasets the number of the duplicates is<br />

lower but still impressive, considering the enormous size<br />

of the feature vectors in these datasets.<br />

For evaluation purposes, ML-KNN (Zhang M. L et al.,<br />

2007) was employed to demonstrate the effect of the<br />

studied problem on the average precision for the FunCat<br />

annotated datasets. Here, “unique” refers to the datasets<br />

occurring after removing all the duplicates. Thus, any<br />

feature vector can only once be included in a gene’s<br />

neighbour set. We report the average of 10 “unique”<br />

versions, each one using a different gene’s class label as<br />

ground truth for the feature vector.<br />

Dataset K= 1 K = 5 K = 17<br />

Train Test<br />

(5cv)<br />

Train Test<br />

(5cv)<br />

Train Test<br />

(5cv)<br />

pheno initial 51.59 23.62 39.55 24.14 32.76 23.59<br />

unique 100 24.21 55.62 24.90 39.70 25.01<br />

hom initial 98.30 39.32 63.64 39.45 48.96 37.28<br />

unique 100 39.14 64.64 39.67 49.28 37.53<br />

TABLE 2. Average Precision rates (%) using ML-KNN.<br />

The table shows that the less discriminant feature<br />

representation can affect the ML-KNN and decrease the<br />

precision of multi-label classification. Indisputably, it<br />

could be concluded that the same problem will be more<br />

obvious or even completely disastrous for two-class or<br />

multi-class classification problems.<br />

CONCLUSION<br />

The major point of this study was to inform the research<br />

community of the relatively low representational power of<br />

the features present in some widely used gene function<br />

prediction datasets, making them even more difficult and<br />

challenging datasets from machine learning perspective.<br />

We observed the same issue in datasets of other HMC<br />

application domains like text categorization.<br />

REFERENCES<br />

Zhang M. L. & Zhou Z. H. ML-KNN: A lazy learning approach to multi-label learning, Pattern<br />

recognition 40, 2038-2048, (2007).<br />

Vens C. et al. Decision trees for hierarchical multi-label classification, Machine Learning 73, 185-214,<br />

(2008).<br />

Schietgat L. et al. Predicting gene function using hierarchical multi-label decision tree ensembles, BMC<br />

Bioinformatics 11, (2010).<br />

89


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P46. ANALYSIS OF BIAS AND ASYMMETRY IN THE PROTEIN STABILITY<br />

PREDICTION<br />

Fabrizio Pucci 1,* , Katrien Bernaerts 1,2 , Fabian Teheux 1 , Dimitri Gilis 1 & Marianne Rooman 1 .<br />

Department of BioModeling, BioInformatics & BioProcesses 1 , Université Libre de Bruxelles, 1050 Brussels, Belgium;<br />

BioBased Materials, Faculty of Humanities and Sciences 2 , Maastricht University, 6200 Maastricht, The Netherlands.<br />

* fapucci@ulb.ac.be<br />

In many bioinformatics analyses avoiding biases towards the training dataset is one of the most intricate issue. Here we<br />

focus on the specific case of the prediction of protein thermodynamic stability changes upon point mutations (G). In a<br />

first instance we measure the bias towards the destabilizing mutations of some widely used G-prediction algorithms<br />

described in the literature. Then we show how important is the use of the symmetry of the model to avoid biasing. In the<br />

last step we briefly discuss the distribution of the G values for all possible point mutations in a series of proteins with<br />

the aim of understanding whether the distribution is universal and how much it is biased towards the training dataset.<br />

INTRODUCTION<br />

The accurate prediction of the stability changes on a large<br />

scale is still a challenge in protein science. Despite the<br />

large amount of work done in the last years, the results<br />

frequently suffer from hidden biases towards the training<br />

dataset and this makes the evaluation of the real<br />

performances a difficult task.<br />

Here we study the “bias problem” in the case of the<br />

prediction of protein thermodynamic stability changes<br />

upon point mutations and more precisely of its best<br />

descriptor G that is the change of folding free energy<br />

upon mutation from the wild type protein W to the mutant<br />

M. In principle the predicted G value of the inverse<br />

mutation (M to W) has to be exactly equal to minus the<br />

G of the direct mutation (W to M), since the free energy<br />

is a state function.<br />

Unfortunately the asymmetry of the training dataset<br />

towards the destabilizing mutations (reflecting the<br />

evolutionary optimization of protein stability) makes the<br />

prediction of inverse mutations less accurate with respect<br />

to the direct ones. This introduces a series of distortions in<br />

the prediction model that we will analyze here.<br />

METHODS<br />

We computed the G value for a set of almost 200<br />

mutations in which both the structure of the wild type<br />

protein and mutant are known, using a series of prediction<br />

tools, i.e. PoPMuSiC [1], I-Mutant, FoldX, Duet,<br />

AutoMute, CupSat, Eris and ProSMS. We then computed<br />

the Ratio (RID) of the standard deviation between the<br />

predicted and the experimental values of G for the<br />

Inverse mutations to for the Direct mutations (which<br />

should be one in the case of a perfect symmetric<br />

prediction) and compared the results of the different<br />

programs.<br />

If the functional structure of the model is known as in the<br />

case of the artificial neural network of PoPMuSiC, one<br />

can further understand which terms contribute more than<br />

others to deviate the RID from unit and thus propose new<br />

model structures in which the biases are correctly avoided<br />

[2].<br />

In the more blind machine learning approaches (as the<br />

methods based on Random Forest or Support Vector<br />

Machine) in which the functional form is not explicitly<br />

known, the asymmetry correction is less obvious.<br />

In a second part, we investigated how the symmetry of the<br />

G values distribution in the training dataset influences<br />

the prediction of the G distribution for all possible<br />

mutations in a series of proteins with known structures.<br />

RESULTS & DISCUSSION<br />

The estimation of the asymmetry computed for a<br />

series of available prediction methods gives a RID<br />

values between 1 for bias-corrected methods and<br />

about 3 for the most biased programs. From these<br />

results we have shown that the correct use of the<br />

symmetry in setting up the model structure helps to<br />

avoid unwanted biases towards the destabilizing<br />

mutations.<br />

Furthermore the distribution of the G values for all<br />

point mutations in some proteins has been analyzed<br />

and showed a dependence from the G distribution<br />

of the training dataset when the RID deviate<br />

significantly from one. The understanding of the<br />

relation between the two distrubutions is an<br />

important step to comprehend the universality of the<br />

distribution [3] and how much the proteins are<br />

optimized to minimize the impact of single-site<br />

aminoacid substitution.<br />

REFERENCES<br />

[1] Y. Dehouck, Jean Marc Kwasigroch, D. Gilis, M. Rooman (2011),<br />

PopMusic 2.1 : a web server for the estimation of the protein<br />

stability changes upon mutation and sequence optimality. BMC<br />

Bioinformatics. 12, 151<br />

[2] F. Pucci, K. Bernaerts, F. Teheux, D. Gilis, M. Rooman, Symmetry<br />

Principles in Optimization Problems: an application to Protein<br />

Stability Prediction (<strong>2015</strong>), IFAC-PapersOnLine 48-1, 458-463<br />

[3] Tokuriki N, Stricher F, Schymkowitz J, Serrano L, Tawfik DS, The<br />

stability effects of protein mutations appear to be universally<br />

distributed (2007), J Mol Biol, 356, 1318-1332.<br />

90


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P47. MULTI-LEVEL BIOLOGICAL CHARACTERIZATION OF EXOMIC<br />

VARIANTS AT THE PROTEIN LEVEL IMPROVES THE IDENTIFICATION OF<br />

THEIR DELETERIOUS EFFECTS<br />

Daniele Raimondi 1,2,3,4 , Andrea Gazzo 1,2 , Marianne Rooman 1,6 , Tom Lenaerts 1,2,5 & Wim Vranken 1,2,3,4 .<br />

Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, 1050, Belgium 1 ; Machine Learning group,<br />

Université Libre de Bruxelles, Brussels, 1050, Belgium 2 ; Structural Biology Brussels, Vrije Universiteit Brussel,<br />

Brussels, 1050, Belgium 3 ; Structural Biology Research Centre, VIB, Brussels, 1050, Belgium 4 ; Artificial Intelligence lab,<br />

Vrije Universiteit Brussel, Brussels, 1050 Belgium 5 ; 3BIO-BioInfo group, Université Libre de Bruxelles, Brussels, 1050,<br />

Belgium 6 . * daniele.raimondi@vub.ac.be<br />

The increasing availability of genome sequence data led to the development of predictors that are capable of identifying<br />

the likely phenotypic effects of Single Nucleotide Variants (SNVs) or short inframe Insertions or Deletions (INDELs).<br />

Most of these predictors focus on SNVs and use a combination of features related to sequence conservation, biophysical<br />

and/or structural properties to link the observed variant to either a neutral or a disease phenotype. Despite notable<br />

successes, the mapping between genetic alterations and phenotypic effects is riddled with levels of complexity that are<br />

not yet fully understood and that are often not taken into account in the predictions. A better multi-level molecular and<br />

functional contextualization of both the variant and the protein may therefore significantly improve the predictive quality<br />

of variant-effect predictors.<br />

INTRODUCTION<br />

The phenotypical interpretation at the organism level of<br />

protein-level alterations is the ultimate goal of the varianteffect<br />

prediction field. This causal relationship is still far<br />

from being completely understood and is confounded by<br />

many aspects related to the intrinsic complexity of cell life. A<br />

crucial restriction of variant-effect prediction is that an<br />

alteration of the protein’s molecular phenotype, even if it is a<br />

sine qua non condition for the disease phenotype in the<br />

carrier individual,may not constitute in itself a sufficient<br />

cause for the disease: this also depends on the particular role<br />

that the affected protein plays in the well-being of the<br />

organism. Even the most commonly used features, which<br />

relate evolutionary constraints with likely functional damage,<br />

offer only a partial correlation with the pathogenicity of the<br />

variant. Consequently, additional information that bridges the<br />

variant-phenotype gap is crucial to improve variant-effect<br />

predictions.<br />

METHODS<br />

We address the inherently complex variant-effect prediction<br />

problem through the integration of different sources of<br />

information. By describing each (protein, variant) pair from<br />

different perspectives corresponding to different levels of<br />

contextualisation, we assembled the most relevant and<br />

accessible pieces of information that are currently available,<br />

with the aim to elucidate the fuzzy and complex mapping<br />

between molecular-level alterations and the individual-level<br />

phenotypic outcome. We use three variant-oriented features<br />

with different characteristics: the log-odd ratio (LOR) score<br />

and Conservation index (CI) [1], which are column-wise<br />

measures of the conservation of a mutated column within a<br />

multiple-sequence alignment (MSA), and the PROVEAN [2]<br />

predictions (PROV), which provide a sequence-wide measure<br />

of the change in evolutionary distance between the mutated<br />

target protein and close functional homologs that correlates<br />

with the deleteriousness of variants. The protein-oriented<br />

features use pathway [4] and protein-protein interaction<br />

networks information [5] (DGR) as well as genetic and<br />

clinical information, for instance an evaluation of how<br />

tolerant the affected genes are to homozygous loss-offunction<br />

mutations (REC) [3].<br />

RESULTS & DISCUSSION<br />

DEOGEN is our novel variant effect predictor that can<br />

natively handle both SNVs and inframe INDELs. By<br />

integrating information from different biological scales and<br />

mimicking the complex mixture of effects that lead from the<br />

variant to the phenotype, we obtain significant improvements<br />

in the variant-effect prediction results. Next to the typical<br />

variant-oriented features based on the evolutionary<br />

conservation of the mutated positions, we added a collection<br />

of protein-oriented features that are based on functional<br />

aspects of the gene affected. We cross-validated DEOGEN on<br />

36825 polymorphisms, 20821 deleterious SNVs and 1038<br />

INDELs from SwissProt.<br />

Method Missing SNVs Sen Spe Pre Bac MCC<br />

PROVEAN 0.0 78 79 68 79 56<br />

SIFT 2.0 85 69 61 77 52<br />

Mutation Assessor 0.6 85 71 63 78 54<br />

PolyPhen2 (HumDiv) 4.0 89 63 57 76 50<br />

CADD 7.0 82 75 66 78 55<br />

EFIN 0.0 86 80 87 83 64<br />

MutationTaster 20.7 86 75 69 81 60<br />

GERP++ 20.7 97 24 45 61 28<br />

DEOGEN 4.4 77 92 85 84 71<br />

FIGURE 1. Comparison of the performances of 8 variant-effect predictors<br />

with DEOGEN on Humsavar 2013 dataset.<br />

REFERENCES<br />

[1]Calabrese, R. et al., R. Functional annotations improve the predictive<br />

score of human disease-related mutations in proteins. Hum. Mutat.<br />

30, 123744 (2009).<br />

[2]Choi, Y. et al., Predicting the functional effect of amino acid<br />

substitutions and indels. PLoS One 7, e46688 (2012).<br />

[3]Daniel G. MacArthur et al. A Systematic Survey of Loss-of-Function<br />

Variants in Human Protein-Coding Genes Science 17 February<br />

2012: 335 (6070), 823-828.<br />

[4]Atanas Kamburov et al. (2011) ConsensusPathDB: toward a more<br />

complete picture of cell biology. Nucleic Acids Research 39:D712-<br />

717.<br />

91


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P48. NGOME: PREDICTION OF NON-ENZYMATIC PROTEIN<br />

DEAMIDATION FROM SEQUENCE-DERIVED SECONDARY STRUCTURE AND<br />

INTRINSIC DISORDER<br />

J. Ramiro Lorenzo 1 , Leonardo G. Alonso 2 & Ignacio E. Sánchez 1* .<br />

Protein Physiology Laboratory, Facultad de Ciencias Exactas y Naturales and IQUIBICEN - CONICET, Universidad de<br />

Buenos Aires, Argentina 1 ; Protein Structure-Function and Engineering Laboratory, Fundación Instituto Leloir and<br />

IIBBA - CONICET, Buenos Aires, Argentina 2 . *isanchez@qb.fcen.uba.ar<br />

Asparagine residues in proteins undergo spontaneous deamidation, a post-translational modification that may act as a<br />

molecular clock for the regulation of protein function and turnover. Asparagine deamidation is modulated by protein<br />

local sequence, secondary structure and hydrogen bonding. We present NGOME, an algorithm able to predict non -<br />

enzymatic deamidation of internal asparagine residues in proteins, in the absence of structural data, from sequence based<br />

predictions of secondary structure and intrinsic disorder. NGOME may help the user identify deamidation-prone<br />

asparagine residues, often related to protein gain of function, protein degradation or protein misfolding in pathological<br />

processes.<br />

INTRODUCTION<br />

Protein deamidation is a post-translational modification in<br />

which the side chain amide group of a glutamine or<br />

asparagine (Asn) residue is transformed into an acidic<br />

carboxylate group. Deamidation often, but not always,<br />

leads to loss of protein function 1,2 . Deamidation rates in<br />

proteins vary widely, with halftimes for particular Asn<br />

residues ranging from several days to years. In contrast<br />

with the ubiquity and importance of Asn deamidation,<br />

there is currently no publicly available algorithm for the<br />

prediction of Asn deamidation A structure-based<br />

algorithm was published 3 , but is no longer available online<br />

and is not useful for proteins of unknown structure or<br />

those that are intrinsically disordered.<br />

METHODS<br />

Dataset. We collected from the literature experimental<br />

reports of deamidation of Asn residues in proteins using<br />

mass spectrometry or Edman sequencing. Since<br />

deamidation rates depend strongly on pH and temperature,<br />

we only included experiments at neutral or slightly basic<br />

pH and up to 313K. An Asn residue was considered a<br />

positive if unequivocal change to aspartic or isoaspartic<br />

residue was observed. Asn residues for which direct<br />

experimental evidence was not obtained were not taken<br />

into account.<br />

NGOME training. We trained the algorithm by randomly<br />

splitting the dataset into training and test sets 100 times,<br />

while keeping a similar number of positive and negative<br />

Asn-Xaa dipeptides in the two sets. For each splitting, we<br />

selected the weights for disorder 4 and alpha helix<br />

prediction 5 in NGOME algorithm to maximize the area<br />

under the ROC curve for the training set. For the test set,<br />

the area under the ROC curve for NGOME was larger than<br />

for sequence-based prediction 97 out of 100 times. Finally,<br />

we selected the average values of weights for NGOME.<br />

RESULTS & DISCUSSION<br />

Both protein sequence and structure can influence Asn<br />

deamidation kinetics. In the absence of secondary and<br />

5. Cole, C., et al. Nucleic Acids Res 36:W197-201 (2008).<br />

tertiary structure, Asn deamidation rates are governed by<br />

the identity of the N+1 amino acid 3 . In model peptides, the<br />

Asn-Gly dipeptide is by far the fastest to deamidate, with<br />

bulky N+1 side chains generally slowing down the<br />

reaction. Several structural features decreasing Asn<br />

deamidation rates have also been identified, including<br />

alpha helix formation and hydrogen bond formation by the<br />

Asn side chain, the N+1 backbone amide and the<br />

neighbouring residues 3 .<br />

We compiled a database of 281 Asn residues (67 positives<br />

and 214 negatives) in 39 proteins to train NGOME. We<br />

computed t50 for all Asn in the dataset and generated a<br />

ROC curve by considering as positives Asn residues with<br />

different values of t50. The area under the ROC curve is<br />

larger for the NGOME predictions (0.9640) than for the<br />

sequence-based predictions (0.9270) (p-value 6×10 -3 ).<br />

NGOME also performs better for threshold value s<br />

yielding few false positives. NGOME can also<br />

discriminate between positive and negative Asn-Gly<br />

dipeptides whereas sequence-based prediction can not.<br />

The area under the ROC curve is 0.7051 for the NGOME<br />

predictions, larger than the random value of 0.5 for<br />

sequence-based prediction (p-value 9×10 –3 ). Since<br />

NGOME requires only a protein sequence as an input and<br />

not a three-dimensional structure, we envision that<br />

GNOME will be useful to systematically evaluate whole<br />

proteome data and in the study of intrinsically disordered<br />

proteins for which the structural data is scarce. NGOME is<br />

freely available as a webserver at the National EMBnet<br />

node Argentina, URL: http://www.embnet.qb.fcen.uba.ar/<br />

in the subpage “Protein and nucleic acid structure and<br />

sequence analysis”.<br />

REFERENCES<br />

1. Curnis, F., et al. J Biol Chem 281:36466-36476 (2006).<br />

2. Reissner, K.J. and Aswad, D.W. Cell Mol Life Sci 60:1281 -1295<br />

(2003).<br />

3. Robinson, N.E. and Robinson, A.B. Proc Natl Acad Sci U S A<br />

98:4367-4372 (2001).<br />

4. Dosztanyi, Z., et al. Bioinformatics 21:3433-3434 (2005).<br />

92


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P49. OPTIMAL DESIGN OF SRM ASSAYS USING MODULAR EMPIRICAL<br />

MODELS<br />

Jérôme Renaux 1,* , Alexandros Sarafianos 1 , Kurt De Grave 1 & Jan Ramon 1 .<br />

Department of Computer Science, KU Leuven. 1 * Jerome.renaux@cs.kuleuven.be<br />

Targeted proteomics techniques such as Selected Reaction Monitoring (SRM) have become very popular for protein<br />

quantification due to their high sensitivity and reproducibility. However, these rely on the selection of optimal transitions,<br />

which are not always known in advance and may require expensive and time-consuming discovery experiments to<br />

identify. We propose a computer program for the automated identification of optimal transitions using machine learning<br />

and show encouraging results when compared to a widely used spectral library.<br />

INTRODUCTION<br />

A major issue with both SRM is to know which transitions<br />

to monitor in order to maximally detect a specific protein,<br />

these being different from one protein to another. Good<br />

candidates are transitions whose chemical properties will<br />

make them likely to occur and easy to detect by the mass<br />

spectrometer, while being sufficiently specific indicators<br />

of their parent protein.<br />

Traditionally, targeted proteomics assays, which consist of<br />

lists of ions or transitions to monitor, are designed through<br />

costly exploratory experiments. Recently, attempts have<br />

been made to produce software to help design optimal<br />

assays. These efforts rely on some extent on collaborative<br />

databases of mass spectra which are mined to identify the<br />

best possible peptides to include in the assays. While<br />

successful, these approaches still depend on past<br />

exploratory analyses and on the coverage of the exploited<br />

databases. Therefore, their performance decrease in cases<br />

where such databases cannot be leveraged, such as when<br />

dealing with little-studied organisms or rare, lowabundance<br />

proteins.<br />

We propose an approach called SIMPOPE (Sequence of<br />

Inductive Models for the Prediction and Optimization of<br />

Proteomics Experiments) that models all the steps of the<br />

typical tandem mass spectrometry (MS/MS) workflow in<br />

order to accurately predict the properties of peptide and<br />

fragment ions within a given proteome, and subsequently<br />

identify optimal assays among them.<br />

METHODS<br />

SIMPOPE consists of a sequential suite of predictive<br />

models for each step of the MS/MS workflow. It exploits<br />

knowledge from public databases and combines it with the<br />

generalizing power of machine learning models to<br />

compensate for noisy or missing data. All models are<br />

probabilistic, allowing to keep track of the inherent<br />

uncertainty of the successive predictions and to weight the<br />

results accordingly for the assay prediction.<br />

Enzymatic cleavage is modelled using CP-DT(Fannes et<br />

al., 2013), which models the behaviour of the trypsin<br />

enzyme using random forests. Retention time prediction is<br />

achieved using the Elude tool from the Percolator suite<br />

(Moruz et al., 2010). The charge distribution of<br />

electrospray precursor ions is also modelled using random<br />

forests trained on experimental data mined from PRIDE<br />

(Vizcaino et al., 2013). Fragmentation patterns and<br />

product ion intensity are predicted with the help of random<br />

forest models trained on MS-LIMS data (Degroeve &<br />

Martens 2013; De Grave et al., 2014). Finally, prior<br />

knowledge about the abundance of proteins within a given<br />

proteome is incorporated as prior probabilities, obtained<br />

when available from PaxDB.<br />

On the human proteome, these steps yield a total of 321<br />

000 000 transitions together with their relevant chemical<br />

properties. We then compute a score for every single<br />

transition, based on these properties and on their aliasing<br />

with other transitions in terms of Q1 and Q3 m/z.<br />

RESULTS & DISCUSSION<br />

We validated our approach by computing scores for 2000<br />

reference transitions from the SRMAtlas database (Picotti<br />

et al., 2014). Based on these scores, we can rank the<br />

reference transitions among all possible transitions.<br />

Intuitively, reference transitions should rank high, and<br />

therefore have a low rank (ideally, in the top five). Based<br />

on the average number of transitions per protein in our<br />

reference set, a perfect median rank would be 3.2, while a<br />

totally random scoring system should yield a median rank<br />

of 151. The approach we propose achieved a median rank<br />

of 15, signifying that using our scoring method, 50% of<br />

the reference transitions are ranked in the top 15. This<br />

result is encouraging as it shows that the scores predicted<br />

by SIMPOPE do correlate with the quality of the<br />

transitions. We can subsequently use that score as a<br />

feature to train an additional model on top of the ones<br />

described here to refine the assay prediction process<br />

(further results on the poster).<br />

REFERENCES<br />

Degroeve, S. & Martens, L. MS2PIP: a tool for MS/MS peak<br />

intensity prediction. Bioinformatics, 29, pp.3199–203 (2013).<br />

Fannes, T. et al. Journal of Proteome Research, 12(5), pp.2253–2259<br />

(2013).<br />

De Grave, K. De et al. Prediction of peptide fragment ion intensity : a<br />

priori partitioning reconsidered. International Mass Spectrometry<br />

Conference 2014, (2014).<br />

Moruz, L., Tomazela, D. & Käll, L. Training, selection, and robust<br />

calibration of retention time models for targeted proteomics. Journal<br />

of Proteome Research, 9(10), pp.5209–5216 (2010).<br />

Picotti, P. et al. A complete mass-spectrometric map of the yeast<br />

proteome applied to quantitative trait analysis. Nature, 494(7436),<br />

pp.266–270 (2014).<br />

Vizcaino, J. a. et al. The Proteomics Identifications (PRIDE) database<br />

and associated tools: status in 2013. Nucleic Acids Research, 41(D1),<br />

pp.D1063–D1069 (2013).<br />

93


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P50. EVALUATING THE ROBUSTNESS OF LARGE INDEL IDENTIFICATION<br />

ACROSS MULTIPLE MICROBIAL GENOMES<br />

Alex Salazar 1,2 & Thomas Abeel 1,2* .<br />

Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands 1 ; Genome Sequencing and Analysis<br />

Program, Broad Institute of MIT and Harvard 2 . * T.Abeel@tudelft.nl<br />

Comparing large structural variants—such as large insertions and deletions (indels)—across multiple genomes can reveal<br />

important insights in microbial organisms. Unfortunately, most studies that compare sequence variants only focus on<br />

single nucleotide variants and small indels. In this study, we investigated whether current available variant callers are<br />

robust when identifying the same large indel across multiple genomes—an important criteria for accurately associating<br />

large variants. By simulating over 8,000 large indels of various sizes across 161 bacterial strains, we found that<br />

breakpoint detection is precise when identifying both deletions and insertion. We suggest that left-most-overlap<br />

normalization across all samples will ensure uniform breakpoint coordinates of identical large variants which can then be<br />

incorporated to existing association pipelines.<br />

INTRODUCTION<br />

Structural sequence variants—such as large insertion and<br />

deletions (indels)—along with small sequence variants (e.g.<br />

single nucleotide variants and small indels) can enable more<br />

robust comparisons of microbial populations. Unfortunately,<br />

limitations in variant calling methods restrict investigations to<br />

compare only small variants across multiple microbial<br />

genomes—thereby ignoring larger variants (e.g. indels of size<br />

greater than 50nt). The recent development of structural<br />

variant detecting tools now provide an opportunity to<br />

compare and associate large indels with phenotype and<br />

population structure across a collection of samples. However,<br />

these tools have only been benchmarked against a single<br />

genome and their ability to consistently call large events<br />

across multiple genomes remains uncharacterized.<br />

METHODS<br />

In this study, we systematically benchmarked the robustness<br />

of large indel identification across multiple genomes using<br />

five recently developed structural variant detection tools:<br />

Pilon (Walker et al., 2014), Breseq (Barrick et al., 2014),<br />

BreakSeek (Zhao et al., <strong>2015</strong>), and MindTheGap (Rizk et al.,<br />

2014). Using a manually-curated reference genome for<br />

M. tuberculosis (H37Rv), we simulated nearly 10,000<br />

deletions and 8,000 thousand insertions—ranging from 50nt<br />

to 550nt. Overall, the simulation experiment resulted in a<br />

total 1.6 million expected deletions and 1.3 million expected<br />

insertions when we aligned short-reads from a data set of 161<br />

clinical strains of M. tuberculosis (Zhang et al., 2013).<br />

After identifying the simulated indels using the variant<br />

detecting tools, we used a distance test to investigate each<br />

tool’s robustness in breakpoint and genotype prediction. For<br />

each simulated indel prediction, we computed the distance of<br />

the predicted breakpoint coordinate to the expected<br />

breakpoint coordinate. We also calculated a genotype<br />

similarity score using the Damerau-Levenshtein distance.<br />

RESULTS & DISCUSSION<br />

We found that all tools are able to precisely predict the<br />

breakpoint coordinate of the same large event present across<br />

multiple genomes. For deletions, Breseq and Breakseek<br />

consistently identified more than 96% of all simulated<br />

deletions regardless of size. This number ranged from 87% to<br />

93% in Pilon and correlated with decreasing deletion size.<br />

Breseq and Pilon correctly predicted the exact breakpoint<br />

coordinate for about two-thirds of all identified simulated<br />

indels. This number ranged from 1% to 7% in Breakseek calls<br />

and inversely correlated with increasing deletion size.<br />

For insertions, MindTheGap consistently identified<br />

approximately 97% of all simulated insertions, but Pilon’s<br />

performance worsened as the number of insertions that it<br />

identified ranged from 69% to 93%--again, we observed a<br />

direct correlation of missed calls as the insertion size<br />

increased. Both tools correctly predicted the exact breakpoint<br />

coordinate for about two-thirds of all identified simulated<br />

indels. Nevertheless, we found 99% of the predicted<br />

breakpoint coordinates made by the four tools were within<br />

10nt of the expected breakpoint coordinate.<br />

Our results also indicate that Pilon, Breseq, Breakseek, and<br />

MindTheGap are robust when predicting the genotype of<br />

large indels across multiple samples. The large majority of<br />

identified simulated deletions had a size and genotype<br />

similarity of more than 98%. In insertions, the size similarity<br />

of insertions varied widely in both MindTheGap and Pilon<br />

calls indicating that both tools have a difficult time<br />

determining the exact length of an insertion sequence.<br />

Overall, these results show that breakpoint detection is<br />

precise when identifying deletion and insertions of any size.<br />

Therefore, a simple normalization procedure—such as leftmost-overlap<br />

normalization across samples—will ensure<br />

consistent breakpoint location for identical large events. This<br />

will enable researchers to incorporate large variants to<br />

existing association pipelines; opening novel opportunities to<br />

associate large variants with phenotype and population<br />

structure.<br />

REFERENCES<br />

Barrick,J.E. et al. (2014) Identifying structural variation in haploid<br />

microbial genomes from short-read resequencing data using breseq.<br />

BMC Genomics, 15, 1039.<br />

Rizk,G. et al. (2014) MindTheGap: integrated detection and assembly of<br />

short and long insertions. Bioinformatics, 30, 1–7.<br />

Walker,B.J. et al. (2014) Pilon: an integrated tool for comprehensive<br />

microbial variant detection and genome assembly improvement.<br />

PLoS One, 9, e112963.<br />

Zhang,H. et al. (2013) Genome sequencing of 161 Mycobacterium<br />

tuberculosis isolates from China identifies genes and intergenic<br />

regions associated with drug resistance. Nat. Genet., 45, 1255–60.<br />

Zhao,H. and Zhao,F. (<strong>2015</strong>) BreakSeek: a breakpoint-based algorithm for<br />

full spectral range INDEL detection. Nucleic Acids Res., 1–13.<br />

94


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

10th Benelux Bioinformatics Conference Poster<br />

<strong>bbc</strong> <strong>2015</strong><br />

P51. INTEGRATING STRUCTURED AND UNSTRUCTURED DATA SOURCES<br />

FOR PREDICTING CLINICAL CODES<br />

Elyne Scheurwegs 1,3* , Kim Luyckx 2 , Léon Luyten 2 , Walter Daelemans 3 & Tim Van den Bulcke 1 .<br />

Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Antwerp University Hospital 2 ; Center<br />

for Computation Linguistics and Psycholinguistics (CliPS), University of Antwerp 3 ; * elyne.scheurwegs@uantwerpen.be<br />

Automated clinical coding is a task in medical informatics, in which information found in patient files is translated to<br />

various types of coding systems (e.g. ICD-9-CM). The information in patient files consists of multiple data sources, both<br />

in structured (e.g. lab test results) and unstructured form (e.g. a text describing the progress of a patient over multiple<br />

days during the stay). This work studies the complementarity of information derived from these different sources to<br />

enhance clinical code prediction.<br />

INTRODUCTION<br />

The increased accessibility of healthcare data through the<br />

large-scale adoption of electronic health records stimulates<br />

the development of algorithms that monitor hospital<br />

activities, such as clinical coding applications.<br />

Clinical coding consists of the translation of information<br />

found in a patient file to diagnostic and procedural codes,<br />

originating from a medical ontology to patient files.<br />

In our work, we investigate if unstructured (textual) and<br />

structured data sources, present in electronic health<br />

records, can be combined to assign clinical diagnostic and<br />

procedural codes (specifically ICD-9-CM) to patient stays.<br />

Our main objective is to evaluate if integrating these<br />

heterogeneous data types improves prediction strength<br />

compared to using the data types in isolation.<br />

METHODS<br />

Several datasets were collected from the clinical data<br />

warehouse of the Antwerp University Hospital (UZA).<br />

The resulting dataset consists of a randomized subset of<br />

anonymized data of patient stays, in 14 different medical<br />

specialties. Two separate data integration approaches were<br />

evaluated on each dataset from a medical specialty.<br />

With early data integration, multiple sources are combined<br />

prior to training a model. This is achieved by using a<br />

single bag of features that are given to the prediction<br />

pipeline. Feature selection is performed with tf-idf for<br />

unstructured sources and gainratio and minimal<br />

redundancy, maximum relevance (mRMR) for structured<br />

source filtering.<br />

The late data integration method trains a separate model<br />

on each data source, and then combines the prediction<br />

output for each code in a meta-learner. This meta-learner<br />

is mainly used to find which sources perform best for a<br />

certain code.<br />

The prediction task in both approaches was cast as a multiclass<br />

classification task, in which an array of binary<br />

predictions was made (one for each clinical code).<br />

RESULTS & DISCUSSION<br />

Late data integration improves the predictions of ICD-9-<br />

CM diagnostic codes made in comparison to the best<br />

individual prediction source (i.e. overall F-measure<br />

increased from 30.6% to 38.3%). Early data integration<br />

does not show this trend and only performs well with a<br />

limited number of combinations of sources. ICD-9-CM<br />

procedure codes also show this trend, with the exception<br />

of the RIZIV data source, which shows a better prediction<br />

when used individually. The predictive strength of the<br />

models varies strongly between different medical<br />

specialties.<br />

The results show that the data sources, independent of<br />

their structured or unstructured nature, are able to provide<br />

complementary information when predicting ICD-9-CM<br />

codes, particularly when combined within the late data<br />

integration approach. This approach also allows for<br />

including as many sources as possible, as the effects of<br />

including a source that does not contain any additional<br />

information barely influences the end result. This is an<br />

advantage when the information content of a data source is<br />

not previously known. A disadvantage is the loss of<br />

information due to the strong generalisation as each data<br />

source is effectively reduced to a single feature for the<br />

meta-learner.<br />

Early data integration seems to suffer when combining<br />

sources that have features with a largely differing<br />

information content and different numbers of features. An<br />

unstructured data source typically renders 30,000<br />

different, weak features, while a structured source often<br />

contains only 500 different features.<br />

CONCLUSIONS<br />

Models using multiple electronic health record data<br />

sources systematically outperform models using data<br />

sources in isolation in the task of predicting ICD-9-CM<br />

codes over a broad range of medical specialties.<br />

ACKNOWLEDGEMENT<br />

This work is supported by a doctoral research grant (nr.<br />

131137) by the Agency for Innovation by Science and<br />

Technology in Flanders (IWT). The datasets used in this<br />

research were made available by the Antwerp University<br />

Hospital (UZA) for restricted use.<br />

REFERENCES<br />

Scheurwegs, E et al. Data integration of structured and unstructured<br />

sources for assigning clinical codes to patient stays. Journal of the<br />

American Medical Informatics Association (<strong>2015</strong>): ocv115.<br />

95


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P52. SUPERVISED TEXT MINING FOR DISEASE AND GENE LINKS<br />

Jaak Simm 1,2,3* , Adam Arany 1,2 , Sarah ElShal 1,2 & Yves Moreau 1,2 .<br />

Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing, and Data<br />

Analytics, KU Leuven, Kasteelpark Arenberg 10, box 2446, 3001 Leuven, Belgium 1 ; iMinds Medical IT, Kasteelpark<br />

Arenberg 10, box 2446, 3001 Leuven, Belgium 2 ; Institute of Gene Technology, Tallinn University of Technology,<br />

Akadeemia tee 15A, Estonia 3 . * jaak.simm@esat.kuleuven.be<br />

Scientific publications contain rich information about genetic disorders. Text mining these publications provides an<br />

automatic way to quickly query and summarize the information. We propose a supervised learning approach that takes<br />

advantage of the well known unsupervised approach TF-IDF (term frequency–inverse document frequency) and<br />

integrates it with supervised approach using logistic loss error metric. The preliminary results on OMIM dataset look<br />

promising.<br />

INTRODUCTION<br />

Scientific publications contain rich information about<br />

genetic disorders. Text mining these publications provides<br />

an automatic way to quickly query and summarize the<br />

information.<br />

The traditional approaches employ unsupervised text<br />

mining approaches like TF-IDF (term frequency–inverse<br />

document frequency) or Latent Dirichlet Allocation<br />

(LDA) by Blei et al. (2003) for linking terms to genes and<br />

diseases. A recent text mining software Beegle (ElShal et<br />

al., <strong>2015</strong>) developed for linking diseases and genes has<br />

taken this approach using TF-IDF as its similarity metric.<br />

PROPOSED METHOD<br />

Our work proposes a supervised learning of the<br />

importance of the textual terms, which can automatically<br />

filter out many terms that are unnecessary for the task at<br />

hand. We formulate it as a prediction of supervised values<br />

y given the terms for all genes g and all diseases d where i<br />

is the index of the term:<br />

and w i is the weight for the term i and σ is sigmoid<br />

function. The main idea is to learn the weight vector w that<br />

minimizes the difference between known values y and<br />

predictions. The minimization can transformed into a<br />

logistic regression.<br />

For the supervised values we use OMIM database<br />

(Hamosh et al., 2003). More specifically y corresponds to<br />

1 if there is a link between the given gene-disease pair and<br />

0 if there is no link. Intuitively, in this setup the text<br />

mining is transformed into a classification problem. We<br />

use dataset of 330 OMIM terms and their linked genes and<br />

randomly sample genes as negatives for each disease.<br />

For the textual terms we use MEDLINE abstracts as the<br />

source of biomedical text. We employ MetaMap (Aronson<br />

et al. 2010) to link terms with abstracts. We use geneRIF<br />

to link genes with abstracts, and PubMed to link diseases<br />

with abstracts. We apply a TF-IDF transformation to score<br />

a term with a given disease or gene based on the abstracts<br />

linked to each entity. We only use the terms linked to<br />

abstracts that belong to genes. Hence our vocabulary<br />

consists of 66,883 terms.<br />

RESULTS & DISCUSSION<br />

The preliminary results show that supervised learning<br />

allows to automatically pick up the keywords that are<br />

informative, improving the recall of the genes that are<br />

related to genetic disorders. We will present more detailed<br />

results in the poster.<br />

We are also investigate how to integrate the supervised<br />

approach to have answers to online queries provided by<br />

Beegle.<br />

REFERENCES<br />

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet<br />

allocation. the Journal of machine Learning research, 3, 993-1022.<br />

Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A., & McKusick,<br />

V. A. (2005). Online Mendelian Inheritance in Man (OMIM), a<br />

knowledgebase of human genes and genetic disorders. Nucleic acids<br />

research, 33(suppl 1), D514-D517.<br />

ElShal, S., Tranchevent L.C., Sifrim A., Ardeshirdavani A., Davis J.,<br />

Moreau Y. (<strong>2015</strong>). Beegle: from literature mining to disease-gene<br />

discovery. Nucleic Acids Res, gkv905.<br />

Aronson, A. R., & Lang, F. M. (2010). An overview of MetaMap:<br />

historical perspective and recent advances. Journal of the American<br />

Medical Informatics Association, 17(3), 229-236.<br />

96


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P53. FLOWSOM WEB: A SCALABLE ALGORITHM TO VISUALIZE AND<br />

COMPARE CYTOMETRY DATA IN THE BROWSER<br />

Arne Soete 2 , Sofie Van Gassen 1,2,3 , Tom Dhaene 1 , Bart N. Lambrecht 2,3 & Yvan Saeys 2,3 .<br />

Department of Information Technology, Ghent University-iMinds, Ghent, Belgium 1 ; Inflammation Research Center, VIB,<br />

Ghent, Belgium 2 ; Department of Respiratory Medicine, Ghent University Hospital, Ghent, Belgium 3 .<br />

We developed FlowSOM Web, a web-tool which visualizes cytometry data based on Self-Organizing Maps. Similar cells<br />

are clustered and visualized via star charts. This allows us to process and display millions of cells efficiently.<br />

Additionally, different biological samples (e.g. healthy versus diseased mice) can be compared.<br />

INTRODUCTION<br />

Cytometry data describes cell characteristics in<br />

biological samples. Cells are labeled with fluorescent<br />

antibodies and a flow cytometer measures the properties<br />

of millions of cells one by one. Biologists use this<br />

information to get more insight in diseases and to<br />

diagnose patients. Most of them still analyse this data<br />

manually to differentiate between the different cell types<br />

present. This is done by plotting the data in 2D scatter<br />

plots and selecting groups of cells in a hierarchical way.<br />

This process is called `gating'. Recently, the number of<br />

properties that can be measured simultaneously has<br />

strongly increased. As the number of possible 2D scatter<br />

plots increases exponentially with the number of<br />

properties measured, it becomes infeasible to analyze<br />

them all and relevant information that is present in the<br />

data might be missed.<br />

METHODS<br />

We present FlowSOM, a new algorithm for the<br />

visualization and interpretation of cytometry data (Van<br />

Gassen, et al,. <strong>2015</strong>). Using a twolevel clustering and<br />

star charts, our algorithm helps to obtain a clear<br />

overview of how all markers are behaving on all cells,<br />

and to detect subsets that might be missed otherwise.<br />

Our algorithm consists of 4 steps: pre-processing the<br />

data, building a self-organizing map, building a minimal<br />

spanning tree and computing a meta-clustering result.<br />

RESULTS & DISCUSSION<br />

Although our results are quite similar to SPADE, another<br />

state-of-the art algorithm for the visualization of<br />

cytometry data, our results can be computed much faster<br />

and use less memory. By providing star-charts and an<br />

automatic meta-clustering step, much more information<br />

can be visualised in a single tree than is done by the<br />

SPADE algorithm.<br />

Additionally, multiple states can be compared (e.g.<br />

healthy versus diseased mice) with one another and the<br />

differences between the two states can be visualized via<br />

star-charts.<br />

On this conference, we would like to demonstrate a<br />

recently developed web interface to the underlying R<br />

functionality. This interface allows to upload cytometry<br />

data, run the aforementioned analysis, compare different<br />

cell states and explore the results, via interactive<br />

visualizations, all from the comfort of the browser.<br />

FIGURE 1. Example of a FlowSOM star chart.<br />

REFERENCES<br />

Van Gassen, et al. (<strong>2015</strong>), FlowSOM: Using self-organizing maps for<br />

visualization and interpretation of cytometry data. Cytometry,<br />

87: 636–645<br />

97


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P54. TOWARDS A BELGIAN REFERENCE SET<br />

Erika Souche 1* , Amin Ardeshirdavani 2 , Yves Moreau 2 , Gert Matthijs 1 & Joris Vermeesch 1 .<br />

Department of Human Genetics, KU Leuven 1 ; ESAT-STADIUS Center for Dynamical Systems, Signal Processing and<br />

Data Analytic, KU Leuven 2 . * Erika.souche@uzleuven.be<br />

Next-Generation Sequencing (NGS) is increasingly used to study and diagnose human disorders. The simultaneous<br />

sequencing of a large number of genes leading to the detection of a large number of variants, the bottleneck has moved<br />

from sequencing to variant interpretation and classification. Although publically available databases of variant<br />

frequencies help distinguishing causative mutations from common variants, they often lack population specific variant<br />

frequencies. To circumvent this shortage of population specific information, most genetic centers exploit their sequence<br />

data of unrelated and unaffected individuals to filter out common local variants is often done. However the<br />

files/databases are rarely shared and they are mainly based on whole exome data. In this project we demonstrate the<br />

utility of a local variant database generated from whole exome data, describe a procedure allowing the sharing of<br />

information between genetic centers and mine low coverage whole genome data for common variants.<br />

INTRODUCTION<br />

Next-Generation Sequencing (NGS) is increasingly used<br />

to study and diagnose human disorders. The simultaneous<br />

sequencing of a large number of genes leading to the<br />

detection of a large number of variants, the bottleneck has<br />

moved from sequencing to variant interpretation and<br />

classification. Publically available databases of variant<br />

frequencies provided by, among others, the Exome<br />

Sequencing Project (ESP) the 1000 genomes project<br />

(McVean et al., 2012) or dbSNP (Sherry et al., 2001) help<br />

distinguishing causative mutations from common variants,<br />

identifying up to 78% of variants as common for a Belgian<br />

exome. However, these data sets often lack population<br />

specific variant frequencies and are outperformed by<br />

databases of local variants. For example, using GoNL<br />

(The Genome of the Netherlands Consortium, 2014) alone<br />

allowed the identification of up to 85% of variants as<br />

common for the same Belgian exome. The fact that the<br />

GoNL is based on only 498 individuals further highlights<br />

the importance of building and using population specific<br />

databases.<br />

Such population specific data can be retrieved from locally<br />

sequenced individuals that underwent Whole Exome<br />

Sequencing (WES) or Whole Genome Sequencing (WGS).<br />

Storing only the frequencies and genotype counts of the<br />

variants provides a valuable tool for variant classification<br />

while no sensitive information on the individuals is<br />

included.<br />

METHODS<br />

WES data of 350 unrelated and unaffected individuals<br />

have been parsed. All samples were analysed in a similar<br />

way i.e. reads were aligned to the reference genome with<br />

BWA (Li & Durbin, 2009) and genotyping was performed<br />

according to GATK best practices (McKenna et al., 2010;<br />

DePristo et al., 2011). All samples were genotyped at all<br />

polymorphic positions using GATK HaplotypeCaller and<br />

GenotypeGVCFs. For each position, samples with low<br />

quality genotype were considered as not genotyped and<br />

excluded from the genotype counts. The number of<br />

alternate alleles, allele counts and genotypes were<br />

compiled in a population VCF file, in which individual<br />

genotypes are not accessible.<br />

Variant frequencies can also be extracted from low<br />

coverage WGS. As a pilot we processed the data of<br />

chromosome 21 of about 4,000 WGS. The mapping was<br />

performed with BWA (Li & Durbin, 2009) and the BAM<br />

files were merged per 200 samples. All positions were<br />

genotyped using freebayes (Garrison & Marth, 2012).<br />

Genotype information of all locations outside low<br />

complexity regions were then compiled for all samples<br />

using the integration of Apache Hadoop, HBase and Hive<br />

(see poster “Big data solutions for variant discovery from<br />

low coverage sequencing data, by integration of Hadoop,<br />

Hbase and Hive”). Several models were then used to<br />

distinguish real variants from sequencing errors: the Minor<br />

Allele Frequency (MAF), the transition/transversion ratio,<br />

the expected number of loci with a MAF of 5%, etc.<br />

RESULTS & DISCUSSION<br />

We demonstrated the effect of our reference set on several<br />

exomes. The inclusion of only 350 individuals allowed the<br />

identification of about 3% additional common variants,<br />

not listed as common by ESP, dbSNP (Sherry et al., 2001),<br />

1000 Genomes (McVean et al., 2012) and GoNL (The<br />

Genome of the Netherlands Consortium, 2014). Since only<br />

the frequencies of the variants in the screened populations<br />

are reported, this file can easily be shared between<br />

laboratories. Besides, the procedure used to generate the<br />

population VCF file can easily be applied to several<br />

genetic centers in order to generate a common population<br />

VCF file, as planned within the BeMGI project.<br />

Finally we expect that the data from WGS will further<br />

increase the performance of our reference set. A genomewide<br />

variant frequencies file from local population will<br />

become worthwhile when WGS is routinely used in<br />

diagnostics.<br />

REFERENCES<br />

DePristo M et al. Nature Genetics 43, 491-498 (2011).<br />

Exome Variant Server, NHLBI Exome Sequencing Project (ESP), Seattle,<br />

WA (URL: http://evs.gs.washington.edu/EVS/).<br />

Garrison E & Marth G http://arxiv.org/abs/1207.3907 (2012).<br />

Li H & Durbin R Bioinformatics 25, 1754-60 (2009).<br />

McKenna A et al. Genome Research 20, 1297-303 (2010).<br />

McVean et al. Nature 491, 56–65 (2012).<br />

Sherry ST, et al. Nucleic Acids Res. 29, 308-11 (2001).<br />

The Genome of the Netherlands Consortium. Nature Genetics 46,<br />

818–825 (2014).<br />

98


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P55. MANAGING BIG IMAGING DATA FROM MICROSCOPY:<br />

A DEPARTMENTAL-WIDE APPROACH<br />

Yves Sucaet 1* , Silke Smeets 1 , Stijn Piessens 1 , Sabrina D’Haese 1 , Chris Groven 1 , Wim Waelput 1 & Peter In’t Veld 1 .<br />

Department of Pathology 1 , Faculty of Medicine, Vrije Universiteit Brussel, Laarbeeklaan 103, 1090 Brussels, Belgium.<br />

* yves.sucaet@usa.net<br />

With recent breakthroughs in whole slide imaging (WSI), almost any microscopic material can be digitized in an<br />

efficient manner. In order to mine these data efficiently, a top-down approach was employed to manage various imaging<br />

platforms. At Brussels Free University (VUB), we built a centralized infrastructure that integrates a variety of imaging<br />

platforms (brightfield, fluorescence, multi-vendor formats). With the help of the Pathomation software platform for<br />

digital microscopy, various datastores and image repositories were integrated. Custom coding was used to interact with<br />

various vendor-software and server applications, where needed. The end-result is an interconnected network of<br />

heterogeneous scalable information silos. We currently have two main use cases for WSI: education and biobanking.<br />

These applications are available to the public via http://www.diabetesbiobank.org.<br />

INTRODUCTION<br />

Too often, image analysis and data/image mining projects<br />

remain stuck in micro-environments because they are<br />

limited by vendor-specific solutions that neither scale nor<br />

interact with material from other departments or<br />

institutions. Successful roll-out of digital histopathology<br />

therefore requires more than a whole slide scanner.<br />

If the goal is for an imaging facility to allow a researcher<br />

to conduct a (microscopic) experiment, then that<br />

researcher should not be hindered by the imaging platform<br />

used. Similarly, an instructor integrating digital content<br />

into his or her course, should be able to make their<br />

materials as accessible as possible to as many students as<br />

possible.<br />

At Brussels Free University (VUB), we currently have two<br />

main use cases for whole slide imaging: education and<br />

biobanking. We have set these up in such a way that they<br />

are both scalable and expandable.<br />

METHODS<br />

Whole slide imaging (WSI) has recently provided a boost<br />

to digital capturing of microscopic content (and an<br />

explosion of data, resulting in a veritable digital treasure<br />

trove waiting for bioinformatics to be explored). But<br />

researchers have been digitizing content for a long time<br />

already through various technologies (mounted cameras,<br />

inverted fluorescent microscopes with low magnification,<br />

…).<br />

We envisioned an environment whereby a researcher can<br />

manage and view all of the material related to an<br />

experiment or observation from a single interface,<br />

irrespective of origin or technology used.<br />

The following steps were taken to accomplish this:<br />

<br />

<br />

<br />

Setup a central server (50TB storage)<br />

Centrally store all imaging data provide mapped<br />

drives on the individual workstations to facilitate<br />

a smooth transition for end-users<br />

Install the Pathomation platform for digital<br />

microscopy (PMA.core, PMA.view, PMA.zui)<br />

for universal viewing of digital content and to<br />

provide a uniform end-user experience<br />

<br />

<br />

Install Pydio (open source) for easy sharing of<br />

digital imaging content (integrated with<br />

Pathomation’s PMA.core so no duplicate user<br />

directories need to be maintained)<br />

Build custom portals to highlight specific<br />

collections of microscopic content and/or serve<br />

specific target audiences<br />

RESULTS & DISCUSSION<br />

The centralized digital imaging infrastructure is used by<br />

various researchers and graduate students. Recently over<br />

3,000 images were processed and hosted in the course of<br />

one month.<br />

Two use cases are worth highlighting:<br />

<br />

<br />

For undergraduate students (Medicine, BMS) we<br />

built custom portal websites to supplement their<br />

courses in histology and pathology. These sites<br />

are available at http://histology.vub.ac.be and<br />

http://pathology.vub.ac.be and provide students<br />

with (guided) virtual microscopy without the<br />

need to install any additional software<br />

We also provide access portals to different<br />

specialized biobanks. The Willy Gepts collection<br />

represents a historic milestone in diabetes<br />

research (http://gepts.vub.ac.be) and is<br />

complementary to the Alan Foulis collection<br />

(http://foulis.vub.ac.be). Furthermore, the clinical<br />

diabetes biobank can now be consulted online,<br />

too, via http://www.diabetesbiobank.org.<br />

CONCLUSION<br />

Digital histopathology has been around for some time now,<br />

but often results in heterogeneous data collections. It is<br />

only now that we start looking at integrated approaches on<br />

this varied data can be best handled. Digital pathology<br />

involves much more than the acquisition of a slide scanner.<br />

We have engaged five different imaging platforms onto a<br />

single architecture. We are storing data from all modalities<br />

in a single storage facility, and manage it through a single<br />

access point. The resulting environment assists in<br />

rendering content to any type of display device, without<br />

the need for extra software or background information<br />

concerning the content’s origin.<br />

99


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P56. ESTIMATING THE IMPACT OF CIS-REGULATORY VARIATION IN<br />

CANCER GENOMES USING ENHANCER PREDICTION MODELS AND<br />

MATCHED GENOME-EPIGENOME-TRANSCRIPTOME DATA<br />

Dmitry Svetlichnyy 1* , Hana Imrichova 1 , Zeynep Kalender Atak 1 & Stein Aerts 1 .<br />

Laboratory of Computational Biology, University of Leuven 1 . *dmitry.svetlichnyy@med.kuleuven.be<br />

The prioritization of candidate driver mutations in the non-coding part of the genome is a key challenge in cancer<br />

genomics. Whereas driver mutations in protein-coding genes can be distinguished from passenger mutations based on<br />

their recurrence, non-coding mutations are usually not recurrent at the same position. We aim to tackle this problem<br />

using machine-learning methods to predict regulatory regions and cancer genome sequences in combination with samplespecific<br />

chromatin profiles obtained using ChIP-seq against H3K27Ac.<br />

INTRODUCTION<br />

Perturbations of gene regulatory networks in cancer cells<br />

can arise from mutations in transcription factors or cofactors,<br />

but also from mutations in regulatory regions.<br />

Prioritizing candidate driver mutations that have a<br />

significant impact on the activity of a regulatory region is<br />

a key challenge in cancer genomics.<br />

METHODS<br />

We have developed enhancer prediction methods using<br />

Random Forest classifiers to estimate the Predicted<br />

Regulatory Impact of a Mutation in an Enhancer<br />

(PRIME). We find that the recently identified driver<br />

mutation in the TAL1 enhancer has a high PRIME score,<br />

representing a “gain-of-target” for the oncogenic<br />

transcription factor MYB [1]. We trained enhancer models<br />

for 45 cancer-related transcription factors, and used these<br />

to score somatic mutations across more than five hundred<br />

breast cancer genomes. Next, we re-sequenced the genome<br />

of ten cancer cell lines representing six different cancer<br />

types (breast, lung, melanoma, ovarian, and colon) and<br />

profiled their active chromatin by ChIP-seq against<br />

H3K27Ac.<br />

RESULTS & DISCUSSION<br />

Then we integrated these data with matched expression<br />

data and with the Random Forest model predictions for<br />

sets of oncogenic transcription factors per cancer type.<br />

This resulted in surprisingly few high-impact mutations<br />

that generate de novo regulatory (oncogenic) activity at<br />

the chromatin and gene expression level. Our framework<br />

can be applied to identify candidate cis-regulatory<br />

mutations using sequence information alone, and to<br />

samples with combined genome-epigenome-transcriptome<br />

data. Our results suggest the presence of only few cisregulatory<br />

driver mutations per genome in cancer genomes<br />

that may alter the expression levels of specific oncogenes<br />

and tumor suppressor genes.<br />

REFERENCES<br />

1. Mansour MR, Abraham BJ, Anders L, Berezovskaya A, Gutierrez A,<br />

Durbin AD, et al. An oncogenic super-enhancer formed through somatic<br />

mutation of a noncoding intergenic element. Science. 2014;346: 1373–<br />

1377. doi:10.1126/science.1259037<br />

100


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P57. I-PV: A CIRCOS MODULE FOR INTERACTIVE PROTEIN<br />

SEQUENCE VISUALIZATION<br />

Ibrahim Tanyalcin 1,2* , Carla Al Assaf 3 , Alexander Gheldof 1 , Katrien Stouffs 1,4 , Willy Lissens 1,4 & Anna C. Jansen 5,2 .<br />

Center for Medical Genetics, UZ Brussel, Brussels, Belgium 1 ; Neurogenetics Research Group, Vrije Universiteit Brussel,<br />

Brussels, Belgium 2 ; Center for Human Genetics, KU Leuven and University Hospitals Leuven, 3000 Leuven, Belgium 3 ;<br />

Reproduction, Genetics and Regenerative Medicine, Vrije Universiteit Brussel, Brussels, Belgium 4 ; Pediatric Neurology<br />

Unit, Department of Pediatrics, UZ Brussel, Brussels, Belgium 5 . *ibrahim.tanyalcin@i-pv.org or itanyalc@vub.ac.be<br />

Summary: Today’s genome browsers and protein databanks supply vast amounts of information about proteins. The<br />

challenge is to concisely bring together this information in an interactive and easy to generate format.<br />

Availability and Implementation: We have developed an interactive CIRCOS module called i-PV to visualize user<br />

supplied protein sequence, conservation and SNV data in a live presentable format. I-PV can be downloaded from<br />

http://www.i-pv.org.<br />

INTRODUCTION<br />

Today’s genome browsers and protein databanks supply<br />

vast amount of information about both the structural<br />

annotation and the single nucleotide variants (SNV) in<br />

genes. The challenge is to concisely bring together this<br />

information in an interactive and easy to generate format.<br />

Thus, we have developed an interactive CIRCOS<br />

(Krzywinski et al.) module combined with D3 (Bostock et<br />

al.) and plain javascript called i-PV to visualize user<br />

supplied protein sequence, conservation and SNV data<br />

while significantly easing and automating input file<br />

requirements and generation.<br />

METHODS<br />

To use i-PV, only 4 text files (with “.txt” extension) have<br />

to be supplied to the software: conservation scores,<br />

protein and cDNA sequences, and SNVs/Indels files.<br />

Protein and cDNA (or mRNA) sequence files are supplied<br />

in fasta format whereas SNP/Indel fıles are provided as<br />

annotated vcf file (Variant Call Format). The conservation<br />

scores are simply array of numbers separated by newline<br />

characters. The input files are supplied to i-PV, data are<br />

automatically checked for errors or duplicates and<br />

matched against the user provided fasta files, and then an<br />

interactive html file containing the graph is automatically<br />

generated as shown in Fig.1.<br />

RESULTS & DISCUSSION<br />

Many sequence visualization tools focus on certain aspects<br />

of proteins such as conservation, variations, sequence<br />

alignments or topology. While all these tools are very<br />

useful in their own right, we pursued a more interactivity<br />

based design. Therefore, i-PV is not solely designed for<br />

visualization but also for live presentable graphs and<br />

information that can selectively be displayed and<br />

customized. I-PV combines major sources of information<br />

under one html file that is easy to generate and share on<br />

both desktop and mobile environments.<br />

Last but not least, many visualization tools are based on<br />

rectangular-scroll based representation of information<br />

which does not deliver a “wide angle” view of the<br />

sequence data unlike circular visualization. However, as<br />

like all other types of visualizations, there are also<br />

limitations for circular graphs when it comes to<br />

conveniently zoom in to a particular region or visually<br />

align tracks with different radii. We intend to further<br />

develop this software with several other features based on<br />

end user needs. The current version of i-PV can be<br />

downloaded from http://www.i-pv.org.<br />

FIGURE 1. Overview of i-PV features. (A) SNVs with mouse over<br />

explanation and automatic generated dbSNP links (red: Nonsynonymous,<br />

green: Synonymous, gray: Not validated). (B) Console can<br />

be hidden for publication quality image. (C) Domains are colored based<br />

on user preference. (D) Conservation data from user generated<br />

alignment with mouse over information. (E) The user can define which<br />

amino acids to be shown on the sequence track. (F) Switch the color of<br />

the background to black. (G) Amino acids are plotted and split into 5<br />

main categories (nonpolar: gray circle, polar: magenta circle, negative:<br />

blue triangle, positive: red triangle, aromatic: green hexagon). (H)<br />

Adjustable conservation score threshold to display regions above a<br />

certain percentage of maximum conservation score. (I) Font-size of<br />

chosen amino acids can be adjusted. (J) User selectable amino acids to<br />

be displayed. (K) Up to 17 different amino acid properties can be chosen<br />

to be displayed from drop-down menu. (I) Tile track showing SNVs and<br />

indels (red: SNVs, magenta: Indels, gray stroke: Not validated, black:<br />

collapsed due to over display). (M) Gene Name. (N) Buttons for mass<br />

selection of amino acids. (O) User defined regions are marked with<br />

custom name tag and mouse over information. (P) Meta-analysis of<br />

amino acid distributions. This information is only displayed in case of<br />

single amino acid comparisons. The log2 ratios are capped between -3<br />

and 3. The maximum and the minimum blosum62 scores are -4 and 11.<br />

Since the blosum62 matrix is diagonally symmetric, the absolute value of<br />

the log ratios are mapped to this range and a p-value is indicated based<br />

on how close the two scores are.<br />

REFERENCES<br />

Bostock, M., et al. (2011), 'D3: Data-Driven Documents', IEEE Trans.<br />

Visualization & Comp. Graphics (Proc. InfoVis).<br />

Krzywinski, M., et al. (2009), 'Circos: an information aesthetic for<br />

comparative genomics', Genome Res, 19 (9), 1639-45.<br />

101


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P58. SFINX: STRAIGHTFORWARD FILTERING INDEX FOR AFFINITY<br />

PURIFICATION-MASS SPECTROMETRY DATA ANALYSIS<br />

Kevin Titeca 1,2 , Pieter Meysman 3,4 , Kris Gevaert 1,2 , Jan Tavernier 1,2 ,<br />

Kris Laukens 3,4 , Lennart Martens 1,2 & Sven Eyckerman 1,2* .<br />

Medical Biotechnology Center, VIB, B-9000 Ghent, Belgium 1 ; Department of Biochemistry, Ghent University, B-9000<br />

Ghent, Belgium 2 ; Advanced Database Research and Modeling (ADReM), University of Antwerp, Belgium 3 ; Biomedical<br />

informatics research center Antwerpen (biomina), Belgium 4 . sven.eyckerman@vib-ugent.be<br />

Affinity purification-mass spectrometry (AP-MS) is one of the most common techniques for the analysis of proteinprotein<br />

interactions, but inferring bona fide interactions from the resulting datasets remains notoriously difficult because<br />

of the many false positives. The ideal filter technique for these data is highly accurate, fast and user friendly without the<br />

need to rely on extensive parameter optimization or external databases, which also makes it reproducible and unbiased.<br />

Because none of the existing filter techniques combines all these features, we developed SFINX, the Straightforward<br />

Filtering INdeX.<br />

We here describe the SFINX algorithm and its performance on two independent AP-MS benchmark datasets. SFINX<br />

shows superior performance over the other approaches with accuracy increases of up to 20%, and is extremely fast. It<br />

does not require parameter optimization, and is absolutely independent of external resources. Both the algorithm and its<br />

website interface are highly intuitive with limited need for user input and the possibility of immediate network<br />

visualization and interpretation at http://sfinx.ugent.be/. SFINX might become essential in the toolbox of any scientist<br />

interested in user-friendly and highly accurate filtering of AP-MS data.<br />

102


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P59. MAPREDUCE APPROACHES FOR CONTACT MAP PREDICTION:<br />

AN EXTREMELY IMBALANCED BIG DATA PROBLEM<br />

Isaac Triguero 1,2* , Sara del Río 3 , Victoria López 3 , Jaume Bacardit 4 , José M. Benítez 3 & Francisco Herrera 3 .<br />

VIB Inflammation Research Center 1 ; Department of Respiratory Medicine, Ghent University 2 ; Department of Computer<br />

Science and Artificial Intelligence 3 ; School of Computing Science, Newcastle University 4 .<br />

* Isaac.Triguero@irc.vib-Ugent.be<br />

The application of data mining and machine learning techniques to biological and biomedicine data continues to be an<br />

ubiquitous research theme in current bioinformatics. The rapid advances in biotechnology are allowing us to obtain and<br />

store large quantities of data about cells, proteins, genes, etc, that should be processed. Moreover, in many of these<br />

problems such as contact map prediction, it is difficult to collect representative positive examples. Learning under these<br />

circumstances, known as imbalance big data classification, may not be straightforward for most of the standard machine<br />

learning methods. In this work we describe the methodology that won the ECBDL'14 big data competition, which was<br />

concerned with the prediction of contact maps. Our methodology is composed of several MapReduce approaches to deal<br />

with big amounts of data. The results show that this model is very suitable to tackle large-scale bioinformatics<br />

classifications problems.<br />

INTRODUCTION<br />

The prediction of a protein’s contact map is a crucial step<br />

for the prediction of the complete 3D structure of a protein.<br />

This is one of the most challenging bioinformatics tasks<br />

within the field of protein structure prediction because of<br />

the sparseness of the contacts (i.e. few positive examples)<br />

and the great amount of data extracted (i.e. millions of<br />

instances, Gbs of disk space) from a few thousand of<br />

proteins.<br />

This problem refers to an imbalance bioinformatics big<br />

data application, in which traditional machine learning<br />

techniques become non effective and non efficient due to<br />

the big dimension of the problem. However, with use of<br />

the emerging cloud-based technologies, these techniques<br />

can be redesigned to extract valuable knowledge from<br />

such amount of data.<br />

The ECDBL’14 competition (http://cruncher.ncl.ac.uk/<br />

bdcomp/) brought up a data set that modeled the contact<br />

map prediction problem as a classification task.<br />

Concretely, the training data set considered was formed by<br />

32 million instances, 631 attributes, 2 classes, 98% of<br />

negative examples and it occupies about 56GB of disk<br />

space.<br />

In this work we describe the methodology with which we<br />

have participated, under the name 'Efdamis', ranking as the<br />

winner algorithm (Triguero et al, <strong>2015</strong>).<br />

METHODS<br />

In the proposed methodology, we focused on the<br />

MapReduce (Dean et al, 2008) paradigm in order to<br />

manage this voluminous data set. We extended the<br />

applicability of some pre-processing and classification<br />

models to deal with large-scale problems. This is<br />

composed of four main parts:<br />

<br />

<br />

An oversampling approach: The goal is to balance the<br />

highly skewed class distribution of the problem by<br />

replicating randomly the instances of the minority<br />

class (del Rio et al, 2014).<br />

<br />

<br />

An evolutionary feature weighting method: Due the<br />

relative high number of features of the given problem<br />

we developed a feature selection scheme for largescale<br />

problems that improves the classification<br />

performance by detecting the most significant features<br />

(Triguero et al, 2012).<br />

Building a learning model: As classifier, we focused<br />

on a scalable RandomForest algorithm.<br />

Testing the model: Even the test data can be<br />

considered big data (2.9 millions of instances), so that,<br />

the testing phase was also deployed within a parallel<br />

approach.<br />

RESULTS & DISCUSSION<br />

Table 1 presents the final results of the top 5 participants<br />

in terms of True Positive Rate (TPR) and True Negative<br />

Rate (TNR). In this particular problem, the necessity of<br />

balancing the TPR and TNR ratios emerged as a difficult<br />

challenge for most of the participants of the competition.<br />

In this sense, the use of scalable preprocessing techniques<br />

played in important role to improve the results of the<br />

RandomForest classifier. First, the designed oversampling<br />

approach allowed us to prevent RandomForest to be<br />

biased to the negative class. Second, our feature weighting<br />

approach provided us the possibility of reducing the<br />

dimensionality of the problem by selecting the most<br />

relevant features. Thus, it resulted in a better performance<br />

as well as a notable reduction of the time requirements.<br />

Team TPR TNR TPR * TNR<br />

Efdamis 0.73043 0.73018 0.53335<br />

ICOS 0.70321 0.73016 0.51345<br />

UNSW 0.69916 0.72763 0.50873<br />

HyperEns 0.64003 0.76338 0.48858<br />

PUC-Rio_ICA 0.65709 0.71460 0.46956<br />

TABLE 1: Comparison with the top 5 of the competition.<br />

REFERENCES<br />

Dean J., Ghemawat S., Mapreduce: simplified data processing on large<br />

clusters, Commun. ACM 51 (1), 107–113 (2008).<br />

del Río S., et al., On the use of MapReduce for imbalanced big data using<br />

random forest, Inf. Sci. 285 (2014) 112–137.<br />

Triguero I. et al., Integrating a differential evolution feature weighting<br />

scheme into prototype generation, Neurocomputing 97 (2012) 332–<br />

343.<br />

103


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P60. COEXPNETVIZ: THE CONSTRUCTION AND VIZUALISATION OF CO-<br />

EXPRESSION NETWORKS<br />

Oren Tzfadia 1,2 , Tim Diels 1,2,4 , Sam De Meyer 1,2 , Klaas Vandepoele 1,2 , Yves Van de Peer 1,2,3,5,* & Asaph Aharoni 6 .<br />

Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium 1 ; Department of Plant Biotechnology and<br />

Bioinformatics, Ghent University, 9052 Ghent, Belgium 2 ; Genomics Research Institute (GRI), University of Pretoria,<br />

0028 Pretoria, South Africa 3 ; Department of Mathematics and Computer Science, University of Antwerp, Antwerp,<br />

Belgium 4 ; Bioinformatics Institute Ghent, Ghent University, 9052 Ghent, Belgium 5 ; Department of Plant Sciences and<br />

the Environment, Weizmann Institute of Science, Rehovot 6 .<br />

INTRODUCTION<br />

Comparative transcriptomics is a common approach in<br />

functional gene discovery efforts. It allows for finding<br />

conserved co-expression patterns between orthologous<br />

genes in closely related plant species, suggesting that these<br />

genes potentially share similar function and regulation.<br />

Several efficient co-expression-based tools have been<br />

commonly used in plant research but most of these<br />

pipelines are limited to data from model systems, which<br />

greatly limit their utility. Moreover, in addition, none of<br />

the existing pipelines allow plant researchers to make use<br />

of their own unpublished gene expression data for<br />

performing a comparative co-expression analysis and<br />

generate multi-species co-expression networks.<br />

RESULTS<br />

We introduce CoExpNetViz, a computational tool that<br />

uses a set of bait genes as an input (chosen by the user)<br />

and a minimum of one pre-processed gene expression<br />

dataset. The CoExpNetViz algorithm proceeds in three<br />

main steps; (i) for every bait gene submitted, coexpression<br />

values are calculated using Pearson correlation<br />

coefficients, (ii) non-bait (or target) genes are grouped<br />

based on cross-species orthology, and (iii) output files are<br />

generated and results can be visualized as network graphs<br />

in Cytoscape.<br />

AVAILABILITY AND IMPLEMENTATION<br />

The CoExpNetViz tool is freely available both as a PHP<br />

web server (link:<br />

http://bioinformatics.psb.ugent.be/webtools/coexpr/)<br />

(implemented in C++) and as a Cytoscape plugin<br />

(implemented in Java). Both versions of the CoExpNetViz<br />

tool support LINUX and Windows platforms.<br />

104


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P61. THE DETECTION OF PURIFYING SELECTION DURING TUMOUR<br />

EVOLUTION UNVEILS CANCER VULNERABILITIES<br />

Jimmy Van den Eynden 1* & Erik Larsson 1 .<br />

Department of Medical Biochemistry and Cell Biology, Institute of Biomedicine, The Sahlgrenska Academy, University<br />

of Gothenburg, Sweden. * jimmy.van.den.eynden@gu.se<br />

Identification of somatic mutation patterns indicative of positive selection arguably has become the major goal of cancer<br />

genomics. This is motivated by a search for cancer driver genes and pathways that are recurrently activated in tumours<br />

but not normal cells, thus providing possible therapeutic windows. However, cancer cells additionally depend on a large<br />

number of basic cellular processes, and elevated sensitivity to inhibition of certain essential non-driver genes has been<br />

demonstrated in some cases. While such vulnerability genes should in theory be identifiable based on strong purifying<br />

(negative) selection in tumors, these patterns have been elusive and purifying selection remains underexplored in cancer.<br />

We established a new methodology and, using mutational data from 25 TCGA tumor types, we show for the first time<br />

that negative selection in candidate vulnerability genes can be detected.<br />

INTRODUCTION<br />

Recently it was shown that a hemizygous deletion of the<br />

well–known tumour suppressor gene TP53 creates<br />

therapeutic vulnerability in colorectal cancer due to<br />

concomitant loss of the neighbouring gene POLR2A (Liu<br />

et al., <strong>2015</strong>).<br />

As any damaging mutation occurring in the single allele of<br />

a hemizygously deleted essential gene, like POLR2A, is<br />

expected to lead to cell death, we hypothesized that<br />

purifying selection in these genes could be unveiled by<br />

demonstrating a lower number of damaging mutations<br />

then could be expected in the absence of any selection.<br />

Therefore we used the POLR2A case as a proof-ofconcept<br />

to develop a methodology to detect purifying<br />

selection in large genome sequencing datasets.<br />

METHODS<br />

Mutation and copy number data from 25 different cancers<br />

types and 7,871 samples were downloaded from the<br />

TCGA data portal and pooled together in a large pancancer<br />

dataset. Different mutational functional impact<br />

scores were calculated using Annovar. Copy number data<br />

were analyzed using Gistic 2.0 to differentiate POLR2A<br />

copy number neutral from hemizygously deleted samples.<br />

RESULTS & DISCUSSION<br />

POLR2A was found to be hemizygously deleted in 29% of<br />

all samples. As expected, in over 99% this deletion was<br />

part of the TP53 (driving) deletion on chromosome 17.<br />

POLR2A was mutated 228 times in 2.3% of all samples.<br />

While 14 nonsense mutations and small out-of-frame<br />

insertions or deletions occurred in the copy number<br />

neutral group, none of these damaging mutations were<br />

found in the deletion group (p=0.03, fisher test),<br />

suggesting purifying selection against this type of<br />

mutations.<br />

Next to these truncating mutations, also missense<br />

mutations that have a damaging effect on the gene’s<br />

protein function are expected to be selected against.<br />

Therefore we predicted the functional impact of all<br />

mutations using different functional impact scores. The<br />

median (PolyPhen-2) functional impact score was found<br />

to significantly lower in the deletion group compared to<br />

the copy number neutral group (p=0.002, Wilcoxon test,<br />

fig.1), further confirming that purifying selection has<br />

taken place in POLR2A during tumour evolution.<br />

These preliminary findings confirm that purifying<br />

selection is detectable in vulnerability genes like POLR2A<br />

and this approach could be used to detect other, new<br />

candidate vulnerability genes.<br />

FIGURE 1. Negative selection against POLR2A high impact mutations in<br />

hemizygously deleted tumour samples.<br />

REFERENCES<br />

Liu, Y., Zhang, X., Han, C., Wan, G., Huang, X., Ivan, C., … Lu, X.<br />

(<strong>2015</strong>). TP53 loss creates therapeutic vulnerability in colorectal<br />

cancer. Nature, 520(7549), 697–701.<br />

http://doi.org/10.1038/nature14418<br />

105


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P62. FLOREMI: SURVIVAL TIME PREDICTION<br />

BASED ON FLOW CYTOMETRY DATA<br />

Sofie Van Gassen 1,2,3* , Celine Vens 2,3,4 , Tom Dhaene 1 , Bart N. Lambrecht 2,3 & Yvan Saeys 2,3 .<br />

Department of Information Technology, Ghent University—iMinds 1 ; VIB Inflammation Research Center 2 ; Department of<br />

Respiratory Medicine, Ghent University 3 ; Department of Public Health and Primary Care, kU Leuven Kulak 4 .<br />

* sofie.vangassen@irc.vib-ugent.be<br />

Flow cytometry is a high-throughput technique for single cell analysis. It enables researchers and pathologists to study<br />

blood and tissue samples by measuring several cell properties, such as cell size, granularity and the presence of cellular<br />

markers. While this technique provides a wealth of information, it becomes hard to analyze all data manually. To<br />

investigate alternative automatic analysis methods, the FlowCAP challenges were organized. We will present an<br />

algorithm that obtained the best results on the FlowCAP IV challenge, predicting the time of progression to AIDS for<br />

HIV patients.<br />

INTRODUCTION<br />

The main task of the most recent FlowCAP IV challenge<br />

was a survival modeling challenge: participants had to<br />

predict the time of progression to AIDS for HIV patients,<br />

based on flow cytometry data of an unstimulated and a<br />

stimulated blood sample. Additionally, a secondary task<br />

was the identification of cell populations that could be<br />

indicative of this progression rate. Several challenges<br />

needed to be taken into account: the raw dataset was about<br />

20GB large and about eighty percent of the survival times<br />

were censored.<br />

METHODS<br />

We developed a new algorithm, FloReMi, which<br />

combined several preprocessing steps with a density based<br />

clustering algorithm, a feature selection step and a random<br />

survival forest (Van Gassen et al., <strong>2015</strong>).<br />

The input for our algorithm consisted of 2 flow cytometry<br />

samples for each patient: one unstimulated PBMC sample<br />

and one PBMC sample stimulated with HIV antigens. For<br />

each of these samples, 16 parameters were measured for<br />

hundreds of thousands of cells.<br />

First, we included quality control to remove erroneous<br />

measurements from the samples. We also made an<br />

automatic selection of live T cells to focus on the cells of<br />

interest in this specific flow cytometry staining.<br />

Once the dataset was cleaned up, we extracted features for<br />

each patient. This was done by clustering the cells using<br />

the flowDensity (Malek et al., <strong>2015</strong>) and flowType<br />

algorithms (Aghaeepour et al., 2012). These algorithms<br />

divide the values for each feature into either “high” or<br />

“low” and use all combinatorial options of “high”, “low”<br />

or “neutral” marker values to group the cells. This resulted<br />

in 3 10 different cell subsets.<br />

For each of these subsets, we computed the number of<br />

cells assigned to it and the mean fluorescence intensity for<br />

13 markers. Per patient, we collected these numbers for<br />

both samples and also computed the differences between<br />

the two. This resulted in a total of 2,480,058 features per<br />

patient.<br />

Because traditional machine learning algorithms cannot<br />

handle this amount of features, we then applied a feature<br />

selection step. To estimate the usefulness of a feature, we<br />

applied a Cox proportional hazards model on each feature.<br />

The resulting p-value indicates how well the feature<br />

corresponds with the known survival times for the training<br />

set. We ordered the features based on these scores, and<br />

picked only those that were uncorrelated with the others.<br />

This resulted in a final selection of 13 features, on which<br />

we applied several machine learning techniques. We<br />

compared the results of the Cox Proportional Hazards<br />

model, the Additive Hazards model and the Random<br />

Survival Forest.<br />

RESULTS & DISCUSSION<br />

All three methods performed well on the training dataset.<br />

However, on the test dataset, both the Cox Proportional<br />

Hazards model and the Additive Hazards model obtained<br />

bad results, probably due to overfitting on the training data.<br />

Only the Random Survival Forest obtained good results on<br />

the test dataset (Figure 1). This method outperformed all<br />

other methods submitted to the challenge.<br />

FIGURE 1. On the training dataset, there was a strong correlation<br />

between the scores and the actual survival times for all models. On the<br />

test dataset, only the Random Survival Forest performed well.<br />

One important challenge remains: the biological<br />

interpretation of our final features. Although they correlate<br />

with the transition times from HIV to AIDS, it is hard to<br />

interpret them as known cell types, due to our<br />

unsupervised feature extraction. Our method delivers a<br />

first step towards new insights in the progress from HIV to<br />

AIDS.<br />

REFERENCES<br />

Malek M et al. Bioinformatics 31.4, 606-607 (<strong>2015</strong>).<br />

Aghaeepour N et al. Bioinformatics 28, 1009-1016 (2012).<br />

Van Gassen S et al. Cytometry A, DOI 10.1002/cyto.a.22734<br />

106


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P63. STUDYING BET PROTEIN-CHROMATIN OCCUPATION TO<br />

UNDERSTAND GENOTOXICITY OF MLV-BASED GENE THERAPY VECTORS<br />

Sebastiaan Vanuytven 1* , Jonas Demeulemeester 1 , Zeger Debyser 1 & Rik Gijsbers 1,2 .<br />

Laboratory for Molecular Virology and Gene Therapy, KU Leuven 1 ; Leuven Viral Vector Core, KU Leuven 2 .<br />

* Sebastiaan.vanuytven@student.kuleuven.be<br />

Integrating retroviral vectors are used to treat genetic and acquired disorders that, theoretically, can be cured by<br />

introducing specific gene expression cassettes into patient cells. Clinical trials held over the past two decades have<br />

proven that this approach is effective in curing genetic disorders and can produce better results than the standard therapy<br />

(Touzot, F et al., <strong>2015</strong>). Nevertheless, adverse events in a limited number of patients treated with gamma-retroviral<br />

vectors have deterred their widespread application. Specifically, vector integration occurring in proximity of protooncogenes<br />

resulted in insertional mutagenesis and clonal expansion of the cells (Hacein-Bey-Abina S et al., 2003).<br />

INTRODUCTION<br />

Retroviruses and their derived viral vectors do not<br />

integrate at random. Their overall integration pattern is<br />

dictated by cellular cofactors that are co-opted by the<br />

invading viral complex. For gammaretroviral vectors<br />

(prototype MLV) the cellular bromo- and extraterminal<br />

domain (BET) family of proteins (BRD2, BRD3 and<br />

BRD4) tethers the viral integrase to the host cell<br />

chromatin (De Rijck J et al., 2013). At the moment the<br />

only available ChIP-seq data derives from HEK-293T<br />

cells exogenously overexpressing FLAG-tagged versions<br />

of the BET proteins (LeRoy G et al., 2012). Yet, the<br />

detailed chromatin binding profile of endogenous BET<br />

proteins in human cells is currently unknown. Here we<br />

report on the chromatin occupation of the endogenous<br />

BET proteins in K562 and human primary CD4+ T cells.<br />

METHODS<br />

Following fixation, all three BET proteins were pulleddown<br />

with specific antibodies (Bethyl Laboratories, α-<br />

BRD2: A302-583A; α-BRD3: A302-368A; α-BRD4:<br />

A301-985A or Abcam ab84776). Subsequently, 1x10 7<br />

cells per sample were processed for ChIP as previously<br />

described (Pradeepa MM et al., 2012). ChIPed DNA was<br />

amplified with WGA2 using the manufacturer's protocol<br />

(Sigma Aldrich). All ChIP experiments were done with at<br />

least two biological replicates in K562 and CD4+ T cells.<br />

After processing of the ChIP-seq data, we compared the<br />

obtained BET protein-binding sites with MLV integration<br />

sites, histone modifications and other genetic features.<br />

Furthermore, we used motif discovery in the<br />

neighbourhood of BET binding sites and MLV integration<br />

sites to try and discover potential new players in the MLV<br />

integration process.<br />

RESULTS & DISCUSSION<br />

Analysis showed that 24% of the MLV integration sites<br />

overlap with a BET-binding site in K562 cells, the<br />

majority of which are BRD4 sites. In addition, BET<br />

binding sites located in promoter and enhancer regions are<br />

preferred for MLV integration. Further, evaluation<br />

demonstrated a strong correlation between MLVintegration<br />

in these sites and the occurrence of the<br />

transcription factor recognition motifs for MAX, GATA2,<br />

EGR1, GAPBA and YY1, suggesting a role for these<br />

proteins or the underlying chromatin structures in<br />

targeting integration of MLV to these locations in the<br />

genome via interaction with BET proteins and/or the MLV<br />

long terminal repeat sequences. Recently, we generated<br />

MLV-based vectors that no longer recognize BET-proteins,<br />

BET independent MLV-based (BinMLV) vectors (El<br />

Ashkar S et al., 2014). Integration preferences of BinMLV<br />

vectors are shifted away from epigenetic marks associated<br />

with enhancers and promoters as shown in a PCA analysis,<br />

but they also associate less with BET and MAX binding<br />

sites. Even though, BinMLV vectors still did not integrate<br />

at random, their distribution can overall be described as<br />

more safe, with 3% more integration sites in so-called<br />

genomic "safe-harbor" regions (Sadelain M et al., 2012).<br />

REFERENCES<br />

De Rijck J et al. The BET family of proteins targets moloney murine<br />

leukemia virus integration near transcription start sites, Cell Rep, 5,<br />

886-894, (2013).<br />

El Ashkar S et al. BET-independent MLV-based Vectors Target Away<br />

From Promoters and Regulatory Elements, Mol Ther Nucleic Acids,<br />

3, e179, (2014).<br />

Hacein-Bey-Abina S et al. LMO2-associated clonal T cell proliferation in<br />

two patients after gene therapy for SCID-X1, Science, 302, 415-419,<br />

(2003).<br />

LeRoy G et al. Proteogenomic characterization and mapping of<br />

nucleosomes decoded by Brd and HP1 proteins, Genome Biol, 13,<br />

R68, (2012).<br />

Pradeepa MM et al. Psip1/Ledgf p52 binds methylated histone H3K36<br />

and splicing factors and contributes to the regulation of alternative<br />

splicing, PLoS Genet, 8, e1002717, (2012).<br />

Sadelain M, Papapetrou EP and Bushman FD. Safe harbours for the<br />

integration of new DNA in the human genome, Nat Rev Cancer, 12,<br />

51-58, (2012).<br />

Touzot, F et al. Faster T-cell development following gene therapy<br />

compared with haploidentical HSCT in the treatment of SCID-X1,<br />

Blood, 125, 3563-3569, (<strong>2015</strong>).<br />

107


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P64. THE COMPLETE GENOME SEQUENCE OF LACTOBACILLUS<br />

FERMENTUM IMDO 130101 AND ITS METABOLIC TRAITS RELATED TO<br />

THE SOURDOUGH FERMENTATION PROCESS<br />

Marko Verce, Koen Illeghems, Luc De Vuyst & Stefan Weckx * .<br />

Research Group of Industrial Microbiology and Food Biotechnology (IMDO), Faculty of Sciences and Bioengineering<br />

Sciences, Vrije Universiteit Brussel, Brussels, Belgium. * stefan.weckx@vub.ac.be<br />

The genome of the lactic acid bacterium species Lactobacillus fermentum IMDO 130101, capable of dominating<br />

sourdough fermentation processes, was sequenced, annotated, and curated. Further, this genome sequence of 2.09 Mbp<br />

was compared to other complete genomes of different strains of L. fermentum to elucidate the potential of L. fermentum<br />

IMDO 130101 as a sourdough starter culture strain. As opposed to the other strains, L. fermentum IMDO 130101<br />

contained unique genes related to carbohydrate import and metabolism as well as a gene coding for a phenolic acid<br />

decarboxylase and a gene encoding a 4,6- -glucanotransferase. The latter enzyme activity may result in the production<br />

of isomalto/malto-polysaccharides. All these features make L. fermentum IMDO 130101 attractive for further study as a<br />

candidate sourdough starter culture strain.<br />

INTRODUCTION<br />

Lactobacillus fermentum is a heterofermentative lactic<br />

acid bacterium often found in fermented food products,<br />

including sourdough. Strain L. fermentum IMDO 130101,<br />

a dominant sourdough strain originally isolated from a rye<br />

sourdough (Weckx et al., 2010) and extensively described<br />

previously (e.g., Vrancken et al., 2008), was sequenced<br />

and compared to other L. fermentum strains with<br />

completed genomes to elucidate unique adaptations of the<br />

strain studied to the sourdough environment.<br />

METHODS<br />

High-quality genomic DNA was used to construct an 8-kb<br />

paired-end library for 454 pyrosequencing. The<br />

pyrosequencing reads were assembled using the GS De<br />

Novo Assembler version 2.5.3 with default parameters.<br />

Primers for gap closure were designed using CONSED<br />

23.0, the gaps amplified with polymerase chain reaction<br />

(PCR) assays and the amplicons sequenced using Sanger<br />

sequencing. The sequences were imported into CONSED<br />

23.0 and used to close the gaps. The genome was<br />

annotated using the automated genome annotation<br />

platform GenDB v2.2 (Meyer et al., 2003), followed by<br />

extensive manual curation. Publicly available genome<br />

sequences of L. fermentum F-6 (Sun et al., <strong>2015</strong>), L.<br />

fermentum IFO 3956 (Morita et al., 2008), and L.<br />

fermentum CECT 5716 (Jiménez et al., 2010) were<br />

acquired from RefSeq. Whole-genome comparisons with<br />

the other three L. fermentum strains and ortholog findings<br />

were performed using the progressiveMauve algorithm<br />

(Darling et al., 2010).<br />

RESULTS & DISCUSSION<br />

The 2.09 Mbp genome was assembled from 403,466 reads,<br />

resulting in 74 contigs. No plasmids were found. The<br />

comparative genome analysis with other strains showed<br />

that 477 coding sequences were found in L. fermentum<br />

IMDO 130101 solely (Figure 1).<br />

L. fermentum IMDO 130101 was predicted to be able to<br />

import and utilise glucose, fructose, xylose, mannose, N-<br />

acetylglucosamine, maltose, sucrose, lactose and gluconic<br />

acid via the heterolactic fermentation pathway. Also, the<br />

ability to degrade raffinose and arabinose was predicted.<br />

Consumption of glucose, fructose, maltose and sucrose<br />

was shown in previous research, although growth with<br />

sucrose as the sole energy source was impaired (Vrancken<br />

et al., 2008). The strain possibly imports isomaltose and<br />

maltodextrins, hence elaborating glucose subunits. The<br />

-glucosidase-encoding gene was not found in the<br />

genomes of the other three strains considered, and neither<br />

were the putative maltodextrin import-related genes, the<br />

trehalose-6-phosphate phosphorylase-encoding gene and a<br />

putative -glucanase-encoding gene, which all may be<br />

adaptations of L. fermentum IMDO 130101 to the<br />

sourdough environment. The presence of the arginine<br />

deiminase gene cluster was confirmed. Also, L. fermentum<br />

IMDO 130101 contained a gene for a phenolic acid<br />

decarboxylase, which may have an impact on sourdough<br />

aroma. Further, a 4,6- -glucanotransferase-encoding gene<br />

was present in strain IMDO 130101 solely, which could<br />

result in isomalto/malto-polysaccharide production, a<br />

soluble dietary fibre with prebiotic properties.<br />

Overall, comparative genome analysis revealed metabolic<br />

traits that are of interest for the use of L. fermentum IMDO<br />

130101 as a functional starter culture for sourdough<br />

fermentation processes.<br />

FIGURE 1. Venn diagram of shared coding sequences between four<br />

different strains of Lactobacillus fermentum.<br />

REFERENCES<br />

Darling et al. PLoS ONE 5, e11147 (2010).<br />

Jiménez E. et al. J. Bacteriol. 192, 4800-4800 (2010).<br />

Meyer et al. Nucleic Acids Res. 31, 2187-2195 (2003).<br />

Morita et al. DNA Res. 15: 151-161 (2008).<br />

Sun et al. J. Biotechnol. 194, 110-111 (<strong>2015</strong>).<br />

Vrancken et al. Int. J. Food Microbiol. 128, 58-66 (2008).<br />

Weckx et al. Food Microbiol. 27, 1000-1008 (2010).<br />

108


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P65. ORTHOLOGICAL ANALYSIS OF AN EBOLA VIRUS – HUMAN PPIN<br />

SUGGESTS REDUCED INTERFERENCE OF EBOLA VIRUS WITH EPIGENETIC<br />

PROCESSES IN ITS SUSPECTED BAT RESERVOIR HOST<br />

Ben Verhees 1* , Kris Laukens 1,2 , Stefan Naulaerts 1,2 , Pieter Meysman 1,2 & Xaveer Van Ostade 3 .<br />

Biomedical informatics research center Antwerpen (biomina) 1 ; Advanced Database Research and Modeling (ADReM),<br />

University of Antwerp 2 ; Laboratory of Protein Science, Proteomics and Epigenetic Signalling (PPES) and Centre for<br />

Proteomics and Mass spectrometry (CFP-CeProMa), University of Antwerp 3 . * ben.verhees@student.uantwerpen.be<br />

Ebola virus is a zoonosis, but its reservoir host has not yet been identified. Recent findings suggest however, that Mops<br />

condylurus, an insect-eating bat, is a likely candidate. Studying the interactions between Ebola virus and its reservoir<br />

host could prove highly informative, as reservoir hosts of zoonotic pathogens often appear to tolerate infections with<br />

these pathogens with little evidence of disease. In this study, a protein-protein interaction network (PPIN) was created<br />

between Ebola virus and human proteins. Orthology data in Myotis lucifugus – a model organism often used for bat<br />

studies – was employed to determine which of the human first neighbors of Ebola virus proteins do not possess an<br />

orthologue in M. lucifugus. Subsequent GO enrichment analysis suggested that these proteins are mostly involved in<br />

epigenetic processes, and thus we hypothesize that Ebola virus displays reduced interference with epigenetic processes in<br />

its reservoir host.<br />

INTRODUCTION<br />

The idea that bats serve as reservoirs for a wide range of<br />

zoonotic pathogens has been the topic of much recent<br />

research. Previous studies on human and bat orthology in<br />

this context have mainly focused on specific genes,<br />

important in fighting off viral infection.<br />

Our study is different however, in that it focuses on<br />

proteins the Ebola virus immediately interacts with in<br />

humans, and the existence of orthologues of these proteins<br />

in bats.<br />

METHODS<br />

Construction of an Ebola virus – human PPIN<br />

An Ebola virus – human PPIN was constructed from in<br />

silico data. All network analysis was done using<br />

Cytoscape v. 3.2.1.<br />

Orthology analysis<br />

Identification of orthologues was performed using the<br />

OMA orthology database, release: September <strong>2015</strong>.<br />

Statistics<br />

For the statistical analysis, the hypergeometric test was<br />

performed.<br />

GO enrichment<br />

GO enrichment analysis was performed using ClueGO v.<br />

1.2.7, a Cytoscape plug-in. Default settings were used, and<br />

all ontologies/pathways were examined.<br />

RESULTS & DISCUSSION<br />

Myotis lucifugus as a model for Mops condylurus<br />

In this study, Myotis lucifugus was used as a model to<br />

study interactions between Ebola virus and Mops<br />

condylurus, its suspected reservoir.<br />

Ebola virus – human PPIN and orthology in M.<br />

lucifugus<br />

An Ebola virus – human PPIN was created, and human<br />

first neighbors of Ebola virus proteins were examined for<br />

existence of orthologues in M. lucifugus. Statistical<br />

analysis revealed that there was an upregulation of human<br />

proteins with orthologues in M. lucifugus (p=0.019).<br />

GO enrichment suggests reduced interference of Ebola<br />

virus with epigenetic processes in its reservoir host<br />

Gene ontology (GO) enrichment analysis was performed<br />

of the human first neighbors of Ebola virus proteins which<br />

do not possess an orthologue in M. lucifugus. The analysis<br />

revealed that these proteins are mostly involved in<br />

epigenetic processes (Figure 1).<br />

FIGURE 1. GO enrichment analysis of human first neighbors of Ebola<br />

virus proteins which do not possess an orthologue in M. lucifugus.<br />

Discussion<br />

Using this novel approach, we have shown that Ebola<br />

virus is likely able to interfere with epigenetic processes in<br />

humans. Secondly, Ebola virus’ ability to interfere with<br />

host epigenetics is likely reduced or altered in its reservoir<br />

host.<br />

While the idea that viruses are able to interact with host<br />

epigenetic mechanisms is fairly recent, over the past few<br />

years significant research has been done exploring this<br />

topic. In a comprehensive review, Li et al. (2014) describe<br />

how specific viral proteins are able to modulate the<br />

activity of chromatin modification complexes, e.g. HATs,<br />

HDACs, HMTs, and HDMTs, and even directly bind<br />

histone proteins. These findings lend support to the results<br />

of our study, as these suggest that Ebola virus is also able<br />

to interact with HDACs, HMTs and several histone<br />

proteins in humans.<br />

REFERENCES<br />

Li S et al. Rev Med Virol 24, 223-241 (2014).<br />

109


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P66. PLADIPUS EMPOWERS UNIVERSAL DISTRIBUTED COMPUTING<br />

Kenneth Verheggen 1,2,3* , Harald Barsnes 4,5 , Lennart Martens 1,2,3 & Marc Vaudel 4 .<br />

Medical Biotechnology Center, VIB, Ghent, Belgium 1 ; Department of Biochemistry, Ghent University, Ghent 2 ;<br />

Belgium,Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium 3 ; Proteomics Unit, Department of<br />

Biomedicine, University of Bergen, Norway 4 ; KG Jebsen Center for Diabetes Research, Department of Clinical Science,<br />

University of Bergen, Norway 5 . *kenneth.verheggen@vib-ugent.be<br />

The use of proteomics bioinformatics substantially contributes to an improved understanding of proteomes, but this novel<br />

and in-depth knowledge comes at the cost of increased computational complexity. Parallelization across multiple<br />

computers, a strategy termed distributed computing, can be used to handle this increased complexity. However, setting<br />

up and maintaining a distributed computing infrastructure requires resources and skills that are not readily available to<br />

most research groups.<br />

Here, we propose a free and open source framework named Pladipus that greatly facilitates the establishment of<br />

distributed computing networks for proteomics bioinformatics tools.<br />

INTRODUCTION<br />

Various modern day bioinformatics-related fields have a<br />

growing focus on large scale data processing. This<br />

inevitably leads to an increased complexity, as is<br />

illustrated by the recent efforts to elaborate a<br />

comprehensive MS-based human proteome<br />

characterization (Kim et al., 2014; Wilhelm et al., 2014).<br />

Such high-throughput, complex studies are becoming<br />

increasingly popular, but require high performance<br />

computational setups in order to be analyzed swiftly.<br />

METHODS<br />

Here, we present a generic platform for distributed<br />

proteomics software, called Pladipus. It provides an<br />

end-user-oriented solution to distribute<br />

bioinformatics tasks over a network of computers,<br />

managed through an intuitive graphical user interface<br />

(GUI).<br />

Pladipus comes with several modules that work out<br />

of the box. They include SearchGUI (Vaudel et al.,<br />

2011), PeptideShaker (Vaudel et al., <strong>2015</strong>),<br />

DeNovoGUI (Muth et al., 2014), MsConvert (part of<br />

Proteowizard (Kessner et al., 2008)) and three<br />

common forms of the BLAST (Altschul et al., 1990)<br />

algorithm (blastn, blastp and blastx). It is possible to<br />

link these together to set up tailored pipelines for<br />

specific needs, including custom, in-house<br />

algorithms and execute the whole on an inexpensive,<br />

scalable cluster infrastructure without additional cost<br />

or expert maintenance requirement. It can even be set<br />

up to allow existing (idle) hardware to hook into the<br />

network and participate in the processing.<br />

RESULTS & DISCUSSION<br />

To numerically assess the benefits of using a distributed<br />

computing framework, 52 CPTAC experiments (LTQ-<br />

Study6 : Orbitrap@86) (Paulovich et al., 2010) were<br />

searched three times against a protein sequence database<br />

(UniProtKB/SwissProt (release-<strong>2015</strong>_05)) on Pladipus<br />

networks of various. A selection of three search engines<br />

was applied: X!Tandem, Tide and MS-GF+. As expected<br />

for a distributed system, the wall time is very reproducible<br />

and decreased nearly exponentially with the number of<br />

workers.<br />

FIGURE 1. Benchmarking of a Pladipus network<br />

(16GB ram, 12cores, 250GB disk space, Ubuntu<br />

precise)<br />

Pladipus is freely available as open<br />

source under the permissive Apache2<br />

license. Documentation, including<br />

example files, an installer and a video tutorial, can be<br />

found at<br />

https://compomics.github.io/projects/pladipus.html.<br />

REFERENCES<br />

Altschul,S.F. et al. (1990) Basic local alignment search tool. J. Mol.<br />

Biol., 215, 403–10.<br />

Kessner,D. et al. (2008) ProteoWizard: open source software for rapid<br />

proteomics tools development. Bioinformatics, 24, 2534–6.<br />

Kim,M.-S. et al. (2014) A draft map of the human proteome. Nature,<br />

509, 575–81.<br />

Muth,T. et al. (2014) DeNovoGUI: an open source graphical user<br />

interface for de novo sequencing of tandem mass spectra. J.<br />

Proteome Res., 13, 1143–6.<br />

Paulovich,A.G. et al. (2010) Interlaboratory study characterizing a yeast<br />

performance standard for benchmarking LC-MS platform<br />

performance. Mol. Cell. Proteomics, 9, 242–54.<br />

Vaudel,M. et al. (<strong>2015</strong>) PeptideShaker enables reanalysis of MS-derived<br />

proteomics data sets. Nat. Biotechnol., 33, 22–24.<br />

Vaudel,M. et al. (2011) SearchGUI: An open-source graphical user<br />

interface for simultaneous OMSSA and X!Tandem searches.<br />

Proteomics, 11, 996–9.<br />

Wilhelm,M. et al. (2014) Mass-spectrometry-based draft of the human<br />

proteome. Nature, 509, 582–7.<br />

110


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P67. IDENTIFICATION OF ANTIBIOTIC RESISTANCE MECHANISMS USING<br />

A NETWORK-BASED APPROACH<br />

Bram Weytjens 1,2,3,4 , Dries De Maeyer 1,2,,3,4 & Kathleen Marchal 1,2,4 *.<br />

Dept. of Information Technology (INTEC, iMINDS), UGent, Ghent, 9052, Belgium 1 ; Dept. of Plant Biotechnology and<br />

Bioinformatics, Ghent University, Technologiepark 927, 9052 Gent, Belgium 2 ; Dept. of Microbial and Molecular<br />

Systems, KU Leuven, Kasteelpark Arenberg 20, B-3001 Leuven, Belgium 3 , Bioinformatics Institute Ghent, Ghent<br />

University, Ghent B-9000, Belgium 4 . * kathleen.marchal@intec.ugent.be<br />

Antibiotic resistance is a growing public health concern as the effectiveness of multiple types of antibiotics is decreasing.<br />

To prevent and combat the further spread of antibiotic resistance in bacteria there is the need to better understand the<br />

relationship between genetic alterations and the (molecular) phenotype of antibiotic resistant strains. As several (-omics)<br />

experiments regarding the attainment of antibiotic resistance by bacteria have already been performed and are publicly<br />

available, we re-analysed a laboratory evolution experiment by Suzuki et al. (Suzuki, 2014) in order to demonstrate the<br />

power of a network-based approach in identifying mutations and molecular pathways driving the resistance phenotype.<br />

INTRODUCTION<br />

While network-based approaches are no longer new in<br />

high-throughput (-omics) analysis, they are not yet widely<br />

used in standard analysis pipelines. We analysed a dataset<br />

consisting of multiple E. coli MDS42 strains, each<br />

independently evolved in the presence of a specific<br />

antibiotic (10 in total). By adapting PheNetic (De Maeyer.<br />

2013), an algorithm which connects genetic alterations to<br />

their differentially expressed genes over a genome-wide<br />

interaction network, we were able to automatically<br />

identify mutations in genes which are known to induce<br />

antibiotic resistance.<br />

METHODS<br />

For every strain whole-genome sequencing data and<br />

microarray data (eQTL data) was available. By finding the<br />

most probable connections between the mutations of every<br />

strain and the strain’s respective expression data over a<br />

biological network, PheNetic was able to not only uncover<br />

potential driver genes and molecular pathways for the<br />

resistance phenotype but also to prioritize the identified<br />

mutations based on the likelihood that they are truly<br />

driving the resistance phenotype. Such network-based<br />

approach has following advantages:<br />

<br />

<br />

Integration of interactomics (network), genomics<br />

and interactomics data<br />

Multiple related datasets can be analyzed together<br />

FIGURE 1: Part of Amikacin resistance network.<br />

RESULTS & DISCUSSION<br />

In the case of Amikacin resistance (figure 1) we were able<br />

to uncover a gain-of-function mutation in cpxA, a gene of<br />

a two-component signal transduction mechanisms which is<br />

known to be involved in amikacin resistance for two<br />

strains out of four. For the other two strains, deleterious<br />

cyoB mutations were found, which is known to lead to<br />

intracellular oxidized copper and eventually multidrug<br />

resistance. These genes were furthermore ranked highest<br />

by PheNetic.<br />

REFERENCES<br />

Suzuki S et al. Nat Commun 5, 5792 (2014).<br />

De Maeyer D et al. Mol Biosyst 9: 1594-1603 (2013).<br />

111


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P68. DEFINING THE MICROBIAL COMMUNITY OF DIFFERENT<br />

LACTOBACILLUS NICHES USING METAGENOMIC SEQUENCING<br />

Sander Wuyts 1,2* , Eline Oerlemans 1 , Ilke De Boeck 1 , Wenke Smets 1 , Dieter Vandenheuvel, Ingmar Claes 1 & Sarah<br />

Lebeer 1 .<br />

Laboratory of Applied Microbiology and Biotechnology, University of Antwerp 1 ; Research Group of Industrial<br />

Microbiology and Food Biotechnology (IMDO), Vrije Universiteit Brussel 2 * Sander.Wuyts@UAntwerp.be<br />

Next-Generation Sequencing (NGS) has revolutionized the field of microbial community analysis. Due to these highthroughput<br />

DNA-technologies, microbiologists are now able to perform more in-depth analyses of various microbial<br />

communities compared to culture-independent methods. In our lab, we have successfully deployed 16S rDNA amplicon<br />

sequencing using MiSeq-sequencing (Illumina). A bioinformatic pipeline has been built based on mothur (Schloss et al.<br />

2009), UPARSE (Edgar 2013) and Phyloseq (McMurdie & Holmes 2013) to analyse different microbial community<br />

datasets. The focus is on functional analysis of lactobacilli and other lactic acid bacteria in different ecological niches:<br />

ranging from the human upper respiratory tract to naturally fermented plant-based foods.<br />

INTRODUCTION<br />

16S metagenomics is a technique that makes use of the<br />

highly conserved bacterial 16S rRNA gene. This gene<br />

codes for an RNA-molecule which is a component of the<br />

30S small subunit of bacterial ribosomes. It consists of 9<br />

hypervariable regions, flanked by conserved regions for<br />

which primer pairs for PCR/sequencing can be designed.<br />

Due to these characteristics and due to the slow rate of<br />

evolution, this gene has been widely used in bacterial<br />

phylogeny and taxonomy. NGS technologies like Illumina<br />

MiSeq have made it possible to study all the different<br />

16S rRNA gene copies from an environmental sample and<br />

use these to identify the bacteria present in the sample. But<br />

the use of these high-throughput technologies comes with<br />

a cost: the need for a more in-depth bioinformatic analysis.<br />

METHODS<br />

Wetlab:<br />

DNA is extracted using sample dependent extraction<br />

protocols. A barcoded PCR is performed on the V4 region<br />

of the 16S rRNA gene as described in Kozich et al. 2013.<br />

For each sample a different set of primers is used; each<br />

primerset contains a unique combination of barcodes. The<br />

PCR-products are cleaned using AMPure XP (Agencourt)<br />

bead purification and quantified using Qubit (Life<br />

technologies). All samples are equimolary pooled into one<br />

single library. A negative control (= “empty” DNAextraction)<br />

and a positive control (= “Mock” communities<br />

HM-276D and HM-782D) are always processed together<br />

with the samples. The library is sequenced using a dual<br />

index sequencing strategy (Kozich et al. 2013) and a<br />

2 x 250 bp kit on the Illumina MiSeq.<br />

Bio-informatic analysis:<br />

Samples are demultiplexed on the MiSeq itself, allowing 1<br />

bp difference in the barcodes. The general quality of the<br />

reads is checked using FastQC (Babraham Bioinformatics).<br />

The paired end reads are merged using mothur’s<br />

make.contigs command. Quality control in mothur is<br />

performed using screen.seqs, alignment to the SILVA<br />

database and removal of sequences that do not map to the<br />

database, removal of chimeras using chimera.uchime and<br />

removal of sequences that classify to the lineages<br />

“Mitochondria” and “Chloroplast”.<br />

The distance between sequences are calculated using<br />

mothur’s dist.seqs command and are clustered at 97 %<br />

sequence similarity using mothur’s cluster command.<br />

Alternatively the UPARSE clustering algorithm can be<br />

used for these last two steps. Sequences are classified<br />

using the RDP database and the complete dataset is<br />

exported as a .biom file.<br />

Visualisation and statistical analysis is performed using<br />

the R-package Phyloseq. This analysis depends on the<br />

experimental design but generally consists of a<br />

normalisation step (either using rarefying, proportions or a<br />

statistical mixture model (McMurdie & Holmes 2014)), a<br />

calculation of alpha diversity measurements and a<br />

calculation and visualisation of beta diversity.<br />

RESULTS & DISCUSSION<br />

The above described method was optimised and proved to<br />

be working. We successfully used this technique to obtain<br />

better insights in the role of lactobacilli in different<br />

ecological niches, e.g. in the murine gastrointestinal tract,<br />

vegetable fermentations and the human upper respiratory<br />

tract.<br />

REFERENCES<br />

Edgar, R.C., 2013. UPARSE: highly accurate OTU sequences from<br />

microbial amplicon reads. Nature methods, 10(10), pp.996–8.<br />

Kozich, J.J. et al., 2013. Development of a dual-index sequencing<br />

strategy and curation pipeline for analyzing amplicon sequence<br />

data on the MiSeq Illumina sequencing platform. Applied and<br />

environmental microbiology, 79(17), pp.5112–20.<br />

McMurdie, P.J. & Holmes, S., 2013. Phyloseq: An R Package for<br />

Reproducible Interactive Analysis and Graphics of Microbiome<br />

Census Data. PLoS ONE, 8(4).<br />

McMurdie, P.J. & Holmes, S., 2014. Waste not, want not: why rarefying<br />

microbiome data is inadmissible. PLoS computational biology,<br />

10(4), p.e1003531.<br />

Schloss, P.D. et al., 2009. Introducing mothur: Open-source, platformindependent,<br />

community-supported software for describing and<br />

comparing microbial communities. Applied and Environmental<br />

Microbiology, 75(23), pp.7537–7541.<br />

112


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P69. HUNTING HUMAN PHENOTYPE-ASSOCIATED GENES<br />

USING MATRIX FACTORIZATION<br />

Pooya Zakeri 1,2,* , Jaak Simm 1,2 , Adam Arany 1,2 , Sarah Elshal 1,2 & Yves Moreau 1,2 .<br />

Department of Electrical Engineering, STADIUS, KU Leuven, Leuven 3001, Belgium 1 ; iMinds Medical IT, Leuven 3001,<br />

Belgium 2 . * pooya.zakeri@esat.kuleuven.be<br />

In the last decade, the phenotype-genes identification has received growing attention. It is yet one of the most<br />

challenging problem in biology. In particular, determining disease-associated genes is a demanding process and plays a<br />

crucial role in understanding the relationship between phenotype disease and genes. Typical approaches for gene<br />

prioritization often models each diseases individually, that fails to capture the common patterns in the data. This<br />

motivates us to formulate the hunting phenotype-associated genes problem as a factorization of an incompletely filled<br />

gene-phenotype-matrix where the objective is to predict unknown values. Experimental result on the updated version of<br />

Endeavour benchmark demonstrates that our proposed model can effectively improve the accuracy of the state-of-the-art<br />

gene prioritization model.<br />

INTRODUCTION<br />

In biology, there is often the need to discover the most<br />

promising genes among large list of candidate genes to<br />

further investigate. While a single data source might not<br />

be effective enough, fusing several complementary<br />

genomic data sources results in more accurate prediction.<br />

Moreover, fusing the phenotypic similarity of diseases and<br />

sharing information about known disease genes across<br />

both diseases and genes through a multi-task approach,<br />

enable us to handle gene prioritization for diseases with<br />

very few known genes and genes with limited available<br />

information. Typical strategies for hunting phenotypeassociated<br />

genes often models each phenotype<br />

individually [1, 2, 3, 4], that fails to capture the common<br />

patterns in the data. This motivates us to formulate the<br />

hunting phenotype-associated genes task as a factorization<br />

of an incompletely filled gene-phenotype-matrix where the<br />

objective is to predict unknown values.<br />

METHODS<br />

We consider OMIM database which is a human phenotype<br />

disease specific association databases. OMIM focuses on<br />

the relationship between human genotype and associated<br />

diseases. OMIM database can be seen as an incomplete<br />

matrix where each row is a gene and each column is a<br />

phenotype (disease).<br />

The idea behind the factorizing the M×N OMIM matrix is<br />

to represent each row and each column by a latent vector<br />

of size D. Then, the OMIM matrix can be modeled by<br />

product of an N×D gene matrix G and an M× D disease<br />

matrix P.<br />

Bayesian matrix factorization (BPMF) [5] is a famous<br />

method to fill such an incomplete matrix. But BPMF uses<br />

no side information which results in an inaccurate genephenotype-matrix<br />

completion.<br />

We propose an extended version of BPMF with an ability<br />

to work with multiple side information sources for<br />

completing gene-phenotype-matrix [6], which allows to<br />

make out-of-genes-phenotype-matrix ranking. In our<br />

proposed framework we are also able to integrate both<br />

genomic data sources and phenotypes information,<br />

whereas earlier approaches for hunting phenotype<br />

associated genes are limited to only fuse genomic<br />

information. This modification is done by adding genomic<br />

and phenotypic features to the corresponding latent<br />

variables [6]. In this study, we consider several genomic<br />

data sources including annotation-based data sources such<br />

as UniProt annotation, literature-based data sources on<br />

each genes, and as well the literature-based phenotypic<br />

information on each diseases, as just as in [1, 4, 9]. The<br />

framework of our Bayesian data fusion model for gene<br />

prioritization is illustrated in Figure 1.<br />

FIGURE 1. The framework of our Bayesian data fusion model for gene<br />

prioritization.<br />

RESULTS & DISCUSSION<br />

We report the average TPR results, when considering the<br />

top 1%, 5%, 10%, and 30% of the ranked genes.<br />

Experimental result on the updated version of Endeavour<br />

[3] benchmark demonstrates that our proposed model can<br />

effectively improve the accuracy of the state-of-the-art<br />

gene prioritization model.<br />

REFERENCES<br />

Aerts, S. et al. Nat Biotech, 24(5), 537–544, (2006).<br />

De Bie T, Tranchevent LC, van Oeffelen LMM, Moreau Y,<br />

Bioinformatics, 23(13):i125-i132, (2007).<br />

Tranchevent LC1, et. al. NAR, (35) W377-W384(2008) .<br />

ElShal S, et al. Davis J. Moreau Y. NAR, (<strong>2015</strong>).<br />

R. Salakhutdinov and A. Mnih. 25th ICML, 880–887. ACM, (2008).<br />

SIMM J, et al. arXiv:1509.04610 [stat.ML], (2106).<br />

113


BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P70. THE IMPACT OF HMGA PROTEINS ON REPLICATION ORIGINS<br />

DISTRIBUTION<br />

A. Zouaoui 1 , M. Kahli 2 , E. Besnard 3 , R. Desprat 1 , N. Kirsten 4 , P. Ben-sadoun 1 & J.M. Lemaitre 1 .<br />

Institute for Regenerative Medicine and Biotherapy, France 1 ; Institut de Biologie de l’École Normale Supérieure (ENS),<br />

France 2 ; The Gladstone Institutes, University of California San Francisco (UCSF), United States 3 ; Helmholtz Zentrum<br />

München, Research Unit Gene Vectors, Munich, Germany 4 .<br />

Proliferative cells can have an irreversible stop in the cell<br />

cycle that is called cellular senescence which can induct<br />

the development of cancer and ageing. Senescence is<br />

characterized by the development of Dense<br />

Heterochromatic Foci (SAHF) and the decline of the DNA<br />

replication. High-Mobility Group A proteins promote<br />

SAHF formation, a proliferative stop and stabilize<br />

senescence when overexpressed.<br />

In a cell, DNA replication is regulated on several<br />

genomics sites called replication origin (« Oris »). Prereplication<br />

proteic complex is required for DNA<br />

replication to occur. In the pre-replication complex, the<br />

ORC1 protein is involved in recognition of the origin of<br />

replication. DNA autoradiography of eukaryote cells<br />

allowed to find that human replication origins are<br />

bidirectional and spaced at 20-400kb intervals (Huberman<br />

and Riggs, 1968). At each origin, replication forks are<br />

formed and new short nascent strand are synthetized. A<br />

popular method to map replication origins is the<br />

purification of Short Nascent Strand (SNS). Several<br />

laboratories have identified up to 50 000 origins using<br />

microarray and sequencing techniques. Our laboratory has<br />

developed an origin mapping method divided in four cell<br />

type: IMR90, H9, iPSC and HeLa (Besnard et al., 2012).<br />

The Short Nascent Strand was isolated, sequenced and<br />

analyzed. 250 000 origin peaks have been identified with a<br />

peak detection tool named SoleSearch (Blahnik KR, Dou<br />

L, O’Geen H, et al. 2010).<br />

The objective is to find the most sensitive method to<br />

analyze the origin distribution in proliferative and<br />

senescent cells to observe if senescence has an impact on<br />

the origin distribution. The implication of HMGA proteins<br />

on the DNA replication is investigated. Two new methods<br />

are in development to analyze the replication origin with<br />

two more sensitive tools. In the first method, we search<br />

origin peaks with Macs2 tool (Zhang et al., 2008) which<br />

uses a new statistic and algorithm model. In a second time,<br />

origin enrichment is observed with Homer tool (Heinz S et<br />

al., 2010).<br />

Two methods are currently in development to identify the<br />

replication origin site by Illumina GaII sequencing of short<br />

nascent strand. Human SNS-seq reads of 36bp were<br />

mapped to human genome build GRCH38 with BWA tool<br />

(ref). Origin peaks were called by MACS2 and origin<br />

enrichment by Homer. To compare the two methods,<br />

active origins in HeLa cells were detected with each<br />

method. Correlation between ORC1 peaks and origins<br />

identified is calculated to choose the most sensitive<br />

method. The impact of pre-senecence is observed in<br />

comparing origins distribution observed in proliferative<br />

and senescent cells. Origins distribution is compared<br />

before and after induction of HMGA proteins to<br />

investigate the implication of these proteins on the DNA<br />

replication during senescence.<br />

REFERENCES<br />

Besnard et al. Best practices for mapping replication origins in<br />

eukaryotic chromosomes. Current Protoc Cell Biol. 2014 Sep 2;<br />

64:22.18.1-22.18.13<br />

Besnard et al. Unraveling cell type-specific and reprogrammable human<br />

replication origin signatures associated with G-quadruplex consensus<br />

motifs. Nat Struct Mol Biol. 2012 Aug; 19, 837-44<br />

Blahnik KR, Dou L, O’Geen H, et al. Sole-Search: an integrated analysis<br />

program for peak detection and functional annotation using ChIP-seq<br />

data. Nucleic Acids Res. 2010; 38:e13<br />

Fu H et al. Mapping replication origin sequences in eukaryotic<br />

chromosomes. Curr Protoc Cell Biol. 2014 Dec 1; 65:22.20.1-<br />

22.20.17<br />

Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of<br />

Lineage-Determining Transcription Factors Prime cis-Regulatory<br />

Elements Required for Macrophage and B Cell Identities. Mol Cell<br />

2010 May 28; 38, 576-589<br />

Hubberman JA et al. On the mechanism of DNA replication in<br />

mammalian chromosomes. J Mol Biol 1968 Mar 14; 32, 327-41<br />

Zhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol<br />

(2008) 9 pp. R13<br />

114


10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

<strong>bbc</strong> <strong>2015</strong><br />

December 7 - 8, <strong>2015</strong> Antwerp, Belgium<br />

www.<strong>bbc</strong><strong>2015</strong>.be<br />

115

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!