03.12.2015 Views

bbc 2015

BBC2015_booklet

BBC2015_booklet

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P52. SUPERVISED TEXT MINING FOR DISEASE AND GENE LINKS<br />

Jaak Simm 1,2,3* , Adam Arany 1,2 , Sarah ElShal 1,2 & Yves Moreau 1,2 .<br />

Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing, and Data<br />

Analytics, KU Leuven, Kasteelpark Arenberg 10, box 2446, 3001 Leuven, Belgium 1 ; iMinds Medical IT, Kasteelpark<br />

Arenberg 10, box 2446, 3001 Leuven, Belgium 2 ; Institute of Gene Technology, Tallinn University of Technology,<br />

Akadeemia tee 15A, Estonia 3 . * jaak.simm@esat.kuleuven.be<br />

Scientific publications contain rich information about genetic disorders. Text mining these publications provides an<br />

automatic way to quickly query and summarize the information. We propose a supervised learning approach that takes<br />

advantage of the well known unsupervised approach TF-IDF (term frequency–inverse document frequency) and<br />

integrates it with supervised approach using logistic loss error metric. The preliminary results on OMIM dataset look<br />

promising.<br />

INTRODUCTION<br />

Scientific publications contain rich information about<br />

genetic disorders. Text mining these publications provides<br />

an automatic way to quickly query and summarize the<br />

information.<br />

The traditional approaches employ unsupervised text<br />

mining approaches like TF-IDF (term frequency–inverse<br />

document frequency) or Latent Dirichlet Allocation<br />

(LDA) by Blei et al. (2003) for linking terms to genes and<br />

diseases. A recent text mining software Beegle (ElShal et<br />

al., <strong>2015</strong>) developed for linking diseases and genes has<br />

taken this approach using TF-IDF as its similarity metric.<br />

PROPOSED METHOD<br />

Our work proposes a supervised learning of the<br />

importance of the textual terms, which can automatically<br />

filter out many terms that are unnecessary for the task at<br />

hand. We formulate it as a prediction of supervised values<br />

y given the terms for all genes g and all diseases d where i<br />

is the index of the term:<br />

and w i is the weight for the term i and σ is sigmoid<br />

function. The main idea is to learn the weight vector w that<br />

minimizes the difference between known values y and<br />

predictions. The minimization can transformed into a<br />

logistic regression.<br />

For the supervised values we use OMIM database<br />

(Hamosh et al., 2003). More specifically y corresponds to<br />

1 if there is a link between the given gene-disease pair and<br />

0 if there is no link. Intuitively, in this setup the text<br />

mining is transformed into a classification problem. We<br />

use dataset of 330 OMIM terms and their linked genes and<br />

randomly sample genes as negatives for each disease.<br />

For the textual terms we use MEDLINE abstracts as the<br />

source of biomedical text. We employ MetaMap (Aronson<br />

et al. 2010) to link terms with abstracts. We use geneRIF<br />

to link genes with abstracts, and PubMed to link diseases<br />

with abstracts. We apply a TF-IDF transformation to score<br />

a term with a given disease or gene based on the abstracts<br />

linked to each entity. We only use the terms linked to<br />

abstracts that belong to genes. Hence our vocabulary<br />

consists of 66,883 terms.<br />

RESULTS & DISCUSSION<br />

The preliminary results show that supervised learning<br />

allows to automatically pick up the keywords that are<br />

informative, improving the recall of the genes that are<br />

related to genetic disorders. We will present more detailed<br />

results in the poster.<br />

We are also investigate how to integrate the supervised<br />

approach to have answers to online queries provided by<br />

Beegle.<br />

REFERENCES<br />

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet<br />

allocation. the Journal of machine Learning research, 3, 993-1022.<br />

Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A., & McKusick,<br />

V. A. (2005). Online Mendelian Inheritance in Man (OMIM), a<br />

knowledgebase of human genes and genetic disorders. Nucleic acids<br />

research, 33(suppl 1), D514-D517.<br />

ElShal, S., Tranchevent L.C., Sifrim A., Ardeshirdavani A., Davis J.,<br />

Moreau Y. (<strong>2015</strong>). Beegle: from literature mining to disease-gene<br />

discovery. Nucleic Acids Res, gkv905.<br />

Aronson, A. R., & Lang, F. M. (2010). An overview of MetaMap:<br />

historical perspective and recent advances. Journal of the American<br />

Medical Informatics Association, 17(3), 229-236.<br />

96

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!