bbc 2015
BBC2015_booklet
BBC2015_booklet
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P52. SUPERVISED TEXT MINING FOR DISEASE AND GENE LINKS<br />
Jaak Simm 1,2,3* , Adam Arany 1,2 , Sarah ElShal 1,2 & Yves Moreau 1,2 .<br />
Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing, and Data<br />
Analytics, KU Leuven, Kasteelpark Arenberg 10, box 2446, 3001 Leuven, Belgium 1 ; iMinds Medical IT, Kasteelpark<br />
Arenberg 10, box 2446, 3001 Leuven, Belgium 2 ; Institute of Gene Technology, Tallinn University of Technology,<br />
Akadeemia tee 15A, Estonia 3 . * jaak.simm@esat.kuleuven.be<br />
Scientific publications contain rich information about genetic disorders. Text mining these publications provides an<br />
automatic way to quickly query and summarize the information. We propose a supervised learning approach that takes<br />
advantage of the well known unsupervised approach TF-IDF (term frequency–inverse document frequency) and<br />
integrates it with supervised approach using logistic loss error metric. The preliminary results on OMIM dataset look<br />
promising.<br />
INTRODUCTION<br />
Scientific publications contain rich information about<br />
genetic disorders. Text mining these publications provides<br />
an automatic way to quickly query and summarize the<br />
information.<br />
The traditional approaches employ unsupervised text<br />
mining approaches like TF-IDF (term frequency–inverse<br />
document frequency) or Latent Dirichlet Allocation<br />
(LDA) by Blei et al. (2003) for linking terms to genes and<br />
diseases. A recent text mining software Beegle (ElShal et<br />
al., <strong>2015</strong>) developed for linking diseases and genes has<br />
taken this approach using TF-IDF as its similarity metric.<br />
PROPOSED METHOD<br />
Our work proposes a supervised learning of the<br />
importance of the textual terms, which can automatically<br />
filter out many terms that are unnecessary for the task at<br />
hand. We formulate it as a prediction of supervised values<br />
y given the terms for all genes g and all diseases d where i<br />
is the index of the term:<br />
and w i is the weight for the term i and σ is sigmoid<br />
function. The main idea is to learn the weight vector w that<br />
minimizes the difference between known values y and<br />
predictions. The minimization can transformed into a<br />
logistic regression.<br />
For the supervised values we use OMIM database<br />
(Hamosh et al., 2003). More specifically y corresponds to<br />
1 if there is a link between the given gene-disease pair and<br />
0 if there is no link. Intuitively, in this setup the text<br />
mining is transformed into a classification problem. We<br />
use dataset of 330 OMIM terms and their linked genes and<br />
randomly sample genes as negatives for each disease.<br />
For the textual terms we use MEDLINE abstracts as the<br />
source of biomedical text. We employ MetaMap (Aronson<br />
et al. 2010) to link terms with abstracts. We use geneRIF<br />
to link genes with abstracts, and PubMed to link diseases<br />
with abstracts. We apply a TF-IDF transformation to score<br />
a term with a given disease or gene based on the abstracts<br />
linked to each entity. We only use the terms linked to<br />
abstracts that belong to genes. Hence our vocabulary<br />
consists of 66,883 terms.<br />
RESULTS & DISCUSSION<br />
The preliminary results show that supervised learning<br />
allows to automatically pick up the keywords that are<br />
informative, improving the recall of the genes that are<br />
related to genetic disorders. We will present more detailed<br />
results in the poster.<br />
We are also investigate how to integrate the supervised<br />
approach to have answers to online queries provided by<br />
Beegle.<br />
REFERENCES<br />
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet<br />
allocation. the Journal of machine Learning research, 3, 993-1022.<br />
Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A., & McKusick,<br />
V. A. (2005). Online Mendelian Inheritance in Man (OMIM), a<br />
knowledgebase of human genes and genetic disorders. Nucleic acids<br />
research, 33(suppl 1), D514-D517.<br />
ElShal, S., Tranchevent L.C., Sifrim A., Ardeshirdavani A., Davis J.,<br />
Moreau Y. (<strong>2015</strong>). Beegle: from literature mining to disease-gene<br />
discovery. Nucleic Acids Res, gkv905.<br />
Aronson, A. R., & Lang, F. M. (2010). An overview of MetaMap:<br />
historical perspective and recent advances. Journal of the American<br />
Medical Informatics Association, 17(3), 229-236.<br />
96