bbc 2015

Recommendations

Info

BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015 Abstract ID: P Poster 10th Benelux Bioinformatics Conference bbc 2015 P16. BIOMEDICAL TEXT MINING FOR DISEASE-GENE DISCOVERY: SOMETIMES LESS IS MORE Sarah ElShal 1,2* , Jesse Davis 3 & Yves Moreau 1,2 . Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics Department, KU Leuven 1 ; iMinds Future Health Department, KU Leuven 2 ; Department of Computer Science, KU Leuven 3 . * sarah.elshal@esat.kuleuven.be Biomedical text is increasingly being made available online in either abstract or full article formats. This goes in parallel with the knowledge desire to extract information from such text (e.g. finding links between diseases and genes). Consequently text mining is very popular in the biomedical domain given that it provides the possibility to automatically analyze these texts in order to extract knowledge. One of the big challenges in text mining is recognizing named entities (e.g. disease and gene entities) inside a given text, which is widely known as Named Entity Recognition (NER). We studied two biomedical taggers that apply different NER methods on MEDLINE abstracts. Here, we compare the contribution of each of the two taggers in associating genes with diseases. We show that with fewer recognized entities we gain more knowledge and we better associate genes with diseases. INTRODUCTION MEDLINE currently has more than 25 million biomedical citations from different journals all over the world. With this vast amount of text available, it is increasingly important to mine such data and find the best ways to extract relevant knowledge out of it. One example of such knowledge is links between diseases and genes. However it is very challenging and time consuming to recognize biomedical entities inside a given text with the evolving number of dictionaries and tagging strategies. Different taggers exist that map MEDLINE abstracts to biomedical entities. Such tagged entities can be used to generate disease and gene profiles and by applying certain similarity measures, we can extract knowledge and generate disease-gene hypothesis. METHODS We compare two MEDLINE taggers that map the whole set of MEDLINE abstracts to biomedical entities (e.g. genes, diseases, GO and MeSH terms …). The first one is MetaMap (Aronson et al., 2010), and the second one has been used as a text mining pipeline in many resources, latest in Diseases (Pletscher-Frankild et al., 2015). For sake of simplicity, we will refer to the second tagger by m_tagger throughout the rest of the abstract. For each MEDLINE abstract we could obtain two sets of mapped entities: (1) the metamap set, and (2) the m_tagger set. The metamap set (given all the abstracts) corresponds to 78,298 distinct entities vs. 29,536 for M_tagger. In order to compare the contribution of each tagger to the disease-gene association process, we proceeded as follows. First, we generated a validation set from the OMIM database to acquire a list of experimentally-validated disease-gene pairs. Second, we generated an entity profile for every gene in our database and for every disease in our validation set. This profile corresponds to the TF-IDF score of a given entity in one profile, which is calculated according to the set of abstracts found to be linked with a disease or gene. Then for every disease, we computed the cosine similarity between its profile and all the gene profiles. Hence we could have a similarity score for each disease and gene pair, which we used to rank the genes for a given disease. We computed the average recall at the top 10, 25, 50, and 100 ranked genes. We ran this analysis once according to the metamap set and once according to the m_tagger set. We also tried another association measure where we filtered the profiles such that they only contain gene entities. Then we ranked the genes according to their TF-IDF scores in a given disease profile. This corresponds to 9290 gene entities in the metamap set, and 10,003 entities in the m_tagger set. Again we measured the average recall at the different rank thresholds, and we repeated the analysis using the metamap and m_tagger profiles. RESULTS & DISCUSSION Figure 1 presents the recall results on the OMIM validation set. We observe that MetaMap and M_tagger result in comparable recall when ranking the genes according to their cosine similarity with the disease profiles. We also observe that M_tagger results in the best recall when simply ranking the genes according to their TF-IDF scores inside the disease profile. FIGURE 1. Recall results on the OMIM validation set: comparing the contribution of MetaMap and M_tagger, once with cosine similarity and once with TF-IDF ranks. Even though using the m_tagger set implies using less entities than the metamap one, we could gain the same knowledge to associate genes with diseases. Moreover, when we further reduced this set of entities to only genes, we gained even more knowledge and better associated genes with diseases. REFERENCES Aronson A.R. et al. J. Am. Med. Inform. Assoc. An overview of MetaMap: historical perspective and recent advances. 17, 229-236 (2010). Pletscher-Frankild S. et al. DISEASES: text mining and data integration of diseasegene associations. Methods. 74, 83-89 (2015). 60
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015 Abstract ID: P Poster 10th Benelux Bioinformatics Conference bbc 2015 P17. TUNESIM - TUNABLE VARIANT SET SIMULATOR FOR NGS READS Bertrand Escaliere 1,2 , Nicolas Simonis 1,3 , Gianluca Bontempi 1,2 & Guillaume Smits 1,4 . Interuniversity Institute of Bioinformatics in Brussels 1 ; Machine Learning Group, Université Libre de Bruxelles 2 ; Institut de Pathologie et de Génétique 3 ; Hopital Universitaire des Enfants Reine Fabiola, Université Libre de Bruxelles 4 . NGS analysis softwares and pipelines optimization is crucial in order to improve discovery of (new) disease causing variants. A better combination between existing tools and the right choice of parameters can lead to more specific and sensitive calling. Simulated datasets allow the step-by-step generation of new alignment or calling software. Creating a simulator able to insert known human variants at a realistic minor frequency and artificial variants in a tunable controlled way would allow to overcome three optimization limits: complete knowledge of the input dataset, allowing to determine exact calling sensitivity and accuracy; optimization on the appropriate population; and the capacity to dynamically test a pipeline one variable at the time. INTRODUCTION Identification of anomalies causing genetic disorders is difficult. It can be limited by scarcity of affliction concerned, by disorder genetic heterogeneity, or by phenotypic pleiotropy associated with the anomalies in a single gene. Exome and genome sequencing allowed the identification of many genetic diseases causes, whose origin remained inaccessible up to now by the usual techniques of research in genetics (Ng et al., 2009), (Gilissen et al., 2012), (Yang et al., 2013), (Gilissen et al., 2014). Exome and genome sequencing data analysis pipelines are constituted by several steps (roughly: alignment, quality filters, variant calling) and several software are available for those steps. Evaluation and comparison of those tools are crucial in order to improve pipelines accuracy. Exome and genome sequencing simulations should allow to determine the veracity of called variants (false positives and false negatives). METHODS We implemented TuneSIM, a wrapper around NGS dwgsim (http://sourceforge.net/projects/dnaa/) reads simulator with realistic mutations. Generated reads contain real mutations from 1KG project and dbsnp138. We use existing tool dwgsim for reads generations. In order to generate data as realistic as possible we decided to keep the haplotype blocks structure. We computed blocks using vcf files from 1KG project phase 3 in european individuals with Plink (Purcell et al., 2007). For each block, we obtained a frequency of each combination of variants and we used these frequencies for blocks selection. We also insert variants in an independent way using their frequencies in dbSNP (Smigielski et al., 2000). Using 33 in house samples, we computed global allele frequency variants distributions in coding and non coding regions and we select the variants according to those frequencies. Similar operation has been performed for CNVs insertion using 1KG data. We are developing a web interface allowing users to download existing generated datasets. After running their pipelines they can upload their output and see accuracy of their pipelines. RESULTS & DISCUSSION Simulations with different coverage, rate of indels have been performed and analysed with different pipelines. Results will be presented. REFERENCES Gilissen, et al. (2012). Disease gene identification strategies for exome sequencing. Eur J Hum Genet, 20, 490–497. Gilissen, et al. (2014). Genome sequencing identifies major causes of severe intellectual disability. Nature, 511, 344–347. Ng, S. B., et al. (2009). Exome sequencing identifies the cause of a mendelian disorder. Nature Genetics, 42, 30–35. Purcell, et al. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. American journal of human genetics, 81, 559–575. Smigielski, E. M., Sirotkin, K., Ward, M., & Sherry, S. T. (2000). dbsnp: a database of single nucleotide polymorphisms. Nucleic Acids Research, 28, 352–355. Yang, et al. (2013). Clinical Whole-Exome Sequencing for the Diagnosis of Mendelian Disorders. N Engl J Med, 369, 1502–1511. 61
Page 1 and 2:
10 th Benelux Bioinformatics Confer
Page 3 and 4:
10th Benelux Bioinformatics Confere
Page 5 and 6:
Page 7 and 8:
Page 9 and 10: 10th Benelux Bioinformatics Confere
Page 19 and 20: BeNeLux Bioinformatics Conference -
Page 59: BeNeLux Bioinformatics Conference -
Page 111 and 112:
BeNeLux Bioinformatics Conference -
Page 113 and 114:
BeNeLux Bioinformatics Conference -
Page 115:
show all

bbc 2015

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?