bbc 2015

Recommendations

Info

BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015 Abstract ID: P Poster 10th Benelux Bioinformatics Conference bbc 2015 P50. EVALUATING THE ROBUSTNESS OF LARGE INDEL IDENTIFICATION ACROSS MULTIPLE MICROBIAL GENOMES Alex Salazar 1,2 & Thomas Abeel 1,2* . Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands 1 ; Genome Sequencing and Analysis Program, Broad Institute of MIT and Harvard 2 . * T.Abeel@tudelft.nl Comparing large structural variants—such as large insertions and deletions (indels)—across multiple genomes can reveal important insights in microbial organisms. Unfortunately, most studies that compare sequence variants only focus on single nucleotide variants and small indels. In this study, we investigated whether current available variant callers are robust when identifying the same large indel across multiple genomes—an important criteria for accurately associating large variants. By simulating over 8,000 large indels of various sizes across 161 bacterial strains, we found that breakpoint detection is precise when identifying both deletions and insertion. We suggest that left-most-overlap normalization across all samples will ensure uniform breakpoint coordinates of identical large variants which can then be incorporated to existing association pipelines. INTRODUCTION Structural sequence variants—such as large insertion and deletions (indels)—along with small sequence variants (e.g. single nucleotide variants and small indels) can enable more robust comparisons of microbial populations. Unfortunately, limitations in variant calling methods restrict investigations to compare only small variants across multiple microbial genomes—thereby ignoring larger variants (e.g. indels of size greater than 50nt). The recent development of structural variant detecting tools now provide an opportunity to compare and associate large indels with phenotype and population structure across a collection of samples. However, these tools have only been benchmarked against a single genome and their ability to consistently call large events across multiple genomes remains uncharacterized. METHODS In this study, we systematically benchmarked the robustness of large indel identification across multiple genomes using five recently developed structural variant detection tools: Pilon (Walker et al., 2014), Breseq (Barrick et al., 2014), BreakSeek (Zhao et al., 2015), and MindTheGap (Rizk et al., 2014). Using a manually-curated reference genome for M. tuberculosis (H37Rv), we simulated nearly 10,000 deletions and 8,000 thousand insertions—ranging from 50nt to 550nt. Overall, the simulation experiment resulted in a total 1.6 million expected deletions and 1.3 million expected insertions when we aligned short-reads from a data set of 161 clinical strains of M. tuberculosis (Zhang et al., 2013). After identifying the simulated indels using the variant detecting tools, we used a distance test to investigate each tool’s robustness in breakpoint and genotype prediction. For each simulated indel prediction, we computed the distance of the predicted breakpoint coordinate to the expected breakpoint coordinate. We also calculated a genotype similarity score using the Damerau-Levenshtein distance. RESULTS & DISCUSSION We found that all tools are able to precisely predict the breakpoint coordinate of the same large event present across multiple genomes. For deletions, Breseq and Breakseek consistently identified more than 96% of all simulated deletions regardless of size. This number ranged from 87% to 93% in Pilon and correlated with decreasing deletion size. Breseq and Pilon correctly predicted the exact breakpoint coordinate for about two-thirds of all identified simulated indels. This number ranged from 1% to 7% in Breakseek calls and inversely correlated with increasing deletion size. For insertions, MindTheGap consistently identified approximately 97% of all simulated insertions, but Pilon’s performance worsened as the number of insertions that it identified ranged from 69% to 93%--again, we observed a direct correlation of missed calls as the insertion size increased. Both tools correctly predicted the exact breakpoint coordinate for about two-thirds of all identified simulated indels. Nevertheless, we found 99% of the predicted breakpoint coordinates made by the four tools were within 10nt of the expected breakpoint coordinate. Our results also indicate that Pilon, Breseq, Breakseek, and MindTheGap are robust when predicting the genotype of large indels across multiple samples. The large majority of identified simulated deletions had a size and genotype similarity of more than 98%. In insertions, the size similarity of insertions varied widely in both MindTheGap and Pilon calls indicating that both tools have a difficult time determining the exact length of an insertion sequence. Overall, these results show that breakpoint detection is precise when identifying deletion and insertions of any size. Therefore, a simple normalization procedure—such as leftmost-overlap normalization across samples—will ensure consistent breakpoint location for identical large events. This will enable researchers to incorporate large variants to existing association pipelines; opening novel opportunities to associate large variants with phenotype and population structure. REFERENCES Barrick,J.E. et al. (2014) Identifying structural variation in haploid microbial genomes from short-read resequencing data using breseq. BMC Genomics, 15, 1039. Rizk,G. et al. (2014) MindTheGap: integrated detection and assembly of short and long insertions. Bioinformatics, 30, 1–7. Walker,B.J. et al. (2014) Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One, 9, e112963. Zhang,H. et al. (2013) Genome sequencing of 161 Mycobacterium tuberculosis isolates from China identifies genes and intergenic regions associated with drug resistance. Nat. Genet., 45, 1255–60. Zhao,H. and Zhao,F. (2015) BreakSeek: a breakpoint-based algorithm for full spectral range INDEL detection. Nucleic Acids Res., 1–13. 94
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015 Abstract ID: P 10th Benelux Bioinformatics Conference Poster bbc 2015 P51. INTEGRATING STRUCTURED AND UNSTRUCTURED DATA SOURCES FOR PREDICTING CLINICAL CODES Elyne Scheurwegs 1,3* , Kim Luyckx 2 , Léon Luyten 2 , Walter Daelemans 3 & Tim Van den Bulcke 1 . Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Antwerp University Hospital 2 ; Center for Computation Linguistics and Psycholinguistics (CliPS), University of Antwerp 3 ; * elyne.scheurwegs@uantwerpen.be Automated clinical coding is a task in medical informatics, in which information found in patient files is translated to various types of coding systems (e.g. ICD-9-CM). The information in patient files consists of multiple data sources, both in structured (e.g. lab test results) and unstructured form (e.g. a text describing the progress of a patient over multiple days during the stay). This work studies the complementarity of information derived from these different sources to enhance clinical code prediction. INTRODUCTION The increased accessibility of healthcare data through the large-scale adoption of electronic health records stimulates the development of algorithms that monitor hospital activities, such as clinical coding applications. Clinical coding consists of the translation of information found in a patient file to diagnostic and procedural codes, originating from a medical ontology to patient files. In our work, we investigate if unstructured (textual) and structured data sources, present in electronic health records, can be combined to assign clinical diagnostic and procedural codes (specifically ICD-9-CM) to patient stays. Our main objective is to evaluate if integrating these heterogeneous data types improves prediction strength compared to using the data types in isolation. METHODS Several datasets were collected from the clinical data warehouse of the Antwerp University Hospital (UZA). The resulting dataset consists of a randomized subset of anonymized data of patient stays, in 14 different medical specialties. Two separate data integration approaches were evaluated on each dataset from a medical specialty. With early data integration, multiple sources are combined prior to training a model. This is achieved by using a single bag of features that are given to the prediction pipeline. Feature selection is performed with tf-idf for unstructured sources and gainratio and minimal redundancy, maximum relevance (mRMR) for structured source filtering. The late data integration method trains a separate model on each data source, and then combines the prediction output for each code in a meta-learner. This meta-learner is mainly used to find which sources perform best for a certain code. The prediction task in both approaches was cast as a multiclass classification task, in which an array of binary predictions was made (one for each clinical code). RESULTS & DISCUSSION Late data integration improves the predictions of ICD-9- CM diagnostic codes made in comparison to the best individual prediction source (i.e. overall F-measure increased from 30.6% to 38.3%). Early data integration does not show this trend and only performs well with a limited number of combinations of sources. ICD-9-CM procedure codes also show this trend, with the exception of the RIZIV data source, which shows a better prediction when used individually. The predictive strength of the models varies strongly between different medical specialties. The results show that the data sources, independent of their structured or unstructured nature, are able to provide complementary information when predicting ICD-9-CM codes, particularly when combined within the late data integration approach. This approach also allows for including as many sources as possible, as the effects of including a source that does not contain any additional information barely influences the end result. This is an advantage when the information content of a data source is not previously known. A disadvantage is the loss of information due to the strong generalisation as each data source is effectively reduced to a single feature for the meta-learner. Early data integration seems to suffer when combining sources that have features with a largely differing information content and different numbers of features. An unstructured data source typically renders 30,000 different, weak features, while a structured source often contains only 500 different features. CONCLUSIONS Models using multiple electronic health record data sources systematically outperform models using data sources in isolation in the task of predicting ICD-9-CM codes over a broad range of medical specialties. ACKNOWLEDGEMENT This work is supported by a doctoral research grant (nr. 131137) by the Agency for Innovation by Science and Technology in Flanders (IWT). The datasets used in this research were made available by the Antwerp University Hospital (UZA) for restricted use. REFERENCES Scheurwegs, E et al. Data integration of structured and unstructured sources for assigning clinical codes to patient stays. Journal of the American Medical Informatics Association (2015): ocv115. 95
Page 1 and 2:
10 th Benelux Bioinformatics Confer
Page 3 and 4:
10th Benelux Bioinformatics Confere
Page 5 and 6:
Page 7 and 8:
Page 9 and 10:
Page 11 and 12:
Page 13 and 14:
Page 15 and 16:
Page 17 and 18:
Page 19 and 20:
BeNeLux Bioinformatics Conference -
Page 21 and 22:
Page 23 and 24:
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44: BeNeLux Bioinformatics Conference -
Page 93: BeNeLux Bioinformatics Conference -
Page 115: 10th Benelux Bioinformatics Confere
show all

bbc 2015

Create successful ePaper yourself

Delete template?

Save as template?