bbc 2015
BBC2015_booklet
BBC2015_booklet
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P50. EVALUATING THE ROBUSTNESS OF LARGE INDEL IDENTIFICATION<br />
ACROSS MULTIPLE MICROBIAL GENOMES<br />
Alex Salazar 1,2 & Thomas Abeel 1,2* .<br />
Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands 1 ; Genome Sequencing and Analysis<br />
Program, Broad Institute of MIT and Harvard 2 . * T.Abeel@tudelft.nl<br />
Comparing large structural variants—such as large insertions and deletions (indels)—across multiple genomes can reveal<br />
important insights in microbial organisms. Unfortunately, most studies that compare sequence variants only focus on<br />
single nucleotide variants and small indels. In this study, we investigated whether current available variant callers are<br />
robust when identifying the same large indel across multiple genomes—an important criteria for accurately associating<br />
large variants. By simulating over 8,000 large indels of various sizes across 161 bacterial strains, we found that<br />
breakpoint detection is precise when identifying both deletions and insertion. We suggest that left-most-overlap<br />
normalization across all samples will ensure uniform breakpoint coordinates of identical large variants which can then be<br />
incorporated to existing association pipelines.<br />
INTRODUCTION<br />
Structural sequence variants—such as large insertion and<br />
deletions (indels)—along with small sequence variants (e.g.<br />
single nucleotide variants and small indels) can enable more<br />
robust comparisons of microbial populations. Unfortunately,<br />
limitations in variant calling methods restrict investigations to<br />
compare only small variants across multiple microbial<br />
genomes—thereby ignoring larger variants (e.g. indels of size<br />
greater than 50nt). The recent development of structural<br />
variant detecting tools now provide an opportunity to<br />
compare and associate large indels with phenotype and<br />
population structure across a collection of samples. However,<br />
these tools have only been benchmarked against a single<br />
genome and their ability to consistently call large events<br />
across multiple genomes remains uncharacterized.<br />
METHODS<br />
In this study, we systematically benchmarked the robustness<br />
of large indel identification across multiple genomes using<br />
five recently developed structural variant detection tools:<br />
Pilon (Walker et al., 2014), Breseq (Barrick et al., 2014),<br />
BreakSeek (Zhao et al., <strong>2015</strong>), and MindTheGap (Rizk et al.,<br />
2014). Using a manually-curated reference genome for<br />
M. tuberculosis (H37Rv), we simulated nearly 10,000<br />
deletions and 8,000 thousand insertions—ranging from 50nt<br />
to 550nt. Overall, the simulation experiment resulted in a<br />
total 1.6 million expected deletions and 1.3 million expected<br />
insertions when we aligned short-reads from a data set of 161<br />
clinical strains of M. tuberculosis (Zhang et al., 2013).<br />
After identifying the simulated indels using the variant<br />
detecting tools, we used a distance test to investigate each<br />
tool’s robustness in breakpoint and genotype prediction. For<br />
each simulated indel prediction, we computed the distance of<br />
the predicted breakpoint coordinate to the expected<br />
breakpoint coordinate. We also calculated a genotype<br />
similarity score using the Damerau-Levenshtein distance.<br />
RESULTS & DISCUSSION<br />
We found that all tools are able to precisely predict the<br />
breakpoint coordinate of the same large event present across<br />
multiple genomes. For deletions, Breseq and Breakseek<br />
consistently identified more than 96% of all simulated<br />
deletions regardless of size. This number ranged from 87% to<br />
93% in Pilon and correlated with decreasing deletion size.<br />
Breseq and Pilon correctly predicted the exact breakpoint<br />
coordinate for about two-thirds of all identified simulated<br />
indels. This number ranged from 1% to 7% in Breakseek calls<br />
and inversely correlated with increasing deletion size.<br />
For insertions, MindTheGap consistently identified<br />
approximately 97% of all simulated insertions, but Pilon’s<br />
performance worsened as the number of insertions that it<br />
identified ranged from 69% to 93%--again, we observed a<br />
direct correlation of missed calls as the insertion size<br />
increased. Both tools correctly predicted the exact breakpoint<br />
coordinate for about two-thirds of all identified simulated<br />
indels. Nevertheless, we found 99% of the predicted<br />
breakpoint coordinates made by the four tools were within<br />
10nt of the expected breakpoint coordinate.<br />
Our results also indicate that Pilon, Breseq, Breakseek, and<br />
MindTheGap are robust when predicting the genotype of<br />
large indels across multiple samples. The large majority of<br />
identified simulated deletions had a size and genotype<br />
similarity of more than 98%. In insertions, the size similarity<br />
of insertions varied widely in both MindTheGap and Pilon<br />
calls indicating that both tools have a difficult time<br />
determining the exact length of an insertion sequence.<br />
Overall, these results show that breakpoint detection is<br />
precise when identifying deletion and insertions of any size.<br />
Therefore, a simple normalization procedure—such as leftmost-overlap<br />
normalization across samples—will ensure<br />
consistent breakpoint location for identical large events. This<br />
will enable researchers to incorporate large variants to<br />
existing association pipelines; opening novel opportunities to<br />
associate large variants with phenotype and population<br />
structure.<br />
REFERENCES<br />
Barrick,J.E. et al. (2014) Identifying structural variation in haploid<br />
microbial genomes from short-read resequencing data using breseq.<br />
BMC Genomics, 15, 1039.<br />
Rizk,G. et al. (2014) MindTheGap: integrated detection and assembly of<br />
short and long insertions. Bioinformatics, 30, 1–7.<br />
Walker,B.J. et al. (2014) Pilon: an integrated tool for comprehensive<br />
microbial variant detection and genome assembly improvement.<br />
PLoS One, 9, e112963.<br />
Zhang,H. et al. (2013) Genome sequencing of 161 Mycobacterium<br />
tuberculosis isolates from China identifies genes and intergenic<br />
regions associated with drug resistance. Nat. Genet., 45, 1255–60.<br />
Zhao,H. and Zhao,F. (<strong>2015</strong>) BreakSeek: a breakpoint-based algorithm for<br />
full spectral range INDEL detection. Nucleic Acids Res., 1–13.<br />
94