03.12.2015 Views

bbc 2015

BBC2015_booklet

BBC2015_booklet

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P50. EVALUATING THE ROBUSTNESS OF LARGE INDEL IDENTIFICATION<br />

ACROSS MULTIPLE MICROBIAL GENOMES<br />

Alex Salazar 1,2 & Thomas Abeel 1,2* .<br />

Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands 1 ; Genome Sequencing and Analysis<br />

Program, Broad Institute of MIT and Harvard 2 . * T.Abeel@tudelft.nl<br />

Comparing large structural variants—such as large insertions and deletions (indels)—across multiple genomes can reveal<br />

important insights in microbial organisms. Unfortunately, most studies that compare sequence variants only focus on<br />

single nucleotide variants and small indels. In this study, we investigated whether current available variant callers are<br />

robust when identifying the same large indel across multiple genomes—an important criteria for accurately associating<br />

large variants. By simulating over 8,000 large indels of various sizes across 161 bacterial strains, we found that<br />

breakpoint detection is precise when identifying both deletions and insertion. We suggest that left-most-overlap<br />

normalization across all samples will ensure uniform breakpoint coordinates of identical large variants which can then be<br />

incorporated to existing association pipelines.<br />

INTRODUCTION<br />

Structural sequence variants—such as large insertion and<br />

deletions (indels)—along with small sequence variants (e.g.<br />

single nucleotide variants and small indels) can enable more<br />

robust comparisons of microbial populations. Unfortunately,<br />

limitations in variant calling methods restrict investigations to<br />

compare only small variants across multiple microbial<br />

genomes—thereby ignoring larger variants (e.g. indels of size<br />

greater than 50nt). The recent development of structural<br />

variant detecting tools now provide an opportunity to<br />

compare and associate large indels with phenotype and<br />

population structure across a collection of samples. However,<br />

these tools have only been benchmarked against a single<br />

genome and their ability to consistently call large events<br />

across multiple genomes remains uncharacterized.<br />

METHODS<br />

In this study, we systematically benchmarked the robustness<br />

of large indel identification across multiple genomes using<br />

five recently developed structural variant detection tools:<br />

Pilon (Walker et al., 2014), Breseq (Barrick et al., 2014),<br />

BreakSeek (Zhao et al., <strong>2015</strong>), and MindTheGap (Rizk et al.,<br />

2014). Using a manually-curated reference genome for<br />

M. tuberculosis (H37Rv), we simulated nearly 10,000<br />

deletions and 8,000 thousand insertions—ranging from 50nt<br />

to 550nt. Overall, the simulation experiment resulted in a<br />

total 1.6 million expected deletions and 1.3 million expected<br />

insertions when we aligned short-reads from a data set of 161<br />

clinical strains of M. tuberculosis (Zhang et al., 2013).<br />

After identifying the simulated indels using the variant<br />

detecting tools, we used a distance test to investigate each<br />

tool’s robustness in breakpoint and genotype prediction. For<br />

each simulated indel prediction, we computed the distance of<br />

the predicted breakpoint coordinate to the expected<br />

breakpoint coordinate. We also calculated a genotype<br />

similarity score using the Damerau-Levenshtein distance.<br />

RESULTS & DISCUSSION<br />

We found that all tools are able to precisely predict the<br />

breakpoint coordinate of the same large event present across<br />

multiple genomes. For deletions, Breseq and Breakseek<br />

consistently identified more than 96% of all simulated<br />

deletions regardless of size. This number ranged from 87% to<br />

93% in Pilon and correlated with decreasing deletion size.<br />

Breseq and Pilon correctly predicted the exact breakpoint<br />

coordinate for about two-thirds of all identified simulated<br />

indels. This number ranged from 1% to 7% in Breakseek calls<br />

and inversely correlated with increasing deletion size.<br />

For insertions, MindTheGap consistently identified<br />

approximately 97% of all simulated insertions, but Pilon’s<br />

performance worsened as the number of insertions that it<br />

identified ranged from 69% to 93%--again, we observed a<br />

direct correlation of missed calls as the insertion size<br />

increased. Both tools correctly predicted the exact breakpoint<br />

coordinate for about two-thirds of all identified simulated<br />

indels. Nevertheless, we found 99% of the predicted<br />

breakpoint coordinates made by the four tools were within<br />

10nt of the expected breakpoint coordinate.<br />

Our results also indicate that Pilon, Breseq, Breakseek, and<br />

MindTheGap are robust when predicting the genotype of<br />

large indels across multiple samples. The large majority of<br />

identified simulated deletions had a size and genotype<br />

similarity of more than 98%. In insertions, the size similarity<br />

of insertions varied widely in both MindTheGap and Pilon<br />

calls indicating that both tools have a difficult time<br />

determining the exact length of an insertion sequence.<br />

Overall, these results show that breakpoint detection is<br />

precise when identifying deletion and insertions of any size.<br />

Therefore, a simple normalization procedure—such as leftmost-overlap<br />

normalization across samples—will ensure<br />

consistent breakpoint location for identical large events. This<br />

will enable researchers to incorporate large variants to<br />

existing association pipelines; opening novel opportunities to<br />

associate large variants with phenotype and population<br />

structure.<br />

REFERENCES<br />

Barrick,J.E. et al. (2014) Identifying structural variation in haploid<br />

microbial genomes from short-read resequencing data using breseq.<br />

BMC Genomics, 15, 1039.<br />

Rizk,G. et al. (2014) MindTheGap: integrated detection and assembly of<br />

short and long insertions. Bioinformatics, 30, 1–7.<br />

Walker,B.J. et al. (2014) Pilon: an integrated tool for comprehensive<br />

microbial variant detection and genome assembly improvement.<br />

PLoS One, 9, e112963.<br />

Zhang,H. et al. (2013) Genome sequencing of 161 Mycobacterium<br />

tuberculosis isolates from China identifies genes and intergenic<br />

regions associated with drug resistance. Nat. Genet., 45, 1255–60.<br />

Zhao,H. and Zhao,F. (<strong>2015</strong>) BreakSeek: a breakpoint-based algorithm for<br />

full spectral range INDEL detection. Nucleic Acids Res., 1–13.<br />

94

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!