01.06.2016 Views

Sequencing

SFAF2016%20Meeting%20Guide%20Final%203

SFAF2016%20Meeting%20Guide%20Final%203

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

11th Annual <strong>Sequencing</strong>, Finishing, and Analysis in the Future Meeting<br />

AN EXTENDED CORE GENE MLST TARGET<br />

IDENTIFICATION AND SUBSET SELECTION PIPELINE<br />

FOR CULTURE-INDEPENDENT PATHOGEN<br />

SUBTYPING<br />

Wednesday, 1st June 20:00 La Fonda NM Room (1st floor) Poster (PS‐1b.117)<br />

JoWilliams Newkirk 1 , Eija Trees 2 , John Besser 2 , Heather Carleton 2<br />

1 IHRC, 2 Centers for Disease Control and Prevention<br />

While isolate‐based whole genome sequencing is being rapidly integrated into the US public health<br />

surveillance system, isolate availability for surveillance continues to decline as a result of the adoption<br />

of culture‐independent diagnostic tests by clinical laboratories. As affordable methods are not<br />

yet available for obtaining the same genome resolution directly from shotgun metagenomic sequencing<br />

of clinical samples, particularly microbially complex samples such as stool, alternative methods<br />

are needed to reliably capture genetic information relevant to pathogen subtyping. Targeted amplification<br />

and sequencing of informative genomic regions (i.e. multilocus sequence typing, MLST)<br />

is a well understood and robust typing method whose resolution is limited only by the number of<br />

sites used. Unfortunately, identifying large numbers of informative regions with conserved primer<br />

sites is labor intensive, particularly if hundreds or thousands of reference genomes are used for site<br />

selection.<br />

To facilitate the rapid development of extended MLST schemes for targeted pathogen groups, we<br />

developed a custom pipeline leveraging widely used open source programs to identify potential MLST<br />

targets with conserved primers sites and to find subsets of those targets that recapitulate a reference<br />

phylogeny (user provided or generated by the pipeline from concatenated core genes). Our pipeline<br />

accepts whole genome annotation files from the targeted pathogen group in GenBank (.gbk) format.<br />

Core genes are identified by protein BLAST of all annotated open reading frames (ORFs) from a<br />

single genome against the ORFs from all GenBank files submitted to the pipeline. Hits are filtered<br />

to retain only single copy ORFs which occur in all submitted genomes and are 50% similar across<br />

50% of the query length. Hits found in multiple putative single copy ORF groups are also discarded.<br />

The nucleotide sequences for these core ORFs are aligned in Muscle and trimmed to remove end gaps.<br />

Up to ten conserved primer pairs producing amplicons of ~250 bp are designed for each alignment in<br />

Primer3. The primer pairs and amplicons are filtered to retain only those that do not overlap and<br />

capture polymorphisms between input genomes. Users may either retain all passing amplicons or<br />

use one of two methods to select an optimized subset for typing. The concordance of the subtyping<br />

provided by the selected amplicons to the reference phylogeny can be assessed using a variety metrics,<br />

including those that compare the resulting trees (e.g. Kendall‐Colijn metric) and those that compare<br />

cluster membership (e.g. adjusted Wallace coefficient).<br />

Scripts for the pipeline were written in Python 2.7 and R, and management is provided by bpipe<br />

with support for both standard multicore machines and cluster environments. We demonstrate the<br />

utility of this pipeline using a collection of 266 Salmonella bongori and enterica genomes representing<br />

68 serotypes.<br />

83

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!