Sequencing
SFAF2016%20Meeting%20Guide%20Final%203
SFAF2016%20Meeting%20Guide%20Final%203
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
11th Annual <strong>Sequencing</strong>, Finishing, and Analysis in the Future Meeting<br />
AN EXTENDED CORE GENE MLST TARGET<br />
IDENTIFICATION AND SUBSET SELECTION PIPELINE<br />
FOR CULTURE-INDEPENDENT PATHOGEN<br />
SUBTYPING<br />
Wednesday, 1st June 20:00 La Fonda NM Room (1st floor) Poster (PS‐1b.117)<br />
JoWilliams Newkirk 1 , Eija Trees 2 , John Besser 2 , Heather Carleton 2<br />
1 IHRC, 2 Centers for Disease Control and Prevention<br />
While isolate‐based whole genome sequencing is being rapidly integrated into the US public health<br />
surveillance system, isolate availability for surveillance continues to decline as a result of the adoption<br />
of culture‐independent diagnostic tests by clinical laboratories. As affordable methods are not<br />
yet available for obtaining the same genome resolution directly from shotgun metagenomic sequencing<br />
of clinical samples, particularly microbially complex samples such as stool, alternative methods<br />
are needed to reliably capture genetic information relevant to pathogen subtyping. Targeted amplification<br />
and sequencing of informative genomic regions (i.e. multilocus sequence typing, MLST)<br />
is a well understood and robust typing method whose resolution is limited only by the number of<br />
sites used. Unfortunately, identifying large numbers of informative regions with conserved primer<br />
sites is labor intensive, particularly if hundreds or thousands of reference genomes are used for site<br />
selection.<br />
To facilitate the rapid development of extended MLST schemes for targeted pathogen groups, we<br />
developed a custom pipeline leveraging widely used open source programs to identify potential MLST<br />
targets with conserved primers sites and to find subsets of those targets that recapitulate a reference<br />
phylogeny (user provided or generated by the pipeline from concatenated core genes). Our pipeline<br />
accepts whole genome annotation files from the targeted pathogen group in GenBank (.gbk) format.<br />
Core genes are identified by protein BLAST of all annotated open reading frames (ORFs) from a<br />
single genome against the ORFs from all GenBank files submitted to the pipeline. Hits are filtered<br />
to retain only single copy ORFs which occur in all submitted genomes and are 50% similar across<br />
50% of the query length. Hits found in multiple putative single copy ORF groups are also discarded.<br />
The nucleotide sequences for these core ORFs are aligned in Muscle and trimmed to remove end gaps.<br />
Up to ten conserved primer pairs producing amplicons of ~250 bp are designed for each alignment in<br />
Primer3. The primer pairs and amplicons are filtered to retain only those that do not overlap and<br />
capture polymorphisms between input genomes. Users may either retain all passing amplicons or<br />
use one of two methods to select an optimized subset for typing. The concordance of the subtyping<br />
provided by the selected amplicons to the reference phylogeny can be assessed using a variety metrics,<br />
including those that compare the resulting trees (e.g. Kendall‐Colijn metric) and those that compare<br />
cluster membership (e.g. adjusted Wallace coefficient).<br />
Scripts for the pipeline were written in Python 2.7 and R, and management is provided by bpipe<br />
with support for both standard multicore machines and cluster environments. We demonstrate the<br />
utility of this pipeline using a collection of 266 Salmonella bongori and enterica genomes representing<br />
68 serotypes.<br />
83