01.06.2016 Views

Sequencing

SFAF2016%20Meeting%20Guide%20Final%203

SFAF2016%20Meeting%20Guide%20Final%203

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

11th Annual <strong>Sequencing</strong>, Finishing, and Analysis in the Future Meeting<br />

BACTERIAL PATHOGEN NEXT-GENERATION<br />

SEQUENCING DATA TRIMMING, CORRECTION, AND<br />

SNPS DISCOVERY<br />

Wednesday, 1st June 20:00 La Fonda Mezzanine (2nd Floor) Poster (PS‐2b.03)<br />

Darlene Wagner 1 , Lee Katz 2 , Eija Trees 2 , Heather Carleton 2<br />

1 IHRC, 2 Centers for Disease Control and Prevention<br />

Background: Next‐generation sequencing (NGS) allows rapid in‐house sequencing of bacterial strains implicated in<br />

local or multi‐state outbreaks. Single‐nucleotide polymorphisms (SNPs) analyses incorporating NGS reads can aid<br />

outbreak cluster identification. This study evaluated effects of read trimming and correction (healing) on SNPs<br />

analysis quality.<br />

Methods: NGS reads representing outbreak clusters from Salmonella enterica serovar Bareilly, Shiga toxinproducing<br />

Escherichia coli (STEC) serogroup O157, and Campylobacter jejuni, were cleaned using nine healing<br />

methods. Methods used to implement healing included prinseq, fastx_trimmer, BayesHammer, BayesHammer with<br />

fastx_trimmer, Musket, Quake, Blue, CG‐Pipeline with quality trimming, and CG‐Pipeline with quality cutoff masking.<br />

Forward‐read (R1) and reverse‐read (R2) errors were estimated through base‐call qualities (Phred) and<br />

ambiguous nucleotide (N) counts. SNPs discovery through Lyve‐SET (https://github.com/lskatz/lyve‐SET) was<br />

assessed by counting informative aligned positions. Possible false positive/negative SNPs were inferred by counting<br />

SNPs shared across results of the different healing methods.<br />

Results: R1 and R2 reads from Salmonella ser. Bareilly averaged 300,000 ambiguous nucleo‐tide reads while R2<br />

reads exhibited average Phred scores as low as 25.8 (median 29.6). Un‐healed Bareilly reads yielded 56 SNP<br />

positions through Lyve‐SET. BayesHammer, an edit‐distance‐based method, and kmers‐based methods, Quake and<br />

Blue, raised Phred scores in the Bareilly set above<br />

30.0. Blue, along with Musket, another kmers method, increased Bareilly SNP positions to 122 and 123, respectively,<br />

with 1 potential false positive site each. BayesHammer, fastx_trimmer, and BayesHammer with fastx_trimmer<br />

increased Bareilly SNP positions to 94, 99, and 110, respectively, without false positives. The STEC O157 outbreak<br />

clus‐ter R2 reads exhibited median quality of 29.3 with up to 1.07x106 reads containing ambiguous nucleotides.<br />

Unhealed O157 reads yielded 62 SNP positions, which increased to 66 positions with no false positives after healing<br />

through Quake. R1 and R2 reads of the Campylobacter cluster had quality scores well above 30.0 but with an average<br />

of 11,000 ambiguous‐nucleotide reads in R2. Unhealed Campylobacter reads yielded 137 SNP positions while<br />

fastx_trimmer, Blue, and BayesHammer with fastx_trimmer increased SNPs to 150, 152, and 153, respectively, with<br />

no inferred false positives. Across all three organism/outbreak sets, prinseq failed to in‐crease SNP counts, while the<br />

CG‐Pipeline‐based methods reduced SNP counts in the Bareilly and Campylobacter sets. Musket, Quake, and Blue,<br />

consistently increased SNP counts, but each produced possible false positive/negative SNP sites in at least one<br />

organism set.<br />

Conclusions: An optimal read trimming or cleaning method should increase the number of SNP positions without<br />

adding false positives, thus enhancing outbreak phylogeny. This study has shown that Musket, Blue, and Quake<br />

increase numbers of SNP positions for NGS reads with high ambiguous base‐calls and Phred scores < 30.0. Yet, the<br />

kmers‐based methods occasionally introduced SNPs not shared across methods within the organism set, indicating<br />

possible false positive/negative positions. BayesHammer, fastx_trimmer, and the two methods combined all<br />

increased counts of SNPs which were reproducible across more than one healing method. For future studies,<br />

outbreak sets with validated SNP sites will be used for more accurate assessment of false discovery or false exclusion of<br />

SNPs.<br />

91

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!