Sequencing
SFAF2016%20Meeting%20Guide%20Final%203
SFAF2016%20Meeting%20Guide%20Final%203
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
11th Annual <strong>Sequencing</strong>, Finishing, and Analysis in the Future Meeting<br />
SCALABLE AND EXTENSIBLE NEXT-GENERATION<br />
SEQUENCE ANALYSIS PIPELINE MANAGEMENT FOR<br />
OVER 50,000 WHOLE-GENOME SAMPLES<br />
Friday, 3rd June 17:00 La Fonda Ballroom Talk (OS‐10.04)<br />
Jesse Farek, Adam English, Daniel Hughes, William Salerno,<br />
Kimberly Walker, Donna Muzny, Richard Gibbs<br />
1 Human Genome <strong>Sequencing</strong> Center Baylor College of Medicine<br />
The Baylor College of Medicine Human Genome <strong>Sequencing</strong> Center (HGSC) has recently added<br />
Illumina HiSeq X Ten sequencers to its sequencing fleet, which currently processes over 2,000 wholegenome<br />
samples per month. These samples originate from multiple projects and collaborators, including<br />
the Alzheimer’s Disease <strong>Sequencing</strong> Project, the Trans‐Omics for Precision Medicine Program,<br />
Baylor Miraca Genetics Laboratory, the Centers for Common Disease Genomics, the CHARGE<br />
Consortium, and the Center for Mendelian Genomics. In order for sequence analysis at the HGSC<br />
to scale to this increased workload, numerous improvements have been made to the efficiency and<br />
reliability of the center’s sequence analysis infrastructure. HgV is the workflow management system<br />
for primary and secondary Illumina sequence analysis at the HGSC and features tiered XML<br />
pipeline protocols, job tracking, LIMS communication, verbose logging and stable reproducibility.<br />
HgV’s protocol definition and LIMS communication infrastructure has been reworked for greater<br />
configurability so that pipeline protocol and LIMS parameters can be easily modified to accommodate<br />
different project requirements. Specifically, the HGSC has configured HgV use both local and<br />
cloud‐based compute resources and to enforce CAP‐ and CLIA‐compliant data handling for clinical<br />
pipelines. Secondary analysis programs have been rewritten for increased computational efficiency.<br />
The Atlas2 SNP and Indel variant callers (originally written in Ruby) have been rewritten and combined<br />
into a single C++ program that runs on average more than 50 times faster than Atlas2 and<br />
with improved variant calling quality and consistency. Two new custom reporting programs, SeqAnalyzer,<br />
which calculates FASTQ sequence metrics, and AlignStats, which calculates BAM alignment<br />
and coverage metrics, have been written to use significantly fewer computational resources than the<br />
existing programs they replace. Other areas of improvements to analysis workflows have also been<br />
investigated, including measuring the effects of local Indel realignment and base quality recalibration<br />
on variant call quality and researching efficient N+1 joint calling solutions for creating project<br />
level VCFs. These improvements have resulted in fast, extensible, and easily manageable analysis<br />
pipelines for human resequencing and other applications on the HiSeq X platform that have allowed<br />
the HGSC to concurrently support the heterogeneous analysis requirements of multiple large‐scale<br />
sequencing projects. To date, HgV has managed the analysis of over 5,000 whole‐genome samples<br />
and is expected to handle over 50,000 more samples in the near future.<br />
152