01.06.2016 Views

Sequencing

SFAF2016%20Meeting%20Guide%20Final%203

SFAF2016%20Meeting%20Guide%20Final%203

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

11th Annual <strong>Sequencing</strong>, Finishing, and Analysis in the Future Meeting<br />

SCALABLE AND EXTENSIBLE NEXT-GENERATION<br />

SEQUENCE ANALYSIS PIPELINE MANAGEMENT FOR<br />

OVER 50,000 WHOLE-GENOME SAMPLES<br />

Friday, 3rd June 17:00 La Fonda Ballroom Talk (OS‐10.04)<br />

Jesse Farek, Adam English, Daniel Hughes, William Salerno,<br />

Kimberly Walker, Donna Muzny, Richard Gibbs<br />

1 Human Genome <strong>Sequencing</strong> Center Baylor College of Medicine<br />

The Baylor College of Medicine Human Genome <strong>Sequencing</strong> Center (HGSC) has recently added<br />

Illumina HiSeq X Ten sequencers to its sequencing fleet, which currently processes over 2,000 wholegenome<br />

samples per month. These samples originate from multiple projects and collaborators, including<br />

the Alzheimer’s Disease <strong>Sequencing</strong> Project, the Trans‐Omics for Precision Medicine Program,<br />

Baylor Miraca Genetics Laboratory, the Centers for Common Disease Genomics, the CHARGE<br />

Consortium, and the Center for Mendelian Genomics. In order for sequence analysis at the HGSC<br />

to scale to this increased workload, numerous improvements have been made to the efficiency and<br />

reliability of the center’s sequence analysis infrastructure. HgV is the workflow management system<br />

for primary and secondary Illumina sequence analysis at the HGSC and features tiered XML<br />

pipeline protocols, job tracking, LIMS communication, verbose logging and stable reproducibility.<br />

HgV’s protocol definition and LIMS communication infrastructure has been reworked for greater<br />

configurability so that pipeline protocol and LIMS parameters can be easily modified to accommodate<br />

different project requirements. Specifically, the HGSC has configured HgV use both local and<br />

cloud‐based compute resources and to enforce CAP‐ and CLIA‐compliant data handling for clinical<br />

pipelines. Secondary analysis programs have been rewritten for increased computational efficiency.<br />

The Atlas2 SNP and Indel variant callers (originally written in Ruby) have been rewritten and combined<br />

into a single C++ program that runs on average more than 50 times faster than Atlas2 and<br />

with improved variant calling quality and consistency. Two new custom reporting programs, SeqAnalyzer,<br />

which calculates FASTQ sequence metrics, and AlignStats, which calculates BAM alignment<br />

and coverage metrics, have been written to use significantly fewer computational resources than the<br />

existing programs they replace. Other areas of improvements to analysis workflows have also been<br />

investigated, including measuring the effects of local Indel realignment and base quality recalibration<br />

on variant call quality and researching efficient N+1 joint calling solutions for creating project<br />

level VCFs. These improvements have resulted in fast, extensible, and easily manageable analysis<br />

pipelines for human resequencing and other applications on the HiSeq X platform that have allowed<br />

the HGSC to concurrently support the heterogeneous analysis requirements of multiple large‐scale<br />

sequencing projects. To date, HgV has managed the analysis of over 5,000 whole‐genome samples<br />

and is expected to handle over 50,000 more samples in the near future.<br />

152

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!