01.06.2016 Views

Sequencing

SFAF2016%20Meeting%20Guide%20Final%203

SFAF2016%20Meeting%20Guide%20Final%203

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

11th Annual <strong>Sequencing</strong>, Finishing, and Analysis in the Future Meeting<br />

A REFERENCE-AGNOSTIC AND RAPIDLY<br />

QUERYABLE NGS READ DATA FORMAT ALLOWS FOR<br />

FLEXIBLE ANALYSIS AT SCALE<br />

Friday, 3rd June 14:20 La Fonda Ballroom Talk (OS‐9.02)<br />

Niranjan Shekar 1 , William Salerno 2 , Adam English 2 , Adina Mangubat 1 ,<br />

Jeremy Bruestle 1 , Eric Boerwinkle 3 , Richard Gibbs 2<br />

1 Spiral Genetics Inc, 2 Human Genome <strong>Sequencing</strong> Center Baylor College of Medicine,<br />

3 University of Texas Health Science Center at Houston<br />

In identifying the complement of genetic variants that are associated with complex disease, larger<br />

sample sizes increase power. Studies such as the Alzheimer’s Disease <strong>Sequencing</strong> Project and the<br />

CHARGE Consortium where samples are collected from a range of centers show heterogeneous data,<br />

requiring informatics that can additively scale to thousands of samples and analytics that go beyond<br />

identifying small variants in NGS data. At scale, the challenge of evaluating SNPs, indels and SVs<br />

becomes the “N+1” problem of incrementally adding samples without having to perpetually<br />

reevaluate petabytes of population read data stored in BAM files.<br />

The Biograph Analysis Format (BAF) is a method of indexing NGS data that extends the Burrows<br />

Wheeler Transform to allow for multiple paths, effectively creating a read overlap graph of the data.<br />

A BAF of HiSeq X 30x WGS data is 8.3 Gb, 95% smaller than the corresponding BAM. Generated<br />

from the BAM in 14 hours, the BAF can be queried up to 200,000 times a second. Multiple BAFs<br />

can be combined, which at scale results in a file size of approximately 3GB per individual. Because<br />

the BAF can be batched across individuals, query time grows less than linearly with the number of<br />

individuals.<br />

For example, if 30,000 putative SV sites to be queried, SV‐typing these sites across 10,000 HiSeq X<br />

WGS samples in BioGraph Analysis Format would require less than 30 TB of storage (for all the<br />

read data), 16 CPU hours, and 10 minutes (using 100 machines).<br />

Here, we perform read over assembly to genotype 4,276 SVs larger than 80bp detected in at least one<br />

individual of the Ashkenazi Jewish Trio by Pindel. At 1,195 of these locations, there was at least<br />

one SV call in any one individual and all of these calls, except for 25 (2.1%) were consistent with<br />

mendelian inheritance. Further, read overlap assembly to genotype variants was performed at 3,935<br />

locations where PBHoney called an SV with long read sequencing data on the same Trio. Of those,<br />

1,327 locations had at least one genotype with all but 55 (4.1%) being consistent with mendelian<br />

inheritance.<br />

Additionally, the data are reference‐agnostic, so variants can be called against any reference or against<br />

the read graph of any other set of individuals, dramatically reducing the time for data harmonization.<br />

Further, information is divided such that the “read overlap graph” created from all the individuals<br />

is separate from the information indicating that path through the graph for each individual. This<br />

allows a search for a particular variation of interest directly from the read data remotely and<br />

rapidly, without the opportunity to reveal the exact individual(s) from that the variant originates.<br />

Because the data are essentially a read overlap graph, it is possible to accurately characterize SVs<br />

by traversing the graph from a particular location or search for a particular sequence associated<br />

with the SV. So, fast querying of small files with reasonable compute requirements provides an N+1<br />

solution for SVs.<br />

144

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!