Sequencing
SFAF2016%20Meeting%20Guide%20Final%203
SFAF2016%20Meeting%20Guide%20Final%203
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
11th Annual <strong>Sequencing</strong>, Finishing, and Analysis in the Future Meeting<br />
A REFERENCE-AGNOSTIC AND RAPIDLY<br />
QUERYABLE NGS READ DATA FORMAT ALLOWS FOR<br />
FLEXIBLE ANALYSIS AT SCALE<br />
Friday, 3rd June 14:20 La Fonda Ballroom Talk (OS‐9.02)<br />
Niranjan Shekar 1 , William Salerno 2 , Adam English 2 , Adina Mangubat 1 ,<br />
Jeremy Bruestle 1 , Eric Boerwinkle 3 , Richard Gibbs 2<br />
1 Spiral Genetics Inc, 2 Human Genome <strong>Sequencing</strong> Center Baylor College of Medicine,<br />
3 University of Texas Health Science Center at Houston<br />
In identifying the complement of genetic variants that are associated with complex disease, larger<br />
sample sizes increase power. Studies such as the Alzheimer’s Disease <strong>Sequencing</strong> Project and the<br />
CHARGE Consortium where samples are collected from a range of centers show heterogeneous data,<br />
requiring informatics that can additively scale to thousands of samples and analytics that go beyond<br />
identifying small variants in NGS data. At scale, the challenge of evaluating SNPs, indels and SVs<br />
becomes the “N+1” problem of incrementally adding samples without having to perpetually<br />
reevaluate petabytes of population read data stored in BAM files.<br />
The Biograph Analysis Format (BAF) is a method of indexing NGS data that extends the Burrows<br />
Wheeler Transform to allow for multiple paths, effectively creating a read overlap graph of the data.<br />
A BAF of HiSeq X 30x WGS data is 8.3 Gb, 95% smaller than the corresponding BAM. Generated<br />
from the BAM in 14 hours, the BAF can be queried up to 200,000 times a second. Multiple BAFs<br />
can be combined, which at scale results in a file size of approximately 3GB per individual. Because<br />
the BAF can be batched across individuals, query time grows less than linearly with the number of<br />
individuals.<br />
For example, if 30,000 putative SV sites to be queried, SV‐typing these sites across 10,000 HiSeq X<br />
WGS samples in BioGraph Analysis Format would require less than 30 TB of storage (for all the<br />
read data), 16 CPU hours, and 10 minutes (using 100 machines).<br />
Here, we perform read over assembly to genotype 4,276 SVs larger than 80bp detected in at least one<br />
individual of the Ashkenazi Jewish Trio by Pindel. At 1,195 of these locations, there was at least<br />
one SV call in any one individual and all of these calls, except for 25 (2.1%) were consistent with<br />
mendelian inheritance. Further, read overlap assembly to genotype variants was performed at 3,935<br />
locations where PBHoney called an SV with long read sequencing data on the same Trio. Of those,<br />
1,327 locations had at least one genotype with all but 55 (4.1%) being consistent with mendelian<br />
inheritance.<br />
Additionally, the data are reference‐agnostic, so variants can be called against any reference or against<br />
the read graph of any other set of individuals, dramatically reducing the time for data harmonization.<br />
Further, information is divided such that the “read overlap graph” created from all the individuals<br />
is separate from the information indicating that path through the graph for each individual. This<br />
allows a search for a particular variation of interest directly from the read data remotely and<br />
rapidly, without the opportunity to reveal the exact individual(s) from that the variant originates.<br />
Because the data are essentially a read overlap graph, it is possible to accurately characterize SVs<br />
by traversing the graph from a particular location or search for a particular sequence associated<br />
with the SV. So, fast querying of small files with reasonable compute requirements provides an N+1<br />
solution for SVs.<br />
144