Sequencing
SFAF2016%20Meeting%20Guide%20Final%203
SFAF2016%20Meeting%20Guide%20Final%203
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
11th Annual <strong>Sequencing</strong>, Finishing, and Analysis in the Future Meeting<br />
ESTIMATING THE EFFECTS OF REPEATS ON<br />
ASSEMBLY CONTIGUITY<br />
Thursday, 2nd June 10:00 La Fonda Ballroom Talk (OS‐4.01)<br />
Shoudan Liang, Jason Chin<br />
Pacific Biosciences of California<br />
For a perfect assembler and at a high coverage, the contiguity of the assembly at a finite read length<br />
is limited by repetitive sequences. We study the limit imposed by repeat structures in plants, and<br />
contrast it to human, as the read length is increased. We started with assembled contigs from long<br />
reads and perform an all‐against‐all alignment. Non‐unique regions of the contigs define repeats.<br />
We require each alignment to be longer than a minimum length, S. Repeats shorter than S will not<br />
align. Therefore, as the minimum overlap S is increased, we observed a decrease in the number of<br />
repeat regions. For example, for coffee genome, when the minimum allowed overlap increases from<br />
500 to 5,000 bp, the number of distinct repetitive regions is reduced by more a factor of 10. This<br />
is partially due to long repeats being less abundant and partially because the short repeats are<br />
occurring in clusters that are seen as unique sequences in the alignment. We developed a method<br />
to separate these two effects. We show the tendency of repeats to cluster in several plant genomes.<br />
Clustered repeats are especially difficult to assemble from short reads because even when all short<br />
reads are identified to be from the same 100 kb region, they are still repetitive in the repeat‐cluster.<br />
A related method to estimate the repeats is by counting the abundance of two k‐mers separated by a<br />
fixed distance. The distance between the k‐mers is a proxy for the repeat length. This method has an<br />
advantage of potentially being directly applied to the long‐read data before assembly. We compare<br />
the direct estimate from the read with the estimate from the contigs for several plant genomes.<br />
A third way of estimating repeat abundance from long reads is by performing an all‐against‐all<br />
alignment using about 1% of data. This, when compared to the expected alignment of an idealized<br />
genome of the same size that does not have any repeatitive regions, reveals excessive alignments<br />
related to repeats at different lengths. This can be helpful in choosing assembly parameters. The<br />
method we developed is available at https://github.com/pb‐sliang/TAP.<br />
103