01.06.2016 Views

Sequencing

SFAF2016%20Meeting%20Guide%20Final%203

SFAF2016%20Meeting%20Guide%20Final%203

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

11th Annual <strong>Sequencing</strong>, Finishing, and Analysis in the Future Meeting<br />

ESTIMATING THE EFFECTS OF REPEATS ON<br />

ASSEMBLY CONTIGUITY<br />

Thursday, 2nd June 10:00 La Fonda Ballroom Talk (OS‐4.01)<br />

Shoudan Liang, Jason Chin<br />

Pacific Biosciences of California<br />

For a perfect assembler and at a high coverage, the contiguity of the assembly at a finite read length<br />

is limited by repetitive sequences. We study the limit imposed by repeat structures in plants, and<br />

contrast it to human, as the read length is increased. We started with assembled contigs from long<br />

reads and perform an all‐against‐all alignment. Non‐unique regions of the contigs define repeats.<br />

We require each alignment to be longer than a minimum length, S. Repeats shorter than S will not<br />

align. Therefore, as the minimum overlap S is increased, we observed a decrease in the number of<br />

repeat regions. For example, for coffee genome, when the minimum allowed overlap increases from<br />

500 to 5,000 bp, the number of distinct repetitive regions is reduced by more a factor of 10. This<br />

is partially due to long repeats being less abundant and partially because the short repeats are<br />

occurring in clusters that are seen as unique sequences in the alignment. We developed a method<br />

to separate these two effects. We show the tendency of repeats to cluster in several plant genomes.<br />

Clustered repeats are especially difficult to assemble from short reads because even when all short<br />

reads are identified to be from the same 100 kb region, they are still repetitive in the repeat‐cluster.<br />

A related method to estimate the repeats is by counting the abundance of two k‐mers separated by a<br />

fixed distance. The distance between the k‐mers is a proxy for the repeat length. This method has an<br />

advantage of potentially being directly applied to the long‐read data before assembly. We compare<br />

the direct estimate from the read with the estimate from the contigs for several plant genomes.<br />

A third way of estimating repeat abundance from long reads is by performing an all‐against‐all<br />

alignment using about 1% of data. This, when compared to the expected alignment of an idealized<br />

genome of the same size that does not have any repeatitive regions, reveals excessive alignments<br />

related to repeats at different lengths. This can be helpful in choosing assembly parameters. The<br />

method we developed is available at https://github.com/pb‐sliang/TAP.<br />

103

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!