NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Low Coverage Genome Assembly Using Fuzzy Hash Maps<br />
John Healy<br />
Department of Computing & Mathematics<br />
<strong>Galway</strong>-Mayo Institute of Technology<br />
Ireland<br />
john.healy@gmit.ie<br />
Abstract<br />
Despite the high-throughput of sequence reads that<br />
characterise second generation sequencing<br />
technologies, genome assemblers require a large<br />
degree of over-sampling to produce a complete<br />
genomic sequence. Using a novel approach to<br />
comparative genome assembly, based on the<br />
application of fuzzy hash maps, low coverage sequence<br />
reads can be rapidly ordered and orientated in an<br />
assembly scaffold with a low error rate and a vastly<br />
increased N50 length.<br />
1. Introduction<br />
The advent of second generation sequencing (SGS)<br />
technology, capable of rapidly sequencing a massive<br />
number of short-length reads, has resulted in a reappraisal<br />
of existing approaches to sequence alignment<br />
and genome assembly. The twin characteristics of large<br />
read number and short read length has resulted in a<br />
move away from assembly strategies based on the<br />
traditional overlap graph to more k-mer centric<br />
approaches such as de Bruijn graphs and sequence<br />
graphs. While the newer k-mer centric genome<br />
assemblers are ideal for use with short-length reads, the<br />
requirement for a large degree of oversampling, or<br />
coverage, renders these approaches unsuitable for<br />
assembling genomes of draft coverage or lower.<br />
Although comparative assemblers have been developed<br />
for assembling draft genomes, the underlying assembly<br />
model is invariably based on the traditional overlap<br />
graph. We describe how a fuzzy hash-map can be<br />
applied to rapidly and accurately assemble a<br />
prokaryotic genome, sampled at varying levels of low<br />
coverage, against the reference genome of a closely<br />
related species.<br />
2. Assembly with Fuzzy Hash Maps<br />
Hash-tables or maps are dictionary data structures<br />
that use a key and a hashing function to provide<br />
constant time, O(1), insertion, deletion and retrieval<br />
operations. By generating a unique hash code from a<br />
given key, hash tables provide a rapid mapping from a<br />
domain of unique keys to a range of possible values.<br />
Fuzzy Hash Maps (FHM) leverage the power of objectoriented<br />
languages to allow a degree of variability in<br />
the composition of a hash key. In the Java programming<br />
language, a degree of fuzziness can be applied to a hash<br />
key by manipulating the contract between the<br />
hashCode() and equals() methods in the object used as<br />
159<br />
Desmond Chambers<br />
Department of Information Technology<br />
National University of Ireland <strong>Galway</strong><br />
Ireland<br />
des.chambers@nuigalway.ie<br />
the hash key. In contrast to traditional hash maps, which<br />
seek to avoid collisions, FHMs encourage initial<br />
collisions in the map by reducing the size of the key<br />
used to compute a hash code. Dynamic programming<br />
algorithms can then be implemented in the equals()<br />
method to establish whether a full collision is permitted.<br />
Using a FHM as the underlying data structure, a de<br />
Bruijn graph approach can be used to anchor a set of<br />
draft sequence against a reference genome and<br />
assemble the draft reads into contiguous sequences [1-<br />
2].<br />
3. Results<br />
The results of assembling the 0.58Mb genome of<br />
M.genitalium at varying levels of coverage are shown in<br />
Table 1. The 0.81Mb genome of M.pneumoniae was<br />
used as a reference sequence and anchored 65.56% of<br />
the M.genitalium reads.<br />
Table 1. Summary of Assembly Results at<br />
Varying Levels of Coverage<br />
Coverage N50<br />
Contig<br />
N50<br />
Scaffold<br />
%<br />
Ordering<br />
Errors<br />
%<br />
Orientation<br />
Errors<br />
Time<br />
(s)<br />
2.0 2141 51215 1.21 0.12 19.2<br />
1.8 1918 14787 4.60 1.00 17.3<br />
1.6 1798 14853 1.12 0.17 16.2<br />
1.4 1739 9734 0.39 0.59 14.6<br />
1.2 1456 10322 2.18 0.46 12.8<br />
1.0 1269 6001 1.24 0.97 11.1<br />
0.8 1228 8240 1.55 0.69 7.5<br />
0.6 992 4743 0.92 0.46 7.8<br />
0.4 -‐ 2450 2.07 0.00 6.0<br />
0.2 -‐ 2539 1.38 2.07 4.1<br />
4. Conclusions<br />
The FHM approach is capable of rapidly and<br />
accurately ordering and orientating low coverage<br />
sequence reads without sacrificing the execution speed<br />
inherent in hash maps.<br />
5. References<br />
[1] J. Healy and D. Chambers, "Fast and Accurate Genome<br />
Anchoring Using Fuzzy Hash Maps", Proceedings 5th<br />
International Conference on Practical Applications of<br />
Computational Biology & Bioinformatics, 2011, pp. 149-<br />
156.<br />
[2] J. Healy and D. Chambers, "De Novo Draft Genome<br />
Assembly Using Fuzzy K-mers", Proceedings 3rd<br />
International Conference on Bioinformatics,<br />
Biocomputational Systems and Biotechnologies, 2011.