29.06.2013 Views

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Low Coverage Genome Assembly Using Fuzzy Hash Maps<br />

John Healy<br />

Department of Computing & Mathematics<br />

<strong>Galway</strong>-Mayo Institute of Technology<br />

Ireland<br />

john.healy@gmit.ie<br />

Abstract<br />

Despite the high-throughput of sequence reads that<br />

characterise second generation sequencing<br />

technologies, genome assemblers require a large<br />

degree of over-sampling to produce a complete<br />

genomic sequence. Using a novel approach to<br />

comparative genome assembly, based on the<br />

application of fuzzy hash maps, low coverage sequence<br />

reads can be rapidly ordered and orientated in an<br />

assembly scaffold with a low error rate and a vastly<br />

increased N50 length.<br />

1. Introduction<br />

The advent of second generation sequencing (SGS)<br />

technology, capable of rapidly sequencing a massive<br />

number of short-length reads, has resulted in a reappraisal<br />

of existing approaches to sequence alignment<br />

and genome assembly. The twin characteristics of large<br />

read number and short read length has resulted in a<br />

move away from assembly strategies based on the<br />

traditional overlap graph to more k-mer centric<br />

approaches such as de Bruijn graphs and sequence<br />

graphs. While the newer k-mer centric genome<br />

assemblers are ideal for use with short-length reads, the<br />

requirement for a large degree of oversampling, or<br />

coverage, renders these approaches unsuitable for<br />

assembling genomes of draft coverage or lower.<br />

Although comparative assemblers have been developed<br />

for assembling draft genomes, the underlying assembly<br />

model is invariably based on the traditional overlap<br />

graph. We describe how a fuzzy hash-map can be<br />

applied to rapidly and accurately assemble a<br />

prokaryotic genome, sampled at varying levels of low<br />

coverage, against the reference genome of a closely<br />

related species.<br />

2. Assembly with Fuzzy Hash Maps<br />

Hash-tables or maps are dictionary data structures<br />

that use a key and a hashing function to provide<br />

constant time, O(1), insertion, deletion and retrieval<br />

operations. By generating a unique hash code from a<br />

given key, hash tables provide a rapid mapping from a<br />

domain of unique keys to a range of possible values.<br />

Fuzzy Hash Maps (FHM) leverage the power of objectoriented<br />

languages to allow a degree of variability in<br />

the composition of a hash key. In the Java programming<br />

language, a degree of fuzziness can be applied to a hash<br />

key by manipulating the contract between the<br />

hashCode() and equals() methods in the object used as<br />

159<br />

Desmond Chambers<br />

Department of Information Technology<br />

National University of Ireland <strong>Galway</strong><br />

Ireland<br />

des.chambers@nuigalway.ie<br />

the hash key. In contrast to traditional hash maps, which<br />

seek to avoid collisions, FHMs encourage initial<br />

collisions in the map by reducing the size of the key<br />

used to compute a hash code. Dynamic programming<br />

algorithms can then be implemented in the equals()<br />

method to establish whether a full collision is permitted.<br />

Using a FHM as the underlying data structure, a de<br />

Bruijn graph approach can be used to anchor a set of<br />

draft sequence against a reference genome and<br />

assemble the draft reads into contiguous sequences [1-<br />

2].<br />

3. Results<br />

The results of assembling the 0.58Mb genome of<br />

M.genitalium at varying levels of coverage are shown in<br />

Table 1. The 0.81Mb genome of M.pneumoniae was<br />

used as a reference sequence and anchored 65.56% of<br />

the M.genitalium reads.<br />

Table 1. Summary of Assembly Results at<br />

Varying Levels of Coverage<br />

Coverage N50<br />

Contig<br />

N50<br />

Scaffold<br />

%<br />

Ordering<br />

Errors<br />

%<br />

Orientation<br />

Errors<br />

Time<br />

(s)<br />

2.0 2141 51215 1.21 0.12 19.2<br />

1.8 1918 14787 4.60 1.00 17.3<br />

1.6 1798 14853 1.12 0.17 16.2<br />

1.4 1739 9734 0.39 0.59 14.6<br />

1.2 1456 10322 2.18 0.46 12.8<br />

1.0 1269 6001 1.24 0.97 11.1<br />

0.8 1228 8240 1.55 0.69 7.5<br />

0.6 992 4743 0.92 0.46 7.8<br />

0.4 -­‐ 2450 2.07 0.00 6.0<br />

0.2 -­‐ 2539 1.38 2.07 4.1<br />

4. Conclusions<br />

The FHM approach is capable of rapidly and<br />

accurately ordering and orientating low coverage<br />

sequence reads without sacrificing the execution speed<br />

inherent in hash maps.<br />

5. References<br />

[1] J. Healy and D. Chambers, "Fast and Accurate Genome<br />

Anchoring Using Fuzzy Hash Maps", Proceedings 5th<br />

International Conference on Practical Applications of<br />

Computational Biology & Bioinformatics, 2011, pp. 149-<br />

156.<br />

[2] J. Healy and D. Chambers, "De Novo Draft Genome<br />

Assembly Using Fuzzy K-mers", Proceedings 3rd<br />

International Conference on Bioinformatics,<br />

Biocomputational Systems and Biotechnologies, 2011.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!