28.02.2014 Views

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

14 CHAPTER 1. INTRODUCTION<br />

of features that are relative to a reference sequence, simple tab-delimited tables have emerged as<br />

the predominant storage formats. For example, SAM [100, 111] is a st<strong>and</strong>ard le format used for<br />

storing sequencing reads <strong>and</strong> read alignments. GFF [112] is a generic format for description of<br />

genomic features that is primarily used, but not restricted to, representation of genome annotation<br />

data, <strong>and</strong> its subset specication GVF [113] is targeted at description of genomic variation<br />

like SNP or structural variants. VCF [114] provides a specication for storage of multi-sample<br />

variation <strong>and</strong> genotype information. SAM, GFF <strong>and</strong> VCF are structured similarly, with each data<br />

set element being represented on a single line. Common or m<strong>and</strong>atory element attributes like<br />

reference genome coordinates are stored as m<strong>and</strong>atory columns, whereas optional attributes are<br />

added in arbitrary order in the form of key-value pairs (tags). All three formats are generic formats<br />

that allow users to specify arbitrary additional attributes, <strong>and</strong> allow inclusion of meta-data<br />

in the le in the form of lines starting with a certain character sequence serving as meta-data<br />

indicator. The SAM format comes with an equivalent binary companion format BAM with the<br />

aim of reduced le parsing overhead.<br />

While processed data like variant calls are usually moderate in size, space eciency is a<br />

concern for raw read <strong>and</strong> read alignment data, suggesting the application of data compression<br />

algorithms. The generic compression algorithm DEFLATE [115] is one of the most widely used<br />

compression methods due to a tradeo between compression eciency <strong>and</strong> compression <strong>and</strong><br />

decompression speed that is acceptable in many settings. Generic compression algorithms are<br />

also applied to sequencing data, e. g. the BAM format is routinely encoded as BGZF [111], a<br />

simple subset specication of the GZIP [116] format for DEFLATE-compressed data.<br />

However, compression eciency can be improved through use of sequencing data specic encoding<br />

methods or preprocessing. To some extent, high-throughput sequencing data compression<br />

is related to the issue of DNA sequence compression. Complete genome sequences are known<br />

to be hardly compressible beyond the typical 2-bit encoding of the four bases using common<br />

generic compression algorithms [117]. Most eorts preceding the emergence of high-throughput<br />

sequencing have thus been directed at generating ecient representations of such large biological<br />

sequences. However, algorithms that have been made available for this purpose are not readily<br />

applicable to short read sequencing data.<br />

Common approaches used by most DNA compression algorithms in the literature rely on<br />

sux arrays or similar index data structures for detecting <strong>and</strong> collapsing perfectly or imperfectly<br />

repetitive regions in a small set of large sequences. As such, these implementations are not suited<br />

to the task of compressing the huge sets of very short sequences generated by a high-throughput<br />

sequencing run. <strong>An</strong>alogous approaches are however available for short read sequences. Imposing<br />

a specic ordering on the elements of a data set potentially leads to an increase in entropy, <strong>and</strong><br />

thereby reduced compression eciency. A utility SCALCE implements a read reordering method<br />

using a similarity-based clustering approach that serves as a compression booster in combination<br />

with generic compression algorithms <strong>and</strong> formats [118]. However, as specic element order is<br />

often a requirement of analysis algorithms, SCALCE <strong>and</strong> similar reordering approaches are most<br />

suitable for o-line long-term storage <strong>and</strong> data distribution. A dierent approach is taken by<br />

the tool Quip, which implements an assembly-based short read compression scheme where reads<br />

are represented by their position on specically assembled contigs [119]. Assembled contigs must<br />

be stored along with the data set elements, but allow ecient compression of data sequenced to<br />

multiple depth.<br />

While these approaches focus on compression of short nucleotide sequences, those represent<br />

just one of the attributes to be stored in next-generation sequencing data sets, along with read<br />

identiers, per-base PHRED qualities <strong>and</strong> possibly read alignments. Various methods therefore<br />

attempt to improve compression ratio for these heterogeneous collections of data. Notably, the<br />

CRAM format [120] aims at being a drop-in replacement for the SAM/BAM format, to large extent

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!