28.02.2014 Views

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2.2. EFFICIENT STORAGE OF HIGH-THROUGHPUT SEQUENCING DATA USING<br />

TEXT-BASED FILE FORMATS 23<br />

(a) GZIP<br />

(b) BGZF<br />

(c) dictzip<br />

(d) SHORE-GZIP<br />

GZIP header<br />

DEFLATE compressed data<br />

GZIP footer<br />

Indexing information embedded in header<br />

Figure 2.2: Basic Structure of the GZIP Format <strong>and</strong> its Sub-Formats<br />

Binary search is however not applicable to elements like read mappings, gene models or<br />

variant information representing two-dimensional objects with a start <strong>and</strong> an end coordinate.<br />

Fast access to such data set elements overlapping a specic region of interest is crucial for data<br />

visualization <strong>and</strong> manual data set exploration. A text-le indexing approach capable of answering<br />

various types of range queries is implemented as a utility SHORE 2dex, providing functionality<br />

similar to Tabix [125] or the BAM [100] <strong>and</strong> BigBed [123] indexing schemes. In comparison to<br />

the also-generic Tabix indexer, SHORE provides additional functionality such as exible choice of<br />

compression format <strong>and</strong> r<strong>and</strong>om access resolution, indexing of les not sorted by start coordinate<br />

<strong>and</strong> fast queries on data sets with deep sequence coverage.<br />

Both utilities SHORE sort <strong>and</strong> SHORE 2dex make use of SHORE's generic r<strong>and</strong>om read access<br />

back end <strong>and</strong> can thereby seamlessly be applied to both compressed <strong>and</strong> uncompressed data sets.<br />

2.2.2 Widely Compatible Indexed Block-Wise Compression<br />

GZIP [116] is one of the most widely used generic compression formats with a high level of<br />

support across operating systems, programming languages, comm<strong>and</strong> line utilities <strong>and</strong> many<br />

further software applications. The employed DEFLATE compression algorithm [115] oers a<br />

reasonable compromise between encoding <strong>and</strong> decoding speed <strong>and</strong> compression ratio. Therefore,<br />

the format is an obvious choice for compression of sequencing related data. However, it lacks<br />

r<strong>and</strong>om read access support, forcing decompression of all data preceding a required section of<br />

the compressed le. As this interferes with the ability to perform accelerated queries on the<br />

compressed data sets, subset specications have been developed to add this feature without<br />

breaking compatibility.<br />

The GZIP le format embeds DEFLATE compressed data between a header <strong>and</strong> a footer<br />

sequence (gure 2.2(a) ). Encoding parameters are specied within the header, whereas the<br />

footer consists solely of eight byte specifying a checksum as well as the decompressed le size<br />

modulo 2 32 . Header, compressed data <strong>and</strong> footer constitute a GZIP stream, <strong>and</strong> a GZIP compliant<br />

le consists of one ore more concatenated streams. Upon decompression, data of concatenated<br />

GZIP streams are concatenated as well.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!