An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
2.2. EFFICIENT STORAGE OF HIGH-THROUGHPUT SEQUENCING DATA USING<br />
TEXT-BASED FILE FORMATS 23<br />
(a) GZIP<br />
(b) BGZF<br />
(c) dictzip<br />
(d) SHORE-GZIP<br />
GZIP header<br />
DEFLATE compressed data<br />
GZIP footer<br />
Indexing information embedded in header<br />
Figure 2.2: Basic Structure of the GZIP Format <strong>and</strong> its Sub-Formats<br />
Binary search is however not applicable to elements like read mappings, gene models or<br />
variant information representing two-dimensional objects with a start <strong>and</strong> an end coordinate.<br />
Fast access to such data set elements overlapping a specic region of interest is crucial for data<br />
visualization <strong>and</strong> manual data set exploration. A text-le indexing approach capable of answering<br />
various types of range queries is implemented as a utility SHORE 2dex, providing functionality<br />
similar to Tabix [125] or the BAM [100] <strong>and</strong> BigBed [123] indexing schemes. In comparison to<br />
the also-generic Tabix indexer, SHORE provides additional functionality such as exible choice of<br />
compression format <strong>and</strong> r<strong>and</strong>om access resolution, indexing of les not sorted by start coordinate<br />
<strong>and</strong> fast queries on data sets with deep sequence coverage.<br />
Both utilities SHORE sort <strong>and</strong> SHORE 2dex make use of SHORE's generic r<strong>and</strong>om read access<br />
back end <strong>and</strong> can thereby seamlessly be applied to both compressed <strong>and</strong> uncompressed data sets.<br />
2.2.2 Widely Compatible Indexed Block-Wise Compression<br />
GZIP [116] is one of the most widely used generic compression formats with a high level of<br />
support across operating systems, programming languages, comm<strong>and</strong> line utilities <strong>and</strong> many<br />
further software applications. The employed DEFLATE compression algorithm [115] oers a<br />
reasonable compromise between encoding <strong>and</strong> decoding speed <strong>and</strong> compression ratio. Therefore,<br />
the format is an obvious choice for compression of sequencing related data. However, it lacks<br />
r<strong>and</strong>om read access support, forcing decompression of all data preceding a required section of<br />
the compressed le. As this interferes with the ability to perform accelerated queries on the<br />
compressed data sets, subset specications have been developed to add this feature without<br />
breaking compatibility.<br />
The GZIP le format embeds DEFLATE compressed data between a header <strong>and</strong> a footer<br />
sequence (gure 2.2(a) ). Encoding parameters are specied within the header, whereas the<br />
footer consists solely of eight byte specifying a checksum as well as the decompressed le size<br />
modulo 2 32 . Header, compressed data <strong>and</strong> footer constitute a GZIP stream, <strong>and</strong> a GZIP compliant<br />
le consists of one ore more concatenated streams. Upon decompression, data of concatenated<br />
GZIP streams are concatenated as well.