An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
24 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
The BGZF format developed for the BAM format <strong>and</strong> the Tabix utility [100, 111, 125] exploits<br />
the concatenation feature to implement independently compressed blocks within the GZIP speci-<br />
cation. Uncompressed data are split into 64 kilobyte (kB) blocks, compressed independently as<br />
GZIP streams <strong>and</strong> eventually concatenated (gure 2.2(b) ). Although the resulting compressed<br />
size of a block is unpredictable <strong>and</strong> depends on the compressibility of the data, with the knowledge<br />
of the start osets of all blocks in a compressed le it is possible to quickly locate the block<br />
containing a specic uncompressed le oset. Thereby excess data that have to be decompressed<br />
to access any specic position of the le are limited to at most the block size of 64 kB. BGZF<br />
however does not specify a means of storing the start oset of each of the compressed blocks. To<br />
realize accelerated queries on top of BGZF, block oset information must therefore be incorporated<br />
in external index les. This represents a source of potential error, as the contents of the<br />
compressed le may diverge from the stored index data. Furthermore, various software applications<br />
do not correctly implement the stream concatenation feature of the GZIP specication,<br />
resulting in BGZF compressed data being truncated after decompression of the rst block.<br />
The open source comm<strong>and</strong> line tool dictzip comes with a specication for an indexed GZIPcompliant<br />
format. The GZIP format specication allows up to 64 kB of arbitrary extra data<br />
to be embedded within the header sequence. This extra data is simply to be ignored by basic<br />
GZIP decoders. The dictzip format takes advantage of the extra data eld for storing block<br />
oset information (gure 2.2(c) ). During the encoding process, the DEFLATE algorithm oers<br />
the possibility of manual insertion of reset points from which decompression may be started<br />
without processing the entire le. By exploiting this feature to compress blocks of input data<br />
independently, the dictzip tool avoids reliance on the stream concatenation feature as in the<br />
case of BGZF. However, with the extra data eld in the GZIP header being limited to 64 kB by<br />
specication, the number of blocks that can be stored is limited as well, eectively prohibiting<br />
storage of data with an uncompressed size of more than 1.8 gigabyte (GB). Furthermore, the<br />
entire index data to be stored in the GZIP header only becomes available after the entire le has<br />
been encoded. Streaming output is therefore incompatible with the format, which is therefore<br />
suitable for medium size, static data sets, but is not well matched to the compression of large-scale<br />
sequencing data.<br />
Due to the shortcomings of available formats, we introduce a further subset specication.<br />
The SHORE-GZIP format compresses the entire le as a single GZIP stream to provide optimal<br />
compatibility. To enable streaming output, all indexing information is stored at the end of the<br />
le. On decompression using either SHORE or arbitrary third party tools, block osets stored<br />
in the index are discarded. Thereby, index information is guaranteed to remain consistent with<br />
the le contents. To allow ne-tuning the tradeo between compression ratio <strong>and</strong> r<strong>and</strong>om access<br />
overhead, the format enables free choice of encoding block size.<br />
SHORE-GZIP realizes block-wise encoding similar to the dictzip approach by requesting periodic<br />
resets of the DEFLATE encoder (gure 2.2(d) ). During encoding, the method keeps track<br />
of the compressed block osets. After encoding all input data, the recorded index information<br />
is split into blocks of approximately 64 kB. Each block is embedded as extra data eld into a<br />
GZIP header, which is appended to the compressed le followed by two byte signaling empty<br />
compressed data to DEFLATE decoders, as well as eight further byte forming the appropriate<br />
GZIP footer. The indexing information is hence stored as consecutive empty GZIP streams that<br />
will be ignored by st<strong>and</strong>ard decoding of the data set. The SHORE-GZIP implementation is however<br />
able to recognize, collect <strong>and</strong> decode the trailing index blocks on dem<strong>and</strong> to enable r<strong>and</strong>om<br />
read access.<br />
<strong>An</strong> important feature of plain GZIP is the ability for users to concatenate compressed les<br />
without breaking format compliance. With SHORE-GZIP, concatenation results in index records<br />
being interspersed among the data streams. Through correct recognition of such interspersed