28.02.2014 Views

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

24 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

The BGZF format developed for the BAM format <strong>and</strong> the Tabix utility [100, 111, 125] exploits<br />

the concatenation feature to implement independently compressed blocks within the GZIP speci-<br />

cation. Uncompressed data are split into 64 kilobyte (kB) blocks, compressed independently as<br />

GZIP streams <strong>and</strong> eventually concatenated (gure 2.2(b) ). Although the resulting compressed<br />

size of a block is unpredictable <strong>and</strong> depends on the compressibility of the data, with the knowledge<br />

of the start osets of all blocks in a compressed le it is possible to quickly locate the block<br />

containing a specic uncompressed le oset. Thereby excess data that have to be decompressed<br />

to access any specic position of the le are limited to at most the block size of 64 kB. BGZF<br />

however does not specify a means of storing the start oset of each of the compressed blocks. To<br />

realize accelerated queries on top of BGZF, block oset information must therefore be incorporated<br />

in external index les. This represents a source of potential error, as the contents of the<br />

compressed le may diverge from the stored index data. Furthermore, various software applications<br />

do not correctly implement the stream concatenation feature of the GZIP specication,<br />

resulting in BGZF compressed data being truncated after decompression of the rst block.<br />

The open source comm<strong>and</strong> line tool dictzip comes with a specication for an indexed GZIPcompliant<br />

format. The GZIP format specication allows up to 64 kB of arbitrary extra data<br />

to be embedded within the header sequence. This extra data is simply to be ignored by basic<br />

GZIP decoders. The dictzip format takes advantage of the extra data eld for storing block<br />

oset information (gure 2.2(c) ). During the encoding process, the DEFLATE algorithm oers<br />

the possibility of manual insertion of reset points from which decompression may be started<br />

without processing the entire le. By exploiting this feature to compress blocks of input data<br />

independently, the dictzip tool avoids reliance on the stream concatenation feature as in the<br />

case of BGZF. However, with the extra data eld in the GZIP header being limited to 64 kB by<br />

specication, the number of blocks that can be stored is limited as well, eectively prohibiting<br />

storage of data with an uncompressed size of more than 1.8 gigabyte (GB). Furthermore, the<br />

entire index data to be stored in the GZIP header only becomes available after the entire le has<br />

been encoded. Streaming output is therefore incompatible with the format, which is therefore<br />

suitable for medium size, static data sets, but is not well matched to the compression of large-scale<br />

sequencing data.<br />

Due to the shortcomings of available formats, we introduce a further subset specication.<br />

The SHORE-GZIP format compresses the entire le as a single GZIP stream to provide optimal<br />

compatibility. To enable streaming output, all indexing information is stored at the end of the<br />

le. On decompression using either SHORE or arbitrary third party tools, block osets stored<br />

in the index are discarded. Thereby, index information is guaranteed to remain consistent with<br />

the le contents. To allow ne-tuning the tradeo between compression ratio <strong>and</strong> r<strong>and</strong>om access<br />

overhead, the format enables free choice of encoding block size.<br />

SHORE-GZIP realizes block-wise encoding similar to the dictzip approach by requesting periodic<br />

resets of the DEFLATE encoder (gure 2.2(d) ). During encoding, the method keeps track<br />

of the compressed block osets. After encoding all input data, the recorded index information<br />

is split into blocks of approximately 64 kB. Each block is embedded as extra data eld into a<br />

GZIP header, which is appended to the compressed le followed by two byte signaling empty<br />

compressed data to DEFLATE decoders, as well as eight further byte forming the appropriate<br />

GZIP footer. The indexing information is hence stored as consecutive empty GZIP streams that<br />

will be ignored by st<strong>and</strong>ard decoding of the data set. The SHORE-GZIP implementation is however<br />

able to recognize, collect <strong>and</strong> decode the trailing index blocks on dem<strong>and</strong> to enable r<strong>and</strong>om<br />

read access.<br />

<strong>An</strong> important feature of plain GZIP is the ability for users to concatenate compressed les<br />

without breaking format compliance. With SHORE-GZIP, concatenation results in index records<br />

being interspersed among the data streams. Through correct recognition of such interspersed

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!