28.02.2014 Views

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

30 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

Attribute Plain Size Compressed Size Compression Ratio<br />

read alignments (ref. comp.) 349 Mb (33 Mb) 19 Mb (3.4 Mb) 5.7% (10.3%)<br />

read IDs (relabeled) 227 Mb (117 Mb) 39 Mb (3.2 Mb) 17.5% (2.7%)<br />

qualities (reduced) 341 Mb (341 Mb) 76 Mb (30 Mb) 22.3% (8.8%)<br />

other elds 163 Mb 5.4 Mb 3.3%<br />

Table 2.1: Compressibility of Individual Read Mapping Attributes<br />

qualities accounted for 349 Mb <strong>and</strong> 341 Mb, each corresponding to approximately 32% of the<br />

total storage size, followed by st<strong>and</strong>ard read identiers with 227 Mb (21%) <strong>and</strong> the collective<br />

size of all further attributes (163 Mb, 15%).<br />

Column data were compressed in XZ format at 128 kB block size. Read alignments were<br />

thereby reduced to 5.7% (19 Mb) of their original size, further elds to 3.3% (5.4 Mb), whereas<br />

read identiers <strong>and</strong> quality strings could not be compressed as eectively with compression ratios<br />

of 17.5% (39 Mb) <strong>and</strong> 22.3% (76 Mb), respectively.<br />

By application of reference-based compression, storage size of read alignments was reduced to<br />

less than ten percent (33 Mb) in plain text format. This gain was however somewhat diminished<br />

in compressed format due to the worse compressibility of the data, resulting in a compressed size<br />

of 3.4 Mb at a compression ratio of 10.3%.<br />

Relabeling of the reads resulted in a storage gain of roughly fty percent relative to the<br />

original representation owing to the shorter read identiers. Vastly improved however was the<br />

compressibility of the identier sequence, resulting in a reduction to just 2.7% (3.2 Mb) relative<br />

to plain text format.<br />

Eect of reduced quality value resolution was assessed using the cramtools quality binning<br />

conguration N40-R8 to allow direct comparison to the CRAM format, although the algorithm<br />

implemented in SHORE achieved similar results. The N40-R8 setting preserves quality values of<br />

all reference sequence mismatches, while assigning quality values of sequence matches to one of<br />

eight possible bins. Application of quality resolution reduction improved compression ratio of<br />

the quality data from 22.3% to 8.8%.<br />

We further assessed the eect of various combinations of compression <strong>and</strong> data reduction<br />

congurations on total data set size by application of the SHORE compress <strong>and</strong> SHORE coal utilities<br />

to the data set in MapList format <strong>and</strong> compare the results with storage of SAM/BAM <strong>and</strong> CRAM<br />

formatted data (table 2.2).<br />

Eect of compression block size was evaluated using SHORE's r<strong>and</strong>om access granularity presets<br />

fast, corresponding to 128 kB block size, medium, 2 MB, <strong>and</strong> slow, 64 MB. Compression<br />

of the 1.1 GB data set in SHORE-GZIP format at 128 kB block size reduced space requirements<br />

to 205 MB, or 19% of the uncompressed le size. Increasing block size to 2 MB further improved<br />

space eciency by just 1.5% to 202 MB storage size, <strong>and</strong> use of 64 MB blocks did not have a<br />

signicant advantage over 2 MB.<br />

Selecting the XZ compression format at 128 kB block size resulted in a storage size of 163 MB,<br />

15% of plain data set size <strong>and</strong> a 20% improvement over SHORE-GZIP at the same block size. Eect<br />

of increased compression block size was more pronounced for XZ than for SHORE-GZIP. At 2 MB<br />

<strong>and</strong> 64 MB block size disk usage was reduced to 140 MB <strong>and</strong> 116 MB, respectively, improvements<br />

of 14% <strong>and</strong> 17% over the respective previous block size.<br />

Block-wise transposition was applied using the block size parameter matching the respective<br />

compression block size. At the smallest block size, size of the transformed data in SHORE-GZIP<br />

format was 181 MB, a 12% improvement over untransposed storage. Block-wise transposition<br />

also augmented the eect of increased block sizes, achieving le sizes of 160 MB at 2 MB <strong>and</strong> of

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!