An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
30 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
Attribute Plain Size Compressed Size Compression Ratio<br />
read alignments (ref. comp.) 349 Mb (33 Mb) 19 Mb (3.4 Mb) 5.7% (10.3%)<br />
read IDs (relabeled) 227 Mb (117 Mb) 39 Mb (3.2 Mb) 17.5% (2.7%)<br />
qualities (reduced) 341 Mb (341 Mb) 76 Mb (30 Mb) 22.3% (8.8%)<br />
other elds 163 Mb 5.4 Mb 3.3%<br />
Table 2.1: Compressibility of Individual Read Mapping Attributes<br />
qualities accounted for 349 Mb <strong>and</strong> 341 Mb, each corresponding to approximately 32% of the<br />
total storage size, followed by st<strong>and</strong>ard read identiers with 227 Mb (21%) <strong>and</strong> the collective<br />
size of all further attributes (163 Mb, 15%).<br />
Column data were compressed in XZ format at 128 kB block size. Read alignments were<br />
thereby reduced to 5.7% (19 Mb) of their original size, further elds to 3.3% (5.4 Mb), whereas<br />
read identiers <strong>and</strong> quality strings could not be compressed as eectively with compression ratios<br />
of 17.5% (39 Mb) <strong>and</strong> 22.3% (76 Mb), respectively.<br />
By application of reference-based compression, storage size of read alignments was reduced to<br />
less than ten percent (33 Mb) in plain text format. This gain was however somewhat diminished<br />
in compressed format due to the worse compressibility of the data, resulting in a compressed size<br />
of 3.4 Mb at a compression ratio of 10.3%.<br />
Relabeling of the reads resulted in a storage gain of roughly fty percent relative to the<br />
original representation owing to the shorter read identiers. Vastly improved however was the<br />
compressibility of the identier sequence, resulting in a reduction to just 2.7% (3.2 Mb) relative<br />
to plain text format.<br />
Eect of reduced quality value resolution was assessed using the cramtools quality binning<br />
conguration N40-R8 to allow direct comparison to the CRAM format, although the algorithm<br />
implemented in SHORE achieved similar results. The N40-R8 setting preserves quality values of<br />
all reference sequence mismatches, while assigning quality values of sequence matches to one of<br />
eight possible bins. Application of quality resolution reduction improved compression ratio of<br />
the quality data from 22.3% to 8.8%.<br />
We further assessed the eect of various combinations of compression <strong>and</strong> data reduction<br />
congurations on total data set size by application of the SHORE compress <strong>and</strong> SHORE coal utilities<br />
to the data set in MapList format <strong>and</strong> compare the results with storage of SAM/BAM <strong>and</strong> CRAM<br />
formatted data (table 2.2).<br />
Eect of compression block size was evaluated using SHORE's r<strong>and</strong>om access granularity presets<br />
fast, corresponding to 128 kB block size, medium, 2 MB, <strong>and</strong> slow, 64 MB. Compression<br />
of the 1.1 GB data set in SHORE-GZIP format at 128 kB block size reduced space requirements<br />
to 205 MB, or 19% of the uncompressed le size. Increasing block size to 2 MB further improved<br />
space eciency by just 1.5% to 202 MB storage size, <strong>and</strong> use of 64 MB blocks did not have a<br />
signicant advantage over 2 MB.<br />
Selecting the XZ compression format at 128 kB block size resulted in a storage size of 163 MB,<br />
15% of plain data set size <strong>and</strong> a 20% improvement over SHORE-GZIP at the same block size. Eect<br />
of increased compression block size was more pronounced for XZ than for SHORE-GZIP. At 2 MB<br />
<strong>and</strong> 64 MB block size disk usage was reduced to 140 MB <strong>and</strong> 116 MB, respectively, improvements<br />
of 14% <strong>and</strong> 17% over the respective previous block size.<br />
Block-wise transposition was applied using the block size parameter matching the respective<br />
compression block size. At the smallest block size, size of the transformed data in SHORE-GZIP<br />
format was 181 MB, a 12% improvement over untransposed storage. Block-wise transposition<br />
also augmented the eect of increased block sizes, achieving le sizes of 160 MB at 2 MB <strong>and</strong> of