An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
32 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
157 MB at the 64 MB conguration. This corresponds to advantages of 21% <strong>and</strong> 22% over the<br />
matching untransformed representations.<br />
With XZ/LZMA compression, compression ratios beneted slightly less from block-wise transposition,<br />
at the three dierent block sizes gaining 8%, 12% <strong>and</strong> 17% over untransposed data to<br />
obtain le sizes of 150 MB, 123 MB <strong>and</strong> 96 MB, respectively.<br />
We further list the eect of additionally applying various combinations of reference-based<br />
compression, read relabeling <strong>and</strong> quality resolution for transposed data encoded in GZIP format<br />
with 2 MB block size as well as encoded in XZ format at the three dierent block sizes. In<br />
all compression formats <strong>and</strong> block sizes assessed, reference-based compression could account for<br />
11% to 12% of additional storage space savings. Added relabeling of reads resulted in le sizes<br />
reduced by further 20% in SHORE-GZIP format <strong>and</strong> by over 28% in the three XZ encoded les.<br />
Reduction of quality resolution gave similar improvements, producing slightly higher savings<br />
than read relabeling in GZIP compressed data, but had somewhat lesser impact compared to<br />
relabeling in XZ encoding.<br />
For comparison of formats, table 2.2 also includes disk usage for the same data set converted<br />
to SAM, BAM <strong>and</strong> CRAM le formats. The SAM format representation of the data set entries<br />
was slightly more verbose compared to the MapList formatting, resulting in an uncompressed<br />
le size of approximately 1.3 GB. The BAM encoded data set was about 14% larger compared<br />
to MapList, <strong>and</strong> about 9% larger than the SAM formatted data compressed as SHORE-GZIP<br />
at 128 kB block size. When applying reference-based compression only, size of CRAM encoded<br />
data ranked slightly better than the MapList le compressed in XZ format at 2 MB block size<br />
or compressed as GZIP following block-wise transposition <strong>and</strong> reference-based compression, <strong>and</strong><br />
slightly worse than the 128 kB block size XZ-encoded le with block-wise transposition <strong>and</strong><br />
reference-based compression enabled. With additional reduction of quality value resolution,<br />
CRAM ranked somewhat worse than the XZ compressed MapList le for the same data encoded<br />
in 128 kB chunks with block-wise transposition.<br />
2.2.6 <strong>Data</strong> Storage Considerations<br />
Choice of sequencing data storage formats depends on three areas of application with overlapping,<br />
but distinct design goals. O-line data archiving requires primarily space-ecient representation.<br />
<strong>Data</strong> exchange formats must conform to a highly st<strong>and</strong>ardized <strong>and</strong> stable format specication <strong>and</strong><br />
allow combined distribution of all data <strong>and</strong> meta-data in a single le, while directly supporting<br />
simple analysis <strong>and</strong> processing tasks. Finally, on-line storage solutions should be optimized<br />
regarding the trade-o between space eciency <strong>and</strong> certain access guarantees to provide optimal<br />
support for the respective approaches to analysis.<br />
With the implementation of the SHORE storage formats, we demonstrate indexable, space ef-<br />
cient on-line data storage for high-throughput sequencing on the basis of simple text-based le<br />
format denitions <strong>and</strong> stock compression formats. While binary formats dedicated to sequencing<br />
data storage such as CRAM should permit further optimization for improved performance<br />
<strong>and</strong> space eciency, the outlined simple formats <strong>and</strong> pre-processing methods should provide a<br />
valuable benchmark for future eorts.<br />
The implemented indexing solution for compressed sequencing data les forms the basis for<br />
rapid data retrieval for iterative analysis <strong>and</strong> data set exploration. Furthermore, our indexing<br />
approach should facilitate the implementation of caching strategies e. g. for implementation of<br />
fully interactive visualization of large sequencing data sets.