28.02.2014 Views

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

32 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

157 MB at the 64 MB conguration. This corresponds to advantages of 21% <strong>and</strong> 22% over the<br />

matching untransformed representations.<br />

With XZ/LZMA compression, compression ratios beneted slightly less from block-wise transposition,<br />

at the three dierent block sizes gaining 8%, 12% <strong>and</strong> 17% over untransposed data to<br />

obtain le sizes of 150 MB, 123 MB <strong>and</strong> 96 MB, respectively.<br />

We further list the eect of additionally applying various combinations of reference-based<br />

compression, read relabeling <strong>and</strong> quality resolution for transposed data encoded in GZIP format<br />

with 2 MB block size as well as encoded in XZ format at the three dierent block sizes. In<br />

all compression formats <strong>and</strong> block sizes assessed, reference-based compression could account for<br />

11% to 12% of additional storage space savings. Added relabeling of reads resulted in le sizes<br />

reduced by further 20% in SHORE-GZIP format <strong>and</strong> by over 28% in the three XZ encoded les.<br />

Reduction of quality resolution gave similar improvements, producing slightly higher savings<br />

than read relabeling in GZIP compressed data, but had somewhat lesser impact compared to<br />

relabeling in XZ encoding.<br />

For comparison of formats, table 2.2 also includes disk usage for the same data set converted<br />

to SAM, BAM <strong>and</strong> CRAM le formats. The SAM format representation of the data set entries<br />

was slightly more verbose compared to the MapList formatting, resulting in an uncompressed<br />

le size of approximately 1.3 GB. The BAM encoded data set was about 14% larger compared<br />

to MapList, <strong>and</strong> about 9% larger than the SAM formatted data compressed as SHORE-GZIP<br />

at 128 kB block size. When applying reference-based compression only, size of CRAM encoded<br />

data ranked slightly better than the MapList le compressed in XZ format at 2 MB block size<br />

or compressed as GZIP following block-wise transposition <strong>and</strong> reference-based compression, <strong>and</strong><br />

slightly worse than the 128 kB block size XZ-encoded le with block-wise transposition <strong>and</strong><br />

reference-based compression enabled. With additional reduction of quality value resolution,<br />

CRAM ranked somewhat worse than the XZ compressed MapList le for the same data encoded<br />

in 128 kB chunks with block-wise transposition.<br />

2.2.6 <strong>Data</strong> Storage Considerations<br />

Choice of sequencing data storage formats depends on three areas of application with overlapping,<br />

but distinct design goals. O-line data archiving requires primarily space-ecient representation.<br />

<strong>Data</strong> exchange formats must conform to a highly st<strong>and</strong>ardized <strong>and</strong> stable format specication <strong>and</strong><br />

allow combined distribution of all data <strong>and</strong> meta-data in a single le, while directly supporting<br />

simple analysis <strong>and</strong> processing tasks. Finally, on-line storage solutions should be optimized<br />

regarding the trade-o between space eciency <strong>and</strong> certain access guarantees to provide optimal<br />

support for the respective approaches to analysis.<br />

With the implementation of the SHORE storage formats, we demonstrate indexable, space ef-<br />

cient on-line data storage for high-throughput sequencing on the basis of simple text-based le<br />

format denitions <strong>and</strong> stock compression formats. While binary formats dedicated to sequencing<br />

data storage such as CRAM should permit further optimization for improved performance<br />

<strong>and</strong> space eciency, the outlined simple formats <strong>and</strong> pre-processing methods should provide a<br />

valuable benchmark for future eorts.<br />

The implemented indexing solution for compressed sequencing data les forms the basis for<br />

rapid data retrieval for iterative analysis <strong>and</strong> data set exploration. Furthermore, our indexing<br />

approach should facilitate the implementation of caching strategies e. g. for implementation of<br />

fully interactive visualization of large sequencing data sets.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!