28.02.2014 Views

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

1.7. CONTRIBUTIONS OF THIS WORK 15<br />

supporting the same logical structure <strong>and</strong> set of features. The main features of CRAM are reduced<br />

representations of aligned read sequences <strong>and</strong> per-base qualities. For aligned sequences, the<br />

format implements reference compression. In reference compressed alignments only the lengths<br />

of read subsequences that perfectly match the reference genome are stored, or an equivalent<br />

representation. While in principle a lossy operation, all discarded information may subsequently<br />

be recovered as long as the appropriate reference sequence is available. For PHRED quality<br />

data, CRAM oers a choice of various lossy binning schemes aimed at reduction of the entropy<br />

introduced with the read qualities. Following such preprocessing operations, CRAM encodes data<br />

into a custom binary format utilizing well-known encoding schemes such as Golomb [121] <strong>and</strong><br />

Human coding [122].<br />

1.6.2 Accelerated Queries on Sequencing <strong>Data</strong><br />

Support for accelerated queries on a data set is required to enable rapid data visualization<br />

[123, 124] <strong>and</strong> data set exploration. Indexed range queries are e. g. supported by the<br />

BigWig <strong>and</strong> BigBed formats [123], the BAM format or the utility Tabix [125] using R-tree [126]<br />

based indexing [124].<br />

While stock generic compression algorithms permit streaming encoding <strong>and</strong> decoding <strong>and</strong> are<br />

thus compatible with the goal of sequential, ordered traversal of data sets, they do not comply<br />

with r<strong>and</strong>om access requirements. This disadvantage is commonly mitigated by enabling seeking<br />

to dened positions in the stream. These decoding osets are realized by periodically either<br />

saving the complete state of the encoder, or by causing an encoder reset. In both cases, the<br />

cost is an eective decrease in compression eciency, either because of the overhead of saving<br />

the series of encoder states along with the compressed data, or because the encoder needs to be<br />

retrained following each reset. Additionally, an index data structure must be integrated to allow<br />

conversion between compressed <strong>and</strong> uncompressed stream osets. As a result, there is a tradeo<br />

between compression ratio <strong>and</strong> r<strong>and</strong>om access granularity, <strong>and</strong> thus eciency.<br />

1.7 Contributions of this Work<br />

The software SHORE is a processing <strong>and</strong> analysis pipeline for high-throughput DNA sequencing<br />

data [1, 18]. The pipeline is targeted at a reference sequence-based resequencing data analysis<br />

approach, combining raw read manipulation <strong>and</strong> ltering, read mapping <strong>and</strong> several analysis<br />

modules specic to dierent sequencing applications.<br />

With this work, large parts of our pipeline were reworked for coping with increasingly large<br />

amounts of data, optimization of workows <strong>and</strong> extended analysis requirements. In the process,<br />

SHORE was embedded into a generic programming framework to provide universal support of<br />

introduced features across all processing modules (section 3.1). Increasing storage sizes of data<br />

sets produced by high-throughput sequencing devices have been addressed by implementation of a<br />

variety of data compression options. At the same time, increased processing times were countered<br />

by extended parallel processing facilities, <strong>and</strong> specically distributed memory parallelization<br />

of the read mapping process. Furthermore, with greater data set size, eciency of retrieval<br />

of specic data set elements or subsets has gained importance for data set exploration <strong>and</strong><br />

visualization. To meet these requirements, appropriate data indexing <strong>and</strong> query algorithms were<br />

incorporated. Exploiting the data set indexing facilities, within several analysis modules we<br />

address basic visualization of sequencing read alignment <strong>and</strong> depth of coverage distribution to<br />

mitigate the need for time-consuming data set copying <strong>and</strong> conversion for use with external<br />

visualization solutions.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!