An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
1.7. CONTRIBUTIONS OF THIS WORK 15<br />
supporting the same logical structure <strong>and</strong> set of features. The main features of CRAM are reduced<br />
representations of aligned read sequences <strong>and</strong> per-base qualities. For aligned sequences, the<br />
format implements reference compression. In reference compressed alignments only the lengths<br />
of read subsequences that perfectly match the reference genome are stored, or an equivalent<br />
representation. While in principle a lossy operation, all discarded information may subsequently<br />
be recovered as long as the appropriate reference sequence is available. For PHRED quality<br />
data, CRAM oers a choice of various lossy binning schemes aimed at reduction of the entropy<br />
introduced with the read qualities. Following such preprocessing operations, CRAM encodes data<br />
into a custom binary format utilizing well-known encoding schemes such as Golomb [121] <strong>and</strong><br />
Human coding [122].<br />
1.6.2 Accelerated Queries on Sequencing <strong>Data</strong><br />
Support for accelerated queries on a data set is required to enable rapid data visualization<br />
[123, 124] <strong>and</strong> data set exploration. Indexed range queries are e. g. supported by the<br />
BigWig <strong>and</strong> BigBed formats [123], the BAM format or the utility Tabix [125] using R-tree [126]<br />
based indexing [124].<br />
While stock generic compression algorithms permit streaming encoding <strong>and</strong> decoding <strong>and</strong> are<br />
thus compatible with the goal of sequential, ordered traversal of data sets, they do not comply<br />
with r<strong>and</strong>om access requirements. This disadvantage is commonly mitigated by enabling seeking<br />
to dened positions in the stream. These decoding osets are realized by periodically either<br />
saving the complete state of the encoder, or by causing an encoder reset. In both cases, the<br />
cost is an eective decrease in compression eciency, either because of the overhead of saving<br />
the series of encoder states along with the compressed data, or because the encoder needs to be<br />
retrained following each reset. Additionally, an index data structure must be integrated to allow<br />
conversion between compressed <strong>and</strong> uncompressed stream osets. As a result, there is a tradeo<br />
between compression ratio <strong>and</strong> r<strong>and</strong>om access granularity, <strong>and</strong> thus eciency.<br />
1.7 Contributions of this Work<br />
The software SHORE is a processing <strong>and</strong> analysis pipeline for high-throughput DNA sequencing<br />
data [1, 18]. The pipeline is targeted at a reference sequence-based resequencing data analysis<br />
approach, combining raw read manipulation <strong>and</strong> ltering, read mapping <strong>and</strong> several analysis<br />
modules specic to dierent sequencing applications.<br />
With this work, large parts of our pipeline were reworked for coping with increasingly large<br />
amounts of data, optimization of workows <strong>and</strong> extended analysis requirements. In the process,<br />
SHORE was embedded into a generic programming framework to provide universal support of<br />
introduced features across all processing modules (section 3.1). Increasing storage sizes of data<br />
sets produced by high-throughput sequencing devices have been addressed by implementation of a<br />
variety of data compression options. At the same time, increased processing times were countered<br />
by extended parallel processing facilities, <strong>and</strong> specically distributed memory parallelization<br />
of the read mapping process. Furthermore, with greater data set size, eciency of retrieval<br />
of specic data set elements or subsets has gained importance for data set exploration <strong>and</strong><br />
visualization. To meet these requirements, appropriate data indexing <strong>and</strong> query algorithms were<br />
incorporated. Exploiting the data set indexing facilities, within several analysis modules we<br />
address basic visualization of sequencing read alignment <strong>and</strong> depth of coverage distribution to<br />
mitigate the need for time-consuming data set copying <strong>and</strong> conversion for use with external<br />
visualization solutions.