An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
3.1. OVERVIEW 63<br />
tation <strong>and</strong> comm<strong>and</strong> line parsing through an auxiliary class av_parser <strong>and</strong> incorporates an<br />
object of class uncaught_h<strong>and</strong>ler to provide h<strong>and</strong>ling <strong>and</strong> reporting of fatal error conditions<br />
encountered by the application.<br />
Classes istreams <strong>and</strong> ostreams of the stream package simplify h<strong>and</strong>ling of les or other input<br />
or output destinations. Interpretation of special location strings or prexes facilitates specication<br />
of special les or other sources or destinations of input <strong>and</strong> output as option arguments or<br />
option default values, like e. g. st<strong>and</strong>ard input, output or Unix pipes. Input or output locations<br />
compressed in GZIP, BAM or XZ format are delegated to appropriate helper stream classes for<br />
automatic decoding, encoding or h<strong>and</strong>ling of r<strong>and</strong>om access requests (section 2.2.2). Both classes<br />
in addition provide input/output error checking <strong>and</strong> further stream-related functionality. Class<br />
ostreams furthermore implements simplied h<strong>and</strong>ling of temporary le destinations as well as<br />
optional temporary le compression for improved h<strong>and</strong>ling of large amounts of temporary data.<br />
Package algo comprises of dynamic programming alignment (section 2.5.2) as well as indexing,<br />
sorting <strong>and</strong> query algorithms <strong>and</strong> associated persistent data structure implementations<br />
(section 2.2.3). The platform for range indexing <strong>and</strong> queries is provided by a generic 2-d tree algorithm<br />
template d2tree, which is further utilized by the class twodex that supplies a persistent<br />
index data structure congurable for a variety of genomic data formats. Multiple algorithm templates<br />
with the prex suffix_ implement generic sux array construction <strong>and</strong> queries [144, 145].<br />
The sux array algorithms are employed by objects of class suffix_index, which provides persistent<br />
disk indexes for genomic sequences congurable <strong>and</strong> capable of answering various types<br />
of sequence match queries. To support up to 64 bit data at reasonable space eciency, persistent<br />
suffix_index <strong>and</strong> twodex index data are encoded <strong>and</strong> decoded non-byte aligned with respective<br />
exact required bit width utilizing class intpack of the container package.<br />
The <strong>lib</strong>shore data processing infrastructure is dened <strong>and</strong> implemented as a set of templates<br />
forming the processing package (section 3.2). Package parallel extends on this infrastructure<br />
to build an abstracted parallel processing interface (section 3.3).<br />
Despite growing acceptance of data exchange specications such as SAM/BAM for short read<br />
mapping data, data format issues can present a major impediment to application of algorithms<br />
to input data from diverse sources. Therefore, the <strong>lib</strong>rary includes support for a broad variety<br />
of sequencing read, mapping <strong>and</strong> further sequencing <strong>and</strong> DNA-related input <strong>and</strong> output le formats<br />
provided through the fmtio package. Format support is implemented as Reader <strong>and</strong> Writer<br />
classes with a consistent interface modeled after the requirements of the data processing infrastructure.<br />
To simplify working with various formats, for sequencing read <strong>and</strong> alignment data the<br />
<strong>lib</strong>rary provides multi-format readers read_reader <strong>and</strong> alignment_reader which automatically<br />
incorporate the adequate Reader objects for each respective input le.<br />
File format parsing <strong>and</strong> decoding requires adequate in-memory representation for the respective<br />
elements of the various types of data set, which are dened in the datatype package. Plain<br />
data structures read <strong>and</strong> alignment provide the st<strong>and</strong>ard representation of short sequencing<br />
read data, whereas large genomic sequences are stored as sequence_record objects with more<br />
sophisticated internal memory management. Additionally, processing modules <strong>and</strong> utilities specic<br />
to a certain type of data set element or certain properties of data set elements are also<br />
grouped under this package. For example, a class alignment_filter_pipe combines application<br />
of a variety of frequently required ltering <strong>and</strong> editing modules to read mapping data. Parsing<br />
<strong>and</strong> decoding the SHORE alignment string pair-wise alignment representation (section 2.2.4) into<br />
CIGAR string-equivalent tokens is provided as a class alignment_tokenizer. Class nucleotide<br />
provides decoding, encoding <strong>and</strong> ecient manipulation of IUPAC encoded DNA bases.<br />
The additional statistics package supplies the basis for statistical hypothesis tests based on<br />
a variety of distributions. The complementary class mtc re-implements various multiple hypothesis<br />
test correction procedures in C++ following the R multtest package implementation [146].