28.02.2014 Views

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.1. OVERVIEW 63<br />

tation <strong>and</strong> comm<strong>and</strong> line parsing through an auxiliary class av_parser <strong>and</strong> incorporates an<br />

object of class uncaught_h<strong>and</strong>ler to provide h<strong>and</strong>ling <strong>and</strong> reporting of fatal error conditions<br />

encountered by the application.<br />

Classes istreams <strong>and</strong> ostreams of the stream package simplify h<strong>and</strong>ling of les or other input<br />

or output destinations. Interpretation of special location strings or prexes facilitates specication<br />

of special les or other sources or destinations of input <strong>and</strong> output as option arguments or<br />

option default values, like e. g. st<strong>and</strong>ard input, output or Unix pipes. Input or output locations<br />

compressed in GZIP, BAM or XZ format are delegated to appropriate helper stream classes for<br />

automatic decoding, encoding or h<strong>and</strong>ling of r<strong>and</strong>om access requests (section 2.2.2). Both classes<br />

in addition provide input/output error checking <strong>and</strong> further stream-related functionality. Class<br />

ostreams furthermore implements simplied h<strong>and</strong>ling of temporary le destinations as well as<br />

optional temporary le compression for improved h<strong>and</strong>ling of large amounts of temporary data.<br />

Package algo comprises of dynamic programming alignment (section 2.5.2) as well as indexing,<br />

sorting <strong>and</strong> query algorithms <strong>and</strong> associated persistent data structure implementations<br />

(section 2.2.3). The platform for range indexing <strong>and</strong> queries is provided by a generic 2-d tree algorithm<br />

template d2tree, which is further utilized by the class twodex that supplies a persistent<br />

index data structure congurable for a variety of genomic data formats. Multiple algorithm templates<br />

with the prex suffix_ implement generic sux array construction <strong>and</strong> queries [144, 145].<br />

The sux array algorithms are employed by objects of class suffix_index, which provides persistent<br />

disk indexes for genomic sequences congurable <strong>and</strong> capable of answering various types<br />

of sequence match queries. To support up to 64 bit data at reasonable space eciency, persistent<br />

suffix_index <strong>and</strong> twodex index data are encoded <strong>and</strong> decoded non-byte aligned with respective<br />

exact required bit width utilizing class intpack of the container package.<br />

The <strong>lib</strong>shore data processing infrastructure is dened <strong>and</strong> implemented as a set of templates<br />

forming the processing package (section 3.2). Package parallel extends on this infrastructure<br />

to build an abstracted parallel processing interface (section 3.3).<br />

Despite growing acceptance of data exchange specications such as SAM/BAM for short read<br />

mapping data, data format issues can present a major impediment to application of algorithms<br />

to input data from diverse sources. Therefore, the <strong>lib</strong>rary includes support for a broad variety<br />

of sequencing read, mapping <strong>and</strong> further sequencing <strong>and</strong> DNA-related input <strong>and</strong> output le formats<br />

provided through the fmtio package. Format support is implemented as Reader <strong>and</strong> Writer<br />

classes with a consistent interface modeled after the requirements of the data processing infrastructure.<br />

To simplify working with various formats, for sequencing read <strong>and</strong> alignment data the<br />

<strong>lib</strong>rary provides multi-format readers read_reader <strong>and</strong> alignment_reader which automatically<br />

incorporate the adequate Reader objects for each respective input le.<br />

File format parsing <strong>and</strong> decoding requires adequate in-memory representation for the respective<br />

elements of the various types of data set, which are dened in the datatype package. Plain<br />

data structures read <strong>and</strong> alignment provide the st<strong>and</strong>ard representation of short sequencing<br />

read data, whereas large genomic sequences are stored as sequence_record objects with more<br />

sophisticated internal memory management. Additionally, processing modules <strong>and</strong> utilities specic<br />

to a certain type of data set element or certain properties of data set elements are also<br />

grouped under this package. For example, a class alignment_filter_pipe combines application<br />

of a variety of frequently required ltering <strong>and</strong> editing modules to read mapping data. Parsing<br />

<strong>and</strong> decoding the SHORE alignment string pair-wise alignment representation (section 2.2.4) into<br />

CIGAR string-equivalent tokens is provided as a class alignment_tokenizer. Class nucleotide<br />

provides decoding, encoding <strong>and</strong> ecient manipulation of IUPAC encoded DNA bases.<br />

The additional statistics package supplies the basis for statistical hypothesis tests based on<br />

a variety of distributions. The complementary class mtc re-implements various multiple hypothesis<br />

test correction procedures in C++ following the R multtest package implementation [146].

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!