An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
3 A C ++ Framework for High-Throughput DNA<br />
Sequencing<br />
In a large, multi-purpose data analysis software solution, there are inevitably many<br />
recurring processing tasks <strong>and</strong> algorithmic challenges. To avoid large amounts of<br />
complex, redundant <strong>and</strong> error-prone code, the common subproblems must be isolated<br />
as reusable modules. With the SHORE DNA sequencing data analysis suite, we have<br />
striven to implement frequently required segments of code using consistent design patterns<br />
<strong>and</strong> interfaces. The resulting C++ programming framework <strong>lib</strong>shore constitutes<br />
the subject of this chapter.<br />
3.1 Overview<br />
High-throughput sequencing has become adopted as a versatile tool applicable to a wide range of<br />
dierent scientic questions (section 1.3). While data analysis for dierent types of application<br />
must be approached from unique angles (section 1.5), in their elementary building blocks methods<br />
<strong>and</strong> algorithms often share considerable overlap on many levels.<br />
Reading, parsing <strong>and</strong> formatted output of sequencing data in st<strong>and</strong>ard storage formats is a<br />
near universal requirement. Additionally, subsetting data sets by dened properties, achievable<br />
through dierent combinations of simple accept-reject ltering operations, is an often powerful<br />
tool for adjustment of an analysis' sensitivity-specicity tradeo. More complex operations rely<br />
on the relationship between multiple data set elements, like e. g. removal of PCR duplicate sequences<br />
from read alignment data, or alter certain properties of the data set elements themselves,<br />
<strong>and</strong> are therefore sensitive to, <strong>and</strong> potentially useful not only in various combinations, but also<br />
orders of application. Finally, entire analysis processes can be modeled as modules or series of<br />
modules transforming the type of the data, for example read alignments into depth of coverage<br />
or read alignments into positional pileups into variant calls.<br />
While isolation <strong>and</strong> encapsulation of subproblems facilitates correct implementation <strong>and</strong> comprehensibility<br />
of each individual step, the logic required to tie components into end-to-end processing<br />
pipelines can itself become extensive. For versatile modes of application, modularized<br />
components must therefore be embedded into a more generic framework capable of providing the<br />
glue between input, output <strong>and</strong> data ltering, manipulation <strong>and</strong> transformation.<br />
The <strong>lib</strong>shore C++ framework code is loosely categorized into ten dierent packages including<br />
application framework functionality, generic data processing infrastructure <strong>and</strong> sequencing specic<br />
processing modules <strong>and</strong> data structures. With gure 3.1 we present a package overview in<br />
a simplied UML-like representation, illustrating exemplary classes <strong>and</strong> associations.<br />
The base package comprises elementary low level functionality <strong>and</strong> utilities as well as machine<br />
or system-dependent blocks of code. While its functionality is required for most other packages,<br />
it is self-contained <strong>and</strong> does not by itself depend on other parts of the <strong>lib</strong>rary.<br />
The class program from the package of the same name constitutes the base class of all SHORE<br />
comm<strong>and</strong> line utilities. The base class combines comm<strong>and</strong> line interface denition, documen-<br />
61