28.02.2014 Views

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3 A C ++ Framework for High-Throughput DNA<br />

Sequencing<br />

In a large, multi-purpose data analysis software solution, there are inevitably many<br />

recurring processing tasks <strong>and</strong> algorithmic challenges. To avoid large amounts of<br />

complex, redundant <strong>and</strong> error-prone code, the common subproblems must be isolated<br />

as reusable modules. With the SHORE DNA sequencing data analysis suite, we have<br />

striven to implement frequently required segments of code using consistent design patterns<br />

<strong>and</strong> interfaces. The resulting C++ programming framework <strong>lib</strong>shore constitutes<br />

the subject of this chapter.<br />

3.1 Overview<br />

High-throughput sequencing has become adopted as a versatile tool applicable to a wide range of<br />

dierent scientic questions (section 1.3). While data analysis for dierent types of application<br />

must be approached from unique angles (section 1.5), in their elementary building blocks methods<br />

<strong>and</strong> algorithms often share considerable overlap on many levels.<br />

Reading, parsing <strong>and</strong> formatted output of sequencing data in st<strong>and</strong>ard storage formats is a<br />

near universal requirement. Additionally, subsetting data sets by dened properties, achievable<br />

through dierent combinations of simple accept-reject ltering operations, is an often powerful<br />

tool for adjustment of an analysis' sensitivity-specicity tradeo. More complex operations rely<br />

on the relationship between multiple data set elements, like e. g. removal of PCR duplicate sequences<br />

from read alignment data, or alter certain properties of the data set elements themselves,<br />

<strong>and</strong> are therefore sensitive to, <strong>and</strong> potentially useful not only in various combinations, but also<br />

orders of application. Finally, entire analysis processes can be modeled as modules or series of<br />

modules transforming the type of the data, for example read alignments into depth of coverage<br />

or read alignments into positional pileups into variant calls.<br />

While isolation <strong>and</strong> encapsulation of subproblems facilitates correct implementation <strong>and</strong> comprehensibility<br />

of each individual step, the logic required to tie components into end-to-end processing<br />

pipelines can itself become extensive. For versatile modes of application, modularized<br />

components must therefore be embedded into a more generic framework capable of providing the<br />

glue between input, output <strong>and</strong> data ltering, manipulation <strong>and</strong> transformation.<br />

The <strong>lib</strong>shore C++ framework code is loosely categorized into ten dierent packages including<br />

application framework functionality, generic data processing infrastructure <strong>and</strong> sequencing specic<br />

processing modules <strong>and</strong> data structures. With gure 3.1 we present a package overview in<br />

a simplied UML-like representation, illustrating exemplary classes <strong>and</strong> associations.<br />

The base package comprises elementary low level functionality <strong>and</strong> utilities as well as machine<br />

or system-dependent blocks of code. While its functionality is required for most other packages,<br />

it is self-contained <strong>and</strong> does not by itself depend on other parts of the <strong>lib</strong>rary.<br />

The class program from the package of the same name constitutes the base class of all SHORE<br />

comm<strong>and</strong> line utilities. The base class combines comm<strong>and</strong> line interface denition, documen-<br />

61

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!