28.02.2014 Views

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

16 CHAPTER 1. INTRODUCTION<br />

Our application programming framework <strong>lib</strong>shore was implemented as a common foundation<br />

to all SHORE processing modules. In addition to elementary functionality such as automatic<br />

input <strong>and</strong> output decoding <strong>and</strong> encoding, algorithms for indexing sequencing-related data <strong>and</strong><br />

alignment of biological sequences are provided. Furthermore, we developed a generic processing<br />

framework with the aim of facilitating breakdown <strong>and</strong> implementation of rather complex<br />

sequencing data analysis algorithms as collections of manageable <strong>and</strong> self-contained processing<br />

modules. We further extended on this framework for straightforward parallelization of basic<br />

sequencing data processing.<br />

The fundamental data analysis workow introduced with the initial version of SHORE consists<br />

of subsequent application of read ltering, read mapping <strong>and</strong> application-specic analysis<br />

modules. While this basic approach has been maintained, it has been extended in many areas<br />

for simplication of a variety of routine tasks. <strong>An</strong> overview of the modied workow is therefore<br />

given in section 2.1.<br />

Signicant modications to core pipeline modules concern raw sequencing read import <strong>and</strong><br />

ltering as well as read mapping. <strong>Data</strong> import was modied to enable recovery of raw data<br />

or iterative data set re-ltering (section 2.3), <strong>and</strong> complemented with extended functionality to<br />

address e. g. the more <strong>and</strong> more widespread use of a variety of <strong>lib</strong>rary multiplexing <strong>and</strong> further<br />

adapter ligation protocols. For this purpose, exible sequence-specic read partitioning <strong>and</strong><br />

clipping facilities have been added (sections 2.4, 2.5). SHORE's read mapping module initially<br />

constituted a basic parallelization wrapper providing a unied interface to dierent short read<br />

alignment tools. To take advantage of an increasing diversity of read mapping algorithms, features<br />

have been added to enable integration of the results obtained using dierent alignment<br />

algorithms or parameters (section 2.6).<br />

Application-specic data analysis modules initially implemented in SHORE mainly addressed<br />

the assessment of genomic variation. This functionality has largely been maintained, with only<br />

slight modications to take advantage of the <strong>lib</strong>shore infrastructure. A focus of this work however<br />

was on quantitative application of high-throughput sequencing, <strong>and</strong> primarily ChIP-Seq<br />

data analysis. Within the SHORE analysis environment, we have implemented an enrichment<br />

detection module targeted at the analysis of transcription factor immunoprecipitation data (section<br />

2.7). In comparison to other approaches, our method emphasizes robustness towards read<br />

mapping artifacts <strong>and</strong> simplies h<strong>and</strong>ling of replicate experiments. Our ChIP-Seq enrichment<br />

detection module is complemented by congurable auxiliary utilities allowing composition of custom<br />

workows applicable to various expression proling or enrichment detection applications.<br />

In conclusion, with this work the SHORE software has been developed from an analysis pipeline<br />

consisting of a set of largely independent submodules targeted primarily at genomic variation<br />

assessment into a tightly integrated, general-purpose sequencing data analysis suite supporting<br />

a set of coordinated, interdependent features. The SHORE sequencing data analysis suite is open<br />

source software freely available 1 under the GNU General Public License (GPL) version 3.<br />

1 http://shore.sf.net

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!