An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

More documents

Recommendations

Info

62 CHAPTER 3. A C++ FRAMEWORK FOR HIGH-THROUGHPUT DNA SEQUENCING base program stream av_parser add_option(spec,var,desc) operator()(argc,argv) uncaught_handler operator()(func) istreams xz_istream gzx_istream 0..* std::istream program add_option(specication,variable,description) operator()(argc,argv) algo DataIterator FunctionArrayIterator DataIterator sux_array FunctionArrayIterator CmpX CmpY d2tree CmpXY cmp_x: CmpX cmp_y: CmpY cmp_xy : CmpXY dp_aligner_pipe fmtio Pipe Reader sam_reader Reader bam_reader Reader fastq_reader Reader s_reader sux_query line_sorter sux_index twodex Cmp sort_les(les) merge_les(les) is_sorted(le) upper_bound(le, line) Reader alignment_reader Reader maplist_reader Reader read_reader Reader atread_reader container intpack statistics Distribution binomial Distribution poisson mtc ostreams parallel thread 1..* parallelizer sync datatype processing T T Pipe xz_ostream gzx_ostream alignment_lter_pipe alignment Pipe pipe_facade Writer Source feed Sink Reader extractor T T Derived read serial desync Source/Pipe/Sink Pipe/Sink parallel Pipe pipe 0..* T alignment_tokenizer Source source Writer-Reader Writer Reader pipe_box Sink sink std::ostream nucleotide sequence_record Reader Pipe Writer Writer atread_writer Writer maplist_writer Writer sam_writer Reader fasta_reader Basic_Reader Reader T monolithic buer_chain 0..* plugin T Figure 3.1: libshore Overview
3.1. OVERVIEW 63 tation and command line parsing through an auxiliary class av_parser and incorporates an object of class uncaught_handler to provide handling and reporting of fatal error conditions encountered by the application. Classes istreams and ostreams of the stream package simplify handling of les or other input or output destinations. Interpretation of special location strings or prexes facilitates specication of special les or other sources or destinations of input and output as option arguments or option default values, like e. g. standard input, output or Unix pipes. Input or output locations compressed in GZIP, BAM or XZ format are delegated to appropriate helper stream classes for automatic decoding, encoding or handling of random access requests (section 2.2.2). Both classes in addition provide input/output error checking and further stream-related functionality. Class ostreams furthermore implements simplied handling of temporary le destinations as well as optional temporary le compression for improved handling of large amounts of temporary data. Package algo comprises of dynamic programming alignment (section 2.5.2) as well as indexing, sorting and query algorithms and associated persistent data structure implementations (section 2.2.3). The platform for range indexing and queries is provided by a generic 2-d tree algorithm template d2tree, which is further utilized by the class twodex that supplies a persistent index data structure congurable for a variety of genomic data formats. Multiple algorithm templates with the prex suffix_ implement generic sux array construction and queries [144, 145]. The sux array algorithms are employed by objects of class suffix_index, which provides persistent disk indexes for genomic sequences congurable and capable of answering various types of sequence match queries. To support up to 64 bit data at reasonable space eciency, persistent suffix_index and twodex index data are encoded and decoded non-byte aligned with respective exact required bit width utilizing class intpack of the container package. The libshore data processing infrastructure is dened and implemented as a set of templates forming the processing package (section 3.2). Package parallel extends on this infrastructure to build an abstracted parallel processing interface (section 3.3). Despite growing acceptance of data exchange specications such as SAM/BAM for short read mapping data, data format issues can present a major impediment to application of algorithms to input data from diverse sources. Therefore, the library includes support for a broad variety of sequencing read, mapping and further sequencing and DNA-related input and output le formats provided through the fmtio package. Format support is implemented as Reader and Writer classes with a consistent interface modeled after the requirements of the data processing infrastructure. To simplify working with various formats, for sequencing read and alignment data the library provides multi-format readers read_reader and alignment_reader which automatically incorporate the adequate Reader objects for each respective input le. File format parsing and decoding requires adequate in-memory representation for the respective elements of the various types of data set, which are dened in the datatype package. Plain data structures read and alignment provide the standard representation of short sequencing read data, whereas large genomic sequences are stored as sequence_record objects with more sophisticated internal memory management. Additionally, processing modules and utilities specic to a certain type of data set element or certain properties of data set elements are also grouped under this package. For example, a class alignment_filter_pipe combines application of a variety of frequently required ltering and editing modules to read mapping data. Parsing and decoding the SHORE alignment string pair-wise alignment representation (section 2.2.4) into CIGAR string-equivalent tokens is provided as a class alignment_tokenizer. Class nucleotide provides decoding, encoding and ecient manipulation of IUPAC encoded DNA bases. The additional statistics package supplies the basis for statistical hypothesis tests based on a variety of distributions. The complementary class mtc re-implements various multiple hypothesis test correction procedures in C++ following the R multtest package implementation [146].
Page 1:
An Integrated Data Analysis Suite a
Page 4 and 5:
Erklärung Hiermit erkläre ich, da
Page 7 and 8:
Acknowledgment First of all I would
Page 9:
Abstract The various parallel DNA s
Page 12:
xii CONTENTS 2.6 A Parallelization
Page 16 and 17:
2 CHAPTER 1. INTRODUCTION 1.2 Appro
Page 18 and 19:
4 CHAPTER 1. INTRODUCTION genetic a
Page 20 and 21:
6 CHAPTER 1. INTRODUCTION Instrumen
Page 22 and 23:
8 CHAPTER 1. INTRODUCTION biases. S
Page 24 and 25:
10 CHAPTER 1. INTRODUCTION To achie
Page 26 and 27: 12 CHAPTER 1. INTRODUCTION quantiti
Page 28 and 29: 14 CHAPTER 1. INTRODUCTION of featu
Page 30: 16 CHAPTER 1. INTRODUCTION Our appl
Page 34 and 35: 20 CHAPTER 2. A HIGH-THROUGHPUT DNA
Page 72: 58 CHAPTER 2. A HIGH-THROUGHPUT DNA
Page 78 and 79: 64 CHAPTER 3. A C++ FRAMEWORK FOR H
Page 91 and 92: 4 Closing Remarks Routine analysis
Page 93: 79 preparation should further induc
Page 96 and 97: 82 BIBLIOGRAPHY [9] S. Brenner, M.
Page 98 and 99: 84 BIBLIOGRAPHY [29] N. Cloonan, A.
Page 100 and 101: 86 BIBLIOGRAPHY [48] B. A. Methe, K
Page 102 and 103: 88 BIBLIOGRAPHY [72] S. Andrews. Fa
Page 104 and 105: 90 BIBLIOGRAPHY [98] M. A. DePristo
Page 106 and 107: 92 BIBLIOGRAPHY [124] W. J. Kent, C
show all

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?