An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

3 A C ++ Framework for High-Throughput DNA 

Sequencing 

In a large, multi-purpose data analysis software solution, there are inevitably many 

recurring processing tasks and algorithmic challenges. To avoid large amounts of 

complex, redundant and error-prone code, the common subproblems must be isolated 

as reusable modules. With the SHORE DNA sequencing data analysis suite, we have 

striven to implement frequently required segments of code using consistent design patterns 

and interfaces. The resulting C++ programming framework libshore constitutes 

the subject of this chapter. 

3.1 Overview 

High-throughput sequencing has become adopted as a versatile tool applicable to a wide range of 

dierent scientic questions (section 1.3). While data analysis for dierent types of application 

must be approached from unique angles (section 1.5), in their elementary building blocks methods 

and algorithms often share considerable overlap on many levels. 

Reading, parsing and formatted output of sequencing data in standard storage formats is a 

near universal requirement. Additionally, subsetting data sets by dened properties, achievable 

through dierent combinations of simple accept-reject ltering operations, is an often powerful 

tool for adjustment of an analysis' sensitivity-specicity tradeo. More complex operations rely 

on the relationship between multiple data set elements, like e. g. removal of PCR duplicate sequences 

from read alignment data, or alter certain properties of the data set elements themselves, 

and are therefore sensitive to, and potentially useful not only in various combinations, but also 

orders of application. Finally, entire analysis processes can be modeled as modules or series of 

modules transforming the type of the data, for example read alignments into depth of coverage 

or read alignments into positional pileups into variant calls. 

While isolation and encapsulation of subproblems facilitates correct implementation and comprehensibility 

of each individual step, the logic required to tie components into end-to-end processing 

pipelines can itself become extensive. For versatile modes of application, modularized 

components must therefore be embedded into a more generic framework capable of providing the 

glue between input, output and data ltering, manipulation and transformation. 

The libshore C++ framework code is loosely categorized into ten dierent packages including 

application framework functionality, generic data processing infrastructure and sequencing specic 

processing modules and data structures. With gure 3.1 we present a package overview in 

a simplied UML-like representation, illustrating exemplary classes and associations. 

The base package comprises elementary low level functionality and utilities as well as machine 

or system-dependent blocks of code. While its functionality is required for most other packages, 

it is self-contained and does not by itself depend on other parts of the library. 

The class program from the package of the same name constitutes the base class of all SHORE 

command line utilities. The base class combines command line interface denition, documen- 

61

Previous page

Next page

1

3

4

5

7

8

9

11

12

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

91

92

93

95

96

97

98

99

100

101

102

103

104

105

106

107

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?