An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

More documents

Recommendations

Info

24 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE The BGZF format developed for the BAM format and the Tabix utility [100, 111, 125] exploits the concatenation feature to implement independently compressed blocks within the GZIP specication. Uncompressed data are split into 64 kilobyte (kB) blocks, compressed independently as GZIP streams and eventually concatenated (gure 2.2(b) ). Although the resulting compressed size of a block is unpredictable and depends on the compressibility of the data, with the knowledge of the start osets of all blocks in a compressed le it is possible to quickly locate the block containing a specic uncompressed le oset. Thereby excess data that have to be decompressed to access any specic position of the le are limited to at most the block size of 64 kB. BGZF however does not specify a means of storing the start oset of each of the compressed blocks. To realize accelerated queries on top of BGZF, block oset information must therefore be incorporated in external index les. This represents a source of potential error, as the contents of the compressed le may diverge from the stored index data. Furthermore, various software applications do not correctly implement the stream concatenation feature of the GZIP specication, resulting in BGZF compressed data being truncated after decompression of the rst block. The open source command line tool dictzip comes with a specication for an indexed GZIPcompliant format. The GZIP format specication allows up to 64 kB of arbitrary extra data to be embedded within the header sequence. This extra data is simply to be ignored by basic GZIP decoders. The dictzip format takes advantage of the extra data eld for storing block oset information (gure 2.2(c) ). During the encoding process, the DEFLATE algorithm oers the possibility of manual insertion of reset points from which decompression may be started without processing the entire le. By exploiting this feature to compress blocks of input data independently, the dictzip tool avoids reliance on the stream concatenation feature as in the case of BGZF. However, with the extra data eld in the GZIP header being limited to 64 kB by specication, the number of blocks that can be stored is limited as well, eectively prohibiting storage of data with an uncompressed size of more than 1.8 gigabyte (GB). Furthermore, the entire index data to be stored in the GZIP header only becomes available after the entire le has been encoded. Streaming output is therefore incompatible with the format, which is therefore suitable for medium size, static data sets, but is not well matched to the compression of large-scale sequencing data. Due to the shortcomings of available formats, we introduce a further subset specication. The SHORE-GZIP format compresses the entire le as a single GZIP stream to provide optimal compatibility. To enable streaming output, all indexing information is stored at the end of the le. On decompression using either SHORE or arbitrary third party tools, block osets stored in the index are discarded. Thereby, index information is guaranteed to remain consistent with the le contents. To allow ne-tuning the tradeo between compression ratio and random access overhead, the format enables free choice of encoding block size. SHORE-GZIP realizes block-wise encoding similar to the dictzip approach by requesting periodic resets of the DEFLATE encoder (gure 2.2(d) ). During encoding, the method keeps track of the compressed block osets. After encoding all input data, the recorded index information is split into blocks of approximately 64 kB. Each block is embedded as extra data eld into a GZIP header, which is appended to the compressed le followed by two byte signaling empty compressed data to DEFLATE decoders, as well as eight further byte forming the appropriate GZIP footer. The indexing information is hence stored as consecutive empty GZIP streams that will be ignored by standard decoding of the data set. The SHORE-GZIP implementation is however able to recognize, collect and decode the trailing index blocks on demand to enable random read access. An important feature of plain GZIP is the ability for users to concatenate compressed les without breaking format compliance. With SHORE-GZIP, concatenation results in index records being interspersed among the data streams. Through correct recognition of such interspersed
2.2. EFFICIENT STORAGE OF HIGH-THROUGHPUT SEQUENCING DATA USING TEXT-BASED FILE FORMATS 25 index elements by SHORE's index parsing algorithm, random access functionality is preserved in concatenated les. Apart from GZIP, SHORE oers le compression using the XZ [127] format. XZ employs the LempelZivMarkov chain algorithm (LZMA) to achieve considerably improved average compression ratio compared to DEFLATE. LZMA oers decompression performance close to that of DEFLATE, whereas compression requires signicantly more computational resources. XZ is modeled after the GZIP format by specication of a similar structure of concatenated streams of encoded data. The le format however natively includes terminal index records that store all positions of an eventual encoder reset and can therefore be employed within SHORE without further extensions. 2.2.3 Ecient Queries on Text Files Queries for specic elements in a sorted sequence can be quickly answered by binary search in O(log n) time, where n is the length of the input sequence. Stock binary search algorithms however are not directly applicable to text-formatted data sets where the total number of elements and their exact start osets in the le are not known. We therefore implement a modied binary search for text le queries. Our algorithm bisects the data set by seeking to the central byte and subsequently seeking past the next newline marker following that position. Except for border cases that must be handled separately, the following line is compared to the user-provided search key and the algorithm is applied recursively to the appropriate subset of the le. In general, worst case time complexity of the modied binary search algorithm is O(m) on arbitrary text les with a total size of m byte, which however reduces to O(log n) for an n-element data set if a static limit for the maximal byte size per element is assumed to exist. Generic text le search functionality is made available through the SHORE sort utility. The program is applicable to arbitrary sorted tab-delimited text tables, oering the capability to retrieve specic data set elements as well as to bisect the data set using a search key. Automated data analysis can often narrow down the data relevant to the investigation to a small set of regions on the reference genome. Close examination of such individual genomic regions necessitates rapid retrieval of data associated with the appropriate range of positions. Unless consisting completely of non-overlapping entries, range queries can however not be answered by binary search. Objects associated with a specic range are two-dimensional entities that can be interpreted as points in a plane dened by their start and end coordinate. A query range q = (x q , y q ) then is itself a point in the plane, which together with its reection about the 45 degree line q ′ = (y q , x q ) partitions the search space into six dierent quadrants (gure 2.3(a) ). Area A includes all objects fully included in the query range, dened by constraints x ≥ x q and y ≤ y q . Subdivision B represents all elements whose range itself includes the entire query range, with the constraints x ≤ x q and y ≥ y q . All objects intersecting the query are represented by the set union of areas A to D dened by y > x q and x < y q , whereas elements completely disjoint from the query range fall into quadrants E and F (y ≤ x q or x ≥ y q ). One of many ways to impose a spacial partitioning on multi-dimensional data represent k-d trees [128], recursively bisecting the data set with respect to alternating dimensions (gure 2.3(b) ). In a perfectly balanced k-d tree, each recursion splits the current subset of the data at the median of the respective dimension's values. In that case, the tree may be represented implicitly by an array of its elements with a particular ordering [129]. k-d trees can answer region searches with a worst case time complexity of O(k · n 1−1/k + r), i. e. O( √ n + r) in the two-dimensional case, with n the total number of data set elements and r the number of elements to be retrieved [130].
Page 1: An Integrated Data Analysis Suite a
Page 4 and 5: Erklärung Hiermit erkläre ich, da
Page 7 and 8: Acknowledgment First of all I would
Page 9: Abstract The various parallel DNA s
Page 12: xii CONTENTS 2.6 A Parallelization
Page 16 and 17: 2 CHAPTER 1. INTRODUCTION 1.2 Appro
Page 18 and 19: 4 CHAPTER 1. INTRODUCTION genetic a
Page 20 and 21: 6 CHAPTER 1. INTRODUCTION Instrumen
Page 22 and 23: 8 CHAPTER 1. INTRODUCTION biases. S
Page 24 and 25: 10 CHAPTER 1. INTRODUCTION To achie
Page 26 and 27: 12 CHAPTER 1. INTRODUCTION quantiti
Page 28 and 29: 14 CHAPTER 1. INTRODUCTION of featu
Page 30: 16 CHAPTER 1. INTRODUCTION Our appl
Page 34 and 35: 20 CHAPTER 2. A HIGH-THROUGHPUT DNA
Page 72: 58 CHAPTER 2. A HIGH-THROUGHPUT DNA
Page 76 and 77: 62 CHAPTER 3. A C++ FRAMEWORK FOR H
Page 88 and 89:
74 CHAPTER 3. A C++ FRAMEWORK FOR H
Page 91 and 92:
4 Closing Remarks Routine analysis
Page 93:
79 preparation should further induc
Page 96 and 97:
82 BIBLIOGRAPHY [9] S. Brenner, M.
Page 98 and 99:
84 BIBLIOGRAPHY [29] N. Cloonan, A.
Page 100 and 101:
86 BIBLIOGRAPHY [48] B. A. Methe, K
Page 102 and 103:
88 BIBLIOGRAPHY [72] S. Andrews. Fa
Page 104 and 105:
90 BIBLIOGRAPHY [98] M. A. DePristo
Page 106 and 107:
92 BIBLIOGRAPHY [124] W. J. Kent, C
show all

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

Create successful ePaper yourself

Delete template?

Save as template?