28.02.2014 Views

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

22 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

language of choice, readout of binary formats is therefore often too laborious or beyond basic programming<br />

skills, eectively limiting users of the format to choose from a narrow set of supported<br />

programming languages.<br />

In contrast, basic parsers are straightforward to implement for simple tab-delimited table<br />

formats. Requirements such as string tokenization <strong>and</strong> conversion between character strings <strong>and</strong><br />

numeric types are well supported by high level scripting languages popular in the bioinformatics<br />

eld such as Perl or Python, <strong>and</strong> considered a basic skill in next to any programming language.<br />

The line-based formats are furthermore accessible to shell scripting <strong>and</strong> readily ltered through<br />

widely used Unix text processing utilities such as grep, awk, sed or sort. While conversion<br />

between le <strong>and</strong> in-memory representation is less ecient compared to optimized binary formats,<br />

the introduced overhead is marginal in relation to the overall CPU consumption of data analysis<br />

algorithms.<br />

Furthermore, delimited text tables can serve as a basic framework for le format denitions<br />

such as SAM, GFF or VCF. This common foundation enables implementation of generic algorithms<br />

applicable to a large variety of dierent le formats, as demonstrated for example by the<br />

tool Tabix [125] for range indexing of genomic feature le formats. Adapting similar algorithms to<br />

dierent specialized binary formats is laborious, requiring reimplementation or implementation<br />

of sophisticated adapter mechanisms.<br />

Format issues are easily identiable with text-based le formats without requirement of expert<br />

knowledge or implementation of specialized format verication routines. Software dealing with<br />

data produced by a rapidly evolving technology as is high-throughput sequencing instrumentation<br />

is likely required to undergo constant adaptation. Besides, at high rates of data production<br />

storage systems may operate close to the limits of their capacity. In such potentially unstable<br />

settings, simplicity of le formats constitutes a signicant advantage.<br />

On grounds of these considerations, the SHORE analysis suite relies on tab-delimited table<br />

formats for storing reads, read mapping data <strong>and</strong> analysis results. Users may further choose to<br />

store les either directly as plain text, or between one of two widely used compression formats,<br />

GZIP [116] <strong>and</strong> XZ [127]. While such generic compression le formats themselves represent<br />

binary formats, decoding comm<strong>and</strong>s <strong>and</strong> routines are widely available across platforms <strong>and</strong><br />

programming languages <strong>and</strong> are employed merely as a ltering layer, thus retaining most of the<br />

advantages of a text-based format. With compression, SHORE implements block-wise encoding<br />

for both the GZIP <strong>and</strong> the XZ format to enable near-r<strong>and</strong>om read access at any decompressed<br />

le oset. Our implementation enables free adjustment of r<strong>and</strong>om access <strong>and</strong> space eciency<br />

trade-os (section 2.2.2).<br />

SHORE's text-based read alignment le format was extended to support reference compression<br />

as introduced by the binary CRAM format (section 1.6), as well as further simple preprocessing<br />

lters serving to boost compressibility in combination with generic compression algorithms.<br />

Depending on the choice of preprocessing lters, compression algorithm <strong>and</strong> r<strong>and</strong>om access resolution,<br />

our compressed text le formats enable similar or better read alignment compression<br />

ratios compared to CRAM (section 2.2.5).<br />

<strong>An</strong>other common requirement in sequencing data analysis is the retrieval of data set elements<br />

with a certain attribute value, e. g. extraction of the set of single nucleotide polymorphisms<br />

included in a certain range of the reference genome or retrieving a read with a specic identier<br />

for a short read data set. However, st<strong>and</strong>ard linear search is slow when processing large data sets.<br />

Given the data set is in sorted order with respect to the key attribute, accelerated queries are<br />

possible. SHORE sort oers text le sorting functionality similar to the st<strong>and</strong>ard sort comm<strong>and</strong><br />

line utility, optimized for use with typical high-throughput sequencing le formats. Additionally,<br />

the tool is capable of performing binary accelerated queries on arbitrary sorted tab-delimited<br />

text les.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!