An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
22 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
language of choice, readout of binary formats is therefore often too laborious or beyond basic programming<br />
skills, eectively limiting users of the format to choose from a narrow set of supported<br />
programming languages.<br />
In contrast, basic parsers are straightforward to implement for simple tab-delimited table<br />
formats. Requirements such as string tokenization <strong>and</strong> conversion between character strings <strong>and</strong><br />
numeric types are well supported by high level scripting languages popular in the bioinformatics<br />
eld such as Perl or Python, <strong>and</strong> considered a basic skill in next to any programming language.<br />
The line-based formats are furthermore accessible to shell scripting <strong>and</strong> readily ltered through<br />
widely used Unix text processing utilities such as grep, awk, sed or sort. While conversion<br />
between le <strong>and</strong> in-memory representation is less ecient compared to optimized binary formats,<br />
the introduced overhead is marginal in relation to the overall CPU consumption of data analysis<br />
algorithms.<br />
Furthermore, delimited text tables can serve as a basic framework for le format denitions<br />
such as SAM, GFF or VCF. This common foundation enables implementation of generic algorithms<br />
applicable to a large variety of dierent le formats, as demonstrated for example by the<br />
tool Tabix [125] for range indexing of genomic feature le formats. Adapting similar algorithms to<br />
dierent specialized binary formats is laborious, requiring reimplementation or implementation<br />
of sophisticated adapter mechanisms.<br />
Format issues are easily identiable with text-based le formats without requirement of expert<br />
knowledge or implementation of specialized format verication routines. Software dealing with<br />
data produced by a rapidly evolving technology as is high-throughput sequencing instrumentation<br />
is likely required to undergo constant adaptation. Besides, at high rates of data production<br />
storage systems may operate close to the limits of their capacity. In such potentially unstable<br />
settings, simplicity of le formats constitutes a signicant advantage.<br />
On grounds of these considerations, the SHORE analysis suite relies on tab-delimited table<br />
formats for storing reads, read mapping data <strong>and</strong> analysis results. Users may further choose to<br />
store les either directly as plain text, or between one of two widely used compression formats,<br />
GZIP [116] <strong>and</strong> XZ [127]. While such generic compression le formats themselves represent<br />
binary formats, decoding comm<strong>and</strong>s <strong>and</strong> routines are widely available across platforms <strong>and</strong><br />
programming languages <strong>and</strong> are employed merely as a ltering layer, thus retaining most of the<br />
advantages of a text-based format. With compression, SHORE implements block-wise encoding<br />
for both the GZIP <strong>and</strong> the XZ format to enable near-r<strong>and</strong>om read access at any decompressed<br />
le oset. Our implementation enables free adjustment of r<strong>and</strong>om access <strong>and</strong> space eciency<br />
trade-os (section 2.2.2).<br />
SHORE's text-based read alignment le format was extended to support reference compression<br />
as introduced by the binary CRAM format (section 1.6), as well as further simple preprocessing<br />
lters serving to boost compressibility in combination with generic compression algorithms.<br />
Depending on the choice of preprocessing lters, compression algorithm <strong>and</strong> r<strong>and</strong>om access resolution,<br />
our compressed text le formats enable similar or better read alignment compression<br />
ratios compared to CRAM (section 2.2.5).<br />
<strong>An</strong>other common requirement in sequencing data analysis is the retrieval of data set elements<br />
with a certain attribute value, e. g. extraction of the set of single nucleotide polymorphisms<br />
included in a certain range of the reference genome or retrieving a read with a specic identier<br />
for a short read data set. However, st<strong>and</strong>ard linear search is slow when processing large data sets.<br />
Given the data set is in sorted order with respect to the key attribute, accelerated queries are<br />
possible. SHORE sort oers text le sorting functionality similar to the st<strong>and</strong>ard sort comm<strong>and</strong><br />
line utility, optimized for use with typical high-throughput sequencing le formats. Additionally,<br />
the tool is capable of performing binary accelerated queries on arbitrary sorted tab-delimited<br />
text les.