An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
2.2. EFFICIENT STORAGE OF HIGH-THROUGHPUT SEQUENCING DATA USING<br />
TEXT-BASED FILE FORMATS 27<br />
(a)<br />
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTA<br />
||||||||||**|||||||||||||****|||||||||||||<br />
CCCTAAACCCGTAACCCTAAACCCT---TCTCTGAATCCTTA<br />
(b)<br />
(c)<br />
CCCTAAACCC[TG][AT]AACCCTAAACCCT[A-][A-][A-][CT]CTCTGAATCCTTA<br />
10[TA,GT]13[AAA,---][CT]13<br />
(d)<br />
Query CIGAR MD<br />
CCCTAAACCCGTAACCCTAAACCCTTCTCTGAATCCTTA 10=2X13=3D1X13= 10T0A13^AAA0C13<br />
Figure 2.4: Example Read Alignment <strong>and</strong> its Representations in the MapList <strong>and</strong> SAM Formats<br />
BAM indexing as well as the Tabix utility are two incarnations of a similar indexing scheme<br />
based on R-trees which was rst introduced for the UCSC Genome Browser [124]. Utilities like<br />
the BAM indexer however are tied to their specic associated le format. The more versatile<br />
Tabix utility supports many text-based le formats, but requires BGZF compressed data. By<br />
separating the r<strong>and</strong>om access <strong>and</strong> range indexing concepts, the 2d-tree index is universally applicable<br />
to all storage formats supported by the r<strong>and</strong>om access back end, currently including plain<br />
text, XZ <strong>and</strong> SHORE-GZIP. Moreover, the ability of indexing data sets not strictly required to<br />
be ordered by sequence coordinate is unique among the stated indexing methods. As described,<br />
our method indexes chunks of the data set having a xed byte size. The UCSC R-tree family of<br />
indexes dier in this respect by grouping data set elements into tiling windows on the genomic<br />
sequence at a maximum resolution of 16 kilobases. This binning approach implies the disadvantage<br />
of potential uncontrollable growth of the linear search phase for deeply covered regions<br />
of the genome. We further regard the 2d-tree index's homogeneous array layout a substantial<br />
implementation advantage.<br />
2.2.4 Improved Compression of Read Mapping <strong>Data</strong><br />
To cope with increasingly large amounts of read mapping data, we implement reference based<br />
compression, quality reduction, read relabeling as well as a simple transposition algorithm capable<br />
of further boosting compressibility of SHORE alignment les <strong>and</strong> other types of tab-delimited table.<br />
Conversion from st<strong>and</strong>ard to condensed representations is implemented as a tool SHORE coal.<br />
A read mapping by our denition is a multi-attribute entity composed of its mapping coordinates<br />
on the reference genome as well as a read alignment that species the exact set of<br />
base pairings, traditionally represented in a multi-line format (gure 2.4(a) ). The line-based,<br />
tab-delimited MapList format [1] dened by SHORE stores mapping coordinates, read alignments<br />
along with the original read information <strong>and</strong> attributes describing the reliability of the mapping<br />
as well as read pairing information. Alignments are stored in an alignment string format resembling<br />
that originally introduced by Vmatch [131], a sequence matching tool based on enhanced<br />
sux arrays [132]. In the Vmatch format, matching positions are represented literally by the<br />
matching nucleotide symbol, whereas edit operations are described by the pair of mismatching<br />
symbols enclosed in square brackets (gure 2.4(b) ).<br />
To adjust to the development of sequencing technology <strong>and</strong> read mapping tools, we have<br />
carefully amended the original alignment representation with a small set of additional features<br />
while maintaining full backwards compatibility. With the constant increase of achievable read<br />
length, it becomes possible for read mapping tools to align reads across longer stretches of