An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
2.6. A PARALLELIZATION FRONT-END FOR SHORT READ ALIGNMENT TOOLS 43<br />
read manipulation. Alternatively, the score threshold may be set up to be calculated using the<br />
total length of the selected sequence as the function parameter.<br />
2.6 A Parallelization Front-End for Short Read Alignment Tools<br />
A great variety of sequencing read mapping <strong>and</strong> alignment programs are nowadays available to<br />
users. While most of these applications are similar in operation, each comes with its own specic<br />
user interface <strong>and</strong> usage peculiarities. Despite the huge selection of dierent implementations<br />
<strong>and</strong> extensive optimization of the underlying mapping <strong>and</strong> alignment algorithms, read mapping<br />
is furthermore still one of the computationally most expensive steps for resequencing applications<br />
(section 1.5.3). While parallel processing can often solve the issue of extensive run times,<br />
parallelization support is not uniform among all available solutions. Therefore, as each read<br />
alignment tool comes with its own set of trade-os, strengths <strong>and</strong> weaknesses, users are often left<br />
to decide between performance <strong>and</strong> required or desired features, <strong>and</strong> frequently need to adapt to<br />
dierent interfaces.<br />
The SHORE mapowcell application is a parallelization, pre- <strong>and</strong> post-processing wrapper for<br />
a variety of widely used <strong>and</strong> freely available short read mappers [1]. In addition to shared memory<br />
parallelization, the application has been reimplemented supporting networked distributed<br />
memory architectures through the Message Passing Interface (MPI) st<strong>and</strong>ard. Further added<br />
capabilities include automatic adjustment of parameters for h<strong>and</strong>ling of data sets with heterogeneous<br />
read lengths, as well as straightforward integration of mapping results obtained using<br />
dierent aligner back-ends or sets of mapping parameters. The program thus constitutes a<br />
meta-alignment tool able to amend mapped sequencing read data sets using dierent aligners,<br />
parameters or additional reference sequence information. Such multi-pass read mapping may<br />
especially benet computationally expensive approaches such as spliced read mapping.<br />
SHORE mapowcell currently provides a common user interface to the tools GenomeMapper,<br />
BWA, Bowtie, Bowtie2, ELAND <strong>and</strong> BLAT. User parameters are automatically transformed into<br />
the appropriate equivalent for the respective mapping back-end <strong>and</strong> results consistently reported<br />
in compressed, seekable SHORE MapList format easily convertible to various data exchange formats<br />
such as SAM.<br />
Ion or pyrosequencing read data sets as well as other types of sequencing data set following<br />
quality based read trimming (section 2.3) or sequence context specic read clipping (section 2.5)<br />
feature read length distributions with broad support range. However, many read mapping tools<br />
only support a set of static alignment parameters provided at program startup, resulting in<br />
below-par mapping specicity for reads on the lower tail of the read length distribution <strong>and</strong><br />
insucient sensitivity for reads at the upper end of the length spectrum.<br />
The SHORE mapowcell application automates read length-dependent selection of all relevant<br />
alignment parameters such as maximum tolerated edit distance, gap extension lengths or seed<br />
sizes. These user parameters may be provided as either absolute values or percentage of read<br />
length; additionally, edit distance may be bounded according to the alignment seed lemma.<br />
Given a certain read length l <strong>and</strong> edit distance d, a read alignment is guaranteed to contain a<br />
perfectly matching seed of size s = ⌊l/(d + 1)⌋, <strong>and</strong> therefore the maximum edit distance that<br />
guarantees nding a seed for each valid mapping is d = ⌊l/s⌋ − 1. We provide two modes of<br />
reducing the initially specied edit distance. A mode strict limits maximum edit distance to<br />
d max = ⌊l/s⌋ − 1, thereby ensuring that either all valid alignments with optimal edit distance<br />
are found for a read, or it is left unmapped (further seed selection heuristics implemented by<br />
the mapping tool excluded). <strong>An</strong> alternative setting on sets a weaker limit d max = ⌊l/s⌋, such<br />
that some of multiple mappings with the same edit distance may be missed, but reads are left<br />
unmapped if valid alignments at lesser edit distance can not be excluded.