28.02.2014 Views

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2.6. A PARALLELIZATION FRONT-END FOR SHORT READ ALIGNMENT TOOLS 43<br />

read manipulation. Alternatively, the score threshold may be set up to be calculated using the<br />

total length of the selected sequence as the function parameter.<br />

2.6 A Parallelization Front-End for Short Read Alignment Tools<br />

A great variety of sequencing read mapping <strong>and</strong> alignment programs are nowadays available to<br />

users. While most of these applications are similar in operation, each comes with its own specic<br />

user interface <strong>and</strong> usage peculiarities. Despite the huge selection of dierent implementations<br />

<strong>and</strong> extensive optimization of the underlying mapping <strong>and</strong> alignment algorithms, read mapping<br />

is furthermore still one of the computationally most expensive steps for resequencing applications<br />

(section 1.5.3). While parallel processing can often solve the issue of extensive run times,<br />

parallelization support is not uniform among all available solutions. Therefore, as each read<br />

alignment tool comes with its own set of trade-os, strengths <strong>and</strong> weaknesses, users are often left<br />

to decide between performance <strong>and</strong> required or desired features, <strong>and</strong> frequently need to adapt to<br />

dierent interfaces.<br />

The SHORE mapowcell application is a parallelization, pre- <strong>and</strong> post-processing wrapper for<br />

a variety of widely used <strong>and</strong> freely available short read mappers [1]. In addition to shared memory<br />

parallelization, the application has been reimplemented supporting networked distributed<br />

memory architectures through the Message Passing Interface (MPI) st<strong>and</strong>ard. Further added<br />

capabilities include automatic adjustment of parameters for h<strong>and</strong>ling of data sets with heterogeneous<br />

read lengths, as well as straightforward integration of mapping results obtained using<br />

dierent aligner back-ends or sets of mapping parameters. The program thus constitutes a<br />

meta-alignment tool able to amend mapped sequencing read data sets using dierent aligners,<br />

parameters or additional reference sequence information. Such multi-pass read mapping may<br />

especially benet computationally expensive approaches such as spliced read mapping.<br />

SHORE mapowcell currently provides a common user interface to the tools GenomeMapper,<br />

BWA, Bowtie, Bowtie2, ELAND <strong>and</strong> BLAT. User parameters are automatically transformed into<br />

the appropriate equivalent for the respective mapping back-end <strong>and</strong> results consistently reported<br />

in compressed, seekable SHORE MapList format easily convertible to various data exchange formats<br />

such as SAM.<br />

Ion or pyrosequencing read data sets as well as other types of sequencing data set following<br />

quality based read trimming (section 2.3) or sequence context specic read clipping (section 2.5)<br />

feature read length distributions with broad support range. However, many read mapping tools<br />

only support a set of static alignment parameters provided at program startup, resulting in<br />

below-par mapping specicity for reads on the lower tail of the read length distribution <strong>and</strong><br />

insucient sensitivity for reads at the upper end of the length spectrum.<br />

The SHORE mapowcell application automates read length-dependent selection of all relevant<br />

alignment parameters such as maximum tolerated edit distance, gap extension lengths or seed<br />

sizes. These user parameters may be provided as either absolute values or percentage of read<br />

length; additionally, edit distance may be bounded according to the alignment seed lemma.<br />

Given a certain read length l <strong>and</strong> edit distance d, a read alignment is guaranteed to contain a<br />

perfectly matching seed of size s = ⌊l/(d + 1)⌋, <strong>and</strong> therefore the maximum edit distance that<br />

guarantees nding a seed for each valid mapping is d = ⌊l/s⌋ − 1. We provide two modes of<br />

reducing the initially specied edit distance. A mode strict limits maximum edit distance to<br />

d max = ⌊l/s⌋ − 1, thereby ensuring that either all valid alignments with optimal edit distance<br />

are found for a read, or it is left unmapped (further seed selection heuristics implemented by<br />

the mapping tool excluded). <strong>An</strong> alternative setting on sets a weaker limit d max = ⌊l/s⌋, such<br />

that some of multiple mappings with the same edit distance may be missed, but reads are left<br />

unmapped if valid alignments at lesser edit distance can not be excluded.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!