An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
42 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />
While alignment scores become available after the initial stage, ltering is delayed until after<br />
backtracing for reasons of threshold calculation.<br />
The end alignment mode for the left end of the alignment determines the mode of alignment<br />
matrix initialization <strong>and</strong> alignment score calculation. With left end alignment mode local<br />
initialization <strong>and</strong> score calculation proceeds according to st<strong>and</strong>ard local alignment, with rst<br />
row <strong>and</strong> column initialized to zero <strong>and</strong> the alignment score at each eld of the matrix truncated<br />
at zero from below. Left end mode dangling_any utilizes the same matrix initialization, but<br />
does not clip alignment scores at zero. With dangling_ref <strong>and</strong> dangling_qry, only the rst row<br />
or column is zero-initialized, respectively, given the width of the matrix is determined by the<br />
size of the reference sequence. Sequence overhangs for the respective other sequence are valued<br />
with the regular gap penalty. Mode global does not perform pre-initialization of the matrix, as<br />
in st<strong>and</strong>ard global alignment.<br />
For selection of the best out of multiple permitted pairs of end alignment mode for a reference<br />
oligomer, multiple passes of dynamic programming alignment must be performed due to the<br />
distinct requirements of matrix initialization. For each distinct left end alignment mode a corresponding<br />
right end alignment mode may be dened. However, alignment is only performed if the<br />
conguration is not included by a dierent pair of modes specied, i. e. if the constraint on the<br />
right end of the alignment is weaker than that specied by the next left end mode that includes the<br />
current one. Due to the hierarchical nature of tolerated sequence overhang congurations there<br />
is nonetheless a chance that the same alignment will be produced multiple times by dierent keyword<br />
pairs. For example, (dangling_ref;dangling_qry) <strong>and</strong> (dangling_qry;dangling_ref)<br />
both dene supersets of the (global;global) conguration. Such redundancies can be eliminated<br />
prior to backtracing by determining the subset of the respective end alignment mode that<br />
has already been covered by previous alignments <strong>and</strong> excluding it by assigning corresponding<br />
elds of the alignment matrix the maximum penalty.<br />
For backtracing initialization <strong>and</strong> alignment score calculation, the alignment algorithm<br />
keeps track of values <strong>and</strong> locations of last row, last column <strong>and</strong> global score maximums for the<br />
matrix. Mirroring matrix initialization, backtracing starts at global score maximums for right<br />
end alignment mode local, at last row <strong>and</strong> last column maximums for modes dangling_ref<br />
<strong>and</strong> dangling_qry, respectively, at the combined last row <strong>and</strong> last column maximums for<br />
dangling_any, <strong>and</strong> at the lower right corner for global.<br />
Whenever the backtracing algorithm encounters multiple possible elds of origin for an alignment<br />
score, alternative paths are stacked for later completion, unless retrieval of only a single<br />
representative alignment was requested. While exhaustive tracing of all possible alignment paths<br />
can be combinatorially unfavorable, generating a representative for all distinct pairs of mapping<br />
end coordinates can be performed eciently. For this purpose, each eld of the matrix that has<br />
been reached by the backtracing algorithm from the same right end coordinates is marked as<br />
visited. If a eld marked as visited is reached from one of the stacked alternative paths, the<br />
respective path is aborted <strong>and</strong> the algorithm proceeds to assessing the next path.<br />
For read clipping purposes it is typically only relevant whether an oligomer can be assigned<br />
to unique coordinates on the sequencing read. For assessment of coordinate uniqueness, the<br />
backtracing algorithm proceeds only until alternative end coordinates have been encountered, <strong>and</strong><br />
the alternative trace is not emitted. If the backtrace completes without encountering coordinates<br />
dierent from the primary trace, its end coordinates are marked as unique.<br />
Reliability of a sequence match is determined by both the length of the match <strong>and</strong> its relative<br />
amount <strong>and</strong> type of edit operations. Alignment score ltering is therefore congured by setting<br />
slope <strong>and</strong> oset of a linear function. By default, the function's parameter is the length within the<br />
shorter sequence out of reference <strong>and</strong> query that is spanned by the match. Alignments with a<br />
score below the threshold thus calculated are not propagated to the list of results <strong>and</strong> sequencing