28.02.2014 Views

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

42 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

While alignment scores become available after the initial stage, ltering is delayed until after<br />

backtracing for reasons of threshold calculation.<br />

The end alignment mode for the left end of the alignment determines the mode of alignment<br />

matrix initialization <strong>and</strong> alignment score calculation. With left end alignment mode local<br />

initialization <strong>and</strong> score calculation proceeds according to st<strong>and</strong>ard local alignment, with rst<br />

row <strong>and</strong> column initialized to zero <strong>and</strong> the alignment score at each eld of the matrix truncated<br />

at zero from below. Left end mode dangling_any utilizes the same matrix initialization, but<br />

does not clip alignment scores at zero. With dangling_ref <strong>and</strong> dangling_qry, only the rst row<br />

or column is zero-initialized, respectively, given the width of the matrix is determined by the<br />

size of the reference sequence. Sequence overhangs for the respective other sequence are valued<br />

with the regular gap penalty. Mode global does not perform pre-initialization of the matrix, as<br />

in st<strong>and</strong>ard global alignment.<br />

For selection of the best out of multiple permitted pairs of end alignment mode for a reference<br />

oligomer, multiple passes of dynamic programming alignment must be performed due to the<br />

distinct requirements of matrix initialization. For each distinct left end alignment mode a corresponding<br />

right end alignment mode may be dened. However, alignment is only performed if the<br />

conguration is not included by a dierent pair of modes specied, i. e. if the constraint on the<br />

right end of the alignment is weaker than that specied by the next left end mode that includes the<br />

current one. Due to the hierarchical nature of tolerated sequence overhang congurations there<br />

is nonetheless a chance that the same alignment will be produced multiple times by dierent keyword<br />

pairs. For example, (dangling_ref;dangling_qry) <strong>and</strong> (dangling_qry;dangling_ref)<br />

both dene supersets of the (global;global) conguration. Such redundancies can be eliminated<br />

prior to backtracing by determining the subset of the respective end alignment mode that<br />

has already been covered by previous alignments <strong>and</strong> excluding it by assigning corresponding<br />

elds of the alignment matrix the maximum penalty.<br />

For backtracing initialization <strong>and</strong> alignment score calculation, the alignment algorithm<br />

keeps track of values <strong>and</strong> locations of last row, last column <strong>and</strong> global score maximums for the<br />

matrix. Mirroring matrix initialization, backtracing starts at global score maximums for right<br />

end alignment mode local, at last row <strong>and</strong> last column maximums for modes dangling_ref<br />

<strong>and</strong> dangling_qry, respectively, at the combined last row <strong>and</strong> last column maximums for<br />

dangling_any, <strong>and</strong> at the lower right corner for global.<br />

Whenever the backtracing algorithm encounters multiple possible elds of origin for an alignment<br />

score, alternative paths are stacked for later completion, unless retrieval of only a single<br />

representative alignment was requested. While exhaustive tracing of all possible alignment paths<br />

can be combinatorially unfavorable, generating a representative for all distinct pairs of mapping<br />

end coordinates can be performed eciently. For this purpose, each eld of the matrix that has<br />

been reached by the backtracing algorithm from the same right end coordinates is marked as<br />

visited. If a eld marked as visited is reached from one of the stacked alternative paths, the<br />

respective path is aborted <strong>and</strong> the algorithm proceeds to assessing the next path.<br />

For read clipping purposes it is typically only relevant whether an oligomer can be assigned<br />

to unique coordinates on the sequencing read. For assessment of coordinate uniqueness, the<br />

backtracing algorithm proceeds only until alternative end coordinates have been encountered, <strong>and</strong><br />

the alternative trace is not emitted. If the backtrace completes without encountering coordinates<br />

dierent from the primary trace, its end coordinates are marked as unique.<br />

Reliability of a sequence match is determined by both the length of the match <strong>and</strong> its relative<br />

amount <strong>and</strong> type of edit operations. Alignment score ltering is therefore congured by setting<br />

slope <strong>and</strong> oset of a linear function. By default, the function's parameter is the length within the<br />

shorter sequence out of reference <strong>and</strong> query that is spanned by the match. Alignments with a<br />

score below the threshold thus calculated are not propagated to the list of results <strong>and</strong> sequencing

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!