12.07.2015 Views

Lecture 3: Dynamic Programming and Alignment - Main Algorithms ...

Lecture 3: Dynamic Programming and Alignment - Main Algorithms ...

Lecture 3: Dynamic Programming and Alignment - Main Algorithms ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of Sequences<strong>Lecture</strong> 3:<strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong><strong>Main</strong> <strong>Algorithms</strong> with Applications in BioinformaticsGrégory NuelLaboratoire Statistique et GénomeUniversity of Evry, CNRS (8071), INRA (1152)FranceBioinformatics <strong>and</strong> Comparative Genome Analysis, InstitutPasteur of Tunis from March 18 to April 6, 2007<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Outline<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of Sequences1 <strong>Dynamic</strong> <strong>Programming</strong>The PrincipleApplication: Longuest Common SubsequenceApplication: Viterbi2 Local Score of One SequenceMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the Significance3 <strong>Alignment</strong> of SequencesGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>Heuristics<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Outline<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesThe PrincipleApplication: Longuest Common SubsequenceApplication: Viterbi1 <strong>Dynamic</strong> <strong>Programming</strong>The PrincipleApplication: Longuest Common SubsequenceApplication: Viterbi2 Local Score of One SequenceMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the Significance3 <strong>Alignment</strong> of SequencesGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>Heuristics<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Computing n!<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesThe PrincipleApplication: Longuest Common SubsequenceApplication: ViterbiDefinition (n!)For all n 1 we have n! = 1 × 2 × . . . × n <strong>and</strong> 0! = 1. Hence1! = 1, 2! = 1 × 2 = 2, 3! = 1 × 2 × 3 = 6,. . .function factorial(n)1: r = 12: for i = 2 . . . n do3: r = r × i4: return r⇒ O(n) to compute n! hence O(n 2 ) to compute 1!, 2!, 3!, . . . n!.<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Computing n!<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesThe PrincipleApplication: Longuest Common SubsequenceApplication: ViterbiDefinition (n!)For all n 1 we have n! = 1 × 2 × . . . × n <strong>and</strong> 0! = 1. Hence1! = 1, 2! = 1 × 2 = 2, 3! = 1 × 2 × 3 = 6,. . .function smartfactorial(n)1: static n top = 3 <strong>and</strong> static r[33] = {1, 1, 2, 6}2: if n > 32 then3: we need more memory4: else5: while n top < n do6: n top = n top + 17: r[n top ] = r[n top − 1] × n top8: return r[n]⇒ O(n) to compute 1!, 2!, 3!, . . . n! but we need O(33) in space.<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesWhat is <strong>Dynamic</strong> <strong>Programming</strong> ?The PrincipleApplication: Longuest Common SubsequenceApplication: Viterbi<strong>Dynamic</strong> programming is a method to solve a problem (like computingn!) using solutions of subproblems (like the values of 1!,2!, . . . (n−1)!) that takes much less time than naive approaches.Such an approach usually relies ona recurrence relation (like n! = (n − 1)! × n)a data structure to memorize subproblems solutions (thusmore space requirements than with naive approaches)Example (Problem solved by dynamic programming)n factorial, Fibonnacci numbers, . . .longuest common subsequenceViterbi algorithmlocal score of one sequence, alignement of sequences,. . .<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Outline<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesThe PrincipleApplication: Longuest Common SubsequenceApplication: Viterbi1 <strong>Dynamic</strong> <strong>Programming</strong>The PrincipleApplication: Longuest Common SubsequenceApplication: Viterbi2 Local Score of One SequenceMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the Significance3 <strong>Alignment</strong> of SequencesGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>Heuristics<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesLonguest Common SubsequenceThe PrincipleApplication: Longuest Common SubsequenceApplication: ViterbiThe problemGiven two sequence X p 1 = X 1 . . . X p <strong>and</strong> Y q 1 = Y 1 . . . Y q overthe same alphabet A, we want to find their Longuest CommonSubsequence LCS(X p 1 , Y q 1 ).Example: LCS(XMJYAUZ, MZJAWXU) = MJAUThe recurrence relationFor all i p <strong>and</strong> j q we have LCS(X1 i , Y j ⎧1 ) =⎪⎨ ∅ if i 0 or j 0⎪⎩LCS(X i−11, Y j−11) + X i if X i = Y jLCS(X i−11, Y j 1 ) or LCS(X i 1 , Y j−1where + is the concatenation symbol.1) (the longuest) otherwise<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


The algorithm<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesThe PrincipleApplication: Longuest Common SubsequenceApplication: ViterbiLength1: initialize (L i,j ) 0ip,0jq to 02: for i = 1 . . . p do3: for i = 1 . . . q do4: if X i = Y j then5: L i,j = L i−1,j−1 + 16: else7: L i,j = max ( L i−1,j <strong>and</strong> L i,j−1)8: return L p,qTraceback1: find a “path” connection L 0,0 to L p,q2: return the concatenation of the matching letters of this path<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


A simple example<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesThe PrincipleApplication: Longuest Common SubsequenceApplication: ViterbiExample (X 7 1 = XMJYAUZ <strong>and</strong> Y 6 1 = MZJAWXU)j 0 1 2 3 4 5 6 7i M Z J A W X U0 0 0 0 0 0 0 0 01 X 0 0 0 0 0 0 1 12 M 0 1 1 1 1 1 1 13 J 0 1 1 2 2 2 2 24 Y 0 1 1 2 2 2 2 25 A 0 1 1 2 3 3 3 36 U 0 1 1 2 3 3 3 47 Z 0 1 2 2 3 3 3 4<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


A simple example<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesThe PrincipleApplication: Longuest Common SubsequenceApplication: ViterbiExample (X 7 1 = XMJYAUZ <strong>and</strong> Y 6 1 = MZJAWXU)j 0 1 2 3 4 5 6 7i M Z J A W X U0 0 0 0 0 0 0 0 01 X 0 0 0 0 0 0 1 12 M 0 1 1 1 1 1 1 13 J 0 1 1 2 2 2 2 24 Y 0 1 1 2 2 2 2 25 A 0 1 1 2 3 3 3 36 U 0 1 1 2 3 3 3 47 Z 0 1 2 2 3 3 3 4<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Outline<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesThe PrincipleApplication: Longuest Common SubsequenceApplication: Viterbi1 <strong>Dynamic</strong> <strong>Programming</strong>The PrincipleApplication: Longuest Common SubsequenceApplication: Viterbi2 Local Score of One SequenceMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the Significance3 <strong>Alignment</strong> of SequencesGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>Heuristics<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesBest hidden path of a HMMThe PrincipleApplication: Longuest Common SubsequenceApplication: ViterbiThe ProblemHow to compute the best path s ∗ = argmax s P θ (S = s | X = x)of a HMM ?PropositionZ i (u) =max P θ (S 1 = s 1 , . . . , S i−1 = s i−1 , S i = u | X = x)s 1 ,...,s i−1then we have the following recurrence relationZ i (u) = maxt∈S Z i−1(t)ν(t, u)µ u (X i )<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesVIterbi Algorithm for a M0M1 modelThe PrincipleApplication: Longuest Common SubsequenceApplication: ViterbiThe log-likelihood of s ∗1: initialize L 1 (t) = log µ 0 (t) + log µ t (x 1 ) for all t ∈ S2: for i = 2 . . . l do3: for all u ∈ S do4: L i (u) = max t∈S L i−1 (t) + log ν(t, u) + log µ u (x i )5: T i−1 (u) = argmax t∈S L i−1 (t) + log ν(t, u) + log µ u (x i )6: return L lTraceback1: s ∗ l = argmax t∈S Z l(t)2: for i = l − 1 . . . 1 do3: s ∗ i= T i (s ∗ i+1 )<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesA simple example (1)The PrincipleApplication: Longuest Common SubsequenceApplication: ViterbiExample (S = {1, 2, 3} <strong>and</strong> A = {a, c, g, t})ν gives the transition between hidden states⎛3.0 −1.0⎞−1.0N = log ν = ⎝ 1.6 2.5 1.6 ⎠−1.0 −1.0 3.0<strong>and</strong> µ t gives the distribution of X i the letter if S i = t⎛µ 1⎞ ⎛2.0 1.5 2.5 1.0M = log ⎝ µ 2⎠ = ⎝ 1.5 2.0 1.0 2.5µ3 1.9 1.9 1.9 1.9⎞⎠<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesA simple example (2)The PrincipleApplication: Longuest Common SubsequenceApplication: ViterbiExample (S = {1, 2, 3} <strong>and</strong> A = {a, c, g, t})x = ctagacc <strong>and</strong> (as a simplification) we assume thatL 1 (t) = M(t, c) we hence get the following table:c t a g a c cL i (1) 1.5 → 5.5 11.1 → 16.1 → 21.6 → 26.1 → 30.6↗↘L i (2) 2.0 → 7.0 → 10.5 → 14.5 → 18.0 22.6 → 27.6L i (3) 1.9 → 6.8 → 11.7 → 16.6 → 21.5 → 26.6 → 31.3<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Outline<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the Significance1 <strong>Dynamic</strong> <strong>Programming</strong>The PrincipleApplication: Longuest Common SubsequenceApplication: Viterbi2 Local Score of One SequenceMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the Significance3 <strong>Alignment</strong> of SequencesGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>Heuristics<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the SignificanceHomogeneous Regions in Biological SequencesExample (gc rich regions in DNA)X = aaagaaagggcacacagccagaaataattttcttis there one (or more) gc rich region in this DNA sequence ?Example (hydrophobic regions in proteins)X = YVPISMYCLQWLLPVLLIPKPLNWSDGVAST, I, L, M, F, W <strong>and</strong> C are very hydrophobic amino-acids. Isthere any hydrophobic region in this protein ?<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Sliding Window<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the SignificanceIdeaGiven a window size h, for each i = 1 . . . l − h + 1 score the featureof interest (gc content, hydrophobic content) in the slidingwindow [i, i + h − 1].Example (with h = 5 <strong>and</strong> score = frequency)a a a g a a a g g g c a c a c a g c c a g a a a t a a t t t t c t t1 1 1 2 2 3 4 4 4 3 3 2 3 3 4 3 4 4 3 2 2 1 0 0 0 0 0 1 1 1 - - - -Y V P I S M Y C L Q W L L P V LL I P K PL N W S D G V A S1 2 2 3 3 3 3 4 4 3 3 3 3 3 3 4 3 3 2 3 2 2 1 1 - - - -<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Another Approach<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the SignificanceRemarksThe sliding window method is very simple to underst<strong>and</strong>, have alow linear complexity, but suffers several drawbacks:how to choose the window size h ?where are exactly the limits of our regions of interest ?how to choose the scoring fonction ?⇒ we need another approach !Definition (Local Score)Given a scoring function S : A → R, the local score H of thesequence X = X 1 . . . X l is defined by:H =max1i


Examples<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the SignificanceExample (S([gc]) = +1 <strong>and</strong> S([ac]) = −1)X = aaagaaa gggcacacagccag aaataattttcttExample (S([gc]) = +1 <strong>and</strong> S([ac]) = −2)X = aaagaaa gggc acacagccagaaataattttcttExample (S([TILMFWC]) = +1 <strong>and</strong> S({TILMFWC}) = −1 )X = YVP ISMYCLQWLLPVLLIPKPLNW SDGVASExample (S([TILMFWC]) = +1 <strong>and</strong> S({TILMFWC}) = −2 )X = YVPISMY CLQWLL PVLLIPKPLNWSDGVAS<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Outline<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the Significance1 <strong>Dynamic</strong> <strong>Programming</strong>The PrincipleApplication: Longuest Common SubsequenceApplication: Viterbi2 Local Score of One SequenceMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the Significance3 <strong>Alignment</strong> of SequencesGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>Heuristics<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the SignificanceBrut Force has a Cubic ComplexityBrut forceScore each of the possible segments <strong>and</strong> pick up the best score.l segments of length 1l − 1 segments of length 2. . .1 segment of length lProposition⇒ resulting complexity is O(l 3 )If we denote by H i the local score of X 1 . . . X i then we have thefollowing recurrence relation:H 0 = 0 <strong>and</strong> H i = max(0, H i−1 + S(X i )) ∀1 i l<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Linear Algorithm<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the SignificanceAlgorithm1: H 0 = 02: for i = 1 . . . l do3: H i = max(0, H i−1 + S(X i ))4: return H = max 1il H i⇒ complexity is O(l) in space <strong>and</strong> timeExample (S([gc]) = +1 <strong>and</strong> S([ac]) = −1)X i - a a a g a a a g g g c a c a c a g c c a g a a a t a a t t t t c t tH i 0 0 0 0 1 0 0 0 1 2 3 4 3 4 3 4 3 4 5 6 5 6 7 6 5 4 3 2 1 0 0 0 1 0 0we get H = 7 <strong>and</strong> can easily find the corresponding segment<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the SignificanceInterest of the Local Score ApproachRemarkssimple <strong>and</strong> efficient linear algorithmdirectly point out segments of interestcan be used with complex scoring function (ex:Kyte-Doolittle hydrophobic scale)all suboptimal segments can be found in O(l) thanks to thealgorithm from Ruzzo <strong>and</strong> Tompa (1999)⇒ far more elegant approach than sliding windows from a widerange of problem encontred in sequence analysis.<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesConnexion with HMM (1)Motivation <strong>and</strong> NotationComputing the Local ScoreAssessing the SignificanceM0M1 HMM modelS = {1, 2, 3}, transition between hidden states is given bystart1-p1-p1pp1-q1-q2qq1-p1-p3ppstop<strong>and</strong> for all a ∈ A we have⎧⎨ P(X i = a|S i = 1) = µ(a)P(X⎩ i = a|S i = 2) = ν(a)P(X i = a|S i = 3) = µ(a)<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesConnexion with HMM (2)Motivation <strong>and</strong> NotationComputing the Local ScoreAssessing the SignificancePropositionWith θ = (p, q, µ, ν) we have:L(θ | X, S) = l log(1 − p) − 2 log p − log ql∑l∑+ log µ(X i ) + I Si =3 log (1 − q)ν(X i)(1 − p)µ(X i )i=1hence with the scoring functionwe getS(a) = logi=1(1 − q)ν(a)(1 − p)µ(a)∀a ∈ AL(θ | X, S = s ∗ ) = H + constantwhere s ∗ is the Viterbi path <strong>and</strong> H is the local score.<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesConnexion with HMM (3)Motivation <strong>and</strong> NotationComputing the Local ScoreAssessing the SignificanceRemarksThe local score is totally equivalent to a M0M1 modelit is hence possible to use HMM parameter estimation toestimate scoring functionHMM provides a natural path to extend the local score:best score on two segments (or three, or four, . . . )dependant scoring functiongeneralized HMM to avoid geometric distribution for thesegment lengths. . .<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Outline<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the Significance1 <strong>Dynamic</strong> <strong>Programming</strong>The PrincipleApplication: Longuest Common SubsequenceApplication: Viterbi2 Local Score of One SequenceMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the Significance3 <strong>Alignment</strong> of SequencesGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>Heuristics<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Notion of p-valueProblem<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the SignificanceUsing S([gc] = +1 <strong>and</strong> S([at]) = −1, how to interpret thatH(X) = H(Y ) it if we assume either of:A 1 X is 10 times smaller than YA 2 P X ([gc]) = 90% while P Y ([gc]) = 20%A 3 we have both H 1 <strong>and</strong> H 2⇒ need of p-valuesDefinition (p-value)If we assume that X = X 1 . . . X l is generated according to ar<strong>and</strong>om model (ex: M0, M1, . . . ) then the p-value of anobserved result H obs is given by:p-value = P(H H obs )<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesAsymptotic ApproximationsMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the SignificanceProposition (Iglehart, 1972 <strong>and</strong> Karlin et al., 1990)( )log l − log K 1H ∼ Gumble;λ λwhich mean that(P H log l )λ + u ≃ 1 − exp(−K e −λu)Two way to compute K <strong>and</strong> λanalytically in the M0 case (λ is easy, K requires morecomplex computations)by simulations usinglog ( − log P(H < x) ) ≃ −λx + log K + log l<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesExact ComputationsMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the SignificanceRemarkonce λ <strong>and</strong> K are computed, asymptotic approximationsare very fast to computethese approximations are only valid for large lProposition (Mercier & Daudin 2001, Nuel 2006)One can use FMCI to compute a exact p-value for a Mm modelwith a complexity()O 10 n × k m+1 × H obs × log lwhere n is the number of digits in the scoring function <strong>and</strong> k thealphabet size.<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the SignificanceGumble approximations vs exact p-values (1)Example (Kyte-Doolittle hydrophobic scale on SwissProt)<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the SignificanceGumble approximations vs exact p-values (2)asymptotic approximations could be completly false (for99.5% of the data in our example)exact computations take much more time (in our example20 exact p-values computed per second)there is a program called pLocalScore allowing to performboth the asymptotic approximations <strong>and</strong> the exactcomputations⇒ advice: use exact computation rather than asymptoticapproximations as often as possible<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Outline<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>Heuristics1 <strong>Dynamic</strong> <strong>Programming</strong>The PrincipleApplication: Longuest Common SubsequenceApplication: Viterbi2 Local Score of One SequenceMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the Significance3 <strong>Alignment</strong> of SequencesGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>Heuristics<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesHow to compare sequences ?Global <strong>Alignment</strong>Local <strong>Alignment</strong>HeuristicsProblemHow to compare two biological sequences (DNA, proteins) X =X 1 . . . X p <strong>and</strong> Y = Y 1 . . . Y q ? Should we:compare their respective lengths ?compare their compositions (letters, word of size 2, 3, . . . )look for repetitions ?. . .⇒ What about their proximity in the evolution process ?<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Mutation <strong>and</strong> Indel<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>HeuristicsDefinitionDuring the evolution of a biological sequence, a letter thatchanges is called a mutation <strong>and</strong> we called indel either theinsertion of a letter or the deletion of a letter.Example (a DNA sequence)a c c g t t a c a a g a c a| | | | | | | | | | | | | |a c c g t t a c a a g a c a| | | | | | | | | | | | | |a c c g t t a c a a g a c a| | | | | | | | | • | | | |a c c g t t a c a t g a c a| | • | | | | | | | | | | |a c g t t a c a t g a c a| | | | | | • | | | | | | |a c g t t a g c a t g a c a<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesWhat is an alignment ?Global <strong>Alignment</strong>Local <strong>Alignment</strong>HeuristicsExample (Two DNA sequences)X = a c g t a g c a t g a c aY = a c c g t a c a a g c aWe denote by A a common ancestral sequence:A a c c g t t a c a a g a c aX a c g t a g c a t g a c aY a c c g t a c a a g c ahere is the alignement we get:˜X a c - g t - a g c a t g a c aỸ a c c g t - a - c a a g - c a<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesScoring <strong>Alignment</strong>sGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>HeuristicsDefinition (Score of an <strong>Alignment</strong>)Using the scoring function S : A ∪ {-} × A ∪ {-} → R we definethe score of an alignment as the sum of the scoring functionover all the columns of the alignment.Example (S(match) = +1 S(mismatch) = −1 S(gap) = −2)the first alignment scores 8 × 1 − 0 × 1 − 9 × 2 = −10˜X a c - g t a - - - g c a t g a c aỸ a c c g t a c a a g c a - - - - -<strong>and</strong> the second alignment scores 10 × 1 − 1 × 1 − 2 × 2 = 5˜X a c - g t a g c a t g a c aỸ a c c g t a - c a a g - c a<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel <strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>HeuristicsHow to find the best alignment ?BrutScores individually all possible alignments <strong>and</strong> pick up the bestone.⇒ How many possible alignements ?PropositionIf N(p, q) is the number of alignments between two sequencesof lengths p <strong>and</strong> q we have the following recurrence relation:N(1, q) = 2q + 1 N(p, 1) = 2p + 1N(p, q) = N(p, q − 1) + N(p − 1, q) + N(p − 1, q − 1)for all p, q 1.<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesNumber of AlignementsGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>HeuristicsExampleq = 1 q = 2 q = 3 q = 4 q = 5 q = 6 q = 7p = 1 3 5 7 9 11 13 15p = 2 5 13 25 41 61 85 113p = 3 7 25 63 129 231 377 575p = 4 9 41 129 321 681 1289 2241p = 5 11 61 231 681 1683 3653 7183ApproximationIdea: N(p, q) ∼ ρ p+q we hence getρ 2 − 2ρ − 1 = 0 ⇒ ρ = 1 + √ 2this gives us:N(20, 20) ≃ 10 15 N(100, 100) ≃ 10 76 N(1000, 1000) ≃ 10 764<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of Sequences<strong>Dynamic</strong> <strong>Programming</strong>Global <strong>Alignment</strong>Local <strong>Alignment</strong>HeuristicsNeedleman <strong>and</strong> Wunsch (1970)We denote by B(i, j) the best score of an alignment of X 1 . . . X i<strong>and</strong> Y 1 . . . Y j <strong>and</strong> we get1: B(0, 0) = 02: B(i, 0) = ∑ i3: for i = 1 . . . p do4: for i = 1 . . . q do5:k=1 S(X k, −) <strong>and</strong> B(0, j) = ∑ jk=1 S(−, Y k)⎧⎨B(i, j) = max⎩B(i − 1, j − 1) + S(X i , Y j )B(i − 1, j) + S(X i , −)B(i, j − 1) + S(−, Y j )6: return S(p, q) <strong>and</strong> use a traceback to find the alignment<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Example<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>HeuristicsExample (X = gcgacgtgcaag Y = aggcacgca +3, −1, −2)- a g g c a c g c a- 0 -2 -4 -6 -8 -10 -12 -14 -16 -18g -2 -1 1 -1 -3 -5 -7 -9 -11 -13c -4 -3 -1 0 2 0 -2 -4 -6 -8g -6 -5 0 2 0 1 -1 1 -1 -3a -8 -3 -2 0 1 3 1 -1 0 2c -10 -5 -4 -2 3 1 6 4 2 0g -12 -7 -2 -1 1 2 4 9 7 5t -14 -9 -4 -3 -1 0 2 7 8 6g -16 -11 -6 -1 -3 -2 0 5 6 7c -18 -13 -8 -3 2 0 1 3 8 6a -20 -15 -10 -5 0 5 3 1 6 11a -22 -17 -12 -7 -2 3 4 2 4 9g -24 -19 -14 -9 -4 1 2 7 5 7a g g c - a c g - - c a - -- g - c g a c g t g c a a g<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Outline<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>Heuristics1 <strong>Dynamic</strong> <strong>Programming</strong>The PrincipleApplication: Longuest Common SubsequenceApplication: Viterbi2 Local Score of One SequenceMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the Significance3 <strong>Alignment</strong> of SequencesGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>Heuristics<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesLimits of Global <strong>Alignment</strong>Global <strong>Alignment</strong>Local <strong>Alignment</strong>HeuristicsRemarksGlobal alignment is not suitable to detect:overlaps (ex: the end of X <strong>and</strong> beginning of Y are thesame)insertions (ex: X is included in Y )all kind of local similaritiesDefinition (local alignment)The local alignment of two sequences is a global alignment oftwo segments of these sequences.H =max B(X1ii ′ p,1jj ′ i . . . X i ′ , Y i . . . Y i ′ )q<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Algorithm<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>HeuristicsSmith <strong>and</strong> Waterman (1981)We denote by H(i, j) the best score of local alignment of X 1 . . . X i<strong>and</strong> Y 1 . . . Y j <strong>and</strong> we get1: H(i, 0) = H(0, j) = 0 for all i, j2: for i = 1 . . . p do3: for i = 1 . . . q do4:⎧H(i − 1, j − 1) + S(X i , Y j )⎪⎨H(i − 1, j) + S(XH(i, j) = maxi , −)H(i, j − 1) + S(−, Y ⎪⎩j )05: return max i,j H(i, j) <strong>and</strong> use a traceback to find the alignment<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Example<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>HeuristicsExample (X = gcgacgtgcaag Y = aggcacgca +3, −1, −2)- a g g c a c g c a- 0 0 0 0 0 0 0 0 0 0g 0 0 3 3 1 0 0 3 1 0c 0 0 1 2 6 4 3 1 6 4g 0 0 3 4 4 5 3 6 4 5a 0 3 1 2 3 7 5 4 5 7c 0 1 2 0 5 5 10 8 7 5g 0 0 4 5 3 4 8 13 11 9t 0 0 2 3 4 2 6 11 12 10g 0 0 3 5 3 3 4 9 10 11c 0 0 1 3 8 6 6 7 12 10a 0 3 1 1 6 11 9 7 10 15a 0 3 2 0 4 9 10 8 8 13g 0 1 6 5 3 7 8 13 11 11g c g a c g t g c ag c - a c - - g c a<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Outline<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>Heuristics1 <strong>Dynamic</strong> <strong>Programming</strong>The PrincipleApplication: Longuest Common SubsequenceApplication: Viterbi2 Local Score of One SequenceMotivation <strong>and</strong> NotationComputing the Local ScoreAssessing the Significance3 <strong>Alignment</strong> of SequencesGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>Heuristics<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Complexity<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>HeuristicsRemarkslocal alignment with Swith & Waterman has a complexityO(p × q) both in time <strong>and</strong> spacea slightly more complex algorithm has a complexityO(p × q) in time but only O(min(p, q)) in space⇒ too slow for massive sequences comparisons (completegenomes, databases, . . . ) we need heuristics !<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


BLAST, FASTA, . . .<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of SequencesGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>HeuristicsHeuristic to compare a query to a database1: pre-process the database to speed up hit search (once perdatabase) <strong>and</strong> compute λ <strong>and</strong> K (for Gumble approximations)2: scan the database for query hits (ex: the same word oflength 7 appears both in the database <strong>and</strong> in the query)3: try to combine hits together4: perform a Smith <strong>and</strong> Waterman algorithm in the vicinity ofthe hits5: asses the significance of the result with the Gumble approximation<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Summary<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of Sequences<strong>Dynamic</strong> programmingneed of recurrence relationtrade memory for speedmany applicationsGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>HeuristicsLocal score of one sequencean elegant replacement for sliding window methodsstrong connexion with HMMGumbel approximations are not much reliable<strong>Alignment</strong> of sequencesssuitable to compare biological sequenceschoice of the scoring functiononly heuristics are used in practice<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Summary<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of Sequences<strong>Dynamic</strong> programmingneed of recurrence relationtrade memory for speedmany applicationsGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>HeuristicsLocal score of one sequencean elegant replacement for sliding window methodsstrong connexion with HMMGumbel approximations are not much reliable<strong>Alignment</strong> of sequencesssuitable to compare biological sequenceschoice of the scoring functiononly heuristics are used in practice<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>


Summary<strong>Dynamic</strong> <strong>Programming</strong>Local Score of One Sequence<strong>Alignment</strong> of Sequences<strong>Dynamic</strong> programmingneed of recurrence relationtrade memory for speedmany applicationsGlobal <strong>Alignment</strong>Local <strong>Alignment</strong>HeuristicsLocal score of one sequencean elegant replacement for sliding window methodsstrong connexion with HMMGumbel approximations are not much reliable<strong>Alignment</strong> of sequencesssuitable to compare biological sequenceschoice of the scoring functiononly heuristics are used in practice<strong>Main</strong> <strong>Algorithms</strong> with Applications in Bioinformatics by G. Nuel<strong>Lecture</strong> 3: <strong>Dynamic</strong> <strong>Programming</strong> <strong>and</strong> <strong>Alignment</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!