Lecture 3: Dynamic Programming and Alignment - Main Algorithms ...

Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesLecture 3:Dynamic Programming and AlignmentMain Algorithms with Applications in BioinformaticsGrégory NuelLaboratoire Statistique et GénomeUniversity of Evry, CNRS (8071), INRA (1152)FranceBioinformatics and Comparative Genome Analysis, InstitutPasteur of Tunis from March 18 to April 6, 2007Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

OutlineDynamic ProgrammingLocal Score of One SequenceAlignment of Sequences1 Dynamic ProgrammingThe PrincipleApplication: Longuest Common SubsequenceApplication: Viterbi2 Local Score of One SequenceMotivation and NotationComputing the Local ScoreAssessing the Significance3 Alignment of SequencesGlobal AlignmentLocal AlignmentHeuristicsMain Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

OutlineDynamic ProgrammingLocal Score of One SequenceAlignment of SequencesThe PrincipleApplication: Longuest Common SubsequenceApplication: Viterbi1 Dynamic ProgrammingThe PrincipleApplication: Longuest Common SubsequenceApplication: Viterbi2 Local Score of One SequenceMotivation and NotationComputing the Local ScoreAssessing the Significance3 Alignment of SequencesGlobal AlignmentLocal AlignmentHeuristicsMain Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Computing n!Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesThe PrincipleApplication: Longuest Common SubsequenceApplication: ViterbiDefinition (n!)For all n 1 we have n! = 1 × 2 × . . . × n and 0! = 1. Hence1! = 1, 2! = 1 × 2 = 2, 3! = 1 × 2 × 3 = 6,. . .function factorial(n)1: r = 12: for i = 2 . . . n do3: r = r × i4: return r⇒ O(n) to compute n! hence O(n 2 ) to compute 1!, 2!, 3!, . . . n!.Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Computing n!Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesThe PrincipleApplication: Longuest Common SubsequenceApplication: ViterbiDefinition (n!)For all n 1 we have n! = 1 × 2 × . . . × n and 0! = 1. Hence1! = 1, 2! = 1 × 2 = 2, 3! = 1 × 2 × 3 = 6,. . .function smartfactorial(n)1: static n top = 3 and static r[33] = {1, 1, 2, 6}2: if n > 32 then3: we need more memory4: else5: while n top < n do6: n top = n top + 17: r[n top ] = r[n top − 1] × n top8: return r[n]⇒ O(n) to compute 1!, 2!, 3!, . . . n! but we need O(33) in space.Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesWhat is Dynamic Programming ?The PrincipleApplication: Longuest Common SubsequenceApplication: ViterbiDynamic programming is a method to solve a problem (like computingn!) using solutions of subproblems (like the values of 1!,2!, . . . (n−1)!) that takes much less time than naive approaches.Such an approach usually relies ona recurrence relation (like n! = (n − 1)! × n)a data structure to memorize subproblems solutions (thusmore space requirements than with naive approaches)Example (Problem solved by dynamic programming)n factorial, Fibonnacci numbers, . . .longuest common subsequenceViterbi algorithmlocal score of one sequence, alignement of sequences,. . .Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment


Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesLonguest Common SubsequenceThe PrincipleApplication: Longuest Common SubsequenceApplication: ViterbiThe problemGiven two sequence X p 1 = X 1 . . . X p and Y q 1 = Y 1 . . . Y q overthe same alphabet A, we want to find their Longuest CommonSubsequence LCS(X p 1 , Y q 1 ).Example: LCS(XMJYAUZ, MZJAWXU) = MJAUThe recurrence relationFor all i p and j q we have LCS(X1 i , Y j ⎧1 ) =⎪⎨ ∅ if i 0 or j 0⎪⎩LCS(X i−11, Y j−11) + X i if X i = Y jLCS(X i−11, Y j 1 ) or LCS(X i 1 , Y j−1where + is the concatenation symbol.1) (the longuest) otherwiseMain Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

The algorithmDynamic ProgrammingLocal Score of One SequenceAlignment of SequencesThe PrincipleApplication: Longuest Common SubsequenceApplication: ViterbiLength1: initialize (L i,j ) 0ip,0jq to 02: for i = 1 . . . p do3: for i = 1 . . . q do4: if X i = Y j then5: L i,j = L i−1,j−1 + 16: else7: L i,j = max ( L i−1,j and L i,j−1)8: return L p,qTraceback1: find a “path” connection L 0,0 to L p,q2: return the concatenation of the matching letters of this pathMain Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

A simple exampleDynamic ProgrammingLocal Score of One SequenceAlignment of SequencesThe PrincipleApplication: Longuest Common SubsequenceApplication: ViterbiExample (X 7 1 = XMJYAUZ and Y 6 1 = MZJAWXU)j 0 1 2 3 4 5 6 7i M Z J A W X U0 0 0 0 0 0 0 0 01 X 0 0 0 0 0 0 1 12 M 0 1 1 1 1 1 1 13 J 0 1 1 2 2 2 2 24 Y 0 1 1 2 2 2 2 25 A 0 1 1 2 3 3 3 36 U 0 1 1 2 3 3 3 47 Z 0 1 2 2 3 3 3 4Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

A simple exampleDynamic ProgrammingLocal Score of One SequenceAlignment of SequencesThe PrincipleApplication: Longuest Common SubsequenceApplication: ViterbiExample (X 7 1 = XMJYAUZ and Y 6 1 = MZJAWXU)j 0 1 2 3 4 5 6 7i M Z J A W X U0 0 0 0 0 0 0 0 01 X 0 0 0 0 0 0 1 12 M 0 1 1 1 1 1 1 13 J 0 1 1 2 2 2 2 24 Y 0 1 1 2 2 2 2 25 A 0 1 1 2 3 3 3 36 U 0 1 1 2 3 3 3 47 Z 0 1 2 2 3 3 3 4Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment


Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesBest hidden path of a HMMThe PrincipleApplication: Longuest Common SubsequenceApplication: ViterbiThe ProblemHow to compute the best path s ∗ = argmax s P θ (S = s | X = x)of a HMM ?PropositionZ i (u) =max P θ (S 1 = s 1 , . . . , S i−1 = s i−1 , S i = u | X = x)s 1 ,...,s i−1then we have the following recurrence relationZ i (u) = maxt∈S Z i−1(t)ν(t, u)µ u (X i )Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesVIterbi Algorithm for a M0M1 modelThe PrincipleApplication: Longuest Common SubsequenceApplication: ViterbiThe log-likelihood of s ∗1: initialize L 1 (t) = log µ 0 (t) + log µ t (x 1 ) for all t ∈ S2: for i = 2 . . . l do3: for all u ∈ S do4: L i (u) = max t∈S L i−1 (t) + log ν(t, u) + log µ u (x i )5: T i−1 (u) = argmax t∈S L i−1 (t) + log ν(t, u) + log µ u (x i )6: return L lTraceback1: s ∗ l = argmax t∈S Z l(t)2: for i = l − 1 . . . 1 do3: s ∗ i= T i (s ∗ i+1 )Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesA simple example (1)The PrincipleApplication: Longuest Common SubsequenceApplication: ViterbiExample (S = {1, 2, 3} and A = {a, c, g, t})ν gives the transition between hidden states⎛3.0 −1.0⎞−1.0N = log ν = ⎝ 1.6 2.5 1.6 ⎠−1.0 −1.0 3.0and µ t gives the distribution of X i the letter if S i = t⎛µ 1⎞ ⎛2.0 1.5 2.5 1.0M = log ⎝ µ 2⎠ = ⎝ 1.5 2.0 1.0 2.5µ3 1.9 1.9 1.9 1.9⎞⎠Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesA simple example (2)The PrincipleApplication: Longuest Common SubsequenceApplication: ViterbiExample (S = {1, 2, 3} and A = {a, c, g, t})x = ctagacc and (as a simplification) we assume thatL 1 (t) = M(t, c) we hence get the following table:c t a g a c cL i (1) 1.5 → 5.5 11.1 → 16.1 → 21.6 → 26.1 → 30.6↗↘L i (2) 2.0 → 7.0 → 10.5 → 14.5 → 18.0 22.6 → 27.6L i (3) 1.9 → 6.8 → 11.7 → 16.6 → 21.5 → 26.6 → 31.3Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

OutlineDynamic ProgrammingLocal Score of One SequenceAlignment of SequencesMotivation and NotationComputing the Local ScoreAssessing the Significance1 Dynamic ProgrammingThe PrincipleApplication: Longuest Common SubsequenceApplication: Viterbi2 Local Score of One SequenceMotivation and NotationComputing the Local ScoreAssessing the Significance3 Alignment of SequencesGlobal AlignmentLocal AlignmentHeuristicsMain Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesMotivation and NotationComputing the Local ScoreAssessing the SignificanceHomogeneous Regions in Biological SequencesExample (gc rich regions in DNA)X = aaagaaagggcacacagccagaaataattttcttis there one (or more) gc rich region in this DNA sequence ?Example (hydrophobic regions in proteins)X = YVPISMYCLQWLLPVLLIPKPLNWSDGVAST, I, L, M, F, W and C are very hydrophobic amino-acids. Isthere any hydrophobic region in this protein ?Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Sliding WindowDynamic ProgrammingLocal Score of One SequenceAlignment of SequencesMotivation and NotationComputing the Local ScoreAssessing the SignificanceIdeaGiven a window size h, for each i = 1 . . . l − h + 1 score the featureof interest (gc content, hydrophobic content) in the slidingwindow [i, i + h − 1].Example (with h = 5 and score = frequency)a a a g a a a g g g c a c a c a g c c a g a a a t a a t t t t c t t1 1 1 2 2 3 4 4 4 3 3 2 3 3 4 3 4 4 3 2 2 1 0 0 0 0 0 1 1 1 - - - -Y V P I S M Y C L Q W L L P V LL I P K PL N W S D G V A S1 2 2 3 3 3 3 4 4 3 3 3 3 3 3 4 3 3 2 3 2 2 1 1 - - - -Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Another ApproachDynamic ProgrammingLocal Score of One SequenceAlignment of SequencesMotivation and NotationComputing the Local ScoreAssessing the SignificanceRemarksThe sliding window method is very simple to understand, have alow linear complexity, but suffers several drawbacks:how to choose the window size h ?where are exactly the limits of our regions of interest ?how to choose the scoring fonction ?⇒ we need another approach !Definition (Local Score)Given a scoring function S : A → R, the local score H of thesequence X = X 1 . . . X l is defined by:H =max1i

ExamplesDynamic ProgrammingLocal Score of One SequenceAlignment of SequencesMotivation and NotationComputing the Local ScoreAssessing the SignificanceExample (S([gc]) = +1 and S([ac]) = −1)X = aaagaaa gggcacacagccag aaataattttcttExample (S([gc]) = +1 and S([ac]) = −2)X = aaagaaa gggc acacagccagaaataattttcttExample (S([TILMFWC]) = +1 and S({TILMFWC}) = −1 )X = YVP ISMYCLQWLLPVLLIPKPLNW SDGVASExample (S([TILMFWC]) = +1 and S({TILMFWC}) = −2 )X = YVPISMY CLQWLL PVLLIPKPLNWSDGVASMain Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment


Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesMotivation and NotationComputing the Local ScoreAssessing the SignificanceBrut Force has a Cubic ComplexityBrut forceScore each of the possible segments and pick up the best score.l segments of length 1l − 1 segments of length 2. . .1 segment of length lProposition⇒ resulting complexity is O(l 3 )If we denote by H i the local score of X 1 . . . X i then we have thefollowing recurrence relation:H 0 = 0 and H i = max(0, H i−1 + S(X i )) ∀1 i lMain Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Linear AlgorithmDynamic ProgrammingLocal Score of One SequenceAlignment of SequencesMotivation and NotationComputing the Local ScoreAssessing the SignificanceAlgorithm1: H 0 = 02: for i = 1 . . . l do3: H i = max(0, H i−1 + S(X i ))4: return H = max 1il H i⇒ complexity is O(l) in space and timeExample (S([gc]) = +1 and S([ac]) = −1)X i - a a a g a a a g g g c a c a c a g c c a g a a a t a a t t t t c t tH i 0 0 0 0 1 0 0 0 1 2 3 4 3 4 3 4 3 4 5 6 5 6 7 6 5 4 3 2 1 0 0 0 1 0 0we get H = 7 and can easily find the corresponding segmentMain Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesMotivation and NotationComputing the Local ScoreAssessing the SignificanceInterest of the Local Score ApproachRemarkssimple and efficient linear algorithmdirectly point out segments of interestcan be used with complex scoring function (ex:Kyte-Doolittle hydrophobic scale)all suboptimal segments can be found in O(l) thanks to thealgorithm from Ruzzo and Tompa (1999)⇒ far more elegant approach than sliding windows from a widerange of problem encontred in sequence analysis.Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesConnexion with HMM (1)Motivation and NotationComputing the Local ScoreAssessing the SignificanceM0M1 HMM modelS = {1, 2, 3}, transition between hidden states is given bystart1-p1-p1pp1-q1-q2qq1-p1-p3ppstopand for all a ∈ A we have⎧⎨ P(X i = a|S i = 1) = µ(a)P(X⎩ i = a|S i = 2) = ν(a)P(X i = a|S i = 3) = µ(a)Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesConnexion with HMM (2)Motivation and NotationComputing the Local ScoreAssessing the SignificancePropositionWith θ = (p, q, µ, ν) we have:L(θ | X, S) = l log(1 − p) − 2 log p − log ql∑l∑+ log µ(X i ) + I Si =3 log (1 − q)ν(X i)(1 − p)µ(X i )i=1hence with the scoring functionwe getS(a) = logi=1(1 − q)ν(a)(1 − p)µ(a)∀a ∈ AL(θ | X, S = s ∗ ) = H + constantwhere s ∗ is the Viterbi path and H is the local score.Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesConnexion with HMM (3)Motivation and NotationComputing the Local ScoreAssessing the SignificanceRemarksThe local score is totally equivalent to a M0M1 modelit is hence possible to use HMM parameter estimation toestimate scoring functionHMM provides a natural path to extend the local score:best score on two segments (or three, or four, . . . )dependant scoring functiongeneralized HMM to avoid geometric distribution for thesegment lengths. . .Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment


Notion of p-valueProblemDynamic ProgrammingLocal Score of One SequenceAlignment of SequencesMotivation and NotationComputing the Local ScoreAssessing the SignificanceUsing S([gc] = +1 and S([at]) = −1, how to interpret thatH(X) = H(Y ) it if we assume either of:A 1 X is 10 times smaller than YA 2 P X ([gc]) = 90% while P Y ([gc]) = 20%A 3 we have both H 1 and H 2⇒ need of p-valuesDefinition (p-value)If we assume that X = X 1 . . . X l is generated according to arandom model (ex: M0, M1, . . . ) then the p-value of anobserved result H obs is given by:p-value = P(H H obs )Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesAsymptotic ApproximationsMotivation and NotationComputing the Local ScoreAssessing the SignificanceProposition (Iglehart, 1972 and Karlin et al., 1990)( )log l − log K 1H ∼ Gumble;λ λwhich mean that(P H log l )λ + u ≃ 1 − exp(−K e −λu)Two way to compute K and λanalytically in the M0 case (λ is easy, K requires morecomplex computations)by simulations usinglog ( − log P(H < x) ) ≃ −λx + log K + log lMain Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesExact ComputationsMotivation and NotationComputing the Local ScoreAssessing the SignificanceRemarkonce λ and K are computed, asymptotic approximationsare very fast to computethese approximations are only valid for large lProposition (Mercier & Daudin 2001, Nuel 2006)One can use FMCI to compute a exact p-value for a Mm modelwith a complexity()O 10 n × k m+1 × H obs × log lwhere n is the number of digits in the scoring function and k thealphabet size.Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesMotivation and NotationComputing the Local ScoreAssessing the SignificanceGumble approximations vs exact p-values (1)Example (Kyte-Doolittle hydrophobic scale on SwissProt)Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesMotivation and NotationComputing the Local ScoreAssessing the SignificanceGumble approximations vs exact p-values (2)asymptotic approximations could be completly false (for99.5% of the data in our example)exact computations take much more time (in our example20 exact p-values computed per second)there is a program called pLocalScore allowing to performboth the asymptotic approximations and the exactcomputations⇒ advice: use exact computation rather than asymptoticapproximations as often as possibleMain Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

OutlineDynamic ProgrammingLocal Score of One SequenceAlignment of SequencesGlobal AlignmentLocal AlignmentHeuristics1 Dynamic ProgrammingThe PrincipleApplication: Longuest Common SubsequenceApplication: Viterbi2 Local Score of One SequenceMotivation and NotationComputing the Local ScoreAssessing the Significance3 Alignment of SequencesGlobal AlignmentLocal AlignmentHeuristicsMain Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesHow to compare sequences ?Global AlignmentLocal AlignmentHeuristicsProblemHow to compare two biological sequences (DNA, proteins) X =X 1 . . . X p and Y = Y 1 . . . Y q ? Should we:compare their respective lengths ?compare their compositions (letters, word of size 2, 3, . . . )look for repetitions ?. . .⇒ What about their proximity in the evolution process ?Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Mutation and IndelDynamic ProgrammingLocal Score of One SequenceAlignment of SequencesGlobal AlignmentLocal AlignmentHeuristicsDefinitionDuring the evolution of a biological sequence, a letter thatchanges is called a mutation and we called indel either theinsertion of a letter or the deletion of a letter.Example (a DNA sequence)a c c g t t a c a a g a c a| | | | | | | | | | | | | |a c c g t t a c a a g a c a| | | | | | | | | | | | | |a c c g t t a c a a g a c a| | | | | | | | | • | | | |a c c g t t a c a t g a c a| | • | | | | | | | | | | |a c g t t a c a t g a c a| | | | | | • | | | | | | |a c g t t a g c a t g a c aMain Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesWhat is an alignment ?Global AlignmentLocal AlignmentHeuristicsExample (Two DNA sequences)X = a c g t a g c a t g a c aY = a c c g t a c a a g c aWe denote by A a common ancestral sequence:A a c c g t t a c a a g a c aX a c g t a g c a t g a c aY a c c g t a c a a g c ahere is the alignement we get:˜X a c - g t - a g c a t g a c aỸ a c c g t - a - c a a g - c aMain Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesScoring AlignmentsGlobal AlignmentLocal AlignmentHeuristicsDefinition (Score of an Alignment)Using the scoring function S : A ∪ {-} × A ∪ {-} → R we definethe score of an alignment as the sum of the scoring functionover all the columns of the alignment.Example (S(match) = +1 S(mismatch) = −1 S(gap) = −2)the first alignment scores 8 × 1 − 0 × 1 − 9 × 2 = −10˜X a c - g t a - - - g c a t g a c aỸ a c c g t a c a a g c a - - - - -and the second alignment scores 10 × 1 − 1 × 1 − 2 × 2 = 5˜X a c - g t a g c a t g a c aỸ a c c g t a - c a a g - c aMain Algorithms with Applications in Bioinformatics by G. Nuel Lecture 3: Dynamic Programming and Alignment

Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesGlobal AlignmentLocal AlignmentHeuristicsHow to find the best alignment ?BrutScores individually all possible alignments and pick up the bestone.⇒ How many possible alignements ?PropositionIf N(p, q) is the number of alignments between two sequencesof lengths p and q we have the following recurrence relation:N(1, q) = 2q + 1 N(p, 1) = 2p + 1N(p, q) = N(p, q − 1) + N(p − 1, q) + N(p − 1, q − 1)for all p, q 1.Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesNumber of AlignementsGlobal AlignmentLocal AlignmentHeuristicsExampleq = 1 q = 2 q = 3 q = 4 q = 5 q = 6 q = 7p = 1 3 5 7 9 11 13 15p = 2 5 13 25 41 61 85 113p = 3 7 25 63 129 231 377 575p = 4 9 41 129 321 681 1289 2241p = 5 11 61 231 681 1683 3653 7183ApproximationIdea: N(p, q) ∼ ρ p+q we hence getρ 2 − 2ρ − 1 = 0 ⇒ ρ = 1 + √ 2this gives us:N(20, 20) ≃ 10 15 N(100, 100) ≃ 10 76 N(1000, 1000) ≃ 10 764Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesDynamic ProgrammingGlobal AlignmentLocal AlignmentHeuristicsNeedleman and Wunsch (1970)We denote by B(i, j) the best score of an alignment of X 1 . . . X iand Y 1 . . . Y j and we get1: B(0, 0) = 02: B(i, 0) = ∑ i3: for i = 1 . . . p do4: for i = 1 . . . q do5:k=1 S(X k, −) and B(0, j) = ∑ jk=1 S(−, Y k)⎧⎨B(i, j) = max⎩B(i − 1, j − 1) + S(X i , Y j )B(i − 1, j) + S(X i , −)B(i, j − 1) + S(−, Y j )6: return S(p, q) and use a traceback to find the alignmentMain Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

ExampleDynamic ProgrammingLocal Score of One SequenceAlignment of SequencesGlobal AlignmentLocal AlignmentHeuristicsExample (X = gcgacgtgcaag Y = aggcacgca +3, −1, −2)- a g g c a c g c a- 0 -2 -4 -6 -8 -10 -12 -14 -16 -18g -2 -1 1 -1 -3 -5 -7 -9 -11 -13c -4 -3 -1 0 2 0 -2 -4 -6 -8g -6 -5 0 2 0 1 -1 1 -1 -3a -8 -3 -2 0 1 3 1 -1 0 2c -10 -5 -4 -2 3 1 6 4 2 0g -12 -7 -2 -1 1 2 4 9 7 5t -14 -9 -4 -3 -1 0 2 7 8 6g -16 -11 -6 -1 -3 -2 0 5 6 7c -18 -13 -8 -3 2 0 1 3 8 6a -20 -15 -10 -5 0 5 3 1 6 11a -22 -17 -12 -7 -2 3 4 2 4 9g -24 -19 -14 -9 -4 1 2 7 5 7a g g c - a c g - - c a - -- g - c g a c g t g c a a gMain Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment


Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesLimits of Global AlignmentGlobal AlignmentLocal AlignmentHeuristicsRemarksGlobal alignment is not suitable to detect:overlaps (ex: the end of X and beginning of Y are thesame)insertions (ex: X is included in Y )all kind of local similaritiesDefinition (local alignment)The local alignment of two sequences is a global alignment oftwo segments of these sequences.H =max B(X1ii ′ p,1jj ′ i . . . X i ′ , Y i . . . Y i ′ )qMain Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

AlgorithmDynamic ProgrammingLocal Score of One SequenceAlignment of SequencesGlobal AlignmentLocal AlignmentHeuristicsSmith and Waterman (1981)We denote by H(i, j) the best score of local alignment of X 1 . . . X iand Y 1 . . . Y j and we get1: H(i, 0) = H(0, j) = 0 for all i, j2: for i = 1 . . . p do3: for i = 1 . . . q do4:⎧H(i − 1, j − 1) + S(X i , Y j )⎪⎨H(i − 1, j) + S(XH(i, j) = maxi , −)H(i, j − 1) + S(−, Y ⎪⎩j )05: return max i,j H(i, j) and use a traceback to find the alignmentMain Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

ExampleDynamic ProgrammingLocal Score of One SequenceAlignment of SequencesGlobal AlignmentLocal AlignmentHeuristicsExample (X = gcgacgtgcaag Y = aggcacgca +3, −1, −2)- a g g c a c g c a- 0 0 0 0 0 0 0 0 0 0g 0 0 3 3 1 0 0 3 1 0c 0 0 1 2 6 4 3 1 6 4g 0 0 3 4 4 5 3 6 4 5a 0 3 1 2 3 7 5 4 5 7c 0 1 2 0 5 5 10 8 7 5g 0 0 4 5 3 4 8 13 11 9t 0 0 2 3 4 2 6 11 12 10g 0 0 3 5 3 3 4 9 10 11c 0 0 1 3 8 6 6 7 12 10a 0 3 1 1 6 11 9 7 10 15a 0 3 2 0 4 9 10 8 8 13g 0 1 6 5 3 7 8 13 11 11g c g a c g t g c ag c - a c - - g c aMain Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment


ComplexityDynamic ProgrammingLocal Score of One SequenceAlignment of SequencesGlobal AlignmentLocal AlignmentHeuristicsRemarkslocal alignment with Swith & Waterman has a complexityO(p × q) both in time and spacea slightly more complex algorithm has a complexityO(p × q) in time but only O(min(p, q)) in space⇒ too slow for massive sequences comparisons (completegenomes, databases, . . . ) we need heuristics !Main Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

BLAST, FASTA, . . .Dynamic ProgrammingLocal Score of One SequenceAlignment of SequencesGlobal AlignmentLocal AlignmentHeuristicsHeuristic to compare a query to a database1: pre-process the database to speed up hit search (once perdatabase) and compute λ and K (for Gumble approximations)2: scan the database for query hits (ex: the same word oflength 7 appears both in the database and in the query)3: try to combine hits together4: perform a Smith and Waterman algorithm in the vicinity ofthe hits5: asses the significance of the result with the Gumble approximationMain Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

SummaryDynamic ProgrammingLocal Score of One SequenceAlignment of SequencesDynamic programmingneed of recurrence relationtrade memory for speedmany applicationsGlobal AlignmentLocal AlignmentHeuristicsLocal score of one sequencean elegant replacement for sliding window methodsstrong connexion with HMMGumbel approximations are not much reliableAlignment of sequencesssuitable to compare biological sequenceschoice of the scoring functiononly heuristics are used in practiceMain Algorithms with Applications in Bioinformatics by G. NuelLecture 3: Dynamic Programming and Alignment

Lecture 3: Dynamic Programming and Alignment - Main Algorithms ...

Create successful ePaper yourself

Delete template?

Save as template?