Sequence Comparison.pdf

More documents

Recommendations

Info

6.2 Basic Formulas on Hit Probability 95 s[i]s[k + i]s[2k + i]···s[(l − 1)k + i], i = r + 1,r + 2,...,k − 1, where l = ⌊ n k ⌋,r = n − kl − 1. Because π hits the first r + 1 sequences with probability Π l+1 and the last k − 1 − r sequences with probability Π l , we have that For any i, j ≥|π|, ( ∩ i−1 t=0Āt subevent of ( ∩ i−1 ) ( 0 Ā t=0 ∩ ¯Π n ′ =(¯Π l+1 ) r+1 ( ¯Π l ) k−1−r . (6.4) ) i+ j−1 are independent and ∩t=0 Ā t is a ) ∩ . Hence, ) and i+ j−1 (∩ (∩ ) ( i−1 ¯Π i ¯Π j = Pr[ 0 Ā t=0 ∩ ∩ t=i+|π|−1Āt i+ j−1 t=i+|π|−1Āt i+ j−1 t=i+|π|−1Āt )] [ > Pr ∩ i+ j−1 t=0 Ā t ] = ¯Π i+ j . Hence, formula (6.4) implies that ¯Π ′ n > ¯Π n or equivalently Π ′ n < Π n for any n ≥|π ′ |. 6.2.1 A Recurrence System for Hit Probability We have shown that the non-hit probability of a consecutive seed θ satisfies equation (6.3). Given a consecutive seed θ and n > |θ|, it takes linear-time to compute the hit probability Θ n . However, calculating the hit probability for an arbitrary seed is rather complicated. In this section, we generalize the recurrence relation (6.3) to a recurrence system in the general case. For a spaced seed π,wesetm = 2 |π|−w π .LetW π be the set of all m distinct strings obtained from π by filling 0 or 1 in the ∗’s positions. For example, for π = 1∗11∗1, W π = {101101,101111,111101,111111}. The seed π hits the random sequence R at position n − 1 if and only if a unique W j ∈ W π occurs at the position. For each j, letB ( n j) denote the event that W j occurs at the position n − 1. Because A n denotes the event that π hits the sequences R at position n − 1, we have that A n = ∪ 1≤ j≤m B ( n j) and B ( n j) ’s are disjoint. Setting ] π n ( j) = Pr [Ā0 Ā 1 ···Ā n−2 B ( j) n−1 , j = 1,2,···,m. We have and hence formula (6.2) becomes π n = ∑ 1≤ j≤m π ( j) n ¯Π n = ¯Π n−1 − π n (1) − π n (2) −···−π n (m) . (6.5) Recall that, for any W j ∈ W π and a,b such that 0 ≤ a < b ≤|π|−1, W j [a,b] denotes the substring of W j from position a to position b inclusively. For a string s, weuse
96 6 Anatomy of Spaced Seeds Pr[s] to denote the probability that s occurs at a position k ≥|s|. For any i, j, and k such that 1 ≤ i, j ≤ m, 1≤ k ≤|π|, we define ⎧ ⎨ Pr[W j [k,|π|−1]] if W i [|π|−k,|π − 1]=W j [0,k − 1]; p (ij) k = 1 k = |π| & i = j; ⎩ 0 otherwise. It is easy to see that p (ij) k is the conditional probability that W j hits at the position n + k given that W i hits at position n for k < |π| and n. Theorem 6.1. Let p j = Pr[W j ] for W j ∈ W π (1 ≤ j ≤ m). Then, for any n ≥|π|, p j ¯Π n = Proof. For each 1 ≤ j ≤ m, |π|−1 ∑ k=1 m ∑ i=1 m |π| ∑ ∑ i=1 k=1 π (i) n+k p(ij) k , j = 1,2,...,m. (6.6) p j ¯Π n ] = Pr [Ā0 Ā 1 ···Ā n−1 B ( j) n+|π|−1 |π|−1 ] ] = ∑ Pr [Ā0 Ā 1 ···Ā n+k−2 A n+k−1 B ( j) n+|π|−1 + Pr [Ā0 Ā 1 ···Ā n+|π|−2 B ( j) n+|π|−1 k=1 ] = Pr [Ā0 Ā 1 ···Ā n+k−2 B (i) j) n+k−1B( n+|π|−1 + π ( j) n+|π| = = = |π|−1 ∑ k=1 |π|−1 ∑ k=1 m m ∑ i=1 m ∑ i=1 |π| ∑ ∑ i=1 k=1 ] [ ] Pr [Ā0 Ā 1 ···Ā n+k−2 B (i) n+k−1 Pr B ( j) n+|π|−1 |B(i) n+k−1 + π ( j) n+|π| π (i) n+k p(ij) k + π ( j) n+|π| π (i) n+k p(ij) k . This proves formula (6.6) . Example 6.3. Let π = 1 a ∗1 b , a ≥ b ≥ 1. Then, |π| = a+b+1 and W π = {W 1 ,W 2 } = {1 a 01 b ,1 a+b+1 }. Then we have ⊓⊔ p (11) k p (11) k p (11) |π| = 1, = p |π|−k−1 q, k = 1,2,...,b, = 0, k = b + 1,b + 2,...,|π|−1,
Page 2 and 3:
Computational Biology Editors-in-Ch
Page 4 and 5:
Kun-Mao Chao·Louxin Zhang Sequence
Page 6 and 7:
KMC: To Daddy, Mommy, Pei-Pei and L
Page 8 and 9:
viii Foreword I invite you to study
Page 10 and 11:
x Preface Chapters 2 to 5 form the
Page 12 and 13:
Acknowledgments We are extremely gr
Page 14 and 15:
Contents Foreword .................
Page 16 and 17:
Contents xix Part II. Theory ......
Page 18 and 19:
Chapter 1 Introduction 1.1 Biologic
Page 20 and 21:
1.2 Alignment: A Model for Sequence
Page 22 and 23:
1.2 Alignment: A Model for Sequence
Page 24 and 25:
1.3 Scoring Alignment 7 ( ) k m a k
Page 26 and 27:
1.4 Computing Sequence Alignment 9
Page 28 and 29:
1.5 Multiple Alignment 11 1.5 Multi
Page 30 and 31:
1.8 Bibliographic Notes and Further
Page 32 and 33:
PART I. ALGORITHMS AND TECHNIQUES 1
Page 34 and 35:
18 2 Basic Algorithmic Techniques 2
Page 36 and 37:
20 2 Basic Algorithmic Techniques F
Page 38 and 39:
22 2 Basic Algorithmic Techniques s
Page 40 and 41:
Page 42 and 43:
Page 44 and 45:
28 2 Basic Algorithmic Techniques P
Page 46 and 47:
30 2 Basic Algorithmic Techniques a
Page 48 and 49:
32 2 Basic Algorithmic Techniques O
Page 50 and 51:
Chapter 3 Pairwise Sequence Alignme
Page 52 and 53:
3.3 Global Alignment 37 3.2 Dot Mat
Page 54 and 55:
3.3 Global Alignment 39 ⎧ ⎨ S[i
Page 56 and 57:
3.3 Global Alignment 41 ( ai b j )
Page 58 and 59: 3.4 Local Alignment 43 ⎧ 0, ⎪
Page 60 and 61: 3.4 Local Alignment 45 Algorithm LO
Page 62 and 63: 3.5 Various Scoring Schemes 47 Fig.
Page 64 and 65: 3.6 Space-Saving Strategies 49 Fig.
Page 66 and 67: 3.6 Space-Saving Strategies 51 Algo
Page 68 and 69: 3.6 Space-Saving Strategies 53 scor
Page 70 and 71: 3.7 Other Advanced Topics 55 ning,
Page 72 and 73: 3.7 Other Advanced Topics 57 (0,0).
Page 74 and 75: 3.7 Other Advanced Topics 59 3.7.4
Page 76 and 77: 3.8 Bibliographic Notes and Further
Page 78 and 79: Chapter 4 Homology Search Tools The
Page 80 and 81: 4.1 Finding Exact Word Matches 65 F
Page 82 and 83: 4.1 Finding Exact Word Matches 67 F
Page 84 and 85: 4.3 BLAST 69 SALSDLHAHKLRVDPVNFKLLS
Page 86 and 87: 4.3 BLAST 71 length w, whereas for
Page 88 and 89: 4.3 BLAST 73 Fig. 4.9 A scenario of
Page 90 and 91: 4.5 PatternHunter 75 BLAT identifie
Page 92 and 93: 4.5 PatternHunter 77 can develop an
Page 96 and 97: 82 5 Multiple Sequence Alignment S
Page 98 and 99: 84 5 Multiple Sequence Alignment S
Page 100 and 101: 86 5 Multiple Sequence Alignment Fi
Page 102 and 103: 88 5 Multiple Sequence Alignment al
Page 104 and 105: Chapter 6 Anatomy of Spaced Seeds B
Page 106 and 107: 6.2 Basic Formulas on Hit Probabili
Page 110 and 111: 6.2 Basic Formulas on Hit Probabili
Page 112 and 113: 6.3 Distance between Non-Overlappin
Page 118 and 119: 6.4 Asymptotic Analysis of Hit Prob
Page 124 and 125: 6.5 Spaced Seed Selection 111 Count
Page 126 and 127: 6.6 Generalizations of Spaced Seeds
Page 132 and 133: 120 7 Local Alignment Statistics tr
Page 134 and 135: 122 7 Local Alignment Statistics Fi
Page 136 and 137: 124 7 Local Alignment Statistics Le
Page 138 and 139: 126 7 Local Alignment Statistics Be
Page 140 and 141: 128 7 Local Alignment Statistics we
Page 142 and 143: 130 7 Local Alignment Statistics A
Page 144 and 145: 132 7 Local Alignment Statistics pr
Page 146 and 147: 134 7 Local Alignment Statistics wh
Page 148 and 149: 136 7 Local Alignment Statistics Ta
Page 150 and 151: 138 7 Local Alignment Statistics Be
Page 152 and 153: 140 7 Local Alignment Statistics 7.
Page 154 and 155: 142 7 Local Alignment Statistics 7.
Page 156 and 157: 144 7 Local Alignment Statistics al
Page 158 and 159:
146 7 Local Alignment Statistics 7.
Page 160 and 161:
Chapter 8 Scoring Matrices With the
Page 162 and 163:
8.1 The PAM Scoring Matrices 151 AB
Page 164 and 165:
8.2 The BLOSUM Scoring Matrices 153
Page 166 and 167:
8.3 General Form of the Scoring Mat
Page 168 and 169:
8.4 How to Select a Scoring Matrix?
Page 170 and 171:
8.5 Compositional Adjustment of Sco
Page 172 and 173:
8.6 DNA Scoring Matrices 161 This i
Page 174 and 175:
8.7 Gap Cost in Gapped Alignments 1
Page 176 and 177:
Page 178 and 179:
Page 180 and 181:
Page 182 and 183:
Page 184 and 185:
Appendix A Basic Concepts in Molecu
Page 186 and 187:
A.4 The Genomes 175 ondary structur
Page 188 and 189:
Appendix B Elementary Probability T
Page 190 and 191:
B.3 Major Discrete Distributions 17
Page 192 and 193:
B.3 Major Discrete Distributions 18
Page 194 and 195:
B.5 Mean, Variance, and Moments 183
Page 196 and 197:
B.5 Mean, Variance, and Moments 185
Page 198 and 199:
B.6 Relative Entropy of Probability
Page 200 and 201:
B.7 Discrete-time Finite Markov Cha
Page 202 and 203:
B.8 Recurrent Events and the Renewa
Page 204 and 205:
B.8 Recurrent Events and the Renewa
Page 206 and 207:
196 C Software Packages for Sequenc
Page 208 and 209:
198 References 19. Bafna, V. and Pe
Page 210 and 211:
200 References 71. Fitch, W.M. and
Page 212 and 213:
202 References 122. Letunic, I., Co
Page 214 and 215:
204 References 173. Robinson, A.B.
Page 216 and 217:
Index O-notation, 18 P-value, 139 a
Page 218:
Index 209 heuristic, 85 progressive
show all

Sequence Comparison.pdf

Create successful ePaper yourself

Delete template?

Save as template?