Bioinformatics Algorithms: Techniques and Applications

More documents

Recommendations

Info

ALTERNATIVES TO VITERBI DECODING 63 information [66]. They can also be applied to areas as diverse as music composer recognition [60] and fire detection [54]. 4.3 ALTERNATIVES TO VITERBI DECODING The Viterbi decoding algorithm is widely used because of its simplicity and efficiency. It is not the only appropriate decoding algorithm for all HMM applications. This section presents several alternative decoding contexts and appropriate algorithms for them. 4.3.1 Maximizing the Number of Correctly Explained States: Posterior Decoding Posterior decoding focuses on individual positions in the sequence and tries to maximize the probability that they are properly explained. This is in contrast to Viterbi decoding, which computes the globally optimal state path. The simplest posterior decoding question is what state most likely generated symbol i in the HMM output? The most probable path is not necessarily helpful in answering this question. Many different state paths in the HMM can generate the same sequence s, and in position i, it is possible that many of them will agree on the same state. To compute the posterior probability P(hi = k | X) of state k at position i, conditioned on the entire sequence X, we add the probabilities of all paths using state k at position i. The posterior probability can be decomposed as follows: Pr(hi = k | X) = � ℓ Fi(k, X) · ak,ℓ · Bi+1(ℓ, X) , (4.4) Pr(X) where Fi(k, X) = Pr(hi = k, x1,...,xi), and the probability of generating the first i symbols of X and ending in the state k, is called the forward probability of state k at position i, and Bi+1(ℓ, X) = Pr(hi+1 = ℓ, xi+1,...,xn), the probability of starting in state ℓ and generating the rest of the sequence, xi+1,...,xn, is called the backward probability of state ℓ at position i + 1. The forward probabilities for a given sequence X and a given hidden Markov model can be computed in O(nm2 ) time using the standard forward algorithm [8]; the backward probabilities can be computed by the backward algorithm in the same running time. Using Formula (4.4) and the results of the forward and backward algorithms, we can compute the posterior probabilities of all states at all positions of the sequence X in O(nm2 ) time. Note that the posterior probability of the whole sequence Pr(X), which is the denominator in Formula 4.4, is also obtained as a side product of the forward algorithm: it is � ℓ Fn(ℓ, X). We can use the posterior probabilities in a number of ways. A human user can simply examine them to look for interesting features; [44] display a plot of the posterior probabilities of individual states along with the most probable annotation. The plot
64 ADVANCES IN HIDDEN MARKOV MODELS highlights the parts of the annotation that are most certain and the other hypotheses that might be reasonably likely. We can also compute the posterior probability of an entire candidate sequence feature, such as an exon, by summing the probabilities of all paths sharing that feature in a specific location of the sequence. Genscan [15] provides a list of the most probable alternative exons, including the ones not found on the most probable path. These exons can then be tested experimentally or used as an input for further processing. Larsen and Krogh [46] go one step further and compute the statistical significance of discovered genes, computing the expected number of genes with a given score that would occur in a random sequence of certain length. We can also decode sequences using posterior probabilities. In posterior decoding, we choose the highest posterior probability state at each position of the sequence: h ∗ i = arg maxk Pr(hi = k | X). This approach maximizes the expected number of positions in the decoding that have the right state. By contrast, Viterbi decoding maximizes the probability of the entire state path, even though this path may have exceedingly low probability. It may be the case that the posterior decoding has better overall quality. Still, the posterior decoding can be a composition of unrelated high probability paths. This can reach a point of ridiculousness: two adjacent states in the posterior annotation may not even be connected by an edge in the HMM. The probability of such a sequence of states being the source of the query sequence is zero: it is inconsistent with the basic assumptions encoded in the model topology. Different authors have addressed this concern through adding a postprocessing step where we attempt to maximize a different objective function. After computing all posterior state probabilities, using the forward–backward algorithm, we restrict the choice to the paths that use only transitions present in the HMM topology. Kill et al. [37] find the path that maximizes the sum of the posterior state probabilities, trying to maximize the number of correctly predicted states. This is done by straightforward dynamic programming, similar to the Viterbi algorithm, in time O(nm 2 ). Using a similar method, Furiselli et al. [26] maximize the product of posterior probabilities in the postprocessing step. 4.3.2 Maximizing the Annotation Probability: The Multiple Path Problem Each state in an HMM used to annotate sequences is labeled with the feature to which it corresponds. In gene finding, we label states as coming from exons, introns, and so on. Each state path naturally corresponds to a sequence of labels or an annotation. This annotation encapsulates the semantic meaning given to the sequence by the HMM path. This mapping between state paths and annotations is not always one to one: several state paths may correspond to the same annotation. Such paths provide “alternative origins” of the sequence but have the same semantic meaning. Thus, if we seek the most probable meaning, or annotation, for the sequence, we should add probabilities of all of these state paths. We will describe an HMM that has multiple state paths with the same annotation as having the multiple-path problem. Figure 4.6a shows a simplified HMM for gene finding with its state labels depicted by the state colors. If the start state of the HMM is fixed, this HMM does not have the multiple-path problem, even though multiple
Page 2 and 3:
BIOINFORMATICS ALGORITHMS
Page 4 and 5:
Copyright © 2008 by John Wiley & S
Page 6 and 7:
vi CONTENTS 6 A Survey of Seeding f
Page 8 and 9:
PREFACE Bioinformatics, broadly def
Page 10 and 11:
CONTRIBUTORS Sudha Balla, Departmen
Page 12 and 13:
CONTRIBUTORS xiii Steven Hecht Orza
Page 14 and 15:
1 EDUCATING BIOLOGISTS IN THE 21ST
Page 16 and 17:
EDUCATING BIOLOGISTS IN THE 21ST CE
Page 18 and 19:
EDUCATING BIOLOGISTS IN THE 21ST CE
Page 20 and 21:
2 DYNAMIC PROGRAMMING ALGORITHMS FO
Page 22 and 23:
SEQUENCE ALIGNMENT: GLOBAL, LOCAL,
Page 24 and 25: ecurrence: SEQUENCE ALIGNMENT: GLOB
Page 26 and 27: SEQUENCE ALIGNMENT: GLOBAL, LOCAL,
Page 28 and 29: SEQUENCE ALIGNMENT: GLOBAL, LOCAL,
Page 30 and 31: DYNAMIC PROGRAMMING ALGORITHMFOR RN
Page 32 and 33: DYNAMIC PROGRAMMING ALGORITHMFOR RN
Page 34 and 35: DYNAMIC PROGRAMMING ALGORITHMS FOR
Page 36 and 37: REFERENCES 25 the flexible structur
Page 38 and 39: REFERENCES 27 32. Gusfield D. Effic
Page 40 and 41: 3 GRAPH THEORETICAL APPROACHES TO D
Page 42 and 43: GRAPH THEORY BACKGROUND 31 beginnin
Page 44 and 45: GRAPH THEORY BACKGROUND 33 FIGURE 3
Page 46 and 47: GRAPH THEORY BACKGROUND 35 chordal
Page 48 and 49: GRAPH THEORY BACKGROUND 37 decompos
Page 50 and 51: RECONSTRUCTING PHYLOGENIES 39 are (
Page 52 and 53: RECONSTRUCTING PHYLOGENIES 41 only
Page 54 and 55: FORMATION OF MULTIPROTEIN COMPLEXES
Page 56 and 57: 3.4.1 Ribosomal Assembly FORMATION
Page 62 and 63: ACKNOWLEDGMENTS REFERENCES 51 This
Page 64 and 65: REFERENCES 53 37. Golumbic MC, Hart
Page 66 and 67: 4 ADVANCES IN HIDDEN MARKOV MODELS
Page 68 and 69: HIDDEN MARKOV MODELS FOR SEQUENCE A
Page 76 and 77: Noncoding Coding Intron (a) Without
Page 78 and 79: also have this same label). We get
Page 80 and 81: change as follows: GENERALIZED HIDD
Page 82 and 83: 0.00004 0.00002 0.00000 0 20000 400
Page 84 and 85: HMMS WITH MULTIPLE OUTPUTS OR EXTER
Page 94 and 95: TRAINING THE PARAMETERS OF AN HMM 8
Page 96 and 97: CONCLUSION 85 of parameters compare
Page 98 and 99: REFERENCES 87 4. Altun Y, Tsochanta
Page 100 and 101: REFERENCES 89 42. Krogh A. Using da
Page 102 and 103: REFERENCES 91 77. Xu EW, Kearney P,
Page 104 and 105: 94 SORTING- AND FFT-BASED TECHNIQUE
Page 110 and 111: 100 SORTING- AND FFT-BASED TECHNIQU
Page 124 and 125:
114 SORTING- AND FFT-BASED TECHNIQU
Page 126 and 127:
6 A SURVEY OF SEEDING FOR SEQUENCE
Page 128 and 129:
ALIGNMENTS 119 6.2.1 Formal Definit
Page 130 and 131:
TRADITIONAL APPROACHES TO HEURISTIC
Page 132 and 133:
Page 134 and 135:
Page 136 and 137:
MORE CONTEMPORARY SEEDING APPROACHE
Page 138 and 139:
Page 140 and 141:
Page 142 and 143:
MORE COMPLICATED SEED DESCRIPTIONS
Page 144 and 145:
Page 146 and 147:
Page 148 and 149:
SOME THEORETICAL ISSUES IN ALIGNMEN
Page 150 and 151:
REFERENCES 141 6. Brown DG. Optimiz
Page 152 and 153:
7 THE COMPARISON OF PHYLOGENETIC NE
Page 154 and 155:
INTRODUCTION 145 known phylogeny re
Page 156 and 157:
BASIC DEFINITIONS 147 The undirecte
Page 158 and 159:
BASIC DEFINITIONS 149 N1 displays N
Page 160 and 161:
A B C A B C SUBTREES AND SUBNETWORK
Page 162 and 163:
SUBTREES AND SUBNETWORKS 153 of x a
Page 164 and 165:
SUBTREES AND SUBNETWORKS 155 1. it
Page 166 and 167:
SUPERTREES AND SUPERNETWORKS 157 By
Page 168 and 169:
SUPERTREES AND SUPERNETWORKS 159 Go
Page 170 and 171:
SUPERTREES AND SUPERNETWORKS 161 Th
Page 172 and 173:
RECONCILIATION OF GENE TREES AND SP
Page 174 and 175:
Page 176 and 177:
Page 178 and 179:
Page 180 and 181:
Page 182 and 183:
REFERENCES 173 21. Gòrecki P, Tiur
Page 184 and 185:
8 FORMAL MODELS OF GENE CLUSTERS An
Page 186 and 187:
8.2 GENOME PLASTICITY 8.2.1 Genome
Page 188 and 189:
GENOME PLASTICITY 181 FIGURE 8.2 An
Page 190 and 191:
BASIC CONCEPTS 183 “more or less
Page 192 and 193:
BASIC CONCEPTS 185 of {m, o, s}. On
Page 194 and 195:
MODELS OF GENE CLUSTERS 187 Definit
Page 196 and 197:
4, 2, 3, 1, 11, 10, 9, 8, 7, 6, 5 4
Page 198 and 199:
MODELS OF GENE CLUSTERS 191 FIGURE
Page 200 and 201:
MODELS OF GENE CLUSTERS 193 another
Page 202 and 203:
MODELS OF GENE CLUSTERS 195 of gene
Page 204 and 205:
MODELS OF GENE CLUSTERS 197 The two
Page 206 and 207:
REFERENCES 199 flexibility by bound
Page 208 and 209:
REFERENCES 201 28. Hoberman R, Dura
Page 210 and 211:
9 INTEGER LINEAR PROGRAMMING TECHNI
Page 212 and 213:
BASIC PROBLEM SPECIFICATION 205 a n
Page 214 and 215:
INTEGER LINEAR PROGRAMMING FORMULAT
Page 216 and 217:
INTEGER LINEAR PROGRAMMING FORMULAT
Page 218 and 219:
9.4 EXTENSIONS AND VARIATIONS EXTEN
Page 220 and 221:
i=1 EXTENSIONS AND VARIATIONS 213 H
Page 222 and 223:
9.5 COMPUTATIONAL RESULTS COMPUTATI
Page 224 and 225:
DISCUSSION 217 TABLE 9.2 Cluster Si
Page 226 and 227:
DISCUSSION 219 FIGURE 9.5 Manually
Page 228 and 229:
ACKNOWLEDGMENTS REFERENCES 221 We t
Page 230 and 231:
224 EFFICIENT COMBINATORIAL ALGORIT
Page 232 and 233:
Page 234 and 235:
Page 236 and 237:
Page 238 and 239:
Page 240 and 241:
Page 242 and 243:
Page 244 and 245:
Page 246 and 247:
11 ALGORITHMS FOR MULTIPLEX PCR PRI
Page 248 and 249:
INTRODUCTION 243 problem: given a s
Page 250 and 251:
1. p hybridizes at position t of f
Page 252 and 253:
Thus, constraints 11.7 can be repla
Page 254 and 255:
A GREEDY ALGORITHM 249 FIGURE 11.3
Page 256 and 257:
EXPERIMENTAL RESULTS 251 11.5.1 Amp
Page 258 and 259:
#primers/(2x#SNPs) (%) #primers/(2x
Page 260 and 261:
TABLE 11.2 (Continued ) EXPERIMENTA
Page 262 and 263:
REFERENCES 257 p, discard all candi
Page 264 and 265:
12 RECENT DEVELOPMENTS IN ALIGNMENT
Page 266 and 267:
12.2 MULTIPLE SEQUENCE ALIGNMENT 12
Page 268 and 269:
MULTIPLE SEQUENCE ALIGNMENT 263 The
Page 270 and 271:
MOTIF FINDING 265 Marsan and Sagot
Page 272 and 273:
BIOLOGICAL NETWORK ANALYSIS 267 mul
Page 274 and 275:
DISCUSSION 269 an interaction pair
Page 276 and 277:
REFERENCES 271 13. Bucka-Lassen K,
Page 278 and 279:
REFERENCES 273 52. Lee C, Grasso C,
Page 280 and 281:
REFERENCES 275 90. Stormo GD, Hartz
Page 282 and 283:
PART III MICROARRAY DESIGN AND DATA
Page 284 and 285:
280 ALGORITHMS FOR OLIGONUCLEOTIDE
Page 286 and 287:
Page 288 and 289:
Page 290 and 291:
Page 292 and 293:
Page 294 and 295:
Page 296 and 297:
Page 298 and 299:
Page 300 and 301:
Page 302 and 303:
Page 304 and 305:
Page 306 and 307:
14 CLASSIFICATION ACCURACY BASED MI
Page 308 and 309:
INTRODUCTION 305 Decomposition (SVD
Page 310 and 311:
METHODS 307 Note that in most of th
Page 312 and 313:
estimated as K� 1 ai,j = aik,j. d
Page 314 and 315:
METHODS 311 [7]. The KNN-classifier
Page 316 and 317:
ROWimpute-KNN ROWimpute-SVM KNNimpu
Page 318 and 319:
Page 320 and 321:
Classification accuracies of SRBCT
Page 322 and 323:
Page 324 and 325:
Page 326 and 327:
Classification accuracies of SRBCT
Page 328 and 329:
REFERENCES 325 From these two plots
Page 330 and 331:
REFERENCES 327 18. Troyanskaya OG,
Page 332 and 333:
330 META-ANALYSIS OF MICROARRAY DAT
Page 334 and 335:
Page 336 and 337:
Page 338 and 339:
Page 340 and 341:
Page 342 and 343:
Page 344 and 345:
Page 346 and 347:
Page 348 and 349:
Page 350 and 351:
Page 352 and 353:
Page 354 and 355:
Page 356 and 357:
16 PHASING GENOTYPES USING A HIDDEN
Page 358 and 359:
A HIDDEN MARKOV MODEL FOR RECOMBINA
Page 360 and 361:
LEARNING THE HMM FROM UNPHASED GENO
Page 362 and 363:
Page 364 and 365:
Page 366 and 367:
EXPERIMENTAL RESULTS 365 It is also
Page 368 and 369:
DISCUSSION 367 TABLE 16.1 Phasing A
Page 370 and 371:
GERBIL PHASE fastPHASE 0.4 0.35 0.3
Page 372 and 373:
REFERENCES 371 however, that direct
Page 374 and 375:
17 ANALYTICAL AND ALGORITHMIC METHO
Page 376 and 377:
INTRODUCTION 375 The use of real ha
Page 378 and 379:
follows: X11 = 2N11 + N12 + N21 X21
Page 380 and 381:
METHODS 379 FIGURE 17.1 The likelih
Page 382 and 383:
METHODS 381 TABLE 17.3 Tests for Ha
Page 384 and 385:
METHODS 383 The sixth stochastic al
Page 386 and 387:
RESULTS 385 TABLE 17.4 The Distribu
Page 388 and 389:
RESULTS 387 TABLE 17.6 The Distribu
Page 390 and 391:
DISCUSSION 389 2SNP also produced r
Page 392 and 393:
ACKNOWLEDGMENTS 391 haplotypes need
Page 394 and 395:
REFERENCES 393 16. Hill WG. Estimat
Page 396 and 397:
18 OPTIMIZATION METHODS FOR GENOTYP
Page 398 and 399:
Tag-restricted haplotype n Complete
Page 400 and 401:
INFORMATIVE SNP SELECTION 399 from
Page 402 and 403:
DISEASE ASSOCIATION SEARCH 401 18.2
Page 404 and 405:
DISEASE ASSOCIATION SEARCH 403 18.3
Page 406 and 407:
Below is the formal description of
Page 408 and 409:
RESULTS AND DISCUSSION 407 to decid
Page 410 and 411:
RESULTS AND DISCUSSION 409 � Comp
Page 412 and 413:
RESULTS AND DISCUSSION 411 TABLE 18
Page 414 and 415:
RESULTS AND DISCUSSION 413 nonindex
Page 416 and 417:
REFERENCES 415 20. Lee PH, Shatkay
Page 418 and 419:
19 TOPOLOGICAL INDICES IN COMBINATO
Page 420 and 421:
TOPOLOGICAL INDICES 421 The quantit
Page 422 and 423:
Theorem 19.2 Let T = (V, E) be a tr
Page 424 and 425:
HOSOYA POLYNOMIAL 425 The Laplacian
Page 426 and 427:
H2(G, x) = � {u,v}⊆V INVERSE WI
Page 428 and 429:
HEXAGONAL SYSTEMS 429 hexagonal sys
Page 430 and 431:
C 2 HEXAGONAL SYSTEMS 431 FIGURE 19
Page 432 and 433:
THE WIENER INDEX OF PEPTOIDS 433 Th
Page 434 and 435:
if R ≥ L, then π(Lp) = i; Lp = L
Page 436 and 437:
REFERENCES 437 19. Entringer RC, Me
Page 438 and 439:
20 EFFICIENT ALGORITHMS FOR STRUCTU
Page 440 and 441:
COMPOUND REPRESENTATION 441 FIGURE
Page 442 and 443:
COMPOUND REPRESENTATION 443 breakag
Page 444 and 445:
TABLE 20.1 Bond List of Aspirin Bon
Page 446 and 447:
COMPOUND REPRESENTATION 447 20.2.5
Page 448 and 449:
Initial class value for node A A 3
Page 450 and 451:
CHEMICAL COMPOUND DATABASE 451 In c
Page 452 and 453:
CHEMICAL COMPOUND DATABASE 453 taki
Page 454 and 455:
CHEMICAL COMPOUND DATABASE 455 Othe
Page 456 and 457:
REFERENCES 457 lab may take months
Page 458 and 459:
REFERENCES 459 22. Curco D, Rodrigu
Page 460 and 461:
REFERENCES 461 61. An J, Nakama T,
Page 462 and 463:
REFERENCES 463 101. Shen J. HAD An
Page 464 and 465:
466 COMPUTATIONAL APPROACHES TO PRE
Page 466 and 467:
Page 468 and 469:
Page 470 and 471:
Page 472 and 473:
Page 474 and 475:
Page 476 and 477:
Page 478 and 479:
Page 480 and 481:
Page 482 and 483:
Page 484 and 485:
Page 486 and 487:
Page 488 and 489:
Page 490 and 491:
INDEX 2SNP computer program 383, 38
Page 492 and 493:
degeneracy 101-104, 112 degenerate
Page 494 and 495:
lowest p-value method 484-486 max-g
Page 496 and 497:
pseudoknots 20 p-value 339-343, 347
Page 498:
ioinformatics-cp.qxd 11/29/2007 8:4
show all

Bioinformatics Algorithms: Techniques and Applications

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?