24 Bioinformatics I, WS’12/13, D. Huson, October 30, 2012 • actually occur: Actual Positive • actually not occur: Actual Negative The sets of these four types of situations are denoted PP, PN, AP and AN, respectively. Based on these classifications, one can compute the number of: Signal Detected Name Definition Yes Yes True Positive TP = P P ∩ AP No No True Negative TN = P N ∩ AN Yes No False Negative FN = P N ∩ AP No Yes False Positive FP = P P ∩ AN Based on these counts, we define: • Sensitivity is the true positive rate or the probability of correctly predicting a positive example: Sn = T P/(T P + F N). • Specificity is the true negative rate or the probability of correctly predicting a negative example: Sp = T N/(T N + F P ). We will use these terms to investigate the performance of different seeding strategies for BLAT. 3.9.1 Example A new diagnostic test is to be evaluated on 100 patients, of which 30 have the targeted disease. If the test predicts that 40 patients are ill, of which 25 really are ill, then what is the sensitivity and what is the specificity of the test? TP= FP= TN= FN= Sensitivity: Sn = T P/(T P + F N) = Specificity: Sp = T N/(T N + F P ) = 3.10 BLAT preprocessing In a preprocessing step, BLAT builds a hash table indexing all occurrences of any word of length K in the database sequence D. • For DNA sequences, K is typically 8, . . . , 16 • For protein sequences, K is typically 3, . . . , 5. 3.11 Seed-and-extend Like many other fast alignment programs, BLAT uses the two stage seed-and-extend approach: • in the seed stage, the program detects short regions of the two sequences that are highly similar • in the extend stage, these regions are examined in detail and alignments are produced for the regions that are very similar according to some criterion.
Bioinformatics I, WS’12/13, D. Huson, October 30, 2012 25 3.11.1 Three different seed strategies BLAT supports three different seeding strategies: 1. Use single perfect matches of length K (so-called K-mer matches), 2. Use single near-perfect K-mer matches, and 3. Use multiple perfect K-mer matches. For each strategy, we want to determine: • how many homologous regions are missed (FN), and • how many non-homologous regions are passed to the extension stage (FP), thus increasing the running time of the application. 3.11.2 Some definitions K: The K-mer size M: Match ratio between homologous areas, ≈ 98% for cDNA/genomic alignments within the same species, ≈ 89% for protein alignments between human and mouse. H: The size of a homologous area. For a human exon this is typically 50 − 200 bp. G: Database size, e.g. 3 Gb for human. Q: Query size. A: Alphabet size, 20 for amino acids, 4 for nucleotides. query sequence (e.g. cDNA) matches 3.11.3 Strategy 1: Single perfect matches Database sequence (e.g. genome) The simplest seed method is to look for words of a given size K that are shared by the query and the database. This is done by comparing every K-mer in the query sequence with all non-overlapping K-mers in the database sequence. Assuming that each letter is independent of the previous letter, the probability that a specific K-mer in a homologous region of the database perfectly matches the corresponding K-mer in the query is: p1 = M K . Let T = ⌊ H K ⌋ denote the number of non-overlapping K-mers in a homologous region of length H. Sensitivity: The probability that at least one non-overlapping K-mer in the homologous region perfectly matches the corresponding K-mer in the query is: Specificity: P = 1 − (1 − p1) T = 1 − (1 − M K ) T . The number of non-overlapping K-mers that are expected to match by chance, assuming all letters are equally likely, is: F = (Q − K + 1) · G K · K 1 . A