BLAST, BLAT and FASTA - Algorithms in Bioinformatics
BLAST, BLAT and FASTA - Algorithms in Bioinformatics
BLAST, BLAT and FASTA - Algorithms in Bioinformatics
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Bio<strong>in</strong>formatics I, WS’09-10, S. Henz (script by D. Huson) November 26, 2009 51<br />
The number of r<strong>and</strong>om HSPs (s, t) with σ(s, t) ≥ S can be described by a Poisson distribution with<br />
parameter v = Kmne −λS . The number of HSPs with score ≥ S that we expect to see due to chance<br />
is then the parameter v, also called the E-value:<br />
E(S) =Kmne −λS<br />
The parameters K <strong>and</strong> λ depend on the background probabilities of the symbols <strong>and</strong> on the employed<br />
scor<strong>in</strong>g matrix. We def<strong>in</strong>e λ as the unique value for y that satisfies the equation<br />
<br />
a,b∈Σ<br />
papbe S(a,b)y =1<br />
K <strong>and</strong> λ are scal<strong>in</strong>g-factors for the search space <strong>and</strong> for the scor<strong>in</strong>g scheme, respectively.<br />
Hence the probability of f<strong>in</strong>d<strong>in</strong>g exactly x HSPs with a score ≥ S is given by<br />
−E Ex<br />
P(X = x) =e<br />
x!<br />
The probability of f<strong>in</strong>d<strong>in</strong>g at least one HSP “by chance” is<br />
where E is the E-value for S.<br />
P(S) = 1 − P(X = 0) = 1 − e −E ,<br />
Thus we see that the probability distribution of the scores follows an extreme value distribution.<br />
<strong>BLAST</strong> reports E-values rather than ”P-values” as it is easier to <strong>in</strong>terpret the difference between<br />
E-values than to <strong>in</strong>terpret the difference between P-values.<br />
The raw scores S are of little use without detailed knowledge of the scor<strong>in</strong>g system used, that is, of<br />
the statistical parameters K <strong>and</strong> λ.<br />
Therefore we <strong>in</strong>troduced a normalized raw score called bit score S ′ that is def<strong>in</strong>ed as<br />
E-values <strong>and</strong> bit scores are related by<br />
(exercise!)<br />
4.6 Gapped <strong>BLAST</strong><br />
S ′ =<br />
λS − ln K<br />
.<br />
ln 2<br />
E = mn2 −S′<br />
A new version of <strong>BLAST</strong> called <strong>BLAST</strong> 2.0 2 allows gaps <strong>in</strong> the extension phase.<br />
4.7 The <strong>BLAST</strong> family<br />
• <strong>BLAST</strong>N: compares a DNA query sequence to a DNA sequence database<br />
2 S. F. Altschul, T. L. Madden, A.A. Schäffer, J. Zhang, Z. Zhang, W. Miller, <strong>and</strong> D. J. Lipman: Gapped <strong>BLAST</strong> <strong>and</strong><br />
PSI-<strong>BLAST</strong>: a new generation of prote<strong>in</strong> database search programs. Nucleic Acids Res. 25(17):3389-402 (1997).