11.04.2013 Views

BLAST, BLAT and FASTA - Algorithms in Bioinformatics

BLAST, BLAT and FASTA - Algorithms in Bioinformatics

BLAST, BLAT and FASTA - Algorithms in Bioinformatics

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Bio<strong>in</strong>formatics I, WS’09-10, S. Henz (script by D. Huson) November 26, 2009 51<br />

The number of r<strong>and</strong>om HSPs (s, t) with σ(s, t) ≥ S can be described by a Poisson distribution with<br />

parameter v = Kmne −λS . The number of HSPs with score ≥ S that we expect to see due to chance<br />

is then the parameter v, also called the E-value:<br />

E(S) =Kmne −λS<br />

The parameters K <strong>and</strong> λ depend on the background probabilities of the symbols <strong>and</strong> on the employed<br />

scor<strong>in</strong>g matrix. We def<strong>in</strong>e λ as the unique value for y that satisfies the equation<br />

<br />

a,b∈Σ<br />

papbe S(a,b)y =1<br />

K <strong>and</strong> λ are scal<strong>in</strong>g-factors for the search space <strong>and</strong> for the scor<strong>in</strong>g scheme, respectively.<br />

Hence the probability of f<strong>in</strong>d<strong>in</strong>g exactly x HSPs with a score ≥ S is given by<br />

−E Ex<br />

P(X = x) =e<br />

x!<br />

The probability of f<strong>in</strong>d<strong>in</strong>g at least one HSP “by chance” is<br />

where E is the E-value for S.<br />

P(S) = 1 − P(X = 0) = 1 − e −E ,<br />

Thus we see that the probability distribution of the scores follows an extreme value distribution.<br />

<strong>BLAST</strong> reports E-values rather than ”P-values” as it is easier to <strong>in</strong>terpret the difference between<br />

E-values than to <strong>in</strong>terpret the difference between P-values.<br />

The raw scores S are of little use without detailed knowledge of the scor<strong>in</strong>g system used, that is, of<br />

the statistical parameters K <strong>and</strong> λ.<br />

Therefore we <strong>in</strong>troduced a normalized raw score called bit score S ′ that is def<strong>in</strong>ed as<br />

E-values <strong>and</strong> bit scores are related by<br />

(exercise!)<br />

4.6 Gapped <strong>BLAST</strong><br />

S ′ =<br />

λS − ln K<br />

.<br />

ln 2<br />

E = mn2 −S′<br />

A new version of <strong>BLAST</strong> called <strong>BLAST</strong> 2.0 2 allows gaps <strong>in</strong> the extension phase.<br />

4.7 The <strong>BLAST</strong> family<br />

• <strong>BLAST</strong>N: compares a DNA query sequence to a DNA sequence database<br />

2 S. F. Altschul, T. L. Madden, A.A. Schäffer, J. Zhang, Z. Zhang, W. Miller, <strong>and</strong> D. J. Lipman: Gapped <strong>BLAST</strong> <strong>and</strong><br />

PSI-<strong>BLAST</strong>: a new generation of prote<strong>in</strong> database search programs. Nucleic Acids Res. 25(17):3389-402 (1997).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!