11.04.2013 Views

screen

screen

screen

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Alignments<br />

BLAST, BLAT


Genome<br />

vs.<br />

gene<br />

Genome Gene<br />

Built of... DNA DNA<br />

Describes Organism Protein<br />

Single molecule, or<br />

Stored as... Part of genome<br />

a few of them<br />

Both (depending on<br />

Circular/ linear Linear<br />

the species)<br />

“Life cycle” DNA-DNA-DNA-... DNA-RNA-protein<br />

Amount per<br />

cell<br />

Information<br />

content<br />

Size 0.5Mb 100b ... 100000b<br />

*) ... 3500Mb<br />

1 500 .. 50000<br />

2% ... 100% ~30%


The amount of genetic<br />

information in organisms<br />

Name<br />

Mycoplasma<br />

Genome<br />

size (Mb) # genes<br />

genitalium<br />

0.5 470<br />

Escherichia coli<br />

Saccharomyces<br />

4.5 4400<br />

cerevisiae<br />

Drosophila<br />

12 5500<br />

melanogaster<br />

Caenorhabtitis<br />

120 18000<br />

elegans<br />

97 22000<br />

Homo sapiens 3000 23457<br />

Zea mays 2500 50000


The amount of genetic<br />

information in organisms<br />

http://www.cbs.dtu.dk/staff/dave/roanoke/genetics980313.html<br />

Largest genome: amoeba Chaos chaos<br />

(200x human genome) http://www.lawrence.edu/dept/biology/animal/


Sequence searching -<br />

challenges<br />

Exponential growth of<br />

databases


Sequence searching –<br />

definition<br />

Task:<br />

– Query: short, new sequence (~1000<br />

letters)<br />

– Database (searching space): very<br />

many sequences<br />

– Goal: find seqs homologous to the<br />

query


Sequence searching –<br />

definition<br />

We want:<br />

– fast tool<br />

– primarily a filter: most sequences will<br />

be unrelated to the query<br />

– fine-tune the alignment later


Database Search Algorithms:<br />

Sensitivity, Selectivity<br />

• True Positive (TP) – a homology detected (positive)<br />

correctly (true)<br />

Signal Detected Name<br />

Yes Yes True Positive<br />

No No True Negative<br />

Yes No False Negative<br />

No Yes False Positive


Database Search Algorithms:<br />

Sensitivity, Selectivity<br />

• Sensitivity =TP/(TP+FN)<br />

• Selectivity =TN/(TN+FP)<br />

–<br />

Selectivity<br />

Courtesy of Gary Benson (ISSCB 2003)<br />

Sensitivity


What is BLAST<br />

Basic Local Alignment Search Tool<br />

Bad news: it is only a heuristics<br />

– Heuristics: A rule of thumb that often helps in solving<br />

a certain class of problems, but makes no guarantees.<br />

Perkins, DN (1981) The Mind's Best Work<br />

Basic idea:<br />

– High scoring segments have well<br />

conserved (almost identical) part<br />

– As well conserved part are identified,<br />

extend it to the real alignment<br />

-<br />

s<br />

e<br />

q<br />

-<br />

s<br />

e<br />

q<br />

u<br />

e


What means well conserved<br />

for BLAST?<br />

BLAST works with k-words (words of length<br />

k)<br />

– k is a parameter<br />

– different for DNA (>10) and proteins<br />

(2..4)<br />

word w 1 is T-similar to w 2 if the sum of pair<br />

scores is at least T (e.g. T=12)<br />

Similar 3-words<br />

W 1 : R K P<br />

W 2 : R R P<br />

Score: 9 –1 7 ∑ = 15


BLAST algorithm<br />

3 basic steps<br />

1)Preprocess the query: extract all<br />

the k-words<br />

2)Scan for T-similar matches in<br />

database<br />

3)Extend them to alignments<br />

1)Preprocess<br />

2)Scan<br />

3)Extend


BLAST, Step 1:<br />

Preprocess the query<br />

Take the query (e.g. LVNRKPVVP)<br />

Chop it into overlapping k-words (k=3 in this<br />

case) Query: LVNRKPVVP<br />

Word1: LVN<br />

Word2: VNR<br />

Word3: NRK<br />

…<br />

For each word find all similar words (scoring at least<br />

T)<br />

E.g. for RKP the following 3-words are similar:<br />

QKP KKP RQP REP RRP RKP<br />

1)Preprocess<br />

2)Scan<br />

3)Extend


Finite state machine<br />

abstract machine<br />

constant amount of memory<br />

(states)<br />

used in computation and languages<br />

recognizes regular expressions<br />

– cp dmt*.pdf /home/john<br />

1)Preprocess<br />

2)Scan<br />

3)Extend<br />

AC*T|GGC


BLAST, Step 2:<br />

Find ¨exact¨ matches<br />

with scanning<br />

Use all the T-similar k-words to build the Finite<br />

State Machine<br />

Scan for exact matches<br />

1)Preprocess<br />

2)Scan<br />

3)Extend<br />

QKP<br />

KKP<br />

RQP movement<br />

REP<br />

RRP<br />

RKP<br />

...<br />

...VLQKPLKKPPLVKRQPCCEVVRKPLVKVIRCLA...


BLAST, Step 3:<br />

Extending ¨exact¨ matches<br />

Having the list of exact matches we extend<br />

alignment in both directions<br />

Query: L V N R K P V V P<br />

T-similar: R R P<br />

1)Preprocess<br />

2)Scan<br />

3)Extend<br />

Subject: G V C R R P L K C<br />

Score: -3 4 -3 5 2 7 1 -2 -3<br />

…till the sum of scores drops below some<br />

level X (e.g. X=-100) from the best known<br />

- what with gaps?


Gapped BLAST<br />

(now standard)<br />

gapped local alignments are computed:<br />

much, much, much slower<br />

therefore: modified “Hit criteria”<br />

1)Preprocess<br />

2)Scan<br />

3)Extend


Hit criteria<br />

Extends the alignment only if there are<br />

close two hits on the same diagonal<br />

– sensitivity would drop without lowering T<br />

– reduces extensions (90% time is spend<br />

on extensions)<br />

Gapped local alignments are computed<br />

query<br />

pos<br />

– increased sensitivity allows us raise T<br />

– raising T speeds up the search<br />

close hit, same diag<br />

dbpos<br />

1)Preprocess<br />

2)Scan<br />

3)Extend


Gapped BLAST v BLAST<br />

We end up with<br />

– same speed<br />

– gapped alignments!<br />

– much higher sensitivity


BLAST flavours<br />

blastp: protein query, protein db<br />

blastn: DNA query, DNA db<br />

blastx: DNA query, protein db<br />

– in all reading frames. Used to find potential<br />

translation products of an unknown nucleotide<br />

sequence.<br />

tblastn: protein query, DNA db<br />

– database dynamically translated in all reading<br />

frames.<br />

tblastx: DNA query, DNA db<br />

– all translations of query against all translations of<br />

db


PSI-BLAST<br />

Position-Specific Iterated BLAST<br />

A profile is derived from the result<br />

of the first search<br />

Database is searched against the<br />

profile (instead of a sequence)<br />

Up to 3 iterations


Profile<br />

Profile is generalized form of<br />

sequence<br />

probabilities instead of a letter<br />

A<br />

C<br />

D<br />

.<br />

.<br />

.<br />

W<br />

Y<br />

0.5<br />

0<br />

0<br />

.<br />

.<br />

.<br />

0<br />

0.5<br />

0.3<br />

0.1<br />

0<br />

.<br />

.<br />

.<br />

0.3<br />

0.3<br />

0.2<br />

0.0<br />

0.1<br />

.<br />

.<br />

.<br />

0.4<br />

0.3<br />

Score of the profile<br />

...<br />

...<br />

...<br />

...<br />

...<br />

...<br />

...<br />

...<br />

0<br />

0.5<br />

0.2<br />

.<br />

.<br />

.<br />

0.1<br />

0.2<br />

scorep,i , A=∑ B p[i ,B]score blosum62 A ,B<br />

profile position letter


Constructing a profile<br />

Take significant BLAST results<br />

Make an alignment<br />

Assign weights to sequences<br />

Construct the profile<br />

A<br />

C<br />

D<br />

.<br />

.<br />

.<br />

W<br />

Y<br />

0.5<br />

0<br />

0<br />

.<br />

.<br />

.<br />

0<br />

0.5<br />

0.3<br />

0.1<br />

0<br />

.<br />

.<br />

.<br />

0.3<br />

0.3<br />

0.2<br />

0.0<br />

0.1<br />

.<br />

.<br />

.<br />

0.4<br />

0.3<br />

...<br />

...<br />

...<br />

...<br />

...<br />

...<br />

...<br />

...<br />

0<br />

0.5<br />

0.2<br />

.<br />

.<br />

.<br />

0.1<br />

0.2


BLAT<br />

The Blast-Like Alignment Tool<br />

Large-scale genome comparison:<br />

– query can be large<br />

Preprocessing phase:<br />

– BLAST: query<br />

– BLAT: db


BLAT, Step 1:<br />

Preprocess the<br />

database<br />

Index the database with k-words<br />

– k=8..16 for nucleotides<br />

– k=3..5 for proteins<br />

For each k-word store in which<br />

sequences it appears<br />

k-word: RKP<br />

1)Preprocess<br />

2)Scan<br />

3)Extend<br />

Hashed DB:<br />

QKP: HUgn0151194, Gene14, IG0, ...<br />

KKP: haemoglobin, Gene134, IG_30, ...<br />

RQP: HSPHOSR1, GeneA22...<br />

RKP: galactosyltransferase, IG_1...<br />

REP: haemoglobin, Gene134, IG_30, ...<br />

RRP: Z17368, Creatine kinase, ...<br />

...


Hashing – associative<br />

arrays<br />

Indexing with the object<br />

Hash function:<br />

hash:<br />

possible objects - large<br />

– Objects should be “well spread”<br />

x<br />

1)Preprocess<br />

2)Scan<br />

3)Extend<br />

small<br />

(fits in memory)


Hashing - examples<br />

T9 Predictive Text in mobile phones<br />

– “hello” in Multitap:<br />

4, 4, 3, 3, 5, 5, 5,<br />

(pause) 5, 5, 5, 6, 6, 6<br />

– “hello” in T9:<br />

4, 3, 5, 5, 6<br />

– Collisions: 4, 6:<br />

“in”, “go”<br />

1)Preprocess<br />

2)Scan<br />

3)Extend


BLAT, Step 1:<br />

Index to find exact matches<br />

with hashing<br />

The database is preprocessed only<br />

once! (independent from the<br />

query)<br />

k-word: RKP<br />

1)Preprocess<br />

2)Scan<br />

3)Extend<br />

Hashed DB:<br />

QKP: HUgn0151194, Gene14, IG0, ...<br />

KKP: haemoglobin, Gene134, IG_30, ...<br />

RQP: HSPHOSR1, GeneA22...<br />

RKP: galactosyltransferase, IG_1...<br />

REP: haemoglobin, Gene134, IG_30, ...<br />

RRP: Z17368, Creatine kinase, ...<br />

...


BLAT, Step 2:<br />

Hit criteria<br />

1)Preprocess<br />

2)Scan<br />

3)Extend<br />

In a constant time we can get the<br />

sequences with a certain k-word<br />

relaxing hit definition -> improve<br />

sensitivity<br />

– allow imperfect hits<br />

• costly, huge hash grows a few times!<br />

➔ shorten k (would lead to FP), but<br />

expect two hits (see BLAST)


BLAT, Step 3:<br />

Identifying homologous<br />

regions<br />

Exclude common k-words<br />

For all k-words from query<br />

– find out the position in db<br />

For results (qpos, dbpos):<br />

– split into buckets (64kbp)<br />

1)Preprocess<br />

2)Scan<br />

3)Extend<br />

– sort on the diagonal (diag=qposdbpos)


BLAT, Step 3:<br />

Identifying homologous<br />

regions<br />

continued...<br />

1)Preprocess<br />

2)Scan<br />

3)Extend<br />

– from diagonally close hits (gap<br />

limit) create “pre-clusters”<br />

– sort each “pre-cluster” on dbpos<br />

– create clusters from close hits<br />

– run Local Alignment for each<br />

cluster


Seeds – improving sensitivity<br />

More general form of k-word is a<br />

seed<br />

The seed<br />

CT.GT.AT.<br />

gives “hits” with both sequences<br />

...CTCGTTATA...<br />

...CTAGTAATG...


How to detect homology?<br />

Take the score of an maximal local<br />

alignment<br />

can it be obtained by chance?<br />

– any score can be obtained from<br />

comparing (long enough) random<br />

sequences


What is a “chance”?<br />

Extracting local alignments from<br />

random sequences<br />

P-value (e.g. =0.01)<br />

– The probability of obtaining the result<br />

by pure chance<br />

– An alignment giving lower P-value<br />

than set by user is considered a hit.


Best Local Alignments<br />

by chance<br />

Create random seqs, each 1000aa long<br />

Find the max local align<br />

Repeat<br />

# Alignments<br />

7000<br />

6000<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

0<br />

25 30 35 40 45 50 55 60 65 70 75<br />

Score


The Statistics of local<br />

alignment<br />

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html<br />

Subst matrix must guarantee<br />

• E(score(a,b)) < 0 for random a, b<br />

Analytical solution<br />

– sum of i.i.d. variables -> normal<br />

distribution<br />

– max of i.i.d. -> extreme value<br />

distribution (EVD)


Expected number of aligns<br />

E-value: the expected number of<br />

alignments scoring >= S<br />

E=K m n e −S<br />

2x size of seq -> 2x number aligns<br />

2x S -> E drops exponentially


E-value depends on n, m<br />

Example<br />

– For comparing seqA with seqB: S=88<br />

-> E = 0.001<br />

– For comparing seqA with 1000 seqs:<br />

score 88 -> E=1<br />

Important for db searching:<br />

– n – size of query, m - size of db<br />

E=K m n e −S


Deriving K, L<br />

The above eq is theoretical result<br />

for gapless (g=inf) alignment<br />

– K, L can be derived from the subst<br />

table<br />

For gapped case<br />

E=Kmne −S<br />

– it seems that the equation holds<br />

– we can derive K, L from “experiments”<br />

# Alignments<br />

7000<br />

6000<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

0<br />

25 30 35 40 45 50 55 60 65 70 75<br />

Score


Bit score S'<br />

Score S depends on the substitution<br />

table<br />

What if we want table-independent<br />

score?<br />

E=mn2 −S' where S'=<br />

E=Kmne −S<br />

S−ln K<br />

ln 2


Why does the BLAST work?<br />

Relevant riddle<br />

Are there at least 2 people in<br />

Amsterdam with the same<br />

number of hairs?<br />

– At most 500 000 hairs on<br />

each head<br />

– 700 000 people living in<br />

Amsterdam


Why does the BLAST work?<br />

Pigeons<br />

pigeonhole principle: having 9 boxes and 10<br />

pigeons, there is at least one box with more than 1 pigeon<br />

– n=9, k=7 case:<br />

http://en.wikipedia.org/wiki/Pigeonhole_principle


Why does the BLAST work?<br />

Average case<br />

pigeonhole principle describes the worst<br />

case!<br />

On average we'll expect two pigeons in<br />

the same box much earlier<br />

Birthday paradox: among 23 people,<br />

probability that they have the same<br />

birthday is > 0.5<br />

– note: 365 boxes and only 23 pigeons!


Birthday paradox


Why does the BLA[S]T work?<br />

Forget the T-similar words, now use only identities<br />

2 sequences, 100 nucleotides each:<br />

– What's the minimal sequence identity for which<br />

there's a string of 3 consecutive identities?<br />

0 identities, 100 mismatches:<br />

67 identities, 33 mismatches:<br />

but if seqs are 50% id, we'll detect it with prob. 99%<br />

28% id -> we'll detect it with prob. 50%<br />

– how is it calculated?<br />

68st<br />

?


Expected sensitivity<br />

We assume that letters are<br />

independent<br />

I – identity between seqs, for<br />

human-mouse<br />

– 86% for DNA,<br />

– 89% for proteins<br />

p word id =I k


Expected sensitivity<br />

Q - query size<br />

number of non-overlapping words<br />

R= Q<br />

k<br />

prob. of a “hit”<br />

p detect =1−1−p wordid R


Expected specificity<br />

How many matches by chance (C)?<br />

G – genome size<br />

C=Q−k1∗ G<br />

k<br />

For h-m, to get 99% sensitivity we<br />

have to set k=7, and for Q=1000<br />

C ~= 25,000,000<br />

∗ 1<br />

4 k<br />

– 7h assuming 1/1000 per alignment

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!