screen

Alignments 

BLAST, BLAT

Genome 

vs. 

gene 

Genome Gene 

Built of... DNA DNA 

Describes Organism Protein 

Single molecule, or 

Stored as... Part of genome 

a few of them 

Both (depending on 

Circular/ linear Linear 

the species) 

“Life cycle” DNA-DNA-DNA-... DNA-RNA-protein 

Amount per 

cell 

Information 

content 

Size 0.5Mb 100b ... 100000b 

*) ... 3500Mb 

1 500 .. 50000 

2% ... 100% ~30%

The amount of genetic 

information in organisms 

Name 

Mycoplasma 

Genome 

size (Mb) # genes 

genitalium 

0.5 470 

Escherichia coli 

Saccharomyces 

4.5 4400 

cerevisiae 

Drosophila 

12 5500 

melanogaster 

Caenorhabtitis 

120 18000 

elegans 

97 22000 

Homo sapiens 3000 23457 

Zea mays 2500 50000

The amount of genetic 

information in organisms 

http://www.cbs.dtu.dk/staff/dave/roanoke/genetics980313.html 

Largest genome: amoeba Chaos chaos 

(200x human genome) http://www.lawrence.edu/dept/biology/animal/

Sequence searching - 

challenges 

Exponential growth of 

databases

Sequence searching – 

definition 

Task: 

– Query: short, new sequence (~1000 

letters) 

– Database (searching space): very 

many sequences 

– Goal: find seqs homologous to the 

query

Sequence searching – 

definition 

We want: 

– fast tool 

– primarily a filter: most sequences will 

be unrelated to the query 

– fine-tune the alignment later

Database Search Algorithms: 

Sensitivity, Selectivity 

• True Positive (TP) – a homology detected (positive) 

correctly (true) 

Signal Detected Name 

Yes Yes True Positive 

No No True Negative 

Yes No False Negative 

No Yes False Positive

Database Search Algorithms: 

Sensitivity, Selectivity 

• Sensitivity =TP/(TP+FN) 

• Selectivity =TN/(TN+FP) 

– 

Selectivity 

Courtesy of Gary Benson (ISSCB 2003) 

Sensitivity

What is BLAST 

Basic Local Alignment Search Tool 

Bad news: it is only a heuristics 

– Heuristics: A rule of thumb that often helps in solving 

a certain class of problems, but makes no guarantees. 

Perkins, DN (1981) The Mind's Best Work 

Basic idea: 

– High scoring segments have well 

conserved (almost identical) part 

– As well conserved part are identified, 

extend it to the real alignment 

- 

s 

e 

q 

- 

s 

e 

q 

u 

e

What means well conserved 

for BLAST? 

BLAST works with k-words (words of length 

k) 

– k is a parameter 

– different for DNA (>10) and proteins 

(2..4) 

word w 1 is T-similar to w 2 if the sum of pair 

scores is at least T (e.g. T=12) 

Similar 3-words 

W 1 : R K P 

W 2 : R R P 

Score: 9 –1 7 ∑ = 15

BLAST algorithm 

3 basic steps 

1)Preprocess the query: extract all 

the k-words 

2)Scan for T-similar matches in 

database 

3)Extend them to alignments 

1)Preprocess 

2)Scan 

3)Extend

BLAST, Step 1: 

Preprocess the query 

Take the query (e.g. LVNRKPVVP) 

Chop it into overlapping k-words (k=3 in this 

case) Query: LVNRKPVVP 

Word1: LVN 

Word2: VNR 

Word3: NRK 

… 

For each word find all similar words (scoring at least 

T) 

E.g. for RKP the following 3-words are similar: 

QKP KKP RQP REP RRP RKP 

1)Preprocess 

2)Scan 

3)Extend

Finite state machine 

abstract machine 

constant amount of memory 

(states) 

used in computation and languages 

recognizes regular expressions 

– cp dmt*.pdf /home/john 

1)Preprocess 

2)Scan 

3)Extend 

AC*T|GGC


Find ¨exact¨ matches 

with scanning 

Use all the T-similar k-words to build the Finite 

State Machine 

Scan for exact matches 

1)Preprocess 

2)Scan 

3)Extend 

QKP 

KKP 

RQP movement 

REP 

RRP 

RKP 

... 

...VLQKPLKKPPLVKRQPCCEVVRKPLVKVIRCLA...


Extending ¨exact¨ matches 

Having the list of exact matches we extend 

alignment in both directions 

Query: L V N R K P V V P 

T-similar: R R P 

1)Preprocess 

2)Scan 

3)Extend 

Subject: G V C R R P L K C 

Score: -3 4 -3 5 2 7 1 -2 -3 

…till the sum of scores drops below some 

level X (e.g. X=-100) from the best known 

- what with gaps?

Gapped BLAST 

(now standard) 

gapped local alignments are computed: 

much, much, much slower 

therefore: modified “Hit criteria” 

1)Preprocess 

2)Scan 

3)Extend

Hit criteria 

Extends the alignment only if there are 

close two hits on the same diagonal 

– sensitivity would drop without lowering T 

– reduces extensions (90% time is spend 

on extensions) 

Gapped local alignments are computed 

query 

pos 

– increased sensitivity allows us raise T 

– raising T speeds up the search 

close hit, same diag 

dbpos 

1)Preprocess 

2)Scan 

3)Extend

Gapped BLAST v BLAST 

We end up with 

– same speed 

– gapped alignments! 

– much higher sensitivity

BLAST flavours 

blastp: protein query, protein db 

blastn: DNA query, DNA db 

blastx: DNA query, protein db 

– in all reading frames. Used to find potential 

translation products of an unknown nucleotide 

sequence. 

tblastn: protein query, DNA db 

– database dynamically translated in all reading 

frames. 

tblastx: DNA query, DNA db 

– all translations of query against all translations of 

db

PSI-BLAST 

Position-Specific Iterated BLAST 

A profile is derived from the result 

of the first search 

Database is searched against the 

profile (instead of a sequence) 

Up to 3 iterations

Profile 

Profile is generalized form of 

sequence 

probabilities instead of a letter 

A 

C 

D 

. 

. 

. 

W 

Y 

0.5 

0 

0 

. 

. 

. 

0 

0.5 

0.3 

0.1 

0 

. 

. 

. 

0.3 

0.3 

0.2 

0.0 

0.1 

. 

. 

. 

0.4 

0.3 

Score of the profile 

... 

... 

... 

... 

... 

... 

... 

... 

0 

0.5 

0.2 

. 

. 

. 

0.1 

0.2 

scorep,i , A=∑ B p[i ,B]score blosum62 A ,B 

profile position letter

Constructing a profile 

Take significant BLAST results 

Make an alignment 

Assign weights to sequences 

Construct the profile 

A 

C 

D 

. 

. 

. 

W 

Y 

0.5 

0 

0 

. 

. 

. 

0 

0.5 

0.3 

0.1 

0 

. 

. 

. 

0.3 

0.3 

0.2 

0.0 

0.1 

. 

. 

. 

0.4 

0.3 

... 

... 

... 

... 

... 

... 

... 

... 

0 

0.5 

0.2 

. 

. 

. 

0.1 

0.2

BLAT 

The Blast-Like Alignment Tool 

Large-scale genome comparison: 

– query can be large 

Preprocessing phase: 

– BLAST: query 

– BLAT: db

BLAT, Step 1: 

Preprocess the 

database 

Index the database with k-words 

– k=8..16 for nucleotides 

– k=3..5 for proteins 

For each k-word store in which 

sequences it appears 

k-word: RKP 

1)Preprocess 

2)Scan 

3)Extend 

Hashed DB: 

QKP: HUgn0151194, Gene14, IG0, ... 

KKP: haemoglobin, Gene134, IG_30, ... 

RQP: HSPHOSR1, GeneA22... 

RKP: galactosyltransferase, IG_1... 

REP: haemoglobin, Gene134, IG_30, ... 

RRP: Z17368, Creatine kinase, ... 

...

Hashing – associative 

arrays 

Indexing with the object 

Hash function: 

hash: 

possible objects - large 

– Objects should be “well spread” 

x 

1)Preprocess 

2)Scan 

3)Extend 

small 

(fits in memory)

Hashing - examples 

T9 Predictive Text in mobile phones 

– “hello” in Multitap: 

4, 4, 3, 3, 5, 5, 5, 

(pause) 5, 5, 5, 6, 6, 6 

– “hello” in T9: 

4, 3, 5, 5, 6 

– Collisions: 4, 6: 

“in”, “go” 

1)Preprocess 

2)Scan 

3)Extend

BLAT, Step 1: 

Index to find exact matches 

with hashing 

The database is preprocessed only 

once! (independent from the 

query) 

k-word: RKP 

1)Preprocess 

2)Scan 

3)Extend 

Hashed DB: 

QKP: HUgn0151194, Gene14, IG0, ... 

KKP: haemoglobin, Gene134, IG_30, ... 

RQP: HSPHOSR1, GeneA22... 

RKP: galactosyltransferase, IG_1... 

REP: haemoglobin, Gene134, IG_30, ... 

RRP: Z17368, Creatine kinase, ... 

...

BLAT, Step 2: 

Hit criteria 

1)Preprocess 

2)Scan 

3)Extend 

In a constant time we can get the 

sequences with a certain k-word 

relaxing hit definition -> improve 

sensitivity 

– allow imperfect hits 

• costly, huge hash grows a few times! 

➔ shorten k (would lead to FP), but 

expect two hits (see BLAST)

BLAT, Step 3: 

Identifying homologous 

regions 

Exclude common k-words 

For all k-words from query 

– find out the position in db 

For results (qpos, dbpos): 

– split into buckets (64kbp) 

1)Preprocess 

2)Scan 

3)Extend 

– sort on the diagonal (diag=qposdbpos)

BLAT, Step 3: 

Identifying homologous 

regions 

continued... 

1)Preprocess 

2)Scan 

3)Extend 

– from diagonally close hits (gap 

limit) create “pre-clusters” 

– sort each “pre-cluster” on dbpos 

– create clusters from close hits 

– run Local Alignment for each 

cluster

Seeds – improving sensitivity 

More general form of k-word is a 

seed 

The seed 

CT.GT.AT. 

gives “hits” with both sequences 

...CTCGTTATA... 

...CTAGTAATG...

How to detect homology? 

Take the score of an maximal local 

alignment 

can it be obtained by chance? 

– any score can be obtained from 

comparing (long enough) random 

sequences

What is a “chance”? 

Extracting local alignments from 

random sequences 

P-value (e.g. =0.01) 

– The probability of obtaining the result 

by pure chance 

– An alignment giving lower P-value 

than set by user is considered a hit.

Best Local Alignments 

by chance 

Create random seqs, each 1000aa long 

Find the max local align 

Repeat 

# Alignments 

7000 

6000 

5000 

4000 

3000 

2000 

1000 

0 

25 30 35 40 45 50 55 60 65 70 75 

Score

The Statistics of local 

alignment 

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html 

Subst matrix must guarantee 

• E(score(a,b)) < 0 for random a, b 

Analytical solution 

– sum of i.i.d. variables -> normal 

distribution 

– max of i.i.d. -> extreme value 

distribution (EVD)

Expected number of aligns 

E-value: the expected number of 

alignments scoring >= S 

E=K m n e −S 

2x size of seq -> 2x number aligns 

2x S -> E drops exponentially

E-value depends on n, m 

Example 

– For comparing seqA with seqB: S=88 

-> E = 0.001 

– For comparing seqA with 1000 seqs: 

score 88 -> E=1 

Important for db searching: 

– n – size of query, m - size of db 

E=K m n e −S

Deriving K, L 

The above eq is theoretical result 

for gapless (g=inf) alignment 

– K, L can be derived from the subst 

table 

For gapped case 

E=Kmne −S 

– it seems that the equation holds 

– we can derive K, L from “experiments” 

# Alignments 

7000 

6000 

5000 

4000 

3000 

2000 

1000 

0 

25 30 35 40 45 50 55 60 65 70 75 

Score

Bit score S' 

Score S depends on the substitution 

table 

What if we want table-independent 

score? 

E=mn2 −S' where S'= 

E=Kmne −S 

S−ln K 

ln 2

Why does the BLAST work? 

Relevant riddle 

Are there at least 2 people in 

Amsterdam with the same 

number of hairs? 

– At most 500 000 hairs on 

each head 

– 700 000 people living in 

Amsterdam


Pigeons 

pigeonhole principle: having 9 boxes and 10 

pigeons, there is at least one box with more than 1 pigeon 

– n=9, k=7 case: 

http://en.wikipedia.org/wiki/Pigeonhole_principle


Average case 

pigeonhole principle describes the worst 

case! 

On average we'll expect two pigeons in 

the same box much earlier 

Birthday paradox: among 23 people, 

probability that they have the same 

birthday is > 0.5 

– note: 365 boxes and only 23 pigeons!

Birthday paradox

Why does the BLA[S]T work? 

Forget the T-similar words, now use only identities 

2 sequences, 100 nucleotides each: 

– What's the minimal sequence identity for which 

there's a string of 3 consecutive identities? 

0 identities, 100 mismatches: 

67 identities, 33 mismatches: 

but if seqs are 50% id, we'll detect it with prob. 99% 

28% id -> we'll detect it with prob. 50% 

– how is it calculated? 

68st 

?

Expected sensitivity 

We assume that letters are 

independent 

I – identity between seqs, for 

human-mouse 

– 86% for DNA, 

– 89% for proteins 

p word id =I k

Expected sensitivity 

Q - query size 

number of non-overlapping words 

R= Q 

k 

prob. of a “hit” 

p detect =1−1−p wordid R

Expected specificity 

How many matches by chance (C)? 

G – genome size 

C=Q−k1∗ G 

k 

For h-m, to get 99% sensitivity we 

have to set k=7, and for Q=1000 

C ~= 25,000,000 

∗ 1 

4 k 

– 7h assuming 1/1000 per alignment

screen

Create successful ePaper yourself

Delete template?

Save as template?