screen

ibi.vu.nl

screen

Alignments

BLAST, BLAT


Genome

vs.

gene

Genome Gene

Built of... DNA DNA

Describes Organism Protein

Single molecule, or

Stored as... Part of genome

a few of them

Both (depending on

Circular/ linear Linear

the species)

“Life cycle” DNA-DNA-DNA-... DNA-RNA-protein

Amount per

cell

Information

content

Size 0.5Mb 100b ... 100000b

*) ... 3500Mb

1 500 .. 50000

2% ... 100% ~30%


The amount of genetic

information in organisms

Name

Mycoplasma

Genome

size (Mb) # genes

genitalium

0.5 470

Escherichia coli

Saccharomyces

4.5 4400

cerevisiae

Drosophila

12 5500

melanogaster

Caenorhabtitis

120 18000

elegans

97 22000

Homo sapiens 3000 23457

Zea mays 2500 50000


The amount of genetic

information in organisms

http://www.cbs.dtu.dk/staff/dave/roanoke/genetics980313.html

Largest genome: amoeba Chaos chaos

(200x human genome) http://www.lawrence.edu/dept/biology/animal/


Sequence searching -

challenges

Exponential growth of

databases


Sequence searching –

definition

Task:

– Query: short, new sequence (~1000

letters)

– Database (searching space): very

many sequences

– Goal: find seqs homologous to the

query


Sequence searching –

definition

We want:

– fast tool

– primarily a filter: most sequences will

be unrelated to the query

– fine-tune the alignment later


Database Search Algorithms:

Sensitivity, Selectivity

• True Positive (TP) – a homology detected (positive)

correctly (true)

Signal Detected Name

Yes Yes True Positive

No No True Negative

Yes No False Negative

No Yes False Positive


Database Search Algorithms:

Sensitivity, Selectivity

• Sensitivity =TP/(TP+FN)

• Selectivity =TN/(TN+FP)


Selectivity

Courtesy of Gary Benson (ISSCB 2003)

Sensitivity


What is BLAST

Basic Local Alignment Search Tool

Bad news: it is only a heuristics

– Heuristics: A rule of thumb that often helps in solving

a certain class of problems, but makes no guarantees.

Perkins, DN (1981) The Mind's Best Work

Basic idea:

– High scoring segments have well

conserved (almost identical) part

– As well conserved part are identified,

extend it to the real alignment

-

s

e

q

-

s

e

q

u

e


What means well conserved

for BLAST?

BLAST works with k-words (words of length

k)

– k is a parameter

– different for DNA (>10) and proteins

(2..4)

word w 1 is T-similar to w 2 if the sum of pair

scores is at least T (e.g. T=12)

Similar 3-words

W 1 : R K P

W 2 : R R P

Score: 9 –1 7 ∑ = 15


BLAST algorithm

3 basic steps

1)Preprocess the query: extract all

the k-words

2)Scan for T-similar matches in

database

3)Extend them to alignments

1)Preprocess

2)Scan

3)Extend


BLAST, Step 1:

Preprocess the query

Take the query (e.g. LVNRKPVVP)

Chop it into overlapping k-words (k=3 in this

case) Query: LVNRKPVVP

Word1: LVN

Word2: VNR

Word3: NRK


For each word find all similar words (scoring at least

T)

E.g. for RKP the following 3-words are similar:

QKP KKP RQP REP RRP RKP

1)Preprocess

2)Scan

3)Extend


Finite state machine

abstract machine

constant amount of memory

(states)

used in computation and languages

recognizes regular expressions

– cp dmt*.pdf /home/john

1)Preprocess

2)Scan

3)Extend

AC*T|GGC


BLAST, Step 2:

Find ¨exact¨ matches

with scanning

Use all the T-similar k-words to build the Finite

State Machine

Scan for exact matches

1)Preprocess

2)Scan

3)Extend

QKP

KKP

RQP movement

REP

RRP

RKP

...

...VLQKPLKKPPLVKRQPCCEVVRKPLVKVIRCLA...


BLAST, Step 3:

Extending ¨exact¨ matches

Having the list of exact matches we extend

alignment in both directions

Query: L V N R K P V V P

T-similar: R R P

1)Preprocess

2)Scan

3)Extend

Subject: G V C R R P L K C

Score: -3 4 -3 5 2 7 1 -2 -3

…till the sum of scores drops below some

level X (e.g. X=-100) from the best known

- what with gaps?


Gapped BLAST

(now standard)

gapped local alignments are computed:

much, much, much slower

therefore: modified “Hit criteria”

1)Preprocess

2)Scan

3)Extend


Hit criteria

Extends the alignment only if there are

close two hits on the same diagonal

– sensitivity would drop without lowering T

– reduces extensions (90% time is spend

on extensions)

Gapped local alignments are computed

query

pos

– increased sensitivity allows us raise T

– raising T speeds up the search

close hit, same diag

dbpos

1)Preprocess

2)Scan

3)Extend


Gapped BLAST v BLAST

We end up with

– same speed

– gapped alignments!

– much higher sensitivity


BLAST flavours

blastp: protein query, protein db

blastn: DNA query, DNA db

blastx: DNA query, protein db

– in all reading frames. Used to find potential

translation products of an unknown nucleotide

sequence.

tblastn: protein query, DNA db

– database dynamically translated in all reading

frames.

tblastx: DNA query, DNA db

– all translations of query against all translations of

db


PSI-BLAST

Position-Specific Iterated BLAST

A profile is derived from the result

of the first search

Database is searched against the

profile (instead of a sequence)

Up to 3 iterations


Profile

Profile is generalized form of

sequence

probabilities instead of a letter

A

C

D

.

.

.

W

Y

0.5

0

0

.

.

.

0

0.5

0.3

0.1

0

.

.

.

0.3

0.3

0.2

0.0

0.1

.

.

.

0.4

0.3

Score of the profile

...

...

...

...

...

...

...

...

0

0.5

0.2

.

.

.

0.1

0.2

scorep,i , A=∑ B p[i ,B]score blosum62 A ,B

profile position letter


Constructing a profile

Take significant BLAST results

Make an alignment

Assign weights to sequences

Construct the profile

A

C

D

.

.

.

W

Y

0.5

0

0

.

.

.

0

0.5

0.3

0.1

0

.

.

.

0.3

0.3

0.2

0.0

0.1

.

.

.

0.4

0.3

...

...

...

...

...

...

...

...

0

0.5

0.2

.

.

.

0.1

0.2


BLAT

The Blast-Like Alignment Tool

Large-scale genome comparison:

– query can be large

Preprocessing phase:

– BLAST: query

– BLAT: db


BLAT, Step 1:

Preprocess the

database

Index the database with k-words

– k=8..16 for nucleotides

– k=3..5 for proteins

For each k-word store in which

sequences it appears

k-word: RKP

1)Preprocess

2)Scan

3)Extend

Hashed DB:

QKP: HUgn0151194, Gene14, IG0, ...

KKP: haemoglobin, Gene134, IG_30, ...

RQP: HSPHOSR1, GeneA22...

RKP: galactosyltransferase, IG_1...

REP: haemoglobin, Gene134, IG_30, ...

RRP: Z17368, Creatine kinase, ...

...


Hashing – associative

arrays

Indexing with the object

Hash function:

hash:

possible objects - large

– Objects should be “well spread”

x

1)Preprocess

2)Scan

3)Extend

small

(fits in memory)


Hashing - examples

T9 Predictive Text in mobile phones

– “hello” in Multitap:

4, 4, 3, 3, 5, 5, 5,

(pause) 5, 5, 5, 6, 6, 6

– “hello” in T9:

4, 3, 5, 5, 6

– Collisions: 4, 6:

“in”, “go”

1)Preprocess

2)Scan

3)Extend


BLAT, Step 1:

Index to find exact matches

with hashing

The database is preprocessed only

once! (independent from the

query)

k-word: RKP

1)Preprocess

2)Scan

3)Extend

Hashed DB:

QKP: HUgn0151194, Gene14, IG0, ...

KKP: haemoglobin, Gene134, IG_30, ...

RQP: HSPHOSR1, GeneA22...

RKP: galactosyltransferase, IG_1...

REP: haemoglobin, Gene134, IG_30, ...

RRP: Z17368, Creatine kinase, ...

...


BLAT, Step 2:

Hit criteria

1)Preprocess

2)Scan

3)Extend

In a constant time we can get the

sequences with a certain k-word

relaxing hit definition -> improve

sensitivity

– allow imperfect hits

• costly, huge hash grows a few times!

➔ shorten k (would lead to FP), but

expect two hits (see BLAST)


BLAT, Step 3:

Identifying homologous

regions

Exclude common k-words

For all k-words from query

– find out the position in db

For results (qpos, dbpos):

– split into buckets (64kbp)

1)Preprocess

2)Scan

3)Extend

– sort on the diagonal (diag=qposdbpos)


BLAT, Step 3:

Identifying homologous

regions

continued...

1)Preprocess

2)Scan

3)Extend

– from diagonally close hits (gap

limit) create “pre-clusters”

– sort each “pre-cluster” on dbpos

– create clusters from close hits

– run Local Alignment for each

cluster


Seeds – improving sensitivity

More general form of k-word is a

seed

The seed

CT.GT.AT.

gives “hits” with both sequences

...CTCGTTATA...

...CTAGTAATG...


How to detect homology?

Take the score of an maximal local

alignment

can it be obtained by chance?

– any score can be obtained from

comparing (long enough) random

sequences


What is a “chance”?

Extracting local alignments from

random sequences

P-value (e.g. =0.01)

– The probability of obtaining the result

by pure chance

– An alignment giving lower P-value

than set by user is considered a hit.


Best Local Alignments

by chance

Create random seqs, each 1000aa long

Find the max local align

Repeat

# Alignments

7000

6000

5000

4000

3000

2000

1000

0

25 30 35 40 45 50 55 60 65 70 75

Score


The Statistics of local

alignment

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

Subst matrix must guarantee

• E(score(a,b)) < 0 for random a, b

Analytical solution

– sum of i.i.d. variables -> normal

distribution

– max of i.i.d. -> extreme value

distribution (EVD)


Expected number of aligns

E-value: the expected number of

alignments scoring >= S

E=K m n e −S

2x size of seq -> 2x number aligns

2x S -> E drops exponentially


E-value depends on n, m

Example

– For comparing seqA with seqB: S=88

-> E = 0.001

– For comparing seqA with 1000 seqs:

score 88 -> E=1

Important for db searching:

– n – size of query, m - size of db

E=K m n e −S


Deriving K, L

The above eq is theoretical result

for gapless (g=inf) alignment

– K, L can be derived from the subst

table

For gapped case

E=Kmne −S

– it seems that the equation holds

– we can derive K, L from “experiments”

# Alignments

7000

6000

5000

4000

3000

2000

1000

0

25 30 35 40 45 50 55 60 65 70 75

Score


Bit score S'

Score S depends on the substitution

table

What if we want table-independent

score?

E=mn2 −S' where S'=

E=Kmne −S

S−ln K

ln 2


Why does the BLAST work?

Relevant riddle

Are there at least 2 people in

Amsterdam with the same

number of hairs?

– At most 500 000 hairs on

each head

– 700 000 people living in

Amsterdam


Why does the BLAST work?

Pigeons

pigeonhole principle: having 9 boxes and 10

pigeons, there is at least one box with more than 1 pigeon

– n=9, k=7 case:

http://en.wikipedia.org/wiki/Pigeonhole_principle


Why does the BLAST work?

Average case

pigeonhole principle describes the worst

case!

On average we'll expect two pigeons in

the same box much earlier

Birthday paradox: among 23 people,

probability that they have the same

birthday is > 0.5

– note: 365 boxes and only 23 pigeons!


Birthday paradox


Why does the BLA[S]T work?

Forget the T-similar words, now use only identities

2 sequences, 100 nucleotides each:

– What's the minimal sequence identity for which

there's a string of 3 consecutive identities?

0 identities, 100 mismatches:

67 identities, 33 mismatches:

but if seqs are 50% id, we'll detect it with prob. 99%

28% id -> we'll detect it with prob. 50%

– how is it calculated?

68st

?


Expected sensitivity

We assume that letters are

independent

I – identity between seqs, for

human-mouse

– 86% for DNA,

– 89% for proteins

p word id =I k


Expected sensitivity

Q - query size

number of non-overlapping words

R= Q

k

prob. of a “hit”

p detect =1−1−p wordid R


Expected specificity

How many matches by chance (C)?

G – genome size

C=Q−k1∗ G

k

For h-m, to get 99% sensitivity we

have to set k=7, and for Q=1000

C ~= 25,000,000

∗ 1

4 k

– 7h assuming 1/1000 per alignment

More magazines by this user
Similar magazines