screen
screen
screen
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Alignments<br />
BLAST, BLAT
Genome<br />
vs.<br />
gene<br />
Genome Gene<br />
Built of... DNA DNA<br />
Describes Organism Protein<br />
Single molecule, or<br />
Stored as... Part of genome<br />
a few of them<br />
Both (depending on<br />
Circular/ linear Linear<br />
the species)<br />
“Life cycle” DNA-DNA-DNA-... DNA-RNA-protein<br />
Amount per<br />
cell<br />
Information<br />
content<br />
Size 0.5Mb 100b ... 100000b<br />
*) ... 3500Mb<br />
1 500 .. 50000<br />
2% ... 100% ~30%
The amount of genetic<br />
information in organisms<br />
Name<br />
Mycoplasma<br />
Genome<br />
size (Mb) # genes<br />
genitalium<br />
0.5 470<br />
Escherichia coli<br />
Saccharomyces<br />
4.5 4400<br />
cerevisiae<br />
Drosophila<br />
12 5500<br />
melanogaster<br />
Caenorhabtitis<br />
120 18000<br />
elegans<br />
97 22000<br />
Homo sapiens 3000 23457<br />
Zea mays 2500 50000
The amount of genetic<br />
information in organisms<br />
http://www.cbs.dtu.dk/staff/dave/roanoke/genetics980313.html<br />
Largest genome: amoeba Chaos chaos<br />
(200x human genome) http://www.lawrence.edu/dept/biology/animal/
Sequence searching -<br />
challenges<br />
Exponential growth of<br />
databases
Sequence searching –<br />
definition<br />
Task:<br />
– Query: short, new sequence (~1000<br />
letters)<br />
– Database (searching space): very<br />
many sequences<br />
– Goal: find seqs homologous to the<br />
query
Sequence searching –<br />
definition<br />
We want:<br />
– fast tool<br />
– primarily a filter: most sequences will<br />
be unrelated to the query<br />
– fine-tune the alignment later
Database Search Algorithms:<br />
Sensitivity, Selectivity<br />
• True Positive (TP) – a homology detected (positive)<br />
correctly (true)<br />
Signal Detected Name<br />
Yes Yes True Positive<br />
No No True Negative<br />
Yes No False Negative<br />
No Yes False Positive
Database Search Algorithms:<br />
Sensitivity, Selectivity<br />
• Sensitivity =TP/(TP+FN)<br />
• Selectivity =TN/(TN+FP)<br />
–<br />
Selectivity<br />
Courtesy of Gary Benson (ISSCB 2003)<br />
Sensitivity
What is BLAST<br />
Basic Local Alignment Search Tool<br />
Bad news: it is only a heuristics<br />
– Heuristics: A rule of thumb that often helps in solving<br />
a certain class of problems, but makes no guarantees.<br />
Perkins, DN (1981) The Mind's Best Work<br />
Basic idea:<br />
– High scoring segments have well<br />
conserved (almost identical) part<br />
– As well conserved part are identified,<br />
extend it to the real alignment<br />
-<br />
s<br />
e<br />
q<br />
-<br />
s<br />
e<br />
q<br />
u<br />
e
What means well conserved<br />
for BLAST?<br />
BLAST works with k-words (words of length<br />
k)<br />
– k is a parameter<br />
– different for DNA (>10) and proteins<br />
(2..4)<br />
word w 1 is T-similar to w 2 if the sum of pair<br />
scores is at least T (e.g. T=12)<br />
Similar 3-words<br />
W 1 : R K P<br />
W 2 : R R P<br />
Score: 9 –1 7 ∑ = 15
BLAST algorithm<br />
3 basic steps<br />
1)Preprocess the query: extract all<br />
the k-words<br />
2)Scan for T-similar matches in<br />
database<br />
3)Extend them to alignments<br />
1)Preprocess<br />
2)Scan<br />
3)Extend
BLAST, Step 1:<br />
Preprocess the query<br />
Take the query (e.g. LVNRKPVVP)<br />
Chop it into overlapping k-words (k=3 in this<br />
case) Query: LVNRKPVVP<br />
Word1: LVN<br />
Word2: VNR<br />
Word3: NRK<br />
…<br />
For each word find all similar words (scoring at least<br />
T)<br />
E.g. for RKP the following 3-words are similar:<br />
QKP KKP RQP REP RRP RKP<br />
1)Preprocess<br />
2)Scan<br />
3)Extend
Finite state machine<br />
abstract machine<br />
constant amount of memory<br />
(states)<br />
used in computation and languages<br />
recognizes regular expressions<br />
– cp dmt*.pdf /home/john<br />
1)Preprocess<br />
2)Scan<br />
3)Extend<br />
AC*T|GGC
BLAST, Step 2:<br />
Find ¨exact¨ matches<br />
with scanning<br />
Use all the T-similar k-words to build the Finite<br />
State Machine<br />
Scan for exact matches<br />
1)Preprocess<br />
2)Scan<br />
3)Extend<br />
QKP<br />
KKP<br />
RQP movement<br />
REP<br />
RRP<br />
RKP<br />
...<br />
...VLQKPLKKPPLVKRQPCCEVVRKPLVKVIRCLA...
BLAST, Step 3:<br />
Extending ¨exact¨ matches<br />
Having the list of exact matches we extend<br />
alignment in both directions<br />
Query: L V N R K P V V P<br />
T-similar: R R P<br />
1)Preprocess<br />
2)Scan<br />
3)Extend<br />
Subject: G V C R R P L K C<br />
Score: -3 4 -3 5 2 7 1 -2 -3<br />
…till the sum of scores drops below some<br />
level X (e.g. X=-100) from the best known<br />
- what with gaps?
Gapped BLAST<br />
(now standard)<br />
gapped local alignments are computed:<br />
much, much, much slower<br />
therefore: modified “Hit criteria”<br />
1)Preprocess<br />
2)Scan<br />
3)Extend
Hit criteria<br />
Extends the alignment only if there are<br />
close two hits on the same diagonal<br />
– sensitivity would drop without lowering T<br />
– reduces extensions (90% time is spend<br />
on extensions)<br />
Gapped local alignments are computed<br />
query<br />
pos<br />
– increased sensitivity allows us raise T<br />
– raising T speeds up the search<br />
close hit, same diag<br />
dbpos<br />
1)Preprocess<br />
2)Scan<br />
3)Extend
Gapped BLAST v BLAST<br />
We end up with<br />
– same speed<br />
– gapped alignments!<br />
– much higher sensitivity
BLAST flavours<br />
blastp: protein query, protein db<br />
blastn: DNA query, DNA db<br />
blastx: DNA query, protein db<br />
– in all reading frames. Used to find potential<br />
translation products of an unknown nucleotide<br />
sequence.<br />
tblastn: protein query, DNA db<br />
– database dynamically translated in all reading<br />
frames.<br />
tblastx: DNA query, DNA db<br />
– all translations of query against all translations of<br />
db
PSI-BLAST<br />
Position-Specific Iterated BLAST<br />
A profile is derived from the result<br />
of the first search<br />
Database is searched against the<br />
profile (instead of a sequence)<br />
Up to 3 iterations
Profile<br />
Profile is generalized form of<br />
sequence<br />
probabilities instead of a letter<br />
A<br />
C<br />
D<br />
.<br />
.<br />
.<br />
W<br />
Y<br />
0.5<br />
0<br />
0<br />
.<br />
.<br />
.<br />
0<br />
0.5<br />
0.3<br />
0.1<br />
0<br />
.<br />
.<br />
.<br />
0.3<br />
0.3<br />
0.2<br />
0.0<br />
0.1<br />
.<br />
.<br />
.<br />
0.4<br />
0.3<br />
Score of the profile<br />
...<br />
...<br />
...<br />
...<br />
...<br />
...<br />
...<br />
...<br />
0<br />
0.5<br />
0.2<br />
.<br />
.<br />
.<br />
0.1<br />
0.2<br />
scorep,i , A=∑ B p[i ,B]score blosum62 A ,B<br />
profile position letter
Constructing a profile<br />
Take significant BLAST results<br />
Make an alignment<br />
Assign weights to sequences<br />
Construct the profile<br />
A<br />
C<br />
D<br />
.<br />
.<br />
.<br />
W<br />
Y<br />
0.5<br />
0<br />
0<br />
.<br />
.<br />
.<br />
0<br />
0.5<br />
0.3<br />
0.1<br />
0<br />
.<br />
.<br />
.<br />
0.3<br />
0.3<br />
0.2<br />
0.0<br />
0.1<br />
.<br />
.<br />
.<br />
0.4<br />
0.3<br />
...<br />
...<br />
...<br />
...<br />
...<br />
...<br />
...<br />
...<br />
0<br />
0.5<br />
0.2<br />
.<br />
.<br />
.<br />
0.1<br />
0.2
BLAT<br />
The Blast-Like Alignment Tool<br />
Large-scale genome comparison:<br />
– query can be large<br />
Preprocessing phase:<br />
– BLAST: query<br />
– BLAT: db
BLAT, Step 1:<br />
Preprocess the<br />
database<br />
Index the database with k-words<br />
– k=8..16 for nucleotides<br />
– k=3..5 for proteins<br />
For each k-word store in which<br />
sequences it appears<br />
k-word: RKP<br />
1)Preprocess<br />
2)Scan<br />
3)Extend<br />
Hashed DB:<br />
QKP: HUgn0151194, Gene14, IG0, ...<br />
KKP: haemoglobin, Gene134, IG_30, ...<br />
RQP: HSPHOSR1, GeneA22...<br />
RKP: galactosyltransferase, IG_1...<br />
REP: haemoglobin, Gene134, IG_30, ...<br />
RRP: Z17368, Creatine kinase, ...<br />
...
Hashing – associative<br />
arrays<br />
Indexing with the object<br />
Hash function:<br />
hash:<br />
possible objects - large<br />
– Objects should be “well spread”<br />
x<br />
1)Preprocess<br />
2)Scan<br />
3)Extend<br />
small<br />
(fits in memory)
Hashing - examples<br />
T9 Predictive Text in mobile phones<br />
– “hello” in Multitap:<br />
4, 4, 3, 3, 5, 5, 5,<br />
(pause) 5, 5, 5, 6, 6, 6<br />
– “hello” in T9:<br />
4, 3, 5, 5, 6<br />
– Collisions: 4, 6:<br />
“in”, “go”<br />
1)Preprocess<br />
2)Scan<br />
3)Extend
BLAT, Step 1:<br />
Index to find exact matches<br />
with hashing<br />
The database is preprocessed only<br />
once! (independent from the<br />
query)<br />
k-word: RKP<br />
1)Preprocess<br />
2)Scan<br />
3)Extend<br />
Hashed DB:<br />
QKP: HUgn0151194, Gene14, IG0, ...<br />
KKP: haemoglobin, Gene134, IG_30, ...<br />
RQP: HSPHOSR1, GeneA22...<br />
RKP: galactosyltransferase, IG_1...<br />
REP: haemoglobin, Gene134, IG_30, ...<br />
RRP: Z17368, Creatine kinase, ...<br />
...
BLAT, Step 2:<br />
Hit criteria<br />
1)Preprocess<br />
2)Scan<br />
3)Extend<br />
In a constant time we can get the<br />
sequences with a certain k-word<br />
relaxing hit definition -> improve<br />
sensitivity<br />
– allow imperfect hits<br />
• costly, huge hash grows a few times!<br />
➔ shorten k (would lead to FP), but<br />
expect two hits (see BLAST)
BLAT, Step 3:<br />
Identifying homologous<br />
regions<br />
Exclude common k-words<br />
For all k-words from query<br />
– find out the position in db<br />
For results (qpos, dbpos):<br />
– split into buckets (64kbp)<br />
1)Preprocess<br />
2)Scan<br />
3)Extend<br />
– sort on the diagonal (diag=qposdbpos)
BLAT, Step 3:<br />
Identifying homologous<br />
regions<br />
continued...<br />
1)Preprocess<br />
2)Scan<br />
3)Extend<br />
– from diagonally close hits (gap<br />
limit) create “pre-clusters”<br />
– sort each “pre-cluster” on dbpos<br />
– create clusters from close hits<br />
– run Local Alignment for each<br />
cluster
Seeds – improving sensitivity<br />
More general form of k-word is a<br />
seed<br />
The seed<br />
CT.GT.AT.<br />
gives “hits” with both sequences<br />
...CTCGTTATA...<br />
...CTAGTAATG...
How to detect homology?<br />
Take the score of an maximal local<br />
alignment<br />
can it be obtained by chance?<br />
– any score can be obtained from<br />
comparing (long enough) random<br />
sequences
What is a “chance”?<br />
Extracting local alignments from<br />
random sequences<br />
P-value (e.g. =0.01)<br />
– The probability of obtaining the result<br />
by pure chance<br />
– An alignment giving lower P-value<br />
than set by user is considered a hit.
Best Local Alignments<br />
by chance<br />
Create random seqs, each 1000aa long<br />
Find the max local align<br />
Repeat<br />
# Alignments<br />
7000<br />
6000<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
0<br />
25 30 35 40 45 50 55 60 65 70 75<br />
Score
The Statistics of local<br />
alignment<br />
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html<br />
Subst matrix must guarantee<br />
• E(score(a,b)) < 0 for random a, b<br />
Analytical solution<br />
– sum of i.i.d. variables -> normal<br />
distribution<br />
– max of i.i.d. -> extreme value<br />
distribution (EVD)
Expected number of aligns<br />
E-value: the expected number of<br />
alignments scoring >= S<br />
E=K m n e −S<br />
2x size of seq -> 2x number aligns<br />
2x S -> E drops exponentially
E-value depends on n, m<br />
Example<br />
– For comparing seqA with seqB: S=88<br />
-> E = 0.001<br />
– For comparing seqA with 1000 seqs:<br />
score 88 -> E=1<br />
Important for db searching:<br />
– n – size of query, m - size of db<br />
E=K m n e −S
Deriving K, L<br />
The above eq is theoretical result<br />
for gapless (g=inf) alignment<br />
– K, L can be derived from the subst<br />
table<br />
For gapped case<br />
E=Kmne −S<br />
– it seems that the equation holds<br />
– we can derive K, L from “experiments”<br />
# Alignments<br />
7000<br />
6000<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
0<br />
25 30 35 40 45 50 55 60 65 70 75<br />
Score
Bit score S'<br />
Score S depends on the substitution<br />
table<br />
What if we want table-independent<br />
score?<br />
E=mn2 −S' where S'=<br />
E=Kmne −S<br />
S−ln K<br />
ln 2
Why does the BLAST work?<br />
Relevant riddle<br />
Are there at least 2 people in<br />
Amsterdam with the same<br />
number of hairs?<br />
– At most 500 000 hairs on<br />
each head<br />
– 700 000 people living in<br />
Amsterdam
Why does the BLAST work?<br />
Pigeons<br />
pigeonhole principle: having 9 boxes and 10<br />
pigeons, there is at least one box with more than 1 pigeon<br />
– n=9, k=7 case:<br />
http://en.wikipedia.org/wiki/Pigeonhole_principle
Why does the BLAST work?<br />
Average case<br />
pigeonhole principle describes the worst<br />
case!<br />
On average we'll expect two pigeons in<br />
the same box much earlier<br />
Birthday paradox: among 23 people,<br />
probability that they have the same<br />
birthday is > 0.5<br />
– note: 365 boxes and only 23 pigeons!
Birthday paradox
Why does the BLA[S]T work?<br />
Forget the T-similar words, now use only identities<br />
2 sequences, 100 nucleotides each:<br />
– What's the minimal sequence identity for which<br />
there's a string of 3 consecutive identities?<br />
0 identities, 100 mismatches:<br />
67 identities, 33 mismatches:<br />
but if seqs are 50% id, we'll detect it with prob. 99%<br />
28% id -> we'll detect it with prob. 50%<br />
– how is it calculated?<br />
68st<br />
?
Expected sensitivity<br />
We assume that letters are<br />
independent<br />
I – identity between seqs, for<br />
human-mouse<br />
– 86% for DNA,<br />
– 89% for proteins<br />
p word id =I k
Expected sensitivity<br />
Q - query size<br />
number of non-overlapping words<br />
R= Q<br />
k<br />
prob. of a “hit”<br />
p detect =1−1−p wordid R
Expected specificity<br />
How many matches by chance (C)?<br />
G – genome size<br />
C=Q−k1∗ G<br />
k<br />
For h-m, to get 99% sensitivity we<br />
have to set k=7, and for Q=1000<br />
C ~= 25,000,000<br />
∗ 1<br />
4 k<br />
– 7h assuming 1/1000 per alignment