Algorithms on Sequences
Algorithms on Sequences
Algorithms on Sequences
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Bioinformatics<br />
<str<strong>on</strong>g>Algorithms</str<strong>on</strong>g> <strong>on</strong> <strong>Sequences</strong><br />
Potsdam June 2003<br />
Jacques Nicolas<br />
Matching with errors :<br />
Comparis<strong>on</strong> of sequences<br />
Motivati<strong>on</strong>s<br />
• Sequence comparis<strong>on</strong> is a basic operati<strong>on</strong> in<br />
molecular biology.<br />
• It occurs, for instance, in :<br />
– Search in Data Banks<br />
– Genome Assembly<br />
– Phylogenic trees c<strong>on</strong>structi<strong>on</strong><br />
– …<br />
• The knowledge of principles of sequence<br />
comparis<strong>on</strong> is necessary for mastering the<br />
interpretati<strong>on</strong> of results given by softwares.
Sequence comparis<strong>on</strong> and alignments<br />
To compare sequences is to look for alignments<br />
MALVEDNNAVAVSFSEEQEALVLKSWAILKKDSANIALRFFLKIFEVAPSASQMF-SF<br />
: .: :...:: .: ... . ..:: . .. . .. ...:.: .:.:...: .:<br />
MPIV-DTGSVA-PLSAAEKTKIRSAWAPVYSNYETSGVDILVKFFTSTPAAQEFFPKF<br />
search in<br />
a bank<br />
local alignment<br />
(blast)<br />
assembly<br />
of genomes<br />
Today lecture<br />
• Dynamic programming applied to<br />
the comparis<strong>on</strong> of sequences<br />
• Alignments<br />
• Substituti<strong>on</strong> matrices<br />
• Validati<strong>on</strong> of comparis<strong>on</strong> results<br />
• Search heuristics<br />
• Introducti<strong>on</strong> to existing softwares<br />
phylogeny<br />
global alignment
Alignment of two sequences<br />
• Matching of 2 subsequences<br />
example : TACGT-AAT<br />
TATGTGAAT<br />
• 2 elementary operati<strong>on</strong>s<br />
– substituti<strong>on</strong><br />
– inserti<strong>on</strong> / deleti<strong>on</strong> : gap<br />
Score of an alignment<br />
• Necessary to «compare» 2 alignments<br />
• Score evaluated as a functi<strong>on</strong> of costs of<br />
elementary operati<strong>on</strong>s<br />
– substituti<strong>on</strong> cost<br />
• DNA : match : +M<br />
mismatch : -N<br />
• Protein : substituti<strong>on</strong> matrix<br />
– inserti<strong>on</strong>/deleti<strong>on</strong> (indel) cost
Score computati<strong>on</strong><br />
• Score is the sum of costs of elementary operati<strong>on</strong>s<br />
of an alignment.<br />
• Example : M = +5 N = -4 G = -3<br />
GATACGT-AATGCATA<br />
CTATGTGAATTT<br />
5+5-4+5+5-3+5+5+5=28<br />
Dynamic Programming : reminder<br />
– Optimality principle<br />
– Split a problem in simpler subproblems<br />
A<br />
B<br />
C<br />
D<br />
optimal<br />
partial<br />
result<br />
optimal<br />
partial<br />
result<br />
optimal<br />
partial<br />
result<br />
optimal<br />
final soluti<strong>on</strong><br />
optimal<br />
partial<br />
result<br />
optimal<br />
partial<br />
result<br />
optimal<br />
partial<br />
result<br />
optimal<br />
partial<br />
result<br />
optimal<br />
partial<br />
result
DP applied to alignments<br />
• Given 2 sequences :<br />
– X (length = N) et Y (length = M)<br />
• The score S N,M of alignment A N,M :<br />
X 1 . . . X N<br />
Y 1 . . . Y M<br />
is calculated from 3 sub-alignments :<br />
A N-1,M-1 A N,M-1 A N-1,M<br />
X 1 . . . X N-1 X N<br />
Y 1 . . . Y M-1 Y M<br />
substituti<strong>on</strong><br />
X1 . . . XN -<br />
Y1 . . . YM-1 YM gap<br />
X1 . . . XN-1 XN Y1 . . . YM -<br />
Needleman & Wunsh’s algorithm (1)<br />
gap<br />
• global alignment between 2 sequences<br />
acgtgcataaagccaggataccg<br />
acggtcattagaccccgagataccg<br />
ac.gtgcataa.agccag.gataccg<br />
|| || ||| | | || | |||||||<br />
acggt.cattagaccccgagataccg
Needleman & Wunsh’s algorithm (2)<br />
X = a c t<br />
Y = a g t<br />
optimal a c<br />
alignment a g<br />
+<br />
substituti<strong>on</strong> (t,t)<br />
optimal a c t<br />
alignment a g t<br />
MAX<br />
optimal a c t<br />
alignment a g<br />
+<br />
substituti<strong>on</strong> (t,-)<br />
optimal a c<br />
alignment a g t<br />
+<br />
substituti<strong>on</strong> (-,t)<br />
Needleman & Wunsh’s algorithm (3)<br />
act<br />
agt<br />
act<br />
ag<br />
act<br />
a<br />
ac<br />
agt<br />
ac<br />
ag<br />
ac<br />
a<br />
a<br />
agt<br />
a<br />
ag<br />
a<br />
a
Needleman & Wunsh’s algorithm (4)<br />
• The score S N,M is calculated from the<br />
recurrence equati<strong>on</strong> :<br />
S I,J-1 - Gap<br />
S I,J = Max S I-1,J-1 + Sub (X I ,Y J )<br />
S I-1,J - Gap<br />
Needleman & Wunsh’s algorithm (5)<br />
T<br />
A<br />
G<br />
A<br />
C<br />
T<br />
T A G C T A<br />
i,j<br />
S I,J-1 - Gap<br />
S I,J = Max S I-1,J-1 + Sub (XI ,YJ )<br />
S I-1,J - Gap<br />
Complexity :<br />
filling a matrix of size N x M<br />
Result : S N,M
Needleman & Wunsh’s algorithm (6)<br />
T<br />
A<br />
G<br />
A<br />
C<br />
T<br />
T A A C T<br />
0 -2 -4 -6 -8 -10<br />
-2<br />
-4<br />
-6<br />
-8<br />
-10<br />
-12<br />
2<br />
0<br />
-2<br />
-4<br />
-6<br />
-8<br />
0<br />
4<br />
2<br />
-2<br />
2<br />
-4<br />
-6<br />
0 -2<br />
2 1 -1<br />
0 4 2 0<br />
-2<br />
-4<br />
2<br />
0<br />
6 4<br />
4 8<br />
S I,J-1 - Gap<br />
S I,J = Max S I-1,J-1 + Sub (X I ,Y J )<br />
S I-1,J - Gap<br />
match M = 2<br />
mismatch N = -2<br />
Gap = 2<br />
result = ?<br />
Needleman & Wunsh’s algorithm (7)<br />
T<br />
A<br />
G<br />
A<br />
C<br />
T<br />
T A A C T<br />
0 -2 -4 -6 -8 -10<br />
-2<br />
-4<br />
-6<br />
-8<br />
-10<br />
-12<br />
2 0<br />
0<br />
-2<br />
-4<br />
-6<br />
-8<br />
4<br />
2<br />
-2<br />
2<br />
-4<br />
-6<br />
0 -2<br />
2 0 -2<br />
0 4 2 0<br />
-2<br />
-4<br />
2<br />
0<br />
6 4<br />
4 8<br />
Determinati<strong>on</strong> of<br />
the optimal alignment<br />
T<br />
2<br />
T<br />
A<br />
4<br />
A<br />
-<br />
2<br />
G<br />
A<br />
4<br />
A<br />
C<br />
6<br />
C<br />
T<br />
8<br />
T
Smith & Waterman algorithm (1)<br />
• local alignments between 2 sequences<br />
ccggttcttcttacaacgtgcataaagccagcacaagaaagt<br />
cgtagtacggtcattagaccccgcagtgagccttacgt<br />
ac.gtgcataa.agccag<br />
|| || ||| | | || |<br />
acggt.cattagaccccg<br />
Smith & Waterman algorithm (2)<br />
• The score S N,M is calculated from the<br />
recurrence equati<strong>on</strong> :<br />
0<br />
S I,J-1 - Gap<br />
S I,J = Max S I-1,J-1 + Sub (X I ,Y J )<br />
S I-1,J - Gap<br />
Dynamic<br />
Programming
Smith & Waterman algorithm (3)<br />
GCC-UCG<br />
GCCAUUG<br />
C A G C C U C G C U U A G<br />
A 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0<br />
A 0.0 1.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.7<br />
U 0.0 0.0 0.8 0.3 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.7<br />
G 0.0 0.0 1.0 0.3 0.0 0.0 0.7 1.0 0.0 0.0 0.7 0.7 1.0<br />
C 1.0 0.0 0.0 2.0 1.3 0.3 1.0 0.3 2.0 0.7 0.3 0.3 0.3<br />
C 1.0 0.7 0.0 1.0 3.0 1.7 1.3 1.0 1.3 1.7 0.3 0.0 0.0<br />
A 0.0 2.0 0.7 0.3 1.7 2.7 1.3 1.0 0.7 1.0 1.3 1.3 0.0<br />
U 0.0 0.7 1.7 0.3 1.3 2.7 2.3 1.0 0.7 1.7 2.0 1.0 1.0<br />
U 0.0 0.3 0.3 1.3 1.0 2.3 2.3 2.0 0.7 1.7 2.7 1.7 1.0<br />
G 0.0 0.0 1.3 0.0 1.0 1.0 2.0 3.3 2.0 1.7 1.3 2.3 2.7<br />
A 0.0 1.0 0.0 1.0 0.3 0.7 0.7 2.0 3.0 1.7 1.3 2.3 2.0<br />
C 1.0 0.0 0.7 1.0 2.0 0.7 1.7 1.7 3.0 2.7 1.3 1.0 2.0<br />
G 0.0 0.7 1.0 0.3 0.7 1.7 0.3 2.7 1.7 2.7 2.3 1.0 2.0<br />
G 0.0 0.0 1.7 0.7 0.3 0.3 1.3 1.3 2.3 1.3 2.3 2.0 2.0<br />
Gotoh’s algorithm (1)<br />
• More realistic modelling of « gaps »;<br />
• Cost of a N characters gap :<br />
– S & W : C N = N x G<br />
– Gotoh : C N = G b + (N-1) x G x<br />
G b = breaking cost<br />
G x = extensi<strong>on</strong> cost<br />
• What is the recurrence equati<strong>on</strong> ?
E I,J = MAX<br />
Gotoh’s algorithm (2)<br />
S I,J = MAX<br />
S I,J-1 - G B<br />
E I,J-1 - G X<br />
0<br />
E I,J<br />
F I,J<br />
S I-1,J-1 + sub (X I , Y J)<br />
F I,J = MAX<br />
Cost of a substituti<strong>on</strong><br />
S I-1,J - G B<br />
F I-1,J - G X<br />
• DNA <strong>Sequences</strong><br />
One c<strong>on</strong>sider 2 costs :<br />
• cost of a match positive value<br />
• cost of a mismatch negative value<br />
example : BLAST :<br />
default value = +5 and -4<br />
• Proteins<br />
Use a substituti<strong>on</strong> matrix
Substituti<strong>on</strong> Matrices<br />
• Substituti<strong>on</strong> matrices are 2D arrays used to<br />
assign a similarity value between pairs of<br />
characters (bases or amino acids) =<br />
substituti<strong>on</strong> cost<br />
• Matrices are generally symetrical. The value of<br />
the substituti<strong>on</strong> of X with Y is c<strong>on</strong>sidered to be<br />
equal to th value of the substituti<strong>on</strong> of Y with X.<br />
Several types of matrices<br />
• Identity Matrices<br />
• Matrix based <strong>on</strong> the genetic code<br />
• Substituti<strong>on</strong> Matrices<br />
– PAM (mutati<strong>on</strong>s, Dayhoff)<br />
– BLOSUM (Henikoff)<br />
• Physico-chemical Matrices<br />
– Hydrophobicity<br />
– 3D structure
Identity and default Blast Matrices<br />
A 3<br />
A<br />
T<br />
C<br />
G<br />
A T C G<br />
1<br />
0<br />
0<br />
0<br />
0 0 0<br />
1<br />
0<br />
0<br />
0 0<br />
1<br />
0<br />
0<br />
1<br />
A T C G<br />
Identity matrix Reference matrix in BLAST<br />
A<br />
T<br />
C<br />
G<br />
5<br />
-4 -4 -4<br />
5<br />
-4 -4<br />
Matrix based <strong>on</strong> the genetic code<br />
A R N D C Q E G H I L K M F P S T W Y V<br />
R 1 3<br />
N 1 1 3<br />
D 2 1 2 3<br />
C 1 2 1 1 3<br />
Q 1 2 1 1 0 3<br />
E 2 1 1 2 0 2 3<br />
G 2 2 1 2 2 1 2 3<br />
H 1 2 2 2 1 2 1 1 3<br />
I 1 2 2 1 1 1 1 1 1 3<br />
L 1 2 1 1 1 2 1 1 2 2 3<br />
K 1 2 2 1 0 2 2 1 1 2 1 3<br />
M 1 2 1 0 0 1 1 1 0 2 2 2 3<br />
F 1 1 1 1 2 0 0 1 1 2 2 0 1 3<br />
P 2 2 1 1 1 2 1 1 2 1 2 1 1 1 3<br />
S 2 2 2 1 2 1 1 2 1 2 2 1 1 2 2 3<br />
T 2 2 2 1 1 1 1 1 1 2 1 2 2 1 2 2 3<br />
W 1 2 0 0 2 1 1 2 0 0 2 1 1 1 1 2 1 3<br />
Y 1 1 2 2 2 1 1 1 2 1 1 1 0 2 1 2 1 1 3<br />
V 2 1 1 2 1 1 2 2 1 2 2 1 2 2 1 1 1 1 1 3<br />
-4<br />
-4<br />
-4<br />
-4<br />
-4<br />
5<br />
-4<br />
-4<br />
5<br />
(number of<br />
modificati<strong>on</strong>s<br />
to pass from<br />
an amino acid<br />
to another)
Dayhoff’s Matrices (PAM)<br />
• M.O. Dayhoff, 1978, Atlas of Protein Sequence and<br />
Structure. Initial matrix deduced from the alignment of<br />
71 families of highly homologous proteins (>1000<br />
sequences).<br />
• Replacement frequency Matrix of each amino acid X with<br />
<strong>on</strong>e of the 19 others (relative frequency for<br />
homogeneity reas<strong>on</strong>s) = PAM (Percent Accepted<br />
Mutati<strong>on</strong>).<br />
• The model of evoluti<strong>on</strong> of Dayhoff assumes that proteins<br />
have diverged due to the accumulati<strong>on</strong> of random<br />
mutati<strong>on</strong>s. PAM N is the power N of the matrix, thus<br />
simulating an evoluti<strong>on</strong> rate. Standard value of N =250<br />
Each value is the probability<br />
of substituti<strong>on</strong> of a given aa<br />
with another <strong>on</strong>e. The greater<br />
the value, the more likely the<br />
replacment. 0 is equivalent to<br />
a random mutati<strong>on</strong>.
Other substituti<strong>on</strong> matrices<br />
• Physico-chemical properties may be used to<br />
build score matrices.<br />
• Chemical properties<br />
• Hydrobobicity<br />
• Sec<strong>on</strong>dary structures (membrane domains)<br />
• Tertiary structures<br />
• ...
Taylor’s Venn Diagram<br />
For usual protein<br />
properties<br />
Validati<strong>on</strong> of an alignment<br />
• A computer program finds always a soluti<strong>on</strong>…<br />
• What is the credit that may be assigned to<br />
resulting alignments ?<br />
– Validati<strong>on</strong> from :<br />
• the score ?<br />
• the percentage of identities ?<br />
• …<br />
• Problem : How to decide that an alignment has<br />
something to do with biological reality
Example 1 : easy case<br />
ALIGN calculates a global alignment of two sequences<br />
versi<strong>on</strong> 2.0uPlease cite: Myers and Miller, CABIOS (1989) 4:11-17<br />
AF248645_1 150 aa vs.<br />
AF156936_1 149 aa<br />
scoring matrix: BLOSUM50, gap penalties: -12/-2<br />
44.1% identity; Global alignment score: 428<br />
10 20 30 40 50 60<br />
/tmp/t MPIVDTGSVAPLSAAEKTKIRSAWAPVYSNYETSGVDILVKFFTSTPAAQEFFPKFKGLT<br />
:::.: : . :: ..: :: .: .:.:.: .:. .:..:. ..:.::. ::::..<br />
AF1569 MPITDQGPLPTLSEGDKKAIRESWPQIYQNFEQTGLVVLLEFLQKNPGAQQSFPKFSA--<br />
10 20 30 40 50<br />
70 80 90 100 110 120<br />
/tmp/t TADQLKKSADVRWHAERIINAVNDAVVSMDDTEKMSMKLRDLSGKHAKSFQVDPQYFKVL<br />
: .:... .:.:.: ::::::: .. :: :.. :..::.::.. :::::. :: :<br />
AF1569 TKCNLEQDNEVKWQASRIINAVNHTIGLMDKEAAMKQYLKELSAKHSSEFQVDPKLFKEL<br />
60 70 80 90 100 110<br />
130 140 150<br />
/tmp/t AAVIADTVAAGDAGFEKLMSMICILLRSAY--<br />
.:....:. : :..:::.:.:: ::::.:<br />
AF1569 SAIFVSTIR-GKAAYEKLFSIICTLLRSSYDE<br />
120 130 140<br />
Example 2 : opposite but easy…<br />
ALIGN calculates a global alignment of two sequences<br />
versi<strong>on</strong> 2.0uPlease cite: Myers and Miller, CABIOS (1989) 4:11-17<br />
U13831_1 134 aa vs.<br />
AF248645_1 150 aa<br />
scoring matrix: BLOSUM50, gap penalties: -12/-2<br />
15.8% identity; Global alignment score: -47<br />
10 20 30 40 50<br />
/tmp/t MTRDQNGTWEMESNENFEGYMKALDIDFATPKIAVRLTQTKVI---DQDGDNFKTK---T<br />
: ..:. : :. : .: . . : . .: .. .: .<br />
AF2486 MPIVDTGSVAPLS---------------AAEKTKIRSAWAPVYSNYETSGVDILVKFFTS<br />
10 20 30 40<br />
60 70 80 90 100 110<br />
/tmp/t TSTFRNYDVDFTVGVEFDEYTKSLDNR-HVKALVTWEGDVLVCVQKGEKENRGWKQW---<br />
: . ... : . :. :: : : :.. ... .:..: .. :: . ..<br />
AF2486 TPAAQEFFPKFKGLTTADQLKKSADVRWHAERIINAVNDAVVSMDDTEKMSMKLRDLSGK<br />
50 60 70 80 90 100<br />
120 130<br />
/tmp/t ----IEGDKLYLEL---------TCGD--------QVCRQVFKKK<br />
.. : :... . :: ..: . .<br />
AF2486 HAKSFQVDPQYFKVLAAVIADTVAAGDAGFEKLMSMICILLRSAY<br />
110 120 130 140 150
Example 3 ???<br />
ALIGN calculates a global alignment of two sequences<br />
versi<strong>on</strong> 2.0uPlease cite: Myers and Miller, CABIOS (1989) 4:11-17<br />
U76030_1 166 aa vs.<br />
AF248645_1 150 aa<br />
scoring matrix: BLOSUM50, gap penalties: -12/-2<br />
20.6% identity; Global alignment score: 94<br />
10 20 30 40 50<br />
/tmp/t MALVEDNNAVAVSFSEEQEALVLKSWAILKKDSANIALRFFLKIFEVAPSASQMF-SF--<br />
: .: :...:: .: ... . ..:: . .. . .. ...:.: .:.:...: .:<br />
AF2486 MPIV-DTGSVA-PLSAAEKTKIRSAWAPVYSNYETSGVDILVKFFTSTPAAQEFFPKFKG<br />
10 20 30 40 50<br />
60 70 80 90 100 110<br />
/tmp/t LRNSDVPLEKNPKLKTHAMSVFVMTCEAAAQLRKAGKVTVRDTTLKRLGATHLK-YGVGD<br />
: ..: :.:. .. :: .. . .:.... . :.... :. :.. : : . :<br />
AF2486 LTTAD-QLKKSADVRWHAERIINAVNDAVVSMDDTEKMSMK---LRDLSGKHAKSFQVDP<br />
60 70 80 90 100 110<br />
120 130 140 150 160<br />
/tmp/t AHFEVVKFALLDTIKEEVPADMWSPAMKSAWSEAYDHLVAAIKQEMKPAE<br />
.:.:. .. ::. .: . ....:.. : .. :<br />
AF2486 QYFKVLAAVIADTV--------------AAGDAGFEKLMSMICILLRSAY<br />
120 130 140 150<br />
Validati<strong>on</strong> of a global alignment<br />
• alignment between S1 and S2 (with gaps)<br />
score S<br />
• Questi<strong>on</strong> : is S significant ?<br />
comparis<strong>on</strong> with a score between<br />
random sequences
Validati<strong>on</strong> of a global alignment (2)<br />
S1, S2 score S<br />
S’1, S’2 score S’<br />
Random sequences produced by<br />
• Permutati<strong>on</strong>s<br />
• Markovian models<br />
comparis<strong>on</strong> ?<br />
Standardized score : Z-score<br />
Sq1, Sq2<br />
S<br />
Practically :<br />
Significant<br />
score<br />
for Z > 6<br />
Z<br />
=<br />
( S−m<br />
)<br />
σ<br />
Mean<br />
Sq1<br />
Sq2 1 Sq2 2 ... Sq2 K<br />
S 1 S 2 … S k<br />
Standard<br />
deviati<strong>on</strong>
PRSS compares a test sequence to a shuffled sequence<br />
versi<strong>on</strong> 2.0u64, May, 1998<br />
s-w est<br />
22 1 0:=<br />
24 0 1:*<br />
26 12 5:==*===<br />
28 23 16:=======*====<br />
30 38 34:================*==<br />
32 68 53:==========================*=======<br />
34 60 64:============================== *<br />
36 68 66:================================*=<br />
38 51 60:========================== *<br />
40 58 51:=========================*===<br />
42 32 40:================ *<br />
44 28 31:============== *<br />
46 12 23:====== *<br />
48 17 17:========*<br />
50 8 12:==== *<br />
52 5 8:===*<br />
54 7 6:==*=<br />
56 3 4:=*<br />
58 3 3:=*<br />
60 1 2:*<br />
62 2 1:*<br />
64 0 1:*<br />
66 1 1:*<br />
68 0 0:<br />
70 0 0:<br />
72 1 0:=<br />
74 0 0:<br />
76 0 0:<br />
78 0 0:<br />
80 1 0:=<br />
82 0 0:<br />
PRSS compares a test sequence to a shuffled sequence<br />
versi<strong>on</strong> 2.0u64, May, 1998<br />
s-w est<br />
< 20 0 0:<br />
22 0 0:<br />
24 3 1:*=<br />
26 11 7:===*==<br />
28 36 23:===========*======<br />
30 57 46:======================*======<br />
32 79 65:================================*=======<br />
34 72 73:====================================*<br />
36 59 69:============================== *<br />
38 49 58:========================= *<br />
40 33 46:================= *<br />
42 28 34:============== *<br />
44 22 24:===========*<br />
46 17 17:========*<br />
48 8 12:==== * O<br />
50 6 8:===*<br />
52 10 6:==*==<br />
54 5 4:=*=<br />
56 4 3:=*<br />
58 1 2:*<br />
60 0 1:*<br />
62 0 1:*<br />
64 0 1:*<br />
66 0 0:<br />
> 68 0 0:<br />
67000 residues in 500 sequences,<br />
BLOSUM50 matrix, gap penalties: -12,-2<br />
unshuffled s-w score: 49; shuffled score range: 26 - 60<br />
For 500 sequences, a score >=49 is expected 31 times<br />
PRSS<br />
example 1<br />
74500 residues in 500 sequences,<br />
BLOSUM50 matrix, gap penalties: -12,-2<br />
unshuffled s-w score: 442; shuffled score range: 24 - 82<br />
For 500 sequences, a score >=442 is expected 5.26e-30 times<br />
PRSS<br />
example 2
PRSS compares a test sequence to a shuffled sequence<br />
versi<strong>on</strong> 2.0u64, May, 1998<br />
s-w est<br />
28 5 2:*==<br />
30 11 10:====*=<br />
32 45 25:============*==========<br />
34 59 44:=====================*========<br />
36 62 59:=============================*=<br />
38 57 65:============================= *<br />
40 65 62:==============================*==<br />
42 49 55:========================= *<br />
44 31 45:================ *<br />
46 23 35:============ *<br />
48 21 27:=========== *<br />
50 21 20:=========*=<br />
52 7 14:==== *<br />
54 20 10:====*=====<br />
56 6 7:===*<br />
58 5 5:==*<br />
60 5 4:=*=<br />
62 2 3:=*<br />
64 1 2:*<br />
66 2 1:*<br />
68 1 1:*<br />
70 2 1:*<br />
83000 residues in 500 sequences,<br />
BLOSUM50 matrix, gap penalties: -12,-2<br />
unshuffled s-w score: 114; shuffled score range: 29 - 72<br />
For 500 sequences, a score >=114 is expected 0.000902 times<br />
PRSS<br />
example 3<br />
Validati<strong>on</strong> of a local alignment<br />
• Karlin, S & Altschul, S.F. (1990) « Methods for assessing the<br />
statistical significance of molecular sequence features by using<br />
general scoring schemes » Proc. Natl. Acad. Sci. USA 87:2264-2268.<br />
Altschul et al. 1994 «Issues in searching in molecular sequence<br />
databases » Nature Genet. 6:119-129)<br />
• BLAST : comparis<strong>on</strong> sequence / bank<br />
• To evaluate the significance of an optimal score without gap, <strong>on</strong>e<br />
uses a model of random sequences.<br />
– m : size of sequence, n : size of bank<br />
– The expected number of HSP with a score ≥ s is<br />
E=<br />
Kmne<br />
−λs<br />
where K and λ are c<strong>on</strong>stants, depending <strong>on</strong> the substituti<strong>on</strong> matrix … -<br />
RIINAVNDAVVMD<br />
::::::: .. ::<br />
RIINAVNHTIGMD<br />
HSP<br />
High Scoring Pair
Validati<strong>on</strong> of a local alignment (2)<br />
• E is an indicati<strong>on</strong> of the degree of surprise <strong>on</strong>e gets<br />
with the observed score : the highest it is, the least<br />
significant is the score.<br />
• A reas<strong>on</strong>able value of E is between 0.1 et 0.001<br />
Biologists use generally 10 -4<br />
• Blast default searches until 10<br />
• One generally gives score results standardized with<br />
respect to parameters K and λ :<br />
S ' are called « bit scores »<br />
s − ln K<br />
S′<br />
=<br />
λ<br />
Probability : P-value<br />
ln 2<br />
• The random number of HSP with a score ≥ s follows a<br />
Poiss<strong>on</strong>’s law, i.e. the probability to find exactly k HSP with a<br />
score ≥ s is<br />
e k E −<br />
where E is the expected number of HSP previously defined.<br />
• The p-value P associated to score s is the probability to find<br />
at least <strong>on</strong>e HSP :<br />
−E<br />
P<br />
k! E<br />
= 1−<br />
For instance, if the expected number of HSP with a score ≥ s<br />
is 3, the probability to find at least 1 HSP is 0,95.<br />
e
BLASTN 1.4.7 [16-Oct-94] [Build 17:42:06 Mar 10 1995]<br />
Reference: Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers,<br />
and David J. Lipman (1990). Basic local alignment search tool. J. Mol. Biol.<br />
215:403-10.<br />
Query= gb|X17217|ADAAVAR Avian adenovirus (CELO) DNA encoding VA<br />
(virus-associated) RNA and six open reading frames.<br />
(4898 letters)<br />
Database: smallgenbank.fasta<br />
100 sequences; 205,192 total letters.<br />
Searching.................................................d<strong>on</strong>e<br />
Smallest<br />
Sum<br />
High Probability<br />
<strong>Sequences</strong> producing High-scoring Segment Pairs: Score P(N) N<br />
gb AAUNKDNA 3576 Z17216 Avian adenovirus DNA (CEL06). ... 302<br />
gb AAVSPHERE 8457 M77182 Amsacta entomopoxvirus sphero... 112<br />
gb AAVSPHER 4657 M75889 Amsacta moorei entomopoxvirus ... 94<br />
gb AAFVMAF 3171 M26769 Avian musculoap<strong>on</strong>eurotic fibros... 92<br />
gb ACU10885 2773 U10885 AcMNPV HR3 p6.9 gene, partial ... 101<br />
gb ACSJUN 1074 M16266 Avian sarcoma virus 17 proviral ... 98<br />
Softwares based <strong>on</strong> dynamic<br />
programming<br />
• SSEARCH (Smith-Waterman)<br />
• BESTFIT (Smith-Waterman)<br />
• GAP (Needleman-Wunsh)<br />
• ALIGN (Gotoh)<br />
• ...<br />
Not often used for the search in banks :<br />
Computati<strong>on</strong> time too high<br />
BLAST<br />
6.4e-17 1<br />
0.0052 4<br />
0.34 3<br />
0.96 2<br />
0.97 1<br />
0.998 1<br />
Necessary to find some « tricks»<br />
to accelerate the search of alignments
Why DP is so expensive ?<br />
DNA bank<br />
10 6 sequences<br />
(size sequence = 1000)<br />
Dynamic Programming<br />
10 3 x 10 3 matrix-cell<br />
Search Heuristics<br />
Scan :<br />
Computati<strong>on</strong> of<br />
10 12 matrix-cells<br />
50 ns<br />
> 12 hours<br />
• A heuristics is a mechanism that makes profit of an<br />
assumpti<strong>on</strong> <strong>on</strong> data to accelerate a computati<strong>on</strong><br />
• Assumpti<strong>on</strong> :<br />
– an alignment includes at least a subword of size W<br />
– example : W=5<br />
ATGGCG.GTAGGCATAGGACTCA.TAC<br />
||| || | ||||| ||| |||| ||<br />
ATGTCGAGAAGGCACAGGTCTCAGGAC
Principle of the heuristics (1)<br />
1 – Query sequence is splited in subwords<br />
of size W , that are stored in a dicti<strong>on</strong>nary<br />
ATGGACTGGC<br />
12345678<br />
query sequence<br />
ATG 1<br />
TGG 2,7 W=3<br />
GGA 3<br />
GAC 4<br />
ACT 5<br />
CTG 6<br />
GGC 8 dicti<strong>on</strong>nary<br />
Principle of the heuristics (2)<br />
2 – For each sequence of the bank, subwords<br />
of size W (hits) bel<strong>on</strong>ging to the dicti<strong>on</strong>nary<br />
are detected<br />
1 sequence of the bank<br />
AATCGGATTGCATAA<br />
ATG 1<br />
TGG 2,7<br />
GGA 3<br />
GAC 4<br />
ACT 5<br />
CTG 6<br />
GGC 8 dicti<strong>on</strong>nary
Principle of the heuristics (3)<br />
3 – Each hit is « extended » to produce an<br />
alignment<br />
hit<br />
Sequence of the bank<br />
AATCGGATTGCATAA<br />
ATGGACTGGC<br />
Query sequence<br />
alignment<br />
ATCGGATTG<br />
|| ||| ||<br />
AT.GGACTG<br />
Speed Gain of the heuristics<br />
• Example : 2 sequences of DNA of size 1000<br />
• Dynamic Programming :<br />
– 1000 x 1000 = 10 6 matrix-cells<br />
• Heuristics with W = 8<br />
– statistically :<br />
nb of hits = 10 6 x (1 / 4 8 ) = 15 (1 for W=10)<br />
– 1000 search in the dicti<strong>on</strong>nary (10 x 10 3 operati<strong>on</strong>s)<br />
– 15 extensi<strong>on</strong>s (15 x 10 3 operati<strong>on</strong>s)<br />
• Gain : 10 6 / 2.5 x 10 4 = 40 (87 for W=10)
Tradeoff sensitivity / rapidity<br />
W<br />
1 12<br />
sensitivity<br />
rapidity<br />
• The highest W, the fastest the computati<strong>on</strong><br />
•W is a parameter of softwares (Blast, Fasta)<br />
Softwares searching in data banks<br />
• Blast<br />
– 2 implementati<strong>on</strong>s :<br />
• NCBI BLAST, Nati<strong>on</strong>al Center for Biotechnology Informati<strong>on</strong><br />
• WU-BLAST, W. Gish, Washingt<strong>on</strong> University<br />
• Fasta<br />
– W. Pears<strong>on</strong>, University of Virginia<br />
• …
BLAST<br />
• Many sites<br />
– check default parameters<br />
• Main parameters :<br />
υ W : size of subword<br />
υ DNA : default value 11<br />
υ Protein : default value 3<br />
υ E : expected value<br />
υ -G : gap opening (default 11)<br />
υ -E : gap extensi<strong>on</strong> (default 1)<br />
Various Blasts<br />
NAME Query Bank Comm<strong>on</strong> Usage<br />
BLASTN DNA DNA Look for identities<br />
+pattern splicing<br />
BLASTP Protein Proteins Search of homologous proteins<br />
TBLASTN Protein DNA<br />
translated<br />
6 phases<br />
BLASTX DNA<br />
translated<br />
TBLASTX DNA<br />
translated<br />
Search of genes to be annotated<br />
in banks<br />
Proteins Search of homologous genes and<br />
proteins<br />
DNA<br />
translated<br />
Discovery of the structure of<br />
genes
Recent Blasts<br />
• Blast2 (Gapped Blast,1997) : versi<strong>on</strong> of Blast with indels,<br />
filtering Dust and Seg and threads for multi-processors.<br />
• PSI-Blast (Positi<strong>on</strong> Specific Iterated Blast) : iterative<br />
versi<strong>on</strong> of Blast in view of phylogeny studies, where the comparis<strong>on</strong><br />
matrix used at <strong>on</strong>e step takes into account c<strong>on</strong>sensus found at the<br />
previous step, for a multiple alignment of sequences with a sufficient<br />
E-value, until the processus has c<strong>on</strong>verged.<br />
• PHI-Blast (Pattern-Hit Initiated Blast) : fast versi<strong>on</strong> of<br />
Blast, that may be combined with PSI-Blast, where <strong>on</strong>e gives a<br />
pattern (regular expressi<strong>on</strong> Prosite-like) that must be used in the<br />
alignment of sequences)<br />
Dotplots :<br />
The simplest way to see alignments<br />
Dotplot = identity matrix of two sequences<br />
Similarities =<br />
diag<strong>on</strong>als<br />
• • • • • • •<br />
•<br />
•<br />
•<br />
Filtering, Smoothing<br />
(window 4, 1 error)<br />
DNA /cDNA of actin gene from<br />
muscle of Pisaster ochraceus<br />
(horiz<strong>on</strong>tal / vertical)
Dotter : a tool for dot plots<br />
• Karolinska Institutet, unix<br />
& windows versi<strong>on</strong>s<br />
ftp://ftp.cgr.ki.se/pub/esr<br />
/dotter/<br />
• Complexity : linear for<br />
space, quadratic for time<br />
(15 mn for 30000x30000)<br />
« A dot-matrix program with dynamic<br />
theshold c<strong>on</strong>trol suited for genomic DNA and<br />
protein sequences » E. S<strong>on</strong>nhammer R. Durbin<br />
Gene 167: GC1-10 1995