02.08.2013 Views

Algorithms on Sequences

Algorithms on Sequences

Algorithms on Sequences

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Bioinformatics<br />

<str<strong>on</strong>g>Algorithms</str<strong>on</strong>g> <strong>on</strong> <strong>Sequences</strong><br />

Potsdam June 2003<br />

Jacques Nicolas<br />

Matching with errors :<br />

Comparis<strong>on</strong> of sequences<br />

Motivati<strong>on</strong>s<br />

• Sequence comparis<strong>on</strong> is a basic operati<strong>on</strong> in<br />

molecular biology.<br />

• It occurs, for instance, in :<br />

– Search in Data Banks<br />

– Genome Assembly<br />

– Phylogenic trees c<strong>on</strong>structi<strong>on</strong><br />

– …<br />

• The knowledge of principles of sequence<br />

comparis<strong>on</strong> is necessary for mastering the<br />

interpretati<strong>on</strong> of results given by softwares.


Sequence comparis<strong>on</strong> and alignments<br />

To compare sequences is to look for alignments<br />

MALVEDNNAVAVSFSEEQEALVLKSWAILKKDSANIALRFFLKIFEVAPSASQMF-SF<br />

: .: :...:: .: ... . ..:: . .. . .. ...:.: .:.:...: .:<br />

MPIV-DTGSVA-PLSAAEKTKIRSAWAPVYSNYETSGVDILVKFFTSTPAAQEFFPKF<br />

search in<br />

a bank<br />

local alignment<br />

(blast)<br />

assembly<br />

of genomes<br />

Today lecture<br />

• Dynamic programming applied to<br />

the comparis<strong>on</strong> of sequences<br />

• Alignments<br />

• Substituti<strong>on</strong> matrices<br />

• Validati<strong>on</strong> of comparis<strong>on</strong> results<br />

• Search heuristics<br />

• Introducti<strong>on</strong> to existing softwares<br />

phylogeny<br />

global alignment


Alignment of two sequences<br />

• Matching of 2 subsequences<br />

example : TACGT-AAT<br />

TATGTGAAT<br />

• 2 elementary operati<strong>on</strong>s<br />

– substituti<strong>on</strong><br />

– inserti<strong>on</strong> / deleti<strong>on</strong> : gap<br />

Score of an alignment<br />

• Necessary to «compare» 2 alignments<br />

• Score evaluated as a functi<strong>on</strong> of costs of<br />

elementary operati<strong>on</strong>s<br />

– substituti<strong>on</strong> cost<br />

• DNA : match : +M<br />

mismatch : -N<br />

• Protein : substituti<strong>on</strong> matrix<br />

– inserti<strong>on</strong>/deleti<strong>on</strong> (indel) cost


Score computati<strong>on</strong><br />

• Score is the sum of costs of elementary operati<strong>on</strong>s<br />

of an alignment.<br />

• Example : M = +5 N = -4 G = -3<br />

GATACGT-AATGCATA<br />

CTATGTGAATTT<br />

5+5-4+5+5-3+5+5+5=28<br />

Dynamic Programming : reminder<br />

– Optimality principle<br />

– Split a problem in simpler subproblems<br />

A<br />

B<br />

C<br />

D<br />

optimal<br />

partial<br />

result<br />

optimal<br />

partial<br />

result<br />

optimal<br />

partial<br />

result<br />

optimal<br />

final soluti<strong>on</strong><br />

optimal<br />

partial<br />

result<br />

optimal<br />

partial<br />

result<br />

optimal<br />

partial<br />

result<br />

optimal<br />

partial<br />

result<br />

optimal<br />

partial<br />

result


DP applied to alignments<br />

• Given 2 sequences :<br />

– X (length = N) et Y (length = M)<br />

• The score S N,M of alignment A N,M :<br />

X 1 . . . X N<br />

Y 1 . . . Y M<br />

is calculated from 3 sub-alignments :<br />

A N-1,M-1 A N,M-1 A N-1,M<br />

X 1 . . . X N-1 X N<br />

Y 1 . . . Y M-1 Y M<br />

substituti<strong>on</strong><br />

X1 . . . XN -<br />

Y1 . . . YM-1 YM gap<br />

X1 . . . XN-1 XN Y1 . . . YM -<br />

Needleman & Wunsh’s algorithm (1)<br />

gap<br />

• global alignment between 2 sequences<br />

acgtgcataaagccaggataccg<br />

acggtcattagaccccgagataccg<br />

ac.gtgcataa.agccag.gataccg<br />

|| || ||| | | || | |||||||<br />

acggt.cattagaccccgagataccg


Needleman & Wunsh’s algorithm (2)<br />

X = a c t<br />

Y = a g t<br />

optimal a c<br />

alignment a g<br />

+<br />

substituti<strong>on</strong> (t,t)<br />

optimal a c t<br />

alignment a g t<br />

MAX<br />

optimal a c t<br />

alignment a g<br />

+<br />

substituti<strong>on</strong> (t,-)<br />

optimal a c<br />

alignment a g t<br />

+<br />

substituti<strong>on</strong> (-,t)<br />

Needleman & Wunsh’s algorithm (3)<br />

act<br />

agt<br />

act<br />

ag<br />

act<br />

a<br />

ac<br />

agt<br />

ac<br />

ag<br />

ac<br />

a<br />

a<br />

agt<br />

a<br />

ag<br />

a<br />

a


Needleman & Wunsh’s algorithm (4)<br />

• The score S N,M is calculated from the<br />

recurrence equati<strong>on</strong> :<br />

S I,J-1 - Gap<br />

S I,J = Max S I-1,J-1 + Sub (X I ,Y J )<br />

S I-1,J - Gap<br />

Needleman & Wunsh’s algorithm (5)<br />

T<br />

A<br />

G<br />

A<br />

C<br />

T<br />

T A G C T A<br />

i,j<br />

S I,J-1 - Gap<br />

S I,J = Max S I-1,J-1 + Sub (XI ,YJ )<br />

S I-1,J - Gap<br />

Complexity :<br />

filling a matrix of size N x M<br />

Result : S N,M


Needleman & Wunsh’s algorithm (6)<br />

T<br />

A<br />

G<br />

A<br />

C<br />

T<br />

T A A C T<br />

0 -2 -4 -6 -8 -10<br />

-2<br />

-4<br />

-6<br />

-8<br />

-10<br />

-12<br />

2<br />

0<br />

-2<br />

-4<br />

-6<br />

-8<br />

0<br />

4<br />

2<br />

-2<br />

2<br />

-4<br />

-6<br />

0 -2<br />

2 1 -1<br />

0 4 2 0<br />

-2<br />

-4<br />

2<br />

0<br />

6 4<br />

4 8<br />

S I,J-1 - Gap<br />

S I,J = Max S I-1,J-1 + Sub (X I ,Y J )<br />

S I-1,J - Gap<br />

match M = 2<br />

mismatch N = -2<br />

Gap = 2<br />

result = ?<br />

Needleman & Wunsh’s algorithm (7)<br />

T<br />

A<br />

G<br />

A<br />

C<br />

T<br />

T A A C T<br />

0 -2 -4 -6 -8 -10<br />

-2<br />

-4<br />

-6<br />

-8<br />

-10<br />

-12<br />

2 0<br />

0<br />

-2<br />

-4<br />

-6<br />

-8<br />

4<br />

2<br />

-2<br />

2<br />

-4<br />

-6<br />

0 -2<br />

2 0 -2<br />

0 4 2 0<br />

-2<br />

-4<br />

2<br />

0<br />

6 4<br />

4 8<br />

Determinati<strong>on</strong> of<br />

the optimal alignment<br />

T<br />

2<br />

T<br />

A<br />

4<br />

A<br />

-<br />

2<br />

G<br />

A<br />

4<br />

A<br />

C<br />

6<br />

C<br />

T<br />

8<br />

T


Smith & Waterman algorithm (1)<br />

• local alignments between 2 sequences<br />

ccggttcttcttacaacgtgcataaagccagcacaagaaagt<br />

cgtagtacggtcattagaccccgcagtgagccttacgt<br />

ac.gtgcataa.agccag<br />

|| || ||| | | || |<br />

acggt.cattagaccccg<br />

Smith & Waterman algorithm (2)<br />

• The score S N,M is calculated from the<br />

recurrence equati<strong>on</strong> :<br />

0<br />

S I,J-1 - Gap<br />

S I,J = Max S I-1,J-1 + Sub (X I ,Y J )<br />

S I-1,J - Gap<br />

Dynamic<br />

Programming


Smith & Waterman algorithm (3)<br />

GCC-UCG<br />

GCCAUUG<br />

C A G C C U C G C U U A G<br />

A 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0<br />

A 0.0 1.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.7<br />

U 0.0 0.0 0.8 0.3 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.7<br />

G 0.0 0.0 1.0 0.3 0.0 0.0 0.7 1.0 0.0 0.0 0.7 0.7 1.0<br />

C 1.0 0.0 0.0 2.0 1.3 0.3 1.0 0.3 2.0 0.7 0.3 0.3 0.3<br />

C 1.0 0.7 0.0 1.0 3.0 1.7 1.3 1.0 1.3 1.7 0.3 0.0 0.0<br />

A 0.0 2.0 0.7 0.3 1.7 2.7 1.3 1.0 0.7 1.0 1.3 1.3 0.0<br />

U 0.0 0.7 1.7 0.3 1.3 2.7 2.3 1.0 0.7 1.7 2.0 1.0 1.0<br />

U 0.0 0.3 0.3 1.3 1.0 2.3 2.3 2.0 0.7 1.7 2.7 1.7 1.0<br />

G 0.0 0.0 1.3 0.0 1.0 1.0 2.0 3.3 2.0 1.7 1.3 2.3 2.7<br />

A 0.0 1.0 0.0 1.0 0.3 0.7 0.7 2.0 3.0 1.7 1.3 2.3 2.0<br />

C 1.0 0.0 0.7 1.0 2.0 0.7 1.7 1.7 3.0 2.7 1.3 1.0 2.0<br />

G 0.0 0.7 1.0 0.3 0.7 1.7 0.3 2.7 1.7 2.7 2.3 1.0 2.0<br />

G 0.0 0.0 1.7 0.7 0.3 0.3 1.3 1.3 2.3 1.3 2.3 2.0 2.0<br />

Gotoh’s algorithm (1)<br />

• More realistic modelling of « gaps »;<br />

• Cost of a N characters gap :<br />

– S & W : C N = N x G<br />

– Gotoh : C N = G b + (N-1) x G x<br />

G b = breaking cost<br />

G x = extensi<strong>on</strong> cost<br />

• What is the recurrence equati<strong>on</strong> ?


E I,J = MAX<br />

Gotoh’s algorithm (2)<br />

S I,J = MAX<br />

S I,J-1 - G B<br />

E I,J-1 - G X<br />

0<br />

E I,J<br />

F I,J<br />

S I-1,J-1 + sub (X I , Y J)<br />

F I,J = MAX<br />

Cost of a substituti<strong>on</strong><br />

S I-1,J - G B<br />

F I-1,J - G X<br />

• DNA <strong>Sequences</strong><br />

One c<strong>on</strong>sider 2 costs :<br />

• cost of a match positive value<br />

• cost of a mismatch negative value<br />

example : BLAST :<br />

default value = +5 and -4<br />

• Proteins<br />

Use a substituti<strong>on</strong> matrix


Substituti<strong>on</strong> Matrices<br />

• Substituti<strong>on</strong> matrices are 2D arrays used to<br />

assign a similarity value between pairs of<br />

characters (bases or amino acids) =<br />

substituti<strong>on</strong> cost<br />

• Matrices are generally symetrical. The value of<br />

the substituti<strong>on</strong> of X with Y is c<strong>on</strong>sidered to be<br />

equal to th value of the substituti<strong>on</strong> of Y with X.<br />

Several types of matrices<br />

• Identity Matrices<br />

• Matrix based <strong>on</strong> the genetic code<br />

• Substituti<strong>on</strong> Matrices<br />

– PAM (mutati<strong>on</strong>s, Dayhoff)<br />

– BLOSUM (Henikoff)<br />

• Physico-chemical Matrices<br />

– Hydrophobicity<br />

– 3D structure


Identity and default Blast Matrices<br />

A 3<br />

A<br />

T<br />

C<br />

G<br />

A T C G<br />

1<br />

0<br />

0<br />

0<br />

0 0 0<br />

1<br />

0<br />

0<br />

0 0<br />

1<br />

0<br />

0<br />

1<br />

A T C G<br />

Identity matrix Reference matrix in BLAST<br />

A<br />

T<br />

C<br />

G<br />

5<br />

-4 -4 -4<br />

5<br />

-4 -4<br />

Matrix based <strong>on</strong> the genetic code<br />

A R N D C Q E G H I L K M F P S T W Y V<br />

R 1 3<br />

N 1 1 3<br />

D 2 1 2 3<br />

C 1 2 1 1 3<br />

Q 1 2 1 1 0 3<br />

E 2 1 1 2 0 2 3<br />

G 2 2 1 2 2 1 2 3<br />

H 1 2 2 2 1 2 1 1 3<br />

I 1 2 2 1 1 1 1 1 1 3<br />

L 1 2 1 1 1 2 1 1 2 2 3<br />

K 1 2 2 1 0 2 2 1 1 2 1 3<br />

M 1 2 1 0 0 1 1 1 0 2 2 2 3<br />

F 1 1 1 1 2 0 0 1 1 2 2 0 1 3<br />

P 2 2 1 1 1 2 1 1 2 1 2 1 1 1 3<br />

S 2 2 2 1 2 1 1 2 1 2 2 1 1 2 2 3<br />

T 2 2 2 1 1 1 1 1 1 2 1 2 2 1 2 2 3<br />

W 1 2 0 0 2 1 1 2 0 0 2 1 1 1 1 2 1 3<br />

Y 1 1 2 2 2 1 1 1 2 1 1 1 0 2 1 2 1 1 3<br />

V 2 1 1 2 1 1 2 2 1 2 2 1 2 2 1 1 1 1 1 3<br />

-4<br />

-4<br />

-4<br />

-4<br />

-4<br />

5<br />

-4<br />

-4<br />

5<br />

(number of<br />

modificati<strong>on</strong>s<br />

to pass from<br />

an amino acid<br />

to another)


Dayhoff’s Matrices (PAM)<br />

• M.O. Dayhoff, 1978, Atlas of Protein Sequence and<br />

Structure. Initial matrix deduced from the alignment of<br />

71 families of highly homologous proteins (>1000<br />

sequences).<br />

• Replacement frequency Matrix of each amino acid X with<br />

<strong>on</strong>e of the 19 others (relative frequency for<br />

homogeneity reas<strong>on</strong>s) = PAM (Percent Accepted<br />

Mutati<strong>on</strong>).<br />

• The model of evoluti<strong>on</strong> of Dayhoff assumes that proteins<br />

have diverged due to the accumulati<strong>on</strong> of random<br />

mutati<strong>on</strong>s. PAM N is the power N of the matrix, thus<br />

simulating an evoluti<strong>on</strong> rate. Standard value of N =250<br />

Each value is the probability<br />

of substituti<strong>on</strong> of a given aa<br />

with another <strong>on</strong>e. The greater<br />

the value, the more likely the<br />

replacment. 0 is equivalent to<br />

a random mutati<strong>on</strong>.


Other substituti<strong>on</strong> matrices<br />

• Physico-chemical properties may be used to<br />

build score matrices.<br />

• Chemical properties<br />

• Hydrobobicity<br />

• Sec<strong>on</strong>dary structures (membrane domains)<br />

• Tertiary structures<br />

• ...


Taylor’s Venn Diagram<br />

For usual protein<br />

properties<br />

Validati<strong>on</strong> of an alignment<br />

• A computer program finds always a soluti<strong>on</strong>…<br />

• What is the credit that may be assigned to<br />

resulting alignments ?<br />

– Validati<strong>on</strong> from :<br />

• the score ?<br />

• the percentage of identities ?<br />

• …<br />

• Problem : How to decide that an alignment has<br />

something to do with biological reality


Example 1 : easy case<br />

ALIGN calculates a global alignment of two sequences<br />

versi<strong>on</strong> 2.0uPlease cite: Myers and Miller, CABIOS (1989) 4:11-17<br />

AF248645_1 150 aa vs.<br />

AF156936_1 149 aa<br />

scoring matrix: BLOSUM50, gap penalties: -12/-2<br />

44.1% identity; Global alignment score: 428<br />

10 20 30 40 50 60<br />

/tmp/t MPIVDTGSVAPLSAAEKTKIRSAWAPVYSNYETSGVDILVKFFTSTPAAQEFFPKFKGLT<br />

:::.: : . :: ..: :: .: .:.:.: .:. .:..:. ..:.::. ::::..<br />

AF1569 MPITDQGPLPTLSEGDKKAIRESWPQIYQNFEQTGLVVLLEFLQKNPGAQQSFPKFSA--<br />

10 20 30 40 50<br />

70 80 90 100 110 120<br />

/tmp/t TADQLKKSADVRWHAERIINAVNDAVVSMDDTEKMSMKLRDLSGKHAKSFQVDPQYFKVL<br />

: .:... .:.:.: ::::::: .. :: :.. :..::.::.. :::::. :: :<br />

AF1569 TKCNLEQDNEVKWQASRIINAVNHTIGLMDKEAAMKQYLKELSAKHSSEFQVDPKLFKEL<br />

60 70 80 90 100 110<br />

130 140 150<br />

/tmp/t AAVIADTVAAGDAGFEKLMSMICILLRSAY--<br />

.:....:. : :..:::.:.:: ::::.:<br />

AF1569 SAIFVSTIR-GKAAYEKLFSIICTLLRSSYDE<br />

120 130 140<br />

Example 2 : opposite but easy…<br />

ALIGN calculates a global alignment of two sequences<br />

versi<strong>on</strong> 2.0uPlease cite: Myers and Miller, CABIOS (1989) 4:11-17<br />

U13831_1 134 aa vs.<br />

AF248645_1 150 aa<br />

scoring matrix: BLOSUM50, gap penalties: -12/-2<br />

15.8% identity; Global alignment score: -47<br />

10 20 30 40 50<br />

/tmp/t MTRDQNGTWEMESNENFEGYMKALDIDFATPKIAVRLTQTKVI---DQDGDNFKTK---T<br />

: ..:. : :. : .: . . : . .: .. .: .<br />

AF2486 MPIVDTGSVAPLS---------------AAEKTKIRSAWAPVYSNYETSGVDILVKFFTS<br />

10 20 30 40<br />

60 70 80 90 100 110<br />

/tmp/t TSTFRNYDVDFTVGVEFDEYTKSLDNR-HVKALVTWEGDVLVCVQKGEKENRGWKQW---<br />

: . ... : . :. :: : : :.. ... .:..: .. :: . ..<br />

AF2486 TPAAQEFFPKFKGLTTADQLKKSADVRWHAERIINAVNDAVVSMDDTEKMSMKLRDLSGK<br />

50 60 70 80 90 100<br />

120 130<br />

/tmp/t ----IEGDKLYLEL---------TCGD--------QVCRQVFKKK<br />

.. : :... . :: ..: . .<br />

AF2486 HAKSFQVDPQYFKVLAAVIADTVAAGDAGFEKLMSMICILLRSAY<br />

110 120 130 140 150


Example 3 ???<br />

ALIGN calculates a global alignment of two sequences<br />

versi<strong>on</strong> 2.0uPlease cite: Myers and Miller, CABIOS (1989) 4:11-17<br />

U76030_1 166 aa vs.<br />

AF248645_1 150 aa<br />

scoring matrix: BLOSUM50, gap penalties: -12/-2<br />

20.6% identity; Global alignment score: 94<br />

10 20 30 40 50<br />

/tmp/t MALVEDNNAVAVSFSEEQEALVLKSWAILKKDSANIALRFFLKIFEVAPSASQMF-SF--<br />

: .: :...:: .: ... . ..:: . .. . .. ...:.: .:.:...: .:<br />

AF2486 MPIV-DTGSVA-PLSAAEKTKIRSAWAPVYSNYETSGVDILVKFFTSTPAAQEFFPKFKG<br />

10 20 30 40 50<br />

60 70 80 90 100 110<br />

/tmp/t LRNSDVPLEKNPKLKTHAMSVFVMTCEAAAQLRKAGKVTVRDTTLKRLGATHLK-YGVGD<br />

: ..: :.:. .. :: .. . .:.... . :.... :. :.. : : . :<br />

AF2486 LTTAD-QLKKSADVRWHAERIINAVNDAVVSMDDTEKMSMK---LRDLSGKHAKSFQVDP<br />

60 70 80 90 100 110<br />

120 130 140 150 160<br />

/tmp/t AHFEVVKFALLDTIKEEVPADMWSPAMKSAWSEAYDHLVAAIKQEMKPAE<br />

.:.:. .. ::. .: . ....:.. : .. :<br />

AF2486 QYFKVLAAVIADTV--------------AAGDAGFEKLMSMICILLRSAY<br />

120 130 140 150<br />

Validati<strong>on</strong> of a global alignment<br />

• alignment between S1 and S2 (with gaps)<br />

score S<br />

• Questi<strong>on</strong> : is S significant ?<br />

comparis<strong>on</strong> with a score between<br />

random sequences


Validati<strong>on</strong> of a global alignment (2)<br />

S1, S2 score S<br />

S’1, S’2 score S’<br />

Random sequences produced by<br />

• Permutati<strong>on</strong>s<br />

• Markovian models<br />

comparis<strong>on</strong> ?<br />

Standardized score : Z-score<br />

Sq1, Sq2<br />

S<br />

Practically :<br />

Significant<br />

score<br />

for Z > 6<br />

Z<br />

=<br />

( S−m<br />

)<br />

σ<br />

Mean<br />

Sq1<br />

Sq2 1 Sq2 2 ... Sq2 K<br />

S 1 S 2 … S k<br />

Standard<br />

deviati<strong>on</strong>


PRSS compares a test sequence to a shuffled sequence<br />

versi<strong>on</strong> 2.0u64, May, 1998<br />

s-w est<br />

22 1 0:=<br />

24 0 1:*<br />

26 12 5:==*===<br />

28 23 16:=======*====<br />

30 38 34:================*==<br />

32 68 53:==========================*=======<br />

34 60 64:============================== *<br />

36 68 66:================================*=<br />

38 51 60:========================== *<br />

40 58 51:=========================*===<br />

42 32 40:================ *<br />

44 28 31:============== *<br />

46 12 23:====== *<br />

48 17 17:========*<br />

50 8 12:==== *<br />

52 5 8:===*<br />

54 7 6:==*=<br />

56 3 4:=*<br />

58 3 3:=*<br />

60 1 2:*<br />

62 2 1:*<br />

64 0 1:*<br />

66 1 1:*<br />

68 0 0:<br />

70 0 0:<br />

72 1 0:=<br />

74 0 0:<br />

76 0 0:<br />

78 0 0:<br />

80 1 0:=<br />

82 0 0:<br />

PRSS compares a test sequence to a shuffled sequence<br />

versi<strong>on</strong> 2.0u64, May, 1998<br />

s-w est<br />

< 20 0 0:<br />

22 0 0:<br />

24 3 1:*=<br />

26 11 7:===*==<br />

28 36 23:===========*======<br />

30 57 46:======================*======<br />

32 79 65:================================*=======<br />

34 72 73:====================================*<br />

36 59 69:============================== *<br />

38 49 58:========================= *<br />

40 33 46:================= *<br />

42 28 34:============== *<br />

44 22 24:===========*<br />

46 17 17:========*<br />

48 8 12:==== * O<br />

50 6 8:===*<br />

52 10 6:==*==<br />

54 5 4:=*=<br />

56 4 3:=*<br />

58 1 2:*<br />

60 0 1:*<br />

62 0 1:*<br />

64 0 1:*<br />

66 0 0:<br />

> 68 0 0:<br />

67000 residues in 500 sequences,<br />

BLOSUM50 matrix, gap penalties: -12,-2<br />

unshuffled s-w score: 49; shuffled score range: 26 - 60<br />

For 500 sequences, a score >=49 is expected 31 times<br />

PRSS<br />

example 1<br />

74500 residues in 500 sequences,<br />

BLOSUM50 matrix, gap penalties: -12,-2<br />

unshuffled s-w score: 442; shuffled score range: 24 - 82<br />

For 500 sequences, a score >=442 is expected 5.26e-30 times<br />

PRSS<br />

example 2


PRSS compares a test sequence to a shuffled sequence<br />

versi<strong>on</strong> 2.0u64, May, 1998<br />

s-w est<br />

28 5 2:*==<br />

30 11 10:====*=<br />

32 45 25:============*==========<br />

34 59 44:=====================*========<br />

36 62 59:=============================*=<br />

38 57 65:============================= *<br />

40 65 62:==============================*==<br />

42 49 55:========================= *<br />

44 31 45:================ *<br />

46 23 35:============ *<br />

48 21 27:=========== *<br />

50 21 20:=========*=<br />

52 7 14:==== *<br />

54 20 10:====*=====<br />

56 6 7:===*<br />

58 5 5:==*<br />

60 5 4:=*=<br />

62 2 3:=*<br />

64 1 2:*<br />

66 2 1:*<br />

68 1 1:*<br />

70 2 1:*<br />

83000 residues in 500 sequences,<br />

BLOSUM50 matrix, gap penalties: -12,-2<br />

unshuffled s-w score: 114; shuffled score range: 29 - 72<br />

For 500 sequences, a score >=114 is expected 0.000902 times<br />

PRSS<br />

example 3<br />

Validati<strong>on</strong> of a local alignment<br />

• Karlin, S & Altschul, S.F. (1990) « Methods for assessing the<br />

statistical significance of molecular sequence features by using<br />

general scoring schemes » Proc. Natl. Acad. Sci. USA 87:2264-2268.<br />

Altschul et al. 1994 «Issues in searching in molecular sequence<br />

databases » Nature Genet. 6:119-129)<br />

• BLAST : comparis<strong>on</strong> sequence / bank<br />

• To evaluate the significance of an optimal score without gap, <strong>on</strong>e<br />

uses a model of random sequences.<br />

– m : size of sequence, n : size of bank<br />

– The expected number of HSP with a score ≥ s is<br />

E=<br />

Kmne<br />

−λs<br />

where K and λ are c<strong>on</strong>stants, depending <strong>on</strong> the substituti<strong>on</strong> matrix … -<br />

RIINAVNDAVVMD<br />

::::::: .. ::<br />

RIINAVNHTIGMD<br />

HSP<br />

High Scoring Pair


Validati<strong>on</strong> of a local alignment (2)<br />

• E is an indicati<strong>on</strong> of the degree of surprise <strong>on</strong>e gets<br />

with the observed score : the highest it is, the least<br />

significant is the score.<br />

• A reas<strong>on</strong>able value of E is between 0.1 et 0.001<br />

Biologists use generally 10 -4<br />

• Blast default searches until 10<br />

• One generally gives score results standardized with<br />

respect to parameters K and λ :<br />

S ' are called « bit scores »<br />

s − ln K<br />

S′<br />

=<br />

λ<br />

Probability : P-value<br />

ln 2<br />

• The random number of HSP with a score ≥ s follows a<br />

Poiss<strong>on</strong>’s law, i.e. the probability to find exactly k HSP with a<br />

score ≥ s is<br />

e k E −<br />

where E is the expected number of HSP previously defined.<br />

• The p-value P associated to score s is the probability to find<br />

at least <strong>on</strong>e HSP :<br />

−E<br />

P<br />

k! E<br />

= 1−<br />

For instance, if the expected number of HSP with a score ≥ s<br />

is 3, the probability to find at least 1 HSP is 0,95.<br />

e


BLASTN 1.4.7 [16-Oct-94] [Build 17:42:06 Mar 10 1995]<br />

Reference: Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers,<br />

and David J. Lipman (1990). Basic local alignment search tool. J. Mol. Biol.<br />

215:403-10.<br />

Query= gb|X17217|ADAAVAR Avian adenovirus (CELO) DNA encoding VA<br />

(virus-associated) RNA and six open reading frames.<br />

(4898 letters)<br />

Database: smallgenbank.fasta<br />

100 sequences; 205,192 total letters.<br />

Searching.................................................d<strong>on</strong>e<br />

Smallest<br />

Sum<br />

High Probability<br />

<strong>Sequences</strong> producing High-scoring Segment Pairs: Score P(N) N<br />

gb AAUNKDNA 3576 Z17216 Avian adenovirus DNA (CEL06). ... 302<br />

gb AAVSPHERE 8457 M77182 Amsacta entomopoxvirus sphero... 112<br />

gb AAVSPHER 4657 M75889 Amsacta moorei entomopoxvirus ... 94<br />

gb AAFVMAF 3171 M26769 Avian musculoap<strong>on</strong>eurotic fibros... 92<br />

gb ACU10885 2773 U10885 AcMNPV HR3 p6.9 gene, partial ... 101<br />

gb ACSJUN 1074 M16266 Avian sarcoma virus 17 proviral ... 98<br />

Softwares based <strong>on</strong> dynamic<br />

programming<br />

• SSEARCH (Smith-Waterman)<br />

• BESTFIT (Smith-Waterman)<br />

• GAP (Needleman-Wunsh)<br />

• ALIGN (Gotoh)<br />

• ...<br />

Not often used for the search in banks :<br />

Computati<strong>on</strong> time too high<br />

BLAST<br />

6.4e-17 1<br />

0.0052 4<br />

0.34 3<br />

0.96 2<br />

0.97 1<br />

0.998 1<br />

Necessary to find some « tricks»<br />

to accelerate the search of alignments


Why DP is so expensive ?<br />

DNA bank<br />

10 6 sequences<br />

(size sequence = 1000)<br />

Dynamic Programming<br />

10 3 x 10 3 matrix-cell<br />

Search Heuristics<br />

Scan :<br />

Computati<strong>on</strong> of<br />

10 12 matrix-cells<br />

50 ns<br />

> 12 hours<br />

• A heuristics is a mechanism that makes profit of an<br />

assumpti<strong>on</strong> <strong>on</strong> data to accelerate a computati<strong>on</strong><br />

• Assumpti<strong>on</strong> :<br />

– an alignment includes at least a subword of size W<br />

– example : W=5<br />

ATGGCG.GTAGGCATAGGACTCA.TAC<br />

||| || | ||||| ||| |||| ||<br />

ATGTCGAGAAGGCACAGGTCTCAGGAC


Principle of the heuristics (1)<br />

1 – Query sequence is splited in subwords<br />

of size W , that are stored in a dicti<strong>on</strong>nary<br />

ATGGACTGGC<br />

12345678<br />

query sequence<br />

ATG 1<br />

TGG 2,7 W=3<br />

GGA 3<br />

GAC 4<br />

ACT 5<br />

CTG 6<br />

GGC 8 dicti<strong>on</strong>nary<br />

Principle of the heuristics (2)<br />

2 – For each sequence of the bank, subwords<br />

of size W (hits) bel<strong>on</strong>ging to the dicti<strong>on</strong>nary<br />

are detected<br />

1 sequence of the bank<br />

AATCGGATTGCATAA<br />

ATG 1<br />

TGG 2,7<br />

GGA 3<br />

GAC 4<br />

ACT 5<br />

CTG 6<br />

GGC 8 dicti<strong>on</strong>nary


Principle of the heuristics (3)<br />

3 – Each hit is « extended » to produce an<br />

alignment<br />

hit<br />

Sequence of the bank<br />

AATCGGATTGCATAA<br />

ATGGACTGGC<br />

Query sequence<br />

alignment<br />

ATCGGATTG<br />

|| ||| ||<br />

AT.GGACTG<br />

Speed Gain of the heuristics<br />

• Example : 2 sequences of DNA of size 1000<br />

• Dynamic Programming :<br />

– 1000 x 1000 = 10 6 matrix-cells<br />

• Heuristics with W = 8<br />

– statistically :<br />

nb of hits = 10 6 x (1 / 4 8 ) = 15 (1 for W=10)<br />

– 1000 search in the dicti<strong>on</strong>nary (10 x 10 3 operati<strong>on</strong>s)<br />

– 15 extensi<strong>on</strong>s (15 x 10 3 operati<strong>on</strong>s)<br />

• Gain : 10 6 / 2.5 x 10 4 = 40 (87 for W=10)


Tradeoff sensitivity / rapidity<br />

W<br />

1 12<br />

sensitivity<br />

rapidity<br />

• The highest W, the fastest the computati<strong>on</strong><br />

•W is a parameter of softwares (Blast, Fasta)<br />

Softwares searching in data banks<br />

• Blast<br />

– 2 implementati<strong>on</strong>s :<br />

• NCBI BLAST, Nati<strong>on</strong>al Center for Biotechnology Informati<strong>on</strong><br />

• WU-BLAST, W. Gish, Washingt<strong>on</strong> University<br />

• Fasta<br />

– W. Pears<strong>on</strong>, University of Virginia<br />

• …


BLAST<br />

• Many sites<br />

– check default parameters<br />

• Main parameters :<br />

υ W : size of subword<br />

υ DNA : default value 11<br />

υ Protein : default value 3<br />

υ E : expected value<br />

υ -G : gap opening (default 11)<br />

υ -E : gap extensi<strong>on</strong> (default 1)<br />

Various Blasts<br />

NAME Query Bank Comm<strong>on</strong> Usage<br />

BLASTN DNA DNA Look for identities<br />

+pattern splicing<br />

BLASTP Protein Proteins Search of homologous proteins<br />

TBLASTN Protein DNA<br />

translated<br />

6 phases<br />

BLASTX DNA<br />

translated<br />

TBLASTX DNA<br />

translated<br />

Search of genes to be annotated<br />

in banks<br />

Proteins Search of homologous genes and<br />

proteins<br />

DNA<br />

translated<br />

Discovery of the structure of<br />

genes


Recent Blasts<br />

• Blast2 (Gapped Blast,1997) : versi<strong>on</strong> of Blast with indels,<br />

filtering Dust and Seg and threads for multi-processors.<br />

• PSI-Blast (Positi<strong>on</strong> Specific Iterated Blast) : iterative<br />

versi<strong>on</strong> of Blast in view of phylogeny studies, where the comparis<strong>on</strong><br />

matrix used at <strong>on</strong>e step takes into account c<strong>on</strong>sensus found at the<br />

previous step, for a multiple alignment of sequences with a sufficient<br />

E-value, until the processus has c<strong>on</strong>verged.<br />

• PHI-Blast (Pattern-Hit Initiated Blast) : fast versi<strong>on</strong> of<br />

Blast, that may be combined with PSI-Blast, where <strong>on</strong>e gives a<br />

pattern (regular expressi<strong>on</strong> Prosite-like) that must be used in the<br />

alignment of sequences)<br />

Dotplots :<br />

The simplest way to see alignments<br />

Dotplot = identity matrix of two sequences<br />

Similarities =<br />

diag<strong>on</strong>als<br />

• • • • • • •<br />

•<br />

•<br />

•<br />

Filtering, Smoothing<br />

(window 4, 1 error)<br />

DNA /cDNA of actin gene from<br />

muscle of Pisaster ochraceus<br />

(horiz<strong>on</strong>tal / vertical)


Dotter : a tool for dot plots<br />

• Karolinska Institutet, unix<br />

& windows versi<strong>on</strong>s<br />

ftp://ftp.cgr.ki.se/pub/esr<br />

/dotter/<br />

• Complexity : linear for<br />

space, quadratic for time<br />

(15 mn for 30000x30000)<br />

« A dot-matrix program with dynamic<br />

theshold c<strong>on</strong>trol suited for genomic DNA and<br />

protein sequences » E. S<strong>on</strong>nhammer R. Durbin<br />

Gene 167: GC1-10 1995

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!