Algorithms on Sequences

Bioinformatics 

<strong>Algorithms</strong> on Sequences 

Potsdam June 2003 

Jacques Nicolas 

Matching with errors : 

Comparison of sequences 

Motivations 

• Sequence comparison is a basic operation in 

molecular biology. 

• It occurs, for instance, in : 

– Search in Data Banks 

– Genome Assembly 

– Phylogenic trees construction 

– … 

• The knowledge of principles of sequence 

comparison is necessary for mastering the 

interpretation of results given by softwares.

Sequence comparison and alignments 

To compare sequences is to look for alignments 

MALVEDNNAVAVSFSEEQEALVLKSWAILKKDSANIALRFFLKIFEVAPSASQMF-SF 

: .: :...:: .: ... . ..:: . .. . .. ...:.: .:.:...: .: 

MPIV-DTGSVA-PLSAAEKTKIRSAWAPVYSNYETSGVDILVKFFTSTPAAQEFFPKF 

search in 

a bank 

local alignment 

(blast) 

assembly 

of genomes 

Today lecture 

• Dynamic programming applied to 

the comparison of sequences 

• Alignments 

• Substitution matrices 

• Validation of comparison results 

• Search heuristics 

• Introduction to existing softwares 

phylogeny 

global alignment

Alignment of two sequences 

• Matching of 2 subsequences 

example : TACGT-AAT 

TATGTGAAT 

• 2 elementary operations 

– substitution 

– insertion / deletion : gap 

Score of an alignment 

• Necessary to «compare» 2 alignments 

• Score evaluated as a function of costs of 

elementary operations 

– substitution cost 

• DNA : match : +M 

mismatch : -N 

• Protein : substitution matrix 

– insertion/deletion (indel) cost

Score computation 

• Score is the sum of costs of elementary operations 

of an alignment. 

• Example : M = +5 N = -4 G = -3 

GATACGT-AATGCATA 

CTATGTGAATTT 

5+5-4+5+5-3+5+5+5=28 

Dynamic Programming : reminder 

– Optimality principle 

– Split a problem in simpler subproblems 

A 

B 

C 

D 

optimal 

partial 

result 

optimal 

partial 

result 

optimal 

partial 

result 

optimal 

final solution 

optimal 

partial 

result 

optimal 

partial 

result 

optimal 

partial 

result 

optimal 

partial 

result 

optimal 

partial 

result

DP applied to alignments 

• Given 2 sequences : 

– X (length = N) et Y (length = M) 

• The score S N,M of alignment A N,M : 

X 1 . . . X N 

Y 1 . . . Y M 

is calculated from 3 sub-alignments : 

A N-1,M-1 A N,M-1 A N-1,M 

X 1 . . . X N-1 X N 

Y 1 . . . Y M-1 Y M 

substitution 

X1 . . . XN - 

Y1 . . . YM-1 YM gap 

X1 . . . XN-1 XN Y1 . . . YM - 

Needleman & Wunsh’s algorithm (1) 

gap 

• global alignment between 2 sequences 

acgtgcataaagccaggataccg 

acggtcattagaccccgagataccg 

ac.gtgcataa.agccag.gataccg 

|| || ||| | | || | ||||||| 

acggt.cattagaccccgagataccg


X = a c t 

Y = a g t 

optimal a c 

alignment a g 

+ 

substitution (t,t) 

optimal a c t 

alignment a g t 

MAX 

optimal a c t 

alignment a g 

+ 

substitution (t,-) 

optimal a c 

alignment a g t 

+ 

substitution (-,t) 


act 

agt 

act 

ag 

act 

a 

ac 

agt 

ac 

ag 

ac 

a 

a 

agt 

a 

ag 

a 

a


• The score S N,M is calculated from the 

recurrence equation : 

S I,J-1 - Gap 

S I,J = Max S I-1,J-1 + Sub (X I ,Y J ) 

S I-1,J - Gap 


T 

A 

G 

A 

C 

T 

T A G C T A 

i,j 

S I,J-1 - Gap 

S I,J = Max S I-1,J-1 + Sub (XI ,YJ ) 

S I-1,J - Gap 

Complexity : 

filling a matrix of size N x M 

Result : S N,M


T 

A 

G 

A 

C 

T 

T A A C T 

0 -2 -4 -6 -8 -10 

-2 

-4 

-6 

-8 

-10 

-12 

2 

0 

-2 

-4 

-6 

-8 

0 

4 

2 

-2 

2 

-4 

-6 

0 -2 

2 1 -1 

0 4 2 0 

-2 

-4 

2 

0 

6 4 

4 8 

S I,J-1 - Gap 


S I-1,J - Gap 

match M = 2 

mismatch N = -2 

Gap = 2 

result = ? 


T 

A 

G 

A 

C 

T 

T A A C T 

0 -2 -4 -6 -8 -10 

-2 

-4 

-6 

-8 

-10 

-12 

2 0 

0 

-2 

-4 

-6 

-8 

4 

2 

-2 

2 

-4 

-6 

0 -2 

2 0 -2 

0 4 2 0 

-2 

-4 

2 

0 

6 4 

4 8 

Determination of 

the optimal alignment 

T 

2 

T 

A 

4 

A 

- 

2 

G 

A 

4 

A 

C 

6 

C 

T 

8 

T

Smith & Waterman algorithm (1) 

• local alignments between 2 sequences 

ccggttcttcttacaacgtgcataaagccagcacaagaaagt 

cgtagtacggtcattagaccccgcagtgagccttacgt 

ac.gtgcataa.agccag 

|| || ||| | | || | 

acggt.cattagaccccg 


• The score S N,M is calculated from the 

recurrence equation : 

0 

S I,J-1 - Gap 


S I-1,J - Gap 

Dynamic 

Programming


GCC-UCG 

GCCAUUG 

C A G C C U C G C U U A G 

A 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 

A 0.0 1.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.7 

U 0.0 0.0 0.8 0.3 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.7 

G 0.0 0.0 1.0 0.3 0.0 0.0 0.7 1.0 0.0 0.0 0.7 0.7 1.0 

C 1.0 0.0 0.0 2.0 1.3 0.3 1.0 0.3 2.0 0.7 0.3 0.3 0.3 

C 1.0 0.7 0.0 1.0 3.0 1.7 1.3 1.0 1.3 1.7 0.3 0.0 0.0 

A 0.0 2.0 0.7 0.3 1.7 2.7 1.3 1.0 0.7 1.0 1.3 1.3 0.0 

U 0.0 0.7 1.7 0.3 1.3 2.7 2.3 1.0 0.7 1.7 2.0 1.0 1.0 

U 0.0 0.3 0.3 1.3 1.0 2.3 2.3 2.0 0.7 1.7 2.7 1.7 1.0 

G 0.0 0.0 1.3 0.0 1.0 1.0 2.0 3.3 2.0 1.7 1.3 2.3 2.7 

A 0.0 1.0 0.0 1.0 0.3 0.7 0.7 2.0 3.0 1.7 1.3 2.3 2.0 

C 1.0 0.0 0.7 1.0 2.0 0.7 1.7 1.7 3.0 2.7 1.3 1.0 2.0 

G 0.0 0.7 1.0 0.3 0.7 1.7 0.3 2.7 1.7 2.7 2.3 1.0 2.0 

G 0.0 0.0 1.7 0.7 0.3 0.3 1.3 1.3 2.3 1.3 2.3 2.0 2.0 

Gotoh’s algorithm (1) 

• More realistic modelling of « gaps »; 

• Cost of a N characters gap : 

– S & W : C N = N x G 

– Gotoh : C N = G b + (N-1) x G x 

G b = breaking cost 

G x = extension cost 

• What is the recurrence equation ?

E I,J = MAX 

Gotoh’s algorithm (2) 

S I,J = MAX 

S I,J-1 - G B 

E I,J-1 - G X 

0 

E I,J 

F I,J 

S I-1,J-1 + sub (X I , Y J) 

F I,J = MAX 

Cost of a substitution 

S I-1,J - G B 

F I-1,J - G X 

• DNA Sequences 

One consider 2 costs : 

• cost of a match positive value 

• cost of a mismatch negative value 

example : BLAST : 

default value = +5 and -4 

• Proteins 

Use a substitution matrix

Substitution Matrices 

• Substitution matrices are 2D arrays used to 

assign a similarity value between pairs of 

characters (bases or amino acids) = 

substitution cost 

• Matrices are generally symetrical. The value of 

the substitution of X with Y is considered to be 

equal to th value of the substitution of Y with X. 

Several types of matrices 

• Identity Matrices 

• Matrix based on the genetic code 

• Substitution Matrices 

– PAM (mutations, Dayhoff) 

– BLOSUM (Henikoff) 

• Physico-chemical Matrices 

– Hydrophobicity 

– 3D structure

Identity and default Blast Matrices 

A 3 

A 

T 

C 

G 

A T C G 

1 

0 

0 

0 

0 0 0 

1 

0 

0 

0 0 

1 

0 

0 

1 

A T C G 

Identity matrix Reference matrix in BLAST 

A 

T 

C 

G 

5 

-4 -4 -4 

5 

-4 -4 

Matrix based on the genetic code 

A R N D C Q E G H I L K M F P S T W Y V 

R 1 3 

N 1 1 3 

D 2 1 2 3 

C 1 2 1 1 3 

Q 1 2 1 1 0 3 

E 2 1 1 2 0 2 3 

G 2 2 1 2 2 1 2 3 

H 1 2 2 2 1 2 1 1 3 

I 1 2 2 1 1 1 1 1 1 3 

L 1 2 1 1 1 2 1 1 2 2 3 

K 1 2 2 1 0 2 2 1 1 2 1 3 

M 1 2 1 0 0 1 1 1 0 2 2 2 3 

F 1 1 1 1 2 0 0 1 1 2 2 0 1 3 

P 2 2 1 1 1 2 1 1 2 1 2 1 1 1 3 

S 2 2 2 1 2 1 1 2 1 2 2 1 1 2 2 3 

T 2 2 2 1 1 1 1 1 1 2 1 2 2 1 2 2 3 

W 1 2 0 0 2 1 1 2 0 0 2 1 1 1 1 2 1 3 

Y 1 1 2 2 2 1 1 1 2 1 1 1 0 2 1 2 1 1 3 

V 2 1 1 2 1 1 2 2 1 2 2 1 2 2 1 1 1 1 1 3 

-4 

-4 

-4 

-4 

-4 

5 

-4 

-4 

5 

(number of 

modifications 

to pass from 

an amino acid 

to another)

Dayhoff’s Matrices (PAM) 

• M.O. Dayhoff, 1978, Atlas of Protein Sequence and 

Structure. Initial matrix deduced from the alignment of 

71 families of highly homologous proteins (>1000 

sequences). 

• Replacement frequency Matrix of each amino acid X with 

one of the 19 others (relative frequency for 

homogeneity reasons) = PAM (Percent Accepted 

Mutation). 

• The model of evolution of Dayhoff assumes that proteins 

have diverged due to the accumulation of random 

mutations. PAM N is the power N of the matrix, thus 

simulating an evolution rate. Standard value of N =250 

Each value is the probability 

of substitution of a given aa 

with another one. The greater 

the value, the more likely the 

replacment. 0 is equivalent to 

a random mutation.

Other substitution matrices 

• Physico-chemical properties may be used to 

build score matrices. 

• Chemical properties 

• Hydrobobicity 

• Secondary structures (membrane domains) 

• Tertiary structures 

• ...

Taylor’s Venn Diagram 

For usual protein 

properties 

Validation of an alignment 

• A computer program finds always a solution… 

• What is the credit that may be assigned to 

resulting alignments ? 

– Validation from : 

• the score ? 

• the percentage of identities ? 

• … 

• Problem : How to decide that an alignment has 

something to do with biological reality

Example 1 : easy case 

ALIGN calculates a global alignment of two sequences 

version 2.0uPlease cite: Myers and Miller, CABIOS (1989) 4:11-17 

AF248645_1 150 aa vs. 

AF156936_1 149 aa 

scoring matrix: BLOSUM50, gap penalties: -12/-2 

44.1% identity; Global alignment score: 428 

10 20 30 40 50 60 

/tmp/t MPIVDTGSVAPLSAAEKTKIRSAWAPVYSNYETSGVDILVKFFTSTPAAQEFFPKFKGLT 

:::.: : . :: ..: :: .: .:.:.: .:. .:..:. ..:.::. ::::.. 

AF1569 MPITDQGPLPTLSEGDKKAIRESWPQIYQNFEQTGLVVLLEFLQKNPGAQQSFPKFSA-- 

10 20 30 40 50 

70 80 90 100 110 120 

/tmp/t TADQLKKSADVRWHAERIINAVNDAVVSMDDTEKMSMKLRDLSGKHAKSFQVDPQYFKVL 

: .:... .:.:.: ::::::: .. :: :.. :..::.::.. :::::. :: : 

AF1569 TKCNLEQDNEVKWQASRIINAVNHTIGLMDKEAAMKQYLKELSAKHSSEFQVDPKLFKEL 

60 70 80 90 100 110 

130 140 150 

/tmp/t AAVIADTVAAGDAGFEKLMSMICILLRSAY-- 

.:....:. : :..:::.:.:: ::::.: 

AF1569 SAIFVSTIR-GKAAYEKLFSIICTLLRSSYDE 

120 130 140 

Example 2 : opposite but easy… 



U13831_1 134 aa vs. 

AF248645_1 150 aa 


15.8% identity; Global alignment score: -47 

10 20 30 40 50 

/tmp/t MTRDQNGTWEMESNENFEGYMKALDIDFATPKIAVRLTQTKVI---DQDGDNFKTK---T 

: ..:. : :. : .: . . : . .: .. .: . 

AF2486 MPIVDTGSVAPLS---------------AAEKTKIRSAWAPVYSNYETSGVDILVKFFTS 

10 20 30 40 

60 70 80 90 100 110 

/tmp/t TSTFRNYDVDFTVGVEFDEYTKSLDNR-HVKALVTWEGDVLVCVQKGEKENRGWKQW--- 

: . ... : . :. :: : : :.. ... .:..: .. :: . .. 

AF2486 TPAAQEFFPKFKGLTTADQLKKSADVRWHAERIINAVNDAVVSMDDTEKMSMKLRDLSGK 

50 60 70 80 90 100 

120 130 

/tmp/t ----IEGDKLYLEL---------TCGD--------QVCRQVFKKK 

.. : :... . :: ..: . . 

AF2486 HAKSFQVDPQYFKVLAAVIADTVAAGDAGFEKLMSMICILLRSAY 

110 120 130 140 150

Example 3 ??? 



U76030_1 166 aa vs. 

AF248645_1 150 aa 


20.6% identity; Global alignment score: 94 

10 20 30 40 50 

/tmp/t MALVEDNNAVAVSFSEEQEALVLKSWAILKKDSANIALRFFLKIFEVAPSASQMF-SF-- 

: .: :...:: .: ... . ..:: . .. . .. ...:.: .:.:...: .: 

AF2486 MPIV-DTGSVA-PLSAAEKTKIRSAWAPVYSNYETSGVDILVKFFTSTPAAQEFFPKFKG 

10 20 30 40 50 

60 70 80 90 100 110 

/tmp/t LRNSDVPLEKNPKLKTHAMSVFVMTCEAAAQLRKAGKVTVRDTTLKRLGATHLK-YGVGD 

: ..: :.:. .. :: .. . .:.... . :.... :. :.. : : . : 

AF2486 LTTAD-QLKKSADVRWHAERIINAVNDAVVSMDDTEKMSMK---LRDLSGKHAKSFQVDP 

60 70 80 90 100 110 

120 130 140 150 160 

/tmp/t AHFEVVKFALLDTIKEEVPADMWSPAMKSAWSEAYDHLVAAIKQEMKPAE 

.:.:. .. ::. .: . ....:.. : .. : 

AF2486 QYFKVLAAVIADTV--------------AAGDAGFEKLMSMICILLRSAY 

120 130 140 150 

Validation of a global alignment 

• alignment between S1 and S2 (with gaps) 

score S 

• Question : is S significant ? 

comparison with a score between 

random sequences

Validation of a global alignment (2) 

S1, S2 score S 

S’1, S’2 score S’ 

Random sequences produced by 

• Permutations 

• Markovian models 

comparison ? 

Standardized score : Z-score 

Sq1, Sq2 

S 

Practically : 

Significant 

score 

for Z > 6 

Z 

= 

( S−m 

) 

σ 

Mean 

Sq1 

Sq2 1 Sq2 2 ... Sq2 K 

S 1 S 2 … S k 

Standard 

deviation

PRSS compares a test sequence to a shuffled sequence 

version 2.0u64, May, 1998 

s-w est 

22 1 0:= 

24 0 1:* 

26 12 5:==*=== 

28 23 16:=======*==== 

30 38 34:================*== 

32 68 53:==========================*======= 

34 60 64:============================== * 

36 68 66:================================*= 

38 51 60:========================== * 

40 58 51:=========================*=== 

42 32 40:================ * 

44 28 31:============== * 

46 12 23:====== * 

48 17 17:========* 

50 8 12:==== * 

52 5 8:===* 

54 7 6:==*= 

56 3 4:=* 

58 3 3:=* 

60 1 2:* 

62 2 1:* 

64 0 1:* 

66 1 1:* 

68 0 0: 

70 0 0: 

72 1 0:= 

74 0 0: 

76 0 0: 

78 0 0: 

80 1 0:= 

82 0 0: 



s-w est 

< 20 0 0: 

22 0 0: 

24 3 1:*= 

26 11 7:===*== 

28 36 23:===========*====== 

30 57 46:======================*====== 

32 79 65:================================*======= 

34 72 73:====================================* 

36 59 69:============================== * 

38 49 58:========================= * 

40 33 46:================= * 

42 28 34:============== * 

44 22 24:===========* 

46 17 17:========* 

48 8 12:==== * O 

50 6 8:===* 

52 10 6:==*== 

54 5 4:=*= 

56 4 3:=* 

58 1 2:* 

60 0 1:* 

62 0 1:* 

64 0 1:* 

66 0 0: 

> 68 0 0: 

67000 residues in 500 sequences, 

BLOSUM50 matrix, gap penalties: -12,-2 

unshuffled s-w score: 49; shuffled score range: 26 - 60 

For 500 sequences, a score >=49 is expected 31 times 

PRSS 

example 1 




For 500 sequences, a score >=442 is expected 5.26e-30 times 

PRSS 

example 2



s-w est 

28 5 2:*== 

30 11 10:====*= 

32 45 25:============*========== 

34 59 44:=====================*======== 

36 62 59:=============================*= 

38 57 65:============================= * 

40 65 62:==============================*== 

42 49 55:========================= * 

44 31 45:================ * 

46 23 35:============ * 

48 21 27:=========== * 

50 21 20:=========*= 

52 7 14:==== * 

54 20 10:====*===== 

56 6 7:===* 

58 5 5:==* 

60 5 4:=*= 

62 2 3:=* 

64 1 2:* 

66 2 1:* 

68 1 1:* 

70 2 1:* 




For 500 sequences, a score >=114 is expected 0.000902 times 

PRSS 

example 3 

Validation of a local alignment 

• Karlin, S & Altschul, S.F. (1990) « Methods for assessing the 

statistical significance of molecular sequence features by using 

general scoring schemes » Proc. Natl. Acad. Sci. USA 87:2264-2268. 

Altschul et al. 1994 «Issues in searching in molecular sequence 

databases » Nature Genet. 6:119-129) 

• BLAST : comparison sequence / bank 

• To evaluate the significance of an optimal score without gap, one 

uses a model of random sequences. 

– m : size of sequence, n : size of bank 

– The expected number of HSP with a score ≥ s is 

E= 

Kmne 

−λs 

where K and λ are constants, depending on the substitution matrix … - 

RIINAVNDAVVMD 

::::::: .. :: 

RIINAVNHTIGMD 

HSP 

High Scoring Pair

Validation of a local alignment (2) 

• E is an indication of the degree of surprise one gets 

with the observed score : the highest it is, the least 

significant is the score. 

• A reasonable value of E is between 0.1 et 0.001 

Biologists use generally 10 -4 

• Blast default searches until 10 

• One generally gives score results standardized with 

respect to parameters K and λ : 

S ' are called « bit scores » 

s − ln K 

S′ 

= 

λ 

Probability : P-value 

ln 2 

• The random number of HSP with a score ≥ s follows a 

Poisson’s law, i.e. the probability to find exactly k HSP with a 

score ≥ s is 

e k E − 

where E is the expected number of HSP previously defined. 

• The p-value P associated to score s is the probability to find 

at least one HSP : 

−E 

P 

k! E 

= 1− 

For instance, if the expected number of HSP with a score ≥ s 

is 3, the probability to find at least 1 HSP is 0,95. 

e

BLASTN 1.4.7 [16-Oct-94] [Build 17:42:06 Mar 10 1995] 

Reference: Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, 

and David J. Lipman (1990). Basic local alignment search tool. J. Mol. Biol. 

215:403-10. 

Query= gb|X17217|ADAAVAR Avian adenovirus (CELO) DNA encoding VA 

(virus-associated) RNA and six open reading frames. 

(4898 letters) 

Database: smallgenbank.fasta 

100 sequences; 205,192 total letters. 

Searching.................................................done 

Smallest 

Sum 

High Probability 

Sequences producing High-scoring Segment Pairs: Score P(N) N 

gb AAUNKDNA 3576 Z17216 Avian adenovirus DNA (CEL06). ... 302 

gb AAVSPHERE 8457 M77182 Amsacta entomopoxvirus sphero... 112 

gb AAVSPHER 4657 M75889 Amsacta moorei entomopoxvirus ... 94 

gb AAFVMAF 3171 M26769 Avian musculoaponeurotic fibros... 92 

gb ACU10885 2773 U10885 AcMNPV HR3 p6.9 gene, partial ... 101 

gb ACSJUN 1074 M16266 Avian sarcoma virus 17 proviral ... 98 

Softwares based on dynamic 

programming 

• SSEARCH (Smith-Waterman) 

• BESTFIT (Smith-Waterman) 

• GAP (Needleman-Wunsh) 

• ALIGN (Gotoh) 

• ... 

Not often used for the search in banks : 

Computation time too high 

BLAST 

6.4e-17 1 

0.0052 4 

0.34 3 

0.96 2 

0.97 1 

0.998 1 

Necessary to find some « tricks» 

to accelerate the search of alignments

Why DP is so expensive ? 

DNA bank 

10 6 sequences 

(size sequence = 1000) 

Dynamic Programming 

10 3 x 10 3 matrix-cell 

Search Heuristics 

Scan : 

Computation of 

10 12 matrix-cells 

50 ns 

> 12 hours 

• A heuristics is a mechanism that makes profit of an 

assumption on data to accelerate a computation 

• Assumption : 

– an alignment includes at least a subword of size W 

– example : W=5 

ATGGCG.GTAGGCATAGGACTCA.TAC 

||| || | ||||| ||| |||| || 

ATGTCGAGAAGGCACAGGTCTCAGGAC

Principle of the heuristics (1) 

1 – Query sequence is splited in subwords 

of size W , that are stored in a dictionnary 

ATGGACTGGC 

12345678 

query sequence 

ATG 1 

TGG 2,7 W=3 

GGA 3 

GAC 4 

ACT 5 

CTG 6 

GGC 8 dictionnary 


2 – For each sequence of the bank, subwords 

of size W (hits) belonging to the dictionnary 

are detected 

1 sequence of the bank 

AATCGGATTGCATAA 

ATG 1 

TGG 2,7 

GGA 3 

GAC 4 

ACT 5 

CTG 6 

GGC 8 dictionnary


3 – Each hit is « extended » to produce an 

alignment 

hit 

Sequence of the bank 

AATCGGATTGCATAA 

ATGGACTGGC 

Query sequence 

alignment 

ATCGGATTG 

|| ||| || 

AT.GGACTG 

Speed Gain of the heuristics 

• Example : 2 sequences of DNA of size 1000 

• Dynamic Programming : 

– 1000 x 1000 = 10 6 matrix-cells 

• Heuristics with W = 8 

– statistically : 

nb of hits = 10 6 x (1 / 4 8 ) = 15 (1 for W=10) 

– 1000 search in the dictionnary (10 x 10 3 operations) 

– 15 extensions (15 x 10 3 operations) 

• Gain : 10 6 / 2.5 x 10 4 = 40 (87 for W=10)

Tradeoff sensitivity / rapidity 

W 

1 12 

sensitivity 

rapidity 

• The highest W, the fastest the computation 

•W is a parameter of softwares (Blast, Fasta) 

Softwares searching in data banks 

• Blast 

– 2 implementations : 

• NCBI BLAST, National Center for Biotechnology Information 

• WU-BLAST, W. Gish, Washington University 

• Fasta 

– W. Pearson, University of Virginia 

• …

BLAST 

• Many sites 

– check default parameters 

• Main parameters : 

υ W : size of subword 

υ DNA : default value 11 

υ Protein : default value 3 

υ E : expected value 

υ -G : gap opening (default 11) 

υ -E : gap extension (default 1) 

Various Blasts 

NAME Query Bank Common Usage 

BLASTN DNA DNA Look for identities 

+pattern splicing 

BLASTP Protein Proteins Search of homologous proteins 

TBLASTN Protein DNA 

translated 

6 phases 

BLASTX DNA 

translated 

TBLASTX DNA 

translated 

Search of genes to be annotated 

in banks 

Proteins Search of homologous genes and 

proteins 

DNA 

translated 

Discovery of the structure of 

genes

Recent Blasts 

• Blast2 (Gapped Blast,1997) : version of Blast with indels, 

filtering Dust and Seg and threads for multi-processors. 

• PSI-Blast (Position Specific Iterated Blast) : iterative 

version of Blast in view of phylogeny studies, where the comparison 

matrix used at one step takes into account consensus found at the 

previous step, for a multiple alignment of sequences with a sufficient 

E-value, until the processus has converged. 

• PHI-Blast (Pattern-Hit Initiated Blast) : fast version of 

Blast, that may be combined with PSI-Blast, where one gives a 

pattern (regular expression Prosite-like) that must be used in the 

alignment of sequences) 

Dotplots : 

The simplest way to see alignments 

Dotplot = identity matrix of two sequences 

Similarities = 

diagonals 

• • • • • • • 

• 

• 

• 

Filtering, Smoothing 

(window 4, 1 error) 

DNA /cDNA of actin gene from 

muscle of Pisaster ochraceus 

(horizontal / vertical)

Dotter : a tool for dot plots 

• Karolinska Institutet, unix 

& windows versions 

ftp://ftp.cgr.ki.se/pub/esr 

/dotter/ 

• Complexity : linear for 

space, quadratic for time 

(15 mn for 30000x30000) 

« A dot-matrix program with dynamic 

theshold control suited for genomic DNA and 

protein sequences » E. Sonnhammer R. Durbin 

Gene 167: GC1-10 1995

Algorithms on Sequences

Create successful ePaper yourself

Delete template?

Save as template?