Algorithms on Sequences

cs.uni.potsdam.de

Algorithms on Sequences

Bioinformatics

ong>Algorithmsong> on Sequences

Potsdam June 2003

Jacques Nicolas

Matching with errors :

Comparison of sequences

Motivations

• Sequence comparison is a basic operation in

molecular biology.

• It occurs, for instance, in :

– Search in Data Banks

– Genome Assembly

– Phylogenic trees construction

– …

• The knowledge of principles of sequence

comparison is necessary for mastering the

interpretation of results given by softwares.


Sequence comparison and alignments

To compare sequences is to look for alignments

MALVEDNNAVAVSFSEEQEALVLKSWAILKKDSANIALRFFLKIFEVAPSASQMF-SF

: .: :...:: .: ... . ..:: . .. . .. ...:.: .:.:...: .:

MPIV-DTGSVA-PLSAAEKTKIRSAWAPVYSNYETSGVDILVKFFTSTPAAQEFFPKF

search in

a bank

local alignment

(blast)

assembly

of genomes

Today lecture

• Dynamic programming applied to

the comparison of sequences

• Alignments

• Substitution matrices

• Validation of comparison results

• Search heuristics

• Introduction to existing softwares

phylogeny

global alignment


Alignment of two sequences

• Matching of 2 subsequences

example : TACGT-AAT

TATGTGAAT

• 2 elementary operations

– substitution

– insertion / deletion : gap

Score of an alignment

• Necessary to «compare» 2 alignments

• Score evaluated as a function of costs of

elementary operations

– substitution cost

• DNA : match : +M

mismatch : -N

• Protein : substitution matrix

– insertion/deletion (indel) cost


Score computation

• Score is the sum of costs of elementary operations

of an alignment.

• Example : M = +5 N = -4 G = -3

GATACGT-AATGCATA

CTATGTGAATTT

5+5-4+5+5-3+5+5+5=28

Dynamic Programming : reminder

– Optimality principle

– Split a problem in simpler subproblems

A

B

C

D

optimal

partial

result

optimal

partial

result

optimal

partial

result

optimal

final solution

optimal

partial

result

optimal

partial

result

optimal

partial

result

optimal

partial

result

optimal

partial

result


DP applied to alignments

• Given 2 sequences :

– X (length = N) et Y (length = M)

• The score S N,M of alignment A N,M :

X 1 . . . X N

Y 1 . . . Y M

is calculated from 3 sub-alignments :

A N-1,M-1 A N,M-1 A N-1,M

X 1 . . . X N-1 X N

Y 1 . . . Y M-1 Y M

substitution

X1 . . . XN -

Y1 . . . YM-1 YM gap

X1 . . . XN-1 XN Y1 . . . YM -

Needleman & Wunsh’s algorithm (1)

gap

• global alignment between 2 sequences

acgtgcataaagccaggataccg

acggtcattagaccccgagataccg

ac.gtgcataa.agccag.gataccg

|| || ||| | | || | |||||||

acggt.cattagaccccgagataccg


Needleman & Wunsh’s algorithm (2)

X = a c t

Y = a g t

optimal a c

alignment a g

+

substitution (t,t)

optimal a c t

alignment a g t

MAX

optimal a c t

alignment a g

+

substitution (t,-)

optimal a c

alignment a g t

+

substitution (-,t)

Needleman & Wunsh’s algorithm (3)

act

agt

act

ag

act

a

ac

agt

ac

ag

ac

a

a

agt

a

ag

a

a


Needleman & Wunsh’s algorithm (4)

• The score S N,M is calculated from the

recurrence equation :

S I,J-1 - Gap

S I,J = Max S I-1,J-1 + Sub (X I ,Y J )

S I-1,J - Gap

Needleman & Wunsh’s algorithm (5)

T

A

G

A

C

T

T A G C T A

i,j

S I,J-1 - Gap

S I,J = Max S I-1,J-1 + Sub (XI ,YJ )

S I-1,J - Gap

Complexity :

filling a matrix of size N x M

Result : S N,M


Needleman & Wunsh’s algorithm (6)

T

A

G

A

C

T

T A A C T

0 -2 -4 -6 -8 -10

-2

-4

-6

-8

-10

-12

2

0

-2

-4

-6

-8

0

4

2

-2

2

-4

-6

0 -2

2 1 -1

0 4 2 0

-2

-4

2

0

6 4

4 8

S I,J-1 - Gap

S I,J = Max S I-1,J-1 + Sub (X I ,Y J )

S I-1,J - Gap

match M = 2

mismatch N = -2

Gap = 2

result = ?

Needleman & Wunsh’s algorithm (7)

T

A

G

A

C

T

T A A C T

0 -2 -4 -6 -8 -10

-2

-4

-6

-8

-10

-12

2 0

0

-2

-4

-6

-8

4

2

-2

2

-4

-6

0 -2

2 0 -2

0 4 2 0

-2

-4

2

0

6 4

4 8

Determination of

the optimal alignment

T

2

T

A

4

A

-

2

G

A

4

A

C

6

C

T

8

T


Smith & Waterman algorithm (1)

• local alignments between 2 sequences

ccggttcttcttacaacgtgcataaagccagcacaagaaagt

cgtagtacggtcattagaccccgcagtgagccttacgt

ac.gtgcataa.agccag

|| || ||| | | || |

acggt.cattagaccccg

Smith & Waterman algorithm (2)

• The score S N,M is calculated from the

recurrence equation :

0

S I,J-1 - Gap

S I,J = Max S I-1,J-1 + Sub (X I ,Y J )

S I-1,J - Gap

Dynamic

Programming


Smith & Waterman algorithm (3)

GCC-UCG

GCCAUUG

C A G C C U C G C U U A G

A 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

A 0.0 1.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.7

U 0.0 0.0 0.8 0.3 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.7

G 0.0 0.0 1.0 0.3 0.0 0.0 0.7 1.0 0.0 0.0 0.7 0.7 1.0

C 1.0 0.0 0.0 2.0 1.3 0.3 1.0 0.3 2.0 0.7 0.3 0.3 0.3

C 1.0 0.7 0.0 1.0 3.0 1.7 1.3 1.0 1.3 1.7 0.3 0.0 0.0

A 0.0 2.0 0.7 0.3 1.7 2.7 1.3 1.0 0.7 1.0 1.3 1.3 0.0

U 0.0 0.7 1.7 0.3 1.3 2.7 2.3 1.0 0.7 1.7 2.0 1.0 1.0

U 0.0 0.3 0.3 1.3 1.0 2.3 2.3 2.0 0.7 1.7 2.7 1.7 1.0

G 0.0 0.0 1.3 0.0 1.0 1.0 2.0 3.3 2.0 1.7 1.3 2.3 2.7

A 0.0 1.0 0.0 1.0 0.3 0.7 0.7 2.0 3.0 1.7 1.3 2.3 2.0

C 1.0 0.0 0.7 1.0 2.0 0.7 1.7 1.7 3.0 2.7 1.3 1.0 2.0

G 0.0 0.7 1.0 0.3 0.7 1.7 0.3 2.7 1.7 2.7 2.3 1.0 2.0

G 0.0 0.0 1.7 0.7 0.3 0.3 1.3 1.3 2.3 1.3 2.3 2.0 2.0

Gotoh’s algorithm (1)

• More realistic modelling of « gaps »;

• Cost of a N characters gap :

– S & W : C N = N x G

– Gotoh : C N = G b + (N-1) x G x

G b = breaking cost

G x = extension cost

• What is the recurrence equation ?


E I,J = MAX

Gotoh’s algorithm (2)

S I,J = MAX

S I,J-1 - G B

E I,J-1 - G X

0

E I,J

F I,J

S I-1,J-1 + sub (X I , Y J)

F I,J = MAX

Cost of a substitution

S I-1,J - G B

F I-1,J - G X

• DNA Sequences

One consider 2 costs :

• cost of a match positive value

• cost of a mismatch negative value

example : BLAST :

default value = +5 and -4

• Proteins

Use a substitution matrix


Substitution Matrices

• Substitution matrices are 2D arrays used to

assign a similarity value between pairs of

characters (bases or amino acids) =

substitution cost

• Matrices are generally symetrical. The value of

the substitution of X with Y is considered to be

equal to th value of the substitution of Y with X.

Several types of matrices

• Identity Matrices

• Matrix based on the genetic code

• Substitution Matrices

– PAM (mutations, Dayhoff)

– BLOSUM (Henikoff)

• Physico-chemical Matrices

– Hydrophobicity

– 3D structure


Identity and default Blast Matrices

A 3

A

T

C

G

A T C G

1

0

0

0

0 0 0

1

0

0

0 0

1

0

0

1

A T C G

Identity matrix Reference matrix in BLAST

A

T

C

G

5

-4 -4 -4

5

-4 -4

Matrix based on the genetic code

A R N D C Q E G H I L K M F P S T W Y V

R 1 3

N 1 1 3

D 2 1 2 3

C 1 2 1 1 3

Q 1 2 1 1 0 3

E 2 1 1 2 0 2 3

G 2 2 1 2 2 1 2 3

H 1 2 2 2 1 2 1 1 3

I 1 2 2 1 1 1 1 1 1 3

L 1 2 1 1 1 2 1 1 2 2 3

K 1 2 2 1 0 2 2 1 1 2 1 3

M 1 2 1 0 0 1 1 1 0 2 2 2 3

F 1 1 1 1 2 0 0 1 1 2 2 0 1 3

P 2 2 1 1 1 2 1 1 2 1 2 1 1 1 3

S 2 2 2 1 2 1 1 2 1 2 2 1 1 2 2 3

T 2 2 2 1 1 1 1 1 1 2 1 2 2 1 2 2 3

W 1 2 0 0 2 1 1 2 0 0 2 1 1 1 1 2 1 3

Y 1 1 2 2 2 1 1 1 2 1 1 1 0 2 1 2 1 1 3

V 2 1 1 2 1 1 2 2 1 2 2 1 2 2 1 1 1 1 1 3

-4

-4

-4

-4

-4

5

-4

-4

5

(number of

modifications

to pass from

an amino acid

to another)


Dayhoff’s Matrices (PAM)

• M.O. Dayhoff, 1978, Atlas of Protein Sequence and

Structure. Initial matrix deduced from the alignment of

71 families of highly homologous proteins (>1000

sequences).

• Replacement frequency Matrix of each amino acid X with

one of the 19 others (relative frequency for

homogeneity reasons) = PAM (Percent Accepted

Mutation).

• The model of evolution of Dayhoff assumes that proteins

have diverged due to the accumulation of random

mutations. PAM N is the power N of the matrix, thus

simulating an evolution rate. Standard value of N =250

Each value is the probability

of substitution of a given aa

with another one. The greater

the value, the more likely the

replacment. 0 is equivalent to

a random mutation.


Other substitution matrices

• Physico-chemical properties may be used to

build score matrices.

• Chemical properties

• Hydrobobicity

• Secondary structures (membrane domains)

• Tertiary structures

• ...


Taylor’s Venn Diagram

For usual protein

properties

Validation of an alignment

• A computer program finds always a solution

• What is the credit that may be assigned to

resulting alignments ?

– Validation from :

• the score ?

• the percentage of identities ?

• …

• Problem : How to decide that an alignment has

something to do with biological reality


Example 1 : easy case

ALIGN calculates a global alignment of two sequences

version 2.0uPlease cite: Myers and Miller, CABIOS (1989) 4:11-17

AF248645_1 150 aa vs.

AF156936_1 149 aa

scoring matrix: BLOSUM50, gap penalties: -12/-2

44.1% identity; Global alignment score: 428

10 20 30 40 50 60

/tmp/t MPIVDTGSVAPLSAAEKTKIRSAWAPVYSNYETSGVDILVKFFTSTPAAQEFFPKFKGLT

:::.: : . :: ..: :: .: .:.:.: .:. .:..:. ..:.::. ::::..

AF1569 MPITDQGPLPTLSEGDKKAIRESWPQIYQNFEQTGLVVLLEFLQKNPGAQQSFPKFSA--

10 20 30 40 50

70 80 90 100 110 120

/tmp/t TADQLKKSADVRWHAERIINAVNDAVVSMDDTEKMSMKLRDLSGKHAKSFQVDPQYFKVL

: .:... .:.:.: ::::::: .. :: :.. :..::.::.. :::::. :: :

AF1569 TKCNLEQDNEVKWQASRIINAVNHTIGLMDKEAAMKQYLKELSAKHSSEFQVDPKLFKEL

60 70 80 90 100 110

130 140 150

/tmp/t AAVIADTVAAGDAGFEKLMSMICILLRSAY--

.:....:. : :..:::.:.:: ::::.:

AF1569 SAIFVSTIR-GKAAYEKLFSIICTLLRSSYDE

120 130 140

Example 2 : opposite but easy…

ALIGN calculates a global alignment of two sequences

version 2.0uPlease cite: Myers and Miller, CABIOS (1989) 4:11-17

U13831_1 134 aa vs.

AF248645_1 150 aa

scoring matrix: BLOSUM50, gap penalties: -12/-2

15.8% identity; Global alignment score: -47

10 20 30 40 50

/tmp/t MTRDQNGTWEMESNENFEGYMKALDIDFATPKIAVRLTQTKVI---DQDGDNFKTK---T

: ..:. : :. : .: . . : . .: .. .: .

AF2486 MPIVDTGSVAPLS---------------AAEKTKIRSAWAPVYSNYETSGVDILVKFFTS

10 20 30 40

60 70 80 90 100 110

/tmp/t TSTFRNYDVDFTVGVEFDEYTKSLDNR-HVKALVTWEGDVLVCVQKGEKENRGWKQW---

: . ... : . :. :: : : :.. ... .:..: .. :: . ..

AF2486 TPAAQEFFPKFKGLTTADQLKKSADVRWHAERIINAVNDAVVSMDDTEKMSMKLRDLSGK

50 60 70 80 90 100

120 130

/tmp/t ----IEGDKLYLEL---------TCGD--------QVCRQVFKKK

.. : :... . :: ..: . .

AF2486 HAKSFQVDPQYFKVLAAVIADTVAAGDAGFEKLMSMICILLRSAY

110 120 130 140 150


Example 3 ???

ALIGN calculates a global alignment of two sequences

version 2.0uPlease cite: Myers and Miller, CABIOS (1989) 4:11-17

U76030_1 166 aa vs.

AF248645_1 150 aa

scoring matrix: BLOSUM50, gap penalties: -12/-2

20.6% identity; Global alignment score: 94

10 20 30 40 50

/tmp/t MALVEDNNAVAVSFSEEQEALVLKSWAILKKDSANIALRFFLKIFEVAPSASQMF-SF--

: .: :...:: .: ... . ..:: . .. . .. ...:.: .:.:...: .:

AF2486 MPIV-DTGSVA-PLSAAEKTKIRSAWAPVYSNYETSGVDILVKFFTSTPAAQEFFPKFKG

10 20 30 40 50

60 70 80 90 100 110

/tmp/t LRNSDVPLEKNPKLKTHAMSVFVMTCEAAAQLRKAGKVTVRDTTLKRLGATHLK-YGVGD

: ..: :.:. .. :: .. . .:.... . :.... :. :.. : : . :

AF2486 LTTAD-QLKKSADVRWHAERIINAVNDAVVSMDDTEKMSMK---LRDLSGKHAKSFQVDP

60 70 80 90 100 110

120 130 140 150 160

/tmp/t AHFEVVKFALLDTIKEEVPADMWSPAMKSAWSEAYDHLVAAIKQEMKPAE

.:.:. .. ::. .: . ....:.. : .. :

AF2486 QYFKVLAAVIADTV--------------AAGDAGFEKLMSMICILLRSAY

120 130 140 150

Validation of a global alignment

• alignment between S1 and S2 (with gaps)

score S

• Question : is S significant ?

comparison with a score between

random sequences


Validation of a global alignment (2)

S1, S2 score S

S’1, S’2 score S’

Random sequences produced by

• Permutations

• Markovian models

comparison ?

Standardized score : Z-score

Sq1, Sq2

S

Practically :

Significant

score

for Z > 6

Z

=

( S−m

)

σ

Mean

Sq1

Sq2 1 Sq2 2 ... Sq2 K

S 1 S 2 … S k

Standard

deviation


PRSS compares a test sequence to a shuffled sequence

version 2.0u64, May, 1998

s-w est

22 1 0:=

24 0 1:*

26 12 5:==*===

28 23 16:=======*====

30 38 34:================*==

32 68 53:==========================*=======

34 60 64:============================== *

36 68 66:================================*=

38 51 60:========================== *

40 58 51:=========================*===

42 32 40:================ *

44 28 31:============== *

46 12 23:====== *

48 17 17:========*

50 8 12:==== *

52 5 8:===*

54 7 6:==*=

56 3 4:=*

58 3 3:=*

60 1 2:*

62 2 1:*

64 0 1:*

66 1 1:*

68 0 0:

70 0 0:

72 1 0:=

74 0 0:

76 0 0:

78 0 0:

80 1 0:=

82 0 0:

PRSS compares a test sequence to a shuffled sequence

version 2.0u64, May, 1998

s-w est

< 20 0 0:

22 0 0:

24 3 1:*=

26 11 7:===*==

28 36 23:===========*======

30 57 46:======================*======

32 79 65:================================*=======

34 72 73:====================================*

36 59 69:============================== *

38 49 58:========================= *

40 33 46:================= *

42 28 34:============== *

44 22 24:===========*

46 17 17:========*

48 8 12:==== * O

50 6 8:===*

52 10 6:==*==

54 5 4:=*=

56 4 3:=*

58 1 2:*

60 0 1:*

62 0 1:*

64 0 1:*

66 0 0:

> 68 0 0:

67000 residues in 500 sequences,

BLOSUM50 matrix, gap penalties: -12,-2

unshuffled s-w score: 49; shuffled score range: 26 - 60

For 500 sequences, a score >=49 is expected 31 times

PRSS

example 1

74500 residues in 500 sequences,

BLOSUM50 matrix, gap penalties: -12,-2

unshuffled s-w score: 442; shuffled score range: 24 - 82

For 500 sequences, a score >=442 is expected 5.26e-30 times

PRSS

example 2


PRSS compares a test sequence to a shuffled sequence

version 2.0u64, May, 1998

s-w est

28 5 2:*==

30 11 10:====*=

32 45 25:============*==========

34 59 44:=====================*========

36 62 59:=============================*=

38 57 65:============================= *

40 65 62:==============================*==

42 49 55:========================= *

44 31 45:================ *

46 23 35:============ *

48 21 27:=========== *

50 21 20:=========*=

52 7 14:==== *

54 20 10:====*=====

56 6 7:===*

58 5 5:==*

60 5 4:=*=

62 2 3:=*

64 1 2:*

66 2 1:*

68 1 1:*

70 2 1:*

83000 residues in 500 sequences,

BLOSUM50 matrix, gap penalties: -12,-2

unshuffled s-w score: 114; shuffled score range: 29 - 72

For 500 sequences, a score >=114 is expected 0.000902 times

PRSS

example 3

Validation of a local alignment

• Karlin, S & Altschul, S.F. (1990) « Methods for assessing the

statistical significance of molecular sequence features by using

general scoring schemes » Proc. Natl. Acad. Sci. USA 87:2264-2268.

Altschul et al. 1994 «Issues in searching in molecular sequence

databases » Nature Genet. 6:119-129)

• BLAST : comparison sequence / bank

• To evaluate the significance of an optimal score without gap, one

uses a model of random sequences.

– m : size of sequence, n : size of bank

– The expected number of HSP with a score ≥ s is

E=

Kmne

−λs

where K and λ are constants, depending on the substitution matrix … -

RIINAVNDAVVMD

::::::: .. ::

RIINAVNHTIGMD

HSP

High Scoring Pair


Validation of a local alignment (2)

• E is an indication of the degree of surprise one gets

with the observed score : the highest it is, the least

significant is the score.

• A reasonable value of E is between 0.1 et 0.001

Biologists use generally 10 -4

• Blast default searches until 10

• One generally gives score results standardized with

respect to parameters K and λ :

S ' are called « bit scores »

s − ln K

S′

=

λ

Probability : P-value

ln 2

• The random number of HSP with a score ≥ s follows a

Poisson’s law, i.e. the probability to find exactly k HSP with a

score ≥ s is

e k E −

where E is the expected number of HSP previously defined.

• The p-value P associated to score s is the probability to find

at least one HSP :

−E

P

k! E

= 1−

For instance, if the expected number of HSP with a score ≥ s

is 3, the probability to find at least 1 HSP is 0,95.

e


BLASTN 1.4.7 [16-Oct-94] [Build 17:42:06 Mar 10 1995]

Reference: Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers,

and David J. Lipman (1990). Basic local alignment search tool. J. Mol. Biol.

215:403-10.

Query= gb|X17217|ADAAVAR Avian adenovirus (CELO) DNA encoding VA

(virus-associated) RNA and six open reading frames.

(4898 letters)

Database: smallgenbank.fasta

100 sequences; 205,192 total letters.

Searching.................................................done

Smallest

Sum

High Probability

Sequences producing High-scoring Segment Pairs: Score P(N) N

gb AAUNKDNA 3576 Z17216 Avian adenovirus DNA (CEL06). ... 302

gb AAVSPHERE 8457 M77182 Amsacta entomopoxvirus sphero... 112

gb AAVSPHER 4657 M75889 Amsacta moorei entomopoxvirus ... 94

gb AAFVMAF 3171 M26769 Avian musculoaponeurotic fibros... 92

gb ACU10885 2773 U10885 AcMNPV HR3 p6.9 gene, partial ... 101

gb ACSJUN 1074 M16266 Avian sarcoma virus 17 proviral ... 98

Softwares based on dynamic

programming

• SSEARCH (Smith-Waterman)

• BESTFIT (Smith-Waterman)

• GAP (Needleman-Wunsh)

• ALIGN (Gotoh)

• ...

Not often used for the search in banks :

Computation time too high

BLAST

6.4e-17 1

0.0052 4

0.34 3

0.96 2

0.97 1

0.998 1

Necessary to find some « tricks»

to accelerate the search of alignments


Why DP is so expensive ?

DNA bank

10 6 sequences

(size sequence = 1000)

Dynamic Programming

10 3 x 10 3 matrix-cell

Search Heuristics

Scan :

Computation of

10 12 matrix-cells

50 ns

> 12 hours

• A heuristics is a mechanism that makes profit of an

assumption on data to accelerate a computation

• Assumption :

– an alignment includes at least a subword of size W

– example : W=5

ATGGCG.GTAGGCATAGGACTCA.TAC

||| || | ||||| ||| |||| ||

ATGTCGAGAAGGCACAGGTCTCAGGAC


Principle of the heuristics (1)

1 – Query sequence is splited in subwords

of size W , that are stored in a dictionnary

ATGGACTGGC

12345678

query sequence

ATG 1

TGG 2,7 W=3

GGA 3

GAC 4

ACT 5

CTG 6

GGC 8 dictionnary

Principle of the heuristics (2)

2 – For each sequence of the bank, subwords

of size W (hits) belonging to the dictionnary

are detected

1 sequence of the bank

AATCGGATTGCATAA

ATG 1

TGG 2,7

GGA 3

GAC 4

ACT 5

CTG 6

GGC 8 dictionnary


Principle of the heuristics (3)

3 – Each hit is « extended » to produce an

alignment

hit

Sequence of the bank

AATCGGATTGCATAA

ATGGACTGGC

Query sequence

alignment

ATCGGATTG

|| ||| ||

AT.GGACTG

Speed Gain of the heuristics

• Example : 2 sequences of DNA of size 1000

• Dynamic Programming :

– 1000 x 1000 = 10 6 matrix-cells

• Heuristics with W = 8

– statistically :

nb of hits = 10 6 x (1 / 4 8 ) = 15 (1 for W=10)

– 1000 search in the dictionnary (10 x 10 3 operations)

– 15 extensions (15 x 10 3 operations)

• Gain : 10 6 / 2.5 x 10 4 = 40 (87 for W=10)


Tradeoff sensitivity / rapidity

W

1 12

sensitivity

rapidity

• The highest W, the fastest the computation

•W is a parameter of softwares (Blast, Fasta)

Softwares searching in data banks

• Blast

– 2 implementations :

• NCBI BLAST, National Center for Biotechnology Information

• WU-BLAST, W. Gish, Washington University

• Fasta

– W. Pearson, University of Virginia

• …


BLAST

• Many sites

– check default parameters

• Main parameters :

υ W : size of subword

υ DNA : default value 11

υ Protein : default value 3

υ E : expected value

υ -G : gap opening (default 11)

υ -E : gap extension (default 1)

Various Blasts

NAME Query Bank Common Usage

BLASTN DNA DNA Look for identities

+pattern splicing

BLASTP Protein Proteins Search of homologous proteins

TBLASTN Protein DNA

translated

6 phases

BLASTX DNA

translated

TBLASTX DNA

translated

Search of genes to be annotated

in banks

Proteins Search of homologous genes and

proteins

DNA

translated

Discovery of the structure of

genes


Recent Blasts

• Blast2 (Gapped Blast,1997) : version of Blast with indels,

filtering Dust and Seg and threads for multi-processors.

• PSI-Blast (Position Specific Iterated Blast) : iterative

version of Blast in view of phylogeny studies, where the comparison

matrix used at one step takes into account consensus found at the

previous step, for a multiple alignment of sequences with a sufficient

E-value, until the processus has converged.

• PHI-Blast (Pattern-Hit Initiated Blast) : fast version of

Blast, that may be combined with PSI-Blast, where one gives a

pattern (regular expression Prosite-like) that must be used in the

alignment of sequences)

Dotplots :

The simplest way to see alignments

Dotplot = identity matrix of two sequences

Similarities =

diagonals

• • • • • • •




Filtering, Smoothing

(window 4, 1 error)

DNA /cDNA of actin gene from

muscle of Pisaster ochraceus

(horizontal / vertical)


Dotter : a tool for dot plots

• Karolinska Institutet, unix

& windows versions

ftp://ftp.cgr.ki.se/pub/esr

/dotter/

• Complexity : linear for

space, quadratic for time

(15 mn for 30000x30000)

« A dot-matrix program with dynamic

theshold control suited for genomic DNA and

protein sequences » E. Sonnhammer R. Durbin

Gene 167: GC1-10 1995

More magazines by this user
Similar magazines