screen
The Nobel Prize 2005
(medicine)
Barry J. Marshall and J. Robin
Warren
for research on bacterium
Helicobacter pylori
– '79 stomach: difficult environment
– '81 BM joins
– '83 H. pylori > ulcers
http://nobelprize.org/medicine/laureates/2005
H. pylori
Exclusively human pathogen
50% of population
10% of infected experience ulcers
• before 1984 attributed to stress, diet and
excess gastric acidity
Genome
– 1.6Mb
– 90% coding
– 1600 genes
Alignments
Algorithm for Genomes
Searching for similarities
What is the function of the new
gene?
The “lazy” investigation:
– Find the set of similar proteins
– Identify similarities and differences
– For long proteins: identify domains
Is similarity really
interesting?
Is similarity really
interesting?
Common ancestry is more
interesting
Makes it more likely that genes
share the same function
Homology: sharing a common
ancestor
– a binary property (yes/no)
– It’s a nice tool:
When (a known gene) G is
homologous to (an unknown) X it
means that we gain a lot of
information on X
X
Z
Y
Functional and evolutionary
relation
Guessing homology:
– Based on sequence
• Identity (simpliest method)
• Similarity
– Other (e.g., 3D structure)
Functional relation
determines determines
Sequence → Structure → Function
Evolution
and 3d
structure
Isocitrate
dehydrogenase:
The distance from
the active site
(yellow) determines
the rate of evolution
Dean, A. M. and G. B. Golding. 2000,
Enzyme evolution explained (sort of),
Pacific Symposium on Bioinformatics
2000
How to determine similarity
Frequent evolutionary events:
1. Substitution
2. Insertion, deletion
3. Duplication
4. Inversion
Evolution at work
Z
We’ll use only these
Common ancestor,
usually extinct
X
Y
available
Alignment  the problem
AlgorithmsforGenomes

algorithms4genomes
AlgorithmsforGenomes

algorithms4genomes
AlgorithmsforGenomes
?? ?
algorithms4genomes
Operations:
– substitution,
– Insertion
– deletion
GTACTC
GGATGAC
Many ways to align
1) 2) 3)
GTA
GCA
GTA
GCA
GTA
GCA
Which alignment is
better?
– Reflecting evolution ~=
– Most probable ~=
– Simplest
Scoring
GTACTC
GGATGAC
Should give reasonable alignments
And have to assign scores to:
– Substitution (or match/mismatch)
– Gap penalty
• Linear: g(k)=αk
• Affine: g(k)=β+αk
• Concave, e.g.: g(k)=log(k)
GTA
ATG
GTA
ATG
The score for an alignment is the sum of
scores of all columns
Substitution matrices
Define a score for
match/mismatch of letters
DNA
–  Simple:
A C G T
A 1 1 1 1
C 1 1 1 1
G 1 1 1 1
T 1 1 1 1
–  Used in genome alignments
A C G T
A 91 114 31 123
C 114 100 125 31
G 31 125 100 114
T 123 31 114 91
http://www.cimr.cam.ac.uk/links/codon.htm
Substitution matrices for aa
Amino acids are not equal:
1. Some are easily
substituted, similar:
• biochemical properties
• structure
2. Some mutations occur
more often due to similar
codons
The two above give us
substitution matrices
http://www.people.virginia.edu/~rjh9u/aminacid.html
orange: nonpolar and hydrophobic.
green: polar and hydrophilic
magenta box are acidic
light blue box are basic
BLOSUM 62 substitution
matrix
http://www.carverlab.org/testing/epp.html
Constant vs. affine gap
scoring
Gap
Scoring
Constant
Affine
intro
2
3
extension
2
1
…and +1 for match
Seq1 G T A   G  T  A
Seq2   A T G  A T G 
Const 2 –2 1 –2 –2 (SUM = 7) 2 –2 1 2 –2 (SUM = 7)
Affine –4 1 4 (SUM = 7) 3 –3 1 3 –3 (SUM = 11)
Global alignment.
The NeedlemanWunsch
algorithm
Goal: find the maximal scoring alignment
Scores: m match, s mismatch, g for
insertion/deletion
Dynamic programming
– Solve smaller subproblem(s)
– Iteratively extend the solution
The best alignment for X[1…i] and Y[1…j] is
called M[i, j]
X 1
… X i1
  X i
Y 1
…  Y j1
Y j

The algorithm
Goal: find the maximal scoring alignment
Scores: m match, s mismatch, g for
insertion/deletion
The best alignment for X[1…i]
is called M[i, j]
3 ways to extend the alignment
and Y[1…j]
X[1…i1]
Y[1…j1] X[i] X[1…i] 
Y[j] Y[1…j1]Y[j]
X[1…i1] X[i]
Y[1…j] 
M[i,j] =
M[i1, j1]
+ m
 s
M[i, j1]  g
M[i1, j]  g
The algorithm – final
equation
M[i, j] = max
*
M[i1, j1] + score(X[i],Y[j])
M[i, j1] – g
M[i1, j] – g
Corresponds to:
X 1
…X i

Y 1
…Y j1
Y j
X 1
…X i1
X i
Y 1
…Y j1
Y j
X 1
…X i1
X i
Y 1
…Y j

j1
j
i1
i
* g
g
Example:
global alignment of two
sequences
Align two DNA sequences:
– GAGTGA
– GAGGCGA (note the length difference)
Parameters of the algorithm:
– Match: score(A,A) = 1
– Mismatch: score(A,T) = 1
– Gap: g = 2
M[i1, j1] ± 1
M[i, j] = max M[i, j1] – 2
M[i1, j] – 2
The algorithm. Step 1: init
Create the
matrix
M[i1, j1] ± 1
M[i, j] = max M[i, j1] – 2
Initiation
M[i1, j] – 2
– 0 at [0,0]
– Apply the
equation…
i↓
0
1
j→

G
0

0
1
G
2
A
3
G
4
T
5
G
6
A
2
A
3
G
4
G
5
C
6
G
7
A
The algorithm. Step 1: init
M[i1, j1] ± 1
M[i, j] = max M[i, j1] – 2
M[i1, j] – 2
Initiation of the
matrix:
– 0 at pos [0,0]
– Fill in the first
row using the
“←” rule
– Fill in the first
column using “↑”
i

G
A
G
G
C
G

0
2
4
6
8
10
12
G
2
A
4
j
G
6
T
8
G
10
A
12
A
14
The algorithm. Step 2: fill in
M[i1, j1] ± 1
M[i, j] = max M[i, j1] – 2
M[i1, j] – 2
Continue filling
in of the matrix,
remembering
from which cell
the result comes
(arrows)
i

G
A
G
G

0
2
4
6
8
G
2
1
1
A
4
1
2
j
G
6
3
T
8
G
10
A
12
C
10
G
12
A
14
The algorithm. Step 2: fill in
M[i1, j1] ± 1
M[i, j] = max M[i, j1] – 2
M[i1, j] – 2
We are done…
Where’s the
result?

G
A

0
2
4
G
2
1
1
A
4
1
2
j
G
6
3
0
T
8
5
2
G
10
7
4
A
12
9
6
i
G
G
6
8
3
5
0
2
3
1
1
2
1
2
3
0
C
10
7
4
1
0
1
1
G
12
9
6
3
2
1
0
A
14
11
8
5
4
1
2
The algorithm. Step 3:
backtrace
Start at the last cell
of the matrix
Go in the direction
of arrows
Sometimes the
value may be
obtained from more
than one cell
(which one?)
i
M[i1, j1] ± 1
M[i, j] = max M[i, j1] – 2
M[i1, j] – 2
j
 G A G T G
 0 2 4 6 8 10
G 2 1 1 3 5 7
A 4 1 2 0 2 4
G 6 3 0 3 1 1
G 8 5 2 1 2 2
C 10 7 4 1 0 1
G 12 9 6 3 2 1
A 14 11 8 5 4 1
A
12
9
6
3
0
1
0
2
The algorithm. Step 3:
backtrace
Extract the
alignments
2
1
4
5
8
11
14
A
0
1
2
3
6
9
12
G
1
1
0
1
4
7
10
C
0
2
2
1
2
5
8
G
3
1
1
3
0
3
6
G
6
4
2
0
2
1
4
A
9
7
5
3
1
1
2
G
12
10
8
6
4
2
0

A
G
T
G
A
G

j
i
GAGTGA
GAGGCGA
GAGTGA
GAGGCGA
a
b
a)
b)
Affine gaps
• Penalties:
g o
 gap opening (e.g. 8)
g e
 gap extension (e.g. 2)
M[i1, j1]
X 1
… X i
Y 1
… Y j
M[i, j] =
score(X[i], Y[j]) +
max
I x
[i1, j1]
I y
[i1, j1]
X 1
… 
M[i, j1] + g
I x
[i, j] =
o
Y
max
1
… Y j
I x
[i, j1] + g e
X 1
…X i

Y 1
…Y j1
Y j
X 1
…  
Y 1
…Y j1
Y j
I y
[i, j] =
max
M[i1, j] + g o
I y
[i1, j] + g x
• @ home: think of boundary values M[*,0], I[*,0] etc.
Variation on global alignment
Global alignment: previous algorithms is
called global alignment, because it uses all
letters from both sequences.
CAGCACTTGGATTCTCGG
CAGCGTGG
Semiglobal alignment: don’t penalize for
start/end gaps (omit the start/end of
sequences).
CAGCACTTGGATTCTCGG
CAGCGTGG
– Applications of semiglobal:
– Finding a gene in genome
– Placing marker onto a chromosome
– One sequence much longer than the other
– Danger! – really bad alignments for divergent
seqs seq X:
seq Y:
Semiglobal alignment
Ignore start gaps
– First row/column
set to 0
Ignore end gaps
– Read the result
from last
row/column

G
A
G
T
G

0
0
0
0
0
0
A
0
1
1
1
1
1
G
0
1
2
2
0
2
T
0
1
0
0
3
1
a heuristics
– Heuristics: A rule of thumb that often helps in solving
a certain class of problems, but makes no guarantees.
Perkins, DN (1981) The Mind's Best Work