Algorithmen der Bioinformatik II - Algorithms in Bioinformatics ...

Algorithms in Bioinformatics II, SS 2002, Uni Tübingen, Daniel Huson 1 

Algorithmen der 

Bioinformatik II 

Vorlesung, Sommersemester 2002, 

WSI-Informatik, Universität Tübingen 

Prof. Dr. Daniel Huson 

huson@informatik.uni-tuebingen.de 

0.1 Organisatorisches 

Vorlesung: Mo, Mi, 10-12h ct, A301, Sand 1 

Übungsgruppen: 

Di 10-12h C306 Sand 14 Ulrike von Luxburg 

Mi 15-17h Kleiner Hörsaal, Sand 6/7 Christian Rausch 

Scheinkriterium: regelmässige Teilnahme an einer Übungsgruppe, 60% der möglichen Punkte, 

Bearbeitung nur bis zwei Personen per Blatt. 

Sprechstunden: 

Mi 16-18h C310a, Sand 14 Daniel Huson 

Mi 17-18h C324b, Sand 14 Christian Rausch 

Web: www-ab.informatik.uni-tuebingen.de/lehre/ss02/bioinformatik2.html 

0.2 Inhalt 

1. Bericht über die Sequenzierung des menschlichen Genoms 

2. Markovketten und Hidden Markov Modelle 

3. Suffixbäume 

4. Assembly 

5. Gensuche 

6. Phylogenie 

7. Strukturvorhersage 

1.3 1 Markovketten und Hidden Markov Modelle 

1.1 Markovketten 

1.2 Hidden Markov Modelle 

1.3 Profil HMMs 

Literatur S. Durbin, S. Eddy, A. Krogh und G. Mitchison, Biological Sequence Analysis, 

Cambridge, 1998

2 Algorithms in Bioinformatics II, SS 2002, Uni Tübingen, Daniel Huson 

1.4 1.1 Markovketten 

Beispiel: CpG-Insel im menschenlichen Genom. 

DNS-Doppelstrang: 

...ApCpCpApTpGpApTpGpCpApGpGpApCpTpTpCpCpApTpCpGpTpTpCpGpCpGp... 

...| | | | | | | | | | | | | | | | | | | | | | | | | | | | | ... 

...TpGpGpTpApCpTpApCpGpTpCpCpTpGpApApGpGpTpApGpCpApApGpCpGpCp... 

In einem CpG-Paar wird das C häufig methyliert (d.h. es wird ein H-Atom durch eine CH 3 - 

Gruppe ersetzt). Ein methyl-C mutiert mit erhöhter Wahrscheinlichkeit zu T . Folglich ist das 

Paar CpG im Genom unterrepräsentiert. 

Upstream von einem Gen wird diese Methylierung aus biologischen Gründen unterdrückt. Es 

bilden sich sogenannte CpG-Insel der Länge 100-5000, die sich dadurch auszeichnen, dass dort 

die CpG-Paare nicht unterrepräsentiert sind. 

1.5 CpG-Insel 

CpG-Insel sind nützliche Marke für Gene, die sich in Organismen befinden, deren Genome 5- 

methylcytosine enthalten. 

CpG-Insel in den Promoter-Regionen von Genen spielen eine Rolle bei der Inaktivierung des 

X-Chromosomes, beim Imprinting, und beim Ausschalten von intragenomischen Parasiten. 

Klassische Definition: DNA Sequenz der Länge 200 mit einem C +G Inhalt von 50% und ein 

Verhältnis von beobachtete-# CpG/erwartete-# CpG von über 0.6. 

(Gardiner-Garden & Frommer, 1987) 

Nach einer ganz neuen Untersuchung enthalten die beiden menschlichen Chromosome 21 und 

22 zusammen ca. 1100 CpG-Insel und ca. 750 Gene. 

(Comprehensive analysis of CpG islands in human chromosomes 21 and 22, D. Takai & P. A. Jones, 

PNAS, March 19, 2002) 

1.6 Fragestellungen 

1. Gegeben eine kurze genomische Sequenz. Wie können wir entscheiden, ob diese Sequenz 

von einer CpG-Insel stammt? 

2. Gegeben eine lange genomische Sequenz. Wie finden wir alle darin enthaltenen CpG-Insel? 


Wir wollen ein wahrscheinlichkeitstheoristisches Modell für CpG-Inseln aufstellen. Da Paare 

nachfolgender Nucleotide wichtig sind, brauchen wir ein Modell, in dem die Wahrscheinlichkeit 

eines Symbols von dem vorherigen Symbol abhängt. Also eine Markovkette.


Beispiel: 

A 

G 

C 

T 

Kreise = Zustände, z.B. mit Namen A , C , G und T . 

Pfeile = mögliche Übergänge, jeweils mit einer Übergangswahrscheinlichkeit gelabelt, a st = 

P(x i = t | x i − 1 = s). 


Definition Eine (zeithomogene) Markovkette (der Ordnung 1) ist ein System (S, A), gegeben 

durch eine endliche Menge von Zuständen S = {s 1 , s 2 , . . .,s n } und eine Übergangsmatrix A = 

{a st } mit ∑ t∈S a st = 1 für alle s ∈ S, die die Wahrscheinlichkeit des Übergangs s → t angibt: 

P(x i+1 = t | x i = s) = a st . 

Beispiel Wetter in Tübingen, täglich um 12h: Mögliche Zustände sind Regen, Sonne oder 

Wolken. 

R S W 

R .5 .1 .4 

Übergangswahrscheinlichkeiten: 

S .2 .6 .2 

W .3 .3 .4 

Wetter: ...rrrrrrwwsssssswswswwwrrwrwssss... 

Gegeben sei eine Sequenz von Zuständen x 1 , x 2 , x 3 , . . .,x L . Wie gross ist die Wahrscheinlichkeit, 

dass genau diese Sequenz von Zuständen von einer gegebenen Markovkette durchlaufen wird? 

P(x) = P(x L , x L−1 , . . .,x 1 ) 

= P(x L | x L−1 , . . .,x 1 )P(x L−1 | x L−2 , . . ., x 1 ) . . .P(x 1 ), 

(durch wiederholte Anwendung von P(X, Y ) = P(X|Y )P(Y )) 

= P(x L , | x L−1 )P(x L−1 | x L−2 ) . . .P(x 2 | x 1 )P(x 1 ) 

= P(x 1 ) ∏ L 

i=2 a x i−1 x i 

, 

wegen P(x i | x i−1 ,... ,x 1 ) = P(x i | x i−1 ) = a xi−1 x i , da Markovkette! 

1.9 Modellierung vom Beginn- und Endzustand 

In der bisherigen Beschreibung haben wir die Anfangswahrscheinlichkeiten P(x 1 ) übersehen.


Wir nehmen einen Beginnzustand mit Label b in das Modell auf. Wir setzen nun immer voraus, 

dass x 0 = b ist. Dann gilt: 

P(x 1 = s) = a bs = P(s), 

wobei P(s) die Hintergrundswahrscheinlichkeit von Symbol s ist. 

Wir modellieren auch das Ende der Sequenz mit einem Endzustand ’e’. Dann ergibt sich für 

die Wahrscheinlichkeit, im Zustand t zu enden: 

P(x L = t) = a xL e. 

1.10 Erweiterung des Modells 

Beispiel: 

b 

A 

G 

C 

T 

e 

# Markov chain that generates CpG islands 

# (Source: DEMK98, p 50) 

# Number of states: 

6 

# State labels: 

A C G T * + 

# Transition matrix: 

0.1795 0.2735 0.4255 0.1195 0 0.002 

0.1705 0.3675 0.2735 0.1875 0 0.002 

0.1605 0.3385 0.3745 0.1245 0 0.002 

0.0785 0.3545 0.3835 0.1815 0 0.002 

0.2495 0.2945 0.2495 0.2945 0 0.002 

0.0000 0.0000 0.0000 0.0000 0 1.000 

1.11 Berechnung der Übergangsmatrix 

Die Übergangsmatrix A+ für DNS, die aus einer CpG-Insel stammt, wird wie folgt berechnet: 

a + st = 

c+ st 

∑t ′ c + st ′ , 

wobei c st die Anzahl der Positionen in einer Trainingsmenge von CpG-Insel ist, an die der 

Zustand s von Zustand t gefolgt wird. 

Analog wird auch A − empirisch bestimmt.


1.12 Zwei Beispiele für Markovketten 

# Markov chain for CpG islands # Markov chain for non-CpG islands 

# (Source: DEMK98, p 50) # (Source: DEMK98, p 50) 

# Number of states: # Number of states: 

6 6 

# State labels: # State labels: 

A C G T * + A C G T * + 

# Transition matrix: # Transition matrix: 

.1795 .2735 .4255 .1195 0 0.002 .2995 .2045 .2845 .2095 0 .002 

.1705 .3675 .2735 .1875 0 0.002 .3215 .2975 .0775 .0775 0 .002 

.1605 .3385 .3745 .1245 0 0.002 .2475 .2455 .2975 .2075 0 .002 

.0785 .3545 .3835 .1815 0 0.002 .1765 .2385 .2915 .2915 0 .002 

.2495 .2945 .2495 .2945 0 0.002 .2495 .2495 .2495 .2495 0 .002 

.0000 .0000 .0000 .0000 0 1.000 .0000 .0000 .0000 .0000 0 1.00 

1.13 Beantwortung von Frage 1 

Gegeben eine kurze Sequenz x = (x 1 , x 2 , . . .,x L ). Stammt sie von einer CpG-Insel (Modell+)? 

mit x 0 = b und x L+1 = e. 

Wir benutzen folgenden Score: 

P(x | Modell+) = 

L∏ 

a xi x i+1 

, 

i=0 

L 

P(x | Modell+) 

S(x) = log 

P(x | Modell−) = ∑ 

log a+ x i−1 x i 

. 

a − x i−1 x i 

Je höher dieser Score, um so wahrscheinlicher ist es, dass x eine CpG-Insel ist. 

i=1 

1.14 Fragen an einer Markovkette 

Beispiel Wetter in Tübingen, täglich um 12h: Mögliche Zustände sind Regen, Sonne oder 

Wolken. 

R S W 

R .5 .1 .4 

Übergangswahrscheinlichkeiten: 

S .2 .6 .2 

W .3 .3 .4 

Fragen, die man an das Modell stellen kann: 

Wenn heute die Sonne scheint, wie gross ist die Wahrscheinlichkeit, dass die Sonne die nächsten 

sieben Tage scheint? 

Wie gross ist die Wahrscheinlichkeit, dass es einen Monat lang regnet? 

1.15 Hidden Markov Modelle (HMM) 

Motivation: Frage 2, wie erkennt man CpG-Inseln innerhalb einer langen Sequenz?


Z.B. Fensterverfahren: ein Fenster der Breite w wird entlang der Sequenz geschoben und der 

Score wird geplottet. Probleme: Grenzen der CpG-Insel werden nicht scharf erkannt, welche 

Fenstergrösse w soll gewählt werden?... 

Ansatz: Die beiden Markovketten Modell + und Modell − werden in einem sogenannten Hidden 

Markov Modell vereinigt. 

1.16 Hidden Markov Modelle 

Definition Ein HMM ist ein System M = (S, Q, A, e), mit 

• einem Alphabet S, 

• einer Menge von Zuständen Q, 

• einer Matrix A = {a kl } von Übergangswahrscheinlichkeiten a kl für k, l ∈ Q, und 

• einer Emissionswahrscheinlichkeit e k (b) für jedes k ∈ Q und b ∈ S. 

1.17 Beispiel 

Ein HMM für CpG-Insel: 

A+ C+ G+ T+ 

A C G T 

− − − − 

(Es kommen noch sämtliche Übergänge innerhalb der beiden Mengen hinzu, die von den beiden 

Markovketten Modell + und Modell − übernommen werden.) 

1.18 HMM für CpG-Inseln 


9 

# Names of states (begin/end, A+, C+, G+, T+, A-, C-, G- and T-): 

0 A C G T a c g t 

# Number of symbols: 

4 

# Names of symbols: 

a c g t 

# Transition matrix, probability to change from +island to -island (and vice versa) is 10E-4 

0.0000000000 0.0725193101 0.1637630296 0.1788242720 0.0754545682 0.1322050994 0.1267006624 0.1226380452 0.1278950131 

0.0010000000 0.1762237762 0.2682517483 0.4170629371 0.1174825175 0.0035964036 0.0054745255 0.0085104895 0.0023976024 

0.0010000000 0.1672435130 0.3599201597 0.2679840319 0.1838722555 0.0034131737 0.0073453094 0.0054690619 0.0037524950 

0.0010000000 0.1576223776 0.3318881119 0.3671328671 0.1223776224 0.0032167832 0.0067732268 0.0074915085 0.0024975025 

0.0010000000 0.0773426573 0.3475514486 0.3759440559 0.1781818182 0.0015784216 0.0070929071 0.0076723277 0.0036363636 

0.0010000000 0.0002997003 0.0002047952 0.0002837163 0.0002097902 0.2994005994 0.2045904096 0.2844305694 0.2095804196 

0.0010000000 0.0003216783 0.0002977023 0.0000769231 0.0003016983 0.3213566434 0.2974045954 0.0778441558 0.3013966034 

0.0010000000 0.0002477522 0.0002457542 0.0002977023 0.0002077922 0.2475044955 0.2455084915 0.2974035964 0.2075844156 

0.0010000000 0.0001768232 0.0002387612 0.0002917083 0.0002917083 0.1766463536 0.2385224775 0.2914165834 0.2914155844 

# Emission probabilities: 

0 0 0 0 

1 0 0 0 

0 1 0 0 

0 0 1 0


0 0 0 1 

1 0 0 0 

0 1 0 0 

0 0 1 0 

0 0 0 1 

Wir benutzen ab jetzt 0 für den Beginn- und Endzustand. 

1.19 Beispiel fairer/unfairer Würfel 

Casino, zwei Würfel, fair und unfair: 

1: 1/6 

2: 1/6 

3: 1/6 

4: 1/6 

5: 1/6 

6: 1/6 

0.05 

0.1 

1: 1/10 

2: 1/10 

3: 1/10 

4: 1/10 

5: 1/10 

6: 1/2 

0.95 0.9 

Fair 

Unfair 

Besucher des Casinos beobachtet nur die Anzahl der Augen: 

6 4 3 2 3 4 6 5 1 2 3 4 5 6 6 6 3 2 1 2 6 3 4 2 1 6 6... 

Welcher Würfel benutzt wurde, bleibt verdeckt (hidden): 

F F F F F F F F F F F F U U U U U F F F F F F F F F F... 

1.20 Beispiel Urnenmodell 

Gegeben p Urnen U 1 , U 2 , . . .,U p . Jede Urne U i enthält r i rote, g i grüne und b i blaue Kugel. 

Es wird zufällig eine Urne U i ausgewählt und dann aus ihr zufällig eine Kugel k gezogen (mit 

Zurücklegen). Die Farbe der Kugel k wird ausgegeben. 

r1 rot 

g1 grun 

b1 blau 

r2 rot 

g2 grun 

b2 blau 

... 

rp rot 

gp grun 

bp blau 

r r g g b b g b g g g b b b r g r g b b b g g b g g b... 

Auch hier sind die Zustände verdeckt, wir sehen nur die produzierten Symbole. 

1.21 HMM für das Urnenmodell 

# Four urns



5 

# Names of states: 

# (0 begin/end, and urns A-D) 

0 A B C D 

# Number of symbols: 

3 

# red, green, blue 

r g b 

# Transition matrix: 

0 .25 .25 .25 .25 

0.01 .69 .30 .0 

0.01 .0 .69 .30 

0.01 .30 .0 .69 

# Emission probabilties: 

0 0 0 

.8 .1 .1 

.2 .5 .3 

.1 .1 .8 

# EOF 

1.22 Generierung von synthetischen Daten 

HMMs können benutzt werden, um Daten zu generieren: 

Algorithmus 

Starte in Zustand 0. 

Solange der Zustand 0 noch nicht wieder erreicht wurde: 

Wähle einen neuen Zustand gemäss den Übergangsmatrizen. 

Wähle ein Symbol gemäss den Emissionswahrscheinlichkeiten und gebe es aus. 

1.23 Eine Symbolfolge für das Casino-Beispiel 

Wir benutzen das fair/unfair HMM, um eine Folge von Symbolen zu generieren: 

Symbols: 24335642611341666666526562426612134635535566462666636664253 

States : FFFFFFFFFFFFFFUUUUUUUUUUUUUUUUUUFFFFFFFFFFUUUUUUUUUUUUUFFFF 

Symbols: 35246363252521655615445653663666511145445656621261532516435 

States : FFFFFFFFFFFFFFFFFFFFFFFFFFFUUUUUUUFFUUUUUUUUUUUUUUFFFFFFFFF 

Symbols: 5146526666 

States : FFUUUUUUUU 

Wie wahrscheinlich sind diese Daten? 

Wenn wir nur die Symbole sehen, können wir die zugehörigen Zustände rekonstruieren?


1.24 Berechnung der Wahrsch., wenn Pfad und Symbole 

bekannt sind 

Definition Ein Pfad π = (π 1 , π 2 , . . ., π L ) ist eine Folge von Zuständen im Modell M. 

Gegeben eine Folge von Symbolen x = (x 1 , . . .,x L ) und ein Pfad π = (π 1 , . . .,π L ) durch M. 

Die gemeinsame Wahrscheinlichkeit ist: 

mit π L+1 = 0. 

P(x, π) = a 0π1 

Leider kennen wir den Pfad in der Regel nicht! 

L 

∏ 

i=1 

e πi (x i )a πi π i+1 

, 

1.25 “Dekodierung” einer Symbolfolge 

Problem: Wir haben eine Folge x von Symbolen beobachtet und wollen sie nun “dekodieren”: 

Beispiel: Die Symbolfolge C G C G hat verschiedene “Erklärungen” im CpG-Modell, z.B.: 

(C + , G + , C + , G + ), (C − , G − , C − , G − ) und (C − , G + , C − , G + ). 

Ein Pfad durch das HMM legt fest, welche Teile der Folge x als CpG-Insel gedeutet werden. 

1.26 Der wahrscheinlichste Pfad 

Um das Dekodierungsproblem zu lösen, wollen wir den Pfad π ∗ berechnen, für den die 

Wahrscheinlichkeit, die Symbolfolge x generiert zu haben, maximal ist, also: 

π ∗ = arg maxP(x, π). 

π 

Dieser wahrscheinlichste Pfad π ∗ kann rekursiv berechnet werden. 

Definition: Die Variable v k (i) gibt für das Präfix (x 1 , x 2 , . . .,x i ) die Wahrscheinlichkeit an, 

dass der wahrscheinlichste Pfad im Zustand k (an der Position i) endet. Es gilt: 

mit Initialisierung v 0 (0) = 1. 

v l (i + 1) = e l (x i+1 ) max 

k∈Q (v k(i)a kl ), 

(Zusatzaufgabe: Es gilt: arg max π P(x, π) = arg max π P(π | x)) 

1.27 Wahrscheinlichster Pfad 

x 0 x 1 x 2 x 3 ... x i−2 x i−1 x i x i+1 

A + A + A + ... A + A + A + ... 

C + C + C + ... C + C + C + 

G + G + G + ... G + G + G + 

T + T + T + ... T + T + T + 

0 A − A − A − ... A − A − A − 

C − C − C − ... C − C − C − 

G − G − G − ... G − G − G − 

T − T − T − ... T − T − T −


1.28 Viterbi-Algorithmus 

Input: HMM M = (S, Q, A, e) 

und Symbolfolge x 

Output: Wahrscheinlichster Pfad π ∗ . 

Initialisierung (i = 0): v 0 (0) = 1, v k (0) = 0 für k ≠ 0. 

Für alle i = 1 . . .L, l ∈ Q: v l (i) = e l (x i ) max k∈Q (v k (i − 1)a kl ) 

ptr i (l) = arg max k∈Q (v k (i − 1)a kl ) 

Abschluss: P(x, π ∗ ) = max k∈Q (v k (L)a k0 ) 

π ∗ L = arg max k∈Q(v k (L)a k0 ) 

Traceback: 

Für alle i = L − 1 . . .1: π ∗ i−1 = ptr i(π ∗ i ) 

Implementierungshinweis: statt Multiplikation vieler kleiner Werte, Addition von Logarithmen! 

(Zusatzaufgabe: Laufzeitabschätzung) 

1.29 Beispiel für Viterbi 

Gegeben die Sequenz C G C G und das ein HMM für CpG-Inseln. Hier ist eine mögliche Tabelle 

der Werte für v: 

Sequenz 

v C G C G 

0 1 0 0 0 0 

A + 0 0 0 0 0 

C + 0 .13 0 .012 0 

Zustand G + 0 0 .034 0 .0032 

T + 0 0 0 0 0 

A − 0 0 0 0 0 

C − 0 .13 0 .0026 0 

G − 0 0 .010 0 .00021 

T − 0 0 0 0 0 

1.30 Viterbi-Dekodierung des Casino Beispiels 

Wir benutzen das fair/unfair HMM, um eine Folge von Symbolen zu generieren und den Viterbi- 

Algorithmus, um die Folge zu dekodieren, Ergebnis: 

Symbols: 24335642611341666666526562426612134635535566462666636664253 

States : FFFFFFFFFFFFFFUUUUUUUUUUUUUUUUUUFFFFFFFFFFUUUUUUUUUUUUUFFFF 

Viterbi: FFFFFFFFFFFFFFUUUUUUUUUUUUUUUUFFFFFFFFFFFFUUUUUUUUUUUUUFFFF 

Symbols: 35246363252521655615445653663666511145445656621261532516435 

States : FFFFFFFFFFFFFFFFFFFFFFFFFFFUUUUUUUFFUUUUUUUUUUUUUUFFFFFFFFF 

Viterbi: FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF


Symbols: 5146526666 

States : FFUUUUUUUU 

Viterbi: FFFFFFUUUU 

1.31 Drei Grundfragen für HMMs 

Sei M ein HMM, x eine Folge von Symbolen. 

(Q1) Für x, bestimme den wahrscheinlichsten Zustandspfad durch M: Viterbi-Algorithmus 

(Q2) Berechne die Wahrscheinlichkeit, mit der x von M erzeugt wird: P(x) = P(x | M): 

Vorwärts-Algorithmus 

(Q3) Gegeben x und eventuell weitere Folgen, wie werden die Parameter von M trainiert? 

Z.B., Baum-Welch-Algorithmus 

1.32 Berechnung von P(x | M) 

Gegeben ein HMM M und eine Folge x. Für die Wahrscheinlichkeit, dass x von M generiert 

wurde, gilt: 

P(x | M) = ∑ P(x, π | M), 

π 

wobei wir hier über alle Zustandspfade π durch M summieren müssen! (Zusatzaufgabe: Wie 

schnell wächst die Anzahl der Pfade mit zunehmender Länge?) 

1.33 Vorwärts-Algorithmus 

Dieser Algorithmus geht aus dem Viterbi-Algorithmus durch Ersetzen von max durch Summe 

hervor. Wir betrachten dabei folgende Vorwärtsvariable: 

f k (i) = P(x 1 . . .x i , π i = k), 

die die Wahrscheinlichkeit angibt, die Präfixfolge (x 1 , . . .,x i ) auszugeben und den Zustand 

π i = k zu erreichen. 

Die Rekursion lautet:f l (i + 1) = e l (x i+1 ) ∑ k∈Q f k(i)a kl . 

f p (i) 

f q (i) 

f r (i) 

f (i) 

s 

a kl 

f 

l 

(i+1)


1.34 Vorwärts-Algorithmus 



Output: Wahrscheinlichkeit P(x | M) 

Initialisierung (i = 0): f 0 (0) = 1, f k (0) = 0 für k ≠ 0. 

Für alle i = 1 . . .L, l ∈ Q: f l (i) = e l (x i ) ∑ k∈Q (f k(i − 1)a kl ) 

Ergebnis: P(x | M) = ∑ k∈Q (f k(L)a k0 ) 

Implementierungshinweis: Benutzung von Logarithmen nicht so elegant möglich, aber es gibt 

auch Skalierungsverfahren... 

Löst Frage Q2! 

1.35 Rückwärts-Algorithmus 

Die Rückwärtsvariable enthält die Wahrscheinlichkeit, von dem Zustand p i = k ausgehend die 

Suffixfolge (x i+1 , . . .,x L ) zu erzeugen: b k (i) = P(x i+1 . . .x L | π i = k). 



Output: Wahrscheinlichkeit P(x | M) 

Initialisierung (i = L): b k (L) = a k0 für alle k. 

Für alle i = L − 1 . . .1, k ∈ Q: b k (i) = ∑ l∈Q a kle l (x i+1 )b l (i + 1) 

Ergebnis: P(x | M) = ∑ l∈Q (a 0le l (x 1 )b l (1)) 

b (i) 

k 

a kl 

b p(i+1) 

b q(i+1) 

b r(i+1) 

b (i+1) 

s 

1.36 Vergleich der drei Variablen 

Viterbi v k (i) Wahrscheinlichkeit, dass der wahrscheinlichste Zustandspfad die Symbolfolge 

(x 1 , x 2 , . . .,x i ) generiert und das System zum Zeitpunkt i im Zustand k ist. 

Vorwärts f k (i) Wahrscheinlichkeit, dass die Symbolfolge x 1 , . . .,x i generiert wird, und das 

System zum Zeitpunkt i im Zustand k ist. 

Rückwärts b k (i) Wahrscheinlichkeit, zum Zeitpunkt i im Zustand k zu starten und dann die 

Symbolfolge x i+1 , . . ., x L zu generieren.


1.37 Posterior Wahrscheinlichkeiten 

Gegeben ein HMM M und eine Symbolfolge x. Sei P(π i = k | x) die Wahrscheinlichkeit, dass 

das Symbol x i im Zustand π i = k ausgegeben wurde. Sie wird Posterior Wahrscheinlichkeit 

genannt, da sie nach Beobachtung der Folge x berechnet wird. 

Es gilt: 

P(π i = k | x) = P(π i = k, x) 

P(x) 

= f k(i)b k (i) 

, 

P(x) 

da P(g, h) = P(g | h)P(h) und nach Definition der Vorwärts- und Rückwärtsvariable. 

1.38 Dekodierung mit Posterior Wahrscheinlichkeiten 

Es gibt Alternativen zur Viterbi-Dekodierung, die z.B. dann sinnvoll sind, wenn es viele Pfade 

gibt, die ungefähr genauso wahrscheinlich sind wie π ∗ . 

Wir definieren eine Zustandsfolge ˆπ durch 

ˆπ i = arg max 

k∈Q P(π i = k | x), 

d.h., an jeder Position wählen wir den augenblicklich wahrscheinlichsten Zustand. 

Diese Dekodierung ist sinnvoll, wenn wir uns für den Zustand an einem gegebenen Punkt i 

interessieren, und nicht für die ganze Folge. 

Vorsicht: Sind einige Zustandübergänge durch die Übergangsmatrix nicht erlaubt (i.e., a kl = 0), 

so kann es sein, dass der Pfad ˆπ unzulässig ist, d.h. mit Wahrscheinlichkeit 0 vorkommt! 

1.39 Parameterschätzung 

Wie wird ein HMM konstruiert? 

Erster Schritt: Die “Topologie” wird festgelegt, d.h. Wahl der Zustände und der Verbindungen 

zwischen ihnen. 

Zweiter Schritt: Wahl der Parameterwerte, d.h. der Übergangswahrscheinlichkeiten a kl und 

der Emissionswahrscheinlichkeiten e k (b). 

Wir betrachten den zweiten Schritt. Gegeben eine Menge von Beispielsequenzen. Ziel ist 

es, die Parameter eines HMMs auf die Beispielsequenzen zu “trainieren”, d.h. die Parameter 

so zu wählen, dass die Wahrscheinlichkeit, mit der das HMM die Beispielsequenzen erzeugt, 

maximiert wird. 

1.40 Parameterschätzung bei bekannter Zustandsfolge 

Sei M = (S, Q, A, e) ein HMM. 

Gegeben eine Liste von Symbolfolgen x 1 , x 2 , . . ., x n und eine Liste zugehöriger Pfade 

π 1 , π 2 , . . ., π n . (Z.B., DNA Sequenz mit annotierten CpG-Inseln.) 

Wir möchten die Parameter (A, e) des HMM M optimal wählen, d.h. so, dass gilt: 

P(x 1 , . . ., x n , π 1 , . . ., π n | M = (S, Q, A, e)) =


max 

(A ′ ,e ′ ) P(x1 , . . .,x n , π 1 , . . ., π n | M = (S, Q, A ′ , e ′ )). 

Wir suchen also den sogenannten Maximum Likelihood Estimator (ML-Schätzer) für (A, e). 

1.41 ML-Schätzung für (A, e) 

(Hinweis: Betrachten wir P(D | M) als Funktion von D, so sprechen wir von einer probability; 

als eine Funktion von M, so sprechen wir von einer likelihood.) 

ML-Schätzung: 

Berechnung: 

A kl : 

E k (b): 

(A, e) ML = arg max P(x 1 , . . .,x n , π 1 , . . .,π n | M = (S, Q, A ′ , e ′ )). 

(A ′ ,e ′ ) 

Zahl der Übergänge von Zustand k zu l 

Zahl der Emissionen von b im Zustand k 

Wir setzen die Parameter für M: 

ā kl = 

A kl 

∑ 

q∈Q A kq 

und ē k (b) = 

E k (b) 

∑s∈S E k(s) . 

1.42 Trainierung des fair/unfair HMMs 

Gegeben Trainingsdaten x und π: 

Symbols x: 1 2 5 3 4 6 1 2 6 6 3 2 1 5 

States pi: F F F F F F F U U U U F F F 

A kl 0 F U 

0 

F 

U 

E k (b) 1 2 3 4 5 6 

0 

F 

U 

Zustandsübergange: 

→ 

Emissionen: 

→ 

ā kl 0 F U 

0 

F 

U 

ē k (b) 1 2 3 4 5 6 

0 

F 

U 

1.43 Trainierung des fair/unfair HMMs 

Gegeben Trainingsdaten x und π: 

Symbols x: 1 2 5 3 4 6 1 2 6 6 3 2 1 5 

States pi: F F F F F F F U U U U F F F


A kl 0 F U 

0 0 1 0 

F 1 8 1 

U 0 1 3 

E k (b) 1 2 3 4 5 6 

F 3 2 1 1 2 1 

U 0 1 1 0 0 2 

1.44 Pseudocounts 

Zustandsübergange: 

→ 

ā kl 0 F U 

0 0 1 0 

1 

F 

10 

U 0 

8 

10 

1 

4 

1 

10 

3 

4 

Emissionen: 

ē k (b) 1 2 3 4 5 6 

→ F .3 .2 .1 .1 .2 .1 

1 1 

1 

U 0 .0 .0 

4 4 2 

Ein Problem ist overfitting. Z.B., kommt ein Übergang k ↦→ l in der Trainsmenge nicht vor, so 

wird ā kl = 0 gesetzt und dieser Übergang gilt dann als “verboten”. 

Kommt ein Zustand k in der Trainingsmenge nicht vor, so ist ā kl für alle l undefiniert! 

Um diese Probleme zu lösen, definiert man Pseudocounts r kl und r k (b) und setzt dann: 

A kl = Anzahl Übergänge von k nach l in der Trainingsmenge + r kl 

E k (b) = Anzahl Emissionen von b von k in der Trainingsmenge + r k (b). 

Kleine Pseudocounts entsprechen “wenig Vorwissen”, grosse Pseudocounts entsprechen “viel 

Vorwissen”. 

1.45 Parameterschätzung bei unbekannter Zustandsfolge 

In der Praxis hat man nur Symbolfolgen und kennt die zugehörigen Zustandspfade nicht. 

Gegeben Symbolfolgen x 1 , x 2 , . . .,x n , für die wir die Zustandspfade π 1 , . . .,π n NICHT kennen. 

Das Problem, die Parameter (A, e) eines HMMs M so (optimal) zu wählen, dass 

gilt, ist NP-vollständig! 

P(x 1 , . . .,x n | M = (S, Q, A, e)) = 

max 

(A ′ ,e ′ ) P(x1 , . . .,x n | M = (S, Q, A ′ , e ′ )) 

1.46 Loglikelihood 

Gegeben Symbolfolgen x 1 , x 2 , . . .,x n . 

Sei M = (S, Q, A, e) ein HMM. Wir definieren den Score des Modells M als: 

n∑ 

l(x 1 , . . .,x n ) = log P(x 1 , . . .,x n | (A, e)) = log P(x j | (A, e)). 

(Hier nehmen wir an, dass die Symbolfolgen unabhängig sind und deshalb P(x 1 , . . .,x n ) = 

P(x 1 ) · . . . · P(x n ) gilt.) 

Ziel ist es nun, die Parameter (A, e) so optimieren, dass dieser Score maximiert wird. (Englisch: 

log likelihood). 

j=1


1.47 Überblick: Baum-Welch-Algorithmus 

Sei M = (S, Q, A, e) ein HMM und seien Trainingsfolgen x 1 , x 2 , . . .,x n gegeben. Die Parameter 

(A, e) sollen iterativ verbessert werden, wie folgt: 

- Auf Grundlage von x 1 , . . ., x n und π 1 , . . .,π n werden Erwartungswerte für A kl und E l (b) 

geschätzt. 

- Wir setzen dann (A ′ , e ′ ) ← (Ā, ē). 

- So lange wiederholen, bis ein Haltekriterium erfüllt wird. 

Dies ist ein Spezialfall der EM-Technik (Expectation Maximization). 

1.48 Baum-Welch-Algorithmus 

Input: HMM M = (S,Q,A,e), Trainingsfolgen x 1 ,x 2 ,... ,x n , 

ggf. Pseudocounts r kl und r k (b) 

Output: HMM M ′ = (S,Q,A ′ ,e ′ ) mit verbessertem Score. 

Rekursion: 

Setze A und E auf 0, oder ggf. auf ihre Pseudocounts. 

Für jede Sequenz x j : 

Berechne f k (i) für x j mit dem Vorwärtsalgorithmus. 

Berechne b k (i) für x j mit dem Rückwärtsalgorithmus. 

Addiere beide Beiträge zu den Summen: 

Beenden: 

A kl = ∑ j 

1 

P(x j ) 

E k (b) = ∑ j 

∑ 

i 

1 

P(x j ) 

f j k (i)a kle l (x j i+1 )bj l 

(i + 1) 

∑ 

{i|x j i =b} f j k (i)bj l (i) 

Berechne neue Modellparameter (A ′ ,e ′ ) ← (Ā,ē) 

Berechne die neue Loglikelihood l(x 1 ,... ,x n | (A ′ ,b ′ )) 

Wenn der Score nicht verbessert wurde, oder die 

maximale Anzahl an Iterationen erreicht wurde. 

1.49 Erläuterung 

Gegeben x. Für die erwartete Anzahl von Übergängen von π i = k nach π i+1 = l gilt: 

Also ergibt sich insgesamt: 

P(π i = k, π i+1 = l | x, (A, e)) = f k(i)a kl e l (x i+1 )b l (i + 1) 

. 

P(x) 

A kl = 

n∑ 

j=1 

1 

P(x j ) 

∑L j 

i=1 

f j k (i)a kle l (x j i+1 )bj l 

(i + 1) 

(Zur Erinnerung: 

P(π i = k | x) = P(π i = k, x) 

P(x) 

= f k(i)b k (i) 

.) 

P(x)


1.50 Konvergenz 

Bemerkung Man kann beweisen, dass bei Benutzung des Baum-Welch-Algorithmus der 

Loglikelihood-Score gegen ein lokales Maximum konvergiert. 

Allerdings müssen die Parameter nicht unbedingt konvergieren! 

Lokale Maxima können durch die Wahl verschiedener Startpunkte vermieden werden. 

Man kann natürlich auch andere Standardoptimierungsverfahren benutzen, um das Optimierungsproblem 

zu lösen. 

1.51 Proteinprimärsequenzen 

Gegeben die Aminosäuresequenz eines neuen Proteins P. Wir möchten die biologische Funktion 

von P erfahren. 

SWISS-PROT: Curated protein sequence data which strives to provide a high level of annotations, a minimal level of redundancy and high level of 

intergration with other databases. SIB Switzerland. 

TrEMBL: Computer-annotated supplement to SWISS-PROT 

PIR: Protein information resource. Comprehensive, non-redundant, expertly annotated, fully classified and extensively cross-referenced protein sequence 

database in the public domain. A division of National biomedical Research Foundation, Georgetown Uni. 

OWL: Composite protein sequence database, non-redundant composite of (in this order or priority) SWISS-PROT, PIR, GenBank (translation) and 

NRL-3D. University of Manchester 

1.52 Proteinsekundärstruktur 

PRINTS: A compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family. University of 

Manchester. 

SPRINT: Provides an interface to the PRINTS-S database. PRINTS-S is the relational cousin of the PRINTS data bank of protein family fingerprints. 

University of Manchester. 

PROSITE: Database of protein families and domains, consisting of biologically significant sites, patterns and profiles that help to reliably identify to 

which known protein family (if any) a new sequence belongs. (E.g., uses regular expressions to describe pattern) 

BLOCKS: The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins 

documented in the Prosite Database. Blocks are calibrated against the SWISS-PROT database to obtain a measure of the change distrbution of 

matches. Fred Hutchinson Cancer Research Center in Seattle. 

Pfam: Large collection of multiple sequence alignments and hidden Markov models covering many common protein domains. Pfam 6.6 contains 

alignments and models for 3071 protein families. Washington University in St Louis 

ProDom: The protein domain database. 390 ProDom families were generated automatically using PSI-BLAST. Tolouse. 

InterPro: Built from Pfam, PRINTS, PROSITE, ProDom, SMART, TIGRFAMs, and SWISS-PROT+TrEmbl 

1.53 Räumliche Proteinstruktur 

SCOP: Structual classification of proteins. Cambridge University 

CATH: Protein Structure Classification. University College London 

PDB: Protein data bank. Single worldwide repository for the processing and distribution of 3-D 

biological macromolecular structure 

PDBsum: Contains summary information and derived data on entries in the Protein Data Bank. 

PIR-NRL3D: This sequence-structure database is produced from sequence and annotation extracted 

from three-dimensional structures in PDB.


1.54 Proteinidentifizierung 

#A-helices ...........AAAAAAAAAAAAAAAA...BBBBBBBBBBBBBBBBCCCCCCCCCCC....DDDDDDDEEEEEEEEEEEE 

GLB1_GLYDI .........GLSAAQRQVIAATWKDIAGADNGAGVGKDCLIKFLSAHPQMAAVFG.FSG....AS...DPGVAALGAKVL 

HBB_HUMAN ........VHLTPEEKSAVTALWGKV....NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVL 

HBA_HUMAN .........VLSPADKTNVKAAWGKVGA..HAGEYGAEALERMFLSFPTTKTYFPHF.DLS.....HGSAQVKGHGKKVA 

MYG_PHYCA .........VLSEGEWQLVLHVWAKVEA..DVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKHGVTVL 

GLB5_PETMA PIVDTGSVAPLSAAEKTKIRSAWAPVYS..TYETSGVDILVKFFTSTPAAQEFFPKFKGLTTADQLKKSADVRWHAERII 

GLB3_CHITP ..........LSADQISTVQASFDKVKG......DPVGILYAVFKADPSIMAKFTQFAG.KDLESIKGTAPFETHANRIV 

LGB2_LUPLU ........GALTESQAALVKSSWEEFNA..NIPKHTHRFFILVLEIAPAAKDLFS.FLK.GTSEVPQNNPELQAHAGKVF 

#A-helices EEEEEEEEE............FFFFFFFFFFFF..FFGGGGGGGGGGGGGGGGGGG.....HHHHHHHHHHHHHHHHHHH 

GLB1_GLYDI AQIGVAVSHL..GDEGKMVAQMKAVGVRHKGYGNKHIKAQYFEPLGASLLSAMEHRIGGKMNAAAKDAWAAAYADISGAL 

HBB_HUMAN GAFSDGLAHL...D..NLKGTFATLSELHCDKL..HVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANAL 

HBA_HUMAN DALTNAVAHV...D..DMPNALSALSDLHAHKL..RVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVL 

MYG_PHYCA TALGAILKK....K.GHHEAELKPLAQSHATKH..KIPIKYLEFISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDI 

GLB5_PETMA NAVNDAVASM..DDTEKMSMKLRDLSGKHAKSF..QVDPQYFKVLAAVIADTVAAG.........DAGFEKLMSMICILL 

GLB3_CHITP GFFSKIIGEL..P...NIEADVNTFVASHKPRG...VTHDQLNNFRAGFVSYMKAHT..DFA.GAEAAWGATLDTFFGMI 

LGB2_LUPLU KLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG...VADAHFPVVKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVI 

#A-helices HHHHHHH.... 

GLB1_GLYDI ISGLQS..... 

HBB_HUMAN AHKYH...... 

HBA_HUMAN TSKYR...... 

MYG_PHYCA AAKYKELGYQG Alignment von sieben Globinsequenzen 

GLB5_PETMA RSAY....... Wie kann diese Familie charakterisiert werden? 

GLB3_CHITP FSKM....... 

LGB2_LUPLU KKEMNDAA... 

1.55 Charakterisierung? 

Vertretersequenz? 

Consensussequenz? 

Regulärer Ausdruck (Prosite): 

LGB2_LUPLU 

GLB1_GLYDI 

...FNA--NIPKH... 

...IAGADNGAGV... 

...[FI]-[AN]-x(1,2)-N-[IG]-[AP]-[GK]-[HV]... 

HMM? 

1.56 Einfaches HMM 

HBA_HUMAN ...VGA--HAGEY... 

HBB_HUMAN ...V----NVDEV... 

MYG_PHYCA ...VEA--DVAGH... 

GLB3_CHITP ...VKG------D... 

GLB5_PETMA ...VYS--TYETS... 

LGB2_LUPLU ...FNA--NIPKH... 

GLB1_GLYDI ...IAGADNGAGV... 

"Matches": *** ***** 

Wir betrachten zunächst ein einfaches HMM, das einer PSSM entspricht (Position Specific 

Score Matrix): 

V 

F 

I 

A 

E 

G 

K 

Y 

A 

G 

S 

D 

H 

N 

T 

A 

G 

I 

V 

Y 

A 

D 

E 

G 

P 

E 

G 

K 

T 

D 

H 

S 

V 

Y


(Die aufgeführten Aminosäuren haben eine erhöhte Emissionswahrscheinlichkeit.) 

1.57 Insert-Zustände 

Es werden sogenannte Insert-Zustände eingeführt, die gemäss den Hintergrundsverwahrscheinlichkeiten 

Symbole emittieren. 

Begin 

V 

F 

I 

A 

E 

G 

K 

Y 

A 

G 

S 

D 

H 

N 

T 

A 

G 

I 

V 

Y 

A 

D 

E 

G 

P 

E 

G 

K 

T 

D 

H 

S 

V 

Y 

Hiermit ist es möglich, zusätzliche Sequenzstücke ausserhalb der wichtigen Domains zu modellieren. 

1.58 Delete-Zustände 

Es werden sogenannte Delete-Zustände eingeführt, die still sind, also keine Symbole emittieren. 


V 

F 

I 

A 

E 

G 

K 

Y 

A 

G 

S 

D 

H 

N 

T 

A 

G 

I 

V 

Y 

A 

D 

E 

G 

P 

E 

G 

K 

T 

D 

H 

S 

V 

Y 

Hiermit ist es möglich, das Fehlen einzelner Domains zu modellieren.


1.59 Topologie eines Profil-HMMs 


Match-Zustand, Insert-Zustand, Delete-Zustand 

1.60 Entwerfen eines Profil-HMMs 

Gegeben eine Multialignment einer Familie von Sequenzen. 

Zunächst muss entschieden werden, welche Positionen als Match- und welche als Insert- 

Zustände modelliert werden. Erfahrungswert: Spalten mit mehr als 50% Gaps sollten als 

Insert-Zustände modelliert werden. 

Die Übergangswahrscheinlichkeiten und Emissionswahrscheinlichkeiten können nach Auszählen 

der vorkommenden Übergänge A kl und Emissionen E k (b) bestimmt werden: 

a kl = 

A kl 

∑ 

l ′ A kl ′ 

and e k (b) = 

E k(b) 

∑b ′ E k (b ′ ) . 

Auch hier kann es vorkommen, dass bestimmte Übergänge oder Emissionen nicht beobachtet 

werden. Wir benutzen die Laplace-Regel und addieren 1 zu jede Häufigkeit hinzu. 

2 Suffix trees 

History 

Weiner 1973: linear-time algorithm 

McCreight 1976: reduced space 

Ukkonen 1995: new algorithm, easier to describe 

References 

- Dan Gusfield, Algorithms on strings, trees and sequences, Cambridge, 1997. 

- R. Giegerich, S. Kurtz und J. Stoye. Efficient Implementation of Lazy Suffix Trees, WAE’99, 

LNCS 1668, pp. 30-42, 1999, see link from webpage 

S. Kurtz, Foundations of sequence analysis, Bielefeld (2001), see link.


2.1 Importance of sequence comparison 

The first fact of biological sequence analysis: In biomolecular sequences (DNA, RNA, 

or amino acid sequences), high sequence similarity usually implies significant functional or 

structural similarity. 

“Duplication with modification”: The vast majority of extant proteins are the result of a continuous 

series of genetic duplications and subsequent modifications. As a result, redundancy is a built-in 

characteristic of protein sequences, and we should not be surprised that so many new sequences resemble 

already known sequences. ...all of biology is based on enormous redundancy... 

We didn’t know it at the time, but we found everything in life is so similar, that the same genes 

work in flies are the ones that work in humans. (Eric Wieschaus, cowinner of the 1995 Nobel prize in medicine for work 

on the genetics of Drosophia development.) 

Dan Gusfield, 1997, 212 ff 

2.2 Searching for short queries in a long text 

Problem Given a long text t and many short queries q 1 , . . .,q k . For each query sequence q i , 

find all its occurrences in t. 

We would like to have a data-structure that allows us to solve this problem efficiently. 

Example: The text t is a genomic sequence and the queries are short signals such as transcription 

factor binding sites, splice sites etc. 

Important applications are in the comparison of genomes (in programs such as MUMMer that 

computes maximum unique matches) and in the analysis of repeats. 

2.3 Basic definitions 

Let Σ denote an alphabet and Σ ∗ the set of strings over Σ. Let ǫ denote the empty string and 

Σ + = Σ ∗ \ {ǫ}. 

Let t = t 1 t 2 . . . t n be the text and $ ∈ Σ ∗ \ t. 

For i ∈ {1, 2, . . ., n + 1}, let s i = t i . . .t n $ denote the i-th suffix of t. 

2.4 The role of suffixes 

Consider the text abab$ 

It has the following suffixes: 

abab$, bab$, ab$, b$, $ 

To determine whether a given query q is contained in the text, we check whether q is the prefix 

of one of the suffixes. 

E.g., the query ab is the prefix of both abab$ and ab$. 

To speed up the search for all suffixes that have the query as a prefix, we use a tree structure 

to share common prefixes between the suffixes.


2.5 Sharing prefixes 

(a) The suffixes abab$ and ab$ both share the prefix ab. 

(b) The suffixes bab$ and b$ both share the prefix b. 

(c) The suffix $ doesn’t share a prefix. 

b a 

b 

$ 

a 

b 

$ 

$ 

$ 

$ 

b a (b) 

(a) 

(c) 

2.6 First example 

b a 

b 

$ 

abab$ 

ab$ 

bab$ 

b$ 

$ 

a 

b 

$ 

$ 

b a 

$ 

$ 

5 

1 3 2 4 5 

⇒ 

1 3 2 4 

Suffix tree for abab$ is obtained by sharing prefixes where ever possible. The leaves are annotated 

by the positions of the corresponding suffixes in the text. 

2.7 Σ + -tree 

A Σ + -tree T is a finite, directed tree with root root. Its edges are labeled with strings in Σ + , 

such that: For every letter a ∈ Σ and node u there exists at most one a-edge u → as 

w (for some 

string s and some node w). 

w 

av 

u 

A leaf is a node with no children and an edge leading to a leaf is called a leaf edge. A node 

with at least two children is called a branching node. 

2.8 Naming nodes by strings 

Let u be a node of T. We use the name ¯s for u, if s is the concatenation of all labels of the 

edges along the path from the root of the tree to u.


root 

p 

For example u = pqr: 

q 

r 

u 

The root is called ǫ. 

Definition: A string g is said to occur in T, if there exists a string h such that gh is a node in 

T. 

2.9 Suffix tree 

Definition: A suffix tree ST(t) for t is a Σ + -tree with the following properties: 

1. Every node is either a leaf or a branching node, and 

2. a string s occurs in ST(t) ⇔ w is a substring of t. 

There exists a one-to-one correspondence between the non-empty suffixes of t$ and the leaves 

of ST(t). 

For every leaf s j we define l(s j ) = {j}. Recursively, for every branching node u we define: 

l(u) = {j | u v → uv is an edge of ST(t), j ∈ l(uv)}. 

In other words, 

We call l(u) the leaf set of u. 

⋃ 

l(u) = l(v). 

v is child of u 

2.10 Example 

Text: xabxac 

b 

x 

a 

c 

$ 

a 

c 

$ 

a 

b c 

x $ 

a c 

c 

$ 

x $ 

$ 

b xa c $ 

2.11 Idea: Compute tree recursively 

Note that the sub tree below a branching node u is determined by the set of all suffixes of t$, 

that start with the prefix u:


U 

U 

US1 

So, if we know the set of remaining suffixes 

S1 

S2 Sk 

... 

... 

US2 USk 

R(u) := {s | us is a suffix of t$}, 

then we can evaluate the node u, i.e. construct the sub tree below u. 

2.12 The main evaluation step 

An unevaluated node is evaluated as follows: We partition the set R(u) into groups by the first 

letter of the strings, i.e. for every letter c ∈ Σ, we define the c-group as: 

R c (u) := {cw ∈ Σ ∗ | cw ∈ R(u)}. 

Consider R c (u) for c ∈ Σ. If R c (u) ≠ ∅, then there are two possible cases: 

1. If R c (u) contains precisely one string w, then we construct a new leaf edge starting at u 

and label it with w. 

2. Otherwise, the set R c (u) contains at least two different strings and let p denote their 

longest common substring (lcp). We create a new c-edge with label p whose source node 

is u. The new unevaluated node up and set R(up) = {w | pw ∈ R c (u)} will be (recursively) 

processed later. 

2.13 Evaluating the root 

This wotd-algorithm (write-only, top-down) starts by evaluating the root node, with R(root) 

equal to the set of all suffixes of t$. All nodes of ST(t) are then recursively constructed using 

the appropriate sets of remaining suffixes. 

2.14 Example 

Text: abab 

The algorithm proceeds as follows: We first evaluate the root node root using R(root) = 

{abab$, bab$, ab$, b$}. There are three groups of suffixes: 

R a (root) = {abab$, ab$}, R b (root) = {bab$, b$} and R $ = {$}. 

The letter $ gives rise to a leaf edge with label $. The letter a gives rise to an internal edge 

with label ab, because ab = lcp(R a (root)). Similarly, for b we obtain an internal edge with label 

b.


For the node ab we have R(ab) = {ab$, $} and thus R a (ab) = {ab$} and R $ (ab) = {$}. Because 

both latter sets have cardinality one, we obtain two new leaf edges with labels ab$ and $, 

respectively. 

Similarly, we obtain two new leaf edges with labels ab$ and $ for the node b. 

Text: abab 

abab$ 

bab$ 

R(root): 

ab$ 

b$ 

$ 

⇒ 

b a 

R(ab): ab$ R(b): 

$ 

b 

ab$ 

$ 

$ 

b a 

b 

$ 

a 

b 

$ 

$ 

b a 

$ 

$ 

⇒ 

2.15 Properties of the wotd-algorithm 

Complexity: Space requirement? Worst case time complexity? (Exercises!) 

The expected running time is O(n log k n) and experimental studies indicate that the algorithm 

often performs in linear time for moderate sized strings. 

Good memory locality. 

Algorithm can be parallelized. 

2.16 Suffix tree data-structure 

An implementation of a suffix tree must represent its nodes, edges and edge labels. To be able 

to describe the implementation, we define a total ordering on the set of children of a branching 

node: 

Let u and v be two different children of the same node in ST(t). We write 

u ≺ v iff min l(u) < min l(v), 

in other words, iff the first occurrence of u in t$ comes before the first occurrence of v in t$. 

min l(u) 

min l(v) 

t 

u v u 

v 

2.17 Representing the edge labels 

Because an edge label s is a substring of the text t$, we could represent it by a pair of pointers 

(i, j) into t ′ = t$ such that s = t ′ i , t′ i+1 , . . .,t′ j .


However, note that we have j = n + 1 for any leaf edge and so in this case the right pointer is 

redundant. 

Moreover, we can also get rid of right pointers in the case of internal edges as well, by defining a 

left pointer on the set of nodes (not edges) in such a way that these can be used to reconstruct 

the original left and right pointers of each edge. 

2.18 Coding edge labels with one pointer 

Consider an edge u v → uv. Define the left pointer of uv as the position p of the first occurrence 

of uv in t$ plus the length of u: 

lp(uv) = min l(uv) + |u|. 

This gives the start position i of a copy of v in t$. 

To get the end position of v, consider the ≺-smallest child uvw of uv. We have min l(uv) = 

min l(uvw), i.e. the corresponding suffix starts at the same position p. By definition, we have 

lp(uvw) = min l(uvw) + |uv| and the end position of v equals lp(uvw) − 1. 

t 

min l(uv)=min l(uvw) 

i 

u v 

lp(uv) 

r 

w 

lp(uvw) 

2.19 The main data table 

For each node u, we store a reference firstchild(u) to its smallest child. 

We store the values of lp and firstchild together in a single (integer) table T. We store the 

values of all children of a given node u consecutively, ordered w.r.t. ≺. (We will indicate the last 

child of u by setting its lastchild-bit.) 

So, only the edge from a given node u to its first child is represented explicitly. Edges from u 

to its other children are given implicitly and are found be scanning consecutive positions in T 

that follow the position of the smallest child. 

We reference the node u using the index of the position in T that contains the value lp(u). 

2.20 Example 

The table T for ST(abab). All indices start at 1. The first value in T for a branching node u 

is lp(u), the second value ist firstchild(u): 

node ab b $ abab$ ab$ bab$ b$ 

{}}{ {}}{ {}}{ {}}{ {}}{ {}}{ {}}{ 

T 1 6 2 8 5 3 5 3 5 

Index 1 2 3 4 5 6 7 8 9 

Bits ∗ † ∗ ∗ † ∗ ∗ † 

To be able to decode this representation of the suffix tree, we need two extra bits: A leaf-bit 

(∗) indicates that the given position in T corresponds to a leaf node and a lastchild-bit (†) 

indicates that the node at this position does not have a larger brother w.r.t. ≺.


2.21 Storing an unevaluated node 

We consider the wotd-algorithm as a process that evaluates the nodes of a suffix tree. It starts 

at the root and then evaluates all nodes recursively. 

First we discuss how to store an unevaluated node u. 

To be able to evaluate u, we (only) need to know the set of remaining suffixes R(u). To make 

these available, we define a global array called suffixes that contains pointers to suffixes in t$ 

and use it as follows: For every unevaluated node u, the suffixes array contains an interval of 

pointers to start positions in t$ that correspond precisely to the suffixes contained in R(u), in 

increasing order. 

We can now represent R(u) in T using two numbers left(u) and right(u), which define an 

interval of entries in the suffixes array. 

As a branching node, u will occupy two positions in T, one for lp(u) and followed by firstchild(u). 

Until u is actually evaluated, we will use these two positions to store left(u) and right(u). We 

use a third bit called the unevaluated-bit to distinguish between unevaluated and evaluated 

nodes. 

2.22 Evaluating a node u 

We sort and count all entries of suffixes in the interval [left(u),right(u)], using the first letter 

of the suffixes as the sort key. 

Each letter c that has count > 0 will give rise to a new c-edge from u. The suffixes in the 

c-group R c (u) determine the tree below the new edge. As a result of the sort, the pointers 

corresponding to suffixes in R c (u) are stored in a subinterval of [left(u),right(u)], ordered from 

left to right. 

To determine the label of the c-edge, we determine the lcp of the c-group: If the c-group consists 

of only one suffix s, then this is the lcp. Otherwise, we step through a simple loop j = 1, 2 . . . 

and check the equality of all letters t suffixes[i]+j for all start positions i of the suffixes in the 

c-group. As soon as a difference is detected, the loop is aborted and j is the length of the lcp. 

For each non-empty c-group of u, we store one child in the table T, as follows: 

A c-group containing only one string gives rise to a leaf node v and we write the number lp(v) 

in the first available position of T. This number lp(s) equals the number stored in suffixes at 

the left most position of the interval in suffixes that corresponds to the c-group. 

A c-group containing more than one node gives rise to branching node v and we store left(v) 

and right(v) in the first two available positions of T. The values of left and right were computed 

during the sort and count step. 

Additionally, in preparation of the evaluation of v, we increment all entries of suffixes within 

the interval [left(v),right(v)] by the length of the lcp. 

Finally, for u we replace the values left(u) and right(u) in T by lp(u) := suffixes[left(u)] and 

firstchild(u), and we clear the unevaluated-bit. 

2.23 Lazy vs. complete evaluation 

To build the complete suffix tree, we proceed depth-first, from left to right.


In a lazy approach, we only evaluate those nodes that are necessary to answer a query (and 

have not yet been evaluated). 

2.24 Example 

Input: Text: a b a b $ 

1 2 3 4 5 

Initial.: suffixes: 1 2 3 4 5 T: 

Evaluate(root): 

Sort and count: R a (root) = {1, 3}, lcp = ab 

R b (root) = {2, 4}, lcp = b 

R $ (root) = {5} 

The suffixes are ordered, left and right are entered in the table and the three bits (u, ∗, †: 

unevaluated,leaf ,lastchild) are set: 

suffixes: 1 3 2 4 5 T: 1 2 3 4 5 

u u ∗† 

Add the length of the lcpto the suffixesentry for every branching nodes: 

suffixes: 3 5 3 5 5 

( ) 

Text : a b a b $ 

1 2 3 4 5 

Evaluate(1): 

R a (1) = {3} 

R $ (1) = {5} 

suffixes: 3 5 3 5 5 T: 1 2 3 4 5 3 5 

(u) u ∗† ∗ ∗† 

Because lp(1) = 1, firstchild(1) = 6 set: 

T: 1 6 3 4 5 3 5 

u ∗† ∗ ∗† 

( ) 

Text : a b a b $ 

1 2 3 4 5 

Evaluate(3): 

R a (3) = {3} 

R $ (3) = {5} 

suffixes: 3 5 3 5 5 T: 1 6 3 4 5 3 5 3 5 

(u) ∗† ∗ ∗† ∗ ∗† 

Because lp(3) = 2, firstchild(3) = 8 set: 

T: 1 6 2 8 5 3 5 3 5 

∗† ∗ ∗† ∗ ∗† 

Done!


2.25 Application: Finding MUMs 

Problem: Given two sequences s and t. Find all maximal unique matches (MUMs) between s 

and t. 

A MUM is a sequence m that occurs precisely once in s and once in t, and is both right maximal 

and left maximal with this property (meaning that ma and am both do not have the uniqueness 

property, for any letter a). 

To find all MUMs, generate the suffix tree T for sZt, where Z is a separator with Z /∈ s and 

Z /∈ t. Any path in T from the root to some node u that has precisely two children, one in s 

and one in t, corresponds to a right maximal unique match. 

To determine whether u is left maximal, too, simply check whether the both preceding letters 

in s and t differ. 

2.28 Ukkonen’s online construction 

This lecture is based on: Stefan Kurtz, Foundations of sequence analysis, Bielefeld (2001) 

We will now discuss an algorithm that constructs ST(t) in linear time. It operates online and 

generates 

ST(ǫ), ST(t 1 ), ST(t 1 t 2 ), . . .,ST(t 1 t 2 . . .t n ) 

for all prefixes of t, without knowledge of the remaining part of the input string. 

Induction: First, note that ST(ǫ) consists of a root node only. To completely define the 

algorithm we must describe the induction step: 

For i ∈ {0, . . ., n − 1} we define 

We call xa visible and y hidden. 

ST(t 1 . . .t i ) to ST(t 1 . . .t i t i+1 ), for all i. 

x := t 1 . . .t i , a := t i+1 and y := t i+2 . . .t n . 

2.29 Main idea 

In the step ST(x) −→ ST(xa): 

t 1 . . .t 

} {{ i−1 t 

} }{{} i t i+1 . . .t 

} {{ n 

} 

x a y 

Consider all suffixes sa of xa: 

t 1 . . .t i−1 a, t 2 . . .t i−1 a, . . .t i−1 a and a. 

There are three cases: 

• If sa occurs in ST(x), do nothing. 

• If s is a leaf in ST(x), extend the label of the corresponding leaf edge by a. 

• Otherwise, sa is a relevant suffix and needs to be inserted appropriately.


2.30 Implicit Suffix Tree 

To be precise, each tree T computed by the induction step of Ukkonen’s algorithm is an implicit 

suffix tree in the following sense: 

Definition. An implicit suffix tree for string t is a tree obtained from the suffix tree for t$ by 

removing every copy of the terminal symbol $ from the edge labels of the tree, then removing 

any edge that has no label, and then removing any node that does not have at least two children. 

In the following description of Ukkonens algorithm, we will not distinguish between implicit 

suffix tree and suffix tree. Note that we can obtain the latter from the former straight-forwardly 

in linear time, do you know how? 

2.31 The INSERT set 

Because ST(x) represents all substrings of x and ST(xa) represents all substrings of xa, the 

induction step must add the set INSERT of all substrings of xa that are not substrings of x. 

Observation I: For all w ∈ INSERT there exists a suffix s of x such that w = sa. 

Proof: w /∈ ST(x) implies w ≠ ǫ and w ends with t i+1 = a. □ 

Observation II: For all sa ∈ INSERT we have: sa is a leaf in ST(xa). 

Proof: Assume that sa is NOT a leaf in ST(xa). Then sa occurs at least twice in xa, and thus 

at least once in x, contradicting the definition of INSERT. □ 

Partition INSERT into: 

• INSERTleaf = {sa ∈ INSERT | s is a leaf in ST(x)}, and 

• INSERTrelevant = {sa ∈ INSERT | s is a not leaf in ST(x)}. 

2.32 Processing INSERTleaf 

Consider sa ∈ INSERTleaf . 

Then s is a leaf node in ST(x). Let u → v s be the corresponding leaf edge. To insert sa, we 

modify this edge as follows: 

u → v s to u −→ va 

sa. 

I.e., to insert all elements of INSERTleaf , we have to extend all leaf edges in ST(x) by the new 

character a. 

Implementation: We represent the label of a leaf edge by a pair (r, e), where r is the start of 

the label in t and e points to a variable that contains the length of the visible string. 

2.33 Processing INSERTrelevant 

If sa ∈ INSERTrelevant, then s is not a leaf in ST(x), or equivalently, s is a nested suffix of x 

(i.e. a suffix of x that occurs more than once in x). 

Definition A suffix sa of xa is called relevant if s is a nested suffix of x and sa is not a substring 

of x.


The induction step from ST(x) to ST(xa) has thus been reduced to the following: Insert all 

relevant suffixes sa of xa into ST(x). 

We will see that relevant suffixes form an interval in the list of all suffixes of xa, bounded by 

“active suffixes”. 

Definition The active suffix α(x) of x is the longest nested suffix of x. 

2.34 Example 

Consider the text adcdacdad. Each column contains all suffixes of a prefix of the string: 

i : 0 1 2 3 4 5 6 7 8 9 

ǫ ↓a ad adc adcd adcda adcdac adcdacd adcdacda adcdacdad 

ǫ ↓d dc dcd dcda dcdac dcdacd dcdacda dcdacdad 

ǫ ↓c cd cda cdac cdacd cdacda cdacdad 

ǫ d ↓da dac dacd dacda dacdad 

ǫ a ↓ac acd acda acdad 

ǫ c cd cda ↓cdad 

ǫ d da ↓dad 

ǫ a ad 

ǫ d 

ǫ 

Relevant suffixes sa are marked by ↓ and active suffixes α(xa) are printed in bold face. 

2.35 Four key observations 

(O 1) For all suffixes s of x: s is nested ⇔ |α(x)| ≥ |s|. 

(O 2) For all suffixes s of x: sa is a relevant suffix of xa ⇔ |α(x)a| ≥ |sa| > |α(xa)|. 

(O 3) α(xa) is a suffix of α(x)a. 

(O 4) If sa = α(xa) and α(x)a ≠ sa, then s is a right-branching substring of x. 

(Definition: A substring s of w is called right-branching, if there exist two different letters a,b such 

that sa and sb both occur in w.) 

2.36 Proof 

Ad 1: follows from the definition of an active suffix.


Ad 2: 

sa is a relevant suffix of xa 

⇔ 

⇔ 

⇔ 

⇔ 

⇔ 

s is a nested suffix of x and sa is not a substring of x 

|α(x)| ≥ |s| and sa is not a substring of x 

|α(x)| ≥ |s| and sa is not a nested substring of xa 

|α(x)a| ≥ |sa| and |sa| > |α(xa)| 

|α(x)a| ≥ |sa| > |α(xa)|. 

Ad 3: 

Both α(xa) and α(x)a are suffixes of xa, so we need only show |α(x)a| ≤ |α(xa)|. 

This is clearly true, if α(xa) = ǫ. 

Let α(xa) = wa. Since wa is a nested suffix of xa, we have uwav = x for some strings u and 

v ≠ ǫ. 

Hence, w is a nested suffix of x. 

Since α(x) is the longest nested suffix of x, we have |α(x)| ≥ |w| and hence |α(x)a| ≥ |wa| = 

|α(xa)|. 

Ad 4: 

Suppose sa = α(xa) and α(x)a ≠ sa. Then there is a suffix csa of xa such that |α(x)a| ≥ 

|csa| > |α(xa)|. 

Statement 2 implies that csa is a relevant suffix of x. 

I.e., cs is a nested suffix of x and csa is not a substring of x. 

Hence, there exists a character b ≠ a such that csb is a substring of x. Since sa is a substring 

of x, too, s is a right-branching substring of x. 

This completes the proof. □ 

2.37 Reformulation of the induction step 

Observation 2 states that all suffixes of xa lie “between” α(x)a and α(xa). In particular, α(xa) 

is the longest suffix of α(x)a that is a substring of x, by Observation 3. 

Hence, the induction step can be formulated as follows: 

Take the suffixes of α(x)a one after the other by decreasing length and insert them 

into ST(x), until a suffix is found which occurs in the tree and therefore equals 

α(xa). 

2.38 Total number of relevant suffixes 

Observation 2 implies that for each i ∈ {1, . . ., n − 1}, the relevant suffixes of t 1 . . . t i+1 lie 

between α(t 1 . . .t i )t i+1 and α(t 1 . . .t i+1 ). Hence, the total number of relevant suffixes is bounded


by: 

∑n−1 

|α(t 1 . . .t i )t i+1 | − |α(t 1 . . .t i+1 )| = 

i=1 

∑n−1 

|α(t 1 . . .t i )| + 1 − |α(t 1 . . . t i+1 )| = 

i=1 

n − 1 + |α(t 1 )| − |α(t 1 . . .t n )| ≤ n. 

2.39 Pseudo-code formulation 

We can formulate the step ST(x) → ST(xa) as follows: 

v := α(x)a 

while v does not occur in ST(x) do 

insert v in ST(x) 

Set v := drop 1 v 

α(xa) := v. 

Because the number of relevant suffixes is bounded by n, these operations are performed O(n) 

times and we will obtain a linear time algorithm, if we can perform each of the following steps 

in constant time: 

1. decide if v occurs in ST(x), 

2. insert v in ST(x), and 

3. drop the first character from v. 

2.40 Two ideas 

Idea 1: Note that v = ǫ or v = sa for some string s occurring in ST(x). (Follows from the 

definition of a relevant suffix!) Hence, we can represent v by the appropriate edges and nodes 

in ST(x). As we will see, this will enable us to implement steps (1) and (2) in constant time. 

Idea 2: The second idea is to construct for each branching node, say bw, an auxiliary edge 

called a suffix link which points to the branching node w, if it exists. This allows us to implement 

step (3) in constant time. 

2.41 Example 

A suffix tree with suffix links:


abca 

bca 

c 

abca 

cabca 

Locations: loc T (ǫ) = root 

loc T (a) = (root, a, bca, abca) 

loc T (abca) = (root, abca, ǫ, abca) 

loc T (bc) = (root, bc, a, bca) 

loc T (c) = c 

loc T (cab) = (c, ab, ca, cabca) 

2.42 The location of an occurring string 

Definition Let T be a suffix tree and s a string that occurs in T. The location loc T (S) of s in 

T is defined as follows: 

• If s is a branching node, then loc T (s) := s. 

• If s is a leaf, then there is a leaf edge u v → s in T and loc T (s) := (u, v, ǫ, s). 

• If there is no node s in T, then there is an edge 

u vw 

−→ uvw 

in T such that 

s = uv, v ≠ ǫ, w ≠ ǫ and loc T (s) := (u, v, w, uvw). 

If a location is a node, we call it a node location, otherwise, an edge location. 

2.43 Operations on locations 

We define the following four operations on locations: 

(1) occurs(loc T (s), a) = true ⇔ sa occurs in T. This operation can be implemented in 

constant time. 

(2) getloc(loc T (s), w) = loc T (sw) for all sw that occur in T. This operation can be implemented 

in O(|w|) time, simply by following characters of w in T. 

(3) Insertion of say, i.e. insertion of ay under loc T (s), delivers the pair (T ′ , z) which is specified 

as follows: 

– If loc T (s) = s, then T ′ is obtained from T by adding a leaf edge s −→ ay 

say. Moreover, 

z is undefined. 

– If loc T (s) = (u, v, w, uvw), then T ′ is obtained from T by splitting the edge u −→ 

vw 

uvw into u → v s → w uvw, and adding a new leaf edge s −→ ay 

say. Moreover, z := s, 

i.e., z is set to the new inner node created by the splitting.


Note that this can be done in constant time. We will also use insert(pos, a) to refer to 

this operation. 

(4) Linking locations using suffix links: 

linkloc(s) := z, 

where s −→ z is the suffix link for s, and 

linkloc(u, av, w, uavw) := 

{ locT (v) if u = root, and 

getloc(z, av) otherwise, 

where u −→ z is the suffix link for u. 

2.44 Example of the insertion operation 

Insertion of d at loc T (bc): 

abca 

bca 

c 

abca 

cabca 

⇒ 

abca 

bc 

a 

d 

c 

abca 

cabca 

2.45 Another observation 

Observation Let T be a suffix tree such that suffix links for all branching nodes in T are 

defined. If cy and y occur in T, then 

linkloc(loc T (cy)) = loc T (y). 

Proof: This follows directly from the definitions. 

2.46 Example of algorithm 

Text: nanuna 

Table: 

0 1 2 3 4 5 6 7 

ǫ n na nan nanu nanun nanuna nanuna$ 

ǫ a an anu anun anuna anuna$ 

ǫ n nu nun nuna nuna$ 

ǫ u un una una$ 

ǫ n na na$ 

ǫ a a$ 

ǫ $ 

ǫ 

2.47 Example of algorithm 

Text: nanuna


Table: 

0 1 2 3 4 5 6 7 

↓↑ 

ǫ → 

↓ 

n na nan nanu nanun nanuna nanuna$ 

↑ 

ǫ → a ↓ an anu anun anuna anuna$ 

↑ 

ǫ → 

↓↑ n → 

↓ 

nu nun nuna nuna$ 

ǫ u un una una$ 

↑ 

↓↑ ↓↑ 

↓ 

ǫ → n → na → na$ 

ǫ a a$ 

ǫ $ 

↑ 

ǫ 

Recall: sa relevant ⇔ |α(xa)| < |sa| ≤ |α(x)a|. 

Active suffix α(xa) is longest nested suffix sa of xa, shown here as sa. ↑ 

We show α(x)a here as 

↓ 

sa. 

The algorithm inserts all relevant suffixes listed between an ↑ and ↓, proceeding from top-left 

to bottom-right of the table. 

2.48 Parameters for the main step 

We are ready to define the main step of Ukkonen’s algorithm. The parameters will satisfy the 

following properties: 

• T is the current suffix tree, 

• L is the current set of suffix links for T, 

• a = t i+1 is the current input character, 

• y = t i+2 . . .t n is the remaining input string, 

• z denotes a node for which a suffix link must be set, 

• loc is the location of s in T, where sa is a suffix of xa with |α(x)a| ≥ |sa| > |α(xa)|. 

2.49 The main step of Ukkonen’s algorithm 

Initially, set T := ST(ǫ), L := ∅, i := 1, t = t 1 . . .t n , loc := root. 

while i ≤ n do 

Set x := t 1 . . .t i−1 , a := t i and y := t i+1 . . .t n . 

(T ′ , L ′ ,loc ′ ) := ukkstep(T, L, ay,undefined,loc). 

The main step: 

ukkstep(T, L, ay, z,loc) := 

⎧ 

⎨ (T, L ′ ,getloc(loc, a)) if occurs(loc, a) 

(T ′ , L ′ ,loc) else if loc = root 

⎩ 

ukkstep(T ′ , L ′ , ay, r,linkloc(loc)) otherwise,


where (T ′ , r) is obtained by inserting ay at loc, and 

⎧ 

⎨ L if z is undefined 

L ′ = L ∪ {z → loc} else if occurs(loc, a) or r undefined 

⎩ 

L ∪ {z → r} otherwise. 

2.50 Main result 

Theorem Let t be a text of length n. Ukkonen’s algorithm computes the suffix tree ST(t) 

with suffix links L in O(n) time and space. 

Proof The algorithm takes care of all non-relevant and relevant suffixes. It takes a O(n) steps 

to process all non-relevant suffixes. There are at most n relevant suffixes. Processing of each 

relevant suffix is done in constant time. 

2.53 Applications of suffix trees 

1. Searching for exact patterns 

2. Minimal unique substrings 

3. Maximal unique matches 

4. Maximal repeats 

5. Approximate repeats 

Additional literature: 

Stefan Kurtz, Enno Ohlebusch, Chris Schleiermacher, Jens Stoye and Robert Giegerich, Computation 

and visualization of degenerate repeats in complete genomes, ISMB 2000, p. 228-238, 2000. 

Stefan Kurtz and Chris Schleiermacher, REPuter: fast computation of maximal repeats in complete 

genomes, Bioinformatics, 15(5):426-427 (1999) 

2.54 Searching for exact patterns 

To determine whether a string q occurs in a string t, follow the path from the root of suffix tree 

ST(t) as directed by the characters of q. If at some point you cannot proceed, then q does not 

occur in t, otherwise it does. 

b a 

b 

$ 

Text abab$. 

a 

b 

$ 

$ 

b a 

$ 

$ 

5 

1 3 2 4 

The query abb is not contained in abab: Following ab we arrive at the node ab, however there is no 

b-edge leaving from there. The query baa is not contained in abab: Follow the b edge to b and then 

continue along the leaf edge whose label starts with a. The next letter of the label is b and doesn’t 

match the next letter of the query string. 

Clearly, the algorithm that matches a query q against the text t runs in O(|q|) time.


2.55 Finding all occurrences 

To find all positions where the query q is contained in t, annotate each leaf s i of the suffix tree 

with the position i at which the suffix i starts in t. 

Then, after matching q to a path in the tree, visit all nodes below the path and return the 

annotated values. 

This works because any occurrence of q in t is the prefix of one of these suffixes. 

The number of nodes below the path is at most twice the number of hits and thus finding and 

collecting all hits takes time O(|q| + k), where k is the number of occurrences. 

(Note that in the discussed lazy suffix tree implementation we do not use this leaf annotation but 

rather compute the positions from the the lp values, to save space...) 

2.56 Maximal Unique Matches 

Standard dynamic programming is too slow for aligning two large genomes. If the genomes are similar, 

then one can expect to see long identical substrings which occur in both genomes. These maximum 

unique matches (MUMs) are almost surely part of a good alignment of the two sequences and so the 

alignment problem can be reduced to aligning the sequence in the gaps between the MUMs. 

Given two sequences s and t, and a number l > 0. The maximal unique matches problem 

(MUM-problem) is to find all sequences u with: 

• |u| ≥ l, 

• u occurs exactly once in s and once in t, and 

• for any character a neither ua nor au occurs both in s and t. 

This problem can be solved in O(|s| + |t|) time, by considering the suffix tree for s%t, where % 

is a character that does not occur in s or t, as described earlier.


2.57 Example 

2.58 Example


2.59 Repeats in human 

(Nature, vol. 409, pg. 880, 15. Feb 2000) 

2.60 Definition of a maximal repeat 

Given a sequence t = t 1 t 2 . . .t n . 

A substring t[i, j] := t i . . .t j is represented by the pair (i, j). A pair R = (l, r) of different 

substrings l = (i, j) and r = (i ′ , j ′ ) of t is called a repeat, if i 

call l and r the right and left instance of the repeat R, respectively. 

t 

i j i’ j’ 

t ... t j i 

= 

t ... t i’ j’ 

A repeat R = ((i, j), (i ′ , j ′ )) is called left maximal, if i = 1 or t i−1 ≠ t i ′ −1, and right maximal, 

if j ′ = n or t j+1 ≠ t j ′ +1, and maximal, if both. 

2.61 Example 

t 

i j i’ j’ 

= 

at ... t 

i j b c t i’ ... tj’ 

d 

maximal ⇔ a ≠ c and b ≠ d 

The string 

1 2 3 4 5 6 7 8 9 10 

g a g c t c g a g c contains the following repeats of length ≥ 2: 

((1, 4), (7, 10)) gagc 

((1, 3), (7, 9)) gag 

((1, 2), (7, 8)) ga 

((2, 4), (8, 10)) agc 

((3, 4), (9, 10)) gc


2.62 An algorithm for computing all maximal repeats 

We will discuss how to compute all maximal repeats. Let t be a string of length n and assume 

that the first and last letter of t both occur exactly once on t, e.g.: 

1 2 3 4 5 6 7 8 9 10 11 12 13 

t = x g g c g c y g c g c c z 

Let T be the suffix tree for t. 

We can ignore all leaf edges from the root. 

The algorithm proceeds in two phases: 

In the first phase, every leaf node v of T is annotated by (a, i), where v = t i . . .t n is the suffix 

associated with v and a = t i−1 is the letter that occurs immediately before the suffix. 

2.63 Example 

Partial suffix tree for 

1 2 3 4 5 6 7 8 9 10 11 12 13 

t = x g g c g c y g c g c c z : 

c 

g 

z 

cz 

gc 

ygcgccz 

ygcgccz 

cz 

cz 

gc 

c 

gcgcygcgccz 

ygcgccz 

cz 

ygcgccz 

With leaf annotations: 

c 

g 

c 12 

z 

cz 

g 11 

ygcgccz 

c 

gc 

g 6 

gcgcygcgccz 

ygcgccz ygcgccz 

cz 

g 9 g 4 cz 

gc c 5 

c 10 

cz 

ygcgccz 

y 8 g 3 

x 2 

2.64 Second phase of the algorithm 

For every leaf node v set: 

{ {i}, if c = ti−1 , and 

A(v, c) = 

∅, else, 

where i is the start position of the corresponding suffix v. 

In the second phase of the algorithm, we extend this annotation to all branching nodes bottomup: 

Let w be a branching node with children v 1 . . .v h and assume we have computed A(v j , c) for 

all j ∈ {1, . . ., h} and all c ∈ Σ.


For each letter c ∈ Σ set: 

A(w, c) := 

h⋃ 

A(v j , c). 

j=1 

Note that this is a disjoint union and A(w, c) is the set of all start positions of w in t for which 

t i−1 = c. 

2.65 Example 

1 2 3 4 5 6 7 8 9 10 11 12 13 

t = x g g c g c y g c g c c z 

((8,9),(10,11)) 

((3,4),(10,11) 

((5,6),(8,9)) 

((3,4),(5,6)) 

g 3 

c 5,10 

y 8 

g 

c 

ygcgccz 

gc c 5 

cz 

c 10 ygcgccz 

cz 

y 8 g 3 

g 3 

c 5,10 

x 2 

y 8 

gcgcygcgccz 

g 3 

y 8 

x 2 

((3,6),(8,11)) 

Annotation of branching nodes and output repeats of length ≥ 2 

2.66 Reporting all maximal repeats 

In a bottom-up traversal, for each branching node w we first determine A(w, c) for all c ∈ Σ 

and then report all maximal repeats of the word w: 

Let q be the current depth, i.e. number of characters from the root node, i.e. the length of w. 

for each pair of children v f and v g of w with v f ≺ v g : 

for each letter c ∈ Σ with A(v f , c) ≠ ∅: 

for each i ∈ A(v f , c): 

for each letter d ∈ Σ with d ≠ c and A(v g , d) ≠ ∅: 

for each j ∈ A(v g , d): 

Print ((i, i + q − 1), (j, j + q − 1)) 

end 

2.67 Maximality of output 

Lemma The algorithm prints precisely the set of all maximal repeats in t of length ≥ l. 

Proof 

1. Each printed pair R is a repeat, as the word w is the common prefix of two or more 

different suffixes.


2. Each repeat R is left-maximal, as c ≠ d. 

3. Each repeat R is right-maximal, as v f ≠ v g , and by definition of a suffix tree, the labels 

of the edges leading to these two children begin with distinct letters. 

4. No maximal repeat is reported twice, as v f ≺ v g and all unions are disjoint. □ 

2.68 Performance analysis 

Lemma Computation of all maximal repeats of length ≤ l can be done in O(n + z) time and 

O(n) space, where z is the number of maximal repeats. 

Proof The suffix tree can be built in O(n) time and space. We can annotate the tree in O(n) 

time and space, if we use the fact that we only need to keep the annotation of a node until 

its father has been fully processed. (Also, we maintain the sets as linked links and then each 

disjoint-union operation can be done in constant time.) 

In the nested loop we enumerate in total all z maximal repeats in O(z) steps. 

Hence, the algorithm is both time and space optimal. 

□ 

2.69 Significance of repeats 

How significant is a detected maximal repeat? In a long random text we will expect to find 

many short repeats purely by chance. 

The E-value associated with a maximum repeat R in t is the expected number of repeats of 

the same length or longer that are found in a random sequence of length |t|. 

To compute this in the case of DNA, consider a simple Bernoulli model where each base 

α ∈ {A, C, G, T } has the same fixed probability of occurrence: p α = p = 1 4 . 

Note that the number of maximal exact repeats of length ≥ l equals the number of (only) 

left-maximal repeats of length exactly l. 

Ignoring boundary effects: 

E[# of maximal exact repeats of length ≥ l] 

= E[# of left-maximal exact repeats of length l] 

= ∑ 

Pr(t[i 1 , i 1 + l − 1] = t[i 2 , i 2 + l − 1], t i1 −1 ≠ t i2 −1) 

1≤i 1


2.70 Palindromic repeats 

Let t be a DNA sequence. We call ((i, j), (i ′ , j ′ )) a palindromic repeat, if t i . . .t j = t i ′ . . .t j ′, 

where w denotes the reverse complement of a DNA string w. 

All maximal palindromic repeats can be found using a modification of the described algorithm 

for maximal repeats, based on the suffix tree for xtytz, where x, y, z are three characters that 

do not appear in t or t. 

2.71 Degenerate repeats 

Let u and w be two strings of the same length. The Hamming distance d H (u, w) between u 

and w is the number of positions i such that u i ≠ w i . 

In the following, we assume that we are given a sequence t = t 1 . . .t n , an error threshold k ≥ 0 

and a minimum length l > 0. 

Definition A pair of equal-length substrings R = ((i 1 , j 1 ), (i 2 , j 2 )) is called a k-mismatch repeat 

in t, iff (i 1 , j 1 ) ≠ (i 2 , j 2 ) and d H (t[i 1 , j 1 ], t[i 2 , j 2 ]) = k. The length of R is j 1 −i 1 +1 = j 2 −i 2 +1. 

A k-mismatch repeat is maximal if it is not contained in any other k-mismatch repeat. 

As with exact repeats, a k-mismatch repeat R = ((i 1 , j 1 ), (i 2 , j 2 )) is maximal iff (i 1 = 1 or 

i 2 = 1 or t i1 −1 ≠ t i2 −1) and (j 1 = n or j 2 = n or t j1 +1 ≠ t j2 +1) 

2.72 The Mismatches Repeat Problem 

The Mismatches Repeat Problem (MMR) is to enumerate all maximal k-mismatch repeats of 

length ≥ l contained in t. 

2.73 Example 

Maximal k-mismatch repeats (k = 0,... ,4) for l = 5 in mississippi: 

1 2 3 4 5 6 7 8 9 10 11 

text : m i s s i s s i p p i 

k = 0 : none 

k = 1 : s i s s i 

m i s s i 

k = 2 : s i p p i 

s i s s i 

k = 3 : s i p p i 

m i s s i 

k = 4 : none 

2.74 The Seed Lemma 

The following result is a key observation and is the basis of many seed-and-extend approaches 

to sequence comparison:


Lemma Every maximal k-mismatch repeat R of length l contains a maximal exact repeat of 

length ≥ ⌊ l ⌋, called a seed. 

k+1 

Proof Let R = ((i 1 , j 1 ), (i 2 , j 2 )) be a k-mismatch repeat of length ≥ l. The k mismatches 

divide t[i 1 , j 1 ] and t[i 2 , j 2 ] into exact repeats w 0 , w 1 , . . ., w k . Now max i∈[0,k] |w i | is minimal if 

the mismatching character pairs are equally distributed over R. Obviously, for such an equal 

distribution the length of the longest w i is ≥ ⌈ l−k ⌉ = ⌊ l ⌋. □ 

k+1 k+1 

(A “•” indicates a mismatch and we have: l = 11, k = 3, ⌈ l−k 

k+1 ⌉ = ⌈11−3 3+1 ⌉ = ⌈8 4 ⌉ = 2 = ⌊ l 

k+1 ⌋ = ⌊ 11 

3+1 ⌋) 

2.75 An algorithm for the MMR problem 

Given a text t of length n. To find all maximal k-mismatch repeats in t of length ≥ l, do the 

following: 

1. Build the suffix tree T for t and use it to detect all seeds, i.e., all exact maximal repeats 

of length ≥ ⌊ l 

k+1 ⌋. 

2. For each seed s = ((i 1 , j 1 ), (i 2 , j 2 )) do the following: 

(a) For q = 0, 1, . . .k, compute: 

left(q) := max{p | d H (t[i 1 − p, i 1 ], t[i 2 − p, i 2 ]) = q}, i.e. the length of the maximal 

extension of the seed to the left with precisely q mismatches. 

(b) For q = 0, 1, . . .k, compute: 

right(q) := max{p | d H (t[j 1 , j 1 + p], t[j 2 , j 2 + p]) = q}, i.e. the length of the maximal 

extension of the seed to the right with precisely q mismatches. 

(c) For q = 0, 1, . . .k: 

If (j 1 − i 1 + 1 + left(q) + right(k − q)) ≥ l, then print ((i 1 − left(q),j 1 + right(k − 

q)),(i 2 − left(q),j 2 + right(k − q))). 

2.76 Correctness of the algorithm 

The correctness of the algorithm follows from the Seed Lemma: every maximal k-mismatch 

repeat of length ≥ l contains an exact repeat of length ≥ ⌊ l ⌋ and can be obtained from the 

k+1 

seed by extending the seed match with q mismatches to the left and with k − q mismatches to 

the right, for some q ≤ k. 

Note that the same maximal k-mismatch repeat can be obtained via more than one seed. To 

avoid this, when computing left for a given seed, we stop the computation of the table left, 

if we observe a second seed to the left of the original seed. This ensures that we only output 

those maximal k-mismatch repeats for which the given seed is leftmost. 

2.77 Efficient solution of MMR problem 

Lemma The mismatch repeats problem MMR can be solved in O(n+kq) time, where n is the 

length of the text, k is the number of mismatches permitted and q is the number of different 

seeds.


We can find all seeds in O(n+q) time. Thus the result follows, if we can compute the k-mismatch 

extension of any seed in k steps. 

The latter can indeed be achieved, if we can determine the maximal common extension of two 

matches in constant time. This is indeed possible, due to the following amazing result on rooted 

trees: 

2.78 The lowest common ancestor problem 

Let T be a rooted tree. A node u is called an ancestor of a node v iff u lies on the unique path 

from root to v. The lowest common ancestor of two nodes x and y is the last node that is both 

on the path from root to x and on the path from root to y. 

Lemma Let T be a rooted tree. After a linear amount of preprocessing, we can determine the 

lowest common ancestor of any two nodes x and y in constant time. 

(This is due to Harel and Tarjan (1984) and later simplified by Schieber and Vishkin (1988), see 

Chapter 8 of Dan Gusfield’s book for details.) 

3 DNA arrays 

This exposition is based on the following sources, which are recommended reading: 

1. M.B. Eisen, P.T. Spellman, P.O. Brown and D. Botstein, Cluster analysis and display of 

genome-wide expression patterns, PNAS, 95:14863-14868, 1998. (see link from webpage) 

2. Pavel Penzer, Computational Molecular biology - an algorithmic approach, MIT Press, 

2000, chapter 5. (Semesterapparat) 

3. Ron Shamir, Analysis of Gene Expression Data, lectures 1 and 4, 2002. (see link from 

webpage) 

3.1 DNA arrays 

• Also known as: biochips, DNA chips, oligo arrays, DNA microarrays or gene arrays. 

• An array is an orderly arrangement of (spots of) samples. 

• Samples are either DNA or DNA products. 

• Each spot in the array contains many copies of the sample. 

• Array provides a medium for matching known and unknown DNA samples based on basepairing 

(hybridization) rules and for automating the process of identifying the unknowns. 

• Sample spot size in microarray less than 200 microns and an array contains thousands of 

spots. 

• Microarrays require specialized robotics and imaging equipment. 

• High-throughput biology: a single DNA chip can provide information on thousands of 

genes simultaneously.


3.2 Two possible formats 

We are given an unknown target nucleic acid sample and the goal is to detect the identity and/or 

abundance of its constituents using known probe sequences. Single stranded DNA probes are 

called oligo-nucleotides or oligos. 

There are two different formats of DNA chips: 

• Format I: The target (500-5000 bp) is attached to a solid surface and exposed to a set of 

probes, either separately or in a mixture. The earliest chips where of this kind, used for 

oligo-fingerprinting. 

• Format II: An array of probes is produced either in situ or by attachment. The array is 

then exposed to sample DNA. Examples are oligo-arrays and cDNA microarrays. 

In both cases, the free sequence is fluorescently or radioactively labeled and hybridization is 

used to determine the identity/abundance of complementary sequences. 

3.3 Oligo arrays C(l) 

The simplest oligo array C(l) consists of all possible oligos of length l and is used e.g. in 

sequencing by hybridization (SBH). 

C(4) A A A A T T T T G G G G C C C C 

A T G C A T G C A T G C A T G C 

AA 

AT 

AG 

AC 

TA 

TT 

TG 

TC 

□ 

GA 

GT 

GG 

GC 

CA 

CT 

CG 

CC 

Example: oligo at □: TCGA 

3.4 cDNA microarrays 

The aim of this technology is to analyze the expression of thousands of genes in a single 

experiment and provides measurements of the differential expression of these genes. 

Here, each spot contains, instead of short oligos, identical cDNA clones, which represents a gene. 

(Such complementary DNA is obtained by reverse transcription from some known mRNA.) The


target is the unknown mRNA extracted from a specific cell. As most of the mRNA in a cell is 

translated into a protein, the total mRNA in a cell represents the genes expressed in the cell. 

Since cDNA clones are much longer than the short oligos otherwise used, a successful hybridization 

with a clone is an almost certain match. However, because an unknown amount of cDNA 

is printed at each spot, one cannot directly associate the hybridization level with a transcription 

level and so cDNA chips are limited to to comparisons of a reference extract and a target 

extract. 

3.5 Affymetrix chips 

Affymetrix produces oligo arrays with the goal of capturing each coding region as specifically 

as possible. The length of the oligos is usually less than 25 bases. The density of oligos on a 

chip can be very high and a 1cm × 1cm chip can easily contain 100 000 types of oligos. 

The chip contains both “coding” oligos and “control” oligos, the former corresponding to perfect 

matches to known targets and the controls corresponding to matches with one perturbed base. 

When reading the chip, hybridization levels at controls are subtracted from the level of match 

probes to reduce the number of false positives. Actual chip designs use 10 match- and 10 

mismatch probes for each target gene. 

Today, Affymetrix offers chips for the entire (known) human or yeast genomes. 

3.6 Oligo fingerprinting 

Format I chips were the first type used, namely for oligo fingerprinting which is, in a sense, the 

opposite to what Affymetrix chips do. Such a chip consists of a matrix of target DNA and is 

exposed to a solution containing many identical oligos. 

After the positions in the matrix have been recorded at which hybridization of the tagged 

oligos has occurred, the chip can be heated to separate the oligos from the target DNA and the 

experiment can be repeated with a different type of oligo. 

Finally, we obtain a data matrix M, with each row representing a specific target DNA from 

the matrix and each column representing an oligo probe. 

Example: cDNA’s extracted from a tissue. Cluster cDNA’s according to their fingerprints and 

then sequence representatives from each cluster to obtain a sequence that identifies the gene. 

3.7 Manufacturing oligo arrays 

1. Start with a matrix created over a glass substrate. 

2. Each cell contains a growing “chain” of nucleotides that ends with a terminator that 

prevents chain extension. 

3. Cover the substrate with a mask and then illuminate the uncovered cells, breaking the 

bonds between the chains and their terminators. 

4. Expose the substrate to a solution of many copies a specific nucleotide base so that each 

of the unterminated chains is extended by one copy of the nucleotide base and a new 

terminator.


5. Repeat using different masks. 

Exposure to light replaces the terminators by hydrogen bonds (1–2), and (3) bonds forms with nucleotide 

bases provided in a solution, and then the process is repeated with a different base (4–6). 

3.8 Experiment with a DNA chip 

Labeled RNA molecules are applied to the probes on the chip, creating a fluorescent spot where 

hybridization has occurred. 

3.9 Functional genomics 

With the sequencing of more and more genomes, the question arises of how to make use of 

this data. One area that is now opening up is functional genomics, the understanding of the 

functionality of specific genes, their relations to diseases, their associated proteins and their 

participation in biological processes. 

The functional annotation of genes is still at an early stage: e.g., for the plant Arabidopsis 

(whose sequence was recently completed), the functions of 40% of the genes are currently 

unknown. 

Functional genomics is being addressed using high-throughput methods: global gene expression 

profiling (“transcriptome analysis”) and wide-scale protein profiling (“proteome analysis”).


3.10 Gene expression 

The existing methods for measuring gene expression are based on two biological assumptions: 

1. The transcription level of genes indicates their regulation: Since a protein is generated 

from a gene in a number of stages (transcription, splicing, synthesis of protein from 

mRNA), regulation of gene expression can occur at many points. However, we assume 

that most regulation is done only during the transcription phase. 

2. Only genes which contribute to organism fitness are expressed, in other words, genes that 

are irrelevant to the given cell under the given circumstances etc. are not expressed. 

Genes affect the cell by being expressed, i.e. transcribed into mRNA and translated into proteins 

that react with other molecules. 

From the pattern of expression we may be able to deduce the function of an unknown gene. 

This is especially true, if the pattern of expression of the unknown gene is very similar to the 

pattern of expression of a gene with known function. 

Also, the level of expression of a gene in different tissues and at different stages is of significant 

interest. 

Hence, it is highly interesting to analyze the expression profile of genes, i.e. in which tissues 

and at what stages of development they are expressed. 

3.11 cDNA Clustering 

It is not easy to determine which genes are expressed in each tissue, and at what level: 

An average tissue contains more than 10 000 expressed genes, and their expression levels can 

vary by a factor of 10 000. Hence, we need to extract more than 10 5 transcripts per tissue. 

There are about 100 different types of tissue in the body and we are interested in comparing 

different growth stages, disease stages etc., and so we should analyze more than 10 10 transcripts. 

⇒ Sequencing all cDNA’s is infeasible and we need cheap, efficient and large scale methods. 

3.12 Representation of gene expression data 

Gene expression data is represented by a raw data matrix R, where each row corresponds to 

one gene and each column represents one tissue or condition. Thus, R ij is the expression level 

for gene i in condition j. The values can be ratios, absolute values or distributions. 

Before it is analyzed, the raw data matrix is preprocessed to compute a similarity or distance 

matrix. 

conditions 

genes 

genes 

1 

2 

3 

4 

... 

1 2 3 ... 

j 

expression 

levels 

"raw data" 

genes 

1 

2 

3 

4 

... 

1 2 3 ... 

distance 

matrix 

j 

i 

Rij 

i 

Dij


3.13 Clustering 

The first step in analyzing gene expression data is clustering. 

Clustering methods are used in many fields. The goal in a clustering problem is to group 

elements (in our case genes) into clusters satisfying: 

1. Homogeneity: Elements inside a cluster are highly similar to each other. 

2. Separation: Elements from different clusters have low similarity to each other. 

There are two types of clustering methods: 

• Agglomerative methods build clusters by looking at small groups of elements and performing 

calculations on them in oder to construct larger groups. 

• Divisive methods analyze large groups of elements in order to divide the data into smaller 

groups and eventually reach the desired clusters. 

Why would we want to cluster gene expression data? Research shows that: 

• Distinct measurements of same genes cluster together. 

• Genes of similar function cluster together. 

• Many cluster-function specific insights are gained. 

3.14 Hierarchical clustering 

This approach attempts to place the input elements in a tree hierarchy structure in which 

distance within the tree reflects element similarity. 

To be precise, the hierarchy is represented by a tree and the actual data is represented by the 

leaves of the tree. The tree can be rooted or not, depending on the method used. 

Distance matrix → Dendrogram 

gene 1 2 3 4 

1 0 3 5 7 

2 3 0 5 7 

3 5 5 0 7 

4 7 7 7 0 

7 5 

01 

01 

01 

01 

0 

01 

3 

01 

01 

01 

01 

01 

01 

01 

01 

gene 2 

01 

01 

1 

gene 3 

01 

01 

00 11 

00 11 

00 11 

01 

01 

gene 1 

gene 4 

3.15 The Neighbor Joining algorithm 

A popular algorithm for “tree building” is Neighbor Joining (NJ), due to Saitou and Nei (1987). 

The algorithm proceeds as follows: 

1. Input: Distance matrix D 

2. For all s, define b s := 1 

|L|−2 

∑ 

k∈L D sk, the average distance from s to any other node.


3. Find elements r, s such that D rs − (b s + b r ) is minimal. 

4. Merge clusters represented by r, s. 

5. Delete elements r, s and add a new element t with: 

6. Repeat steps 2–5, until one element is left. 

D it := D ti := D ir + D is − D rs 

2 

3.16 Average linkage 

Average linkage is similar to NJ, except that when computing the new distances of created 

clusters, the sizes of clusters that are merged are taken into consideration. This algorithm was 

developed by Lance and Williams (1967) and Sokal and Michener (1958). 

1. Input: The distance matrix D ij , initial cluster sizes n r . 

2. Iteration k: The same as in NJ, except that the distance from a new element t is defined 

by: 

D it := D ti := 

n r 

D ir + 

n s 

D is 

n r + n s n r + n s 

3.17 Non-Hierarchical clustering 

∑ 

Given 

∑ 

a set of input vectors. For a given clustering P of them into k clusters, let E P := 

c v∈c D(v, z c) denote the solution cost function, where z c is the centroid (average vector) of 

the cluster c and D(v, z c ) is the distance from v to z c . 

The k-means clustering due to Macqueen (1965) operates as follows: 

1. Initialize an abitrary partition P into k clusters. 

2. For each cluster c and element e: 

Let E P (c, e) be the cost of the solution if e is moved to c. 

3. Pick c, e so that E P (c, e) is minimum. 

4. Move e to c, if it improves E P . 

5. Repeat until no further improvement is achieved. 

3.18 Application to fibroblast cells 

Eisen et al. (1998) performed a series of experiments on real gene expression data. One goal 

was to check the growth response of starved human fibroblast cells, which where given serum. 

About 8600 gene levels were monitored over 13 time points.


The original data of test to reference ratios was first log transformed, and then normalized to 

have mean 0 and variance 1. Let N ij denote these normalized levels. A similarity matrix was 

constructed from N ij as follows: ∑ 

j 

S kl := 

N kjN lj 

, 

N cond 

where N cond is the number of conditions checked. 

The average linkage method was then used to generate the following tree: 

The Dendrogram resulting from the starved human fibroblast cells experiment. Five major clusters 

can be seen, and many non clustered genes. The cells in the five groups server similar functions: 

(A) cholesterol bio-synthesis, (B) the cell cycle, (C) the immediate-early response, (D) signaling and 

angiogenesis, and (E) wound healing and tissue remodeling. 

(Color scale red-to-green corresponds to higher-to-lower expression level than in the control state.) 

3.19 Testing the significance of the clusters 

A standard method for testing the significance of clusters is to randomly permute the input 

data in different ways. 

Original expression data is shown in column (1), clustered data in (2) and the results of clustering 

after random permutations of the rows in (3), columns in (4) and both in (5). 

3.20 Sequencing by Hybridization (SBH) 

Originally, the hope was that one can use DNA chips to sequence lage unknown DNA fragments 

using a large array of short probes: 

1. Produce a chip C(l) spotted with all possible probes of length l (l = 8 in the first SBH 

papers),


2. Apply a solution containing many copies of a fluorescently labeled DNA target fragment 

to the array. 

3. The DNA fragments hybridize to those probes that are complementary to substrings of 

length l of the fragment 

4. Detect probes that hybridize with the DNA fragment and obtain the l-tuple composition 

of the DNA fragment 

5. Apply a combinatorial algorithm to reconstruct the sequence of the DNA target from the 

l-tuple composition 

3.21 The Shortest Superstring Problem 

SBH provides information of the l-tuples present in a target DNA sequence, but not their 

positions. Suppose we are given the spectrum S of all l-tuples of a target DNA sequence, how 

do we construct the sequence? 

This is a special case of theShortest Common Superstring Problem (SCS): A superstring for a 

given set of strings s 1 , s 2 , . . .,s m is a string that contains each s i as a substring. Given a set of 

strings, finding the shortest superstring is NP-complete. 

Define overlap(s i , s j ) as the length of a maximal prefix of s j that matches a suffix of s i . The 

SCS problem can be cast as a Traveling Salesman Problem in a complete directed graph G 

with m vertices s 1 , s 2 , . . .,s m and edges (s i , s j ) of length −overlap(s i , s j ). 

3.22 The SBH graph 

SBH corresponds to the special case when all substrings have the same length l. We say that 

two SBH probes p and q overlap, if the last l − 1 letters of p coincide with the first l − 1 of q. 

Given the spectrum S of a DNA fragment, construct the directed graph H with vertex set S 

and edge set 

E = {(p, q) | p and q overlap}. 

There exists a one-to-one correspondence between paths that visit each vertex of H at least 

once and the DNA fragments with the spectrum S. 

3.23 Example of the SBH graph 

Vertices: l tuples of the spectrum S, edges: overlapping l-tuples: 

S = { ATG AGG TGC TCC GTC GGT GCA CAG } 

H 

The path visiting all vertices corresponds to the sequence reconstruction ATGCAGGTCC. 

A path that visits all nodes of a graph exactly once is called a Hamiltonian path. Unfortunately, 

the Hamiltonian Path Problem is NP-complete, so for larger graphs we cannot hope to find 

such paths.


3.24 Second example of the SBH graph 

S = { ATG TGG TGC GTG GGC GCA GCG CGT } 

H 

This example has two different Hamiltonian paths and thus two different reconstructed sequences: 

ATGCGTGGCA 

ATGGCGTGCA 

3.25 Euler Path 

Leonard Euler wanted to know whether there exists a path that uses all seven bridges in 

Königsberg exactly once: 

Kneiphoff island 

Pregel river 

Birth of graph theory... 

3.26 SBH and the Eulerian Path Problem 

Let S be the spectrum of a DNA fragment. We define a graph G whose set of nodes consists 

of all possible (l − 1)-tuples. 

We connect one l−1-tuple v = v 1 . . .v l−1 to another w = w 1 . . .w l−1 by a directed edge (v, w), if 

the spectrum S contains an l-tuple u with prefix v and suffix w, i.e. such that u = v 1 . . .v l−1 w 1 = 

v l−1 w 1 . . . w l−1 . 

Hence, in this graph the probes correspond to edges and the problem is to find a path that 

visits all edges exactly once, i.e., an Eulerian path. 

Finding all Eulerian paths is simple to solve. 

(To be precise, the Chinese Postman Problem (visit all edges at least once in a minimal tour) can be 

efficiently solved for directed or undirected graphs, but not in a graph that contains both directed and 

undirected edges.)


S = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } 

GT 

GCT 

CG 

AT 

ATG 

TG 

GTG 

TGG 

TGC 

GG 

GCG 

GGC 

Vertices represent (l − 1)-tuples, edges correspond to l-tuples of the spectrum. There are two 

different solutions: 

GT 

CG 

GT 

GC 

GCA 

CG 

CA 

AT 

TG 

GC 

CA 

AT 

TG 

GC 

CA 

GG 

ATGGCGTGCA 

GG 

ATGCGTGGCA 

3.27 Eulerian graphs 

A directed graph G is called Eulerian, if it contains a cycle that traverses every edge of G 

exactly once. 

A vertex v is called balanced, if the number of edges entering v equals the number of edges leaving 

v, i.e. indegree(v) = outdegree(v). We call v semi-balanced, if |indegree(v)−outdegree(v)| = 1. 

Theorem A directed graph is Eulerian, iff it is connected and each of its vertices is balanced. 

Lemma A connected directed graph is Eulerian, iff it contains at most two semi-balanced 

nodes. 

3.28 When does a unique solution exist? 

Problem: Given a spectrum S. Does it possess a unique sequence reconstruction? 

Consider the corresponding graph G. If the graph G is Eulerian, then we can decompose it 

into simple cycles C1, . . .,C t , that is, cycles without self-intersections. Each edge of G is used 

in exactly one cycle, although nodes can be used in many cycles. Define the intersection graph 

G I on t vertices C 1 , . . .C t , where C i and C j are connected by k edges, iff they have precisely k 

original vertices in common. 

Lemma Assume G is Eulerian. Then G has only one Eulerian cycle iff the intersection graph 

G I is a tree. 

3.29 Probability of unique sequence reconstruction 

What is the probability that a randomly generated DNA fragment of n can be uniquely reconstructed 

using a DNA array C(l)? In other words, how large must l be so that a random 

sequence of length n can be uniquely reconstructed from its l-tuples? 

We assume that the bases at each position are chosen independently, each with probability 1 4 . 

Note that a repeat of length ≥ l will always lead to a non-unique reconstruction. We expect 

about ( ) 

n 

2 p l repeats of length ≥ l. Note that ( ( 

n 

2) 

p l = 1 implies l = log n 

) 

1 

p 2 .


=⇒ For a given l one should choose n ≈ √ 2 · 4 l , but not larger. (However, this is a very loose 

bound and a much tighter bound is known.) 

3.30 SBH currently infeasible 

The Eulerian path approach to SBH is currently infeasible due to two problems: 

• Errors in the data 

– False positives arise, when the the target DNA hybridizes to a probe even though 

an exact match is not present 

– False negatives arise, when an exact match goes undetected 

• Repeats make the reconstruction impossible, as soon as the length of the repeated sequence 

is longer than the word length l 

Nevertheless, ideas developed here are employed in a new approach to sequence assembly 

that uses sequenced reads and a Eulerian path representation of the data (Pavel Pevzner, 

Recomb’2001). 

3.31 Masks for VLSIPS 

DNA arrays can be manufactured using VLSIPS, very large scale immobilized polymer synthesis. 

In VLSIPS, probes are grown one layer of nucleotides at a time through a photolithographic 

process. In each step, a different mask is used and only the unmasked probes are extended by 

the current nucleotide. All probes are grown to length l in 4l steps. 

T T T 

A A A A 

A A A A T T T 

T T T 

A A A A 

A A A A T T T 

T T T 

A A A A 

A A A A T T T 

T T T 

A A A A 

A A A A T T T 

T T T T 

T T T 

T T T T T T T 

T T T T 

A A A 

T T T T 

A A A 

T T T T 

A A A 

T T T T 

A A A 

Problem: Due to diffraction, internal reflection and scattering, masked spots near an edge of 

the mask can be unintentionally illuminated. 

Idea: To minimize the problem, design masks that have minimal border length! 

For example, consider the 8 × 8 array for l = 3. Both of the following two masks add a T to 1 4 

of the spots, with borders of very different length: 

T T T T T T T T 
















border length 58 border length 16


3.32 The l-bit Gray code 

In the above example, we can mask 1 of all spots using a mask that has a boundary of length 

4 

4 · l. Can we arrange the spots so that this minimal value is attained for every mask? 

An l-bit Gray code is a permutation of the binary numbers 0, . . .,2 − 1 such that any two 

neighboring numbers differ in exactly one bit. 

The 4-bit “reflected” Gray code is: 

0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 

0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 

0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 

0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 

This is generated recursively from the 1-bit code G 1 = {0, 1} using: 

G l = {g 1 , g 2 , . . .,g 2 l −1, g 2 l} −→ 

G l+1 = {0g 1 , 0g 2 , . . .,0g 2 l −1, 1g 1 , 1g 2 , . . .,1g 2 l −1, g 2 l}. 

3.33 The two-dimensional Gray code 

We want to construct a two-dimensional Gray code for strings of length l over the alphabet 

{A, C, G, T }, in which every l-tuple is present and differs from each of its four neighbors in 

precisely one position. 

The induction step G l → G l+1 : 

Let G l = ⎣ 

Start: G 1 := 

⎡ 

[ A T 

C G 

] 

. 

⎤ 

g 1,1 . . . g 1,2 l 

. . . ⎦ and set 

g 2 l ,1 . . . g 2 l ,2 l 

⎡ 

⎤ 

Ag 1,1 . . . Ag 1,2 l Tg 1,1 . . . Tg 1,2 l 

. . . 

G l+1 := 

Ag 2 l ,1 . . . Ag 2 l ,2 l, Tg 2 l ,1 . . . Tg 2 l ,2 l, 

⎢ Gg 1,1 . . . Gg 1,2 l Cg 1,1 . . . Cg 1,2 . 

l 

⎥ 

⎣ . . . 

⎦ 

Gg 2 l ,1 . . . Gg 2 l ,2 l, Cg 2 l ,1 . . . Cg 2 l ,2 l 

3.34 Additional ideas 

SBH with universal bases Use universal bases such as inosine that stack correctly, but don’t 

bind, and thus play the role of “don’t care” symbols in the probes. Arrays based on this idea 

can be achieve the information-theoretic lower bound of the number of probes required for 

unambiguous reconstruction of an abitrary DNA string of length n. (The full C(l) array has 

redundancies that can be eliminated using such universal bases.) (Preparata et al. 1999)


Adaptive SBH If a sequencing by hybridization is not successful, analyze the critical problems 

and then build a new array to overcome them. Skiena and Sundaram, and Margaritas and 

Skiena, give theoretical bounds for the number of rounds needed for sequence reconstruction 

(in the error free case). 

SBH-style shotgun sequencing The idea is to collect sequence reads from the target DNA 

sequence using traditional sequencing methods and then to treat each such read of length k as 

a set of k −l + 1 individual l-tuples, with l = 30, say. Then, the Eulerian path method is used. 

Idury and Waterman suggested this in 1995 and it leads to an efficient assembly algorithm in 

the error-free case. More recent work by Pevzner and others has led to promising software. 

Fidelity probes for DNA arrays As a quality control measure when manufacturing a DNA 

chip, one can produce fidelity probes that have the same sequence as probes on the chip, but 

are produced in a different order of steps. A known target is hybridized to these probes and 

the result reflects the quality of the array. Hubbel and Pevnzer (1999) describe a combinatorial 

method for designing a small set of fidelity probes that can detect variations and can also 

indicate which manufacturing steps caused the errors. 

4 Sequence Assembly 

This exposition is based on the following sources, which are all recommended reading: 

1. Michael S. Waterman, Introduction to computational biology, Chapman and Hall, 1995. 

(Chapter 7) 

2. Gene Myers, Whole-Genome DNA Sequencing, Computing in Science and Engineering, 

33-43, May–June, 1999. 

3. Eugene W. (Gene) Myers et al., A Whole-Genome Assembly of Drosophila, Science, 

287:2196-2204, 24 March 2000. 

4. W. James Kent and David Haussler, Assembly of a Working Draft of the Human Genome 

with GigAssembler, Genome Research, 11:1541-1548 2001. 

5. Daniel Huson, Knut Reinert and Eugene Myers, The Greedy-Path Merging Algorithm for 

Sequence Assembly, RECOMB 2001, 157-163, 2001. 

4.1 Genome Sequencing 

Using a method that was basically invented in 1980 by Sanger, current sequencing technology 

can only determine 500 − 1000 consecutive base pairs of DNA in any one read. To sequence a 

larger piece of DNA, shotgun sequencing is used. 

Originally, shotgun sequencing was applied to small viral genomes and to 30 − 40kb segments 

of larger genomes. 

In 1994, the 1.8Mb genome of the bacteria H.influenzae was assembled from shotgun data. 

At the beginning of 2000, am assembly of the 130Mb Drosophila genome was published. 

At the beginning of 2001, two initial assemblies of the human genome were published.


4.2 Shotgun Sequencing 

Source sequence. .. 

is copied many times. . . 

ACGTTGCACTAGCACAGCGCGCTATATCGACTACGACTACGACTCAGCA 









and randomly broken into fragments, e.g. 

using sonication or nebulation, . .. 

AGCGCGCTATATCGACTACG ACGACTCAGC ACTAGCACAGCGCGA 

CGCTATATCGACTACGA CGCTATATCGACTACGA TTTTTTTT 

ACGTTGCACTAGCACAGCGCGCT CGCTATATCGACTACGA TGGTG 

TACGACTACGACTCAGCA 

ACTAGCACAGCGCGA AA 

ACTAGCACAGCGCGA ACGACTCAGC 

TGCACTAGCACAGCGCGCTATATCGACT 

CGCTATATCGACTACGA 

AGCACAGCGCGCTATAT TGCACTAGCACAGCGCGCTATATCGACT 

ACGACTCAGC 

ACGTTGCACTAGCACAGCGCGCT 

TACGACTACGACTCAGCA AGCG TACGACTACGACTCAGCA 

that are then size selected, size e.g. 2kb, 

10kb, 50kb or 150kb, ... 

ACCGCTGCACACACGGTAGCAGCAGCAGCACAGACGAC 

TGTTGTGCTCGTGCTATATACACTGGCTACACT 

ACCGCTGCACACACGGTAGCAGCAGCACAACGAC 

TGTTGTGCTCGTGCTATATACACACTGGCTT 

GCTGCACACACGGTAGCAGCAGCAGCACAGACGAC 

ACCGCTGCACACAGCAGCACAGACGAC 

ATTGTTTATATACACACTGGCTACACT 

ACCGGCAGCAGCAGCACAGACGAC 

ATTGCTATATACACACTGGCTACACT 

ATATATACACACTGGCTACACT 

AGCAGCAGCGCACAGACGAC 

TATACACACTGGCTACACT 

ATTGTTGTGCTCGTGC 

ACTGGCTACACT 

TATACACACTACT 


and inserted into cloning vectors. 

XXTAACG......ATGTGA XX 


In double barrel shotgun sequencing, each 

clone is sequenced from both ends, to obtain 

a mate-pair of reads, each read of average 

length 550. 

4.3 Shotgun sequencing data 

Given an unknown DNA sequence a = a 1 a 2 . . .a L . 

Shotgun sequencing of a produces a set of reads 

of average length 550 (at present). 

Essential characteristics of the data: 

F = {f 1 , f 2 , . . ., f R }, 

• Incomplete coverage of the source sequences. 

• Sequencing introduces errors at a rate of about %1 for the first 500 bases, if carefully 

performed. 

• The reads are sampled from both strands of the source sequence and thus the orientation 

of any given read is unknown.


4.4 The fragment assembly problem 

The input is a collection of reads (or fragments) F = {f 1 , f 2 , . . .,f R }, that are sequences over 

the alphabet Σ = {A, C, G, T }. 

An ǫ-layout of F is a string S over Σ and a collection of R pairs of integers (s j , e j ) j∈{1,2,...,R} , 

such that 

• if s j < e j then f j can be aligned to the substring S[s j , e j ] with less than ǫ · |f j | differences, 

and 

• if s j > e j then f j can be aligned to the substring S[e j , s j ] with less than ǫ · |f j | differences, 

then 

• ∪ R j=1 [min(s j, e j ), max(s j , e j )] = [1, |S|]. 

The string S is the reconstructed source string. The integer pairs indicate where the reads are 

placed and the order of s i and e i indicate the orientation of the read f i , i.e. whether f i was 

sampled from S or its complement S. 

The set of all ǫ-layouts models the set of all possible solutions. There are many such solutions 

and so we want a solution that is in some sense best. Traditionally, this has been phrased as the 

Shortest Common Superstring Problem (SCS) of the reads within error rate ǫ. Unfortunately, 

the SCS Problem often produces overcompressed results. 

Consider the following source sequence that contains two instances R, R ′ of a high fidelity 

repeat and three stretches of unique sequence A, B and C: 

source: 

R 

R’ 

A Rl Rc Rr B R’l R’c R’r C 

reads: 

The shortest answer isn’t always the best and the interior part R c ≈ R ′ c of the repeat region is 

overcompressed: 

reconstruction: 

R 

R’ 

A Rl Rc Rr B R’l R’r C 

reads: 

4.5 Sequence assembly in three stages 

Traditional approaches to sequence assembly divides the problem into three phases: 

1. In the overlap phase, every read is compared with every other read, and the overlap graph 

is computed. 

2. In the layout phase, the pairs (s j , e j ) are determined that position every read in the 

assembly. 

3. In the consensus phase, a multialignment of all the placed reads is produced to obtain 

the final sequence.


4.6 The overlap phase 

For a read f i , we must calculate how it overlaps any other read f j (or its reverse complement, 

f j ). Holding f i fixed in orientation, f i and f j can overlap in the following ways: 

f i 

f i 

f j 

f i 

f i 

f j 

( f i 

f j 

f j 

f j) 

The number of possible relationships doubles, when we also consider f j . 

The overlap phase is the computational bottleneck in large assembly projects. For example, 

assembling all 27 million human reads produced at Celera requires 

( ) 27000000 

2 · 

≈ 1458000000000000 

2 

comparisons. 

For any two reads a and b (and either orientation of the latter), one searches for the overlap 

alignment with the highest alignment score, based on a similarity score s(a, b) on Σ and an 

indel penalty g(k) = kδ. 

Let S(a, b) be the maximum score over all alignments of two reads a = a 1 a 2 . . .a m and b = 

b 1 b 2 . . .b n , we want to compute: 

⎧ 

⎧ 

⎫⎫ 

⎨ 

⎨ 1 ≤ k ≤ i ≤ m, ⎬⎬ 

A(a,b) = max 

⎩ S(a k,a k+1 ...a i ,b l b l+1 ...b j ) | 1 ≤ l ≤ j ≤ n, 

⎩ 

⎭⎭ . 

and i = m or j = n holds 

4.7 Overlap alignment 

This is a standard pairwise alignment problem (similar to local alignment, except we don’t have 

a 0 in the recursion) and we can use dynamic programming to compute: 

A(i, j) = max{S(a k , a k+1 . . .a i , b l b l+1 . . . b j ) | 1 ≤ k ≤ i and 1 ≤ l ≤ j}. 

Algorithm (Overlap alignment) 

Input: a = a 1 a 2 . . .a n and b = b 1 b 2 . . .b m , s(·, ·) and δ 

Output: A(i, j) 

begin 

A(0, j) = A(i, 0) ← 0 for i = 1, . . ., n, j = 1, . . .,m 

for i = 1, . . ., n: 

for j = 1, . . .,m: 

end 

⎧ 

⎨ 

A(i, j) ← max 

⎩ 

A(i − 1, j) − δ, 

A(i, j − 1) − δ, 

A(i − 1, j − 1) + s(a i , b i ) 

⎫ 

⎬ 

⎭


Runtime is O(nm). 

Given two reads a = a 1 a 2 . . .a m and b = b 1 b 2 . . .b n . For the matrix A(i, j) computed as above, 

set 

(p, q) := arg max{A(i, j) | i = m or j = n}. 

There are two cases: 

p = m or q = n 

The trace-back paths look like this: 

0 

0 

1 

1 

a 

p 

m 

0 

0 

1 

1 

a 

p 

m 

b 

q 

A(i,j) 

b 

A(i,j) 

n 

or 

q n 

The alignments look like this: 

a 

a 

b 

or 

b 

4.8 Faster overlap detection 

Dynamic programming is too slow for large sequencing projects. Indeed, it is wasteful, as in 

assembly, only high scoring overlaps with more than 96% identity, say, play a role. 

One can use a seed and extend approach (as used in BLAST): 

1. Produce the concatenation of all input reads H = f 1 f 2 . . .f L . 

2. For each read f i ∈ F: Find all seeds, i.e. exact matches between k-mers of f i and the 

concatenated sequence H. (Merge neighboring seeds.) 

3. Compute extensions: Attempt to extend each (merged) seed to a high scoring overlap 

alignment between f i and the corresponding read f j . 

(A k-mer is a string of length k. In this context, k = 18... 22) 

Computation of seeds: 

H 

f1 f2 f3 f4 ... 

fL 

fi 

seeds 

Extension of seeds using 

banded dynamic programing 

(running time is linear in 

the read length): 

fj 

seed extension 

extension 

banded alignment 

fi


4.9 True and repeat-induced overlaps 

Assume that we have found a high quality overlap o between f i and f j . There there are three 

possible cases: 

• The overlap o corresponds to an overlap of f i and f j in the source sequence. In this case 

we call o a true overlap. 

• The reads f i and f j come from different parts of the source sequence and their overlapping 

portions are contained in different instances of the same repeat, this is called a repeatinduced 

overlap. 

• The overlap exists by chance. To avoid short random overlaps, one requires that an 

overlap is at least 40bp long, say. 

Source 

fi 

fj 

fk 

R1 

R2 

fl 

True overlap between f i and f j , repeat induced overlap between f k and f l . 

4.10 Avoiding repeat-induced overlaps 

To avoid the computation of repeat-induced overlaps, one strategy is to only consider seeds in 

the seed-and-extend computation whose k-mers are not contained inside a repeat. In this way 

we can ensure that any computed overlap has a significant unique part. 

There are two strategies for this: 

• Screening known repeats: Each read is aligned against a database of known repeats, i.e. 

using Repeatmasker. Portions of reads that match a known repeat are labeled repetitive. 

• De novo screening: For each k-mer contained in H, the concatenation of reads, we determine 

how many times it occurs in H and then label those k-mers as repetitive, whose 

number of occurrences is unexpectedly high. 

4.11 Celera’s overlapper 

The assembler developed at Celera Genomics employs an overlapper than compares up to 32 

million pairs of reads per second. 

Overlapping all pairs of 27 million reads of human DNA using this program takes about 10 

days, running on about 10-20 four processor machines (Compaq ES40), each with 4GB of main 

memory. 

The input data file is about 50GB. To parallelize the overlap compute, each job grabs as many 

reads as will fit into 4GB of memory (minus the memory necessary for doing the computation) 

and then streams all 27 million reads against the ones in memory.


4.12 The overlap graph 

The overlap phase produces an overlap graph OG, defined as follows: Each read f p ∈ F is 

represented by a directed edge (s p , e p ) from node s p to e p , representing the start and end of f p , 

respectively. The length of such a read edge is simply the length of the corresponding read. 

An overlap between f p = f p1 f p2 . . .f pm and f q = f q1 f q2 . . .f qn gives rise to an undirected overlap 

edge e between s p , or e p , and s q , or f q , depending on the orientation of the overlap, e.g.: 

sp ep 

1 fp i m 

1 

j 

fq 

n 

The label (or “length”) of the overlap edge e is defined to be −1 times the overlap length, e.g. 

−( m−i+j−1 

2 

+ 1) in the figure. 

sq 

eq 

4.13 Example 

Assume we are given 6 reads F = {f 1 , f 2 , . . ., f 6 }, each of length 500, together with the following 

overlaps: 

f1 

320 

f2 

f1 

40 

f4 

f4 

95 

f2 

f4 

80 

f3 

f5 

50 f1 

f6 

330 

f1 

Here, for example, the last 320 bases of read f 1 align to the first 320 bases of the reverse 

complement f 2 of f 2 , whereas f 1 and f 5 overlap in the first 50 bases of each. 

We obtain the following overlap graph OG: 

f5 

f2 

60 

f4 

f6 

f6 

250 

f5 

−250 

f6 

−330 

−50 

f1 

−60 

−40 

−320 

−95 

f2 

−80 

f3 

Each read f p is represented by a read edge (s p , e p ) of length |f p |. Overlaps off the start s p , or 

end e p , of f p are represented by overlap edges starting at the node s p , or e p , respectively. Each 

overlap edge is labeled by −1 times the overlap length. 

4.14 The layout phase 

The goal of the layout phase is to arrange all reads into an approximate multi-alignment. This 

involves assigning coordinates to all nodes of the overlap graph OG, and thus, determining the 

value of s i and e i for each read f i . 

A simple heuristic is to select a spanning forest of the overlap graph OG that contains all 

read edges. (A spanning forest is a set F of edges such that any two nodes in the same connected 

component of OG are connected by a unique simple, unoriented path of edges in F.)


f5 

f4 

−250 

f6 

−330 

−50 

f1 

−60 

−40 

−320 

−95 

f2 

−80 

f3 

such a subset of edges positions every read with respect to every other, within a given connected 

component of the graph: 

1 280 450 500 730 950 1410 

1830 

f5 

f6 

f1 

f4 

f2 

f3 

Such a putative alignment of reads is called a contig. 

The spanning tree is usually constructed using a greedy heuristic in which the overlap edges are 

chosen in decreasing overlap length (i.e., increasing edge “length”). 

f5 

f4 

−250 

f6 

−330 

−50 

f1 

−60 

−40 

−320 

−95 

f2 

−80 

f3 

4.15 Repeats and the layout phase 

Consider the following situation: 

R 

two copy repeat 

R’ 

source 

f1 

f2 

f3 

f4 

f5 

reads 

f7 

f6 

This gives rise to the following overlap graph: 

f1 

f3 

f5 

f7 

f2 

f4 

f6 

Consider this spanning tree:


e 

f1 

f3 

f5 

f7 

f2 

f4 

f 

f6 

A layout produced using the edge e or f does not reflect the true ordering of the reads and the 

obtained contig is called misassembled: 

f1 

f2 

f5 

f3 

However, avoiding the repeat-induced edges e and f, one obtains a correct layout: 

f1 

f2 

f3 

f4 

f5 

f7 

f6 

Note that neither of the two layouts is “consistent” with all overlap edges in the graph. 

e 

f4 

f7 

f6 

4.16 Unitigging 

The main difficulty in the layout phase is that we can’t distinguish between true overlaps and 

repeat-induced overlaps. The latter produce “inconsistent” layouts in which the coordinate 

assignment induces overlaps that are not reflected in the overlap graph (e.g., reads f 4 and f 7 

in the example above). 

Thus, the layout phase proceeds in two stages: 

1. Unitigging: First, all uniquely assemblable contigs are produced, as just described. These 

are called unitigs. 

2. Repeat resolution: Then, at a later stage, one attempts to reconstruct the repetitive 

sequence that lies between such unitigs. 

Reads are sampled from a source sequence that contains repeats: 

source: 

reads: 

repeats 

Reads that form consistent chains in the overlap graph are assembled into unitigs and the 

remaining “repetitive” reads are processed later: 

untigs: 

layouts: 

reads in repeats: 

4.17 Unique unitigs 

As defined above, a “unitig” is obtained as a chain of consistently overlapping reads. However, 

a unitig does not necessarily represent a segment of unique source sequence. For example, its 

reads may come from the interior of different instances of a long (many copy) repeat:


source: 

R R’ R" 

reads: 

unique unitig 

non−unique unitig 

Non-unique unitigs can be detected by virtue of the fact that they contain significantly more 

reads than expected. 

4.18 Identifying unique unitigs 

Let R be the number of reads and G the estimated length of the source sequence. For a unitig 

with k reads and approximate length ρ, the probability of seeing the k − 1 start positions in 

the interval of length ρ is 

e −c c k 

, 

k! 

with c := ρR , if the unitig is not oversampled, and 

G 

e −2c (2c) k 

, 

k! 

if the unitig is the result of collapsing two repeats. 

(see Mike Waterman’s book, page 148, for details) 

The arrival statistic used to identify unique unitigs is the (natural) log of the ratio of these two 

probabilities, 

c − (log 2)k. 

A unitig is called unique, if it’s arrival statistic has a positive value of 10 or more, say. 

4.19 Mate pairs 

Fragment assembly of reads produces contigs, whose relative placement and orientation with 

respect to each other is unknown. 

Recall that modern shotgun sequencing protocols employ a so-called double barreled shotgun. 

That is, longer clones of a given fixed length are sequenced from both ends and one obtains a 

pair of reads, a mate pair, whose relative orientation and mean µ (and standard deviation σ 

of) length are known: 

(µ,σ) 

Typical clone lengths are µ = 2kb, 10kb, 50kb or 150kb. For clean data, σ ≈ 10% of µ. Mate 

pair mismatching is a problem and can effect 10 − 30% of all pairs.


4.20 Scaffolding 

Consider two reconstructed contigs. If they correspond to neighboring regions in the source 

sequence, then we can expect to see mate pairs to span the gap between them: 

c1 

c2 

Such mate pairs determine the relative orientation of both contigs, and we can compute a 

mean and standard deviation for the gap between them. In this case, the contigs are said to 

be scaffolded: 

4.21 Determining the distance between two contigs 

Given two contigs c 1 and c 2 connected by mate pairs m 1 , m 2 , . . .,m k . Each mate pair gives an 

estimation of the distance between the two contigs. 

These estimations can viewed as independent measurements (l 1 , σ 1 ), (l 2 , σ 2 ), ...(l k , σ k ) of the 

distance D between the two contigs c 1 and c 2 . Following standard statistical practice, they can 

be combined as follows: 

Define p := ∑ l i 

σ 2 i 

and q = ∑ 1 

. We set the distance between c 

σi 

2 1 and c 2 to 

D := p q , with standard deviation σ := 1 √ q 

. 

Here is an example: 

D,σ 

l1σ1 , 

l2σ2 , 

l3, 

σ3 

l4, 

σ4 

2k mate pair 

10k mate pair 

10k mate pair 

2k mate pair 

It is possible that the mate pairs between two contigs c 1 and c 2 lead to significantly different 

estimations of the distance from c 1 and c 2 . In practice, only mate pairs that confirm each 

other, i.e. whose estimations are within 3σ of each other, say, are considered together in a gap 

estimation. 

4.22 The significance of mate pairs 

Given two contigs c 1 and c 2 . If there is only one mate pair between the two contigs, then due 

to the high error rates associated with mate pairs, this is not significant. 

If, however, c 1 and c 2 are unique unitigs, and if there exist two different mate pairs between 

the two that give rise to the same relative orientation and similar estimations of the gap size 

between c 1 and c 2 , then this the scaffolding of c 1 and c 2 is highly reliable.


This is because that probability that two false mate pairs occur that confirm each other, is 

extremely small. 

4.23 Example 

Let the sequence length be G = 120MB, for example (Drosophila). For simplicity, assume we 

have 5-fold coverage of mate pairs, with a mean length of µ = 10kb and standard deviation of 

σ = 1kb. 

Consider a false mate pair m 1 = (f 1 , f 2 ) with reads f 1 and f 2 . Let N 1 and N 2 denote the two 

intervals (in the source sequence) of length 3σ centered at the starts of f 1 and f 2 , respectively. 

Both have length 6kb. 

Consider a second false mate m 2 = (g 1 , g 2 ) with g 1 inside N 1 . The probability that g 2 lies in 

N 2 is roughly 

6kb 

120MB = 1 

20000 . 

N1 

m2 

N2 

source 

g1 

f1 

m1 

f2 

Assume that the reads have length 600. Assume that 10% of all mate pairs are false. At 5-fold 

coverage, the interval N 1 is covered by about 5 · 6000 = 50 reads, of which ≈ 5 have false mates. 

600 

Hence, the probability that m 1 is confirmed by some second false mate pair m 2 is 

≈ 5 · 

1 

20000 = 1 

4000 = 0.00025. 

4.24 The overlap-mate graph 

Given a set of reads F = {f 1 , f 2 , . . .,f R } and let G denote the overlap graph associated with 

F. 

Given one set (or more) M µ,σ = {m 1 , . . .,m k } of mate pairs m k = (f i , f j ), with mean µ and 

standard deviation σ. 

Let f i and f j be two mated reads represented by the edges (s i , e i ) and (s j , e j ) in G. We add 

an undirected mate edge between e i and e j , labeled (µ, σ), to indicate that f i and f j are mates 

and thus obtain the overlap-mate graph: 

f5 

f7 

(2000,200) 

f4 

−250 

f6 

−330 

−50 

f1 

−60 

−40 

−320 

−95 

f2 

−80 

f3 

(10000,1000) 

f8


4.25 The contig-mate graph 

Given a set of F of fragments and a set of assembled contigs C = {c 1 , c 2 , . . .,c t }. A more useful 

graph is obtained as follows: 

Represent each assembled contig c i by a contig edge with nodes s i and e i . Then, add mate 

edges between such nodes to indicate that the corresponding contigs contain fragments that are 

mates: 

D,σ 

l1σ1 , 

l2σ2 , 

l3, 

σ3 

l4, 

σ4 

2k mate pair 

10k mate pair 

10k mate pair 

2k mate pair 

Leads to: 

c1 

l1, 

σ1 

l2, 

σ2 

, l3 σ3 

l4, 

σ4 

c2 

4.26 Edge bundling 

Consider two contigs c 1 and c 2 , joined by mate pair edges m 1 , . . .,m k between node e 1 and s 2 , 

say. Every maximal subset of mutually confirming mate edges is replaced by a single bundled 

mate edge e, whose mean length µ and standard deviation σ are computed as discussed above. 

Any such bundled edge is labeled (µ, σ). 

(A heuristic used to compute these subsets is to repeatedly bundle the median-length simple 

mate edge with all mate edges within three standard deviations of it, until all simple mate 

edges have been bundled.) 

Additionally, we set the weight w(e) of any mate edge to 1, if it is a simple mate edge, and to 

∑ k 

i=1 w(e i), if it was obtained by bundling edges e 1 , . . .,e k . 

Consider the following graph: 

Assuming that mate edges drawn together have similar lengths and large enough standard 

deviation, edge bundling will produce the following graph: 

w=2 

w=2 

w=3 

w=4


4.27 Transitive edge reduction 

Consider the previous graph with some specific edge lengths: 

e 

µ= 4200 

c1 

f l= 2000 

µ= 40 c2 

g µ= 1000 

h 

µ=1000 

c3 

The mate edge e gives rise to estimation of the distance from the right node of contig c 1 to the 

left node of c 3 that is similar to the one obtained by following the path P=(g, c 2 , h). Based on 

this transitivity property we can reduce the edge e on to the path p: 

to obtain: 

w=2 

w=3+2 

w=4+2 

Consider two nodes v and w that are connected by an alternating path P = (m 1 , b 1 , m 2 , . . .,m k ) 

of mate-edges (m 1 , m 2 , . . .) and contig edges (c 1 , c 2 , . . .) from v to w, beginning and ending 

with a mate-edge. We obtain a mean length and standard deviation for P by setting l(P) := 

∑ 

m i 

µ(m i ) + ∑ c i 

l(c i ) and σ(P) := 

√ ∑ 

m i 

σ(m i ) 2 . 

We say that a mate-edge e from v to w can be transitively reduced on to the path P, if e and 

P approximately have the same length, i.e. if |µ(e) − l(P)| ≤ C · max{σ(e), σ(P)} for some 

constant C, typically 3. If this is the case, then we can reduce e by removing e from the graph 

and incrementing the weight of every mate-edge m i in P by w(e). 

In the following, we will assume that any contig-mate graph considered has been edge-bundled and 

perhaps also transitively reduced to some degree. 

4.28 Happy mate pairs 

Consider a mate pair m of two reads f i and f j , obtained from a clone of mean length µ and 

standard deviation σ: 

f i 

(µ,σ) f j 

Assume that f i and f j are contained in the same contig or scaffold of an assembly. We call m 

happy, if f i and f j have the correct relative orientation (i.e., are facing each other) and are at 

approximately the right distance, i.e., |µ − |s i − s j || ≤ 3σ, say. Otherwise, m is unhappy. Two 

unhappy mates are highlighted here: 

c1 

c2


4.29 Ordering and orientation of the contig-mate graph 

Given a collection of contigs C = {c 1 , c 2 , . . .,c k } constructed from a set of reads F = 

{f 1 , f 2 , . . .,f R }, together with the corresponding mate pair information M. Let G = (V, E) 

denote the associated contig-mate graph. 

An ordering (and orientation) of G (or C) is a map φ : V → N such that |φ(b i ) − φ(e i )| = l(c i ) 

for all contigs c i ∈ C, in other words, an assignment of coordinates to all nodes that preserves 

contig lengths. 

Additionally, we require {φ(b i ), φ(e i )} ≠ {φ(b j ), φ(e j )} for any two distinct contigs c i and c j . 

4.30 Example 

Given the following contig-mate graph: 

c5 

1500 

c1 

900 

c3 

900 

400 

1000 

5000 

1000 

c4 

c2 

1500 1500 

2500 

An ordering φ assigns coordinates φ(v) to all nodes v and thus determines a layout of the 

contigs: 

φ(s2) 

φ (e2) φ (e4) φ(s4) φ (e1) 

φ (s1) 

φ (s3) 

2700 

φ (e3) 

φ(s5) 

φ(e5) 

5000 

c2 400 c4 900 c1 c3 c5 

1500 

1000 

1500 

1000 

2500 

2700 

900 

1500 

4.31 Happiness of mate edges 

Let G = (V, E) be a contig-mate graph and φ an ordering of G. 

Consider a mate-edge e with nodes v and w. Let c i denote the contig edge incident to v and 

let c j denote the contig edge incident to w. Let v ′ and w ′ denote the other two nodes of c i 

and c j , respectively. We call e happy (with respect to φ), if c i and c j have the correct relative 

orientation, and if the distance between v and w is approximately correct, in other words, we 

require that either 

1. φ(v ′ ) ≤ φ(v) & |φ(w) − φ(v) − µ(e)| ≤ 3σ(e) & φ(w) ≤ φ(w ′ ), or 

2. φ(w ′ ) ≤ φ(w) & |φ(v) − φ(w) − µ(e)| ≤ 3σ(e) & φ(v) ≤ φ(v ′ ). 

Otherwise, e is unhappy. 

4.32 The Contig Ordering Problem 

Given a collection of contigs C = {c 1 , c 2 , . . .,c k } constructed from a set of reads F = 

{f 1 , f 2 , . . .,f R }, together with the corresponding mate pair information M. Let G = (V, E)


denote the associated contig-mate graph. 

Problem The Contig Ordering Problem is to find an ordering of G that maximizes the sum of 

weights of happy mate edges. 

Theorem The corresponding decision problem is NP-complete. 

(The decision problem is: Given a contig-mate graph G, does there exist an ordering of G such 

that the total weight of all happy edges ≥ K?) 

4.33 Proof of NP-completeness 

Recall: to prove that a problem X is NP-complete one must reduce a known NP-complete 

problem N to X. In other words, one must show that any instance I of N can be translated 

into an instance J of X in polynomial time such that I has the answer true iff J does. 

We will use the following NP-complete problem: 

BANDWIDTH: For a given graph G = (V, E) with node set V = {v 1 , v 2 , . . ., v n } and number 

K, does there exist a permutation φ of {1, 2, . . ., n} such that for all edges {v i , v j } ∈ E we have 

|φ(i) − φ(j)| ≤ K? (See Garey and Johnson 1979 for details.) 

A graph with bandwidth 4: 

Problem is in NP: For a given ordering φ, we can determine whether the number of happy 

mate-edges exceeds the given threshold K in polynomial time by simple inspection of all mate 

edges. 

Reduction of BANDWIDTH: Given an instance G = (V, E) of this problem, we construct a 

contig graph G ′ = (V ′ , E ′ ) in polynomial time as follows: 

First, set V ′ := V and E ′ := E, and let these edges be the mate-edges, setting µ(e) := 1 + K−1 

2 

and σ(e) := K−1 

6 

so as to obtain a happy range of [1, K], and w(e) := 1, for every mate-edge e. 

Then, for each initial node v ∈ V , add a new auxiliary node v ′ to V ′ and join v and v ′ by a 

contig edge of length 0. 

The answer to the BANDWIDTH question is true, iff the graph G ′ has an ordering φ such that 

all mate edges in G ′ are happy: 

A graph G has BANDWIDTH ≤ K 

⇐⇒ 

∃ permutation φ such that (v i , v j ) ∈ E implies |φ(i) − φ(j)| ≤ K 

⇐⇒ 

∃ ordering φ such that (v i , v j ) ∈ E implies 1 ≤ |φ(i) − φ(j)| ≤ K 

⇐⇒ 

∃ ordering φ such that e = (v i , v j ) ∈ E implies µ(e) − 3σ(e) ≤ |φ(i) − φ(j)| ≤ µ(e) + 3σ(e) 

⇐⇒ 

all mate-edges of G ′ are happy. 

□


4.34 Spanning tree heuristic for the Contig Ordering 

Problem 

An ordering φ that maximizes the number of happy mate edges is a useful scaffolding of the 

given contigs. 

The simplest heuristic for obtaining an ordering is to compute a maximum weight spanning tree 

for the contig-mate graph and use it to order all contigs, similar to the read layout problem. 

source 

c1 c2 c3 c4 c5 c6 c7 

false mate edge 

Unfortunately, this method does not work well in practice, as false mate edges lead to incorrect 

interleaving of contigs from completely different regions of the source sequence: 

c1 c2 c3 c4 

c5 c6 c7 

4.35 Representing an ordering in the mate-contig graph 

By the definition given above, an ordering is an assignment of coordinates to all nodes of the 

contig-mate graph that corresponds to a scaffolding of the contigs. When we are not interested 

in the exact coordinates, then the relative order and orientation of the contigs can be represented 

as follows: 

Given a contig-mate graph G = (V, E). A set S ⊆ E of selected edges is called a scaffolding of 

G, if it has the following two properties: 

• every contig edge is selected, and 

• every node is incident to at most two selected edges. 

Thus, a scaffolding of G is a set of non-intersecting selected paths, each representing a scaffolding 

of its contained contigs. 

The following example contains two chains of selected edges representing scaffolds s 1 = 

(c 1 , c 2 , c 3 , c 4 ) and s 2 = (c 5 , c 6 , c 7 ): 

c1 c2 c3 c4 

c5 c6 c7 

However, to be able to represent the interleaved scaffolding discussed earlier, we need to add 

some inferred edges (shown here as dotted lines) to the graph: 

c1 c2 c3 c4 

c5 c6 c7 

4.36 Greedy path-merging 

Given a contig-mate graph G = (V, E). The greedy path merging algorithm is a heuristic for 

solving the Contig Ordering Problem. It proceeds “bottom up” as follows, maintaining a valid 

scaffolding S ⊆ E:


Initially, all contig edges c 1 , c 2 , . . .c k are selected, and none others. At this stage, the graph 

consists of k selected paths P 1 = (c 1 ), . . .,P k = (c k ). 

Then, in ordering of decreasing weight we consider each mate edge e = {v, w}: If v and w lie 

in the same selected path P i , then e is a chord of P i and no action is necessary. 

If v and w are contained in two different paths P i and P j , then we attempt to merge the two 

paths to obtain a new path P k and accept such a merge, only if the increase of H(G), the 

number of happy mate edges, is larger than the increase of U(G), the number of unhappy 

ones. 

4.37 The greedy path-merging algorithm 

Algorithm Given a contig-mate graph G. The output of this algorithm is a node-disjoint 

collection of selected paths in G, each one defining an ordering of the contigs whose edges it 

covers. 


Select all contig edges. 

for each mate-edge e in descending order of weight: 

if e is not selected: 

Let v, w denote the two nodes connected by e 

Let P 1 be the selected path incident to v 

Let P 2 be the selected path incident to w 

if P 1 ≠ P 2 and we can merge P 1 and P 2 (guided by e) 

to obtain P: 

if H(P) − (H(P 1 ) + H(P 2 )) ≥ U(P) − (U(P 1 ) + U(P 2 )): 

Replace P 1 and P 2 by P 

end 

4.38 Merging two paths 

Given two selected paths P 1 and P 2 and a guiding unselected mate-edge e 0 with nodes v 0 

(incident to P 1 ) and w 0 (incident to P 2 ). Merging of P 1 and P 2 is attempted as follows: 

(a) 

P1 

P2 

c11 c12 

w0 

h 

e0 

c21 c22 c23 c24 c25 

v0 

c13 

c14 

c26 

c15 

c27 

(b) 

P1 

P2 

c21 

c11 

c22 

c12 

h 

w1 c14 

c23 

e0 

c24 

e1 

c13 c26 

v0 c25 w0 v1 

f0 g0 

c15 

c27 

(c) 

P1 

P2 

c21 

c11 

c22 

c12 

h 

e0 

c23 c24 

v0 

e1 

c13 

c25 w0 v1 

f0 g0 

c26 

e2 

c14 

f1w1v2g1 

This algorithm returns true, if it successfully produced a new selected path P containing all 

contigs edges in P 1 and P 2 , and false, if it fails. 

Merging proceeds by “zipping” the two paths P 1 and P 2 together, first starting with e 0 and 

“zipping” to the right. Then, with the edge labeled h now playing the role of e 0 , zipper to the 

w2 c15 

c27


left. Merging is said to fail, if the positioning of the “active” contig c 1 i implies that it must 

overlap with some contig in P 2 by a significant amount, but no such alignment (of sufficiently 

high quality) exists. 

4.39 Example 

Here are we are given 5 contigs c 1 ,...,c 5 , each of length l(c i ) = 10000: 

c1 

w=1, µ=34000 

c4 

c1 

w=1, µ=34000 

c4 

w=4,µ=12000 

c3 

w=1, µ=12000 

w=4,µ=12000 

c3 

w=1, µ=12000 

c2 

w=3,µ=1000 

w=5,µ=12000 

w=2,µ=1000 

c5 

c2 

w=3,µ=1000 

w=5,µ=12000 

w=2,µ=1000 

c5 

c1 

w=1, µ=34000 

c4 

c1 

w=1, µ=34000 

c4 

w=4,µ=12000 

c3 

w=1, µ=12000 

w=4,µ=12000 

c3 

w=1, µ=12000 

c2 

w=3,µ=1000 

w=5,µ=12000 

w=2,µ=1000 

c5 

c2 

w=3,µ=1000 

w=5,µ=12000 

w=2,µ=1000 

c5 

c1 

w=1, µ=34000 

c4 

c1 

w=1, µ=34000 

c4 

w=4,µ=12000 

c3 

w=1, µ=12000 

µ~1000 

w=4,µ=12000 

c3 

w=1, µ=12000 

µ~1000 

c2 

w=3,µ=1000 

w=5,µ=12000 

w=2,µ=1000 

c5 

c2 

w=3,µ=1000 

w=5,µ=12000 

w=2,µ=1000 

c5 

The final scaffolding is (c 1 ,c 2 ,c 3 ,c 5 ,c 4 ). 

4.40 Repeat resolution 

Consider two unique unitigs u 1 and u 2 that are placed next to each other in a scaffolding, due 

to a heavy mate edge between them: 

u1 

u2 

We consider all non-unique unitigs and singleton reads that potentially can be placed between 

u 1 and u 2 by mate edges: 

u1 

u2 

Different heuristics are used to explore the corresponding local region of the overlap graph in 

an attempt to find a chain of overlapping fragments that spans the gap and is compatible with 

the given mate pair information: 

u1 

u2 

4.41 Summary 

Given a collection F = {f 1 , f 2 , . . .,f R } of reads and mate pair information, sampled from a 

unknown source DNA sequence. Assembly proceeds in the following steps: 

1. compute the overlap graph, e.g. using a seed-and-extend approach,


2. construct all unitigs, e.g. using the minimal spanning tree approach, 

3. scaffold the unitigs, e.g. using the greedy-path merging algorithm, 

4. attempt to resolve repeats between unitigs, and 

5. compute a multi alignment of all reads in a given contig to obtain a consensus sequence 

for it. 

Note that the algorithms for steps (2) and (3) that are used in actual assembly projects are much 

more sophisticated than ones described in these notes. 

4.42 A WGS assembly of human (Celera) 

Input: 27 million fragments of av. length 550bp, 70% paired: 

5m pairs of length 2kb 

4m pairs of length 10kb 

0.9m pairs of length 50kb 

0.35m pairs of length 150kb 

Celera’s assembler uses approximately the following resources: 

Program CPU Max. 

hours 

memory 

Screener 4800 2-3 days on 10-20 computers 2GB 

Overlapper 12000 10 days on 10-20 computers 4GB 

Unitigger 120 4-5 days on a single computer 32GB 

Scaffolder 120 4-5 days on a single computer 32GB 

RepeatRez 50 Two days on a single computer 32GB 

Consensus 160 One day on 10-20 computers 2GB 

Total: ≈ 18000 CPU hours. 

The size of the human genome is ≈ 3Gb. An unpublished 2001 assembly of the 27m fragments 

has the following statistics: 

• The assembly consists of 6500 scaffolds that span 2.776Mb of sequence. 

• The spanned sequence contains 150, 000 gaps, making up 148Mb in total. 

• Of the spanned sequence, 99.0% is contained in scaffolds (or contigs?) of size 30kb or 

more. 

• Of the spanned sequence, 98.7% is contained in scaffolds (or contigs?) of size 100kb or 

more. 

5 Eulerian Superpath Method of Sequence Assembly 

This exposition is based on the following sources, which are all recommended reading: 

1. Michael S. Waterman, Introduction to computational biology, Chapman and Hall, 1995. 

(Chapter 7, section 3.)


2. Pavel A. Pevzner, Haixu Tang and Michael S. Waterman, A new approach to fragment 

assembly in DNA sequencing, RECOMB 2001, Montreal, Canada, Proceedings, pages 

256–265. 

3. Pavel A. Pevzner and Haixu Tang, Fragment assembly with double-barreled data, Bioinformatics, 

vol. 17, suppl. 1, 2001, pages S225–S233. 

5.1 Eulerian path method revisited 

Given an unknown DNA sequence A = a 1 . . .a n . Let S = {s 1 , s 2 . . .} be the spectrum of l-tuples 

observed using the C(l) chip. 

Recall that in sequencing by hybridization (SBH), the assembly problem can be formulated as 

the problem of finding an Euler path (to be precise, a Chinese Postman tour that visits each 

edge at least once and minimizes the number of edges that are used more than once) in the de 

Bruijn graph G = (V, E), defined as follows: 

The set of nodes consists of all (l − 1)-tuples, and the edge set is obtained by connecting any 

two nodes v = v 1 . . .v l−1 and w = w 1 . . .w l−1 by a directed edge (v, w) iff there exists an l-tuple 

s ∈ S such that s = v 1 . . .v l−1 w l−1 = v 1 w 1 . . .w l−1 . 

Two main problems are that this approach is highly sensitive to sequencing errors and requires 

a large l to resolve repeats. 

Given a set of reads F = {f 1 , f 2 , . . .,f R } obtained from a source sequence A by shotgun 

sequencing. “Can the Eulerian path method be applied to fragment assembly?” (Idury and 

Watermann, 1995) 

Let us represent every read f i of length n by the n − l + 1 (not necessarily distinct) l-tuples 

obtained from f i . We define F l to be the spectrum of all such l-tuples. 

Idea: Apply the Eulerian path method to F l to obtain an assembly. 

Unfortunately, this naive approach does not work well in practice, because sequence errors and 

repeats lead to very complicated graphs with many false edges. 

To obtain a feasible approach, We will look at three questions: how to fix sequencing errors, 

how to make use of the continuity of reads and how to make use of the mate pair data? 

5.2 A typical small scale sequencing project 

Consider the N.meningitidis (NM) sequencing project completed at the Sanger Center in 2000. 

The genome is 2, 184, 406 bp long. Sequencing resulted in 53, 263 reads of average length 400, 

corresponding to a coverage of 9.7. 

The total number of sequencing errors is 255, 631, corresponding to an error rate of 1.2% and 

a mean of 4.8 errors per read. 

NM is difficult to assemble because it contains 126 long exact repeats of up to 3832 bp in length 

and many more approximate ones.


5.3 Error Correction 

Reads are collected with an error rate of about 1%. In a sequencing project with sufficient oversampling, 

by comparison of overlapping reads, one should be able to use the fact that sequencing 

errors are randomly distributed to distinguish between sequencing errors and differences due 

to repeats. 

If we knew the precise sequence of the source A, then we could use it to correct the reads. 

However, A is not known until the assembly is complete, a catch-22 1 . 

Assume that A is unknown, but that we know A l , the set of all l-tuples in A. Then we should 

still be able to correct most reads. Unfortunately, A l is not known either, but we will see how 

to approximate A l . 

5.4 Solid and weak l-tuples 

Given an unknown source sequence A, a collection of reads F and the spectrum F l of all l-tuples 

from reads in F. 

An l-tuple s ∈ F l is solid, if it belongs to more than M reads (where M is a given threshold), 

and weak, otherwise. A natural approximation of A l is the set of all solid l-tuples in F l . 

Motivation: If the read-coverage of A is x, then on average every unique l-tuple s in A will be 

contained in x reads. However, if one of theses reads contains a sequencing error in its copy of 

s, then this erroneous l-tuple will not be contained in the other x − 1 reads. 

A 

l−tuple 

reads 

sequencing error 

5.5 The spectral alignment problem 

Let S be a collection of l-tuples called an l-spectrum. A string A is called an S-string, if all its 

l-tuples belong to S. 

Spectral Alignment Problem (SPA) Given a string f and an l-spectrum S, find the minimum 

number of mutations in f that transform f into a S-string. 

A solution to the SPA only makes sense if the number of mutations is small, in which case SPA 

can be solved by dynamic programming, even for large l. 

(See I. Pe’er and R. Shamir, Spectrum Alignment: Efficient Resequencing by Hybridization, 

Proceedings of ISMB, 2000) 

5.6 Error correction based on SPA 

Let F solid 

l 

be the spectrum of all solid l-tuples in F l . Any read f that is not a F solid 

l 

can be corrected by using a minimum number of mutations to transform f into a F solid 

l 

-string 

string 

1 In Joseph Heller’s novel, catch-22 was the paradox that trapped members of the US military: Anyone who 

applied to get out of the military on the grounds of insanity was behaving rationally and thus couldn’t be insane.


(SPA). 

Correcting all reads in this way may change the sets of solid and weak l-tuples. Iterative 

application of this correction gradually increases the number of solid l-tuples and decreases the 

number of weak l-tuples: 

Algorithm Error correction based on SAP 

Input: Set of reads F. Output: corrected reads F ′ 

do 

Determine F solid 

l 

Correct all reads that are not Fl 

solid 

until no further increase of solid l-tuples. 

-reads 

Experiments indicates that this correction eliminates many errors in bacterial sequencing 

projects. However, the following formulation of the problem is even more successful: 

5.7 The Error Correction Problem 

Given a collection of reads F = {f 1 , f 2 , . . .,f R }. In the following, let F l denote the spectrum 

of F consisting of the set of all l-tuples from the reads f 1 , f 2 , . . .,f R and f 1 , . . .,f R , where f 

denotes the reverse complement of f. 

Let ∆ denote an upper bound on the number of errors in each read. 

Error Correction Problem (ERC) Given F, ∆ and l, introduce up to ∆ corrections in each 

read in F in such a way that |F l | is minimized. 

This looks like an NP-hard problem to me... 

5.8 A simple greedy heuristic for ECR 

Observation An error in a read f affects at most l of the l-tuples in f and l of the l-tuples in 

f, and usually creates 2l erroneous l-tuples (2d for a position within a distance of d < l from an 

endpoint of f.) 

l false tuples 

in f 

l false tuples 

in f 

error 

read f 

This inspires the following simple 

Greedy heuristic for ECR: Detect and perform any error correction in a read f that reduces 

the number of l-tuples by 2l (or 2d, for positions close to an end point). 

Experiments suggest that this simple procedure eliminates about 86.5% of all sequencing errors. 

5.9 Orphan elimination heuristic for ECR 

Two l-tuples are called neighbors, if their Hamming distance is 1, i.e., if they differ at precisely 

one position.


The multiplicity of an l-tuple s ∈ F l is the number m(s) of reads in F that contain s. 

We call s an orphan, if 

1. s has small multiplicity, i.e. m(s) ≤ M, where M is a given threshold, 

2. s has precisely one neighbor, t, and 

3. m(s) ≤ m(t). 

The position where an orphan differs from its neighbor is called an orphan position. A read is 

orphan free, if it contains no orphan positions. 

These definitions are motivated by the following 

Observation: If we choose l appropriately (i.e., not too small, not to large), then each erroneous 

l-tuple s induced by a sequencing error in a read f usually: 

• does not appear in any other read, and 

• differs at precisely one position from a correct l-tuple t (obtained from a different read f ′ 

that comes from the same area of the source sequence as f, but doesn’t have a sequencing 

error at the same position). 

Hence, a sequencing error in a read usually creates 2l orphans. 

Orphans are created by random sequencing errors in reads: any l-tuple s containing the error 

will usually be unique and will differ from one correct l-tuple t only at the position of the 

error. This correct l-tuple, in turn, will be contained in a number of reads that do not have a 

sequencing error at the same position: 

contained in 

unknown source sequence 

error 

t 

read 

read 

only neighbor 

read 

orphan s 

read 

The main idea is to correct all errors at orphan positions in the sequencing reads, ensuring that 

the number of corrections made to any one read does not exceed ∆, the specified maximum 

number of errors per read. 

Greedy orphan elimination approach Perform error correction at any orphan position that 

reduces the number of l-tuples by 2l (or 2d, for positions close to an end point). After correcting 

all such errors, repeatedly rerun the method using a 2l − δ condition with increasing δ. 

Experiments suggest that this method can eliminate up to ≈ 97% of all sequencing errors, for 

bacterial size sequencing projects. 

5.10 Error correction or data corruption? 

Any heuristic used for correcting reads will make mistakes. For example, if for a given position 

in an l-tuple, some reads indicate that the nucleotide is an C, others that it is a G, then 

correction may make the wrong choice.


However, this is not a problem, as the goal of error correction is merely to remove inconsistencies 

from the input to help the down-stream assembly algorithm. 

After running the assembly algorithm to obtain a layout based on the corrected reads, a consensus 

sequence is computed from this layout using the uncorrected reads. Thus, the bases in 

the final output are determined by a multi-alignment of the original reads. 

There are a number of additional issues that we will not discuss further. 

5.11 Eulerian path problem revisited (again) 

Given a set of reads F, define the de Bruijn graph G = (V, E) with node set V = F l−1 

and connect any two nodes s = s 1 s 2 . . .s l−1 and t = t 1 t 2 . . .t l−1 by an edge (s, t), iff 

s 1 s 2 . . .s l−1 t l−1 = s 1 t 1 t 2 . . .t l−1 ∈ F l . 

Any reconstruction of the unknown source sequence corresponds to an Euler path through the 

graph, or two such paths, to be precise, as F l was defined to contain all l-tuples of every read 

f and its its reverse complement f. 

With real data, the errors hide the correct path among many erroneous edges. For example, the 

graph corresponding to error free data for the NM project has 4, 039, 248 edges, whereas the 

graph corresponding to the real data has 9, 474, 411 edges (for l = 20). After error correction, 

the number is reduced to 4, 081, 857. 

5.12 Sources, sinks and branching nodes 

A node v is called a source, if indegree(v) = 0, a sink, if outdegree(v) = 0 and a branching 

node, if indegree(v) · outdegree(v) > 1. For the NM project, the de Bruijn graph has 502, 843 

branching nodes, based on the original reads (l = 20). 

Error correction leads to a much simpler graph with 382 sources and sinks, and 12, 175 branching 

nodes. 

Error-free reads lead to a graph with 11, 173 branching nodes. 

Clearly, error correction greatly simplifies the graph G. However, G is still very complicated, 

even in the error-free case. We need to take additional information into account, namely which 

l-tuples belong to the same reads, and in which order. 

5.13 Repeats and tangles 

A path v 1 , v 2 , . . .,v n of nodes in the graph G is called a repeat, if indegree(v 1 ) > 1, 

outdegree(v n ) > 1 and outdegree(v i ) = 1 for 1 ≤ i < n. 

Edges entering v 1 are entrances, while edges leaving v n are exits, of the repeat. 

entrances 

v1 

repeat 

exits 

v2 v3 ... vn−1 vn 

An Eulerian path visits a repeat a few times and each such visit defines a pairing between an 

entrance and an exit.


Such repeats can cause problems in assembly because it is not clear which entrance such be 

paired with which exit. 

5.14 Example 

Sequence 

and reads 

Full graph 

Source, 

sink & 

branching 

nodes only 

Read 

paths 

1 

source 

A R B S C R’ D S’ E 

2 3 4 5 6 7 8 9 10 

A 

D 

A R S E 

B 

C 

D 

R B S E 

C 

7 8 

D 

1 

A 2 R B S 

3 4 

5 

6 

C 

9 10 

E 

sink 

branching 

node 

5.15 Using read-paths to resolve repeats 

Each read f = f 1 f 2 . . .f k ∈ F defines a read-path in the graph G, that consists of the path of 

edges f 1 . . .f l , f 2 . . .f l+1 , f 3 . . .f l+2 that represent the l-tuples of f, in their order of occurrence. 

With this additional information, many short repeats that are spanned by read paths can be 

resolved, i.e. they have a unique pairing of entrances and exits that is compatible with the read 

paths: 

read path 

v1 v2 v3 ... vn−1 vn 

5.16 Tangles 

A tangle is a repeat that cannot be resolved using read paths:


in1 

read path 

out1 

in2 

v1 v2 v3 ... vn−1 vn 

out2 

Here it is unclear, whether an Euler path through the repeat v 1 , . . ., v n should match the two 

pairs in 1 → out 1 and in 2 → out 2 , or in 1 → out 2 and in 2 → out 1 . 

5.17 The Eulerian Superpath Problem 

This leads to a generalization of the Eulerian Path problem: 

Eulerian Superpath Problem (ESP) Given an Eulerian graph and a collection of paths in 

this graph, find an Eulerian path in this graph that contains all these paths as subpaths. 

Note that the Eulerian Path problem is a special case of ESP with every path being a single 

edge. 

Note that a practical assembler must always return a result, regardless of whether the graph 

possesses an Eulerian Superpath path or not. 

Hence, the real aim is to find an optimal path that is compatible with all (or as many as 

possible) superpaths. 

Optimal could mean, for example, that the path uses as many edges as possible, but minimizes 

the number of edges that are used more than once. 

5.18 Solving the Eulerian Superpath Problem 

One strategy for solving EPS is to reduce the problem to an Eulerian Path problem, by applying 

a sequence of equivalent transformations to the initial graph G and set of paths P: 

(G, P) → (G 1 , P 1 ) → . . . → (G k , P k ), 

until we obtain a new graph G k and set of paths P k , for which each of the paths consisting of 

a single edge. 

Such a transformation (G i , P i ) → (G j , P j ) is called equivalent, if there exists a one-to-one 

correspondence between the Eulerian superpaths in (G i , P i ) and (G j , P j ). 

5.19 Equiv. transform.: x, y-detachment 

We will discuss a simple equivalent transformation. 

For a graph G and collection of paths P, let: 

• x = (v in , v mid ) and y = (v mid , v out ) be two consecutive edges in G, 

• P x,y be the set of all paths in P that contain x, y as a subpath, 

• P →x be the set of all paths in P that end with x, and


• P y→ be the set of all paths in P that start with y. 

Additionally, we require that any path passing through x must either end at v mid , or exit v mid 

via y. 

The x, y-detachment is a transformation that adds a new edge z = (v in , v out ) and deletes the 

edges x and y from G: 

Px,y 

Edges x and 

y are replaced 

by edge z. 

P x 

x y 

Vin Vmid Vout 

Py 

⇒ 

Px,y 

Paths in P x,y , 

P →x and P y→ 

are modified 

to contain z. 

P x 

Vin 

The x, y-detachment transformation alters the system of paths P as follows: 

z 

Vmid 

1. in all paths in P x,y , replace x, y by z, 

2. in all paths in P →x , replace x by z, and 

3. in all paths in P y→ , replace y by z. 

Above, we required that all paths through v mid are contained in one of these three sets. In each 

of the three cases it is clear that any Eulerian superpath contained in the original graph will 

also be one in the derived graph, and vice versa. Hence, x, y-detachment under these conditions 

is an equivalent transformation. 

Vout 

Py 

5.20 More general x, y-detachment 

What can happen if we drop the additional requirement stated above, i.e. if we have a path in 

P x,y2 that enters v mid via x and then exits via an edge y 2 ≠ y? 

P x 

Px,y2 

Px,y 

x y 

Vin Vmid Vout 

y2 

Vout2 

Py 

If (the paths in) P →x and P x,y enter v in along the same edge, and if P x,y2 enters via a different 

edge, then we replace x by z in P →x . Similarly, if P →x and Px, y 2 enter via the same edge,


different from the one used by Px, y, then we keep x. In the two other possible cases, it is 

unclear how to update P →x and we call the edge x unresolvable: 

Px,y 

P x 

??? 

z 

Py 

Px,y2 

Vin 

x 

Vmid 

y2 

Vout 

Vout2 

5.21 Example of x, y-detachment 

Multiple application of x, y-detachment to resolve a repeat: 

y3 

y4 

x1 

x2 

y1 

y2 

y3 

z 

x1 

x2 

y1 

y2 

Edge x 2 unresolvable. 

Obtained by y 4 , x 1 -detachment. 

y3 

z 

x1 

x2 

y2 

y3 

x1 

y2 

Obtained by x 2 , y 1 -detachment. 

Obtained by z, x 2 -detachment. 

5.22 Equivalent transformation: x-cut 

We call an edge x = (v, w) removable, if 

1. it is the only outedge for v and the only inedge for w, and 

2. x is either the first or last edge in every path p ∈ P that contains x. 

An x-cut (equivalently) transforms P into a new system P ′ by simply removing x from all paths 

in P →x and P x→ : 

y3 

x 

y1 

y3 

x 

y1 

y4 

y2 

y4 

y2 

⇒


5.23 Summary 

Given a set of read F from a sequencing project. In the Eulerian Path approach: 

1. Every read f ∈ F is shredded into |f| − l + 1 consecutive l-tuples. 

2. The de Bruijn graph is constructed with vertices representing l − 1-tuples and edges 

representing l-tuples. 

3. The graph G is simplified to consist only of source, sink and branching nodes. 

4. The set P of all read paths in G is generated. 

5. Equivalent transformations are applied to reduce the Eulerian Superpath problem to an 

Eulerian Path problem. 

6. An optimal path is used to generate a final fragment assembly. 

5.24 Comparison with other methods


5.25 Mate pairs 

Given a collection or reads F and corresponding mate pair information M. Consider the de 

Bruijn graph G constructed from F. Let m = (f 1 , f 2 ) be a mate pair with mean length µ 

and standard deviation σ. We represent m by a mate-pair path from f 1 to f 2 , if the distance 

d(f 1 , f 2 ) from f 1 to f 2 in G is µ ± 3σ, say, along a unique path from f 1 to f 2 : 

f1 

m 

f2 

5.26 Using mate pairs to resolve repeats 

Consider the following situation, (R, R ′ ) and (S, S ′ ) repeats, and m 1 and m 2 mate pairs: 

A R B S C D R’ E S’ F 

m1 

m2 

The corresponding graph is this: 

A 

R=R’ 

m1 

B 

S=S’ 

C 

D 

E 

m2 

F 

We can use the two mate pairs to resolve the repeats, first using m 2 to separate S and S ′ : 

m1 

C 

A 

B 

R=R’ 

S=S’ 

D 

E 

m2 

F 

and then using m 1 to separate R and R ′ : 

A 

R 

B 

S 

C 

D 

R’ E S’ 

F 

5.27 Mate pairs for ordering and orienting contigs 

Finally, if a number of different mate pairs consistently link two different components of the de 

Bruijn graph G, then they define a relative ordering and orientation of the two corresponding 

contigs:


6 Assembly Validation and Comparison 

Given a set of reads F and mate pairs M obtained from an unknown sequence A using shotgun 

sequencing. 

An assembly of A from F and M is a reconstruction of A that is given by a set of scaffolds 

S = {s 1 , s 2 , . . .,s p } and a set of contigs C = {c 1 , c 2 , . . .,c q }. 

Any such scaffold s i represents a relative ordering and orientation of contigs and is given by 

an ordered list of contigs c i1 , c i2 , . . .,c ik , together with an orientation o ij ∈ {−1, +1} for each 

contig, and an estimation of the gap between any pair of consecutive contigs c ij and c ij+1 

Any such contig c i represents a contiguous piece of sequence of length l(c i ) and is given by a 

consensus sequence C i , a list of reads f i1 , . . .,f ih , and two mappings b(f ij ) and e(f ij ) that map 

the start and end of each read f ij to their position in c i (i.e. in [1, l(c i )]), respectively. 

6.1 The read coverage of a contig (or scaffold) 

Consider a contig c of length l(c). For each position p ∈ [1, l(c)], let m(p) denote the coverage 

of p, defined as the number of reads in c that contain p. A coverage plot can be used to identify 

areas of unusually high coverage, which may correspond to over-collapsed repeats: 

20 

Read 10 

coverage 

0 

3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 4.3 

Scaffold 

To compute the coverage plot, first, for every read f in c, place the begin position b(f) and 

end position e(f) into a sorted sequence L. Then, in order of L, for each begin position, report 

the number of reads that span the position, given by the number of begins minus the number 

of ends seen so far. 

Note that we can similarly define a read coverage for scaffolds. 

Mb 

6.2 Clone coverage 

Consider a contig c consisting of reads f 1 , f 2 , . . .f k . Let f i and f j be two mated reads, with 

approximate clone length µ and standard deviation σ. 

Recall that we call a mate pair m = (f i , f j ) happy (w.r.t. c), if 

1. the pairs are oriented toward each other, i.e. either b(f i ) < e(f i ) and e(f j ) > b(f j ), or 

b(f j ) < e(f j ) and e(f i ) > b(f i ), and


2. the distance between f i and f j is approximately correct, i.e. |µ − |b(e i ) − b(e j )|| ≤ 3σ. 

Otherwise, m is unhappy, and is called mis-oriented, if (1) is violated, and mis-separated, if (1) 

holds, but (2) does not. 

To compute the clone coverage plot, for every mate pair of reads f i and f j in c (either contig 

or scaffold), with b(f i ) < b(f i ), place the begin position b(f i ) and end position b(f j ) into a 

sorted sequence L. Then, in order of L, for each begin position, report the number of happy, 

mis-oriented and mis-separated clones that span the position, in each case given by the number 

of begins minus the number of ends seen so far. 

However, this definition is not useful for large contigs or scaffolds, because unhappy mates that 

are far apart from each other in an assembly will increase the clone coverage over the whole 

distance between them in a way that does not reflect local mis-assembly problems. 

Hence, in practice, to obtain a localized clone coverage plot, one uses b(f i ) + (µ + 3σ) as end 

position, if b(f i ) < e(f i ), and b(f i ) − (µ + 3σ), if e(f i ) < b(f i ). 

The happy, mis-separated and mis-oriented coverage is shown in green, yellow and red, respectively. 

This is a good assembly: 

Example of the clone coverage plot for a poor assembly: 

6.3 Clone middle plot 

A useful tool for visualizing the quality of a contig, based on clone data, is to simply draw 

each mate pair m = (f i , f j ) as a line whose x-coordinates are the start and end positions of the 

mate pair and whose y coordinate is chosen at random. Different colors are used to destinguish 

between happy, mis-oriented and mis-separated mates. Additionally, it makes sense to separate 

the clones by approximate length: 

The previous plot shows a good assembly, this a poor one:


The same data, but using “localized” coordinates: 

6.4 Breakpoint Detection 

Based on the clone coverage, we would like to locate breakpoints in a given contig (or scaffold) 

c. Loosely speaking, a breakpoint is a position p in the contig c at which the sequence of the 

contig immediately to the left and to the right of pcome from different regions of the unknown 

source sequence. 

(In consequence, to obtain a more correct assembly, one must cut all contigs (and scaffolds) at 

their breakpoints and then rearrange the pieces.) 

At a break point, we expect that the happy clone coverage will be very low and the mis-oriented 

clone coverage to be high. 

Breakpoint heuristic Consider all clone start and end positions in order: if the number of 

currently open happy clones drops below the number of currently open mis-separated ones, 

then the begin of a region containing a breakpoint has been detected, where as in the opposite 

case, we have detected the end of such a region. 

Two different assemblies of human chromosome 19 produced by the Human Genome Project, H 1 

produced in September 2000, and H 2 dating January 2001, containing 723 and 488, respectively, 

detected breakpoints (shown as blue ticks):


7 Gene Prediction 

This exposition is based on the following sources, which are all recommended reading (in this 

order): 

1. Pavel A. Pevzner. Computational Molecular Biology, an algorithmic approach. MIT, 

2000, chapter 9. 

2. Chris Burge and Samuel Karlin. Prediction of complete gene structures in human genomic 

DNA. Journal of Molecular Biology, 268:78-94 (1997). 

3. Ian Korf, Paul Flicek, Danial Duan and Michael R. Brent, Integrating Genomic Homology 

into Gene Structure Prediction, Bioinformatics, Vol .1 Suppl 1., pages S1-S9 (2001). 

4. Vineet Bafna and Daniel Huson. The conserved exon method for gene finding. ISMB 

2000, 3-12 (2000). 

5. M. S. Gelfand, A. Mironov and P. A. Pevzner, Gene recognition via spliced alignment, 

PNAS, 93:9061–9066 (1996). 

7.1 Introduction 

In the 1960s, it was discovered that a gene and its protein product are colinear structures with a 

direct correlation between the triplets of nucleotides in the gene and amino acids in the protein. 

It soon became clear that genes can be difficult to determine, due to the existence of overlapping 

genes, and genes within genes etc. 

Moreover, the paradox arose that the genome size of many eukaryotes does not correspond to 

genetic complexity, for example, the salamander genome is 10 times the size of that of human. 

In 1977, the amazing discovery of “split” genes was made: genes that consist of multiple pieces 

called exons, separated by stretches of “junk DNA” called introns. 

Prokaryote 

Eukaryote 

DNA 

DNA 

mRNA 

Protein 

Transcription 

Translation 

RNA 

nucleus 

mRNA 

Protein 

splicing 

The existence of split genes and junk-DNA raises a computational gene prediction problem that 

is still unsolved: 

Given a string of DNA. The gene prediction problem is to reliably predict all genes 

contained in the sequence. 

7.2 Three types of approaches 

One can distinguish between three types of approaches:


• Statistical or ab initio methods. These methods attempt to predict genes based on statistical 

properties of the given DNA sequence. Programs are e.g. Genscan, GeneID, 

GENIE and FGENEH. 

• Homology methods. The given DNA sequence is compared with known protein structures, 

e.g. using “spliced alignments”. Programs are e.g. Procrustes and GeneWise. 

• Comparative methods. The given DNA string is compared with a similar DNA string 

from a different species at the appropriate evolutionary distance and genes are predicted 

in both sequences based on the assumption that exons will be well conserved, whereas 

introns will not. Programs are e.g. CEM (conserved exon method) and Twinscan. 

7.3 Simplest approach to gene prediction 

The simplest way to detect potential coding regions is to look at Open Reading Frames (ORFs). 

An ORF is a sequence of codons in DNA that starts with a Start codon (ATG), ends with a 

Stop codon (TAA, TAG or TGA) and has no other (in-frame) stop codons inside. 

The average distance between stop codons in “random” DNA is 64 ≈ 21, much smaller than 

3 

the number of codons in an average protein (≈ 300). 

Therefore, long ORFs indicate genes, although they fail to detect short genes or genes with 

short exons. 

Additionally, features such as codon usage or hexamer counts can be taken into account. The 

codon usage of a string of DNA is given by a 64-component vector that counts how many times 

each codon is present in the string. These values can differ significantly between coding and 

non-coding DNA. 

7.4 Eukarayotic gene structure 

For our purposes, a eukarayotic gene has the following structure: 

Promotor 

TATA 

5’ UTR 

Start site 

Initial 

exon 

Donor site 

Intron 

Acceptor site 

internal 

exon(s) 

Intron 

Terminal 

exon 

Stop site 

ATG GT AT GT AT TAA 

TAG 

TGA 

3’ UTR 

Poly−A 

AAATAAAA 

Ab initio gene prediction methods use statistical properties of the different components of such 

a gene model to attempt to predicting genes in unannotated DNA. For example, for the bases 

around the start site we may have the following observed frequencies (given by this position 

weight matrix): 

Pos. -8 -7 -6 -5 -4 -3 -2 -1 +1 +2 +3 +4 +5 +6 +7 

A .16 .29 .20 .25 .22 .66 .27 .15 1 0 0 .28 .24 .11 .26 

C .48 .31 .21 .33 .56 .05 .50 .58 0 0 0 .16 .29 .24 .40 

G .18 .16 .46 .21 .17 .27 .12 .22 0 0 1 .48 .20 .45 .21 

T .19 .24 .14 .21 .06 .02 .11 .05 0 1 0 .09 .26 .21 .21


7.5 GENSCAN’s model 

We are going to discuss the popular program Genscan in detail, which is based on a semi- 

Markov model: 

E0+ E1+ E2+ 

I0+ I1+ I2+ 

P− 

(promoter) 

A− 

(poly−A 

signal) 

F− 

(5’ UTR) 

Esngl− 

(single−exon 

gene) 

T− 

(3’ UTR) 

Einit+ 

(initial 

exon) 

Eterm+ 

(terminal 

exon) 

Einit− 

(initial 

exon) 

Eterm− 

(terminal 

exon) 

F+ 

(5’ UTR) 

Esngl+ 

(single−exon 

gene) 

T+ 

(3’ UTR) 

P+ 

(promoter) 

A+ 

(poly−A 

signal) 

I0− I1− I2− 

Forward (+) strand 

Reverse (−) strand 

N 

(intergenic 

region) 

E0− E1− E2− 

Genscan’s model can be formulated as an explicit state duration HMM. (This is an HMM in 

which, additionally, a duration period is explicitly modeled for each state, using a probability 

distribution). The model is thought of generating a parse φ, consisting of: 

• an ordered set of states q = {q 1 , q 2 , . . .,q n }, and 

• an associated set of durations d = {d 1 , d 2 , . . .,d n }, 

which, using probabilistic models for each of the state types, generates a DNA sequence S of 

length L = ∑ n 

i=1 d i. 

The generation of a parse corresponding to a (pre-defined) sequence length L is as follows: 

1. An initial state q 1 is chosen according to an initial distribution π on the states, i.e. 

π i = P(q 1 = Q (i) ), where Q (j) (j = 1, . . .,27) is an indexing of the states of the model. 

2. A length (state duration), d 1 , corresponding to the state q 1 is generated conditional on 

the value of q 1 = Q (i) from the length distribution f Q (i). 

3. A sequence segment s 1 of length d 1 is generated, conditional on d 1 and q 1 , according to 

an appropriate sequence generating model for state type q 1 . 

4. The subsequent state q 2 is generated, conditional on the value of q 1 , from the (first-order 

Markov) state transition matrix T, i.e. T i,j = P(q k+1 = Q (j) | q k = Q (i) ). 

This process is repeated until the sum ∑ n 

i=1 d i of the state durations first equals or exceeds L, 

at which point the last state duration is appropriately truncated, the final stretch of sequence 

is generated and the process stops. 

The resulting sequence is simply the concatenation of the sequence segments, S = s 1 s 2 . . .s n . 

Note that the generated sequence is not restricted to correspond to a single gene, but could represent 

multiple genes, in both strands, or none. 

In addition to its topology involving the 27 states and 46 transitions depicted above, the model 

has four main components:


• a vector of initial probabilities π, 

• a matrix of state transition probabilities T, 

• a set of length distributions f, and 

• a set of sequence generating models P. 

(Recall that an HMM has initial-, transition- and emission probabilities). 

7.6 Likelihood prediction 

Given such a model M. For a fixed sequence length L, consider 

Ω = Φ L × S, 

where Φ L is the set of all possible parses of M of length L, and S L is the set of all possible 

sequences of length L. 

The model M assigns a probability density to each point (parse/sequence pair) in Ω. Thus, for 

a given sequence S ∈ S L , a conditional probability of a particular parse φ ∈ Φ L is given by: 

P(φ | S) = 

P(φ, S) 

P(S) 

= 

P(φ, S) 

∑φ ′ ∈Φ L 

P(φ ′ , S) , 

using Baye’s Rule. 

The essential idea is to specify a precise probabilistic model of what a gene looks like in advance 

and then to select the parse φ through the model M that has highest likelihood, given the 

sequence S. 

7.7 Computational issues 

Given a sequence S of length L, the joint probability, P(φ, S), of generating the parse φ and 

the sequence S is given by: 

P(φ, S) = π q1 f q1 (d 1 )P(s 1 | q 1 , d 1 ) 

× 

n∏ 

T qk−1 ,q k 

f qk (d k )P(s k | q k , d k ), 

k=2 

where the states of φ are q 1 , q 2 , . . .,q n with associated state lengths d 1 , d 2 , . . .,d n , which break 

the sequence into segments s 1 , s 2 , . . .,s n . 

Here, P(s k | q k , d k ) is the probability of generating the segment s k under the appropriate 

sequence generating model for a type-q k state of length d k . 

A modification of the Viterbi algorithm may be used to calculate φ opt , the parse with maximal 

joint probability (under M), that gives the predicted gene or set of genes in the sequence. 

We can compute P(S) using the “forward algorithm” discussed under HMMs. With the help 

of the “backward algorithm”, certain additional quantities of interest can also be computed.


For example, consider the event E (k) 

[x,y] 

that a particular sequence segment [x, y] is an internal 

exon of phase k ∈ {0, 1, 2}. Under M, this event has probability 

∑ 

P(φ, S) 

P(E (k) 

[x,y] | S) = φ:E (k) ∈φ [x,y] 

, 

P(S) 

where the sum is taken over all parses that contain the given exon E (k) 

[x,y]. This sum can be 

computed using the forward-backward algorithm. 

7.8 Details of the model 

So far, we have discussed the topology and the other main components of the Genscan model 

in general terms. The following details need to be discussed: 

• the initial and transition probabilities, 

• the state length distributions, 

• transcriptional and translational signals, 

• splice signals, and 

• reverse-strand states. 

7.9 Initial and transition probabilities 

For gene prediction in randomly chosen blocks of contiguous human DNA, the initial probability 

of each state should be chosen proportionally to its estimated frequency in bulk human genomic 

DNA. 

This is a non-trivial problem, because gene density and certain aspects of gene structure vary 

significantly in regions of differing C+G% content (so-called “isochores”) of the human genome, 

with a much higher gene density in C+G-rich regions. 

Hence, in practice, initial and transitional probabilities are estimated for four different categories: 

(I) < 43% C+G, (II) 43 − 51% C+G, (III) 51 − 57% C+G, and (IV) > 57% C+G. 

The following initial probabilities were obtained from a learning set of 380 genes, by comparing 

the number of bases corresponding to each of the different states: 

Group I II III IV 

C+G-range < 43% 43 − 51% 51 − 57% > 57% 

Initial probabilities: 

Intergenic (N) 0.892 0.867 0.540 0.418 

Intron (I + i , I− i ) 0.095 0.103 0.338 0.388 

5’ UTR (F + , F − ) 0.008 0.018 0.077 0.122 

3’ UTR (T + , T − ) 0.005 0.011 0.045 0.072 

For simplicity, the initial probabilities for the exon, promoter and poly-A states were set to 0. 

Transition probabilities are obtained in a similar way.


7.10 State length distributions 

In general, the states of the model correspond to sequence segments of highly variable length. 

For certain states, most notably for internal exon states E k , length is probably important for 

proper biological function, i.e. proper splicing and inclusion in the final processed mRNA. 

For example, it has been shown in vivo that internal deletions of exons to sizes below about 

50 bp may often lead to exon skipping, and there is evidence that steric interference between 

factors recognizing splice sites may make splicing of small exons more difficult. There is also 

evidence that spliceosomal assembly is inhibited if internal exons are expanded beyond 300 bp. 

In summary, these arguments support the observation that internal exons are usually ≈ 120 − 

150 bp long, with only a few of length less that 50 bp or more than 300 bp. 

Constraints for initial and terminal exons are slightly different. 

The duration in initial, internal and terminal exon states is modeled by a different empirical 

distribution for each of the types of states. 

In contrast to exons, the length of introns does not seem critical, although a minimum length 

of 70 − 80 may be preferred. 

The length distribution for introns appears to be approximately geometric (exponential). However, 

the average length of introns differs substantially between the different C+G groups: In 

group I, the average length is 2069 bp, whereas for group IV , the average length is only 518 bp. 

Hence, the duration in intron states is modeled by a geometric distribution with parameter q 

estimated for each C+G group separately. 

Empirical length distributions for introns and exons: 

75 

300 

60 

Number of introns 

200 

100 

Number of exons 

30 

0 

0 

2k 3k 4k 6k 

0 1k 5k 7k 8k 

0 

200 400 

Length (bp) 

Introns 

Length (bp) 

Initial exons 

250 

40 

200 


100 


20 

0 

0 

0 

200 400 

0 

200 400 

Length (bp) 

Internal exons 

Length (bp) 

Terminal exons 

Note that the exon lengths generated must be consistent with the phases of adjacent introns. 

To account for this, first the number of complete codons is generated from the appropriate 

length distribution, then the appropriate number (0, 1 or 2) of bp is added to each end to 

account for the phases of the preceding and subsequent states. 

For example, if the number of complete codons generated for an internal exon is C = 6, and 

the phase of the previous and next intron is 1 and 2, respectively, then the total length of the 

exon is l = 3C + 2 + 2 = 22: 

phase 1 intron 

TA TGT GTT ACT CGC GCT CGC TT 

exon 

phase 2 intron


For the 5 ′ UTR and 3 ′ UTR states, geometric distributions are used with mean values of 769 

and 457 bp, respectively. 

7.11 Simple signal models 

There are a number of different models of biological signal sequences, such as donor and acceptor 

sites, promoters, etc. 

One of the earliest and must influential approaches is the weight matrix method (WMM), in 

which the frequency p a 

(i) of each nucleotide a at position i of a signal of length n is derived from 

a collection of aligned signal sequences. 

The product P(A) = ∏ n 

i=1 P a (i) 

i 

sequence A = a 1 a 2 . . .a n . 

is used to estimate the probability of generating a particular 

The weight array matrix (WAM) is a generalization that takes dependencies between adjacent 

positions into account. In this model, the probability of generating a particular sequence is 

P(A) = p (1) ∏ n 

a 1 i=2 pi−1,i a i−1 ,a i 

, where p i−1,i 

j,k 

is the conditional probability of generating a particular 

nucleotide x k at position i, given nucleotide x j at position i − 1. 

Here is a WMM for recognition of a start site: 

Pos. -8 -7 -6 -5 -4 -3 -2 -1 +1 +2 +3 +4 +5 +6 +7 

A .16 .29 .20 .25 .22 .66 .27 .15 1 0 0 .28 .24 .11 .26 

C .48 .31 .21 .33 .56 .05 .50 .58 0 0 0 .16 .29 .24 .40 

G .18 .16 .46 .21 .17 .27 .12 .22 0 0 1 .48 .20 .45 .21 

T .19 .24 .14 .21 .06 .02 .11 .05 0 1 0 .09 .26 .21 .21 

Under this model, the sequence ...CCGCCACC ATG GCGC... has the highest probability of 

containing a start site, namely: P = 0.48 · 0.31 · 46 · 0.33 · 0.56 · 0.66 · 0.5 · 0.58 · 1 · 1 · 1 · 0.48 · 

0.29 · 0.45 · 0.4 = 0.006. 

The sequence ...AGTTTTTT ATG TAAT ... has the lowest probability of containing a start site 

at the indicated position, namely: P = 0.16 · 0.16 · 0.14 · 0.21 · 0.06 · 0.02 · 0.11 · 0.05 · 1 · 1 · 1 · 

0.09 · 0.24 · 0.11 · 0.21 = 20.4 · 10 −11 . 

7.12 Transcriptional and translational signals 

Poly-A signals are modeled as a 6 bp WMM model, with consensus sequence AATAAA. 

A 12 bp WMM, beginning 6 bp prior to the start codon, is used for the translation initiation 

signal. 

In both cases, one can estimate the probabilities using the GenBank annotated “polyA signal” 

and “CDS” features from sequences. 

Approximately 30% of eukaryotic promoters lack a TATA signal. Hence, a TATA-containing 

promoter is generated with 0.7 probability, and a TATA-less one with probability 0.3. 

TATA-containing promoters are modeled as a 15 bp TATA WMM and an 8 bp cap site WMM. 

The length between the two WMMs is generated uniformly from the range 14 − 20 bp. 

TATA-less ones are modeled as intergenic regions of 40 bp.


7.13 Splice signals 

The donor and acceptor splice signals are probably the most important signals, as the majority 

of exons are internal ones. Previous approaches use WMMs or WAMs to model them, thus 

assuming independence of sites, or that dependencies only occur between adjacent sites. 

The consensus region of the donor splice sites covers the last 3 bp of the exon (positions -3 to 

-1) and the first 6 bp of the succeeding intron (positions 1 to 6): 

...exon intron... 

Position -3 -2 -1 +1 +2 +3 +4 +5 +6 

Consensus c/a A G G T a/g A G t 

WMM: 

A .33 .60 .08 0 0 .49 .71 .06 .15 

C .37 .13 .04 0 0 .03 .07 .05 .19 

G .18 .14 .81 1 0 .45 .12 .84 .20 

T .12 .13 .07 0 1 .03 .09 .05 .46 

7.14 Donor site model 

However, donor sites show significant dependencies between non-adjacent positions, which probably 

reflect details of donor splice site recognition by U1 snRNA and other factors. 

Given a sequence S. Let C i denote the consensus indicator variable that is 1, if the given 

nucleotide at position i matches the consensus at position i, and 0 otherwise. Let X j denote 

the nucleotide at position j. 

For example, consider: 

...exon intron... 

Position -3 -2 -1 +1 +2 +3 +4 +5 +6 

Consensus c/a A G G T a/g A G t 

S ...T A A C G T A A G C C ... 

Here, C −1 = 0 and C +6 = 0, and = 1, for all other positions. Similarly, X −3 = A, X −2 = A, 

X −1 = C etc. 

We use χ 2 statistics for the variable C i versus X j , for all pairs i, j with i ≠ j in the set of donor 

sites from the genes of the given learning set, based on the C i versus X j contingency table: 

X j 

C i A C G T 

0 f 0 (A) f 0 (C) f 0 (G) f 0 (T) 

1 f 1 (A) f 1 (C) f 1 (G) f 1 (T), 

where f i (x) is the frequency at which the training set has the consensus base at position i and 

the base x at position j. 

A significant χ 2 score indicates that there is a dependency between site i and j. 

The idea is then to identify an ordering of the sites by decreasing discriminatory power and then 

to derive separate WMMs for each of the different cases, thus obtaining a so-called maximal 

dependence decomposition:


Here, H = A|C|U, B = C|G|U and V = A|C|G. For example, G 5 , or H 5 , is the set of donor sites 

with, or without, a G at position +5, respectively. 

7.15 Acceptor site model 

Intron/exon junctions are modeled by a (first-order) WAM for bases −20 to +3, capturing the 

pyrimidine (C,T) rich region and the acceptor splice site itself. 

It is difficult to model the branch point in the preceding intron, and only 30% of the test data 

had in YYRAY sequence in the appropriate region [−40, −21]. 

A modified variant of a second order WAM is employed in which nucleotides are generated 

conditional on the previous two ones, in an attempt to model the weak but detectable tendency 

toward YYY triplets as well as certain branch point-related triplets such as TGA, TAA, GAC, 

and AAC in this region, without requiring the occurrence of any specific branch point consensus. 

(A windowing and averaging process is used to obtain estimates from the limited training 

data.) 

7.16 Exon models 

Coding portions of exons are modeled using an inhomongeneous 3-periodic fifth order Markov 

model. Here, separate Markov transition matrices, c 1 , c 2 and c 3 , are determined for hexamers 

ending at each of the three codon positions, respectively: 

xxxxxxxxxx 

C1 

x1 x2 x3 y1 y2 y3 z1 z2 z3 xxxxxxxxxx 

C2 

C3 

This is based on the observation that frame-shifted hexamer counts are generally the most 

accurate compositional discriminator of coding versus non-coding regions. 

However, A+T rich genes are often not well predicted using hexamer counts based on bulk 

DNA and so Genscan uses two different sets of transition matrices, one trained for sequences 

with < 43% C+G content and one for all others.


7.17 Performance studies 

The performance of a gene prediction program is evaluated by applying it to DNA sequences 

for which all contained genes are known and annotated with high confidence. 

To calculate accuracy statistics, each nucleotide of a test sequence is classified as: 

• a predicted positive (PP) if it is predicted to be contained in a coding region, 

• a predicted negative (PN) if it is predicted to be contained in non-coding region, 

• an actual positive (AP) if it is annotated to be contained in coding region, and 

• an actual negative (AN) if it is annotated to be contained in non-coding region. 

The performance is measured both on the level of nucleotides and on whole predicted exons, 

using a similar classification. 

Based on this classification, we compute the number of: 

• true positives, TP = PP ∩ AP, 

• false positives, FP = PP ∩ AN, 

• true negatives, TN = PN ∩ AN, and 

• false negatives, FN = PN ∩ AP. 

The sensitivity Sn and specificity Sp of a method are then defined as 

Sn = TP 

AP 

and Sp = 

TP 

PP , 

respectively, measuring both the ability to predict true genes and to avoid predicting false 

ones. 

7.18 Performance of GENSCAN 

Genscan was run on a test set of 570 vertebrate sequences and the forward strand exons in the 

optimal Genscan parse of the sequence were compared to the annotated exons. The following 

table shows the results and compares them with results obtained using other programs: 

Genscan performs very well here and is currently the most popular gene finding method.


7.19 Comparative gene finding 

Genscan’s model makes use of statistical features of the genome under consideration, obtained 

from an annotated training set. 

More recently, a number of methods have been suggested that attempt to also make use of 

comparative data. They are based on the observation that 

the level of sequence conservation between two species depends on the function of 

the DNA, e.g. coding sequence is more conserved than intergenic sequence. 

One such program is Rosetta, which first computes a global alignment of two homologous 

sequences and then attempts to predict genes in both sequences simultaneously. A second is 

the conserved exon method, that uses local conservation. 

The Twinscan program is an extension of Genscan, that additionally models a conserved 

sequence. 

7.20 TWINSCAN 

The input to Twinscan consists of a target sequence, i.e. a genomic sequence in which genes are 

to be predicted, and an informant sequence, i.e. a genomic sequence from a related organism. 

For example, the target sequence may come mouse genome and the informant sequence may 

be the human genome. 

Given a target and an informant, in a preprocessing step, one determines a set of top homologs 

(e.g. using BLAST) from the informant sequence, i.e. one or more sequences from the informant 

sequence that match the target sequence best. 

mouse 

conserved human (top homologs) 

The top homologs represent the regions of conserved informant sequence, which we will simply 

call “the informant sequence” in the following. 

7.21 Conservation sequence 

Similarity is represented by a conservation sequence, which pairs one of three symbols with 

each nucleotide of the target: 

. unaligned | matched : mismatched 

Gaps in the informant sequence become mismatch symbols, gaps in the target sequence are 

ignored. Consider: 

123456789 position 

GAATTCCGT target sequence


and suppose that BLAST 

yields the following HSP: 

The conservation sequence 

derived from this HSP is: 

345 6789 target position 123456789 position 

ATT-CCGT target alignment GAATTCCGT target sequence 

|| || | BLAST alignment ..||:||:| conservation sequence 

ATCACC-T Informant alignment 

The following algorithm takes a list of HSPs and computes the conservation sequence C: 

Algorithm 

Input: target sequence, list of HSPs 

Output: conservation sequence C 

Init.: C[1..n] := unaligned 

Sort HSPs by alignment score 

for each position i in the target sequence: 

for each HSP H from best to worst: 

if H covers position i: 

if C[i] = unaligned: 

C[i] := H 

end 

Note that the conservation symbol assigned to the target nucleotide at position i is determined 

by the best HSP that covers i, regardless of which homologous sequence it comes from. Position 

i is classified as unaligned only if none of the HSPs overlap it. 

7.22 Probability of sequence and conservation sequence 

Recall that Genscan assigns each nucleotide of an input sequence to one of seven categories: 

promoter, 5’ UTR, exon, intron, 3’ UTR, poly-A signal and intergenic. 

Genscan chooses the most likely assignment of categories to nucleotides according to the 

Genscan model, using an optimization algorithm (i.e., a modification of the Viterbi algorithm). 

Given a sequence, the Genscan model assigns a probability to each parse of the sequence (i.e., 

path through the model that generates the sequence.) 

The Twinscan model assigns a probability to any parsed DNA sequence together with a 

parallel conservation sequence. Under this model, the probability of a DNA sequence and the 

probability of the parallel conservation sequence are independent, given the parse. 

Consider the following example: 

10 20 30 

123456789|123456789|123456789|123456789 

ATTTAGCCTACTGAAATGGACCGCTTCAGCATGGTATCC target sequence T 

||:|||.........|:|:|||||||||:||:|||::|| conservation sequence C 

Consider the probability of observing the target sequence T 7,33 extending from position 7 to 33, 

given the hypothesis E 7,33 that an internal exon extends from position 7 to 33. 

This is simply the probability of the target sequence T 7,33 under the Genscan model times 

the probability of the conservation sequence C 7,33 under the conservation model, assuming the


parse E 7,33 : 

P(T 7,33 , C 7,33 | E 7,33 ) = P(T 7,33 | E 7,33 )P(C 7,33 | E 7,33 ). 

7.23 TWINSCAN’s model 

Twinscan consists of a new, joint probability model on DNA sequences and conservation 

sequences, together with the same optimization algorithm used by Genscan. 

Twinscan arguments the state-specific sequence models of Genscan with models of the probability 

of generating any given conservation sequence from any given state. 

Coding, UTR, and intron/intergenic states all assign probabilities to stretches of conservation 

sequence using homogeneous 5th-order Markov chains: 

ccccccccccc c1 c2 c3 c4 c5 c6 ccccccccccc 

One set of parameters is estimated for each of these types of regions. 

Again, consider: 

10 20 30 

123456789|123456789|123456789|123456789 

ATTTAGCCTACTGAAATGGACCGCTTCAGCATGGTATCC target sequence T 

||:|||.........|:|:|||||||||:||:|||::|| conservation sequence C 

The probability of observing C 7,33 , given E 7,33 , is: 

P C (C 7,33 | E 7,33 ) = P E (C 7,7 | C 2,6 ) · . . . · P E (C 33,33 | C 28,32 ), 

where P E (C 33,33 | C 28,32 ), for example, is the estimated probability of a ‘|’ (match) following 

the give context symbols “|:||:” in the conservation sequence of an exon. 

Models of conservation at splice donor and acceptor sites are modeled using 2nd-order WAMs 

of length 9 bp and 43 bp, respectively (lengths as in Genscan). 

7.24 TWINSCAN’s performance 

Twinscan was tested on two data sets. The first set consists of 86 mouse sequences totaling 

7.6 Mb and used top homologs from human: 

Program Exons Exon Sn Exon Sp Genes Genes Sn Genes Sp 

Annotation 2758 275 

Genscan 2997 0.631 0.581 395 0.153 0.106 

Twinscan 2854 0.683 0.660 464 0.244 0.144 

The second set is a subset containing 8 pairs of finished orthologs: 

Program Exons Exon Sn Exon Sp Genes Genes Sn Genes Sp 

Annotation 610 48 

Genscan 731 0.798 0.666 51 0.167 0.157 

Twinscan 684 0.854 0.752 50 0.271 0.260


7.25 The conserved exon method (CEM) 

Based on a model of sequence conservation, Twinscan uses an informant sequence to obtain 

better gene predictions for a given target sequence. 

Input to the conserved exon method (CEM) are two related sequences and the method predicts 

gene structures in both sequences simultaneously. The underlying assumption is that exons are 

well preserved, whereas introns and intergenic DNA have very little similarity. 

For this assumption to hold, the two input sequences must be at an appropriate evolutionary 

distance. Coding regions are generally well conserved in species as far back as 450 Myrs. At 

evolutionary distances of 50–100 Myrs (human and mouse), the conservation also extends to 

other functional regions important for gene expression and maintaining genome structure. 

The main idea of CEM is to look for conserved protein sequences by comparing pairs of DNA 

sequences, to identify putative exons based on sequence and splice site conservation, and then 

to chain such pairs of conserved exons together to obtain gene structure predictions in both 

sequences. 

Identifying conserved coding sequence The first part of the CEM is not new. For example, 

the TBLASTX program performs precisely this task. Additionally, a number of tools exist 

for comparing two genomic sequences, finding conserved exons and regulatory regions etc. 

Building gene models The second part of the CEM is more interesting, in which gene 

structures are generated from the identified matches and complete gene structures are predicted 

in both input sequences. 

7.26 Application of TBLASTX 

Throughout the following, we are given two similar DNA sequences S and T. 

The program TBLASTX produces a list of high-scoring pairs (HSPs) of locally aligned substrings 

of S and T, where the two substrings are interpreted as amino-acid coding strings and 

the score of the alignment is computed using a BLOSSUM or PAM protein scoring matrix. 

This is how an HSP is reported by TBLASTX: 

Score = 214 (98.4 bits), Expect = 0.0, Sum P(24) = 0.0 

Identities = 44/46 (95%), Positives = 46/46 (100%), Frame = +1 / +1 

Query: 5284 RLVLRIATDDSKAVCRLSVKFGATLRTSRLLLERAKELNIDVVGVR 5421 

RLVLRIATDDSKAVCRLSVKFGATL+TSRLLLERAKELNIDV+GVR 

Sbjct: 3871 RLVLRIATDDSKAVCRLSVKFGATLKTSRLLLERAKELNIDVIGVR 4008 

In this example, the positions 5284–5421 of sequence S and positions 3871–4008 of sequence 

T are aligned together and interpreted as amino-acids as shown. The “frame” indicates the 

directions and the offsets of the two substrings. 

TBLASTX matches between two similar pieces of human and mouse DNA:


CEMexplorer: /home/huson/genomics/CG/testcases/J03733_X16277: mus.mask vs. hum.mask 

ornithine_1(+,+) 

tblastx 

9000 

8000 

7000 

6000 

5000 

hum.mask 

4000 

3000 

2000 

1000 

1000 2000 3000 4000 5000 6000 7000 

mus.mask 

7.27 Key assumption for conserved exons 

Note that programs such as TBLASTX predict putative coding regions, but not actual splice 

boundaries. Also, many HSPs are due to other conserved features, not exons. 

In the CEM, the local alignments produced by TBLASTX are used as seeds for dynamic 

programming alignments that are computed to detect complete exons. 

Key assumption Any pair of conserved exons E 1 (in S) and E 2 (in T) gives rise to a witness, 

i.e. an HSP h whose middle codon is a portion of the correct local alignment of E 1 and E 2 , in 

the correct frame. 

B 

E2 

h 

E1 

A 

7.28 Conserved exon pairs 

A putative conserved exon pair (CEP) consists of a pair of substrings E 1 (in S) and E 2 (in T) 

that are both flanked by appropriate splice junctions and have a high scoring local amino-acid 

alignment. We now discuss how to obtain putative CEPs. 

Given an HSP h. Let m S (h) and m T (h) denote the position of the middle codon of h in S and 

in T, respectively. 

Let b S (h) and e S (h) denote the position of the left-most possible intron-exon site and right-most 

possible exon-intron site for any putative exon in S that is witnessed by h. Define b T (h) and 

e T (h) in the same way. 

In a simple approach, we use empirical bounds on the lengths of exons to find the values of b S , 

e S , b T and e T . A more sophisticated approach takes the amount of coverage by HSPs etc. into 

account. 

Start, stop and splice sites are detected by WMMs or more advanced techniques. 

If the values of b S , e S , b T and e T were chosen large enough, then the key assumption implies


that the two exons E 1 (in S) and E 2 (in T) of the true CEP (witnessed by h) will start in 

[b S (h), m S (h)] and [b T (h), m T (h)], and will end in [m S (h), e S (h)] and [m T (h), e T (h)], respectively. 

We evaluate all possible pairs of exons in this region by running two dynamic programs: one 

starts at (m S (h), m T (h)) and ends at (e S (h), e T (h)), the other runs in reverse direction from 

(m S (h), m T (h)) to (b S (h), b T (h)): 

e 

T 

(h) 

m T 

(h) 

h 

b 

T 

(h) 

(h) b S 

(h) m S 

e (h) 

S 

7.29 Exon alignment 

The actual algorithms used for the local alignment computations are variants of the standard 

algorithm. 

Note that the alignments are forced to start in the frame defined by the HSP. Frame-shifts are 

allowed subsequently (with an appropriate indel penalty). 

Each splice-junction pair is a cell in the dynamic programming matrix, and its score is maintained 

in a separate list. 

Let (i, j) be the coordinates of a cell corresponding to a splice-pair (z S (h), z T (h)). The score 

assigned to (z S (h), z T (h)) is not Score[i, j], but 

Score(z S (h), z T (h)) = 

max {Score[i − k S (h)][j − k T (h)]} 

0≤k S (h),k T (h)≤2 

This is to allow for the possibility of an intron splitting a codon. In this way, the alignment 

(which only scores codons) allows terminal nucleotide gaps without incurring a frame-shift 

penalty. 

The amount of overhang 

(o S (h), o T (h)) = arg max {Score[i − k S (h)][j − k T (h)]} 

0≤k S (h),k T (h)≤2 

is also stored along with the score. 

As the alignment is done at the protein level, there is a direction associated with it. The 

dynamic programming computation from the mid-point to the acceptor splice junctions is done 

by reversing each codon before scoring. 

7.30 The CEP graph 

For each HSP h we construct a CEP graph. Each node u in the CEP-graph corresponds to a 

coordinate pair (i, j), which is the starting point, mid-point or terminating point of a candidate 

exon pair (E 1 , E 2 ). More precisely, u is one of the following: 

• a center node, if (i, j) = (m S (h), m T (h)) is the position of the middle codon of h,


• a donor node if i ∈ [m S (h), e S (h)] & j ∈ [m T (h), e T (h)] are sites of donor splice signals in 

S, and T, 

• an acceptor node if i ∈ [b S (h), m S (h)] and j ∈ [b T (h), m T (h)] are sites of acceptor signals, 

• a start node if i ∈ [b S (h), m S (h)] and j ∈ [b T (h), m T (h)] are sites of translation initiation 

signals, or 

• a terminal node if i ∈ [m S (h), e S (h)] and j ∈ [m T (h), e T (h)] are sites for a stop codon. 

e 

T 

b 

T 

(h) 

m(h) 

h 

→ h 

T 

(h) 

(h) b S 

(h) m S 

e (h) 

S 

Each node u has some additional information associated with it. The coordinates of the cell 

are maintained as (u S , u T ). For each acceptor or donor node u, we maintain information on 

the nucleotide overhang at the boundary as overhang(u) = (o S (u), o T (u)). 

A directed edge is constructed from each acceptor or start node to the center, and from the 

center to each donor or terminal node. The weight of the edge is the score of the corresponding 

local alignment. 

7.31 The CEM graph 

As discussed above, each HSP gives rise to a CEP graph. (In practice, however, different HSPs 

often lead to the same CEP graph and such redundancies should be removed.) 

Each CEP-graph is a concise representation of alignments of pairs of exons. At most one pair 

can actually be a conserved-exon-pair in the true gene structures. The Conserved-Exon-Method 

takes the CEP-graphs of HSPs and chains them together, thus obtaining the full “CEM-graph”. 

It builds gene models from this graph based on the assumption that the transcripts derived 

from correct orthologous gene structures will have the highest alignment score. 

Let S and T be the two genomic sequences. 

For each HSP h, compute the CEP-graph. We build a candidate exon graph G = (V, E) (which 

we call the CEM-graph), as follows: V is the union of all the nodes in the CEP-graphs, and E 

contains all the edges in each CEP-graph. Further, add an edge from donor or terminal node 

u to an acceptor or start node v if both: 

• v S >= u S +M, and v T >= u T +M, where M is a suitably chosen minimum intron length, 

and: 

• Let (o S (u), o T (u)) = overhang(u), and (o S (v), o T (v)) = overhang(v). Then, (o S (u) + 

o S (v)) = 0(mod 3), and (o T (u) + o T (v)) = 0(mod 3), 

The weight of the edge (u, v) is the score of aligning the amino-acids obtained by concatenating 

the overhangs on either side added to the penalty for an intron gap. 

Example of linking two CEPs, nodes are labeled by their offsets (o S ,o T ):


h 

(0,0) 

(0,2) 

h’ 

(0,0) 

(0,0) 

(0,1) 

(1,2) 

(0,2) 

additional edges linking CEPs 

Example of a complete graph: 

6.84 

51.05 

55.76 6505,8616 6509,8620 

6601,8703 

6205,8309 6238,8342 

6294,8398 

6150,8254 

5 2 

-900 

11 

114.83 49.76 109.7 79.41 29.94 15.1 

82.89 76.63 89.67 94.97 150.1 145.39 140.26 155.23 

18.72 56.72 165.55 108.26 170.68 

-3 6 

105.38 23.59 27.04 49.11 82.38 

689.88 

79.62 5 

4 1 0 

5008,6977 5032,7001 5049,7018 5057,7026 5063,7032 

5141,7110 

5156,7110 

5165,7110 

5170,7110 

5185,7154 5223,7193 5282,7252 5359,7329 

5314,7305 5335,7305 

5509,7490 5519,7500 

5524,7505 

5616,7597 

5632,7613 

5453,7399 

5523,7504 

5636,7617 

5526,7507 

5418,7399 5453,7434 

5314,7284 

4979,6948 

-31 -38 

-39 -10 

-46 

-14 

-16 

-15 

-20 

-22 

-6 

49.82 54.95 30.04 35.17 58.11 52.98 13.54 4.23 

-35 

53.11 62.4 

-26 

-44 

86.57 20.36 40.14 62.02 32.82 42.24 78.57 82.41 58.79 62.63 52.6 

-35 23.65 

4 5 

56.53 94.25 86.96 97.38 52.41 21.98 

-92 -99 

0 

6565 -3 

65 3940,5353 

3873,5286 3924,5337 3958,5371 

4011,5424 4109,5551 

4000,5420 4002,5420 4007,5420 

4149,5591 4250,5686 

4351,5809 

4190,5632 

4364,5822 

4369,5827 4393,5851 4396,5854 

4434,5892 4438,5875 

4034,5478 

-75 

-74 

-12 

96.67 

3557,4898 3589,4930 3590,4931 3592,4933 

3628,4969 

3686,5027 

3514,4855 

3690,5031 

0 

3197,4503 

03293,4571 

57.96 16.35 58.34 61.37 54.11 

3197,4475 

3370,4648 

50.7 65.6 

0 

2832,4073 

2847,4073 

2863,4090 2917,4144 2923,4150 

2964,4191 

-32 

-37 -43 -24 

2.84 3.1 

1811,2378 1815,2382 1825,2392 

7.32 Obtaining a gene prediction 

By construction, a path in the CEM-graph corresponds to a prediction of orthologous gene 

structures in the two genomes. Based on the assumption that the correct gene models will have 

the highest alignment score, we can extract the correct gene structures simply by choosing 

the highest scoring path. As this is a directed acyclic graph, the highest scoring path can be 

computed via a topological sort: 

getGeneModelScores(CEMGraph G(V,E)) 


OrderedNodeListL = TopologicalSort(G) 

for each v in L 

Initialize(Score(v)) 

for each incoming edge e = (x,v) 

if (Score(v) < Score(x) + w(e)) 

Score(v) = Score(x) + w(e) 

predecessor(v) = x 

end


(An ordering φ of the nodes of an acyclic graph is a topological sorting if for any edge (v, w) we 

have φ(v) < φ(w).) 

For an arbitrary node u, score(u) is the best score of an alignment of two sequence prefixes 

S[1..u i ], and T[1..u j ], allowing for frame-shifts, amino-acid indels and intron penalties. 

Once the scores on the nodes are computed the gene models are built by starting at the node 

with the highest score, and following the predecessors. The coordinates of start, terminal, 

donor and acceptor nodes reveal the gene structure in the two genomic sequences. As the 

boundaries of the path are not limited to start and terminal nodes, partial gene structures can 

be predicted. 

7.33 Multiple genes 

Additionally, we add an edge from every stop node v with coordinates (v S , v T ) to every downstream 

start node w (with coordinates w S > v S and w T > v T ). Such edges are given weight 0. 

The role of such edges is to allow prediction of multiple genes. 

HSPs with negative frames in one or both sequences are possible witnesses for exons in the 

reverse strand of one or both sequences. The CEP graphs derived from such HSPs are simply 

added to the CEM graph. 

To enable a prediction of genes in both strands simultaneously, appropriate additional edges 

must be inserted between the start and stop nodes of the CEPs. For example, a stop node 

obtained from an HSP with +/+ frame is connected to all downstream stop nodes that have 

frame −/−. 

7.34 Summary of CEM algorithm 

1. Determine a list of candidate exons for S, and T. 

2. For every HSP h, determine the range of possible exons and their possible splice sites. 

3. Sort HSPs lexicographically according to their ranges. 

4. For each HSP h, build the corresponding CEP graph. 

5. Compute the CEM graph by joining all CEP graphs. 

6. Compute the gene model scores. 

7. Determine the highest scoring path through the CEM graph. 

8. Extract the corresponding gene model. 

7.35 Performance of CEM 

Here is a comparison of the performance of CEM and Genscan on a test data set of 60 pairs 

of gene from human and mouse:


Number of Exon Exon Nucl. Nucl. 

Sequences Sens. Spec. Sens. Spec. 

CEM 120 0.76 0.80 0.94 0.95 

GenScan 120 0.74 0.78 0.92 0.94 

The gain in performance obtained is not spectacular. However, it provides a proof of concept 

and additional work may well lead to a useful tool for comparative gene finding, especially for 

genomes for which little is known of the statistical properties of the contained genes. 

7.36 Homology method: Procrustes 

Any newly sequence gene has a good chance of having an already known relative and progress 

in large-scale sequencing projects is rapidly increasing the number of known genes and protein 

sequences. 

Hence, homology-based gene prediction methods are becoming more and more useful. In particular, 

such a method may be able to detect exons that are missed by statistical methods 

because they are small, or statistically unusual. 

Procrustes is a popular program that uses homology to predict genes and is based on the 

following 

Idea: Given a genomic sequence G and a target protein P. Determine a chain Γ 

of blocks in G that has the highest spliced-alignment score with target T. These 

blocks are interpreted as exons and the chain Γ is the predicted gene structure. 

7.37 Example 

Given the genome G = baabaablacksheephaveyouanywool and the target protein T = 

barbarasleepsonwool, find the best spliced alignment of T to G and thus obtain a gene 

prediction in G: 

Genome sequence: 

baa baa black sheep have you any wool 

Assume that these are the possible blocks: 




Best spliced alignment: 

barbara sleeps on wool 

Resulting gene structure prediction: 

baa baa sheep any wool 

There are many possible chainings of blocks in the given example:


However, we choose the one that yields the best alignment to the given target sequence. In 

general, a number of possible target sequence will be given and then we choose the one that 

gives rise to the best alignment. 

7.38 Preprocessing: determining the blocks 

Given a genomic sequence G. The first computational step is to determine the set B of all 

candidate blocks for G, which should contain all true exons. Naively, this is done be selecting 

all blocks between potential acceptor and donor sites, which are detected using e.g. a WMM: 

acacacAG aggtaAG taggagctcagttacactgcatcagcatg GTatcacttacgacacGTcacacgt 

block 1 

block 2 

block 3 

block 4 

Clearly, this set of blocks will contain many false exons. Statistical methods may be used in an 

attempt to remove blocks that are obviously not true exons. 

Any chain of blocks corresponds to a gene prediction and the number of such chains can be 

huge. Dynamic programming is used to obtain an algorithm that runs in polynomial time. 

7.39 The spliced alignment problem 

Let G = g 1 . . .g n be a string of letters, and B = g i . . .g j and B ′ = g i ′ . . .g j ′ be substrings of 

G. We write B ≺ B ′ , if B ends before B ′ starts, i.e. j 

substrings of G is a chain, if B 1 ≺ B 2 ≺ . . . ≺ B b . We denote the concatenation of the strings 

in Γ by Γ ∗ = B 1 ∗ B 2 ∗ . . . ∗ B b . 

For two strings G and T, we set s(G, T) to the score of an optimal alignment between G and 

T. 

Spliced Alignment Problem (SAP) Let G = g 1 . . .g n be a genomic sequence, 

T = t 1 . . .t m a target sequence and B = {B 1 , . . .,B b } a set of blocks in G. Given G, 

T and B, the Spliced Alignment Problem is to find a chain Γ of strings from B such 

that the score s(Γ ∗ , T) is maximum among all chains of blocks from B. 

7.40 Solving the spliced alignment problem 

The SAP can be reduced to the search of a path in some (unweighted) graph. Vertices of this 

graph correspond to the blocks, arcs correspond to potential transitions between blocks, and 

the path weight is defined as the weight of the optimal alignment between the concatenated 

blocks of this path and the target sequence.


For simplicity, we will consider sequence alignment with linear gap penalties and define the 

∆ match , ∆ mismatch and ∆ indel scores as usual. 

{ 

∆match if x = y, and 

We set ∆(x, y) = 

else. 

∆ mismatch 

7.41 The score of a prefix alignment 

For a block B k = g m . . .g l in G, define first(k) = m, last(k) = l and size(k) = l − m + 1. Let 

B k (i) denote the i-prefix g m . . . g i of B k , if m ≤ i ≤ l. 

Given a position i and let Γ = (B 1 , . . ., B k , . . .,B t ) be a chain such that some block B k contains 

i. We define 

Γ ∗ (i) = B 1 ∗ B 2 ∗ . . . ∗ B k (i) 

as the concatenation of B 1 . . .B k−1 and the i-prefix of B k . 

Then 

S(i, j, k) = 

max 

all chains Γ 

containing block B k 

s(Γ ∗ (i), T(j)), 

is the optimal score for aligning a chain of blocks up to position i in G to the j-prefix of T. As 

we will see, the values of this matrix is computed using dynamic programming. 

7.42 The dynamic program 

Let B(i) = {k | last(k) < i} be the set of all blocks that end (strictly) before position i in G. 

The following recurrence computes S(i, j, k) for 1 ≤ i ≤ n, 1 ≤ j ≤ m and 1 ≤ k ≤ b: 

S(i, j, k) = 

⎧ 

S(i − 1, j − 1, k) + ∆(g i , t j ), 

if i ≠ first(k) 

⎪⎨ 

max 

S(i − 1, j, k) + ∆ indel , 

if i ≠ first(k) 

max l∈B(first(k)) S(last(l), j − 1, l) + ∆(g i , t j ), if i = first(k) 

⎪⎩ 

max l∈B(first(k)) S(last(l), j, l) + ∆ indel , 

S(i, j − 1, k) + ∆ indel . 

if i = first(k) 

The score of the optimal spliced alignment can be found as: 

max S(last(k), m, k). 

k 

Note that S(i, j, k) is only defined if i ∈ B k and therefore only a portion of entries in the 

three-dimensional n × m × b matrix S needs to be computed. 

The total number of such entries is: 

m 

b∑ 

size(k) = nmc, 

k=1


∑ 

where c = 1 b 

n k=1 

size(k) is the coverage of the genomic sequence by blocks. 

Hence, a naive implementation of the recurrence runs in O(mnc + mb 2 ) time. 

(Recall that n = |G|, m = |T | and b is the number of blocks.) 

7.43 Example 

Consider the following string G with all possible blocks indicated by boxes: 

The recurrence corresponds to the following graph: 

The target sequence is: 

’T WAS BRILLIG, AND THE SLITHE TOVES DID GYRE AND GIMBLE IN THE WABE 

The four highlighted chains in the above graph correspond to the following spliced alignments 

of G and T: 

7.44 Speed up 

The time and space requirements of the algorithm can be reduced significantly. Here we only 

discuss one such improvement. 

Define P(i, j) = max l∈B(i) S(last(l), j, l). The recurrence can be rewritten as follows: 

S(i, j, k) = 

⎧ 

S(i − 1, j − 1, k) + ∆(g i , t j ), if i ≠ first(k) 

⎪⎨ S(i − 1, j, k) + ∆ indel , if i ≠ first(k) 

max P(first(k), j − 1) + ∆(g i , t j ), if i = first(k) 

P(first(k), j) + ∆ indel , if i = first(k) 

⎪⎩ 

S(i, j − 1, k) + ∆ indel , 

where 

{ 

P(i − 1, j) 

P(i, j) = max 

max k:last(k)=i−1 S(i − 1, j, k). 

With this modification, we maintain and update the maximal score for all preceding blocks 

explicitly and thus do not reconsider all preceding blocks in each evaluation of the recurrence.


This reduces the run time of the algorithm to O(mnc + mb). 

The corresponding network that indicates which computations are performed looks like this: 

7.45 Evaluation of the method 

The authors of Procrustes evaluated the performance of the program on a test sample of 

human genes with known mammalian relatives. In their study, the average correlation between 

the predicted and actual proteins was 99%. The algorithm correctly reconstructed 87% of the 

genes. 

They also reported that the algorithm predicts human genes reasonably well when the homologous 

protein is non-vertebrate or even prokaryotic. 

Additionally, predictions were made using simulated targets that gradually diverged from the 

analyzed gene. For targets up to 100 PAM distance, the predictions were almost 100% correct. 

(This distance roughly corresponds to 40% similarity). 

This indicates that for an average protein family the method is likely to correctly predict a 

human gene given a mammalian relative. 

8.2 BLAST and BLAT 

The popular BLAST program (or family of programs), was first introduced in: S.F. Altschul, W. 

Gish, W. Miller, E.W. Myers and D.J. Lipman. Basic local alignment search tool, J. Molecular 

Biology, 215:403-410 (1990). 

Recall (from ABI I) that BLAST is an alignment heuristic that computes local alignments 

between a query and a database sequence, using a seed-and-extend approach: 

Given three parameters, i.e. a word size K, a word similarity threshold T and a minimum 

match score S, BLAST operates in three steps: 

1. The list of all words of length K that have similarity ≥ T to some word in the query 

sequence is generated. 

2. The database sequence is scanned for all hits of words in the list. 

3. Each hit is extended until its score falls a certain distance below the best score found for 

shorter extensions and then all best extensions are reported that have score ≥ S. 

8.3 BLAT- BLAST-Like Alignment Tool 

The following is based on: W. James Kent. BLAT- the BLAST-like alignment tool, Genome 

Research 12, 2002.


Analyzing vertebrate genomes requires rapid mRNA/DNA and cross-species protein alignments. 

As the amount of data increases, faster tools are required for comparing sequences. 

A new tool, BLAT, is more accurate and 500 times faster than popular existing tools for 

mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically 

used when comparing vertebrate sequences. 

BLAT’s speed stems from an index of all non-overlapping K-mers in the genome. The program 

has several stages: It uses the index to find regions in the genome that are possibly homologous 

to the query sequence. It performs an alignment between such regions. It stitches together the 

aligned regions (often exons) into larger alignments (typically genes). Finally, BLAT revisits 

small internal exons and adjusts large gap boundaries that have canonical splice sites where 

feasible. 

8.4 Mapping ESTs and Mouse reads 

In the public assembly of the human genome, the problem arises to map 3 million ESTs to the 

human genome. Additionally, 13 million (and continually more) whole genome shotgun reads 

need to be aligned to the human genome. 

The human EST alignments compared 1.75 Gb in 3.72 million ESTs against 2.88 Gb bases of 

Human DNA and took 200 hours on a farm of 90 Linux boxes. 

BLAT was used to align 2.5× coverage unassembled mouse reads to the masked human genome. 

This involved 7.51 Gb in 13.3 million reads and took 16, 300 CPU hours. 

As work continues to finish the human genome, these computations need to be repeated on a 

monthly or bi-monthly basis. Hence, comparison tools are needed that do the job in a couple 

of weeks. 

This was the motivation for the development of BLAT. 

8.5 BLAT vs BLAST 

BLAT is similar to BLAST: The program rapidly scans for relatively short matches (hits) and 

extends these into HSPs. However BLAT differs from BLAST in some important ways: 

• BLAST builds an index of the query string and then scans linearly through the database 

– BLAT builds an index of the database and then scans linearly through the query, 

• BLAST triggers an extension when one or two hits occur – BLAT can trigger extensions 

on any given number of perfect or near perfect matches, 

• BLAST returns each area of homology as separate alignments – BLAT stitches them 

together into larger alignments, 

• BLAST delivers a list of exons sorted by size, with alignments extending slightly beyond 

the edge of each exon – BLAT “unsplices” mRNA onto the genome, giving a single 

alignment that uses each base of the mRNA only once, with correctly positioned splice 

sites.


8.6 Seed-and-extend 

Like all fast alignment programs, BLAT uses the two stage seed-and-extend approach: 

• in the seed stage, the program detects regions of the two sequences that are likely to be 

homologous, and 

• in the extend stage, these regions are examined in detail and alignments are produces for 

the regions that are indeed homologous according to some criterion. 

BLAT provides three different methods for the seed stage: 

• Single perfect K-mer matches, 

• Multiple perfect K-mer matches, and 

• Single near-perfect K-mer matches. 

Given a long database sequence and a short query sequence, we will discuss the different seed 

strategies. 

The simplest seed method is to look for subsequences of a given size K that are shared by the 

query and the database. In many applications, every K-mer in the query sequence is compared 

with all non-overlapping K-mers in the database sequence. 

We want to analyze: 

1. how many homologous regions are missed, and 

2. how many non-homologous regions are passed to the extension stage, using this criteria. 

Errors of type (1) will cause the application to miss true homologs, whereas errors of type (2) 

will increase the running time of the application. 

8.7 Some definitions 

K: The K-mer size, 8 − 16 for nucleotides and 3 − 7 for 

amino acids. 

M: Match ratio between homologous areas, ≈ 98% for 

cDNA/genomic alignments within the same species, ≈ 

89% for protein alignments between human and mouse. 

H: The size of a homologous area. For a human exon this 

is typically 50 − 200 bp. 

G: Database size, e.g. 3 Gb for human. 

Q: Query size. 

A: Alphabet size, 20 for amino acids, 4 for nucleotides. 

query sequence (e.g. cDNA) 

matches 

Database sequence (e.g. genome)


8.8 Single perfect matches 

Assuming that each letter is independent of the previous letter, the probability that a specific 

K-mer in a homologous region of the database matches perfectly the corresponding K-mer in 

the query is: 

p 1 = M K . 

Let T = ⌊ H ⌋ denote the number of non-overlapping K-mers in a homologous region of length 

K 

H. 

The probability that at least one non-overlapping K-mer in the homologous region matches 

perfectly with the corresponding K-mer in the query is: 

P = 1 − (1 − p 1 ) T = 1 − (1 − M K ) T . 

The number of non-overlapping K-mers that are expected to match by chance, assuming all 

letters are equally likely, is: 

F = (Q − K + 1) · G ( ) K 1 

K · . 

A 

These formulas can be used to predict the sensitivity and specificity of single perfect nucleotide 

K-mer matches as a seed-search criterion: 

These formulas can be used to predict the sensitivity and specificity of single perfect amino 

acid K-mer matches as a seed-search criterion: 

Examples 

1. For EST alignments, we would like to find seeds for 99% of all homologous regions that 

have 5% or less sequencing noise. Table 3 indicates that K = 14 or less will work. For


K = 14, we can expect that 399 random hits per query will be produced. A smaller value 

of K will produce significantly more random hits. 

2. The mouse and human genomes average 89% identity at the amino acid level. To find 

true seeds for 99% of all translated mouse reads requires K = 5 or less. For K = 5, each 

read will generate ≈ 62625 random hits, see Table 4. 

3. Comparing mouse and human at a nucleotide level, where there is only 86% identity is 

not feasible: Table 3 implies that K = 7 must be used to find 99% of all true hits, but 

this value generates ≈ 13 million random hits per query. 

8.9 Single near-perfect matches 

Now consider the case of near-perfect matches, that is, hits with one letter mismatch. The 

probability that a non-overlapping K-mer in a homologous region of the database matches 

near-perfectly the corresponding K-mer in the query is: 

p 1 = K · M K−1 · (1 − M) + M K . 

Again, the probability that any non-overlapping K-mer in the homologous region matches 

near-perfectly with the corresponding K-mer in the query is: 

P = 1 − (1 − p 1 ) T . 

The number of K-mers which match near-perfectly by chance is: 

( 

F = (Q − k + 1) · G ( ) K−1 ( 1 

K · K · · 1 − 1 ) ( ) K 1 

+ . 

A A A) 

These formulas can be used to predict the sensitivity and specificity of single near-perfect 

nucleotide K-mer matches as a seed-search criterion: 

These formulas can be used to predict the sensitivity and specificity of single near-perfect amino 

acid K-mer matches as a seed-search criterion:


Examples 

1. For the purposes of EST alignments, a K of 22 or less produce true seeds for 99% of all 

queries, while on average producing only one random hit, see Table 5. 

2. For comparison of translated mouse reads and the human genome, Table 6 indicates that 

K = 8 would detect true seeds for 99% of all mouse reads, while only generating 374 

random hits per read. 

3. A comparison of mouse reads and the human genome (86% identity) on the nucleotide level 

would require K = 13 or K = 12 to detect true seeds for 99% of the reads, while generating 

275671 random hits per read. Using a fast extension program, this computation is feasible. 

BLAT implements near-perfect matches allowing one mismatch in a hit, as follows: 

A non-overlapping index of all K-mers in the database is generated. 

Every possible K-mer in the query sequence that matches in all but one, or in all, positions, is 

looked up. Hence, this means K · (A − 1) + 1 lookups. For an amino-acid search with K = 8, 

for example, 153 lookups are required per occurring K-mer. 

For a given level of sensitivity, the near-perfect match criterion runs 15× more slowly than the 

multiple-perfect match criterion and thus is not so useful in practice. 

8.10 Multiple perfect matches 

An alternative seeding strategy is to require multiple perfect matches that are constrained to 

be near each other. 

For example, consider a situation where there are two hits between the query and the database 

sequences that “lie on the same diagonal” and are close to each other (within some given 

distance W), such as a and b here: 

query sequence (e.g. cDNA) 

d 

a 

k 

w 

b 

c 

Database sequence (e.g. genome)


For N = 1, the probability that a non-overlapping K-mer in a homologous region of the 

database matches perfectly the corresponding K-mer in the query is (as discussed above): 

p 1 = M K . 

The probability that there are exactly n matches within the homologous region is 

P n = p n 1 · (1 − p 1 ) T −n · 

T! 

n! · (T − n)! , 

and the probability that there are N or more matches is the sum: 

P = P N + P N+1 + . . . + P T . 

Again, we are interested in the number of matches generated by chance. The probability that 

such a chain is generated for N = 1 is simply: 

F = (Q − K + 1) · G ( ) K 1 

K · . 

A 

The probability of a second match occurring within W letters after the first is 

S = 1 − 

( 

1 − 

( 1 

A) K 

) W 

K 

, 

because the second match can occur within any of the W K 

database within W letters after the first match. 

non-overlapping K-mers in the 

The number of size N chains of K-mers in which any two consecutive hits are not more than 

W apart is 

F N = F 1 · S N−1 . 

These formulas can be used to predict the sensitivity and specificity of multiple nucleotide (2 

and 3) perfect K-mer matches as a seed-search criterion: 

These formulas can be used to predict the sensitivity and specificity of multiple amino acid (2 

and 3) perfect K-mer matches as a seed-search criterion:


8.11 Clumping hits 

BLAT builds a non-overlapping index of all K-mers in the database, ignoring those K-mers 

that occur too often in the database, those containing ambiguity codes and optionally, those in 

lower case (“soft screened regions”). 

BLAT then looks up each overlapping K-mer of the query sequence in the index, obtaining a 

list L of hits. Each hit consists of a database position and a query position. 

The next step is to form clumps of hits that represent regions in the database sequence that are 

homologous to the query sequence. Each such clump consists of a number of hits (that exceeds 

a given minimum number of hits) that form a chain in which two consecutive hits are not too 

far apart from each other and also in which the gap size in either sequence does not exceed a 

given threshold. 

Multiple hits are clumped together as follows: 

• The hit list L is sorted by database coordinate. 

• The list L is split into buckets of size 64 kb each, based on the database coordinate. 

• Each bucket is sorted along the diagonal, i.e. hits are sorted by the value of database 

position minus query position. 

• Hits that are within the gap limit are grouped together into proto-clumps. 

• Hits within proto-clumps are then sorted by their database coordinate and put into real 

clumps, if they are within the window limit on the database coordinate. 

• Clumps within 300 bp or 100 amino acids of each other in the database are merged and 

then 500 bp are added to each end of a clump. 

A list of hits: 

query sequence 

2 3 

4 

6 1 

5 

Database sequence 

Sorted by database coordinate: 


1 2 

4 6 

5 

3 

Database sequence


Sorted along the diagonal: 


1 2 

3 

5 

4 

6 

Database sequence 

8.12 Nucleotide alignments 

Clumping is the first part of the extension stage. In the case of nucleotide alignments, each 

clump is then processed as follows. 

• A hit list is generated between the query sequence q and the homologous region h in the database, 

looking for smaller, perfect K-mers. 

• If a K-mer w in q matches multiple K-mers in h, then w is repeatedly extended by one until 

the match is unique or exceeds a certain size. 

• The hits are extended as far as possible, without mismatches. 

• Overlapping hits are merged. 

• If there are gaps in the alignment in both the query and the database, then the algorithm 

recurses to fill in the gaps, using a smaller K. 

• Then extensions using indels followed by matches are considered. 

• Large gaps in the query sequence often correspond to introns and they are slid around to find 

the best GT/AG consensus sequence for the intron ends. 

8.13 Protein alignments 

In the case of amino acid sequences, each clump is processed as follows: 

• All hits obtained in the seed stage is extended into maximally scoring ungapped alignments 

(HSPs) using a score function where a match is worth 2 and a mismatch is worth 1. 

• A graph is build with HSPs as nodes. 

• If HSP A starts before HSP B in both sequences, then an edge is put from A to B that is 

weighted by the score of B minus a gap penalty based on the distances between A and B. 

• If A and B overlap, then an optimal crossover position x is determined that maximizes the sum 

of score of A up to x and B starting from x and the edge weight is set accordingly. 

• A dynamic program then extracts the maximal scoring alignment by traversing the graph. 

• The HSPs contained in the path are removed and if any HSPs are left then the dynamic program 

is run again.


8.14 Mouse/Human alignment choices 

The similarity between the human and mouse genomes is 86% on the nucleotide level and 89% 

on the amino-acid level (for coding regions). The following table compares DNA vs amino acid 

alignments, and different seeding stratergies: 

9 Phylogenetic Networks 

Real evolutionary data often contains a number of different and sometimes conflicting phylogenetic 

signals, and thus do not always clearly support a unique tree. To address this problem, 

Hans-Jürgen Bandelt and Andreas Dress developed the method of split decomposition. 

For ideal data, this method gives rise to a tree, whereas less ideal data are represented by a 

tree-like network that may indicate evidence for different and conflicting phylogenies. 

The following lectures are based on: 

Hans-Jürgen Bandelt and Andreas W. M. Dress. A canonical decomposition theory for metrics 

on a finite set, Advances in Mathematics, 92(1):47-105 (1992) 

Daniel H. Huson, SplitsTree: analyzing and visualizing evolutionary data, Bioinformatics, 

14(10):68-73 (1998). 

9.1 Trees vs networks 

Here is (a) the unrooted neighbor-joining tree for 16S rRNA sequences (1355 bp) from ten 

species of Neisseria and (b) a splits graph computed from the same distance matrix: 

(a) 

(b)


(See: Eddie C. Holmes. Genomics, phylogenetics and epidemiology, Microbiology Today, 26:162-163 

(1999).) 

9.2 Phylogenetic trees 

Evolutionary relationships are usually represented by a phylogenetic tree T, i.e. a tree whose 

leaves (and perhaps some internal nodes, too) are all labeled by elements of a set X of taxa 

and whose internal nodes all have degree at least three. 

(If the tree is rooted, then the root node may have degree two, but we will only consider unrooted 

trees.) Often, the edges have weights corresponding to some notion of evolutionary distance 

between the taxa. 

t1 

t2 

t8 

Example: 

t3 

t4 

t5 

t6 

t7 

9.3 Trees and splits 

Any edge e of T defines a split S = {A, Ā} of X, that is, a partitioning of X into two non-empty 

sets A and Ā, consisting of all taxa on the one side and other side of e, respectively. 

t1 

t2 

t8 

For example: 

Here, A = {t 3 , t 4 , t 5 } and Ā = {t 1, t 2 , t 6 , t 7 , t 8 }. 

Let Σ(T) denote the set of all splits obtained from T. 

t4 

t3 

t5 

Ideally, each edge of the tree separates a monophyletic group from the rest and this is reflected 

by the corresponding split. 

e 

t6 

t7 

9.4 Compatible splits 

Given a set of taxa X. Let Σ be a set of splits of X. Two splits S 1 = {A 1 , Ā1} and S 2 = {A 2 , Ā2} 

are called compatible, if one of the four following intersections 

is empty. 

A 1 ∩ A 2 , A 1 ∩ Ā2, Ā 1 ∩ A 2 , or Ā 1 ∩ Ā2, 

A set Σ of splits of X is called compatible, if every pair of splits in Σ is compatible. 

Example 

Given the taxa set X = {a, b, c, d, e}. The splits S 1 = {{a, b}, {c, d, e}}, S 2 = {{a, b, c}, {d, e}}


and S 3 = {{e}, {a, b, c, d}} are all compatible with each other. However, S 4 = {{a, c}, {b, d, e}} 

is not compatible with the first one. Hence, the set Σ = {S 1 , S 2 , S 3 } is compatible, but Σ ′ = 

{S 1 , S 2 , S 3 , S 4 } is not. 

The compatibility condition states that any split S subdivides either the one side, or the other 

side, of any other split S ′ , but not both sides. Hence, any set of compatible splits can be drawn 

as follows, without crossing lines: 

t1 

t2 

t8 

t3 

t4 

t5 

t6 

t7 

This figure also shows the relationship between compatible splits and a hierarchical clustering. 

9.5 Compatible splits and trees 

Any compatible set of splits Σ gives rise to a phylogenetic tree T, for example: 

t1 

t2 

t8 

t3 

t4 

t5 

t6 

t7 

Note: For this always to work, we need to allow taxa to be positioned at internal nodes of the 

tree, not just at leaves! 

Vice versa, any tree T gives rise to a compatible set of splits. (Proof: Consider two edges e and 

e ′ . Because T is a tree, e and e ′ are connected by a unique path P: 

v 

e 

←→ w ←→ P ←→ w ′ e 

←→ ′ 

v ′ 

Because T is a tree, the set A, consisting of all nodes reachable from node v not using edge e, is disjoint 

from A ′ , the set of all nodes reachable from node v ′ not using edge e ′ . Hence, the corresponding splits 

S = {A,Ā = X \ A} and S′ = {A ′ ,Ā′ = X \ A ′ } are compatible.) 

In summary: 

Theorem A set of splits Σ is compatible, iff there exists a phylogenetic tree T such that 

Σ = Σ(T).


9.6 Representing distances using trees 

Given a set X of taxa. For our purposes, a phylogenetic tree T is a tree such that: 

• all leaves, and perhaps some of the internal nodes, too, are (multi-)labeled by elements 

of X, such that each taxon appears exactly once, and 

• every edge e has a weight d e associated with it. 

Given a set of taxa X and a distance matrix {d ab } (i.e., a dissimilarity function or pseudo 

metric) describing “evolutionary distances” between the different taxa, obtained in some way 

(as described in ABI I). 

Any distance-based tree building method attempts to represent given distance matrix d as 

well as possible using a phylogenetic tree T, i.e. for any two taxa a, b ∈ X we approximate 

d ab ≈ ∑ e∈P d e, where P is the unique path of edges in T that connects the nodes with labels 

a and b. 

9.7 Main goal 

Given a distance matrix d, a tree building method such as neighbor-joining will compute a 

phylogenetic tree T for d, no matter how “untree-like” the distance matrix d may be. (Recall 

from ABI I that the four-point condition determines whether a given distance matrix d is 

additive or not, i.e. whether it has an exact representation as by a phylogenetic tree, or not.) 

Our goal is to use more general graphs to represent distances, so-called splits graphs. As we 

will see, the graph will be a tree, whenever the given distances are tree-like (i.e., additive, or 

close to additive). 

to obtain this goal, we proceed indirectly by discussing sets of splits and introducing the notation 

of weak-compatibility. 

Just as a set of compatible splits can be represented by a phylogenetic tree, we will see that a 

weakly-compatible set of splits can be represented by a splits graph. 

9.8 Tree and splits 

Here is an example of a phylogenetic tree T: 

Gorilla 

Pan_panisc 

Homo_sap 

rabbit 

guinea_pig 

Pongo_pygB 

f 

Bos_ta(cow) 

fin_whale 

blue_whale 

Mus_mouse 

platypus wallaroo 

0.1 opossum 

e 

Rattus_norv 

Each edge in T defines a split of the set of taxa X. For example, the edge labeled e separates 

rat and mouse from all other taxa, and the edge f separates cow, fin whale and blue whale 

from all others.


9.9 Tree distance and Σ-distance 

Given a phylogenetic tree T with edge weights. We define the tree distance between two taxa 

a and b as 

d T (a, b) := ∑ d e , 

e∈P 

where P denotes the set of edges along the unique simple path from the node labeled a to the 

node labeled b. 

We set d S := d e , if S is the split corresponding to e. We define the Σ-distance between two 

taxa a and b as 

d Σ (a, b) := 

∑ 

d S , 

S∈Σ(a,b) 

where Σ(a, b) is the set of all splits in Σ that separate a and b. 

These definitions imply 

for all taxa a, b ∈ X. 

d T (a, b) = d Σ (a, b) 

9.10 Weak compatibility 

Compatibility is a requirement defined on any two splits. A relaxed concept is that of weak 

compatibility, which is a condition placed on any three splits. 

Let S 1 , S 2 and S 3 be three splits of X. This triplet is called weakly compatible, if for every 

choice of A i ∈ S i (i = 1, 2, 3), at least one of the four intersections 

A 1 ∩ A 2 ∩ A 3 , A 1 ∩ Ā2 ∩ Ā3, Ā 1 ∩ A 2 ∩ Ā3, or Ā 1 ∩ Ā2 ∩ A 3 

is empty. This means that at least one shaded and one unshaded region of the following diagram 

must be empty: 

S2 

S1 

X 

S3 

Note that if any pair of the three splits is compatible, then all three are weakly-compatible: 

E.g., if for S 1 = {A 1 , Ā1} and S 2 = {A 2 , Ā2} we have A 1 ∩ A 2 = ∅, then A 1 ∩ A 2 ∩ A 3 = ∅ and 

A 1 ∩ A 2 ∩ Ā3 = ∅. 

On the other hand, it is possible that every pair of the three weakly-compatible splits is incompatible: 

C 

B e D 

A 

F 

Here, a split of X = {A, B, C, D, E, F } if given by each pair of parallel edges, e.g. edges e and 

e ′ define the split {{A, B, F }, {C, D, E}}. 

e’ 

E


9.11 Weak compatibility and splits graphs 

As discussed above, any given set of splits Σ can be represented by a tree T(Σ), if and only if 

Σ is compatible. 

A weakly compatible split system S can be represented by a splits graph G(Σ) that has the 

following properties: 

• all leaves (and, additionally, some internal nodes, perhaps) are multi-labeled by taxa so 

that each taxon appears exactly once, 

• edges are labeled by splits such that each split appears at least once, 

• deleting all edges labeled by any given split S = {A, Ā} produces precisely two components, 

one containing all nodes with labels in A and the other containing all nodes with 

labels in Ā, and 

• the graph is minimal with these properties. 

9.12 Weak compatibility and splits graphs 

Given the following splits: 

S 1 = {{A, B}, {C, D, E, F }}, S 2 = {{A, B, C}, {D, E, F }}, 

S 3 = {{A, F, E}, {B, C, D}}, S 4 = {{A, B, F }, {C, D, E}}, 

and all singleton splits A vs B − F, etc. 

They can be represented as follows: 

C 

B 

S3 

S1 

S3 

S4 

S2 

S3 

D 

A 

S1 

S2 

S4 

E 

Here is another example of a splits graph: 

F 

A.cerana 

A.andrenof 

A.florea 

A.mellifer 

A.koschev 

A.dorsata 

This graph is based on DNA obtained from bees. It indicates that is there is some evidence 

that groups A.cerana and A.meillifer together, and conflicting evidence that groups A.mellifer 

with A.dorsata, for example.


9.13 Splits graphs and distances 

Given a set of taxa X, a set of weakly compatible splits Σ of X and a value d S ≥ 0 for each 

split S. 

As above, we define the Σ-distance between taxa a and b simply as d Σ (a, b) := ∑ S∈Σ(a,b) d S, 

where Σ(a, b) is the set of all splits that separate a and b. 

Assume we are given a corresponding splits graph G. In G, each split S is represented by a 

band of parallel edges and each such edge e has weight d e = d S . 

Consider any two taxa a, b ∈ X. We define 

d G (a, b) := min{ ∑ e∈P 

d e | P is a simple path from a to b}. 

Lemma We have d Σ (a, b) = d G (a, b) for all a, b ∈ X. 

(Proof: need to show that a minimum path from a to b uses precisely one edge for every split 

that separates a and b.) 

9.14 Two main questions 

• First, given a set of taxa X and a distance matrix d. How do we compute a set of 

weakly compatible system of splits Σ and values d S , such that ∑ S∈Σ(a,b) d S is a useful 

approximation of d ab ? 

• Second, given a weakly compatible set of splits, how do we compute the corresponding 

splits graph? 

9.15 Distance matrices and d-splits 

Given a distance matrix d on X. We call a split S = {A, Ā} a d-split, if for all i, j ∈ A and 

k, l ∈ Ā we have d ij + d kl < max(d ik + d jl , d il + d jk ). 

In other words, the metric induced by d on any four taxa i, j ∈ A, k, l ∈ Ā, places i, j and k, l 

together as indicated here: 

j 

i 

A 

l 

k 

B 

or 

i 

j 

A 

l 

k 

B 

or 

i 

k 

i 

k 

j 

A 

l 

B 

but NOT: 

j 

A 

l 

B 

. 

9.16 d-splits are weakly compatible 

Lemma (Bandelt & Dress) Let d be a distance matrix on X. Then the set of all d-splits is 

weakly compatible.


Proof Consider three d-splits S 1 = {A 1 , Ā1}, S 2 = {A 2 , Ā2} and S 3 = {A 3 , Ā3} and assume that 

they are not weakly-compatible. Then there exist four taxa x, y, z, t contained in A 1 ∩ Ā2 ∩ Ā3, 

Ā 1 ∩ A 2 ∩ Ā3, Ā 1 ∩ Ā2 ∩ A 3 and A 1 ∩ A 2 ∩ A 3 , respectively: 

S1 

S2 

y 

t 

x 

X 

z 

S3 

The definition of a d split implies the following three inequalities: 

For S 1 : d xt + d yz < max(d xy + d tz , d xz + d ty ), 

for S 2 : d yt + d xz < max(d yx + d tz , d yz + d tx ), and 

for S 3 : d zt + d xy < max(d zx + d ty , d zy + d tx ). 

Note that these three inequalities cannot be fulfilled simutaneously, contradicting our assumptions 

and thus the three splits must be weakly-compatible. □. 

9.17 The isolation index of a split 

We give any d-split S = {A, Ā} a positive weight, namely the quantity 

α A,B := α d A,B := 1 2 

called the isolation index of S. 

min {max(d ik + d jl , d il + d jk ) − (d ij + d kl )}, 

i,j∈A,k,l∈Ā 

We can easily modify this definition to apply to any split S = {A, Ā}, whether d-split or not: 

α A,B := α d A,B := 1 2 

min {max(d ik + d jl , d il + d jk , d ij + d kl ) − (d ij + d kl ))}, 

i,j∈A, k,l∈Ā 

thus obtaining a value ≥ 0 that equals the previously defined isolation index, if S is a d-split, 

and 0, if not. 

9.18 The split decomposition 

For any split S = {A, Ā} of X, the split metric δ S is given by 

{ 

0, if i, j ∈ A or i, j ∈ Ā, 

δ S (i, j) := 

1, else. 

Theorem (Bandelt & Dress) Any given distance matrix d on X possesses the following 

unique decomposition: 

d ij = ( ∑ α S δ S (i, j)) + d 0 ij, 

S 

for all i, j ∈ X. Here, the sum runs over all possible splits S and the map d 0 : X × X → R ≥0 

is a (pseudo-)metric that does not admit any further splits with positive isolation index, i.e. 

there exist no d 0 -splits. 

Hence, we have ∑ S α Sδ S (i, j) ≤ d ij for any pair of taxa i, j ∈ X and the Σ-distance α S 

approximates d ij from below. 

One can prove that the number of d-splits is ≤ ( ) 

|X| 

2 .


9.19 Computing the set of d-splits 

Given a distance matrix d on X. The set of all d-splits can be computed iteratively in O(n 6 ) 

steps: 

Algorithm 

Input: Distance matrix d, taxon set X = {x 1 , x 2 , . . .,x n }. 

Output: Set Σ = Σ n of all d-splits 

Initialization: Σ 0 := ∅, X 0 := ∅ 

for each k = 1, 2, . . ., n do: 

Set S k := ∅, X k := ∅ 

for each split S = {A, Ā} ∈ Σ k−1: 

if {A ∪ {x k }, Ā} has positive isolation index then 

Add {A ∪ {x k }, Ā} to Σ k 

if {A, Ā ∪ {x k}} has positive isolation index then 

Add {A, Ā ∪ {x k}} to Σ k 

If {{x 1 , x 2 , . . .,x k−1 }, {x k }} has positive isolation index then 

Add {{x 1 , x 2 , . . .,x k−1 }, {x k }} to Σ k . 

end 

Lemma This algorithm computes all d-splits. 

Proof: First note that in the k-iteration of the algorithm the new partial singleton 

split {{x 1 , . . .,x k−1 }, {x k }} is evaluated and then added to the current set of splits, if 

α {{x1 ,...,x k−1 },{x k }} > 0. Additionally, the algorithm attempts to extend all existing partial splits 

by adding x k to the one side, or the other side, of them. By definition of the isolation index 

as the minimum of certain sums involving quartets of taxa, adding a taxon to either side of 

a partial split can only decrease the isolation index. Hence, any split of X is obtainable as a 

partial singleton split for some k, followed by successive addition of the remaining taxa to the 

split. 

□ 

9.20 Computing the splits graph 

Given a compatible system of splits, it is easy to construct the corresponding tree and to 

compute coordinates for the tree. 

The problem of computing a splits graph for a given set of weakly compatible splits is more 

difficult. In practice, one distinguishes between circular split systems, which correspond to 

planar split graphs, and non-circular ones. A nice algorithm exists for circular split systems 

that produces a planar graph. 

Here we discuss the convex hull approach that applies to any set of weakly compatible splits, 

whether circular or not. This method is easy to describe. Its main draw-back is that it usually 

produces redundant nodes and edges and so the resulting graph is not always minimal in the 

sense postulated above.


9.21 Convex-hull construction method 

Given a splits graph G. For a given set of taxa A ⊂ X, let G A denote the set of all nodes 

labeled by taxa in A. The convex hull G A of G A is obtained by first setting G A = G A and 

then repeatedly adding any node v to G A , if there exist two nodes a, b already in G A such that 

d G (a, v) + d G (v, b) ≤ d G (a, b). 

Given a weakly compatible set of splits Σ = {S 1 , S 2 , . . ., S k }. The convex-hull construction 

method constructs a graph by adding one split at time. For each split, the convex hull for both 

sides of the split is computed. The intersection of the two is duplicated, one copy is connected 

to one side of the graph corresponding to one side of the split, the other to the other and then 

the two duplicated subgraphs are connected by a set of new edges that represent the new split. 

Assume that G is the graph constructed for splits S 1 . . .,S i . To add the next split S i+1 = {A, Ā}: 

• Determine the two convex hulls G A and G Ā . 

• Let H := G A ∩ G Ā denote their intersection. 

• For each node v ∈ H, produce two new nodes v + and v − and connect them by an edge 

labeled S i . 

• If v ∈ H is labeled by a taxon x ∈ A, or x ∈ Ā, then attach this label to node v+ , or v − , 

respectively. 

• Connect any two nodes v + and w + , and v − and w − , respectively, by an edge, if v and w 

are connected by an edge in G. 

• If v ∈ H is connected to some node w ∈ G A \ G Ā , then connect v + and w by an edge. 

• If v ∈ H is connected to some node w ∈ G Ā \ G A , then connect v − and w by an edge. 

• Delete H. 

9.22 Computing the splits graph 

Given the following set Σ of splits: 

We will demonstrate how to generate G Σ . 

S 1 = {{1, 5, 6} ,{2, 3, 4} } 

S 2 = {{1, 2, 3} ,{4, 5, 6} } 

S 3 = {{1, 2, 5, 6},{3, 4} } 

S 4 = {{1, 2} ,{3, 4, 5, 6}} 

S 5 = {{1, 6} ,{3, 4, 5, 6}} 

Initially, start with a single node labeled by all of X = {1, 2, 3, 4, 5, 6}: 

G 1 

1,2,3,4,5,6 

Then add the first split S 1 . Note that H consists of the single node present in G 1 : 

G 2 

1,5,6 2,3,4 

Add the second split S 2 = {{1, 2, 3}, {4, 5, 6}}. Note that H consists of both nodes in G 2 :


G 3 

5,6 4 

1 

Add the third split S 3 = {{1, 2, 5, 6}, {3, 4}}. Note that H consists of the two nodes labeled 

2, 3 and 4 in G 3 : 

2,3 

4 

5,6 

G 4 

1 

Add the fourth split S 4 = {{1, 2}, {3, 4, 5, 6}}. Note that H consists of the two nodes labeled 

1 and 2 in G 4 : 

4 

5,6 

3 

G 5 

2 

3 

1 2 

Add the fifth split S 5 = {{1, 6}, {2, 3, 4, 5}}. Note that H consists of the two nodes labeled 1 

and 5, 6, plus the node lying between these two in G 5 : 

4 

6 5 

3 

G 6 

1 

2 

Finally, add all singleton splits to G 6 to obtain the final graph G: 

4 

5 

6 

G 

3 

1 

2 

9.23 Example of splits graph 

The distance matrix for the following example was produced in a psychology experiment in 

which people where asked to estimate the distance between different colors:


Title: colors.nex 

Date: Mon Jul 15 09:16:02 2002 

red 

yellow 

red−purple 

gr.−y.(yellowish) 

purple−reddish 

gr.−y.(greenish) 

purple 

green 

10 

Fit=97.0 ntax=10 

purple−blue 

blue

Algorithmen der Bioinformatik II - Algorithms in Bioinformatics ...

Create successful ePaper yourself

Delete template?

Save as template?