16.11.2014 Views

Algorithmen der Bioinformatik II - Algorithms in Bioinformatics ...

Algorithmen der Bioinformatik II - Algorithms in Bioinformatics ...

Algorithmen der Bioinformatik II - Algorithms in Bioinformatics ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 1<br />

<strong>Algorithmen</strong> <strong>der</strong><br />

<strong>Bio<strong>in</strong>formatik</strong> <strong>II</strong><br />

Vorlesung, Sommersemester 2002,<br />

WSI-Informatik, Universität Tüb<strong>in</strong>gen<br />

Prof. Dr. Daniel Huson<br />

huson@<strong>in</strong>formatik.uni-tueb<strong>in</strong>gen.de<br />

0.1 Organisatorisches<br />

Vorlesung: Mo, Mi, 10-12h ct, A301, Sand 1<br />

Übungsgruppen:<br />

Di 10-12h C306 Sand 14 Ulrike von Luxburg<br />

Mi 15-17h Kle<strong>in</strong>er Hörsaal, Sand 6/7 Christian Rausch<br />

Sche<strong>in</strong>kriterium: regelmässige Teilnahme an e<strong>in</strong>er Übungsgruppe, 60% <strong>der</strong> möglichen Punkte,<br />

Bearbeitung nur bis zwei Personen per Blatt.<br />

Sprechstunden:<br />

Mi 16-18h C310a, Sand 14 Daniel Huson<br />

Mi 17-18h C324b, Sand 14 Christian Rausch<br />

Web: www-ab.<strong>in</strong>formatik.uni-tueb<strong>in</strong>gen.de/lehre/ss02/bio<strong>in</strong>formatik2.html<br />

0.2 Inhalt<br />

1. Bericht über die Sequenzierung des menschlichen Genoms<br />

2. Markovketten und Hidden Markov Modelle<br />

3. Suffixbäume<br />

4. Assembly<br />

5. Gensuche<br />

6. Phylogenie<br />

7. Strukturvorhersage<br />

1.3 1 Markovketten und Hidden Markov Modelle<br />

1.1 Markovketten<br />

1.2 Hidden Markov Modelle<br />

1.3 Profil HMMs<br />

Literatur S. Durb<strong>in</strong>, S. Eddy, A. Krogh und G. Mitchison, Biological Sequence Analysis,<br />

Cambridge, 1998


2 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

1.4 1.1 Markovketten<br />

Beispiel: CpG-Insel im menschenlichen Genom.<br />

DNS-Doppelstrang:<br />

...ApCpCpApTpGpApTpGpCpApGpGpApCpTpTpCpCpApTpCpGpTpTpCpGpCpGp...<br />

...| | | | | | | | | | | | | | | | | | | | | | | | | | | | | ...<br />

...TpGpGpTpApCpTpApCpGpTpCpCpTpGpApApGpGpTpApGpCpApApGpCpGpCp...<br />

In e<strong>in</strong>em CpG-Paar wird das C häufig methyliert (d.h. es wird e<strong>in</strong> H-Atom durch e<strong>in</strong>e CH 3 -<br />

Gruppe ersetzt). E<strong>in</strong> methyl-C mutiert mit erhöhter Wahrsche<strong>in</strong>lichkeit zu T . Folglich ist das<br />

Paar CpG im Genom unterrepräsentiert.<br />

Upstream von e<strong>in</strong>em Gen wird diese Methylierung aus biologischen Gründen unterdrückt. Es<br />

bilden sich sogenannte CpG-Insel <strong>der</strong> Länge 100-5000, die sich dadurch auszeichnen, dass dort<br />

die CpG-Paare nicht unterrepräsentiert s<strong>in</strong>d.<br />

1.5 CpG-Insel<br />

CpG-Insel s<strong>in</strong>d nützliche Marke für Gene, die sich <strong>in</strong> Organismen bef<strong>in</strong>den, <strong>der</strong>en Genome 5-<br />

methylcytos<strong>in</strong>e enthalten.<br />

CpG-Insel <strong>in</strong> den Promoter-Regionen von Genen spielen e<strong>in</strong>e Rolle bei <strong>der</strong> Inaktivierung des<br />

X-Chromosomes, beim Impr<strong>in</strong>t<strong>in</strong>g, und beim Ausschalten von <strong>in</strong>tragenomischen Parasiten.<br />

Klassische Def<strong>in</strong>ition: DNA Sequenz <strong>der</strong> Länge 200 mit e<strong>in</strong>em C +G Inhalt von 50% und e<strong>in</strong><br />

Verhältnis von beobachtete-# CpG/erwartete-# CpG von über 0.6.<br />

(Gard<strong>in</strong>er-Garden & Frommer, 1987)<br />

Nach e<strong>in</strong>er ganz neuen Untersuchung enthalten die beiden menschlichen Chromosome 21 und<br />

22 zusammen ca. 1100 CpG-Insel und ca. 750 Gene.<br />

(Comprehensive analysis of CpG islands <strong>in</strong> human chromosomes 21 and 22, D. Takai & P. A. Jones,<br />

PNAS, March 19, 2002)<br />

1.6 Fragestellungen<br />

1. Gegeben e<strong>in</strong>e kurze genomische Sequenz. Wie können wir entscheiden, ob diese Sequenz<br />

von e<strong>in</strong>er CpG-Insel stammt?<br />

2. Gegeben e<strong>in</strong>e lange genomische Sequenz. Wie f<strong>in</strong>den wir alle dar<strong>in</strong> enthaltenen CpG-Insel?<br />

1.7 Markovketten<br />

Wir wollen e<strong>in</strong> wahrsche<strong>in</strong>lichkeitstheoristisches Modell für CpG-Inseln aufstellen. Da Paare<br />

nachfolgen<strong>der</strong> Nucleotide wichtig s<strong>in</strong>d, brauchen wir e<strong>in</strong> Modell, <strong>in</strong> dem die Wahrsche<strong>in</strong>lichkeit<br />

e<strong>in</strong>es Symbols von dem vorherigen Symbol abhängt. Also e<strong>in</strong>e Markovkette.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 3<br />

Beispiel:<br />

A<br />

G<br />

C<br />

T<br />

Kreise = Zustände, z.B. mit Namen A , C , G und T .<br />

Pfeile = mögliche Übergänge, jeweils mit e<strong>in</strong>er Übergangswahrsche<strong>in</strong>lichkeit gelabelt, a st =<br />

P(x i = t | x i − 1 = s).<br />

1.8 Markovketten<br />

Def<strong>in</strong>ition E<strong>in</strong>e (zeithomogene) Markovkette (<strong>der</strong> Ordnung 1) ist e<strong>in</strong> System (S, A), gegeben<br />

durch e<strong>in</strong>e endliche Menge von Zuständen S = {s 1 , s 2 , . . .,s n } und e<strong>in</strong>e Übergangsmatrix A =<br />

{a st } mit ∑ t∈S a st = 1 für alle s ∈ S, die die Wahrsche<strong>in</strong>lichkeit des Übergangs s → t angibt:<br />

P(x i+1 = t | x i = s) = a st .<br />

Beispiel Wetter <strong>in</strong> Tüb<strong>in</strong>gen, täglich um 12h: Mögliche Zustände s<strong>in</strong>d Regen, Sonne o<strong>der</strong><br />

Wolken.<br />

R S W<br />

R .5 .1 .4<br />

Übergangswahrsche<strong>in</strong>lichkeiten:<br />

S .2 .6 .2<br />

W .3 .3 .4<br />

Wetter: ...rrrrrrwwsssssswswswwwrrwrwssss...<br />

Gegeben sei e<strong>in</strong>e Sequenz von Zuständen x 1 , x 2 , x 3 , . . .,x L . Wie gross ist die Wahrsche<strong>in</strong>lichkeit,<br />

dass genau diese Sequenz von Zuständen von e<strong>in</strong>er gegebenen Markovkette durchlaufen wird?<br />

P(x) = P(x L , x L−1 , . . .,x 1 )<br />

= P(x L | x L−1 , . . .,x 1 )P(x L−1 | x L−2 , . . ., x 1 ) . . .P(x 1 ),<br />

(durch wie<strong>der</strong>holte Anwendung von P(X, Y ) = P(X|Y )P(Y ))<br />

= P(x L , | x L−1 )P(x L−1 | x L−2 ) . . .P(x 2 | x 1 )P(x 1 )<br />

= P(x 1 ) ∏ L<br />

i=2 a x i−1 x i<br />

,<br />

wegen P(x i | x i−1 ,... ,x 1 ) = P(x i | x i−1 ) = a xi−1 x i , da Markovkette!<br />

1.9 Modellierung vom Beg<strong>in</strong>n- und Endzustand<br />

In <strong>der</strong> bisherigen Beschreibung haben wir die Anfangswahrsche<strong>in</strong>lichkeiten P(x 1 ) übersehen.


4 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

Wir nehmen e<strong>in</strong>en Beg<strong>in</strong>nzustand mit Label b <strong>in</strong> das Modell auf. Wir setzen nun immer voraus,<br />

dass x 0 = b ist. Dann gilt:<br />

P(x 1 = s) = a bs = P(s),<br />

wobei P(s) die H<strong>in</strong>tergrundswahrsche<strong>in</strong>lichkeit von Symbol s ist.<br />

Wir modellieren auch das Ende <strong>der</strong> Sequenz mit e<strong>in</strong>em Endzustand ’e’. Dann ergibt sich für<br />

die Wahrsche<strong>in</strong>lichkeit, im Zustand t zu enden:<br />

P(x L = t) = a xL e.<br />

1.10 Erweiterung des Modells<br />

Beispiel:<br />

b<br />

A<br />

G<br />

C<br />

T<br />

e<br />

# Markov cha<strong>in</strong> that generates CpG islands<br />

# (Source: DEMK98, p 50)<br />

# Number of states:<br />

6<br />

# State labels:<br />

A C G T * +<br />

# Transition matrix:<br />

0.1795 0.2735 0.4255 0.1195 0 0.002<br />

0.1705 0.3675 0.2735 0.1875 0 0.002<br />

0.1605 0.3385 0.3745 0.1245 0 0.002<br />

0.0785 0.3545 0.3835 0.1815 0 0.002<br />

0.2495 0.2945 0.2495 0.2945 0 0.002<br />

0.0000 0.0000 0.0000 0.0000 0 1.000<br />

1.11 Berechnung <strong>der</strong> Übergangsmatrix<br />

Die Übergangsmatrix A+ für DNS, die aus e<strong>in</strong>er CpG-Insel stammt, wird wie folgt berechnet:<br />

a + st =<br />

c+ st<br />

∑t ′ c + st ′ ,<br />

wobei c st die Anzahl <strong>der</strong> Positionen <strong>in</strong> e<strong>in</strong>er Tra<strong>in</strong><strong>in</strong>gsmenge von CpG-Insel ist, an die <strong>der</strong><br />

Zustand s von Zustand t gefolgt wird.<br />

Analog wird auch A − empirisch bestimmt.


8 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

1.12 Zwei Beispiele für Markovketten<br />

# Markov cha<strong>in</strong> for CpG islands # Markov cha<strong>in</strong> for non-CpG islands<br />

# (Source: DEMK98, p 50) # (Source: DEMK98, p 50)<br />

# Number of states: # Number of states:<br />

6 6<br />

# State labels: # State labels:<br />

A C G T * + A C G T * +<br />

# Transition matrix: # Transition matrix:<br />

.1795 .2735 .4255 .1195 0 0.002 .2995 .2045 .2845 .2095 0 .002<br />

.1705 .3675 .2735 .1875 0 0.002 .3215 .2975 .0775 .0775 0 .002<br />

.1605 .3385 .3745 .1245 0 0.002 .2475 .2455 .2975 .2075 0 .002<br />

.0785 .3545 .3835 .1815 0 0.002 .1765 .2385 .2915 .2915 0 .002<br />

.2495 .2945 .2495 .2945 0 0.002 .2495 .2495 .2495 .2495 0 .002<br />

.0000 .0000 .0000 .0000 0 1.000 .0000 .0000 .0000 .0000 0 1.00<br />

1.13 Beantwortung von Frage 1<br />

Gegeben e<strong>in</strong>e kurze Sequenz x = (x 1 , x 2 , . . .,x L ). Stammt sie von e<strong>in</strong>er CpG-Insel (Modell+)?<br />

mit x 0 = b und x L+1 = e.<br />

Wir benutzen folgenden Score:<br />

P(x | Modell+) =<br />

L∏<br />

a xi x i+1<br />

,<br />

i=0<br />

L<br />

P(x | Modell+)<br />

S(x) = log<br />

P(x | Modell−) = ∑<br />

log a+ x i−1 x i<br />

.<br />

a − x i−1 x i<br />

Je höher dieser Score, um so wahrsche<strong>in</strong>licher ist es, dass x e<strong>in</strong>e CpG-Insel ist.<br />

i=1<br />

1.14 Fragen an e<strong>in</strong>er Markovkette<br />

Beispiel Wetter <strong>in</strong> Tüb<strong>in</strong>gen, täglich um 12h: Mögliche Zustände s<strong>in</strong>d Regen, Sonne o<strong>der</strong><br />

Wolken.<br />

R S W<br />

R .5 .1 .4<br />

Übergangswahrsche<strong>in</strong>lichkeiten:<br />

S .2 .6 .2<br />

W .3 .3 .4<br />

Fragen, die man an das Modell stellen kann:<br />

Wenn heute die Sonne sche<strong>in</strong>t, wie gross ist die Wahrsche<strong>in</strong>lichkeit, dass die Sonne die nächsten<br />

sieben Tage sche<strong>in</strong>t?<br />

Wie gross ist die Wahrsche<strong>in</strong>lichkeit, dass es e<strong>in</strong>en Monat lang regnet?<br />

1.15 Hidden Markov Modelle (HMM)<br />

Motivation: Frage 2, wie erkennt man CpG-Inseln <strong>in</strong>nerhalb e<strong>in</strong>er langen Sequenz?


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 9<br />

Z.B. Fensterverfahren: e<strong>in</strong> Fenster <strong>der</strong> Breite w wird entlang <strong>der</strong> Sequenz geschoben und <strong>der</strong><br />

Score wird geplottet. Probleme: Grenzen <strong>der</strong> CpG-Insel werden nicht scharf erkannt, welche<br />

Fenstergrösse w soll gewählt werden?...<br />

Ansatz: Die beiden Markovketten Modell + und Modell − werden <strong>in</strong> e<strong>in</strong>em sogenannten Hidden<br />

Markov Modell vere<strong>in</strong>igt.<br />

1.16 Hidden Markov Modelle<br />

Def<strong>in</strong>ition E<strong>in</strong> HMM ist e<strong>in</strong> System M = (S, Q, A, e), mit<br />

• e<strong>in</strong>em Alphabet S,<br />

• e<strong>in</strong>er Menge von Zuständen Q,<br />

• e<strong>in</strong>er Matrix A = {a kl } von Übergangswahrsche<strong>in</strong>lichkeiten a kl für k, l ∈ Q, und<br />

• e<strong>in</strong>er Emissionswahrsche<strong>in</strong>lichkeit e k (b) für jedes k ∈ Q und b ∈ S.<br />

1.17 Beispiel<br />

E<strong>in</strong> HMM für CpG-Insel:<br />

A+ C+ G+ T+<br />

A C G T<br />

− − − −<br />

(Es kommen noch sämtliche Übergänge <strong>in</strong>nerhalb <strong>der</strong> beiden Mengen h<strong>in</strong>zu, die von den beiden<br />

Markovketten Modell + und Modell − übernommen werden.)<br />

1.18 HMM für CpG-Inseln<br />

# Number of states:<br />

9<br />

# Names of states (beg<strong>in</strong>/end, A+, C+, G+, T+, A-, C-, G- and T-):<br />

0 A C G T a c g t<br />

# Number of symbols:<br />

4<br />

# Names of symbols:<br />

a c g t<br />

# Transition matrix, probability to change from +island to -island (and vice versa) is 10E-4<br />

0.0000000000 0.0725193101 0.1637630296 0.1788242720 0.0754545682 0.1322050994 0.1267006624 0.1226380452 0.1278950131<br />

0.0010000000 0.1762237762 0.2682517483 0.4170629371 0.1174825175 0.0035964036 0.0054745255 0.0085104895 0.0023976024<br />

0.0010000000 0.1672435130 0.3599201597 0.2679840319 0.1838722555 0.0034131737 0.0073453094 0.0054690619 0.0037524950<br />

0.0010000000 0.1576223776 0.3318881119 0.3671328671 0.1223776224 0.0032167832 0.0067732268 0.0074915085 0.0024975025<br />

0.0010000000 0.0773426573 0.3475514486 0.3759440559 0.1781818182 0.0015784216 0.0070929071 0.0076723277 0.0036363636<br />

0.0010000000 0.0002997003 0.0002047952 0.0002837163 0.0002097902 0.2994005994 0.2045904096 0.2844305694 0.2095804196<br />

0.0010000000 0.0003216783 0.0002977023 0.0000769231 0.0003016983 0.3213566434 0.2974045954 0.0778441558 0.3013966034<br />

0.0010000000 0.0002477522 0.0002457542 0.0002977023 0.0002077922 0.2475044955 0.2455084915 0.2974035964 0.2075844156<br />

0.0010000000 0.0001768232 0.0002387612 0.0002917083 0.0002917083 0.1766463536 0.2385224775 0.2914165834 0.2914155844<br />

# Emission probabilities:<br />

0 0 0 0<br />

1 0 0 0<br />

0 1 0 0<br />

0 0 1 0


10 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

0 0 0 1<br />

1 0 0 0<br />

0 1 0 0<br />

0 0 1 0<br />

0 0 0 1<br />

Wir benutzen ab jetzt 0 für den Beg<strong>in</strong>n- und Endzustand.<br />

1.19 Beispiel fairer/unfairer Würfel<br />

Cas<strong>in</strong>o, zwei Würfel, fair und unfair:<br />

1: 1/6<br />

2: 1/6<br />

3: 1/6<br />

4: 1/6<br />

5: 1/6<br />

6: 1/6<br />

0.05<br />

0.1<br />

1: 1/10<br />

2: 1/10<br />

3: 1/10<br />

4: 1/10<br />

5: 1/10<br />

6: 1/2<br />

0.95 0.9<br />

Fair<br />

Unfair<br />

Besucher des Cas<strong>in</strong>os beobachtet nur die Anzahl <strong>der</strong> Augen:<br />

6 4 3 2 3 4 6 5 1 2 3 4 5 6 6 6 3 2 1 2 6 3 4 2 1 6 6...<br />

Welcher Würfel benutzt wurde, bleibt verdeckt (hidden):<br />

F F F F F F F F F F F F U U U U U F F F F F F F F F F...<br />

1.20 Beispiel Urnenmodell<br />

Gegeben p Urnen U 1 , U 2 , . . .,U p . Jede Urne U i enthält r i rote, g i grüne und b i blaue Kugel.<br />

Es wird zufällig e<strong>in</strong>e Urne U i ausgewählt und dann aus ihr zufällig e<strong>in</strong>e Kugel k gezogen (mit<br />

Zurücklegen). Die Farbe <strong>der</strong> Kugel k wird ausgegeben.<br />

r1 rot<br />

g1 grun<br />

b1 blau<br />

r2 rot<br />

g2 grun<br />

b2 blau<br />

...<br />

rp rot<br />

gp grun<br />

bp blau<br />

r r g g b b g b g g g b b b r g r g b b b g g b g g b...<br />

Auch hier s<strong>in</strong>d die Zustände verdeckt, wir sehen nur die produzierten Symbole.<br />

1.21 HMM für das Urnenmodell<br />

# Four urns


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 11<br />

# Number of states:<br />

5<br />

# Names of states:<br />

# (0 beg<strong>in</strong>/end, and urns A-D)<br />

0 A B C D<br />

# Number of symbols:<br />

3<br />

# red, green, blue<br />

r g b<br />

# Transition matrix:<br />

0 .25 .25 .25 .25<br />

0.01 .69 .30 .0<br />

0.01 .0 .69 .30<br />

0.01 .30 .0 .69<br />

# Emission probabilties:<br />

0 0 0<br />

.8 .1 .1<br />

.2 .5 .3<br />

.1 .1 .8<br />

# EOF<br />

1.22 Generierung von synthetischen Daten<br />

HMMs können benutzt werden, um Daten zu generieren:<br />

Algorithmus<br />

Starte <strong>in</strong> Zustand 0.<br />

Solange <strong>der</strong> Zustand 0 noch nicht wie<strong>der</strong> erreicht wurde:<br />

Wähle e<strong>in</strong>en neuen Zustand gemäss den Übergangsmatrizen.<br />

Wähle e<strong>in</strong> Symbol gemäss den Emissionswahrsche<strong>in</strong>lichkeiten und gebe es aus.<br />

1.23 E<strong>in</strong>e Symbolfolge für das Cas<strong>in</strong>o-Beispiel<br />

Wir benutzen das fair/unfair HMM, um e<strong>in</strong>e Folge von Symbolen zu generieren:<br />

Symbols: 24335642611341666666526562426612134635535566462666636664253<br />

States : FFFFFFFFFFFFFFUUUUUUUUUUUUUUUUUUFFFFFFFFFFUUUUUUUUUUUUUFFFF<br />

Symbols: 35246363252521655615445653663666511145445656621261532516435<br />

States : FFFFFFFFFFFFFFFFFFFFFFFFFFFUUUUUUUFFUUUUUUUUUUUUUUFFFFFFFFF<br />

Symbols: 5146526666<br />

States : FFUUUUUUUU<br />

Wie wahrsche<strong>in</strong>lich s<strong>in</strong>d diese Daten?<br />

Wenn wir nur die Symbole sehen, können wir die zugehörigen Zustände rekonstruieren?


12 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

1.24 Berechnung <strong>der</strong> Wahrsch., wenn Pfad und Symbole<br />

bekannt s<strong>in</strong>d<br />

Def<strong>in</strong>ition E<strong>in</strong> Pfad π = (π 1 , π 2 , . . ., π L ) ist e<strong>in</strong>e Folge von Zuständen im Modell M.<br />

Gegeben e<strong>in</strong>e Folge von Symbolen x = (x 1 , . . .,x L ) und e<strong>in</strong> Pfad π = (π 1 , . . .,π L ) durch M.<br />

Die geme<strong>in</strong>same Wahrsche<strong>in</strong>lichkeit ist:<br />

mit π L+1 = 0.<br />

P(x, π) = a 0π1<br />

Lei<strong>der</strong> kennen wir den Pfad <strong>in</strong> <strong>der</strong> Regel nicht!<br />

L<br />

∏<br />

i=1<br />

e πi (x i )a πi π i+1<br />

,<br />

1.25 “Dekodierung” e<strong>in</strong>er Symbolfolge<br />

Problem: Wir haben e<strong>in</strong>e Folge x von Symbolen beobachtet und wollen sie nun “dekodieren”:<br />

Beispiel: Die Symbolfolge C G C G hat verschiedene “Erklärungen” im CpG-Modell, z.B.:<br />

(C + , G + , C + , G + ), (C − , G − , C − , G − ) und (C − , G + , C − , G + ).<br />

E<strong>in</strong> Pfad durch das HMM legt fest, welche Teile <strong>der</strong> Folge x als CpG-Insel gedeutet werden.<br />

1.26 Der wahrsche<strong>in</strong>lichste Pfad<br />

Um das Dekodierungsproblem zu lösen, wollen wir den Pfad π ∗ berechnen, für den die<br />

Wahrsche<strong>in</strong>lichkeit, die Symbolfolge x generiert zu haben, maximal ist, also:<br />

π ∗ = arg maxP(x, π).<br />

π<br />

Dieser wahrsche<strong>in</strong>lichste Pfad π ∗ kann rekursiv berechnet werden.<br />

Def<strong>in</strong>ition: Die Variable v k (i) gibt für das Präfix (x 1 , x 2 , . . .,x i ) die Wahrsche<strong>in</strong>lichkeit an,<br />

dass <strong>der</strong> wahrsche<strong>in</strong>lichste Pfad im Zustand k (an <strong>der</strong> Position i) endet. Es gilt:<br />

mit Initialisierung v 0 (0) = 1.<br />

v l (i + 1) = e l (x i+1 ) max<br />

k∈Q (v k(i)a kl ),<br />

(Zusatzaufgabe: Es gilt: arg max π P(x, π) = arg max π P(π | x))<br />

1.27 Wahrsche<strong>in</strong>lichster Pfad<br />

x 0 x 1 x 2 x 3 ... x i−2 x i−1 x i x i+1<br />

A + A + A + ... A + A + A + ...<br />

C + C + C + ... C + C + C +<br />

G + G + G + ... G + G + G +<br />

T + T + T + ... T + T + T +<br />

0 A − A − A − ... A − A − A −<br />

C − C − C − ... C − C − C −<br />

G − G − G − ... G − G − G −<br />

T − T − T − ... T − T − T −


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 13<br />

1.28 Viterbi-Algorithmus<br />

Input: HMM M = (S, Q, A, e)<br />

und Symbolfolge x<br />

Output: Wahrsche<strong>in</strong>lichster Pfad π ∗ .<br />

Initialisierung (i = 0): v 0 (0) = 1, v k (0) = 0 für k ≠ 0.<br />

Für alle i = 1 . . .L, l ∈ Q: v l (i) = e l (x i ) max k∈Q (v k (i − 1)a kl )<br />

ptr i (l) = arg max k∈Q (v k (i − 1)a kl )<br />

Abschluss: P(x, π ∗ ) = max k∈Q (v k (L)a k0 )<br />

π ∗ L = arg max k∈Q(v k (L)a k0 )<br />

Traceback:<br />

Für alle i = L − 1 . . .1: π ∗ i−1 = ptr i(π ∗ i )<br />

Implementierungsh<strong>in</strong>weis: statt Multiplikation vieler kle<strong>in</strong>er Werte, Addition von Logarithmen!<br />

(Zusatzaufgabe: Laufzeitabschätzung)<br />

1.29 Beispiel für Viterbi<br />

Gegeben die Sequenz C G C G und das e<strong>in</strong> HMM für CpG-Inseln. Hier ist e<strong>in</strong>e mögliche Tabelle<br />

<strong>der</strong> Werte für v:<br />

Sequenz<br />

v C G C G<br />

0 1 0 0 0 0<br />

A + 0 0 0 0 0<br />

C + 0 .13 0 .012 0<br />

Zustand G + 0 0 .034 0 .0032<br />

T + 0 0 0 0 0<br />

A − 0 0 0 0 0<br />

C − 0 .13 0 .0026 0<br />

G − 0 0 .010 0 .00021<br />

T − 0 0 0 0 0<br />

1.30 Viterbi-Dekodierung des Cas<strong>in</strong>o Beispiels<br />

Wir benutzen das fair/unfair HMM, um e<strong>in</strong>e Folge von Symbolen zu generieren und den Viterbi-<br />

Algorithmus, um die Folge zu dekodieren, Ergebnis:<br />

Symbols: 24335642611341666666526562426612134635535566462666636664253<br />

States : FFFFFFFFFFFFFFUUUUUUUUUUUUUUUUUUFFFFFFFFFFUUUUUUUUUUUUUFFFF<br />

Viterbi: FFFFFFFFFFFFFFUUUUUUUUUUUUUUUUFFFFFFFFFFFFUUUUUUUUUUUUUFFFF<br />

Symbols: 35246363252521655615445653663666511145445656621261532516435<br />

States : FFFFFFFFFFFFFFFFFFFFFFFFFFFUUUUUUUFFUUUUUUUUUUUUUUFFFFFFFFF<br />

Viterbi: FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF


14 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

Symbols: 5146526666<br />

States : FFUUUUUUUU<br />

Viterbi: FFFFFFUUUU<br />

1.31 Drei Grundfragen für HMMs<br />

Sei M e<strong>in</strong> HMM, x e<strong>in</strong>e Folge von Symbolen.<br />

(Q1) Für x, bestimme den wahrsche<strong>in</strong>lichsten Zustandspfad durch M: Viterbi-Algorithmus<br />

(Q2) Berechne die Wahrsche<strong>in</strong>lichkeit, mit <strong>der</strong> x von M erzeugt wird: P(x) = P(x | M):<br />

Vorwärts-Algorithmus<br />

(Q3) Gegeben x und eventuell weitere Folgen, wie werden die Parameter von M tra<strong>in</strong>iert?<br />

Z.B., Baum-Welch-Algorithmus<br />

1.32 Berechnung von P(x | M)<br />

Gegeben e<strong>in</strong> HMM M und e<strong>in</strong>e Folge x. Für die Wahrsche<strong>in</strong>lichkeit, dass x von M generiert<br />

wurde, gilt:<br />

P(x | M) = ∑ P(x, π | M),<br />

π<br />

wobei wir hier über alle Zustandspfade π durch M summieren müssen! (Zusatzaufgabe: Wie<br />

schnell wächst die Anzahl <strong>der</strong> Pfade mit zunehmen<strong>der</strong> Länge?)<br />

1.33 Vorwärts-Algorithmus<br />

Dieser Algorithmus geht aus dem Viterbi-Algorithmus durch Ersetzen von max durch Summe<br />

hervor. Wir betrachten dabei folgende Vorwärtsvariable:<br />

f k (i) = P(x 1 . . .x i , π i = k),<br />

die die Wahrsche<strong>in</strong>lichkeit angibt, die Präfixfolge (x 1 , . . .,x i ) auszugeben und den Zustand<br />

π i = k zu erreichen.<br />

Die Rekursion lautet:f l (i + 1) = e l (x i+1 ) ∑ k∈Q f k(i)a kl .<br />

f p (i)<br />

f q (i)<br />

f r (i)<br />

f (i)<br />

s<br />

a kl<br />

f<br />

l<br />

(i+1)


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 15<br />

1.34 Vorwärts-Algorithmus<br />

Input: HMM M = (S, Q, A, e)<br />

und Symbolfolge x<br />

Output: Wahrsche<strong>in</strong>lichkeit P(x | M)<br />

Initialisierung (i = 0): f 0 (0) = 1, f k (0) = 0 für k ≠ 0.<br />

Für alle i = 1 . . .L, l ∈ Q: f l (i) = e l (x i ) ∑ k∈Q (f k(i − 1)a kl )<br />

Ergebnis: P(x | M) = ∑ k∈Q (f k(L)a k0 )<br />

Implementierungsh<strong>in</strong>weis: Benutzung von Logarithmen nicht so elegant möglich, aber es gibt<br />

auch Skalierungsverfahren...<br />

Löst Frage Q2!<br />

1.35 Rückwärts-Algorithmus<br />

Die Rückwärtsvariable enthält die Wahrsche<strong>in</strong>lichkeit, von dem Zustand p i = k ausgehend die<br />

Suffixfolge (x i+1 , . . .,x L ) zu erzeugen: b k (i) = P(x i+1 . . .x L | π i = k).<br />

Input: HMM M = (S, Q, A, e)<br />

und Symbolfolge x<br />

Output: Wahrsche<strong>in</strong>lichkeit P(x | M)<br />

Initialisierung (i = L): b k (L) = a k0 für alle k.<br />

Für alle i = L − 1 . . .1, k ∈ Q: b k (i) = ∑ l∈Q a kle l (x i+1 )b l (i + 1)<br />

Ergebnis: P(x | M) = ∑ l∈Q (a 0le l (x 1 )b l (1))<br />

b (i)<br />

k<br />

a kl<br />

b p(i+1)<br />

b q(i+1)<br />

b r(i+1)<br />

b (i+1)<br />

s<br />

1.36 Vergleich <strong>der</strong> drei Variablen<br />

Viterbi v k (i) Wahrsche<strong>in</strong>lichkeit, dass <strong>der</strong> wahrsche<strong>in</strong>lichste Zustandspfad die Symbolfolge<br />

(x 1 , x 2 , . . .,x i ) generiert und das System zum Zeitpunkt i im Zustand k ist.<br />

Vorwärts f k (i) Wahrsche<strong>in</strong>lichkeit, dass die Symbolfolge x 1 , . . .,x i generiert wird, und das<br />

System zum Zeitpunkt i im Zustand k ist.<br />

Rückwärts b k (i) Wahrsche<strong>in</strong>lichkeit, zum Zeitpunkt i im Zustand k zu starten und dann die<br />

Symbolfolge x i+1 , . . ., x L zu generieren.


16 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

1.37 Posterior Wahrsche<strong>in</strong>lichkeiten<br />

Gegeben e<strong>in</strong> HMM M und e<strong>in</strong>e Symbolfolge x. Sei P(π i = k | x) die Wahrsche<strong>in</strong>lichkeit, dass<br />

das Symbol x i im Zustand π i = k ausgegeben wurde. Sie wird Posterior Wahrsche<strong>in</strong>lichkeit<br />

genannt, da sie nach Beobachtung <strong>der</strong> Folge x berechnet wird.<br />

Es gilt:<br />

P(π i = k | x) = P(π i = k, x)<br />

P(x)<br />

= f k(i)b k (i)<br />

,<br />

P(x)<br />

da P(g, h) = P(g | h)P(h) und nach Def<strong>in</strong>ition <strong>der</strong> Vorwärts- und Rückwärtsvariable.<br />

1.38 Dekodierung mit Posterior Wahrsche<strong>in</strong>lichkeiten<br />

Es gibt Alternativen zur Viterbi-Dekodierung, die z.B. dann s<strong>in</strong>nvoll s<strong>in</strong>d, wenn es viele Pfade<br />

gibt, die ungefähr genauso wahrsche<strong>in</strong>lich s<strong>in</strong>d wie π ∗ .<br />

Wir def<strong>in</strong>ieren e<strong>in</strong>e Zustandsfolge ˆπ durch<br />

ˆπ i = arg max<br />

k∈Q P(π i = k | x),<br />

d.h., an je<strong>der</strong> Position wählen wir den augenblicklich wahrsche<strong>in</strong>lichsten Zustand.<br />

Diese Dekodierung ist s<strong>in</strong>nvoll, wenn wir uns für den Zustand an e<strong>in</strong>em gegebenen Punkt i<br />

<strong>in</strong>teressieren, und nicht für die ganze Folge.<br />

Vorsicht: S<strong>in</strong>d e<strong>in</strong>ige Zustandübergänge durch die Übergangsmatrix nicht erlaubt (i.e., a kl = 0),<br />

so kann es se<strong>in</strong>, dass <strong>der</strong> Pfad ˆπ unzulässig ist, d.h. mit Wahrsche<strong>in</strong>lichkeit 0 vorkommt!<br />

1.39 Parameterschätzung<br />

Wie wird e<strong>in</strong> HMM konstruiert?<br />

Erster Schritt: Die “Topologie” wird festgelegt, d.h. Wahl <strong>der</strong> Zustände und <strong>der</strong> Verb<strong>in</strong>dungen<br />

zwischen ihnen.<br />

Zweiter Schritt: Wahl <strong>der</strong> Parameterwerte, d.h. <strong>der</strong> Übergangswahrsche<strong>in</strong>lichkeiten a kl und<br />

<strong>der</strong> Emissionswahrsche<strong>in</strong>lichkeiten e k (b).<br />

Wir betrachten den zweiten Schritt. Gegeben e<strong>in</strong>e Menge von Beispielsequenzen. Ziel ist<br />

es, die Parameter e<strong>in</strong>es HMMs auf die Beispielsequenzen zu “tra<strong>in</strong>ieren”, d.h. die Parameter<br />

so zu wählen, dass die Wahrsche<strong>in</strong>lichkeit, mit <strong>der</strong> das HMM die Beispielsequenzen erzeugt,<br />

maximiert wird.<br />

1.40 Parameterschätzung bei bekannter Zustandsfolge<br />

Sei M = (S, Q, A, e) e<strong>in</strong> HMM.<br />

Gegeben e<strong>in</strong>e Liste von Symbolfolgen x 1 , x 2 , . . ., x n und e<strong>in</strong>e Liste zugehöriger Pfade<br />

π 1 , π 2 , . . ., π n . (Z.B., DNA Sequenz mit annotierten CpG-Inseln.)<br />

Wir möchten die Parameter (A, e) des HMM M optimal wählen, d.h. so, dass gilt:<br />

P(x 1 , . . ., x n , π 1 , . . ., π n | M = (S, Q, A, e)) =


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 17<br />

max<br />

(A ′ ,e ′ ) P(x1 , . . .,x n , π 1 , . . ., π n | M = (S, Q, A ′ , e ′ )).<br />

Wir suchen also den sogenannten Maximum Likelihood Estimator (ML-Schätzer) für (A, e).<br />

1.41 ML-Schätzung für (A, e)<br />

(H<strong>in</strong>weis: Betrachten wir P(D | M) als Funktion von D, so sprechen wir von e<strong>in</strong>er probability;<br />

als e<strong>in</strong>e Funktion von M, so sprechen wir von e<strong>in</strong>er likelihood.)<br />

ML-Schätzung:<br />

Berechnung:<br />

A kl :<br />

E k (b):<br />

(A, e) ML = arg max P(x 1 , . . .,x n , π 1 , . . .,π n | M = (S, Q, A ′ , e ′ )).<br />

(A ′ ,e ′ )<br />

Zahl <strong>der</strong> Übergänge von Zustand k zu l<br />

Zahl <strong>der</strong> Emissionen von b im Zustand k<br />

Wir setzen die Parameter für M:<br />

ā kl =<br />

A kl<br />

∑<br />

q∈Q A kq<br />

und ē k (b) =<br />

E k (b)<br />

∑s∈S E k(s) .<br />

1.42 Tra<strong>in</strong>ierung des fair/unfair HMMs<br />

Gegeben Tra<strong>in</strong><strong>in</strong>gsdaten x und π:<br />

Symbols x: 1 2 5 3 4 6 1 2 6 6 3 2 1 5<br />

States pi: F F F F F F F U U U U F F F<br />

A kl 0 F U<br />

0<br />

F<br />

U<br />

E k (b) 1 2 3 4 5 6<br />

0<br />

F<br />

U<br />

Zustandsübergange:<br />

→<br />

Emissionen:<br />

→<br />

ā kl 0 F U<br />

0<br />

F<br />

U<br />

ē k (b) 1 2 3 4 5 6<br />

0<br />

F<br />

U<br />

1.43 Tra<strong>in</strong>ierung des fair/unfair HMMs<br />

Gegeben Tra<strong>in</strong><strong>in</strong>gsdaten x und π:<br />

Symbols x: 1 2 5 3 4 6 1 2 6 6 3 2 1 5<br />

States pi: F F F F F F F U U U U F F F


18 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

A kl 0 F U<br />

0 0 1 0<br />

F 1 8 1<br />

U 0 1 3<br />

E k (b) 1 2 3 4 5 6<br />

F 3 2 1 1 2 1<br />

U 0 1 1 0 0 2<br />

1.44 Pseudocounts<br />

Zustandsübergange:<br />

→<br />

ā kl 0 F U<br />

0 0 1 0<br />

1<br />

F<br />

10<br />

U 0<br />

8<br />

10<br />

1<br />

4<br />

1<br />

10<br />

3<br />

4<br />

Emissionen:<br />

ē k (b) 1 2 3 4 5 6<br />

→ F .3 .2 .1 .1 .2 .1<br />

1 1<br />

1<br />

U 0 .0 .0<br />

4 4 2<br />

E<strong>in</strong> Problem ist overfitt<strong>in</strong>g. Z.B., kommt e<strong>in</strong> Übergang k ↦→ l <strong>in</strong> <strong>der</strong> Tra<strong>in</strong>smenge nicht vor, so<br />

wird ā kl = 0 gesetzt und dieser Übergang gilt dann als “verboten”.<br />

Kommt e<strong>in</strong> Zustand k <strong>in</strong> <strong>der</strong> Tra<strong>in</strong><strong>in</strong>gsmenge nicht vor, so ist ā kl für alle l undef<strong>in</strong>iert!<br />

Um diese Probleme zu lösen, def<strong>in</strong>iert man Pseudocounts r kl und r k (b) und setzt dann:<br />

A kl = Anzahl Übergänge von k nach l <strong>in</strong> <strong>der</strong> Tra<strong>in</strong><strong>in</strong>gsmenge + r kl<br />

E k (b) = Anzahl Emissionen von b von k <strong>in</strong> <strong>der</strong> Tra<strong>in</strong><strong>in</strong>gsmenge + r k (b).<br />

Kle<strong>in</strong>e Pseudocounts entsprechen “wenig Vorwissen”, grosse Pseudocounts entsprechen “viel<br />

Vorwissen”.<br />

1.45 Parameterschätzung bei unbekannter Zustandsfolge<br />

In <strong>der</strong> Praxis hat man nur Symbolfolgen und kennt die zugehörigen Zustandspfade nicht.<br />

Gegeben Symbolfolgen x 1 , x 2 , . . .,x n , für die wir die Zustandspfade π 1 , . . .,π n NICHT kennen.<br />

Das Problem, die Parameter (A, e) e<strong>in</strong>es HMMs M so (optimal) zu wählen, dass<br />

gilt, ist NP-vollständig!<br />

P(x 1 , . . .,x n | M = (S, Q, A, e)) =<br />

max<br />

(A ′ ,e ′ ) P(x1 , . . .,x n | M = (S, Q, A ′ , e ′ ))<br />

1.46 Loglikelihood<br />

Gegeben Symbolfolgen x 1 , x 2 , . . .,x n .<br />

Sei M = (S, Q, A, e) e<strong>in</strong> HMM. Wir def<strong>in</strong>ieren den Score des Modells M als:<br />

n∑<br />

l(x 1 , . . .,x n ) = log P(x 1 , . . .,x n | (A, e)) = log P(x j | (A, e)).<br />

(Hier nehmen wir an, dass die Symbolfolgen unabhängig s<strong>in</strong>d und deshalb P(x 1 , . . .,x n ) =<br />

P(x 1 ) · . . . · P(x n ) gilt.)<br />

Ziel ist es nun, die Parameter (A, e) so optimieren, dass dieser Score maximiert wird. (Englisch:<br />

log likelihood).<br />

j=1


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 19<br />

1.47 Überblick: Baum-Welch-Algorithmus<br />

Sei M = (S, Q, A, e) e<strong>in</strong> HMM und seien Tra<strong>in</strong><strong>in</strong>gsfolgen x 1 , x 2 , . . .,x n gegeben. Die Parameter<br />

(A, e) sollen iterativ verbessert werden, wie folgt:<br />

- Auf Grundlage von x 1 , . . ., x n und π 1 , . . .,π n werden Erwartungswerte für A kl und E l (b)<br />

geschätzt.<br />

- Wir setzen dann (A ′ , e ′ ) ← (Ā, ē).<br />

- So lange wie<strong>der</strong>holen, bis e<strong>in</strong> Haltekriterium erfüllt wird.<br />

Dies ist e<strong>in</strong> Spezialfall <strong>der</strong> EM-Technik (Expectation Maximization).<br />

1.48 Baum-Welch-Algorithmus<br />

Input: HMM M = (S,Q,A,e), Tra<strong>in</strong><strong>in</strong>gsfolgen x 1 ,x 2 ,... ,x n ,<br />

ggf. Pseudocounts r kl und r k (b)<br />

Output: HMM M ′ = (S,Q,A ′ ,e ′ ) mit verbessertem Score.<br />

Rekursion:<br />

Setze A und E auf 0, o<strong>der</strong> ggf. auf ihre Pseudocounts.<br />

Für jede Sequenz x j :<br />

Berechne f k (i) für x j mit dem Vorwärtsalgorithmus.<br />

Berechne b k (i) für x j mit dem Rückwärtsalgorithmus.<br />

Addiere beide Beiträge zu den Summen:<br />

Beenden:<br />

A kl = ∑ j<br />

1<br />

P(x j )<br />

E k (b) = ∑ j<br />

∑<br />

i<br />

1<br />

P(x j )<br />

f j k (i)a kle l (x j i+1 )bj l<br />

(i + 1)<br />

∑<br />

{i|x j i =b} f j k (i)bj l (i)<br />

Berechne neue Modellparameter (A ′ ,e ′ ) ← (Ā,ē)<br />

Berechne die neue Loglikelihood l(x 1 ,... ,x n | (A ′ ,b ′ ))<br />

Wenn <strong>der</strong> Score nicht verbessert wurde, o<strong>der</strong> die<br />

maximale Anzahl an Iterationen erreicht wurde.<br />

1.49 Erläuterung<br />

Gegeben x. Für die erwartete Anzahl von Übergängen von π i = k nach π i+1 = l gilt:<br />

Also ergibt sich <strong>in</strong>sgesamt:<br />

P(π i = k, π i+1 = l | x, (A, e)) = f k(i)a kl e l (x i+1 )b l (i + 1)<br />

.<br />

P(x)<br />

A kl =<br />

n∑<br />

j=1<br />

1<br />

P(x j )<br />

∑L j<br />

i=1<br />

f j k (i)a kle l (x j i+1 )bj l<br />

(i + 1)<br />

(Zur Er<strong>in</strong>nerung:<br />

P(π i = k | x) = P(π i = k, x)<br />

P(x)<br />

= f k(i)b k (i)<br />

.)<br />

P(x)


20 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

1.50 Konvergenz<br />

Bemerkung Man kann beweisen, dass bei Benutzung des Baum-Welch-Algorithmus <strong>der</strong><br />

Loglikelihood-Score gegen e<strong>in</strong> lokales Maximum konvergiert.<br />

Allerd<strong>in</strong>gs müssen die Parameter nicht unbed<strong>in</strong>gt konvergieren!<br />

Lokale Maxima können durch die Wahl verschiedener Startpunkte vermieden werden.<br />

Man kann natürlich auch an<strong>der</strong>e Standardoptimierungsverfahren benutzen, um das Optimierungsproblem<br />

zu lösen.<br />

1.51 Prote<strong>in</strong>primärsequenzen<br />

Gegeben die Am<strong>in</strong>osäuresequenz e<strong>in</strong>es neuen Prote<strong>in</strong>s P. Wir möchten die biologische Funktion<br />

von P erfahren.<br />

SWISS-PROT: Curated prote<strong>in</strong> sequence data which strives to provide a high level of annotations, a m<strong>in</strong>imal level of redundancy and high level of<br />

<strong>in</strong>tergration with other databases. SIB Switzerland.<br />

TrEMBL: Computer-annotated supplement to SWISS-PROT<br />

PIR: Prote<strong>in</strong> <strong>in</strong>formation resource. Comprehensive, non-redundant, expertly annotated, fully classified and extensively cross-referenced prote<strong>in</strong> sequence<br />

database <strong>in</strong> the public doma<strong>in</strong>. A division of National biomedical Research Foundation, Georgetown Uni.<br />

OWL: Composite prote<strong>in</strong> sequence database, non-redundant composite of (<strong>in</strong> this or<strong>der</strong> or priority) SWISS-PROT, PIR, GenBank (translation) and<br />

NRL-3D. University of Manchester<br />

1.52 Prote<strong>in</strong>sekundärstruktur<br />

PRINTS: A compendium of prote<strong>in</strong> f<strong>in</strong>gerpr<strong>in</strong>ts. A f<strong>in</strong>gerpr<strong>in</strong>t is a group of conserved motifs used to characterise a prote<strong>in</strong> family. University of<br />

Manchester.<br />

SPRINT: Provides an <strong>in</strong>terface to the PRINTS-S database. PRINTS-S is the relational cous<strong>in</strong> of the PRINTS data bank of prote<strong>in</strong> family f<strong>in</strong>gerpr<strong>in</strong>ts.<br />

University of Manchester.<br />

PROSITE: Database of prote<strong>in</strong> families and doma<strong>in</strong>s, consist<strong>in</strong>g of biologically significant sites, patterns and profiles that help to reliably identify to<br />

which known prote<strong>in</strong> family (if any) a new sequence belongs. (E.g., uses regular expressions to describe pattern)<br />

BLOCKS: The blocks for the Blocks Database are made automatically by look<strong>in</strong>g for the most highly conserved regions <strong>in</strong> groups of prote<strong>in</strong>s<br />

documented <strong>in</strong> the Prosite Database. Blocks are calibrated aga<strong>in</strong>st the SWISS-PROT database to obta<strong>in</strong> a measure of the change distrbution of<br />

matches. Fred Hutch<strong>in</strong>son Cancer Research Center <strong>in</strong> Seattle.<br />

Pfam: Large collection of multiple sequence alignments and hidden Markov models cover<strong>in</strong>g many common prote<strong>in</strong> doma<strong>in</strong>s. Pfam 6.6 conta<strong>in</strong>s<br />

alignments and models for 3071 prote<strong>in</strong> families. Wash<strong>in</strong>gton University <strong>in</strong> St Louis<br />

ProDom: The prote<strong>in</strong> doma<strong>in</strong> database. 390 ProDom families were generated automatically us<strong>in</strong>g PSI-BLAST. Tolouse.<br />

InterPro: Built from Pfam, PRINTS, PROSITE, ProDom, SMART, TIGRFAMs, and SWISS-PROT+TrEmbl<br />

1.53 Räumliche Prote<strong>in</strong>struktur<br />

SCOP: Structual classification of prote<strong>in</strong>s. Cambridge University<br />

CATH: Prote<strong>in</strong> Structure Classification. University College London<br />

PDB: Prote<strong>in</strong> data bank. S<strong>in</strong>gle worldwide repository for the process<strong>in</strong>g and distribution of 3-D<br />

biological macromolecular structure<br />

PDBsum: Conta<strong>in</strong>s summary <strong>in</strong>formation and <strong>der</strong>ived data on entries <strong>in</strong> the Prote<strong>in</strong> Data Bank.<br />

PIR-NRL3D: This sequence-structure database is produced from sequence and annotation extracted<br />

from three-dimensional structures <strong>in</strong> PDB.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 21<br />

1.54 Prote<strong>in</strong>identifizierung<br />

#A-helices ...........AAAAAAAAAAAAAAAA...BBBBBBBBBBBBBBBBCCCCCCCCCCC....DDDDDDDEEEEEEEEEEEE<br />

GLB1_GLYDI .........GLSAAQRQVIAATWKDIAGADNGAGVGKDCLIKFLSAHPQMAAVFG.FSG....AS...DPGVAALGAKVL<br />

HBB_HUMAN ........VHLTPEEKSAVTALWGKV....NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVL<br />

HBA_HUMAN .........VLSPADKTNVKAAWGKVGA..HAGEYGAEALERMFLSFPTTKTYFPHF.DLS.....HGSAQVKGHGKKVA<br />

MYG_PHYCA .........VLSEGEWQLVLHVWAKVEA..DVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKHGVTVL<br />

GLB5_PETMA PIVDTGSVAPLSAAEKTKIRSAWAPVYS..TYETSGVDILVKFFTSTPAAQEFFPKFKGLTTADQLKKSADVRWHAER<strong>II</strong><br />

GLB3_CHITP ..........LSADQISTVQASFDKVKG......DPVGILYAVFKADPSIMAKFTQFAG.KDLESIKGTAPFETHANRIV<br />

LGB2_LUPLU ........GALTESQAALVKSSWEEFNA..NIPKHTHRFFILVLEIAPAAKDLFS.FLK.GTSEVPQNNPELQAHAGKVF<br />

#A-helices EEEEEEEEE............FFFFFFFFFFFF..FFGGGGGGGGGGGGGGGGGGG.....HHHHHHHHHHHHHHHHHHH<br />

GLB1_GLYDI AQIGVAVSHL..GDEGKMVAQMKAVGVRHKGYGNKHIKAQYFEPLGASLLSAMEHRIGGKMNAAAKDAWAAAYADISGAL<br />

HBB_HUMAN GAFSDGLAHL...D..NLKGTFATLSELHCDKL..HVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANAL<br />

HBA_HUMAN DALTNAVAHV...D..DMPNALSALSDLHAHKL..RVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVL<br />

MYG_PHYCA TALGAILKK....K.GHHEAELKPLAQSHATKH..KIPIKYLEFISEA<strong>II</strong>HVLHSRHPGDFGADAQGAMNKALELFRKDI<br />

GLB5_PETMA NAVNDAVASM..DDTEKMSMKLRDLSGKHAKSF..QVDPQYFKVLAAVIADTVAAG.........DAGFEKLMSMICILL<br />

GLB3_CHITP GFFSK<strong>II</strong>GEL..P...NIEADVNTFVASHKPRG...VTHDQLNNFRAGFVSYMKAHT..DFA.GAEAAWGATLDTFFGMI<br />

LGB2_LUPLU KLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG...VADAHFPVVKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVI<br />

#A-helices HHHHHHH....<br />

GLB1_GLYDI ISGLQS.....<br />

HBB_HUMAN AHKYH......<br />

HBA_HUMAN TSKYR......<br />

MYG_PHYCA AAKYKELGYQG Alignment von sieben Glob<strong>in</strong>sequenzen<br />

GLB5_PETMA RSAY....... Wie kann diese Familie charakterisiert werden?<br />

GLB3_CHITP FSKM.......<br />

LGB2_LUPLU KKEMNDAA...<br />

1.55 Charakterisierung?<br />

Vertretersequenz?<br />

Consensussequenz?<br />

Regulärer Ausdruck (Prosite):<br />

LGB2_LUPLU<br />

GLB1_GLYDI<br />

...FNA--NIPKH...<br />

...IAGADNGAGV...<br />

...[FI]-[AN]-x(1,2)-N-[IG]-[AP]-[GK]-[HV]...<br />

HMM?<br />

1.56 E<strong>in</strong>faches HMM<br />

HBA_HUMAN ...VGA--HAGEY...<br />

HBB_HUMAN ...V----NVDEV...<br />

MYG_PHYCA ...VEA--DVAGH...<br />

GLB3_CHITP ...VKG------D...<br />

GLB5_PETMA ...VYS--TYETS...<br />

LGB2_LUPLU ...FNA--NIPKH...<br />

GLB1_GLYDI ...IAGADNGAGV...<br />

"Matches": *** *****<br />

Wir betrachten zunächst e<strong>in</strong> e<strong>in</strong>faches HMM, das e<strong>in</strong>er PSSM entspricht (Position Specific<br />

Score Matrix):<br />

V<br />

F<br />

I<br />

A<br />

E<br />

G<br />

K<br />

Y<br />

A<br />

G<br />

S<br />

D<br />

H<br />

N<br />

T<br />

A<br />

G<br />

I<br />

V<br />

Y<br />

A<br />

D<br />

E<br />

G<br />

P<br />

E<br />

G<br />

K<br />

T<br />

D<br />

H<br />

S<br />

V<br />

Y


22 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

(Die aufgeführten Am<strong>in</strong>osäuren haben e<strong>in</strong>e erhöhte Emissionswahrsche<strong>in</strong>lichkeit.)<br />

1.57 Insert-Zustände<br />

Es werden sogenannte Insert-Zustände e<strong>in</strong>geführt, die gemäss den H<strong>in</strong>tergrundsverwahrsche<strong>in</strong>lichkeiten<br />

Symbole emittieren.<br />

Beg<strong>in</strong><br />

V<br />

F<br />

I<br />

A<br />

E<br />

G<br />

K<br />

Y<br />

A<br />

G<br />

S<br />

D<br />

H<br />

N<br />

T<br />

A<br />

G<br />

I<br />

V<br />

Y<br />

A<br />

D<br />

E<br />

G<br />

P<br />

E<br />

G<br />

K<br />

T<br />

D<br />

H<br />

S<br />

V<br />

Y<br />

Hiermit ist es möglich, zusätzliche Sequenzstücke ausserhalb <strong>der</strong> wichtigen Doma<strong>in</strong>s zu modellieren.<br />

1.58 Delete-Zustände<br />

Es werden sogenannte Delete-Zustände e<strong>in</strong>geführt, die still s<strong>in</strong>d, also ke<strong>in</strong>e Symbole emittieren.<br />

Beg<strong>in</strong><br />

V<br />

F<br />

I<br />

A<br />

E<br />

G<br />

K<br />

Y<br />

A<br />

G<br />

S<br />

D<br />

H<br />

N<br />

T<br />

A<br />

G<br />

I<br />

V<br />

Y<br />

A<br />

D<br />

E<br />

G<br />

P<br />

E<br />

G<br />

K<br />

T<br />

D<br />

H<br />

S<br />

V<br />

Y<br />

Hiermit ist es möglich, das Fehlen e<strong>in</strong>zelner Doma<strong>in</strong>s zu modellieren.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 23<br />

1.59 Topologie e<strong>in</strong>es Profil-HMMs<br />

Beg<strong>in</strong><br />

Match-Zustand, Insert-Zustand, Delete-Zustand<br />

1.60 Entwerfen e<strong>in</strong>es Profil-HMMs<br />

Gegeben e<strong>in</strong>e Multialignment e<strong>in</strong>er Familie von Sequenzen.<br />

Zunächst muss entschieden werden, welche Positionen als Match- und welche als Insert-<br />

Zustände modelliert werden. Erfahrungswert: Spalten mit mehr als 50% Gaps sollten als<br />

Insert-Zustände modelliert werden.<br />

Die Übergangswahrsche<strong>in</strong>lichkeiten und Emissionswahrsche<strong>in</strong>lichkeiten können nach Auszählen<br />

<strong>der</strong> vorkommenden Übergänge A kl und Emissionen E k (b) bestimmt werden:<br />

a kl =<br />

A kl<br />

∑<br />

l ′ A kl ′<br />

and e k (b) =<br />

E k(b)<br />

∑b ′ E k (b ′ ) .<br />

Auch hier kann es vorkommen, dass bestimmte Übergänge o<strong>der</strong> Emissionen nicht beobachtet<br />

werden. Wir benutzen die Laplace-Regel und addieren 1 zu jede Häufigkeit h<strong>in</strong>zu.<br />

2 Suffix trees<br />

History<br />

We<strong>in</strong>er 1973: l<strong>in</strong>ear-time algorithm<br />

McCreight 1976: reduced space<br />

Ukkonen 1995: new algorithm, easier to describe<br />

References<br />

- Dan Gusfield, <strong>Algorithms</strong> on str<strong>in</strong>gs, trees and sequences, Cambridge, 1997.<br />

- R. Giegerich, S. Kurtz und J. Stoye. Efficient Implementation of Lazy Suffix Trees, WAE’99,<br />

LNCS 1668, pp. 30-42, 1999, see l<strong>in</strong>k from webpage<br />

S. Kurtz, Foundations of sequence analysis, Bielefeld (2001), see l<strong>in</strong>k.


24 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

2.1 Importance of sequence comparison<br />

The first fact of biological sequence analysis: In biomolecular sequences (DNA, RNA,<br />

or am<strong>in</strong>o acid sequences), high sequence similarity usually implies significant functional or<br />

structural similarity.<br />

“Duplication with modification”: The vast majority of extant prote<strong>in</strong>s are the result of a cont<strong>in</strong>uous<br />

series of genetic duplications and subsequent modifications. As a result, redundancy is a built-<strong>in</strong><br />

characteristic of prote<strong>in</strong> sequences, and we should not be surprised that so many new sequences resemble<br />

already known sequences. ...all of biology is based on enormous redundancy...<br />

We didn’t know it at the time, but we found everyth<strong>in</strong>g <strong>in</strong> life is so similar, that the same genes<br />

work <strong>in</strong> flies are the ones that work <strong>in</strong> humans. (Eric Wieschaus, cow<strong>in</strong>ner of the 1995 Nobel prize <strong>in</strong> medic<strong>in</strong>e for work<br />

on the genetics of Drosophia development.)<br />

Dan Gusfield, 1997, 212 ff<br />

2.2 Search<strong>in</strong>g for short queries <strong>in</strong> a long text<br />

Problem Given a long text t and many short queries q 1 , . . .,q k . For each query sequence q i ,<br />

f<strong>in</strong>d all its occurrences <strong>in</strong> t.<br />

We would like to have a data-structure that allows us to solve this problem efficiently.<br />

Example: The text t is a genomic sequence and the queries are short signals such as transcription<br />

factor b<strong>in</strong>d<strong>in</strong>g sites, splice sites etc.<br />

Important applications are <strong>in</strong> the comparison of genomes (<strong>in</strong> programs such as MUMMer that<br />

computes maximum unique matches) and <strong>in</strong> the analysis of repeats.<br />

2.3 Basic def<strong>in</strong>itions<br />

Let Σ denote an alphabet and Σ ∗ the set of str<strong>in</strong>gs over Σ. Let ǫ denote the empty str<strong>in</strong>g and<br />

Σ + = Σ ∗ \ {ǫ}.<br />

Let t = t 1 t 2 . . . t n be the text and $ ∈ Σ ∗ \ t.<br />

For i ∈ {1, 2, . . ., n + 1}, let s i = t i . . .t n $ denote the i-th suffix of t.<br />

2.4 The role of suffixes<br />

Consi<strong>der</strong> the text abab$<br />

It has the follow<strong>in</strong>g suffixes:<br />

abab$, bab$, ab$, b$, $<br />

To determ<strong>in</strong>e whether a given query q is conta<strong>in</strong>ed <strong>in</strong> the text, we check whether q is the prefix<br />

of one of the suffixes.<br />

E.g., the query ab is the prefix of both abab$ and ab$.<br />

To speed up the search for all suffixes that have the query as a prefix, we use a tree structure<br />

to share common prefixes between the suffixes.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 25<br />

2.5 Shar<strong>in</strong>g prefixes<br />

(a) The suffixes abab$ and ab$ both share the prefix ab.<br />

(b) The suffixes bab$ and b$ both share the prefix b.<br />

(c) The suffix $ doesn’t share a prefix.<br />

b a<br />

b<br />

$<br />

a<br />

b<br />

$<br />

$<br />

$<br />

$<br />

b a (b)<br />

(a)<br />

(c)<br />

2.6 First example<br />

b a<br />

b<br />

$<br />

abab$<br />

ab$<br />

bab$<br />

b$<br />

$<br />

a<br />

b<br />

$<br />

$<br />

b a<br />

$<br />

$<br />

5<br />

1 3 2 4 5<br />

⇒<br />

1 3 2 4<br />

Suffix tree for abab$ is obta<strong>in</strong>ed by shar<strong>in</strong>g prefixes where ever possible. The leaves are annotated<br />

by the positions of the correspond<strong>in</strong>g suffixes <strong>in</strong> the text.<br />

2.7 Σ + -tree<br />

A Σ + -tree T is a f<strong>in</strong>ite, directed tree with root root. Its edges are labeled with str<strong>in</strong>gs <strong>in</strong> Σ + ,<br />

such that: For every letter a ∈ Σ and node u there exists at most one a-edge u → as<br />

w (for some<br />

str<strong>in</strong>g s and some node w).<br />

w<br />

av<br />

u<br />

A leaf is a node with no children and an edge lead<strong>in</strong>g to a leaf is called a leaf edge. A node<br />

with at least two children is called a branch<strong>in</strong>g node.<br />

2.8 Nam<strong>in</strong>g nodes by str<strong>in</strong>gs<br />

Let u be a node of T. We use the name ¯s for u, if s is the concatenation of all labels of the<br />

edges along the path from the root of the tree to u.


26 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

root<br />

p<br />

For example u = pqr:<br />

q<br />

r<br />

u<br />

The root is called ǫ.<br />

Def<strong>in</strong>ition: A str<strong>in</strong>g g is said to occur <strong>in</strong> T, if there exists a str<strong>in</strong>g h such that gh is a node <strong>in</strong><br />

T.<br />

2.9 Suffix tree<br />

Def<strong>in</strong>ition: A suffix tree ST(t) for t is a Σ + -tree with the follow<strong>in</strong>g properties:<br />

1. Every node is either a leaf or a branch<strong>in</strong>g node, and<br />

2. a str<strong>in</strong>g s occurs <strong>in</strong> ST(t) ⇔ w is a substr<strong>in</strong>g of t.<br />

There exists a one-to-one correspondence between the non-empty suffixes of t$ and the leaves<br />

of ST(t).<br />

For every leaf s j we def<strong>in</strong>e l(s j ) = {j}. Recursively, for every branch<strong>in</strong>g node u we def<strong>in</strong>e:<br />

l(u) = {j | u v → uv is an edge of ST(t), j ∈ l(uv)}.<br />

In other words,<br />

We call l(u) the leaf set of u.<br />

⋃<br />

l(u) = l(v).<br />

v is child of u<br />

2.10 Example<br />

Text: xabxac<br />

b<br />

x<br />

a<br />

c<br />

$<br />

a<br />

c<br />

$<br />

a<br />

b c<br />

x $<br />

a c<br />

c<br />

$<br />

x $<br />

$<br />

b xa c $<br />

2.11 Idea: Compute tree recursively<br />

Note that the sub tree below a branch<strong>in</strong>g node u is determ<strong>in</strong>ed by the set of all suffixes of t$,<br />

that start with the prefix u:


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 27<br />

U<br />

U<br />

US1<br />

So, if we know the set of rema<strong>in</strong><strong>in</strong>g suffixes<br />

S1<br />

S2 Sk<br />

...<br />

...<br />

US2 USk<br />

R(u) := {s | us is a suffix of t$},<br />

then we can evaluate the node u, i.e. construct the sub tree below u.<br />

2.12 The ma<strong>in</strong> evaluation step<br />

An unevaluated node is evaluated as follows: We partition the set R(u) <strong>in</strong>to groups by the first<br />

letter of the str<strong>in</strong>gs, i.e. for every letter c ∈ Σ, we def<strong>in</strong>e the c-group as:<br />

R c (u) := {cw ∈ Σ ∗ | cw ∈ R(u)}.<br />

Consi<strong>der</strong> R c (u) for c ∈ Σ. If R c (u) ≠ ∅, then there are two possible cases:<br />

1. If R c (u) conta<strong>in</strong>s precisely one str<strong>in</strong>g w, then we construct a new leaf edge start<strong>in</strong>g at u<br />

and label it with w.<br />

2. Otherwise, the set R c (u) conta<strong>in</strong>s at least two different str<strong>in</strong>gs and let p denote their<br />

longest common substr<strong>in</strong>g (lcp). We create a new c-edge with label p whose source node<br />

is u. The new unevaluated node up and set R(up) = {w | pw ∈ R c (u)} will be (recursively)<br />

processed later.<br />

2.13 Evaluat<strong>in</strong>g the root<br />

This wotd-algorithm (write-only, top-down) starts by evaluat<strong>in</strong>g the root node, with R(root)<br />

equal to the set of all suffixes of t$. All nodes of ST(t) are then recursively constructed us<strong>in</strong>g<br />

the appropriate sets of rema<strong>in</strong><strong>in</strong>g suffixes.<br />

2.14 Example<br />

Text: abab<br />

The algorithm proceeds as follows: We first evaluate the root node root us<strong>in</strong>g R(root) =<br />

{abab$, bab$, ab$, b$}. There are three groups of suffixes:<br />

R a (root) = {abab$, ab$}, R b (root) = {bab$, b$} and R $ = {$}.<br />

The letter $ gives rise to a leaf edge with label $. The letter a gives rise to an <strong>in</strong>ternal edge<br />

with label ab, because ab = lcp(R a (root)). Similarly, for b we obta<strong>in</strong> an <strong>in</strong>ternal edge with label<br />

b.


28 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

For the node ab we have R(ab) = {ab$, $} and thus R a (ab) = {ab$} and R $ (ab) = {$}. Because<br />

both latter sets have card<strong>in</strong>ality one, we obta<strong>in</strong> two new leaf edges with labels ab$ and $,<br />

respectively.<br />

Similarly, we obta<strong>in</strong> two new leaf edges with labels ab$ and $ for the node b.<br />

Text: abab<br />

abab$<br />

bab$<br />

R(root):<br />

ab$<br />

b$<br />

$<br />

⇒<br />

b a<br />

R(ab): ab$ R(b):<br />

$<br />

b<br />

ab$<br />

$<br />

$<br />

b a<br />

b<br />

$<br />

a<br />

b<br />

$<br />

$<br />

b a<br />

$<br />

$<br />

⇒<br />

2.15 Properties of the wotd-algorithm<br />

Complexity: Space requirement? Worst case time complexity? (Exercises!)<br />

The expected runn<strong>in</strong>g time is O(n log k n) and experimental studies <strong>in</strong>dicate that the algorithm<br />

often performs <strong>in</strong> l<strong>in</strong>ear time for mo<strong>der</strong>ate sized str<strong>in</strong>gs.<br />

Good memory locality.<br />

Algorithm can be parallelized.<br />

2.16 Suffix tree data-structure<br />

An implementation of a suffix tree must represent its nodes, edges and edge labels. To be able<br />

to describe the implementation, we def<strong>in</strong>e a total or<strong>der</strong><strong>in</strong>g on the set of children of a branch<strong>in</strong>g<br />

node:<br />

Let u and v be two different children of the same node <strong>in</strong> ST(t). We write<br />

u ≺ v iff m<strong>in</strong> l(u) < m<strong>in</strong> l(v),<br />

<strong>in</strong> other words, iff the first occurrence of u <strong>in</strong> t$ comes before the first occurrence of v <strong>in</strong> t$.<br />

m<strong>in</strong> l(u)<br />

m<strong>in</strong> l(v)<br />

t<br />

u v u<br />

v<br />

2.17 Represent<strong>in</strong>g the edge labels<br />

Because an edge label s is a substr<strong>in</strong>g of the text t$, we could represent it by a pair of po<strong>in</strong>ters<br />

(i, j) <strong>in</strong>to t ′ = t$ such that s = t ′ i , t′ i+1 , . . .,t′ j .


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 29<br />

However, note that we have j = n + 1 for any leaf edge and so <strong>in</strong> this case the right po<strong>in</strong>ter is<br />

redundant.<br />

Moreover, we can also get rid of right po<strong>in</strong>ters <strong>in</strong> the case of <strong>in</strong>ternal edges as well, by def<strong>in</strong><strong>in</strong>g a<br />

left po<strong>in</strong>ter on the set of nodes (not edges) <strong>in</strong> such a way that these can be used to reconstruct<br />

the orig<strong>in</strong>al left and right po<strong>in</strong>ters of each edge.<br />

2.18 Cod<strong>in</strong>g edge labels with one po<strong>in</strong>ter<br />

Consi<strong>der</strong> an edge u v → uv. Def<strong>in</strong>e the left po<strong>in</strong>ter of uv as the position p of the first occurrence<br />

of uv <strong>in</strong> t$ plus the length of u:<br />

lp(uv) = m<strong>in</strong> l(uv) + |u|.<br />

This gives the start position i of a copy of v <strong>in</strong> t$.<br />

To get the end position of v, consi<strong>der</strong> the ≺-smallest child uvw of uv. We have m<strong>in</strong> l(uv) =<br />

m<strong>in</strong> l(uvw), i.e. the correspond<strong>in</strong>g suffix starts at the same position p. By def<strong>in</strong>ition, we have<br />

lp(uvw) = m<strong>in</strong> l(uvw) + |uv| and the end position of v equals lp(uvw) − 1.<br />

t<br />

m<strong>in</strong> l(uv)=m<strong>in</strong> l(uvw)<br />

i<br />

u v<br />

lp(uv)<br />

r<br />

w<br />

lp(uvw)<br />

2.19 The ma<strong>in</strong> data table<br />

For each node u, we store a reference firstchild(u) to its smallest child.<br />

We store the values of lp and firstchild together <strong>in</strong> a s<strong>in</strong>gle (<strong>in</strong>teger) table T. We store the<br />

values of all children of a given node u consecutively, or<strong>der</strong>ed w.r.t. ≺. (We will <strong>in</strong>dicate the last<br />

child of u by sett<strong>in</strong>g its lastchild-bit.)<br />

So, only the edge from a given node u to its first child is represented explicitly. Edges from u<br />

to its other children are given implicitly and are found be scann<strong>in</strong>g consecutive positions <strong>in</strong> T<br />

that follow the position of the smallest child.<br />

We reference the node u us<strong>in</strong>g the <strong>in</strong>dex of the position <strong>in</strong> T that conta<strong>in</strong>s the value lp(u).<br />

2.20 Example<br />

The table T for ST(abab). All <strong>in</strong>dices start at 1. The first value <strong>in</strong> T for a branch<strong>in</strong>g node u<br />

is lp(u), the second value ist firstchild(u):<br />

node ab b $ abab$ ab$ bab$ b$<br />

{}}{ {}}{ {}}{ {}}{ {}}{ {}}{ {}}{<br />

T 1 6 2 8 5 3 5 3 5<br />

Index 1 2 3 4 5 6 7 8 9<br />

Bits ∗ † ∗ ∗ † ∗ ∗ †<br />

To be able to decode this representation of the suffix tree, we need two extra bits: A leaf-bit<br />

(∗) <strong>in</strong>dicates that the given position <strong>in</strong> T corresponds to a leaf node and a lastchild-bit (†)<br />

<strong>in</strong>dicates that the node at this position does not have a larger brother w.r.t. ≺.


30 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

2.21 Stor<strong>in</strong>g an unevaluated node<br />

We consi<strong>der</strong> the wotd-algorithm as a process that evaluates the nodes of a suffix tree. It starts<br />

at the root and then evaluates all nodes recursively.<br />

First we discuss how to store an unevaluated node u.<br />

To be able to evaluate u, we (only) need to know the set of rema<strong>in</strong><strong>in</strong>g suffixes R(u). To make<br />

these available, we def<strong>in</strong>e a global array called suffixes that conta<strong>in</strong>s po<strong>in</strong>ters to suffixes <strong>in</strong> t$<br />

and use it as follows: For every unevaluated node u, the suffixes array conta<strong>in</strong>s an <strong>in</strong>terval of<br />

po<strong>in</strong>ters to start positions <strong>in</strong> t$ that correspond precisely to the suffixes conta<strong>in</strong>ed <strong>in</strong> R(u), <strong>in</strong><br />

<strong>in</strong>creas<strong>in</strong>g or<strong>der</strong>.<br />

We can now represent R(u) <strong>in</strong> T us<strong>in</strong>g two numbers left(u) and right(u), which def<strong>in</strong>e an<br />

<strong>in</strong>terval of entries <strong>in</strong> the suffixes array.<br />

As a branch<strong>in</strong>g node, u will occupy two positions <strong>in</strong> T, one for lp(u) and followed by firstchild(u).<br />

Until u is actually evaluated, we will use these two positions to store left(u) and right(u). We<br />

use a third bit called the unevaluated-bit to dist<strong>in</strong>guish between unevaluated and evaluated<br />

nodes.<br />

2.22 Evaluat<strong>in</strong>g a node u<br />

We sort and count all entries of suffixes <strong>in</strong> the <strong>in</strong>terval [left(u),right(u)], us<strong>in</strong>g the first letter<br />

of the suffixes as the sort key.<br />

Each letter c that has count > 0 will give rise to a new c-edge from u. The suffixes <strong>in</strong> the<br />

c-group R c (u) determ<strong>in</strong>e the tree below the new edge. As a result of the sort, the po<strong>in</strong>ters<br />

correspond<strong>in</strong>g to suffixes <strong>in</strong> R c (u) are stored <strong>in</strong> a sub<strong>in</strong>terval of [left(u),right(u)], or<strong>der</strong>ed from<br />

left to right.<br />

To determ<strong>in</strong>e the label of the c-edge, we determ<strong>in</strong>e the lcp of the c-group: If the c-group consists<br />

of only one suffix s, then this is the lcp. Otherwise, we step through a simple loop j = 1, 2 . . .<br />

and check the equality of all letters t suffixes[i]+j for all start positions i of the suffixes <strong>in</strong> the<br />

c-group. As soon as a difference is detected, the loop is aborted and j is the length of the lcp.<br />

For each non-empty c-group of u, we store one child <strong>in</strong> the table T, as follows:<br />

A c-group conta<strong>in</strong><strong>in</strong>g only one str<strong>in</strong>g gives rise to a leaf node v and we write the number lp(v)<br />

<strong>in</strong> the first available position of T. This number lp(s) equals the number stored <strong>in</strong> suffixes at<br />

the left most position of the <strong>in</strong>terval <strong>in</strong> suffixes that corresponds to the c-group.<br />

A c-group conta<strong>in</strong><strong>in</strong>g more than one node gives rise to branch<strong>in</strong>g node v and we store left(v)<br />

and right(v) <strong>in</strong> the first two available positions of T. The values of left and right were computed<br />

dur<strong>in</strong>g the sort and count step.<br />

Additionally, <strong>in</strong> preparation of the evaluation of v, we <strong>in</strong>crement all entries of suffixes with<strong>in</strong><br />

the <strong>in</strong>terval [left(v),right(v)] by the length of the lcp.<br />

F<strong>in</strong>ally, for u we replace the values left(u) and right(u) <strong>in</strong> T by lp(u) := suffixes[left(u)] and<br />

firstchild(u), and we clear the unevaluated-bit.<br />

2.23 Lazy vs. complete evaluation<br />

To build the complete suffix tree, we proceed depth-first, from left to right.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 31<br />

In a lazy approach, we only evaluate those nodes that are necessary to answer a query (and<br />

have not yet been evaluated).<br />

2.24 Example<br />

Input: Text: a b a b $<br />

1 2 3 4 5<br />

Initial.: suffixes: 1 2 3 4 5 T:<br />

Evaluate(root):<br />

Sort and count: R a (root) = {1, 3}, lcp = ab<br />

R b (root) = {2, 4}, lcp = b<br />

R $ (root) = {5}<br />

The suffixes are or<strong>der</strong>ed, left and right are entered <strong>in</strong> the table and the three bits (u, ∗, †:<br />

unevaluated,leaf ,lastchild) are set:<br />

suffixes: 1 3 2 4 5 T: 1 2 3 4 5<br />

u u ∗†<br />

Add the length of the lcpto the suffixesentry for every branch<strong>in</strong>g nodes:<br />

suffixes: 3 5 3 5 5<br />

( )<br />

Text : a b a b $<br />

1 2 3 4 5<br />

Evaluate(1):<br />

R a (1) = {3}<br />

R $ (1) = {5}<br />

suffixes: 3 5 3 5 5 T: 1 2 3 4 5 3 5<br />

(u) u ∗† ∗ ∗†<br />

Because lp(1) = 1, firstchild(1) = 6 set:<br />

T: 1 6 3 4 5 3 5<br />

u ∗† ∗ ∗†<br />

( )<br />

Text : a b a b $<br />

1 2 3 4 5<br />

Evaluate(3):<br />

R a (3) = {3}<br />

R $ (3) = {5}<br />

suffixes: 3 5 3 5 5 T: 1 6 3 4 5 3 5 3 5<br />

(u) ∗† ∗ ∗† ∗ ∗†<br />

Because lp(3) = 2, firstchild(3) = 8 set:<br />

T: 1 6 2 8 5 3 5 3 5<br />

∗† ∗ ∗† ∗ ∗†<br />

Done!


32 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

2.25 Application: F<strong>in</strong>d<strong>in</strong>g MUMs<br />

Problem: Given two sequences s and t. F<strong>in</strong>d all maximal unique matches (MUMs) between s<br />

and t.<br />

A MUM is a sequence m that occurs precisely once <strong>in</strong> s and once <strong>in</strong> t, and is both right maximal<br />

and left maximal with this property (mean<strong>in</strong>g that ma and am both do not have the uniqueness<br />

property, for any letter a).<br />

To f<strong>in</strong>d all MUMs, generate the suffix tree T for sZt, where Z is a separator with Z /∈ s and<br />

Z /∈ t. Any path <strong>in</strong> T from the root to some node u that has precisely two children, one <strong>in</strong> s<br />

and one <strong>in</strong> t, corresponds to a right maximal unique match.<br />

To determ<strong>in</strong>e whether u is left maximal, too, simply check whether the both preced<strong>in</strong>g letters<br />

<strong>in</strong> s and t differ.<br />

2.28 Ukkonen’s onl<strong>in</strong>e construction<br />

This lecture is based on: Stefan Kurtz, Foundations of sequence analysis, Bielefeld (2001)<br />

We will now discuss an algorithm that constructs ST(t) <strong>in</strong> l<strong>in</strong>ear time. It operates onl<strong>in</strong>e and<br />

generates<br />

ST(ǫ), ST(t 1 ), ST(t 1 t 2 ), . . .,ST(t 1 t 2 . . .t n )<br />

for all prefixes of t, without knowledge of the rema<strong>in</strong><strong>in</strong>g part of the <strong>in</strong>put str<strong>in</strong>g.<br />

Induction: First, note that ST(ǫ) consists of a root node only. To completely def<strong>in</strong>e the<br />

algorithm we must describe the <strong>in</strong>duction step:<br />

For i ∈ {0, . . ., n − 1} we def<strong>in</strong>e<br />

We call xa visible and y hidden.<br />

ST(t 1 . . .t i ) to ST(t 1 . . .t i t i+1 ), for all i.<br />

x := t 1 . . .t i , a := t i+1 and y := t i+2 . . .t n .<br />

2.29 Ma<strong>in</strong> idea<br />

In the step ST(x) −→ ST(xa):<br />

t 1 . . .t<br />

} {{ i−1 t<br />

} }{{} i t i+1 . . .t<br />

} {{ n<br />

}<br />

x a y<br />

Consi<strong>der</strong> all suffixes sa of xa:<br />

t 1 . . .t i−1 a, t 2 . . .t i−1 a, . . .t i−1 a and a.<br />

There are three cases:<br />

• If sa occurs <strong>in</strong> ST(x), do noth<strong>in</strong>g.<br />

• If s is a leaf <strong>in</strong> ST(x), extend the label of the correspond<strong>in</strong>g leaf edge by a.<br />

• Otherwise, sa is a relevant suffix and needs to be <strong>in</strong>serted appropriately.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 33<br />

2.30 Implicit Suffix Tree<br />

To be precise, each tree T computed by the <strong>in</strong>duction step of Ukkonen’s algorithm is an implicit<br />

suffix tree <strong>in</strong> the follow<strong>in</strong>g sense:<br />

Def<strong>in</strong>ition. An implicit suffix tree for str<strong>in</strong>g t is a tree obta<strong>in</strong>ed from the suffix tree for t$ by<br />

remov<strong>in</strong>g every copy of the term<strong>in</strong>al symbol $ from the edge labels of the tree, then remov<strong>in</strong>g<br />

any edge that has no label, and then remov<strong>in</strong>g any node that does not have at least two children.<br />

In the follow<strong>in</strong>g description of Ukkonens algorithm, we will not dist<strong>in</strong>guish between implicit<br />

suffix tree and suffix tree. Note that we can obta<strong>in</strong> the latter from the former straight-forwardly<br />

<strong>in</strong> l<strong>in</strong>ear time, do you know how?<br />

2.31 The INSERT set<br />

Because ST(x) represents all substr<strong>in</strong>gs of x and ST(xa) represents all substr<strong>in</strong>gs of xa, the<br />

<strong>in</strong>duction step must add the set INSERT of all substr<strong>in</strong>gs of xa that are not substr<strong>in</strong>gs of x.<br />

Observation I: For all w ∈ INSERT there exists a suffix s of x such that w = sa.<br />

Proof: w /∈ ST(x) implies w ≠ ǫ and w ends with t i+1 = a. □<br />

Observation <strong>II</strong>: For all sa ∈ INSERT we have: sa is a leaf <strong>in</strong> ST(xa).<br />

Proof: Assume that sa is NOT a leaf <strong>in</strong> ST(xa). Then sa occurs at least twice <strong>in</strong> xa, and thus<br />

at least once <strong>in</strong> x, contradict<strong>in</strong>g the def<strong>in</strong>ition of INSERT. □<br />

Partition INSERT <strong>in</strong>to:<br />

• INSERTleaf = {sa ∈ INSERT | s is a leaf <strong>in</strong> ST(x)}, and<br />

• INSERTrelevant = {sa ∈ INSERT | s is a not leaf <strong>in</strong> ST(x)}.<br />

2.32 Process<strong>in</strong>g INSERTleaf<br />

Consi<strong>der</strong> sa ∈ INSERTleaf .<br />

Then s is a leaf node <strong>in</strong> ST(x). Let u → v s be the correspond<strong>in</strong>g leaf edge. To <strong>in</strong>sert sa, we<br />

modify this edge as follows:<br />

u → v s to u −→ va<br />

sa.<br />

I.e., to <strong>in</strong>sert all elements of INSERTleaf , we have to extend all leaf edges <strong>in</strong> ST(x) by the new<br />

character a.<br />

Implementation: We represent the label of a leaf edge by a pair (r, e), where r is the start of<br />

the label <strong>in</strong> t and e po<strong>in</strong>ts to a variable that conta<strong>in</strong>s the length of the visible str<strong>in</strong>g.<br />

2.33 Process<strong>in</strong>g INSERTrelevant<br />

If sa ∈ INSERTrelevant, then s is not a leaf <strong>in</strong> ST(x), or equivalently, s is a nested suffix of x<br />

(i.e. a suffix of x that occurs more than once <strong>in</strong> x).<br />

Def<strong>in</strong>ition A suffix sa of xa is called relevant if s is a nested suffix of x and sa is not a substr<strong>in</strong>g<br />

of x.


34 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

The <strong>in</strong>duction step from ST(x) to ST(xa) has thus been reduced to the follow<strong>in</strong>g: Insert all<br />

relevant suffixes sa of xa <strong>in</strong>to ST(x).<br />

We will see that relevant suffixes form an <strong>in</strong>terval <strong>in</strong> the list of all suffixes of xa, bounded by<br />

“active suffixes”.<br />

Def<strong>in</strong>ition The active suffix α(x) of x is the longest nested suffix of x.<br />

2.34 Example<br />

Consi<strong>der</strong> the text adcdacdad. Each column conta<strong>in</strong>s all suffixes of a prefix of the str<strong>in</strong>g:<br />

i : 0 1 2 3 4 5 6 7 8 9<br />

ǫ ↓a ad adc adcd adcda adcdac adcdacd adcdacda adcdacdad<br />

ǫ ↓d dc dcd dcda dcdac dcdacd dcdacda dcdacdad<br />

ǫ ↓c cd cda cdac cdacd cdacda cdacdad<br />

ǫ d ↓da dac dacd dacda dacdad<br />

ǫ a ↓ac acd acda acdad<br />

ǫ c cd cda ↓cdad<br />

ǫ d da ↓dad<br />

ǫ a ad<br />

ǫ d<br />

ǫ<br />

Relevant suffixes sa are marked by ↓ and active suffixes α(xa) are pr<strong>in</strong>ted <strong>in</strong> bold face.<br />

2.35 Four key observations<br />

(O 1) For all suffixes s of x: s is nested ⇔ |α(x)| ≥ |s|.<br />

(O 2) For all suffixes s of x: sa is a relevant suffix of xa ⇔ |α(x)a| ≥ |sa| > |α(xa)|.<br />

(O 3) α(xa) is a suffix of α(x)a.<br />

(O 4) If sa = α(xa) and α(x)a ≠ sa, then s is a right-branch<strong>in</strong>g substr<strong>in</strong>g of x.<br />

(Def<strong>in</strong>ition: A substr<strong>in</strong>g s of w is called right-branch<strong>in</strong>g, if there exist two different letters a,b such<br />

that sa and sb both occur <strong>in</strong> w.)<br />

2.36 Proof<br />

Ad 1: follows from the def<strong>in</strong>ition of an active suffix.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 35<br />

Ad 2:<br />

sa is a relevant suffix of xa<br />

⇔<br />

⇔<br />

⇔<br />

⇔<br />

⇔<br />

s is a nested suffix of x and sa is not a substr<strong>in</strong>g of x<br />

|α(x)| ≥ |s| and sa is not a substr<strong>in</strong>g of x<br />

|α(x)| ≥ |s| and sa is not a nested substr<strong>in</strong>g of xa<br />

|α(x)a| ≥ |sa| and |sa| > |α(xa)|<br />

|α(x)a| ≥ |sa| > |α(xa)|.<br />

Ad 3:<br />

Both α(xa) and α(x)a are suffixes of xa, so we need only show |α(x)a| ≤ |α(xa)|.<br />

This is clearly true, if α(xa) = ǫ.<br />

Let α(xa) = wa. S<strong>in</strong>ce wa is a nested suffix of xa, we have uwav = x for some str<strong>in</strong>gs u and<br />

v ≠ ǫ.<br />

Hence, w is a nested suffix of x.<br />

S<strong>in</strong>ce α(x) is the longest nested suffix of x, we have |α(x)| ≥ |w| and hence |α(x)a| ≥ |wa| =<br />

|α(xa)|.<br />

Ad 4:<br />

Suppose sa = α(xa) and α(x)a ≠ sa. Then there is a suffix csa of xa such that |α(x)a| ≥<br />

|csa| > |α(xa)|.<br />

Statement 2 implies that csa is a relevant suffix of x.<br />

I.e., cs is a nested suffix of x and csa is not a substr<strong>in</strong>g of x.<br />

Hence, there exists a character b ≠ a such that csb is a substr<strong>in</strong>g of x. S<strong>in</strong>ce sa is a substr<strong>in</strong>g<br />

of x, too, s is a right-branch<strong>in</strong>g substr<strong>in</strong>g of x.<br />

This completes the proof. □<br />

2.37 Reformulation of the <strong>in</strong>duction step<br />

Observation 2 states that all suffixes of xa lie “between” α(x)a and α(xa). In particular, α(xa)<br />

is the longest suffix of α(x)a that is a substr<strong>in</strong>g of x, by Observation 3.<br />

Hence, the <strong>in</strong>duction step can be formulated as follows:<br />

Take the suffixes of α(x)a one after the other by decreas<strong>in</strong>g length and <strong>in</strong>sert them<br />

<strong>in</strong>to ST(x), until a suffix is found which occurs <strong>in</strong> the tree and therefore equals<br />

α(xa).<br />

2.38 Total number of relevant suffixes<br />

Observation 2 implies that for each i ∈ {1, . . ., n − 1}, the relevant suffixes of t 1 . . . t i+1 lie<br />

between α(t 1 . . .t i )t i+1 and α(t 1 . . .t i+1 ). Hence, the total number of relevant suffixes is bounded


36 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

by:<br />

∑n−1<br />

|α(t 1 . . .t i )t i+1 | − |α(t 1 . . .t i+1 )| =<br />

i=1<br />

∑n−1<br />

|α(t 1 . . .t i )| + 1 − |α(t 1 . . . t i+1 )| =<br />

i=1<br />

n − 1 + |α(t 1 )| − |α(t 1 . . .t n )| ≤ n.<br />

2.39 Pseudo-code formulation<br />

We can formulate the step ST(x) → ST(xa) as follows:<br />

v := α(x)a<br />

while v does not occur <strong>in</strong> ST(x) do<br />

<strong>in</strong>sert v <strong>in</strong> ST(x)<br />

Set v := drop 1 v<br />

α(xa) := v.<br />

Because the number of relevant suffixes is bounded by n, these operations are performed O(n)<br />

times and we will obta<strong>in</strong> a l<strong>in</strong>ear time algorithm, if we can perform each of the follow<strong>in</strong>g steps<br />

<strong>in</strong> constant time:<br />

1. decide if v occurs <strong>in</strong> ST(x),<br />

2. <strong>in</strong>sert v <strong>in</strong> ST(x), and<br />

3. drop the first character from v.<br />

2.40 Two ideas<br />

Idea 1: Note that v = ǫ or v = sa for some str<strong>in</strong>g s occurr<strong>in</strong>g <strong>in</strong> ST(x). (Follows from the<br />

def<strong>in</strong>ition of a relevant suffix!) Hence, we can represent v by the appropriate edges and nodes<br />

<strong>in</strong> ST(x). As we will see, this will enable us to implement steps (1) and (2) <strong>in</strong> constant time.<br />

Idea 2: The second idea is to construct for each branch<strong>in</strong>g node, say bw, an auxiliary edge<br />

called a suffix l<strong>in</strong>k which po<strong>in</strong>ts to the branch<strong>in</strong>g node w, if it exists. This allows us to implement<br />

step (3) <strong>in</strong> constant time.<br />

2.41 Example<br />

A suffix tree with suffix l<strong>in</strong>ks:


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 37<br />

abca<br />

bca<br />

c<br />

abca<br />

cabca<br />

Locations: loc T (ǫ) = root<br />

loc T (a) = (root, a, bca, abca)<br />

loc T (abca) = (root, abca, ǫ, abca)<br />

loc T (bc) = (root, bc, a, bca)<br />

loc T (c) = c<br />

loc T (cab) = (c, ab, ca, cabca)<br />

2.42 The location of an occurr<strong>in</strong>g str<strong>in</strong>g<br />

Def<strong>in</strong>ition Let T be a suffix tree and s a str<strong>in</strong>g that occurs <strong>in</strong> T. The location loc T (S) of s <strong>in</strong><br />

T is def<strong>in</strong>ed as follows:<br />

• If s is a branch<strong>in</strong>g node, then loc T (s) := s.<br />

• If s is a leaf, then there is a leaf edge u v → s <strong>in</strong> T and loc T (s) := (u, v, ǫ, s).<br />

• If there is no node s <strong>in</strong> T, then there is an edge<br />

u vw<br />

−→ uvw<br />

<strong>in</strong> T such that<br />

s = uv, v ≠ ǫ, w ≠ ǫ and loc T (s) := (u, v, w, uvw).<br />

If a location is a node, we call it a node location, otherwise, an edge location.<br />

2.43 Operations on locations<br />

We def<strong>in</strong>e the follow<strong>in</strong>g four operations on locations:<br />

(1) occurs(loc T (s), a) = true ⇔ sa occurs <strong>in</strong> T. This operation can be implemented <strong>in</strong><br />

constant time.<br />

(2) getloc(loc T (s), w) = loc T (sw) for all sw that occur <strong>in</strong> T. This operation can be implemented<br />

<strong>in</strong> O(|w|) time, simply by follow<strong>in</strong>g characters of w <strong>in</strong> T.<br />

(3) Insertion of say, i.e. <strong>in</strong>sertion of ay un<strong>der</strong> loc T (s), delivers the pair (T ′ , z) which is specified<br />

as follows:<br />

– If loc T (s) = s, then T ′ is obta<strong>in</strong>ed from T by add<strong>in</strong>g a leaf edge s −→ ay<br />

say. Moreover,<br />

z is undef<strong>in</strong>ed.<br />

– If loc T (s) = (u, v, w, uvw), then T ′ is obta<strong>in</strong>ed from T by splitt<strong>in</strong>g the edge u −→<br />

vw<br />

uvw <strong>in</strong>to u → v s → w uvw, and add<strong>in</strong>g a new leaf edge s −→ ay<br />

say. Moreover, z := s,<br />

i.e., z is set to the new <strong>in</strong>ner node created by the splitt<strong>in</strong>g.


38 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

Note that this can be done <strong>in</strong> constant time. We will also use <strong>in</strong>sert(pos, a) to refer to<br />

this operation.<br />

(4) L<strong>in</strong>k<strong>in</strong>g locations us<strong>in</strong>g suffix l<strong>in</strong>ks:<br />

l<strong>in</strong>kloc(s) := z,<br />

where s −→ z is the suffix l<strong>in</strong>k for s, and<br />

l<strong>in</strong>kloc(u, av, w, uavw) :=<br />

{ locT (v) if u = root, and<br />

getloc(z, av) otherwise,<br />

where u −→ z is the suffix l<strong>in</strong>k for u.<br />

2.44 Example of the <strong>in</strong>sertion operation<br />

Insertion of d at loc T (bc):<br />

abca<br />

bca<br />

c<br />

abca<br />

cabca<br />

⇒<br />

abca<br />

bc<br />

a<br />

d<br />

c<br />

abca<br />

cabca<br />

2.45 Another observation<br />

Observation Let T be a suffix tree such that suffix l<strong>in</strong>ks for all branch<strong>in</strong>g nodes <strong>in</strong> T are<br />

def<strong>in</strong>ed. If cy and y occur <strong>in</strong> T, then<br />

l<strong>in</strong>kloc(loc T (cy)) = loc T (y).<br />

Proof: This follows directly from the def<strong>in</strong>itions.<br />

2.46 Example of algorithm<br />

Text: nanuna<br />

Table:<br />

0 1 2 3 4 5 6 7<br />

ǫ n na nan nanu nanun nanuna nanuna$<br />

ǫ a an anu anun anuna anuna$<br />

ǫ n nu nun nuna nuna$<br />

ǫ u un una una$<br />

ǫ n na na$<br />

ǫ a a$<br />

ǫ $<br />

ǫ<br />

2.47 Example of algorithm<br />

Text: nanuna


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 39<br />

Table:<br />

0 1 2 3 4 5 6 7<br />

↓↑<br />

ǫ →<br />

↓<br />

n na nan nanu nanun nanuna nanuna$<br />

↑<br />

ǫ → a ↓ an anu anun anuna anuna$<br />

↑<br />

ǫ →<br />

↓↑ n →<br />

↓<br />

nu nun nuna nuna$<br />

ǫ u un una una$<br />

↑<br />

↓↑ ↓↑<br />

↓<br />

ǫ → n → na → na$<br />

ǫ a a$<br />

ǫ $<br />

↑<br />

ǫ<br />

Recall: sa relevant ⇔ |α(xa)| < |sa| ≤ |α(x)a|.<br />

Active suffix α(xa) is longest nested suffix sa of xa, shown here as sa. ↑<br />

We show α(x)a here as<br />

↓<br />

sa.<br />

The algorithm <strong>in</strong>serts all relevant suffixes listed between an ↑ and ↓, proceed<strong>in</strong>g from top-left<br />

to bottom-right of the table.<br />

2.48 Parameters for the ma<strong>in</strong> step<br />

We are ready to def<strong>in</strong>e the ma<strong>in</strong> step of Ukkonen’s algorithm. The parameters will satisfy the<br />

follow<strong>in</strong>g properties:<br />

• T is the current suffix tree,<br />

• L is the current set of suffix l<strong>in</strong>ks for T,<br />

• a = t i+1 is the current <strong>in</strong>put character,<br />

• y = t i+2 . . .t n is the rema<strong>in</strong><strong>in</strong>g <strong>in</strong>put str<strong>in</strong>g,<br />

• z denotes a node for which a suffix l<strong>in</strong>k must be set,<br />

• loc is the location of s <strong>in</strong> T, where sa is a suffix of xa with |α(x)a| ≥ |sa| > |α(xa)|.<br />

2.49 The ma<strong>in</strong> step of Ukkonen’s algorithm<br />

Initially, set T := ST(ǫ), L := ∅, i := 1, t = t 1 . . .t n , loc := root.<br />

while i ≤ n do<br />

Set x := t 1 . . .t i−1 , a := t i and y := t i+1 . . .t n .<br />

(T ′ , L ′ ,loc ′ ) := ukkstep(T, L, ay,undef<strong>in</strong>ed,loc).<br />

The ma<strong>in</strong> step:<br />

ukkstep(T, L, ay, z,loc) :=<br />

⎧<br />

⎨ (T, L ′ ,getloc(loc, a)) if occurs(loc, a)<br />

(T ′ , L ′ ,loc) else if loc = root<br />

⎩<br />

ukkstep(T ′ , L ′ , ay, r,l<strong>in</strong>kloc(loc)) otherwise,


40 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

where (T ′ , r) is obta<strong>in</strong>ed by <strong>in</strong>sert<strong>in</strong>g ay at loc, and<br />

⎧<br />

⎨ L if z is undef<strong>in</strong>ed<br />

L ′ = L ∪ {z → loc} else if occurs(loc, a) or r undef<strong>in</strong>ed<br />

⎩<br />

L ∪ {z → r} otherwise.<br />

2.50 Ma<strong>in</strong> result<br />

Theorem Let t be a text of length n. Ukkonen’s algorithm computes the suffix tree ST(t)<br />

with suffix l<strong>in</strong>ks L <strong>in</strong> O(n) time and space.<br />

Proof The algorithm takes care of all non-relevant and relevant suffixes. It takes a O(n) steps<br />

to process all non-relevant suffixes. There are at most n relevant suffixes. Process<strong>in</strong>g of each<br />

relevant suffix is done <strong>in</strong> constant time.<br />

2.53 Applications of suffix trees<br />

1. Search<strong>in</strong>g for exact patterns<br />

2. M<strong>in</strong>imal unique substr<strong>in</strong>gs<br />

3. Maximal unique matches<br />

4. Maximal repeats<br />

5. Approximate repeats<br />

Additional literature:<br />

Stefan Kurtz, Enno Ohlebusch, Chris Schleiermacher, Jens Stoye and Robert Giegerich, Computation<br />

and visualization of degenerate repeats <strong>in</strong> complete genomes, ISMB 2000, p. 228-238, 2000.<br />

Stefan Kurtz and Chris Schleiermacher, REPuter: fast computation of maximal repeats <strong>in</strong> complete<br />

genomes, Bio<strong>in</strong>formatics, 15(5):426-427 (1999)<br />

2.54 Search<strong>in</strong>g for exact patterns<br />

To determ<strong>in</strong>e whether a str<strong>in</strong>g q occurs <strong>in</strong> a str<strong>in</strong>g t, follow the path from the root of suffix tree<br />

ST(t) as directed by the characters of q. If at some po<strong>in</strong>t you cannot proceed, then q does not<br />

occur <strong>in</strong> t, otherwise it does.<br />

b a<br />

b<br />

$<br />

Text abab$.<br />

a<br />

b<br />

$<br />

$<br />

b a<br />

$<br />

$<br />

5<br />

1 3 2 4<br />

The query abb is not conta<strong>in</strong>ed <strong>in</strong> abab: Follow<strong>in</strong>g ab we arrive at the node ab, however there is no<br />

b-edge leav<strong>in</strong>g from there. The query baa is not conta<strong>in</strong>ed <strong>in</strong> abab: Follow the b edge to b and then<br />

cont<strong>in</strong>ue along the leaf edge whose label starts with a. The next letter of the label is b and doesn’t<br />

match the next letter of the query str<strong>in</strong>g.<br />

Clearly, the algorithm that matches a query q aga<strong>in</strong>st the text t runs <strong>in</strong> O(|q|) time.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 41<br />

2.55 F<strong>in</strong>d<strong>in</strong>g all occurrences<br />

To f<strong>in</strong>d all positions where the query q is conta<strong>in</strong>ed <strong>in</strong> t, annotate each leaf s i of the suffix tree<br />

with the position i at which the suffix i starts <strong>in</strong> t.<br />

Then, after match<strong>in</strong>g q to a path <strong>in</strong> the tree, visit all nodes below the path and return the<br />

annotated values.<br />

This works because any occurrence of q <strong>in</strong> t is the prefix of one of these suffixes.<br />

The number of nodes below the path is at most twice the number of hits and thus f<strong>in</strong>d<strong>in</strong>g and<br />

collect<strong>in</strong>g all hits takes time O(|q| + k), where k is the number of occurrences.<br />

(Note that <strong>in</strong> the discussed lazy suffix tree implementation we do not use this leaf annotation but<br />

rather compute the positions from the the lp values, to save space...)<br />

2.56 Maximal Unique Matches<br />

Standard dynamic programm<strong>in</strong>g is too slow for align<strong>in</strong>g two large genomes. If the genomes are similar,<br />

then one can expect to see long identical substr<strong>in</strong>gs which occur <strong>in</strong> both genomes. These maximum<br />

unique matches (MUMs) are almost surely part of a good alignment of the two sequences and so the<br />

alignment problem can be reduced to align<strong>in</strong>g the sequence <strong>in</strong> the gaps between the MUMs.<br />

Given two sequences s and t, and a number l > 0. The maximal unique matches problem<br />

(MUM-problem) is to f<strong>in</strong>d all sequences u with:<br />

• |u| ≥ l,<br />

• u occurs exactly once <strong>in</strong> s and once <strong>in</strong> t, and<br />

• for any character a neither ua nor au occurs both <strong>in</strong> s and t.<br />

This problem can be solved <strong>in</strong> O(|s| + |t|) time, by consi<strong>der</strong><strong>in</strong>g the suffix tree for s%t, where %<br />

is a character that does not occur <strong>in</strong> s or t, as described earlier.


42 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

2.57 Example<br />

2.58 Example


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 43<br />

2.59 Repeats <strong>in</strong> human<br />

(Nature, vol. 409, pg. 880, 15. Feb 2000)<br />

2.60 Def<strong>in</strong>ition of a maximal repeat<br />

Given a sequence t = t 1 t 2 . . .t n .<br />

A substr<strong>in</strong>g t[i, j] := t i . . .t j is represented by the pair (i, j). A pair R = (l, r) of different<br />

substr<strong>in</strong>gs l = (i, j) and r = (i ′ , j ′ ) of t is called a repeat, if i < i ′ and t i . . .t j = t i ′ . . .t j ′. We<br />

call l and r the right and left <strong>in</strong>stance of the repeat R, respectively.<br />

t<br />

i j i’ j’<br />

t ... t j i<br />

=<br />

t ... t i’ j’<br />

A repeat R = ((i, j), (i ′ , j ′ )) is called left maximal, if i = 1 or t i−1 ≠ t i ′ −1, and right maximal,<br />

if j ′ = n or t j+1 ≠ t j ′ +1, and maximal, if both.<br />

2.61 Example<br />

t<br />

i j i’ j’<br />

=<br />

at ... t<br />

i j b c t i’ ... tj’<br />

d<br />

maximal ⇔ a ≠ c and b ≠ d<br />

The str<strong>in</strong>g<br />

1 2 3 4 5 6 7 8 9 10<br />

g a g c t c g a g c conta<strong>in</strong>s the follow<strong>in</strong>g repeats of length ≥ 2:<br />

((1, 4), (7, 10)) gagc<br />

((1, 3), (7, 9)) gag<br />

((1, 2), (7, 8)) ga<br />

((2, 4), (8, 10)) agc<br />

((3, 4), (9, 10)) gc


44 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

2.62 An algorithm for comput<strong>in</strong>g all maximal repeats<br />

We will discuss how to compute all maximal repeats. Let t be a str<strong>in</strong>g of length n and assume<br />

that the first and last letter of t both occur exactly once on t, e.g.:<br />

1 2 3 4 5 6 7 8 9 10 11 12 13<br />

t = x g g c g c y g c g c c z<br />

Let T be the suffix tree for t.<br />

We can ignore all leaf edges from the root.<br />

The algorithm proceeds <strong>in</strong> two phases:<br />

In the first phase, every leaf node v of T is annotated by (a, i), where v = t i . . .t n is the suffix<br />

associated with v and a = t i−1 is the letter that occurs immediately before the suffix.<br />

2.63 Example<br />

Partial suffix tree for<br />

1 2 3 4 5 6 7 8 9 10 11 12 13<br />

t = x g g c g c y g c g c c z :<br />

c<br />

g<br />

z<br />

cz<br />

gc<br />

ygcgccz<br />

ygcgccz<br />

cz<br />

cz<br />

gc<br />

c<br />

gcgcygcgccz<br />

ygcgccz<br />

cz<br />

ygcgccz<br />

With leaf annotations:<br />

c<br />

g<br />

c 12<br />

z<br />

cz<br />

g 11<br />

ygcgccz<br />

c<br />

gc<br />

g 6<br />

gcgcygcgccz<br />

ygcgccz ygcgccz<br />

cz<br />

g 9 g 4 cz<br />

gc c 5<br />

c 10<br />

cz<br />

ygcgccz<br />

y 8 g 3<br />

x 2<br />

2.64 Second phase of the algorithm<br />

For every leaf node v set:<br />

{ {i}, if c = ti−1 , and<br />

A(v, c) =<br />

∅, else,<br />

where i is the start position of the correspond<strong>in</strong>g suffix v.<br />

In the second phase of the algorithm, we extend this annotation to all branch<strong>in</strong>g nodes bottomup:<br />

Let w be a branch<strong>in</strong>g node with children v 1 . . .v h and assume we have computed A(v j , c) for<br />

all j ∈ {1, . . ., h} and all c ∈ Σ.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 45<br />

For each letter c ∈ Σ set:<br />

A(w, c) :=<br />

h⋃<br />

A(v j , c).<br />

j=1<br />

Note that this is a disjo<strong>in</strong>t union and A(w, c) is the set of all start positions of w <strong>in</strong> t for which<br />

t i−1 = c.<br />

2.65 Example<br />

1 2 3 4 5 6 7 8 9 10 11 12 13<br />

t = x g g c g c y g c g c c z<br />

((8,9),(10,11))<br />

((3,4),(10,11)<br />

((5,6),(8,9))<br />

((3,4),(5,6))<br />

g 3<br />

c 5,10<br />

y 8<br />

g<br />

c<br />

ygcgccz<br />

gc c 5<br />

cz<br />

c 10 ygcgccz<br />

cz<br />

y 8 g 3<br />

g 3<br />

c 5,10<br />

x 2<br />

y 8<br />

gcgcygcgccz<br />

g 3<br />

y 8<br />

x 2<br />

((3,6),(8,11))<br />

Annotation of branch<strong>in</strong>g nodes and output repeats of length ≥ 2<br />

2.66 Report<strong>in</strong>g all maximal repeats<br />

In a bottom-up traversal, for each branch<strong>in</strong>g node w we first determ<strong>in</strong>e A(w, c) for all c ∈ Σ<br />

and then report all maximal repeats of the word w:<br />

Let q be the current depth, i.e. number of characters from the root node, i.e. the length of w.<br />

for each pair of children v f and v g of w with v f ≺ v g :<br />

for each letter c ∈ Σ with A(v f , c) ≠ ∅:<br />

for each i ∈ A(v f , c):<br />

for each letter d ∈ Σ with d ≠ c and A(v g , d) ≠ ∅:<br />

for each j ∈ A(v g , d):<br />

Pr<strong>in</strong>t ((i, i + q − 1), (j, j + q − 1))<br />

end<br />

2.67 Maximality of output<br />

Lemma The algorithm pr<strong>in</strong>ts precisely the set of all maximal repeats <strong>in</strong> t of length ≥ l.<br />

Proof<br />

1. Each pr<strong>in</strong>ted pair R is a repeat, as the word w is the common prefix of two or more<br />

different suffixes.


46 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

2. Each repeat R is left-maximal, as c ≠ d.<br />

3. Each repeat R is right-maximal, as v f ≠ v g , and by def<strong>in</strong>ition of a suffix tree, the labels<br />

of the edges lead<strong>in</strong>g to these two children beg<strong>in</strong> with dist<strong>in</strong>ct letters.<br />

4. No maximal repeat is reported twice, as v f ≺ v g and all unions are disjo<strong>in</strong>t. □<br />

2.68 Performance analysis<br />

Lemma Computation of all maximal repeats of length ≤ l can be done <strong>in</strong> O(n + z) time and<br />

O(n) space, where z is the number of maximal repeats.<br />

Proof The suffix tree can be built <strong>in</strong> O(n) time and space. We can annotate the tree <strong>in</strong> O(n)<br />

time and space, if we use the fact that we only need to keep the annotation of a node until<br />

its father has been fully processed. (Also, we ma<strong>in</strong>ta<strong>in</strong> the sets as l<strong>in</strong>ked l<strong>in</strong>ks and then each<br />

disjo<strong>in</strong>t-union operation can be done <strong>in</strong> constant time.)<br />

In the nested loop we enumerate <strong>in</strong> total all z maximal repeats <strong>in</strong> O(z) steps.<br />

Hence, the algorithm is both time and space optimal.<br />

□<br />

2.69 Significance of repeats<br />

How significant is a detected maximal repeat? In a long random text we will expect to f<strong>in</strong>d<br />

many short repeats purely by chance.<br />

The E-value associated with a maximum repeat R <strong>in</strong> t is the expected number of repeats of<br />

the same length or longer that are found <strong>in</strong> a random sequence of length |t|.<br />

To compute this <strong>in</strong> the case of DNA, consi<strong>der</strong> a simple Bernoulli model where each base<br />

α ∈ {A, C, G, T } has the same fixed probability of occurrence: p α = p = 1 4 .<br />

Note that the number of maximal exact repeats of length ≥ l equals the number of (only)<br />

left-maximal repeats of length exactly l.<br />

Ignor<strong>in</strong>g boundary effects:<br />

E[# of maximal exact repeats of length ≥ l]<br />

= E[# of left-maximal exact repeats of length l]<br />

= ∑<br />

Pr(t[i 1 , i 1 + l − 1] = t[i 2 , i 2 + l − 1], t i1 −1 ≠ t i2 −1)<br />

1≤i 1


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 47<br />

2.70 Pal<strong>in</strong>dromic repeats<br />

Let t be a DNA sequence. We call ((i, j), (i ′ , j ′ )) a pal<strong>in</strong>dromic repeat, if t i . . .t j = t i ′ . . .t j ′,<br />

where w denotes the reverse complement of a DNA str<strong>in</strong>g w.<br />

All maximal pal<strong>in</strong>dromic repeats can be found us<strong>in</strong>g a modification of the described algorithm<br />

for maximal repeats, based on the suffix tree for xtytz, where x, y, z are three characters that<br />

do not appear <strong>in</strong> t or t.<br />

2.71 Degenerate repeats<br />

Let u and w be two str<strong>in</strong>gs of the same length. The Hamm<strong>in</strong>g distance d H (u, w) between u<br />

and w is the number of positions i such that u i ≠ w i .<br />

In the follow<strong>in</strong>g, we assume that we are given a sequence t = t 1 . . .t n , an error threshold k ≥ 0<br />

and a m<strong>in</strong>imum length l > 0.<br />

Def<strong>in</strong>ition A pair of equal-length substr<strong>in</strong>gs R = ((i 1 , j 1 ), (i 2 , j 2 )) is called a k-mismatch repeat<br />

<strong>in</strong> t, iff (i 1 , j 1 ) ≠ (i 2 , j 2 ) and d H (t[i 1 , j 1 ], t[i 2 , j 2 ]) = k. The length of R is j 1 −i 1 +1 = j 2 −i 2 +1.<br />

A k-mismatch repeat is maximal if it is not conta<strong>in</strong>ed <strong>in</strong> any other k-mismatch repeat.<br />

As with exact repeats, a k-mismatch repeat R = ((i 1 , j 1 ), (i 2 , j 2 )) is maximal iff (i 1 = 1 or<br />

i 2 = 1 or t i1 −1 ≠ t i2 −1) and (j 1 = n or j 2 = n or t j1 +1 ≠ t j2 +1)<br />

2.72 The Mismatches Repeat Problem<br />

The Mismatches Repeat Problem (MMR) is to enumerate all maximal k-mismatch repeats of<br />

length ≥ l conta<strong>in</strong>ed <strong>in</strong> t.<br />

2.73 Example<br />

Maximal k-mismatch repeats (k = 0,... ,4) for l = 5 <strong>in</strong> mississippi:<br />

1 2 3 4 5 6 7 8 9 10 11<br />

text : m i s s i s s i p p i<br />

k = 0 : none<br />

k = 1 : s i s s i<br />

m i s s i<br />

k = 2 : s i p p i<br />

s i s s i<br />

k = 3 : s i p p i<br />

m i s s i<br />

k = 4 : none<br />

2.74 The Seed Lemma<br />

The follow<strong>in</strong>g result is a key observation and is the basis of many seed-and-extend approaches<br />

to sequence comparison:


48 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

Lemma Every maximal k-mismatch repeat R of length l conta<strong>in</strong>s a maximal exact repeat of<br />

length ≥ ⌊ l ⌋, called a seed.<br />

k+1<br />

Proof Let R = ((i 1 , j 1 ), (i 2 , j 2 )) be a k-mismatch repeat of length ≥ l. The k mismatches<br />

divide t[i 1 , j 1 ] and t[i 2 , j 2 ] <strong>in</strong>to exact repeats w 0 , w 1 , . . ., w k . Now max i∈[0,k] |w i | is m<strong>in</strong>imal if<br />

the mismatch<strong>in</strong>g character pairs are equally distributed over R. Obviously, for such an equal<br />

distribution the length of the longest w i is ≥ ⌈ l−k ⌉ = ⌊ l ⌋. □<br />

k+1 k+1<br />

(A “•” <strong>in</strong>dicates a mismatch and we have: l = 11, k = 3, ⌈ l−k<br />

k+1 ⌉ = ⌈11−3 3+1 ⌉ = ⌈8 4 ⌉ = 2 = ⌊ l<br />

k+1 ⌋ = ⌊ 11<br />

3+1 ⌋)<br />

2.75 An algorithm for the MMR problem<br />

Given a text t of length n. To f<strong>in</strong>d all maximal k-mismatch repeats <strong>in</strong> t of length ≥ l, do the<br />

follow<strong>in</strong>g:<br />

1. Build the suffix tree T for t and use it to detect all seeds, i.e., all exact maximal repeats<br />

of length ≥ ⌊ l<br />

k+1 ⌋.<br />

2. For each seed s = ((i 1 , j 1 ), (i 2 , j 2 )) do the follow<strong>in</strong>g:<br />

(a) For q = 0, 1, . . .k, compute:<br />

left(q) := max{p | d H (t[i 1 − p, i 1 ], t[i 2 − p, i 2 ]) = q}, i.e. the length of the maximal<br />

extension of the seed to the left with precisely q mismatches.<br />

(b) For q = 0, 1, . . .k, compute:<br />

right(q) := max{p | d H (t[j 1 , j 1 + p], t[j 2 , j 2 + p]) = q}, i.e. the length of the maximal<br />

extension of the seed to the right with precisely q mismatches.<br />

(c) For q = 0, 1, . . .k:<br />

If (j 1 − i 1 + 1 + left(q) + right(k − q)) ≥ l, then pr<strong>in</strong>t ((i 1 − left(q),j 1 + right(k −<br />

q)),(i 2 − left(q),j 2 + right(k − q))).<br />

2.76 Correctness of the algorithm<br />

The correctness of the algorithm follows from the Seed Lemma: every maximal k-mismatch<br />

repeat of length ≥ l conta<strong>in</strong>s an exact repeat of length ≥ ⌊ l ⌋ and can be obta<strong>in</strong>ed from the<br />

k+1<br />

seed by extend<strong>in</strong>g the seed match with q mismatches to the left and with k − q mismatches to<br />

the right, for some q ≤ k.<br />

Note that the same maximal k-mismatch repeat can be obta<strong>in</strong>ed via more than one seed. To<br />

avoid this, when comput<strong>in</strong>g left for a given seed, we stop the computation of the table left,<br />

if we observe a second seed to the left of the orig<strong>in</strong>al seed. This ensures that we only output<br />

those maximal k-mismatch repeats for which the given seed is leftmost.<br />

2.77 Efficient solution of MMR problem<br />

Lemma The mismatch repeats problem MMR can be solved <strong>in</strong> O(n+kq) time, where n is the<br />

length of the text, k is the number of mismatches permitted and q is the number of different<br />

seeds.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 49<br />

We can f<strong>in</strong>d all seeds <strong>in</strong> O(n+q) time. Thus the result follows, if we can compute the k-mismatch<br />

extension of any seed <strong>in</strong> k steps.<br />

The latter can <strong>in</strong>deed be achieved, if we can determ<strong>in</strong>e the maximal common extension of two<br />

matches <strong>in</strong> constant time. This is <strong>in</strong>deed possible, due to the follow<strong>in</strong>g amaz<strong>in</strong>g result on rooted<br />

trees:<br />

2.78 The lowest common ancestor problem<br />

Let T be a rooted tree. A node u is called an ancestor of a node v iff u lies on the unique path<br />

from root to v. The lowest common ancestor of two nodes x and y is the last node that is both<br />

on the path from root to x and on the path from root to y.<br />

Lemma Let T be a rooted tree. After a l<strong>in</strong>ear amount of preprocess<strong>in</strong>g, we can determ<strong>in</strong>e the<br />

lowest common ancestor of any two nodes x and y <strong>in</strong> constant time.<br />

(This is due to Harel and Tarjan (1984) and later simplified by Schieber and Vishk<strong>in</strong> (1988), see<br />

Chapter 8 of Dan Gusfield’s book for details.)<br />

3 DNA arrays<br />

This exposition is based on the follow<strong>in</strong>g sources, which are recommended read<strong>in</strong>g:<br />

1. M.B. Eisen, P.T. Spellman, P.O. Brown and D. Botste<strong>in</strong>, Cluster analysis and display of<br />

genome-wide expression patterns, PNAS, 95:14863-14868, 1998. (see l<strong>in</strong>k from webpage)<br />

2. Pavel Penzer, Computational Molecular biology - an algorithmic approach, MIT Press,<br />

2000, chapter 5. (Semesterapparat)<br />

3. Ron Shamir, Analysis of Gene Expression Data, lectures 1 and 4, 2002. (see l<strong>in</strong>k from<br />

webpage)<br />

3.1 DNA arrays<br />

• Also known as: biochips, DNA chips, oligo arrays, DNA microarrays or gene arrays.<br />

• An array is an or<strong>der</strong>ly arrangement of (spots of) samples.<br />

• Samples are either DNA or DNA products.<br />

• Each spot <strong>in</strong> the array conta<strong>in</strong>s many copies of the sample.<br />

• Array provides a medium for match<strong>in</strong>g known and unknown DNA samples based on basepair<strong>in</strong>g<br />

(hybridization) rules and for automat<strong>in</strong>g the process of identify<strong>in</strong>g the unknowns.<br />

• Sample spot size <strong>in</strong> microarray less than 200 microns and an array conta<strong>in</strong>s thousands of<br />

spots.<br />

• Microarrays require specialized robotics and imag<strong>in</strong>g equipment.<br />

• High-throughput biology: a s<strong>in</strong>gle DNA chip can provide <strong>in</strong>formation on thousands of<br />

genes simultaneously.


50 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

3.2 Two possible formats<br />

We are given an unknown target nucleic acid sample and the goal is to detect the identity and/or<br />

abundance of its constituents us<strong>in</strong>g known probe sequences. S<strong>in</strong>gle stranded DNA probes are<br />

called oligo-nucleotides or oligos.<br />

There are two different formats of DNA chips:<br />

• Format I: The target (500-5000 bp) is attached to a solid surface and exposed to a set of<br />

probes, either separately or <strong>in</strong> a mixture. The earliest chips where of this k<strong>in</strong>d, used for<br />

oligo-f<strong>in</strong>gerpr<strong>in</strong>t<strong>in</strong>g.<br />

• Format <strong>II</strong>: An array of probes is produced either <strong>in</strong> situ or by attachment. The array is<br />

then exposed to sample DNA. Examples are oligo-arrays and cDNA microarrays.<br />

In both cases, the free sequence is fluorescently or radioactively labeled and hybridization is<br />

used to determ<strong>in</strong>e the identity/abundance of complementary sequences.<br />

3.3 Oligo arrays C(l)<br />

The simplest oligo array C(l) consists of all possible oligos of length l and is used e.g. <strong>in</strong><br />

sequenc<strong>in</strong>g by hybridization (SBH).<br />

C(4) A A A A T T T T G G G G C C C C<br />

A T G C A T G C A T G C A T G C<br />

AA<br />

AT<br />

AG<br />

AC<br />

TA<br />

TT<br />

TG<br />

TC<br />

□<br />

GA<br />

GT<br />

GG<br />

GC<br />

CA<br />

CT<br />

CG<br />

CC<br />

Example: oligo at □: TCGA<br />

3.4 cDNA microarrays<br />

The aim of this technology is to analyze the expression of thousands of genes <strong>in</strong> a s<strong>in</strong>gle<br />

experiment and provides measurements of the differential expression of these genes.<br />

Here, each spot conta<strong>in</strong>s, <strong>in</strong>stead of short oligos, identical cDNA clones, which represents a gene.<br />

(Such complementary DNA is obta<strong>in</strong>ed by reverse transcription from some known mRNA.) The


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 51<br />

target is the unknown mRNA extracted from a specific cell. As most of the mRNA <strong>in</strong> a cell is<br />

translated <strong>in</strong>to a prote<strong>in</strong>, the total mRNA <strong>in</strong> a cell represents the genes expressed <strong>in</strong> the cell.<br />

S<strong>in</strong>ce cDNA clones are much longer than the short oligos otherwise used, a successful hybridization<br />

with a clone is an almost certa<strong>in</strong> match. However, because an unknown amount of cDNA<br />

is pr<strong>in</strong>ted at each spot, one cannot directly associate the hybridization level with a transcription<br />

level and so cDNA chips are limited to to comparisons of a reference extract and a target<br />

extract.<br />

3.5 Affymetrix chips<br />

Affymetrix produces oligo arrays with the goal of captur<strong>in</strong>g each cod<strong>in</strong>g region as specifically<br />

as possible. The length of the oligos is usually less than 25 bases. The density of oligos on a<br />

chip can be very high and a 1cm × 1cm chip can easily conta<strong>in</strong> 100 000 types of oligos.<br />

The chip conta<strong>in</strong>s both “cod<strong>in</strong>g” oligos and “control” oligos, the former correspond<strong>in</strong>g to perfect<br />

matches to known targets and the controls correspond<strong>in</strong>g to matches with one perturbed base.<br />

When read<strong>in</strong>g the chip, hybridization levels at controls are subtracted from the level of match<br />

probes to reduce the number of false positives. Actual chip designs use 10 match- and 10<br />

mismatch probes for each target gene.<br />

Today, Affymetrix offers chips for the entire (known) human or yeast genomes.<br />

3.6 Oligo f<strong>in</strong>gerpr<strong>in</strong>t<strong>in</strong>g<br />

Format I chips were the first type used, namely for oligo f<strong>in</strong>gerpr<strong>in</strong>t<strong>in</strong>g which is, <strong>in</strong> a sense, the<br />

opposite to what Affymetrix chips do. Such a chip consists of a matrix of target DNA and is<br />

exposed to a solution conta<strong>in</strong><strong>in</strong>g many identical oligos.<br />

After the positions <strong>in</strong> the matrix have been recorded at which hybridization of the tagged<br />

oligos has occurred, the chip can be heated to separate the oligos from the target DNA and the<br />

experiment can be repeated with a different type of oligo.<br />

F<strong>in</strong>ally, we obta<strong>in</strong> a data matrix M, with each row represent<strong>in</strong>g a specific target DNA from<br />

the matrix and each column represent<strong>in</strong>g an oligo probe.<br />

Example: cDNA’s extracted from a tissue. Cluster cDNA’s accord<strong>in</strong>g to their f<strong>in</strong>gerpr<strong>in</strong>ts and<br />

then sequence representatives from each cluster to obta<strong>in</strong> a sequence that identifies the gene.<br />

3.7 Manufactur<strong>in</strong>g oligo arrays<br />

1. Start with a matrix created over a glass substrate.<br />

2. Each cell conta<strong>in</strong>s a grow<strong>in</strong>g “cha<strong>in</strong>” of nucleotides that ends with a term<strong>in</strong>ator that<br />

prevents cha<strong>in</strong> extension.<br />

3. Cover the substrate with a mask and then illum<strong>in</strong>ate the uncovered cells, break<strong>in</strong>g the<br />

bonds between the cha<strong>in</strong>s and their term<strong>in</strong>ators.<br />

4. Expose the substrate to a solution of many copies a specific nucleotide base so that each<br />

of the unterm<strong>in</strong>ated cha<strong>in</strong>s is extended by one copy of the nucleotide base and a new<br />

term<strong>in</strong>ator.


52 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

5. Repeat us<strong>in</strong>g different masks.<br />

Exposure to light replaces the term<strong>in</strong>ators by hydrogen bonds (1–2), and (3) bonds forms with nucleotide<br />

bases provided <strong>in</strong> a solution, and then the process is repeated with a different base (4–6).<br />

3.8 Experiment with a DNA chip<br />

Labeled RNA molecules are applied to the probes on the chip, creat<strong>in</strong>g a fluorescent spot where<br />

hybridization has occurred.<br />

3.9 Functional genomics<br />

With the sequenc<strong>in</strong>g of more and more genomes, the question arises of how to make use of<br />

this data. One area that is now open<strong>in</strong>g up is functional genomics, the un<strong>der</strong>stand<strong>in</strong>g of the<br />

functionality of specific genes, their relations to diseases, their associated prote<strong>in</strong>s and their<br />

participation <strong>in</strong> biological processes.<br />

The functional annotation of genes is still at an early stage: e.g., for the plant Arabidopsis<br />

(whose sequence was recently completed), the functions of 40% of the genes are currently<br />

unknown.<br />

Functional genomics is be<strong>in</strong>g addressed us<strong>in</strong>g high-throughput methods: global gene expression<br />

profil<strong>in</strong>g (“transcriptome analysis”) and wide-scale prote<strong>in</strong> profil<strong>in</strong>g (“proteome analysis”).


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 53<br />

3.10 Gene expression<br />

The exist<strong>in</strong>g methods for measur<strong>in</strong>g gene expression are based on two biological assumptions:<br />

1. The transcription level of genes <strong>in</strong>dicates their regulation: S<strong>in</strong>ce a prote<strong>in</strong> is generated<br />

from a gene <strong>in</strong> a number of stages (transcription, splic<strong>in</strong>g, synthesis of prote<strong>in</strong> from<br />

mRNA), regulation of gene expression can occur at many po<strong>in</strong>ts. However, we assume<br />

that most regulation is done only dur<strong>in</strong>g the transcription phase.<br />

2. Only genes which contribute to organism fitness are expressed, <strong>in</strong> other words, genes that<br />

are irrelevant to the given cell un<strong>der</strong> the given circumstances etc. are not expressed.<br />

Genes affect the cell by be<strong>in</strong>g expressed, i.e. transcribed <strong>in</strong>to mRNA and translated <strong>in</strong>to prote<strong>in</strong>s<br />

that react with other molecules.<br />

From the pattern of expression we may be able to deduce the function of an unknown gene.<br />

This is especially true, if the pattern of expression of the unknown gene is very similar to the<br />

pattern of expression of a gene with known function.<br />

Also, the level of expression of a gene <strong>in</strong> different tissues and at different stages is of significant<br />

<strong>in</strong>terest.<br />

Hence, it is highly <strong>in</strong>terest<strong>in</strong>g to analyze the expression profile of genes, i.e. <strong>in</strong> which tissues<br />

and at what stages of development they are expressed.<br />

3.11 cDNA Cluster<strong>in</strong>g<br />

It is not easy to determ<strong>in</strong>e which genes are expressed <strong>in</strong> each tissue, and at what level:<br />

An average tissue conta<strong>in</strong>s more than 10 000 expressed genes, and their expression levels can<br />

vary by a factor of 10 000. Hence, we need to extract more than 10 5 transcripts per tissue.<br />

There are about 100 different types of tissue <strong>in</strong> the body and we are <strong>in</strong>terested <strong>in</strong> compar<strong>in</strong>g<br />

different growth stages, disease stages etc., and so we should analyze more than 10 10 transcripts.<br />

⇒ Sequenc<strong>in</strong>g all cDNA’s is <strong>in</strong>feasible and we need cheap, efficient and large scale methods.<br />

3.12 Representation of gene expression data<br />

Gene expression data is represented by a raw data matrix R, where each row corresponds to<br />

one gene and each column represents one tissue or condition. Thus, R ij is the expression level<br />

for gene i <strong>in</strong> condition j. The values can be ratios, absolute values or distributions.<br />

Before it is analyzed, the raw data matrix is preprocessed to compute a similarity or distance<br />

matrix.<br />

conditions<br />

genes<br />

genes<br />

1<br />

2<br />

3<br />

4<br />

...<br />

1 2 3 ...<br />

j<br />

expression<br />

levels<br />

"raw data"<br />

genes<br />

1<br />

2<br />

3<br />

4<br />

...<br />

1 2 3 ...<br />

distance<br />

matrix<br />

j<br />

i<br />

Rij<br />

i<br />

Dij


54 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

3.13 Cluster<strong>in</strong>g<br />

The first step <strong>in</strong> analyz<strong>in</strong>g gene expression data is cluster<strong>in</strong>g.<br />

Cluster<strong>in</strong>g methods are used <strong>in</strong> many fields. The goal <strong>in</strong> a cluster<strong>in</strong>g problem is to group<br />

elements (<strong>in</strong> our case genes) <strong>in</strong>to clusters satisfy<strong>in</strong>g:<br />

1. Homogeneity: Elements <strong>in</strong>side a cluster are highly similar to each other.<br />

2. Separation: Elements from different clusters have low similarity to each other.<br />

There are two types of cluster<strong>in</strong>g methods:<br />

• Agglomerative methods build clusters by look<strong>in</strong>g at small groups of elements and perform<strong>in</strong>g<br />

calculations on them <strong>in</strong> o<strong>der</strong> to construct larger groups.<br />

• Divisive methods analyze large groups of elements <strong>in</strong> or<strong>der</strong> to divide the data <strong>in</strong>to smaller<br />

groups and eventually reach the desired clusters.<br />

Why would we want to cluster gene expression data? Research shows that:<br />

• Dist<strong>in</strong>ct measurements of same genes cluster together.<br />

• Genes of similar function cluster together.<br />

• Many cluster-function specific <strong>in</strong>sights are ga<strong>in</strong>ed.<br />

3.14 Hierarchical cluster<strong>in</strong>g<br />

This approach attempts to place the <strong>in</strong>put elements <strong>in</strong> a tree hierarchy structure <strong>in</strong> which<br />

distance with<strong>in</strong> the tree reflects element similarity.<br />

To be precise, the hierarchy is represented by a tree and the actual data is represented by the<br />

leaves of the tree. The tree can be rooted or not, depend<strong>in</strong>g on the method used.<br />

Distance matrix → Dendrogram<br />

gene 1 2 3 4<br />

1 0 3 5 7<br />

2 3 0 5 7<br />

3 5 5 0 7<br />

4 7 7 7 0<br />

7 5<br />

01<br />

01<br />

01<br />

01<br />

0<br />

01<br />

3<br />

01<br />

01<br />

01<br />

01<br />

01<br />

01<br />

01<br />

01<br />

gene 2<br />

01<br />

01<br />

1<br />

gene 3<br />

01<br />

01<br />

00 11<br />

00 11<br />

00 11<br />

01<br />

01<br />

gene 1<br />

gene 4<br />

3.15 The Neighbor Jo<strong>in</strong><strong>in</strong>g algorithm<br />

A popular algorithm for “tree build<strong>in</strong>g” is Neighbor Jo<strong>in</strong><strong>in</strong>g (NJ), due to Saitou and Nei (1987).<br />

The algorithm proceeds as follows:<br />

1. Input: Distance matrix D<br />

2. For all s, def<strong>in</strong>e b s := 1<br />

|L|−2<br />

∑<br />

k∈L D sk, the average distance from s to any other node.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 55<br />

3. F<strong>in</strong>d elements r, s such that D rs − (b s + b r ) is m<strong>in</strong>imal.<br />

4. Merge clusters represented by r, s.<br />

5. Delete elements r, s and add a new element t with:<br />

6. Repeat steps 2–5, until one element is left.<br />

D it := D ti := D ir + D is − D rs<br />

2<br />

3.16 Average l<strong>in</strong>kage<br />

Average l<strong>in</strong>kage is similar to NJ, except that when comput<strong>in</strong>g the new distances of created<br />

clusters, the sizes of clusters that are merged are taken <strong>in</strong>to consi<strong>der</strong>ation. This algorithm was<br />

developed by Lance and Williams (1967) and Sokal and Michener (1958).<br />

1. Input: The distance matrix D ij , <strong>in</strong>itial cluster sizes n r .<br />

2. Iteration k: The same as <strong>in</strong> NJ, except that the distance from a new element t is def<strong>in</strong>ed<br />

by:<br />

D it := D ti :=<br />

n r<br />

D ir +<br />

n s<br />

D is<br />

n r + n s n r + n s<br />

3.17 Non-Hierarchical cluster<strong>in</strong>g<br />

∑<br />

Given<br />

∑<br />

a set of <strong>in</strong>put vectors. For a given cluster<strong>in</strong>g P of them <strong>in</strong>to k clusters, let E P :=<br />

c v∈c D(v, z c) denote the solution cost function, where z c is the centroid (average vector) of<br />

the cluster c and D(v, z c ) is the distance from v to z c .<br />

The k-means cluster<strong>in</strong>g due to Macqueen (1965) operates as follows:<br />

1. Initialize an abitrary partition P <strong>in</strong>to k clusters.<br />

2. For each cluster c and element e:<br />

Let E P (c, e) be the cost of the solution if e is moved to c.<br />

3. Pick c, e so that E P (c, e) is m<strong>in</strong>imum.<br />

4. Move e to c, if it improves E P .<br />

5. Repeat until no further improvement is achieved.<br />

3.18 Application to fibroblast cells<br />

Eisen et al. (1998) performed a series of experiments on real gene expression data. One goal<br />

was to check the growth response of starved human fibroblast cells, which where given serum.<br />

About 8600 gene levels were monitored over 13 time po<strong>in</strong>ts.


56 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

The orig<strong>in</strong>al data of test to reference ratios was first log transformed, and then normalized to<br />

have mean 0 and variance 1. Let N ij denote these normalized levels. A similarity matrix was<br />

constructed from N ij as follows: ∑<br />

j<br />

S kl :=<br />

N kjN lj<br />

,<br />

N cond<br />

where N cond is the number of conditions checked.<br />

The average l<strong>in</strong>kage method was then used to generate the follow<strong>in</strong>g tree:<br />

The Dendrogram result<strong>in</strong>g from the starved human fibroblast cells experiment. Five major clusters<br />

can be seen, and many non clustered genes. The cells <strong>in</strong> the five groups server similar functions:<br />

(A) cholesterol bio-synthesis, (B) the cell cycle, (C) the immediate-early response, (D) signal<strong>in</strong>g and<br />

angiogenesis, and (E) wound heal<strong>in</strong>g and tissue remodel<strong>in</strong>g.<br />

(Color scale red-to-green corresponds to higher-to-lower expression level than <strong>in</strong> the control state.)<br />

3.19 Test<strong>in</strong>g the significance of the clusters<br />

A standard method for test<strong>in</strong>g the significance of clusters is to randomly permute the <strong>in</strong>put<br />

data <strong>in</strong> different ways.<br />

Orig<strong>in</strong>al expression data is shown <strong>in</strong> column (1), clustered data <strong>in</strong> (2) and the results of cluster<strong>in</strong>g<br />

after random permutations of the rows <strong>in</strong> (3), columns <strong>in</strong> (4) and both <strong>in</strong> (5).<br />

3.20 Sequenc<strong>in</strong>g by Hybridization (SBH)<br />

Orig<strong>in</strong>ally, the hope was that one can use DNA chips to sequence lage unknown DNA fragments<br />

us<strong>in</strong>g a large array of short probes:<br />

1. Produce a chip C(l) spotted with all possible probes of length l (l = 8 <strong>in</strong> the first SBH<br />

papers),


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 57<br />

2. Apply a solution conta<strong>in</strong><strong>in</strong>g many copies of a fluorescently labeled DNA target fragment<br />

to the array.<br />

3. The DNA fragments hybridize to those probes that are complementary to substr<strong>in</strong>gs of<br />

length l of the fragment<br />

4. Detect probes that hybridize with the DNA fragment and obta<strong>in</strong> the l-tuple composition<br />

of the DNA fragment<br />

5. Apply a comb<strong>in</strong>atorial algorithm to reconstruct the sequence of the DNA target from the<br />

l-tuple composition<br />

3.21 The Shortest Superstr<strong>in</strong>g Problem<br />

SBH provides <strong>in</strong>formation of the l-tuples present <strong>in</strong> a target DNA sequence, but not their<br />

positions. Suppose we are given the spectrum S of all l-tuples of a target DNA sequence, how<br />

do we construct the sequence?<br />

This is a special case of theShortest Common Superstr<strong>in</strong>g Problem (SCS): A superstr<strong>in</strong>g for a<br />

given set of str<strong>in</strong>gs s 1 , s 2 , . . .,s m is a str<strong>in</strong>g that conta<strong>in</strong>s each s i as a substr<strong>in</strong>g. Given a set of<br />

str<strong>in</strong>gs, f<strong>in</strong>d<strong>in</strong>g the shortest superstr<strong>in</strong>g is NP-complete.<br />

Def<strong>in</strong>e overlap(s i , s j ) as the length of a maximal prefix of s j that matches a suffix of s i . The<br />

SCS problem can be cast as a Travel<strong>in</strong>g Salesman Problem <strong>in</strong> a complete directed graph G<br />

with m vertices s 1 , s 2 , . . .,s m and edges (s i , s j ) of length −overlap(s i , s j ).<br />

3.22 The SBH graph<br />

SBH corresponds to the special case when all substr<strong>in</strong>gs have the same length l. We say that<br />

two SBH probes p and q overlap, if the last l − 1 letters of p co<strong>in</strong>cide with the first l − 1 of q.<br />

Given the spectrum S of a DNA fragment, construct the directed graph H with vertex set S<br />

and edge set<br />

E = {(p, q) | p and q overlap}.<br />

There exists a one-to-one correspondence between paths that visit each vertex of H at least<br />

once and the DNA fragments with the spectrum S.<br />

3.23 Example of the SBH graph<br />

Vertices: l tuples of the spectrum S, edges: overlapp<strong>in</strong>g l-tuples:<br />

S = { ATG AGG TGC TCC GTC GGT GCA CAG }<br />

H<br />

The path visit<strong>in</strong>g all vertices corresponds to the sequence reconstruction ATGCAGGTCC.<br />

A path that visits all nodes of a graph exactly once is called a Hamiltonian path. Unfortunately,<br />

the Hamiltonian Path Problem is NP-complete, so for larger graphs we cannot hope to f<strong>in</strong>d<br />

such paths.


58 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

3.24 Second example of the SBH graph<br />

S = { ATG TGG TGC GTG GGC GCA GCG CGT }<br />

H<br />

This example has two different Hamiltonian paths and thus two different reconstructed sequences:<br />

ATGCGTGGCA<br />

ATGGCGTGCA<br />

3.25 Euler Path<br />

Leonard Euler wanted to know whether there exists a path that uses all seven bridges <strong>in</strong><br />

Königsberg exactly once:<br />

Kneiphoff island<br />

Pregel river<br />

Birth of graph theory...<br />

3.26 SBH and the Eulerian Path Problem<br />

Let S be the spectrum of a DNA fragment. We def<strong>in</strong>e a graph G whose set of nodes consists<br />

of all possible (l − 1)-tuples.<br />

We connect one l−1-tuple v = v 1 . . .v l−1 to another w = w 1 . . .w l−1 by a directed edge (v, w), if<br />

the spectrum S conta<strong>in</strong>s an l-tuple u with prefix v and suffix w, i.e. such that u = v 1 . . .v l−1 w 1 =<br />

v l−1 w 1 . . . w l−1 .<br />

Hence, <strong>in</strong> this graph the probes correspond to edges and the problem is to f<strong>in</strong>d a path that<br />

visits all edges exactly once, i.e., an Eulerian path.<br />

F<strong>in</strong>d<strong>in</strong>g all Eulerian paths is simple to solve.<br />

(To be precise, the Ch<strong>in</strong>ese Postman Problem (visit all edges at least once <strong>in</strong> a m<strong>in</strong>imal tour) can be<br />

efficiently solved for directed or undirected graphs, but not <strong>in</strong> a graph that conta<strong>in</strong>s both directed and<br />

undirected edges.)


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 59<br />

S = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }<br />

GT<br />

GCT<br />

CG<br />

AT<br />

ATG<br />

TG<br />

GTG<br />

TGG<br />

TGC<br />

GG<br />

GCG<br />

GGC<br />

Vertices represent (l − 1)-tuples, edges correspond to l-tuples of the spectrum. There are two<br />

different solutions:<br />

GT<br />

CG<br />

GT<br />

GC<br />

GCA<br />

CG<br />

CA<br />

AT<br />

TG<br />

GC<br />

CA<br />

AT<br />

TG<br />

GC<br />

CA<br />

GG<br />

ATGGCGTGCA<br />

GG<br />

ATGCGTGGCA<br />

3.27 Eulerian graphs<br />

A directed graph G is called Eulerian, if it conta<strong>in</strong>s a cycle that traverses every edge of G<br />

exactly once.<br />

A vertex v is called balanced, if the number of edges enter<strong>in</strong>g v equals the number of edges leav<strong>in</strong>g<br />

v, i.e. <strong>in</strong>degree(v) = outdegree(v). We call v semi-balanced, if |<strong>in</strong>degree(v)−outdegree(v)| = 1.<br />

Theorem A directed graph is Eulerian, iff it is connected and each of its vertices is balanced.<br />

Lemma A connected directed graph is Eulerian, iff it conta<strong>in</strong>s at most two semi-balanced<br />

nodes.<br />

3.28 When does a unique solution exist?<br />

Problem: Given a spectrum S. Does it possess a unique sequence reconstruction?<br />

Consi<strong>der</strong> the correspond<strong>in</strong>g graph G. If the graph G is Eulerian, then we can decompose it<br />

<strong>in</strong>to simple cycles C1, . . .,C t , that is, cycles without self-<strong>in</strong>tersections. Each edge of G is used<br />

<strong>in</strong> exactly one cycle, although nodes can be used <strong>in</strong> many cycles. Def<strong>in</strong>e the <strong>in</strong>tersection graph<br />

G I on t vertices C 1 , . . .C t , where C i and C j are connected by k edges, iff they have precisely k<br />

orig<strong>in</strong>al vertices <strong>in</strong> common.<br />

Lemma Assume G is Eulerian. Then G has only one Eulerian cycle iff the <strong>in</strong>tersection graph<br />

G I is a tree.<br />

3.29 Probability of unique sequence reconstruction<br />

What is the probability that a randomly generated DNA fragment of n can be uniquely reconstructed<br />

us<strong>in</strong>g a DNA array C(l)? In other words, how large must l be so that a random<br />

sequence of length n can be uniquely reconstructed from its l-tuples?<br />

We assume that the bases at each position are chosen <strong>in</strong>dependently, each with probability 1 4 .<br />

Note that a repeat of length ≥ l will always lead to a non-unique reconstruction. We expect<br />

about ( )<br />

n<br />

2 p l repeats of length ≥ l. Note that ( (<br />

n<br />

2)<br />

p l = 1 implies l = log n<br />

)<br />

1<br />

p 2 .


60 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

=⇒ For a given l one should choose n ≈ √ 2 · 4 l , but not larger. (However, this is a very loose<br />

bound and a much tighter bound is known.)<br />

3.30 SBH currently <strong>in</strong>feasible<br />

The Eulerian path approach to SBH is currently <strong>in</strong>feasible due to two problems:<br />

• Errors <strong>in</strong> the data<br />

– False positives arise, when the the target DNA hybridizes to a probe even though<br />

an exact match is not present<br />

– False negatives arise, when an exact match goes undetected<br />

• Repeats make the reconstruction impossible, as soon as the length of the repeated sequence<br />

is longer than the word length l<br />

Nevertheless, ideas developed here are employed <strong>in</strong> a new approach to sequence assembly<br />

that uses sequenced reads and a Eulerian path representation of the data (Pavel Pevzner,<br />

Recomb’2001).<br />

3.31 Masks for VLSIPS<br />

DNA arrays can be manufactured us<strong>in</strong>g VLSIPS, very large scale immobilized polymer synthesis.<br />

In VLSIPS, probes are grown one layer of nucleotides at a time through a photolithographic<br />

process. In each step, a different mask is used and only the unmasked probes are extended by<br />

the current nucleotide. All probes are grown to length l <strong>in</strong> 4l steps.<br />

T T T<br />

A A A A<br />

A A A A T T T<br />

T T T<br />

A A A A<br />

A A A A T T T<br />

T T T<br />

A A A A<br />

A A A A T T T<br />

T T T<br />

A A A A<br />

A A A A T T T<br />

T T T T<br />

T T T<br />

T T T T T T T<br />

T T T T<br />

A A A<br />

T T T T<br />

A A A<br />

T T T T<br />

A A A<br />

T T T T<br />

A A A<br />

Problem: Due to diffraction, <strong>in</strong>ternal reflection and scatter<strong>in</strong>g, masked spots near an edge of<br />

the mask can be un<strong>in</strong>tentionally illum<strong>in</strong>ated.<br />

Idea: To m<strong>in</strong>imize the problem, design masks that have m<strong>in</strong>imal bor<strong>der</strong> length!<br />

For example, consi<strong>der</strong> the 8 × 8 array for l = 3. Both of the follow<strong>in</strong>g two masks add a T to 1 4<br />

of the spots, with bor<strong>der</strong>s of very different length:<br />

T T T T T T T T<br />

T T T T T T T T<br />

T T T T T T T T<br />

T T T T T T T T<br />

T T T T T T T T<br />

T T T T T T T T<br />

T T T T T T T T<br />

T T T T T T T T<br />

T T T T T T T T<br />

T T T T T T T T<br />

T T T T T T T T<br />

T T T T T T T T<br />

T T T T T T T T<br />

T T T T T T T T<br />

T T T T T T T T<br />

T T T T T T T T<br />

bor<strong>der</strong> length 58 bor<strong>der</strong> length 16


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 61<br />

3.32 The l-bit Gray code<br />

In the above example, we can mask 1 of all spots us<strong>in</strong>g a mask that has a boundary of length<br />

4<br />

4 · l. Can we arrange the spots so that this m<strong>in</strong>imal value is atta<strong>in</strong>ed for every mask?<br />

An l-bit Gray code is a permutation of the b<strong>in</strong>ary numbers 0, . . .,2 − 1 such that any two<br />

neighbor<strong>in</strong>g numbers differ <strong>in</strong> exactly one bit.<br />

The 4-bit “reflected” Gray code is:<br />

0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1<br />

0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0<br />

0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0<br />

0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0<br />

This is generated recursively from the 1-bit code G 1 = {0, 1} us<strong>in</strong>g:<br />

G l = {g 1 , g 2 , . . .,g 2 l −1, g 2 l} −→<br />

G l+1 = {0g 1 , 0g 2 , . . .,0g 2 l −1, 1g 1 , 1g 2 , . . .,1g 2 l −1, g 2 l}.<br />

3.33 The two-dimensional Gray code<br />

We want to construct a two-dimensional Gray code for str<strong>in</strong>gs of length l over the alphabet<br />

{A, C, G, T }, <strong>in</strong> which every l-tuple is present and differs from each of its four neighbors <strong>in</strong><br />

precisely one position.<br />

The <strong>in</strong>duction step G l → G l+1 :<br />

Let G l = ⎣<br />

Start: G 1 :=<br />

⎡<br />

[ A T<br />

C G<br />

]<br />

.<br />

⎤<br />

g 1,1 . . . g 1,2 l<br />

. . . ⎦ and set<br />

g 2 l ,1 . . . g 2 l ,2 l<br />

⎡<br />

⎤<br />

Ag 1,1 . . . Ag 1,2 l Tg 1,1 . . . Tg 1,2 l<br />

. . .<br />

G l+1 :=<br />

Ag 2 l ,1 . . . Ag 2 l ,2 l, Tg 2 l ,1 . . . Tg 2 l ,2 l,<br />

⎢ Gg 1,1 . . . Gg 1,2 l Cg 1,1 . . . Cg 1,2 .<br />

l<br />

⎥<br />

⎣ . . .<br />

⎦<br />

Gg 2 l ,1 . . . Gg 2 l ,2 l, Cg 2 l ,1 . . . Cg 2 l ,2 l<br />

3.34 Additional ideas<br />

SBH with universal bases Use universal bases such as <strong>in</strong>os<strong>in</strong>e that stack correctly, but don’t<br />

b<strong>in</strong>d, and thus play the role of “don’t care” symbols <strong>in</strong> the probes. Arrays based on this idea<br />

can be achieve the <strong>in</strong>formation-theoretic lower bound of the number of probes required for<br />

unambiguous reconstruction of an abitrary DNA str<strong>in</strong>g of length n. (The full C(l) array has<br />

redundancies that can be elim<strong>in</strong>ated us<strong>in</strong>g such universal bases.) (Preparata et al. 1999)


62 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

Adaptive SBH If a sequenc<strong>in</strong>g by hybridization is not successful, analyze the critical problems<br />

and then build a new array to overcome them. Skiena and Sundaram, and Margaritas and<br />

Skiena, give theoretical bounds for the number of rounds needed for sequence reconstruction<br />

(<strong>in</strong> the error free case).<br />

SBH-style shotgun sequenc<strong>in</strong>g The idea is to collect sequence reads from the target DNA<br />

sequence us<strong>in</strong>g traditional sequenc<strong>in</strong>g methods and then to treat each such read of length k as<br />

a set of k −l + 1 <strong>in</strong>dividual l-tuples, with l = 30, say. Then, the Eulerian path method is used.<br />

Idury and Waterman suggested this <strong>in</strong> 1995 and it leads to an efficient assembly algorithm <strong>in</strong><br />

the error-free case. More recent work by Pevzner and others has led to promis<strong>in</strong>g software.<br />

Fidelity probes for DNA arrays As a quality control measure when manufactur<strong>in</strong>g a DNA<br />

chip, one can produce fidelity probes that have the same sequence as probes on the chip, but<br />

are produced <strong>in</strong> a different or<strong>der</strong> of steps. A known target is hybridized to these probes and<br />

the result reflects the quality of the array. Hubbel and Pevnzer (1999) describe a comb<strong>in</strong>atorial<br />

method for design<strong>in</strong>g a small set of fidelity probes that can detect variations and can also<br />

<strong>in</strong>dicate which manufactur<strong>in</strong>g steps caused the errors.<br />

4 Sequence Assembly<br />

This exposition is based on the follow<strong>in</strong>g sources, which are all recommended read<strong>in</strong>g:<br />

1. Michael S. Waterman, Introduction to computational biology, Chapman and Hall, 1995.<br />

(Chapter 7)<br />

2. Gene Myers, Whole-Genome DNA Sequenc<strong>in</strong>g, Comput<strong>in</strong>g <strong>in</strong> Science and Eng<strong>in</strong>eer<strong>in</strong>g,<br />

33-43, May–June, 1999.<br />

3. Eugene W. (Gene) Myers et al., A Whole-Genome Assembly of Drosophila, Science,<br />

287:2196-2204, 24 March 2000.<br />

4. W. James Kent and David Haussler, Assembly of a Work<strong>in</strong>g Draft of the Human Genome<br />

with GigAssembler, Genome Research, 11:1541-1548 2001.<br />

5. Daniel Huson, Knut Re<strong>in</strong>ert and Eugene Myers, The Greedy-Path Merg<strong>in</strong>g Algorithm for<br />

Sequence Assembly, RECOMB 2001, 157-163, 2001.<br />

4.1 Genome Sequenc<strong>in</strong>g<br />

Us<strong>in</strong>g a method that was basically <strong>in</strong>vented <strong>in</strong> 1980 by Sanger, current sequenc<strong>in</strong>g technology<br />

can only determ<strong>in</strong>e 500 − 1000 consecutive base pairs of DNA <strong>in</strong> any one read. To sequence a<br />

larger piece of DNA, shotgun sequenc<strong>in</strong>g is used.<br />

Orig<strong>in</strong>ally, shotgun sequenc<strong>in</strong>g was applied to small viral genomes and to 30 − 40kb segments<br />

of larger genomes.<br />

In 1994, the 1.8Mb genome of the bacteria H.<strong>in</strong>fluenzae was assembled from shotgun data.<br />

At the beg<strong>in</strong>n<strong>in</strong>g of 2000, am assembly of the 130Mb Drosophila genome was published.<br />

At the beg<strong>in</strong>n<strong>in</strong>g of 2001, two <strong>in</strong>itial assemblies of the human genome were published.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 63<br />

4.2 Shotgun Sequenc<strong>in</strong>g<br />

Source sequence. ..<br />

is copied many times. . .<br />

ACGTTGCACTAGCACAGCGCGCTATATCGACTACGACTACGACTCAGCA<br />

ACGTTGCACTAGCACAGCGCGCTATATCGACTACGACTACGACTCAGCA<br />

ACGTTGCACTAGCACAGCGCGCTATATCGACTACGACTACGACTCAGCA<br />

ACGTTGCACTAGCACAGCGCGCTATATCGACTACGACTACGACTCAGCA<br />

ACGTTGCACTAGCACAGCGCGCTATATCGACTACGACTACGACTCAGCA<br />

ACGTTGCACTAGCACAGCGCGCTATATCGACTACGACTACGACTCAGCA<br />

ACGTTGCACTAGCACAGCGCGCTATATCGACTACGACTACGACTCAGCA<br />

ACGTTGCACTAGCACAGCGCGCTATATCGACTACGACTACGACTCAGCA<br />

ACGTTGCACTAGCACAGCGCGCTATATCGACTACGACTACGACTCAGCA<br />

and randomly broken <strong>in</strong>to fragments, e.g.<br />

us<strong>in</strong>g sonication or nebulation, . ..<br />

AGCGCGCTATATCGACTACG ACGACTCAGC ACTAGCACAGCGCGA<br />

CGCTATATCGACTACGA CGCTATATCGACTACGA TTTTTTTT<br />

ACGTTGCACTAGCACAGCGCGCT CGCTATATCGACTACGA TGGTG<br />

TACGACTACGACTCAGCA<br />

ACTAGCACAGCGCGA AA<br />

ACTAGCACAGCGCGA ACGACTCAGC<br />

TGCACTAGCACAGCGCGCTATATCGACT<br />

CGCTATATCGACTACGA<br />

AGCACAGCGCGCTATAT TGCACTAGCACAGCGCGCTATATCGACT<br />

ACGACTCAGC<br />

ACGTTGCACTAGCACAGCGCGCT<br />

TACGACTACGACTCAGCA AGCG TACGACTACGACTCAGCA<br />

that are then size selected, size e.g. 2kb,<br />

10kb, 50kb or 150kb, ...<br />

ACCGCTGCACACACGGTAGCAGCAGCAGCACAGACGAC<br />

TGTTGTGCTCGTGCTATATACACTGGCTACACT<br />

ACCGCTGCACACACGGTAGCAGCAGCACAACGAC<br />

TGTTGTGCTCGTGCTATATACACACTGGCTT<br />

GCTGCACACACGGTAGCAGCAGCAGCACAGACGAC<br />

ACCGCTGCACACAGCAGCACAGACGAC<br />

ATTGTTTATATACACACTGGCTACACT<br />

ACCGGCAGCAGCAGCACAGACGAC<br />

ATTGCTATATACACACTGGCTACACT<br />

ATATATACACACTGGCTACACT<br />

AGCAGCAGCGCACAGACGAC<br />

TATACACACTGGCTACACT<br />

ATTGTTGTGCTCGTGC<br />

ACTGGCTACACT<br />

TATACACACTACT<br />

ATTGCTATATACACACTGGCTACACT<br />

and <strong>in</strong>serted <strong>in</strong>to clon<strong>in</strong>g vectors.<br />

XXTAACG......ATGTGA XX<br />

ATTGCTATATACACACTGGCTACACT<br />

In double barrel shotgun sequenc<strong>in</strong>g, each<br />

clone is sequenced from both ends, to obta<strong>in</strong><br />

a mate-pair of reads, each read of average<br />

length 550.<br />

4.3 Shotgun sequenc<strong>in</strong>g data<br />

Given an unknown DNA sequence a = a 1 a 2 . . .a L .<br />

Shotgun sequenc<strong>in</strong>g of a produces a set of reads<br />

of average length 550 (at present).<br />

Essential characteristics of the data:<br />

F = {f 1 , f 2 , . . ., f R },<br />

• Incomplete coverage of the source sequences.<br />

• Sequenc<strong>in</strong>g <strong>in</strong>troduces errors at a rate of about %1 for the first 500 bases, if carefully<br />

performed.<br />

• The reads are sampled from both strands of the source sequence and thus the orientation<br />

of any given read is unknown.


64 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

4.4 The fragment assembly problem<br />

The <strong>in</strong>put is a collection of reads (or fragments) F = {f 1 , f 2 , . . .,f R }, that are sequences over<br />

the alphabet Σ = {A, C, G, T }.<br />

An ǫ-layout of F is a str<strong>in</strong>g S over Σ and a collection of R pairs of <strong>in</strong>tegers (s j , e j ) j∈{1,2,...,R} ,<br />

such that<br />

• if s j < e j then f j can be aligned to the substr<strong>in</strong>g S[s j , e j ] with less than ǫ · |f j | differences,<br />

and<br />

• if s j > e j then f j can be aligned to the substr<strong>in</strong>g S[e j , s j ] with less than ǫ · |f j | differences,<br />

then<br />

• ∪ R j=1 [m<strong>in</strong>(s j, e j ), max(s j , e j )] = [1, |S|].<br />

The str<strong>in</strong>g S is the reconstructed source str<strong>in</strong>g. The <strong>in</strong>teger pairs <strong>in</strong>dicate where the reads are<br />

placed and the or<strong>der</strong> of s i and e i <strong>in</strong>dicate the orientation of the read f i , i.e. whether f i was<br />

sampled from S or its complement S.<br />

The set of all ǫ-layouts models the set of all possible solutions. There are many such solutions<br />

and so we want a solution that is <strong>in</strong> some sense best. Traditionally, this has been phrased as the<br />

Shortest Common Superstr<strong>in</strong>g Problem (SCS) of the reads with<strong>in</strong> error rate ǫ. Unfortunately,<br />

the SCS Problem often produces overcompressed results.<br />

Consi<strong>der</strong> the follow<strong>in</strong>g source sequence that conta<strong>in</strong>s two <strong>in</strong>stances R, R ′ of a high fidelity<br />

repeat and three stretches of unique sequence A, B and C:<br />

source:<br />

R<br />

R’<br />

A Rl Rc Rr B R’l R’c R’r C<br />

reads:<br />

The shortest answer isn’t always the best and the <strong>in</strong>terior part R c ≈ R ′ c of the repeat region is<br />

overcompressed:<br />

reconstruction:<br />

R<br />

R’<br />

A Rl Rc Rr B R’l R’r C<br />

reads:<br />

4.5 Sequence assembly <strong>in</strong> three stages<br />

Traditional approaches to sequence assembly divides the problem <strong>in</strong>to three phases:<br />

1. In the overlap phase, every read is compared with every other read, and the overlap graph<br />

is computed.<br />

2. In the layout phase, the pairs (s j , e j ) are determ<strong>in</strong>ed that position every read <strong>in</strong> the<br />

assembly.<br />

3. In the consensus phase, a multialignment of all the placed reads is produced to obta<strong>in</strong><br />

the f<strong>in</strong>al sequence.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 65<br />

4.6 The overlap phase<br />

For a read f i , we must calculate how it overlaps any other read f j (or its reverse complement,<br />

f j ). Hold<strong>in</strong>g f i fixed <strong>in</strong> orientation, f i and f j can overlap <strong>in</strong> the follow<strong>in</strong>g ways:<br />

f i<br />

f i<br />

f j<br />

f i<br />

f i<br />

f j<br />

( f i<br />

f j<br />

f j<br />

f j)<br />

The number of possible relationships doubles, when we also consi<strong>der</strong> f j .<br />

The overlap phase is the computational bottleneck <strong>in</strong> large assembly projects. For example,<br />

assembl<strong>in</strong>g all 27 million human reads produced at Celera requires<br />

( ) 27000000<br />

2 ·<br />

≈ 1458000000000000<br />

2<br />

comparisons.<br />

For any two reads a and b (and either orientation of the latter), one searches for the overlap<br />

alignment with the highest alignment score, based on a similarity score s(a, b) on Σ and an<br />

<strong>in</strong>del penalty g(k) = kδ.<br />

Let S(a, b) be the maximum score over all alignments of two reads a = a 1 a 2 . . .a m and b =<br />

b 1 b 2 . . .b n , we want to compute:<br />

⎧<br />

⎧<br />

⎫⎫<br />

⎨<br />

⎨ 1 ≤ k ≤ i ≤ m, ⎬⎬<br />

A(a,b) = max<br />

⎩ S(a k,a k+1 ...a i ,b l b l+1 ...b j ) | 1 ≤ l ≤ j ≤ n,<br />

⎩<br />

⎭⎭ .<br />

and i = m or j = n holds<br />

4.7 Overlap alignment<br />

This is a standard pairwise alignment problem (similar to local alignment, except we don’t have<br />

a 0 <strong>in</strong> the recursion) and we can use dynamic programm<strong>in</strong>g to compute:<br />

A(i, j) = max{S(a k , a k+1 . . .a i , b l b l+1 . . . b j ) | 1 ≤ k ≤ i and 1 ≤ l ≤ j}.<br />

Algorithm (Overlap alignment)<br />

Input: a = a 1 a 2 . . .a n and b = b 1 b 2 . . .b m , s(·, ·) and δ<br />

Output: A(i, j)<br />

beg<strong>in</strong><br />

A(0, j) = A(i, 0) ← 0 for i = 1, . . ., n, j = 1, . . .,m<br />

for i = 1, . . ., n:<br />

for j = 1, . . .,m:<br />

end<br />

⎧<br />

⎨<br />

A(i, j) ← max<br />

⎩<br />

A(i − 1, j) − δ,<br />

A(i, j − 1) − δ,<br />

A(i − 1, j − 1) + s(a i , b i )<br />

⎫<br />

⎬<br />


66 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

Runtime is O(nm).<br />

Given two reads a = a 1 a 2 . . .a m and b = b 1 b 2 . . .b n . For the matrix A(i, j) computed as above,<br />

set<br />

(p, q) := arg max{A(i, j) | i = m or j = n}.<br />

There are two cases:<br />

p = m or q = n<br />

The trace-back paths look like this:<br />

0<br />

0<br />

1<br />

1<br />

a<br />

p<br />

m<br />

0<br />

0<br />

1<br />

1<br />

a<br />

p<br />

m<br />

b<br />

q<br />

A(i,j)<br />

b<br />

A(i,j)<br />

n<br />

or<br />

q n<br />

The alignments look like this:<br />

a<br />

a<br />

b<br />

or<br />

b<br />

4.8 Faster overlap detection<br />

Dynamic programm<strong>in</strong>g is too slow for large sequenc<strong>in</strong>g projects. Indeed, it is wasteful, as <strong>in</strong><br />

assembly, only high scor<strong>in</strong>g overlaps with more than 96% identity, say, play a role.<br />

One can use a seed and extend approach (as used <strong>in</strong> BLAST):<br />

1. Produce the concatenation of all <strong>in</strong>put reads H = f 1 f 2 . . .f L .<br />

2. For each read f i ∈ F: F<strong>in</strong>d all seeds, i.e. exact matches between k-mers of f i and the<br />

concatenated sequence H. (Merge neighbor<strong>in</strong>g seeds.)<br />

3. Compute extensions: Attempt to extend each (merged) seed to a high scor<strong>in</strong>g overlap<br />

alignment between f i and the correspond<strong>in</strong>g read f j .<br />

(A k-mer is a str<strong>in</strong>g of length k. In this context, k = 18... 22)<br />

Computation of seeds:<br />

H<br />

f1 f2 f3 f4 ...<br />

fL<br />

fi<br />

seeds<br />

Extension of seeds us<strong>in</strong>g<br />

banded dynamic program<strong>in</strong>g<br />

(runn<strong>in</strong>g time is l<strong>in</strong>ear <strong>in</strong><br />

the read length):<br />

fj<br />

seed extension<br />

extension<br />

banded alignment<br />

fi


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 67<br />

4.9 True and repeat-<strong>in</strong>duced overlaps<br />

Assume that we have found a high quality overlap o between f i and f j . There there are three<br />

possible cases:<br />

• The overlap o corresponds to an overlap of f i and f j <strong>in</strong> the source sequence. In this case<br />

we call o a true overlap.<br />

• The reads f i and f j come from different parts of the source sequence and their overlapp<strong>in</strong>g<br />

portions are conta<strong>in</strong>ed <strong>in</strong> different <strong>in</strong>stances of the same repeat, this is called a repeat<strong>in</strong>duced<br />

overlap.<br />

• The overlap exists by chance. To avoid short random overlaps, one requires that an<br />

overlap is at least 40bp long, say.<br />

Source<br />

fi<br />

fj<br />

fk<br />

R1<br />

R2<br />

fl<br />

True overlap between f i and f j , repeat <strong>in</strong>duced overlap between f k and f l .<br />

4.10 Avoid<strong>in</strong>g repeat-<strong>in</strong>duced overlaps<br />

To avoid the computation of repeat-<strong>in</strong>duced overlaps, one strategy is to only consi<strong>der</strong> seeds <strong>in</strong><br />

the seed-and-extend computation whose k-mers are not conta<strong>in</strong>ed <strong>in</strong>side a repeat. In this way<br />

we can ensure that any computed overlap has a significant unique part.<br />

There are two strategies for this:<br />

• Screen<strong>in</strong>g known repeats: Each read is aligned aga<strong>in</strong>st a database of known repeats, i.e.<br />

us<strong>in</strong>g Repeatmasker. Portions of reads that match a known repeat are labeled repetitive.<br />

• De novo screen<strong>in</strong>g: For each k-mer conta<strong>in</strong>ed <strong>in</strong> H, the concatenation of reads, we determ<strong>in</strong>e<br />

how many times it occurs <strong>in</strong> H and then label those k-mers as repetitive, whose<br />

number of occurrences is unexpectedly high.<br />

4.11 Celera’s overlapper<br />

The assembler developed at Celera Genomics employs an overlapper than compares up to 32<br />

million pairs of reads per second.<br />

Overlapp<strong>in</strong>g all pairs of 27 million reads of human DNA us<strong>in</strong>g this program takes about 10<br />

days, runn<strong>in</strong>g on about 10-20 four processor mach<strong>in</strong>es (Compaq ES40), each with 4GB of ma<strong>in</strong><br />

memory.<br />

The <strong>in</strong>put data file is about 50GB. To parallelize the overlap compute, each job grabs as many<br />

reads as will fit <strong>in</strong>to 4GB of memory (m<strong>in</strong>us the memory necessary for do<strong>in</strong>g the computation)<br />

and then streams all 27 million reads aga<strong>in</strong>st the ones <strong>in</strong> memory.


68 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

4.12 The overlap graph<br />

The overlap phase produces an overlap graph OG, def<strong>in</strong>ed as follows: Each read f p ∈ F is<br />

represented by a directed edge (s p , e p ) from node s p to e p , represent<strong>in</strong>g the start and end of f p ,<br />

respectively. The length of such a read edge is simply the length of the correspond<strong>in</strong>g read.<br />

An overlap between f p = f p1 f p2 . . .f pm and f q = f q1 f q2 . . .f qn gives rise to an undirected overlap<br />

edge e between s p , or e p , and s q , or f q , depend<strong>in</strong>g on the orientation of the overlap, e.g.:<br />

sp ep<br />

1 fp i m<br />

1<br />

j<br />

fq<br />

n<br />

The label (or “length”) of the overlap edge e is def<strong>in</strong>ed to be −1 times the overlap length, e.g.<br />

−( m−i+j−1<br />

2<br />

+ 1) <strong>in</strong> the figure.<br />

sq<br />

eq<br />

4.13 Example<br />

Assume we are given 6 reads F = {f 1 , f 2 , . . ., f 6 }, each of length 500, together with the follow<strong>in</strong>g<br />

overlaps:<br />

f1<br />

320<br />

f2<br />

f1<br />

40<br />

f4<br />

f4<br />

95<br />

f2<br />

f4<br />

80<br />

f3<br />

f5<br />

50 f1<br />

f6<br />

330<br />

f1<br />

Here, for example, the last 320 bases of read f 1 align to the first 320 bases of the reverse<br />

complement f 2 of f 2 , whereas f 1 and f 5 overlap <strong>in</strong> the first 50 bases of each.<br />

We obta<strong>in</strong> the follow<strong>in</strong>g overlap graph OG:<br />

f5<br />

f2<br />

60<br />

f4<br />

f6<br />

f6<br />

250<br />

f5<br />

−250<br />

f6<br />

−330<br />

−50<br />

f1<br />

−60<br />

−40<br />

−320<br />

−95<br />

f2<br />

−80<br />

f3<br />

Each read f p is represented by a read edge (s p , e p ) of length |f p |. Overlaps off the start s p , or<br />

end e p , of f p are represented by overlap edges start<strong>in</strong>g at the node s p , or e p , respectively. Each<br />

overlap edge is labeled by −1 times the overlap length.<br />

4.14 The layout phase<br />

The goal of the layout phase is to arrange all reads <strong>in</strong>to an approximate multi-alignment. This<br />

<strong>in</strong>volves assign<strong>in</strong>g coord<strong>in</strong>ates to all nodes of the overlap graph OG, and thus, determ<strong>in</strong><strong>in</strong>g the<br />

value of s i and e i for each read f i .<br />

A simple heuristic is to select a spann<strong>in</strong>g forest of the overlap graph OG that conta<strong>in</strong>s all<br />

read edges. (A spann<strong>in</strong>g forest is a set F of edges such that any two nodes <strong>in</strong> the same connected<br />

component of OG are connected by a unique simple, unoriented path of edges <strong>in</strong> F.)


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 69<br />

f5<br />

f4<br />

−250<br />

f6<br />

−330<br />

−50<br />

f1<br />

−60<br />

−40<br />

−320<br />

−95<br />

f2<br />

−80<br />

f3<br />

such a subset of edges positions every read with respect to every other, with<strong>in</strong> a given connected<br />

component of the graph:<br />

1 280 450 500 730 950 1410<br />

1830<br />

f5<br />

f6<br />

f1<br />

f4<br />

f2<br />

f3<br />

Such a putative alignment of reads is called a contig.<br />

The spann<strong>in</strong>g tree is usually constructed us<strong>in</strong>g a greedy heuristic <strong>in</strong> which the overlap edges are<br />

chosen <strong>in</strong> decreas<strong>in</strong>g overlap length (i.e., <strong>in</strong>creas<strong>in</strong>g edge “length”).<br />

f5<br />

f4<br />

−250<br />

f6<br />

−330<br />

−50<br />

f1<br />

−60<br />

−40<br />

−320<br />

−95<br />

f2<br />

−80<br />

f3<br />

4.15 Repeats and the layout phase<br />

Consi<strong>der</strong> the follow<strong>in</strong>g situation:<br />

R<br />

two copy repeat<br />

R’<br />

source<br />

f1<br />

f2<br />

f3<br />

f4<br />

f5<br />

reads<br />

f7<br />

f6<br />

This gives rise to the follow<strong>in</strong>g overlap graph:<br />

f1<br />

f3<br />

f5<br />

f7<br />

f2<br />

f4<br />

f6<br />

Consi<strong>der</strong> this spann<strong>in</strong>g tree:


70 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

e<br />

f1<br />

f3<br />

f5<br />

f7<br />

f2<br />

f4<br />

f<br />

f6<br />

A layout produced us<strong>in</strong>g the edge e or f does not reflect the true or<strong>der</strong><strong>in</strong>g of the reads and the<br />

obta<strong>in</strong>ed contig is called misassembled:<br />

f1<br />

f2<br />

f5<br />

f3<br />

However, avoid<strong>in</strong>g the repeat-<strong>in</strong>duced edges e and f, one obta<strong>in</strong>s a correct layout:<br />

f1<br />

f2<br />

f3<br />

f4<br />

f5<br />

f7<br />

f6<br />

Note that neither of the two layouts is “consistent” with all overlap edges <strong>in</strong> the graph.<br />

e<br />

f4<br />

f7<br />

f6<br />

4.16 Unitigg<strong>in</strong>g<br />

The ma<strong>in</strong> difficulty <strong>in</strong> the layout phase is that we can’t dist<strong>in</strong>guish between true overlaps and<br />

repeat-<strong>in</strong>duced overlaps. The latter produce “<strong>in</strong>consistent” layouts <strong>in</strong> which the coord<strong>in</strong>ate<br />

assignment <strong>in</strong>duces overlaps that are not reflected <strong>in</strong> the overlap graph (e.g., reads f 4 and f 7<br />

<strong>in</strong> the example above).<br />

Thus, the layout phase proceeds <strong>in</strong> two stages:<br />

1. Unitigg<strong>in</strong>g: First, all uniquely assemblable contigs are produced, as just described. These<br />

are called unitigs.<br />

2. Repeat resolution: Then, at a later stage, one attempts to reconstruct the repetitive<br />

sequence that lies between such unitigs.<br />

Reads are sampled from a source sequence that conta<strong>in</strong>s repeats:<br />

source:<br />

reads:<br />

repeats<br />

Reads that form consistent cha<strong>in</strong>s <strong>in</strong> the overlap graph are assembled <strong>in</strong>to unitigs and the<br />

rema<strong>in</strong><strong>in</strong>g “repetitive” reads are processed later:<br />

untigs:<br />

layouts:<br />

reads <strong>in</strong> repeats:<br />

4.17 Unique unitigs<br />

As def<strong>in</strong>ed above, a “unitig” is obta<strong>in</strong>ed as a cha<strong>in</strong> of consistently overlapp<strong>in</strong>g reads. However,<br />

a unitig does not necessarily represent a segment of unique source sequence. For example, its<br />

reads may come from the <strong>in</strong>terior of different <strong>in</strong>stances of a long (many copy) repeat:


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 71<br />

source:<br />

R R’ R"<br />

reads:<br />

unique unitig<br />

non−unique unitig<br />

Non-unique unitigs can be detected by virtue of the fact that they conta<strong>in</strong> significantly more<br />

reads than expected.<br />

4.18 Identify<strong>in</strong>g unique unitigs<br />

Let R be the number of reads and G the estimated length of the source sequence. For a unitig<br />

with k reads and approximate length ρ, the probability of see<strong>in</strong>g the k − 1 start positions <strong>in</strong><br />

the <strong>in</strong>terval of length ρ is<br />

e −c c k<br />

,<br />

k!<br />

with c := ρR , if the unitig is not oversampled, and<br />

G<br />

e −2c (2c) k<br />

,<br />

k!<br />

if the unitig is the result of collaps<strong>in</strong>g two repeats.<br />

(see Mike Waterman’s book, page 148, for details)<br />

The arrival statistic used to identify unique unitigs is the (natural) log of the ratio of these two<br />

probabilities,<br />

c − (log 2)k.<br />

A unitig is called unique, if it’s arrival statistic has a positive value of 10 or more, say.<br />

4.19 Mate pairs<br />

Fragment assembly of reads produces contigs, whose relative placement and orientation with<br />

respect to each other is unknown.<br />

Recall that mo<strong>der</strong>n shotgun sequenc<strong>in</strong>g protocols employ a so-called double barreled shotgun.<br />

That is, longer clones of a given fixed length are sequenced from both ends and one obta<strong>in</strong>s a<br />

pair of reads, a mate pair, whose relative orientation and mean µ (and standard deviation σ<br />

of) length are known:<br />

(µ,σ)<br />

Typical clone lengths are µ = 2kb, 10kb, 50kb or 150kb. For clean data, σ ≈ 10% of µ. Mate<br />

pair mismatch<strong>in</strong>g is a problem and can effect 10 − 30% of all pairs.


72 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

4.20 Scaffold<strong>in</strong>g<br />

Consi<strong>der</strong> two reconstructed contigs. If they correspond to neighbor<strong>in</strong>g regions <strong>in</strong> the source<br />

sequence, then we can expect to see mate pairs to span the gap between them:<br />

c1<br />

c2<br />

Such mate pairs determ<strong>in</strong>e the relative orientation of both contigs, and we can compute a<br />

mean and standard deviation for the gap between them. In this case, the contigs are said to<br />

be scaffolded:<br />

4.21 Determ<strong>in</strong><strong>in</strong>g the distance between two contigs<br />

Given two contigs c 1 and c 2 connected by mate pairs m 1 , m 2 , . . .,m k . Each mate pair gives an<br />

estimation of the distance between the two contigs.<br />

These estimations can viewed as <strong>in</strong>dependent measurements (l 1 , σ 1 ), (l 2 , σ 2 ), ...(l k , σ k ) of the<br />

distance D between the two contigs c 1 and c 2 . Follow<strong>in</strong>g standard statistical practice, they can<br />

be comb<strong>in</strong>ed as follows:<br />

Def<strong>in</strong>e p := ∑ l i<br />

σ 2 i<br />

and q = ∑ 1<br />

. We set the distance between c<br />

σi<br />

2 1 and c 2 to<br />

D := p q , with standard deviation σ := 1 √ q<br />

.<br />

Here is an example:<br />

D,σ<br />

l1σ1 ,<br />

l2σ2 ,<br />

l3,<br />

σ3<br />

l4,<br />

σ4<br />

2k mate pair<br />

10k mate pair<br />

10k mate pair<br />

2k mate pair<br />

It is possible that the mate pairs between two contigs c 1 and c 2 lead to significantly different<br />

estimations of the distance from c 1 and c 2 . In practice, only mate pairs that confirm each<br />

other, i.e. whose estimations are with<strong>in</strong> 3σ of each other, say, are consi<strong>der</strong>ed together <strong>in</strong> a gap<br />

estimation.<br />

4.22 The significance of mate pairs<br />

Given two contigs c 1 and c 2 . If there is only one mate pair between the two contigs, then due<br />

to the high error rates associated with mate pairs, this is not significant.<br />

If, however, c 1 and c 2 are unique unitigs, and if there exist two different mate pairs between<br />

the two that give rise to the same relative orientation and similar estimations of the gap size<br />

between c 1 and c 2 , then this the scaffold<strong>in</strong>g of c 1 and c 2 is highly reliable.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 73<br />

This is because that probability that two false mate pairs occur that confirm each other, is<br />

extremely small.<br />

4.23 Example<br />

Let the sequence length be G = 120MB, for example (Drosophila). For simplicity, assume we<br />

have 5-fold coverage of mate pairs, with a mean length of µ = 10kb and standard deviation of<br />

σ = 1kb.<br />

Consi<strong>der</strong> a false mate pair m 1 = (f 1 , f 2 ) with reads f 1 and f 2 . Let N 1 and N 2 denote the two<br />

<strong>in</strong>tervals (<strong>in</strong> the source sequence) of length 3σ centered at the starts of f 1 and f 2 , respectively.<br />

Both have length 6kb.<br />

Consi<strong>der</strong> a second false mate m 2 = (g 1 , g 2 ) with g 1 <strong>in</strong>side N 1 . The probability that g 2 lies <strong>in</strong><br />

N 2 is roughly<br />

6kb<br />

120MB = 1<br />

20000 .<br />

N1<br />

m2<br />

N2<br />

source<br />

g1<br />

f1<br />

m1<br />

f2<br />

Assume that the reads have length 600. Assume that 10% of all mate pairs are false. At 5-fold<br />

coverage, the <strong>in</strong>terval N 1 is covered by about 5 · 6000 = 50 reads, of which ≈ 5 have false mates.<br />

600<br />

Hence, the probability that m 1 is confirmed by some second false mate pair m 2 is<br />

≈ 5 ·<br />

1<br />

20000 = 1<br />

4000 = 0.00025.<br />

4.24 The overlap-mate graph<br />

Given a set of reads F = {f 1 , f 2 , . . .,f R } and let G denote the overlap graph associated with<br />

F.<br />

Given one set (or more) M µ,σ = {m 1 , . . .,m k } of mate pairs m k = (f i , f j ), with mean µ and<br />

standard deviation σ.<br />

Let f i and f j be two mated reads represented by the edges (s i , e i ) and (s j , e j ) <strong>in</strong> G. We add<br />

an undirected mate edge between e i and e j , labeled (µ, σ), to <strong>in</strong>dicate that f i and f j are mates<br />

and thus obta<strong>in</strong> the overlap-mate graph:<br />

f5<br />

f7<br />

(2000,200)<br />

f4<br />

−250<br />

f6<br />

−330<br />

−50<br />

f1<br />

−60<br />

−40<br />

−320<br />

−95<br />

f2<br />

−80<br />

f3<br />

(10000,1000)<br />

f8


74 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

4.25 The contig-mate graph<br />

Given a set of F of fragments and a set of assembled contigs C = {c 1 , c 2 , . . .,c t }. A more useful<br />

graph is obta<strong>in</strong>ed as follows:<br />

Represent each assembled contig c i by a contig edge with nodes s i and e i . Then, add mate<br />

edges between such nodes to <strong>in</strong>dicate that the correspond<strong>in</strong>g contigs conta<strong>in</strong> fragments that are<br />

mates:<br />

D,σ<br />

l1σ1 ,<br />

l2σ2 ,<br />

l3,<br />

σ3<br />

l4,<br />

σ4<br />

2k mate pair<br />

10k mate pair<br />

10k mate pair<br />

2k mate pair<br />

Leads to:<br />

c1<br />

l1,<br />

σ1<br />

l2,<br />

σ2<br />

, l3 σ3<br />

l4,<br />

σ4<br />

c2<br />

4.26 Edge bundl<strong>in</strong>g<br />

Consi<strong>der</strong> two contigs c 1 and c 2 , jo<strong>in</strong>ed by mate pair edges m 1 , . . .,m k between node e 1 and s 2 ,<br />

say. Every maximal subset of mutually confirm<strong>in</strong>g mate edges is replaced by a s<strong>in</strong>gle bundled<br />

mate edge e, whose mean length µ and standard deviation σ are computed as discussed above.<br />

Any such bundled edge is labeled (µ, σ).<br />

(A heuristic used to compute these subsets is to repeatedly bundle the median-length simple<br />

mate edge with all mate edges with<strong>in</strong> three standard deviations of it, until all simple mate<br />

edges have been bundled.)<br />

Additionally, we set the weight w(e) of any mate edge to 1, if it is a simple mate edge, and to<br />

∑ k<br />

i=1 w(e i), if it was obta<strong>in</strong>ed by bundl<strong>in</strong>g edges e 1 , . . .,e k .<br />

Consi<strong>der</strong> the follow<strong>in</strong>g graph:<br />

Assum<strong>in</strong>g that mate edges drawn together have similar lengths and large enough standard<br />

deviation, edge bundl<strong>in</strong>g will produce the follow<strong>in</strong>g graph:<br />

w=2<br />

w=2<br />

w=3<br />

w=4


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 75<br />

4.27 Transitive edge reduction<br />

Consi<strong>der</strong> the previous graph with some specific edge lengths:<br />

e<br />

µ= 4200<br />

c1<br />

f l= 2000<br />

µ= 40 c2<br />

g µ= 1000<br />

h<br />

µ=1000<br />

c3<br />

The mate edge e gives rise to estimation of the distance from the right node of contig c 1 to the<br />

left node of c 3 that is similar to the one obta<strong>in</strong>ed by follow<strong>in</strong>g the path P=(g, c 2 , h). Based on<br />

this transitivity property we can reduce the edge e on to the path p:<br />

to obta<strong>in</strong>:<br />

w=2<br />

w=3+2<br />

w=4+2<br />

Consi<strong>der</strong> two nodes v and w that are connected by an alternat<strong>in</strong>g path P = (m 1 , b 1 , m 2 , . . .,m k )<br />

of mate-edges (m 1 , m 2 , . . .) and contig edges (c 1 , c 2 , . . .) from v to w, beg<strong>in</strong>n<strong>in</strong>g and end<strong>in</strong>g<br />

with a mate-edge. We obta<strong>in</strong> a mean length and standard deviation for P by sett<strong>in</strong>g l(P) :=<br />

∑<br />

m i<br />

µ(m i ) + ∑ c i<br />

l(c i ) and σ(P) :=<br />

√ ∑<br />

m i<br />

σ(m i ) 2 .<br />

We say that a mate-edge e from v to w can be transitively reduced on to the path P, if e and<br />

P approximately have the same length, i.e. if |µ(e) − l(P)| ≤ C · max{σ(e), σ(P)} for some<br />

constant C, typically 3. If this is the case, then we can reduce e by remov<strong>in</strong>g e from the graph<br />

and <strong>in</strong>crement<strong>in</strong>g the weight of every mate-edge m i <strong>in</strong> P by w(e).<br />

In the follow<strong>in</strong>g, we will assume that any contig-mate graph consi<strong>der</strong>ed has been edge-bundled and<br />

perhaps also transitively reduced to some degree.<br />

4.28 Happy mate pairs<br />

Consi<strong>der</strong> a mate pair m of two reads f i and f j , obta<strong>in</strong>ed from a clone of mean length µ and<br />

standard deviation σ:<br />

f i<br />

(µ,σ) f j<br />

Assume that f i and f j are conta<strong>in</strong>ed <strong>in</strong> the same contig or scaffold of an assembly. We call m<br />

happy, if f i and f j have the correct relative orientation (i.e., are fac<strong>in</strong>g each other) and are at<br />

approximately the right distance, i.e., |µ − |s i − s j || ≤ 3σ, say. Otherwise, m is unhappy. Two<br />

unhappy mates are highlighted here:<br />

c1<br />

c2


76 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

4.29 Or<strong>der</strong><strong>in</strong>g and orientation of the contig-mate graph<br />

Given a collection of contigs C = {c 1 , c 2 , . . .,c k } constructed from a set of reads F =<br />

{f 1 , f 2 , . . .,f R }, together with the correspond<strong>in</strong>g mate pair <strong>in</strong>formation M. Let G = (V, E)<br />

denote the associated contig-mate graph.<br />

An or<strong>der</strong><strong>in</strong>g (and orientation) of G (or C) is a map φ : V → N such that |φ(b i ) − φ(e i )| = l(c i )<br />

for all contigs c i ∈ C, <strong>in</strong> other words, an assignment of coord<strong>in</strong>ates to all nodes that preserves<br />

contig lengths.<br />

Additionally, we require {φ(b i ), φ(e i )} ≠ {φ(b j ), φ(e j )} for any two dist<strong>in</strong>ct contigs c i and c j .<br />

4.30 Example<br />

Given the follow<strong>in</strong>g contig-mate graph:<br />

c5<br />

1500<br />

c1<br />

900<br />

c3<br />

900<br />

400<br />

1000<br />

5000<br />

1000<br />

c4<br />

c2<br />

1500 1500<br />

2500<br />

An or<strong>der</strong><strong>in</strong>g φ assigns coord<strong>in</strong>ates φ(v) to all nodes v and thus determ<strong>in</strong>es a layout of the<br />

contigs:<br />

φ(s2)<br />

φ (e2) φ (e4) φ(s4) φ (e1)<br />

φ (s1)<br />

φ (s3)<br />

2700<br />

φ (e3)<br />

φ(s5)<br />

φ(e5)<br />

5000<br />

c2 400 c4 900 c1 c3 c5<br />

1500<br />

1000<br />

1500<br />

1000<br />

2500<br />

2700<br />

900<br />

1500<br />

4.31 Happ<strong>in</strong>ess of mate edges<br />

Let G = (V, E) be a contig-mate graph and φ an or<strong>der</strong><strong>in</strong>g of G.<br />

Consi<strong>der</strong> a mate-edge e with nodes v and w. Let c i denote the contig edge <strong>in</strong>cident to v and<br />

let c j denote the contig edge <strong>in</strong>cident to w. Let v ′ and w ′ denote the other two nodes of c i<br />

and c j , respectively. We call e happy (with respect to φ), if c i and c j have the correct relative<br />

orientation, and if the distance between v and w is approximately correct, <strong>in</strong> other words, we<br />

require that either<br />

1. φ(v ′ ) ≤ φ(v) & |φ(w) − φ(v) − µ(e)| ≤ 3σ(e) & φ(w) ≤ φ(w ′ ), or<br />

2. φ(w ′ ) ≤ φ(w) & |φ(v) − φ(w) − µ(e)| ≤ 3σ(e) & φ(v) ≤ φ(v ′ ).<br />

Otherwise, e is unhappy.<br />

4.32 The Contig Or<strong>der</strong><strong>in</strong>g Problem<br />

Given a collection of contigs C = {c 1 , c 2 , . . .,c k } constructed from a set of reads F =<br />

{f 1 , f 2 , . . .,f R }, together with the correspond<strong>in</strong>g mate pair <strong>in</strong>formation M. Let G = (V, E)


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 77<br />

denote the associated contig-mate graph.<br />

Problem The Contig Or<strong>der</strong><strong>in</strong>g Problem is to f<strong>in</strong>d an or<strong>der</strong><strong>in</strong>g of G that maximizes the sum of<br />

weights of happy mate edges.<br />

Theorem The correspond<strong>in</strong>g decision problem is NP-complete.<br />

(The decision problem is: Given a contig-mate graph G, does there exist an or<strong>der</strong><strong>in</strong>g of G such<br />

that the total weight of all happy edges ≥ K?)<br />

4.33 Proof of NP-completeness<br />

Recall: to prove that a problem X is NP-complete one must reduce a known NP-complete<br />

problem N to X. In other words, one must show that any <strong>in</strong>stance I of N can be translated<br />

<strong>in</strong>to an <strong>in</strong>stance J of X <strong>in</strong> polynomial time such that I has the answer true iff J does.<br />

We will use the follow<strong>in</strong>g NP-complete problem:<br />

BANDWIDTH: For a given graph G = (V, E) with node set V = {v 1 , v 2 , . . ., v n } and number<br />

K, does there exist a permutation φ of {1, 2, . . ., n} such that for all edges {v i , v j } ∈ E we have<br />

|φ(i) − φ(j)| ≤ K? (See Garey and Johnson 1979 for details.)<br />

A graph with bandwidth 4:<br />

Problem is <strong>in</strong> NP: For a given or<strong>der</strong><strong>in</strong>g φ, we can determ<strong>in</strong>e whether the number of happy<br />

mate-edges exceeds the given threshold K <strong>in</strong> polynomial time by simple <strong>in</strong>spection of all mate<br />

edges.<br />

Reduction of BANDWIDTH: Given an <strong>in</strong>stance G = (V, E) of this problem, we construct a<br />

contig graph G ′ = (V ′ , E ′ ) <strong>in</strong> polynomial time as follows:<br />

First, set V ′ := V and E ′ := E, and let these edges be the mate-edges, sett<strong>in</strong>g µ(e) := 1 + K−1<br />

2<br />

and σ(e) := K−1<br />

6<br />

so as to obta<strong>in</strong> a happy range of [1, K], and w(e) := 1, for every mate-edge e.<br />

Then, for each <strong>in</strong>itial node v ∈ V , add a new auxiliary node v ′ to V ′ and jo<strong>in</strong> v and v ′ by a<br />

contig edge of length 0.<br />

The answer to the BANDWIDTH question is true, iff the graph G ′ has an or<strong>der</strong><strong>in</strong>g φ such that<br />

all mate edges <strong>in</strong> G ′ are happy:<br />

A graph G has BANDWIDTH ≤ K<br />

⇐⇒<br />

∃ permutation φ such that (v i , v j ) ∈ E implies |φ(i) − φ(j)| ≤ K<br />

⇐⇒<br />

∃ or<strong>der</strong><strong>in</strong>g φ such that (v i , v j ) ∈ E implies 1 ≤ |φ(i) − φ(j)| ≤ K<br />

⇐⇒<br />

∃ or<strong>der</strong><strong>in</strong>g φ such that e = (v i , v j ) ∈ E implies µ(e) − 3σ(e) ≤ |φ(i) − φ(j)| ≤ µ(e) + 3σ(e)<br />

⇐⇒<br />

all mate-edges of G ′ are happy.<br />


78 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

4.34 Spann<strong>in</strong>g tree heuristic for the Contig Or<strong>der</strong><strong>in</strong>g<br />

Problem<br />

An or<strong>der</strong><strong>in</strong>g φ that maximizes the number of happy mate edges is a useful scaffold<strong>in</strong>g of the<br />

given contigs.<br />

The simplest heuristic for obta<strong>in</strong><strong>in</strong>g an or<strong>der</strong><strong>in</strong>g is to compute a maximum weight spann<strong>in</strong>g tree<br />

for the contig-mate graph and use it to or<strong>der</strong> all contigs, similar to the read layout problem.<br />

source<br />

c1 c2 c3 c4 c5 c6 c7<br />

false mate edge<br />

Unfortunately, this method does not work well <strong>in</strong> practice, as false mate edges lead to <strong>in</strong>correct<br />

<strong>in</strong>terleav<strong>in</strong>g of contigs from completely different regions of the source sequence:<br />

c1 c2 c3 c4<br />

c5 c6 c7<br />

4.35 Represent<strong>in</strong>g an or<strong>der</strong><strong>in</strong>g <strong>in</strong> the mate-contig graph<br />

By the def<strong>in</strong>ition given above, an or<strong>der</strong><strong>in</strong>g is an assignment of coord<strong>in</strong>ates to all nodes of the<br />

contig-mate graph that corresponds to a scaffold<strong>in</strong>g of the contigs. When we are not <strong>in</strong>terested<br />

<strong>in</strong> the exact coord<strong>in</strong>ates, then the relative or<strong>der</strong> and orientation of the contigs can be represented<br />

as follows:<br />

Given a contig-mate graph G = (V, E). A set S ⊆ E of selected edges is called a scaffold<strong>in</strong>g of<br />

G, if it has the follow<strong>in</strong>g two properties:<br />

• every contig edge is selected, and<br />

• every node is <strong>in</strong>cident to at most two selected edges.<br />

Thus, a scaffold<strong>in</strong>g of G is a set of non-<strong>in</strong>tersect<strong>in</strong>g selected paths, each represent<strong>in</strong>g a scaffold<strong>in</strong>g<br />

of its conta<strong>in</strong>ed contigs.<br />

The follow<strong>in</strong>g example conta<strong>in</strong>s two cha<strong>in</strong>s of selected edges represent<strong>in</strong>g scaffolds s 1 =<br />

(c 1 , c 2 , c 3 , c 4 ) and s 2 = (c 5 , c 6 , c 7 ):<br />

c1 c2 c3 c4<br />

c5 c6 c7<br />

However, to be able to represent the <strong>in</strong>terleaved scaffold<strong>in</strong>g discussed earlier, we need to add<br />

some <strong>in</strong>ferred edges (shown here as dotted l<strong>in</strong>es) to the graph:<br />

c1 c2 c3 c4<br />

c5 c6 c7<br />

4.36 Greedy path-merg<strong>in</strong>g<br />

Given a contig-mate graph G = (V, E). The greedy path merg<strong>in</strong>g algorithm is a heuristic for<br />

solv<strong>in</strong>g the Contig Or<strong>der</strong><strong>in</strong>g Problem. It proceeds “bottom up” as follows, ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g a valid<br />

scaffold<strong>in</strong>g S ⊆ E:


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 79<br />

Initially, all contig edges c 1 , c 2 , . . .c k are selected, and none others. At this stage, the graph<br />

consists of k selected paths P 1 = (c 1 ), . . .,P k = (c k ).<br />

Then, <strong>in</strong> or<strong>der</strong><strong>in</strong>g of decreas<strong>in</strong>g weight we consi<strong>der</strong> each mate edge e = {v, w}: If v and w lie<br />

<strong>in</strong> the same selected path P i , then e is a chord of P i and no action is necessary.<br />

If v and w are conta<strong>in</strong>ed <strong>in</strong> two different paths P i and P j , then we attempt to merge the two<br />

paths to obta<strong>in</strong> a new path P k and accept such a merge, only if the <strong>in</strong>crease of H(G), the<br />

number of happy mate edges, is larger than the <strong>in</strong>crease of U(G), the number of unhappy<br />

ones.<br />

4.37 The greedy path-merg<strong>in</strong>g algorithm<br />

Algorithm Given a contig-mate graph G. The output of this algorithm is a node-disjo<strong>in</strong>t<br />

collection of selected paths <strong>in</strong> G, each one def<strong>in</strong><strong>in</strong>g an or<strong>der</strong><strong>in</strong>g of the contigs whose edges it<br />

covers.<br />

beg<strong>in</strong><br />

Select all contig edges.<br />

for each mate-edge e <strong>in</strong> descend<strong>in</strong>g or<strong>der</strong> of weight:<br />

if e is not selected:<br />

Let v, w denote the two nodes connected by e<br />

Let P 1 be the selected path <strong>in</strong>cident to v<br />

Let P 2 be the selected path <strong>in</strong>cident to w<br />

if P 1 ≠ P 2 and we can merge P 1 and P 2 (guided by e)<br />

to obta<strong>in</strong> P:<br />

if H(P) − (H(P 1 ) + H(P 2 )) ≥ U(P) − (U(P 1 ) + U(P 2 )):<br />

Replace P 1 and P 2 by P<br />

end<br />

4.38 Merg<strong>in</strong>g two paths<br />

Given two selected paths P 1 and P 2 and a guid<strong>in</strong>g unselected mate-edge e 0 with nodes v 0<br />

(<strong>in</strong>cident to P 1 ) and w 0 (<strong>in</strong>cident to P 2 ). Merg<strong>in</strong>g of P 1 and P 2 is attempted as follows:<br />

(a)<br />

P1<br />

P2<br />

c11 c12<br />

w0<br />

h<br />

e0<br />

c21 c22 c23 c24 c25<br />

v0<br />

c13<br />

c14<br />

c26<br />

c15<br />

c27<br />

(b)<br />

P1<br />

P2<br />

c21<br />

c11<br />

c22<br />

c12<br />

h<br />

w1 c14<br />

c23<br />

e0<br />

c24<br />

e1<br />

c13 c26<br />

v0 c25 w0 v1<br />

f0 g0<br />

c15<br />

c27<br />

(c)<br />

P1<br />

P2<br />

c21<br />

c11<br />

c22<br />

c12<br />

h<br />

e0<br />

c23 c24<br />

v0<br />

e1<br />

c13<br />

c25 w0 v1<br />

f0 g0<br />

c26<br />

e2<br />

c14<br />

f1w1v2g1<br />

This algorithm returns true, if it successfully produced a new selected path P conta<strong>in</strong><strong>in</strong>g all<br />

contigs edges <strong>in</strong> P 1 and P 2 , and false, if it fails.<br />

Merg<strong>in</strong>g proceeds by “zipp<strong>in</strong>g” the two paths P 1 and P 2 together, first start<strong>in</strong>g with e 0 and<br />

“zipp<strong>in</strong>g” to the right. Then, with the edge labeled h now play<strong>in</strong>g the role of e 0 , zipper to the<br />

w2 c15<br />

c27


80 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

left. Merg<strong>in</strong>g is said to fail, if the position<strong>in</strong>g of the “active” contig c 1 i implies that it must<br />

overlap with some contig <strong>in</strong> P 2 by a significant amount, but no such alignment (of sufficiently<br />

high quality) exists.<br />

4.39 Example<br />

Here are we are given 5 contigs c 1 ,...,c 5 , each of length l(c i ) = 10000:<br />

c1<br />

w=1, µ=34000<br />

c4<br />

c1<br />

w=1, µ=34000<br />

c4<br />

w=4,µ=12000<br />

c3<br />

w=1, µ=12000<br />

w=4,µ=12000<br />

c3<br />

w=1, µ=12000<br />

c2<br />

w=3,µ=1000<br />

w=5,µ=12000<br />

w=2,µ=1000<br />

c5<br />

c2<br />

w=3,µ=1000<br />

w=5,µ=12000<br />

w=2,µ=1000<br />

c5<br />

c1<br />

w=1, µ=34000<br />

c4<br />

c1<br />

w=1, µ=34000<br />

c4<br />

w=4,µ=12000<br />

c3<br />

w=1, µ=12000<br />

w=4,µ=12000<br />

c3<br />

w=1, µ=12000<br />

c2<br />

w=3,µ=1000<br />

w=5,µ=12000<br />

w=2,µ=1000<br />

c5<br />

c2<br />

w=3,µ=1000<br />

w=5,µ=12000<br />

w=2,µ=1000<br />

c5<br />

c1<br />

w=1, µ=34000<br />

c4<br />

c1<br />

w=1, µ=34000<br />

c4<br />

w=4,µ=12000<br />

c3<br />

w=1, µ=12000<br />

µ~1000<br />

w=4,µ=12000<br />

c3<br />

w=1, µ=12000<br />

µ~1000<br />

c2<br />

w=3,µ=1000<br />

w=5,µ=12000<br />

w=2,µ=1000<br />

c5<br />

c2<br />

w=3,µ=1000<br />

w=5,µ=12000<br />

w=2,µ=1000<br />

c5<br />

The f<strong>in</strong>al scaffold<strong>in</strong>g is (c 1 ,c 2 ,c 3 ,c 5 ,c 4 ).<br />

4.40 Repeat resolution<br />

Consi<strong>der</strong> two unique unitigs u 1 and u 2 that are placed next to each other <strong>in</strong> a scaffold<strong>in</strong>g, due<br />

to a heavy mate edge between them:<br />

u1<br />

u2<br />

We consi<strong>der</strong> all non-unique unitigs and s<strong>in</strong>gleton reads that potentially can be placed between<br />

u 1 and u 2 by mate edges:<br />

u1<br />

u2<br />

Different heuristics are used to explore the correspond<strong>in</strong>g local region of the overlap graph <strong>in</strong><br />

an attempt to f<strong>in</strong>d a cha<strong>in</strong> of overlapp<strong>in</strong>g fragments that spans the gap and is compatible with<br />

the given mate pair <strong>in</strong>formation:<br />

u1<br />

u2<br />

4.41 Summary<br />

Given a collection F = {f 1 , f 2 , . . .,f R } of reads and mate pair <strong>in</strong>formation, sampled from a<br />

unknown source DNA sequence. Assembly proceeds <strong>in</strong> the follow<strong>in</strong>g steps:<br />

1. compute the overlap graph, e.g. us<strong>in</strong>g a seed-and-extend approach,


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 81<br />

2. construct all unitigs, e.g. us<strong>in</strong>g the m<strong>in</strong>imal spann<strong>in</strong>g tree approach,<br />

3. scaffold the unitigs, e.g. us<strong>in</strong>g the greedy-path merg<strong>in</strong>g algorithm,<br />

4. attempt to resolve repeats between unitigs, and<br />

5. compute a multi alignment of all reads <strong>in</strong> a given contig to obta<strong>in</strong> a consensus sequence<br />

for it.<br />

Note that the algorithms for steps (2) and (3) that are used <strong>in</strong> actual assembly projects are much<br />

more sophisticated than ones described <strong>in</strong> these notes.<br />

4.42 A WGS assembly of human (Celera)<br />

Input: 27 million fragments of av. length 550bp, 70% paired:<br />

5m pairs of length 2kb<br />

4m pairs of length 10kb<br />

0.9m pairs of length 50kb<br />

0.35m pairs of length 150kb<br />

Celera’s assembler uses approximately the follow<strong>in</strong>g resources:<br />

Program CPU Max.<br />

hours<br />

memory<br />

Screener 4800 2-3 days on 10-20 computers 2GB<br />

Overlapper 12000 10 days on 10-20 computers 4GB<br />

Unitigger 120 4-5 days on a s<strong>in</strong>gle computer 32GB<br />

Scaffol<strong>der</strong> 120 4-5 days on a s<strong>in</strong>gle computer 32GB<br />

RepeatRez 50 Two days on a s<strong>in</strong>gle computer 32GB<br />

Consensus 160 One day on 10-20 computers 2GB<br />

Total: ≈ 18000 CPU hours.<br />

The size of the human genome is ≈ 3Gb. An unpublished 2001 assembly of the 27m fragments<br />

has the follow<strong>in</strong>g statistics:<br />

• The assembly consists of 6500 scaffolds that span 2.776Mb of sequence.<br />

• The spanned sequence conta<strong>in</strong>s 150, 000 gaps, mak<strong>in</strong>g up 148Mb <strong>in</strong> total.<br />

• Of the spanned sequence, 99.0% is conta<strong>in</strong>ed <strong>in</strong> scaffolds (or contigs?) of size 30kb or<br />

more.<br />

• Of the spanned sequence, 98.7% is conta<strong>in</strong>ed <strong>in</strong> scaffolds (or contigs?) of size 100kb or<br />

more.<br />

5 Eulerian Superpath Method of Sequence Assembly<br />

This exposition is based on the follow<strong>in</strong>g sources, which are all recommended read<strong>in</strong>g:<br />

1. Michael S. Waterman, Introduction to computational biology, Chapman and Hall, 1995.<br />

(Chapter 7, section 3.)


82 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

2. Pavel A. Pevzner, Haixu Tang and Michael S. Waterman, A new approach to fragment<br />

assembly <strong>in</strong> DNA sequenc<strong>in</strong>g, RECOMB 2001, Montreal, Canada, Proceed<strong>in</strong>gs, pages<br />

256–265.<br />

3. Pavel A. Pevzner and Haixu Tang, Fragment assembly with double-barreled data, Bio<strong>in</strong>formatics,<br />

vol. 17, suppl. 1, 2001, pages S225–S233.<br />

5.1 Eulerian path method revisited<br />

Given an unknown DNA sequence A = a 1 . . .a n . Let S = {s 1 , s 2 . . .} be the spectrum of l-tuples<br />

observed us<strong>in</strong>g the C(l) chip.<br />

Recall that <strong>in</strong> sequenc<strong>in</strong>g by hybridization (SBH), the assembly problem can be formulated as<br />

the problem of f<strong>in</strong>d<strong>in</strong>g an Euler path (to be precise, a Ch<strong>in</strong>ese Postman tour that visits each<br />

edge at least once and m<strong>in</strong>imizes the number of edges that are used more than once) <strong>in</strong> the de<br />

Bruijn graph G = (V, E), def<strong>in</strong>ed as follows:<br />

The set of nodes consists of all (l − 1)-tuples, and the edge set is obta<strong>in</strong>ed by connect<strong>in</strong>g any<br />

two nodes v = v 1 . . .v l−1 and w = w 1 . . .w l−1 by a directed edge (v, w) iff there exists an l-tuple<br />

s ∈ S such that s = v 1 . . .v l−1 w l−1 = v 1 w 1 . . .w l−1 .<br />

Two ma<strong>in</strong> problems are that this approach is highly sensitive to sequenc<strong>in</strong>g errors and requires<br />

a large l to resolve repeats.<br />

Given a set of reads F = {f 1 , f 2 , . . .,f R } obta<strong>in</strong>ed from a source sequence A by shotgun<br />

sequenc<strong>in</strong>g. “Can the Eulerian path method be applied to fragment assembly?” (Idury and<br />

Watermann, 1995)<br />

Let us represent every read f i of length n by the n − l + 1 (not necessarily dist<strong>in</strong>ct) l-tuples<br />

obta<strong>in</strong>ed from f i . We def<strong>in</strong>e F l to be the spectrum of all such l-tuples.<br />

Idea: Apply the Eulerian path method to F l to obta<strong>in</strong> an assembly.<br />

Unfortunately, this naive approach does not work well <strong>in</strong> practice, because sequence errors and<br />

repeats lead to very complicated graphs with many false edges.<br />

To obta<strong>in</strong> a feasible approach, We will look at three questions: how to fix sequenc<strong>in</strong>g errors,<br />

how to make use of the cont<strong>in</strong>uity of reads and how to make use of the mate pair data?<br />

5.2 A typical small scale sequenc<strong>in</strong>g project<br />

Consi<strong>der</strong> the N.men<strong>in</strong>gitidis (NM) sequenc<strong>in</strong>g project completed at the Sanger Center <strong>in</strong> 2000.<br />

The genome is 2, 184, 406 bp long. Sequenc<strong>in</strong>g resulted <strong>in</strong> 53, 263 reads of average length 400,<br />

correspond<strong>in</strong>g to a coverage of 9.7.<br />

The total number of sequenc<strong>in</strong>g errors is 255, 631, correspond<strong>in</strong>g to an error rate of 1.2% and<br />

a mean of 4.8 errors per read.<br />

NM is difficult to assemble because it conta<strong>in</strong>s 126 long exact repeats of up to 3832 bp <strong>in</strong> length<br />

and many more approximate ones.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 83<br />

5.3 Error Correction<br />

Reads are collected with an error rate of about 1%. In a sequenc<strong>in</strong>g project with sufficient oversampl<strong>in</strong>g,<br />

by comparison of overlapp<strong>in</strong>g reads, one should be able to use the fact that sequenc<strong>in</strong>g<br />

errors are randomly distributed to dist<strong>in</strong>guish between sequenc<strong>in</strong>g errors and differences due<br />

to repeats.<br />

If we knew the precise sequence of the source A, then we could use it to correct the reads.<br />

However, A is not known until the assembly is complete, a catch-22 1 .<br />

Assume that A is unknown, but that we know A l , the set of all l-tuples <strong>in</strong> A. Then we should<br />

still be able to correct most reads. Unfortunately, A l is not known either, but we will see how<br />

to approximate A l .<br />

5.4 Solid and weak l-tuples<br />

Given an unknown source sequence A, a collection of reads F and the spectrum F l of all l-tuples<br />

from reads <strong>in</strong> F.<br />

An l-tuple s ∈ F l is solid, if it belongs to more than M reads (where M is a given threshold),<br />

and weak, otherwise. A natural approximation of A l is the set of all solid l-tuples <strong>in</strong> F l .<br />

Motivation: If the read-coverage of A is x, then on average every unique l-tuple s <strong>in</strong> A will be<br />

conta<strong>in</strong>ed <strong>in</strong> x reads. However, if one of theses reads conta<strong>in</strong>s a sequenc<strong>in</strong>g error <strong>in</strong> its copy of<br />

s, then this erroneous l-tuple will not be conta<strong>in</strong>ed <strong>in</strong> the other x − 1 reads.<br />

A<br />

l−tuple<br />

reads<br />

sequenc<strong>in</strong>g error<br />

5.5 The spectral alignment problem<br />

Let S be a collection of l-tuples called an l-spectrum. A str<strong>in</strong>g A is called an S-str<strong>in</strong>g, if all its<br />

l-tuples belong to S.<br />

Spectral Alignment Problem (SPA) Given a str<strong>in</strong>g f and an l-spectrum S, f<strong>in</strong>d the m<strong>in</strong>imum<br />

number of mutations <strong>in</strong> f that transform f <strong>in</strong>to a S-str<strong>in</strong>g.<br />

A solution to the SPA only makes sense if the number of mutations is small, <strong>in</strong> which case SPA<br />

can be solved by dynamic programm<strong>in</strong>g, even for large l.<br />

(See I. Pe’er and R. Shamir, Spectrum Alignment: Efficient Resequenc<strong>in</strong>g by Hybridization,<br />

Proceed<strong>in</strong>gs of ISMB, 2000)<br />

5.6 Error correction based on SPA<br />

Let F solid<br />

l<br />

be the spectrum of all solid l-tuples <strong>in</strong> F l . Any read f that is not a F solid<br />

l<br />

can be corrected by us<strong>in</strong>g a m<strong>in</strong>imum number of mutations to transform f <strong>in</strong>to a F solid<br />

l<br />

-str<strong>in</strong>g<br />

str<strong>in</strong>g<br />

1 In Joseph Heller’s novel, catch-22 was the paradox that trapped members of the US military: Anyone who<br />

applied to get out of the military on the grounds of <strong>in</strong>sanity was behav<strong>in</strong>g rationally and thus couldn’t be <strong>in</strong>sane.


84 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

(SPA).<br />

Correct<strong>in</strong>g all reads <strong>in</strong> this way may change the sets of solid and weak l-tuples. Iterative<br />

application of this correction gradually <strong>in</strong>creases the number of solid l-tuples and decreases the<br />

number of weak l-tuples:<br />

Algorithm Error correction based on SAP<br />

Input: Set of reads F. Output: corrected reads F ′<br />

do<br />

Determ<strong>in</strong>e F solid<br />

l<br />

Correct all reads that are not Fl<br />

solid<br />

until no further <strong>in</strong>crease of solid l-tuples.<br />

-reads<br />

Experiments <strong>in</strong>dicates that this correction elim<strong>in</strong>ates many errors <strong>in</strong> bacterial sequenc<strong>in</strong>g<br />

projects. However, the follow<strong>in</strong>g formulation of the problem is even more successful:<br />

5.7 The Error Correction Problem<br />

Given a collection of reads F = {f 1 , f 2 , . . .,f R }. In the follow<strong>in</strong>g, let F l denote the spectrum<br />

of F consist<strong>in</strong>g of the set of all l-tuples from the reads f 1 , f 2 , . . .,f R and f 1 , . . .,f R , where f<br />

denotes the reverse complement of f.<br />

Let ∆ denote an upper bound on the number of errors <strong>in</strong> each read.<br />

Error Correction Problem (ERC) Given F, ∆ and l, <strong>in</strong>troduce up to ∆ corrections <strong>in</strong> each<br />

read <strong>in</strong> F <strong>in</strong> such a way that |F l | is m<strong>in</strong>imized.<br />

This looks like an NP-hard problem to me...<br />

5.8 A simple greedy heuristic for ECR<br />

Observation An error <strong>in</strong> a read f affects at most l of the l-tuples <strong>in</strong> f and l of the l-tuples <strong>in</strong><br />

f, and usually creates 2l erroneous l-tuples (2d for a position with<strong>in</strong> a distance of d < l from an<br />

endpo<strong>in</strong>t of f.)<br />

l false tuples<br />

<strong>in</strong> f<br />

l false tuples<br />

<strong>in</strong> f<br />

error<br />

read f<br />

This <strong>in</strong>spires the follow<strong>in</strong>g simple<br />

Greedy heuristic for ECR: Detect and perform any error correction <strong>in</strong> a read f that reduces<br />

the number of l-tuples by 2l (or 2d, for positions close to an end po<strong>in</strong>t).<br />

Experiments suggest that this simple procedure elim<strong>in</strong>ates about 86.5% of all sequenc<strong>in</strong>g errors.<br />

5.9 Orphan elim<strong>in</strong>ation heuristic for ECR<br />

Two l-tuples are called neighbors, if their Hamm<strong>in</strong>g distance is 1, i.e., if they differ at precisely<br />

one position.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 85<br />

The multiplicity of an l-tuple s ∈ F l is the number m(s) of reads <strong>in</strong> F that conta<strong>in</strong> s.<br />

We call s an orphan, if<br />

1. s has small multiplicity, i.e. m(s) ≤ M, where M is a given threshold,<br />

2. s has precisely one neighbor, t, and<br />

3. m(s) ≤ m(t).<br />

The position where an orphan differs from its neighbor is called an orphan position. A read is<br />

orphan free, if it conta<strong>in</strong>s no orphan positions.<br />

These def<strong>in</strong>itions are motivated by the follow<strong>in</strong>g<br />

Observation: If we choose l appropriately (i.e., not too small, not to large), then each erroneous<br />

l-tuple s <strong>in</strong>duced by a sequenc<strong>in</strong>g error <strong>in</strong> a read f usually:<br />

• does not appear <strong>in</strong> any other read, and<br />

• differs at precisely one position from a correct l-tuple t (obta<strong>in</strong>ed from a different read f ′<br />

that comes from the same area of the source sequence as f, but doesn’t have a sequenc<strong>in</strong>g<br />

error at the same position).<br />

Hence, a sequenc<strong>in</strong>g error <strong>in</strong> a read usually creates 2l orphans.<br />

Orphans are created by random sequenc<strong>in</strong>g errors <strong>in</strong> reads: any l-tuple s conta<strong>in</strong><strong>in</strong>g the error<br />

will usually be unique and will differ from one correct l-tuple t only at the position of the<br />

error. This correct l-tuple, <strong>in</strong> turn, will be conta<strong>in</strong>ed <strong>in</strong> a number of reads that do not have a<br />

sequenc<strong>in</strong>g error at the same position:<br />

conta<strong>in</strong>ed <strong>in</strong><br />

unknown source sequence<br />

error<br />

t<br />

read<br />

read<br />

only neighbor<br />

read<br />

orphan s<br />

read<br />

The ma<strong>in</strong> idea is to correct all errors at orphan positions <strong>in</strong> the sequenc<strong>in</strong>g reads, ensur<strong>in</strong>g that<br />

the number of corrections made to any one read does not exceed ∆, the specified maximum<br />

number of errors per read.<br />

Greedy orphan elim<strong>in</strong>ation approach Perform error correction at any orphan position that<br />

reduces the number of l-tuples by 2l (or 2d, for positions close to an end po<strong>in</strong>t). After correct<strong>in</strong>g<br />

all such errors, repeatedly rerun the method us<strong>in</strong>g a 2l − δ condition with <strong>in</strong>creas<strong>in</strong>g δ.<br />

Experiments suggest that this method can elim<strong>in</strong>ate up to ≈ 97% of all sequenc<strong>in</strong>g errors, for<br />

bacterial size sequenc<strong>in</strong>g projects.<br />

5.10 Error correction or data corruption?<br />

Any heuristic used for correct<strong>in</strong>g reads will make mistakes. For example, if for a given position<br />

<strong>in</strong> an l-tuple, some reads <strong>in</strong>dicate that the nucleotide is an C, others that it is a G, then<br />

correction may make the wrong choice.


86 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

However, this is not a problem, as the goal of error correction is merely to remove <strong>in</strong>consistencies<br />

from the <strong>in</strong>put to help the down-stream assembly algorithm.<br />

After runn<strong>in</strong>g the assembly algorithm to obta<strong>in</strong> a layout based on the corrected reads, a consensus<br />

sequence is computed from this layout us<strong>in</strong>g the uncorrected reads. Thus, the bases <strong>in</strong><br />

the f<strong>in</strong>al output are determ<strong>in</strong>ed by a multi-alignment of the orig<strong>in</strong>al reads.<br />

There are a number of additional issues that we will not discuss further.<br />

5.11 Eulerian path problem revisited (aga<strong>in</strong>)<br />

Given a set of reads F, def<strong>in</strong>e the de Bruijn graph G = (V, E) with node set V = F l−1<br />

and connect any two nodes s = s 1 s 2 . . .s l−1 and t = t 1 t 2 . . .t l−1 by an edge (s, t), iff<br />

s 1 s 2 . . .s l−1 t l−1 = s 1 t 1 t 2 . . .t l−1 ∈ F l .<br />

Any reconstruction of the unknown source sequence corresponds to an Euler path through the<br />

graph, or two such paths, to be precise, as F l was def<strong>in</strong>ed to conta<strong>in</strong> all l-tuples of every read<br />

f and its its reverse complement f.<br />

With real data, the errors hide the correct path among many erroneous edges. For example, the<br />

graph correspond<strong>in</strong>g to error free data for the NM project has 4, 039, 248 edges, whereas the<br />

graph correspond<strong>in</strong>g to the real data has 9, 474, 411 edges (for l = 20). After error correction,<br />

the number is reduced to 4, 081, 857.<br />

5.12 Sources, s<strong>in</strong>ks and branch<strong>in</strong>g nodes<br />

A node v is called a source, if <strong>in</strong>degree(v) = 0, a s<strong>in</strong>k, if outdegree(v) = 0 and a branch<strong>in</strong>g<br />

node, if <strong>in</strong>degree(v) · outdegree(v) > 1. For the NM project, the de Bruijn graph has 502, 843<br />

branch<strong>in</strong>g nodes, based on the orig<strong>in</strong>al reads (l = 20).<br />

Error correction leads to a much simpler graph with 382 sources and s<strong>in</strong>ks, and 12, 175 branch<strong>in</strong>g<br />

nodes.<br />

Error-free reads lead to a graph with 11, 173 branch<strong>in</strong>g nodes.<br />

Clearly, error correction greatly simplifies the graph G. However, G is still very complicated,<br />

even <strong>in</strong> the error-free case. We need to take additional <strong>in</strong>formation <strong>in</strong>to account, namely which<br />

l-tuples belong to the same reads, and <strong>in</strong> which or<strong>der</strong>.<br />

5.13 Repeats and tangles<br />

A path v 1 , v 2 , . . .,v n of nodes <strong>in</strong> the graph G is called a repeat, if <strong>in</strong>degree(v 1 ) > 1,<br />

outdegree(v n ) > 1 and outdegree(v i ) = 1 for 1 ≤ i < n.<br />

Edges enter<strong>in</strong>g v 1 are entrances, while edges leav<strong>in</strong>g v n are exits, of the repeat.<br />

entrances<br />

v1<br />

repeat<br />

exits<br />

v2 v3 ... vn−1 vn<br />

An Eulerian path visits a repeat a few times and each such visit def<strong>in</strong>es a pair<strong>in</strong>g between an<br />

entrance and an exit.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 87<br />

Such repeats can cause problems <strong>in</strong> assembly because it is not clear which entrance such be<br />

paired with which exit.<br />

5.14 Example<br />

Sequence<br />

and reads<br />

Full graph<br />

Source,<br />

s<strong>in</strong>k &<br />

branch<strong>in</strong>g<br />

nodes only<br />

Read<br />

paths<br />

1<br />

source<br />

A R B S C R’ D S’ E<br />

2 3 4 5 6 7 8 9 10<br />

A<br />

D<br />

A R S E<br />

B<br />

C<br />

D<br />

R B S E<br />

C<br />

7 8<br />

D<br />

1<br />

A 2 R B S<br />

3 4<br />

5<br />

6<br />

C<br />

9 10<br />

E<br />

s<strong>in</strong>k<br />

branch<strong>in</strong>g<br />

node<br />

5.15 Us<strong>in</strong>g read-paths to resolve repeats<br />

Each read f = f 1 f 2 . . .f k ∈ F def<strong>in</strong>es a read-path <strong>in</strong> the graph G, that consists of the path of<br />

edges f 1 . . .f l , f 2 . . .f l+1 , f 3 . . .f l+2 that represent the l-tuples of f, <strong>in</strong> their or<strong>der</strong> of occurrence.<br />

With this additional <strong>in</strong>formation, many short repeats that are spanned by read paths can be<br />

resolved, i.e. they have a unique pair<strong>in</strong>g of entrances and exits that is compatible with the read<br />

paths:<br />

read path<br />

v1 v2 v3 ... vn−1 vn<br />

5.16 Tangles<br />

A tangle is a repeat that cannot be resolved us<strong>in</strong>g read paths:


88 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

<strong>in</strong>1<br />

read path<br />

out1<br />

<strong>in</strong>2<br />

v1 v2 v3 ... vn−1 vn<br />

out2<br />

Here it is unclear, whether an Euler path through the repeat v 1 , . . ., v n should match the two<br />

pairs <strong>in</strong> 1 → out 1 and <strong>in</strong> 2 → out 2 , or <strong>in</strong> 1 → out 2 and <strong>in</strong> 2 → out 1 .<br />

5.17 The Eulerian Superpath Problem<br />

This leads to a generalization of the Eulerian Path problem:<br />

Eulerian Superpath Problem (ESP) Given an Eulerian graph and a collection of paths <strong>in</strong><br />

this graph, f<strong>in</strong>d an Eulerian path <strong>in</strong> this graph that conta<strong>in</strong>s all these paths as subpaths.<br />

Note that the Eulerian Path problem is a special case of ESP with every path be<strong>in</strong>g a s<strong>in</strong>gle<br />

edge.<br />

Note that a practical assembler must always return a result, regardless of whether the graph<br />

possesses an Eulerian Superpath path or not.<br />

Hence, the real aim is to f<strong>in</strong>d an optimal path that is compatible with all (or as many as<br />

possible) superpaths.<br />

Optimal could mean, for example, that the path uses as many edges as possible, but m<strong>in</strong>imizes<br />

the number of edges that are used more than once.<br />

5.18 Solv<strong>in</strong>g the Eulerian Superpath Problem<br />

One strategy for solv<strong>in</strong>g EPS is to reduce the problem to an Eulerian Path problem, by apply<strong>in</strong>g<br />

a sequence of equivalent transformations to the <strong>in</strong>itial graph G and set of paths P:<br />

(G, P) → (G 1 , P 1 ) → . . . → (G k , P k ),<br />

until we obta<strong>in</strong> a new graph G k and set of paths P k , for which each of the paths consist<strong>in</strong>g of<br />

a s<strong>in</strong>gle edge.<br />

Such a transformation (G i , P i ) → (G j , P j ) is called equivalent, if there exists a one-to-one<br />

correspondence between the Eulerian superpaths <strong>in</strong> (G i , P i ) and (G j , P j ).<br />

5.19 Equiv. transform.: x, y-detachment<br />

We will discuss a simple equivalent transformation.<br />

For a graph G and collection of paths P, let:<br />

• x = (v <strong>in</strong> , v mid ) and y = (v mid , v out ) be two consecutive edges <strong>in</strong> G,<br />

• P x,y be the set of all paths <strong>in</strong> P that conta<strong>in</strong> x, y as a subpath,<br />

• P →x be the set of all paths <strong>in</strong> P that end with x, and


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 89<br />

• P y→ be the set of all paths <strong>in</strong> P that start with y.<br />

Additionally, we require that any path pass<strong>in</strong>g through x must either end at v mid , or exit v mid<br />

via y.<br />

The x, y-detachment is a transformation that adds a new edge z = (v <strong>in</strong> , v out ) and deletes the<br />

edges x and y from G:<br />

Px,y<br />

Edges x and<br />

y are replaced<br />

by edge z.<br />

P x<br />

x y<br />

V<strong>in</strong> Vmid Vout<br />

Py<br />

⇒<br />

Px,y<br />

Paths <strong>in</strong> P x,y ,<br />

P →x and P y→<br />

are modified<br />

to conta<strong>in</strong> z.<br />

P x<br />

V<strong>in</strong><br />

The x, y-detachment transformation alters the system of paths P as follows:<br />

z<br />

Vmid<br />

1. <strong>in</strong> all paths <strong>in</strong> P x,y , replace x, y by z,<br />

2. <strong>in</strong> all paths <strong>in</strong> P →x , replace x by z, and<br />

3. <strong>in</strong> all paths <strong>in</strong> P y→ , replace y by z.<br />

Above, we required that all paths through v mid are conta<strong>in</strong>ed <strong>in</strong> one of these three sets. In each<br />

of the three cases it is clear that any Eulerian superpath conta<strong>in</strong>ed <strong>in</strong> the orig<strong>in</strong>al graph will<br />

also be one <strong>in</strong> the <strong>der</strong>ived graph, and vice versa. Hence, x, y-detachment un<strong>der</strong> these conditions<br />

is an equivalent transformation.<br />

Vout<br />

Py<br />

5.20 More general x, y-detachment<br />

What can happen if we drop the additional requirement stated above, i.e. if we have a path <strong>in</strong><br />

P x,y2 that enters v mid via x and then exits via an edge y 2 ≠ y?<br />

P x<br />

Px,y2<br />

Px,y<br />

x y<br />

V<strong>in</strong> Vmid Vout<br />

y2<br />

Vout2<br />

Py<br />

If (the paths <strong>in</strong>) P →x and P x,y enter v <strong>in</strong> along the same edge, and if P x,y2 enters via a different<br />

edge, then we replace x by z <strong>in</strong> P →x . Similarly, if P →x and Px, y 2 enter via the same edge,


90 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

different from the one used by Px, y, then we keep x. In the two other possible cases, it is<br />

unclear how to update P →x and we call the edge x unresolvable:<br />

Px,y<br />

P x<br />

???<br />

z<br />

Py<br />

Px,y2<br />

V<strong>in</strong><br />

x<br />

Vmid<br />

y2<br />

Vout<br />

Vout2<br />

5.21 Example of x, y-detachment<br />

Multiple application of x, y-detachment to resolve a repeat:<br />

y3<br />

y4<br />

x1<br />

x2<br />

y1<br />

y2<br />

y3<br />

z<br />

x1<br />

x2<br />

y1<br />

y2<br />

Edge x 2 unresolvable.<br />

Obta<strong>in</strong>ed by y 4 , x 1 -detachment.<br />

y3<br />

z<br />

x1<br />

x2<br />

y2<br />

y3<br />

x1<br />

y2<br />

Obta<strong>in</strong>ed by x 2 , y 1 -detachment.<br />

Obta<strong>in</strong>ed by z, x 2 -detachment.<br />

5.22 Equivalent transformation: x-cut<br />

We call an edge x = (v, w) removable, if<br />

1. it is the only outedge for v and the only <strong>in</strong>edge for w, and<br />

2. x is either the first or last edge <strong>in</strong> every path p ∈ P that conta<strong>in</strong>s x.<br />

An x-cut (equivalently) transforms P <strong>in</strong>to a new system P ′ by simply remov<strong>in</strong>g x from all paths<br />

<strong>in</strong> P →x and P x→ :<br />

y3<br />

x<br />

y1<br />

y3<br />

x<br />

y1<br />

y4<br />

y2<br />

y4<br />

y2<br />


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 91<br />

5.23 Summary<br />

Given a set of read F from a sequenc<strong>in</strong>g project. In the Eulerian Path approach:<br />

1. Every read f ∈ F is shredded <strong>in</strong>to |f| − l + 1 consecutive l-tuples.<br />

2. The de Bruijn graph is constructed with vertices represent<strong>in</strong>g l − 1-tuples and edges<br />

represent<strong>in</strong>g l-tuples.<br />

3. The graph G is simplified to consist only of source, s<strong>in</strong>k and branch<strong>in</strong>g nodes.<br />

4. The set P of all read paths <strong>in</strong> G is generated.<br />

5. Equivalent transformations are applied to reduce the Eulerian Superpath problem to an<br />

Eulerian Path problem.<br />

6. An optimal path is used to generate a f<strong>in</strong>al fragment assembly.<br />

5.24 Comparison with other methods


92 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

5.25 Mate pairs<br />

Given a collection or reads F and correspond<strong>in</strong>g mate pair <strong>in</strong>formation M. Consi<strong>der</strong> the de<br />

Bruijn graph G constructed from F. Let m = (f 1 , f 2 ) be a mate pair with mean length µ<br />

and standard deviation σ. We represent m by a mate-pair path from f 1 to f 2 , if the distance<br />

d(f 1 , f 2 ) from f 1 to f 2 <strong>in</strong> G is µ ± 3σ, say, along a unique path from f 1 to f 2 :<br />

f1<br />

m<br />

f2<br />

5.26 Us<strong>in</strong>g mate pairs to resolve repeats<br />

Consi<strong>der</strong> the follow<strong>in</strong>g situation, (R, R ′ ) and (S, S ′ ) repeats, and m 1 and m 2 mate pairs:<br />

A R B S C D R’ E S’ F<br />

m1<br />

m2<br />

The correspond<strong>in</strong>g graph is this:<br />

A<br />

R=R’<br />

m1<br />

B<br />

S=S’<br />

C<br />

D<br />

E<br />

m2<br />

F<br />

We can use the two mate pairs to resolve the repeats, first us<strong>in</strong>g m 2 to separate S and S ′ :<br />

m1<br />

C<br />

A<br />

B<br />

R=R’<br />

S=S’<br />

D<br />

E<br />

m2<br />

F<br />

and then us<strong>in</strong>g m 1 to separate R and R ′ :<br />

A<br />

R<br />

B<br />

S<br />

C<br />

D<br />

R’ E S’<br />

F<br />

5.27 Mate pairs for or<strong>der</strong><strong>in</strong>g and orient<strong>in</strong>g contigs<br />

F<strong>in</strong>ally, if a number of different mate pairs consistently l<strong>in</strong>k two different components of the de<br />

Bruijn graph G, then they def<strong>in</strong>e a relative or<strong>der</strong><strong>in</strong>g and orientation of the two correspond<strong>in</strong>g<br />

contigs:


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 93<br />

6 Assembly Validation and Comparison<br />

Given a set of reads F and mate pairs M obta<strong>in</strong>ed from an unknown sequence A us<strong>in</strong>g shotgun<br />

sequenc<strong>in</strong>g.<br />

An assembly of A from F and M is a reconstruction of A that is given by a set of scaffolds<br />

S = {s 1 , s 2 , . . .,s p } and a set of contigs C = {c 1 , c 2 , . . .,c q }.<br />

Any such scaffold s i represents a relative or<strong>der</strong><strong>in</strong>g and orientation of contigs and is given by<br />

an or<strong>der</strong>ed list of contigs c i1 , c i2 , . . .,c ik , together with an orientation o ij ∈ {−1, +1} for each<br />

contig, and an estimation of the gap between any pair of consecutive contigs c ij and c ij+1<br />

Any such contig c i represents a contiguous piece of sequence of length l(c i ) and is given by a<br />

consensus sequence C i , a list of reads f i1 , . . .,f ih , and two mapp<strong>in</strong>gs b(f ij ) and e(f ij ) that map<br />

the start and end of each read f ij to their position <strong>in</strong> c i (i.e. <strong>in</strong> [1, l(c i )]), respectively.<br />

6.1 The read coverage of a contig (or scaffold)<br />

Consi<strong>der</strong> a contig c of length l(c). For each position p ∈ [1, l(c)], let m(p) denote the coverage<br />

of p, def<strong>in</strong>ed as the number of reads <strong>in</strong> c that conta<strong>in</strong> p. A coverage plot can be used to identify<br />

areas of unusually high coverage, which may correspond to over-collapsed repeats:<br />

20<br />

Read 10<br />

coverage<br />

0<br />

3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 4.3<br />

Scaffold<br />

To compute the coverage plot, first, for every read f <strong>in</strong> c, place the beg<strong>in</strong> position b(f) and<br />

end position e(f) <strong>in</strong>to a sorted sequence L. Then, <strong>in</strong> or<strong>der</strong> of L, for each beg<strong>in</strong> position, report<br />

the number of reads that span the position, given by the number of beg<strong>in</strong>s m<strong>in</strong>us the number<br />

of ends seen so far.<br />

Note that we can similarly def<strong>in</strong>e a read coverage for scaffolds.<br />

Mb<br />

6.2 Clone coverage<br />

Consi<strong>der</strong> a contig c consist<strong>in</strong>g of reads f 1 , f 2 , . . .f k . Let f i and f j be two mated reads, with<br />

approximate clone length µ and standard deviation σ.<br />

Recall that we call a mate pair m = (f i , f j ) happy (w.r.t. c), if<br />

1. the pairs are oriented toward each other, i.e. either b(f i ) < e(f i ) and e(f j ) > b(f j ), or<br />

b(f j ) < e(f j ) and e(f i ) > b(f i ), and


94 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

2. the distance between f i and f j is approximately correct, i.e. |µ − |b(e i ) − b(e j )|| ≤ 3σ.<br />

Otherwise, m is unhappy, and is called mis-oriented, if (1) is violated, and mis-separated, if (1)<br />

holds, but (2) does not.<br />

To compute the clone coverage plot, for every mate pair of reads f i and f j <strong>in</strong> c (either contig<br />

or scaffold), with b(f i ) < b(f i ), place the beg<strong>in</strong> position b(f i ) and end position b(f j ) <strong>in</strong>to a<br />

sorted sequence L. Then, <strong>in</strong> or<strong>der</strong> of L, for each beg<strong>in</strong> position, report the number of happy,<br />

mis-oriented and mis-separated clones that span the position, <strong>in</strong> each case given by the number<br />

of beg<strong>in</strong>s m<strong>in</strong>us the number of ends seen so far.<br />

However, this def<strong>in</strong>ition is not useful for large contigs or scaffolds, because unhappy mates that<br />

are far apart from each other <strong>in</strong> an assembly will <strong>in</strong>crease the clone coverage over the whole<br />

distance between them <strong>in</strong> a way that does not reflect local mis-assembly problems.<br />

Hence, <strong>in</strong> practice, to obta<strong>in</strong> a localized clone coverage plot, one uses b(f i ) + (µ + 3σ) as end<br />

position, if b(f i ) < e(f i ), and b(f i ) − (µ + 3σ), if e(f i ) < b(f i ).<br />

The happy, mis-separated and mis-oriented coverage is shown <strong>in</strong> green, yellow and red, respectively.<br />

This is a good assembly:<br />

Example of the clone coverage plot for a poor assembly:<br />

6.3 Clone middle plot<br />

A useful tool for visualiz<strong>in</strong>g the quality of a contig, based on clone data, is to simply draw<br />

each mate pair m = (f i , f j ) as a l<strong>in</strong>e whose x-coord<strong>in</strong>ates are the start and end positions of the<br />

mate pair and whose y coord<strong>in</strong>ate is chosen at random. Different colors are used to dest<strong>in</strong>guish<br />

between happy, mis-oriented and mis-separated mates. Additionally, it makes sense to separate<br />

the clones by approximate length:<br />

The previous plot shows a good assembly, this a poor one:


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 95<br />

The same data, but us<strong>in</strong>g “localized” coord<strong>in</strong>ates:<br />

6.4 Breakpo<strong>in</strong>t Detection<br />

Based on the clone coverage, we would like to locate breakpo<strong>in</strong>ts <strong>in</strong> a given contig (or scaffold)<br />

c. Loosely speak<strong>in</strong>g, a breakpo<strong>in</strong>t is a position p <strong>in</strong> the contig c at which the sequence of the<br />

contig immediately to the left and to the right of pcome from different regions of the unknown<br />

source sequence.<br />

(In consequence, to obta<strong>in</strong> a more correct assembly, one must cut all contigs (and scaffolds) at<br />

their breakpo<strong>in</strong>ts and then rearrange the pieces.)<br />

At a break po<strong>in</strong>t, we expect that the happy clone coverage will be very low and the mis-oriented<br />

clone coverage to be high.<br />

Breakpo<strong>in</strong>t heuristic Consi<strong>der</strong> all clone start and end positions <strong>in</strong> or<strong>der</strong>: if the number of<br />

currently open happy clones drops below the number of currently open mis-separated ones,<br />

then the beg<strong>in</strong> of a region conta<strong>in</strong><strong>in</strong>g a breakpo<strong>in</strong>t has been detected, where as <strong>in</strong> the opposite<br />

case, we have detected the end of such a region.<br />

Two different assemblies of human chromosome 19 produced by the Human Genome Project, H 1<br />

produced <strong>in</strong> September 2000, and H 2 dat<strong>in</strong>g January 2001, conta<strong>in</strong><strong>in</strong>g 723 and 488, respectively,<br />

detected breakpo<strong>in</strong>ts (shown as blue ticks):


96 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

7 Gene Prediction<br />

This exposition is based on the follow<strong>in</strong>g sources, which are all recommended read<strong>in</strong>g (<strong>in</strong> this<br />

or<strong>der</strong>):<br />

1. Pavel A. Pevzner. Computational Molecular Biology, an algorithmic approach. MIT,<br />

2000, chapter 9.<br />

2. Chris Burge and Samuel Karl<strong>in</strong>. Prediction of complete gene structures <strong>in</strong> human genomic<br />

DNA. Journal of Molecular Biology, 268:78-94 (1997).<br />

3. Ian Korf, Paul Flicek, Danial Duan and Michael R. Brent, Integrat<strong>in</strong>g Genomic Homology<br />

<strong>in</strong>to Gene Structure Prediction, Bio<strong>in</strong>formatics, Vol .1 Suppl 1., pages S1-S9 (2001).<br />

4. V<strong>in</strong>eet Bafna and Daniel Huson. The conserved exon method for gene f<strong>in</strong>d<strong>in</strong>g. ISMB<br />

2000, 3-12 (2000).<br />

5. M. S. Gelfand, A. Mironov and P. A. Pevzner, Gene recognition via spliced alignment,<br />

PNAS, 93:9061–9066 (1996).<br />

7.1 Introduction<br />

In the 1960s, it was discovered that a gene and its prote<strong>in</strong> product are col<strong>in</strong>ear structures with a<br />

direct correlation between the triplets of nucleotides <strong>in</strong> the gene and am<strong>in</strong>o acids <strong>in</strong> the prote<strong>in</strong>.<br />

It soon became clear that genes can be difficult to determ<strong>in</strong>e, due to the existence of overlapp<strong>in</strong>g<br />

genes, and genes with<strong>in</strong> genes etc.<br />

Moreover, the paradox arose that the genome size of many eukaryotes does not correspond to<br />

genetic complexity, for example, the salaman<strong>der</strong> genome is 10 times the size of that of human.<br />

In 1977, the amaz<strong>in</strong>g discovery of “split” genes was made: genes that consist of multiple pieces<br />

called exons, separated by stretches of “junk DNA” called <strong>in</strong>trons.<br />

Prokaryote<br />

Eukaryote<br />

DNA<br />

DNA<br />

mRNA<br />

Prote<strong>in</strong><br />

Transcription<br />

Translation<br />

RNA<br />

nucleus<br />

mRNA<br />

Prote<strong>in</strong><br />

splic<strong>in</strong>g<br />

The existence of split genes and junk-DNA raises a computational gene prediction problem that<br />

is still unsolved:<br />

Given a str<strong>in</strong>g of DNA. The gene prediction problem is to reliably predict all genes<br />

conta<strong>in</strong>ed <strong>in</strong> the sequence.<br />

7.2 Three types of approaches<br />

One can dist<strong>in</strong>guish between three types of approaches:


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 97<br />

• Statistical or ab <strong>in</strong>itio methods. These methods attempt to predict genes based on statistical<br />

properties of the given DNA sequence. Programs are e.g. Genscan, GeneID,<br />

GENIE and FGENEH.<br />

• Homology methods. The given DNA sequence is compared with known prote<strong>in</strong> structures,<br />

e.g. us<strong>in</strong>g “spliced alignments”. Programs are e.g. Procrustes and GeneWise.<br />

• Comparative methods. The given DNA str<strong>in</strong>g is compared with a similar DNA str<strong>in</strong>g<br />

from a different species at the appropriate evolutionary distance and genes are predicted<br />

<strong>in</strong> both sequences based on the assumption that exons will be well conserved, whereas<br />

<strong>in</strong>trons will not. Programs are e.g. CEM (conserved exon method) and Tw<strong>in</strong>scan.<br />

7.3 Simplest approach to gene prediction<br />

The simplest way to detect potential cod<strong>in</strong>g regions is to look at Open Read<strong>in</strong>g Frames (ORFs).<br />

An ORF is a sequence of codons <strong>in</strong> DNA that starts with a Start codon (ATG), ends with a<br />

Stop codon (TAA, TAG or TGA) and has no other (<strong>in</strong>-frame) stop codons <strong>in</strong>side.<br />

The average distance between stop codons <strong>in</strong> “random” DNA is 64 ≈ 21, much smaller than<br />

3<br />

the number of codons <strong>in</strong> an average prote<strong>in</strong> (≈ 300).<br />

Therefore, long ORFs <strong>in</strong>dicate genes, although they fail to detect short genes or genes with<br />

short exons.<br />

Additionally, features such as codon usage or hexamer counts can be taken <strong>in</strong>to account. The<br />

codon usage of a str<strong>in</strong>g of DNA is given by a 64-component vector that counts how many times<br />

each codon is present <strong>in</strong> the str<strong>in</strong>g. These values can differ significantly between cod<strong>in</strong>g and<br />

non-cod<strong>in</strong>g DNA.<br />

7.4 Eukarayotic gene structure<br />

For our purposes, a eukarayotic gene has the follow<strong>in</strong>g structure:<br />

Promotor<br />

TATA<br />

5’ UTR<br />

Start site<br />

Initial<br />

exon<br />

Donor site<br />

Intron<br />

Acceptor site<br />

<strong>in</strong>ternal<br />

exon(s)<br />

Intron<br />

Term<strong>in</strong>al<br />

exon<br />

Stop site<br />

ATG GT AT GT AT TAA<br />

TAG<br />

TGA<br />

3’ UTR<br />

Poly−A<br />

AAATAAAA<br />

Ab <strong>in</strong>itio gene prediction methods use statistical properties of the different components of such<br />

a gene model to attempt to predict<strong>in</strong>g genes <strong>in</strong> unannotated DNA. For example, for the bases<br />

around the start site we may have the follow<strong>in</strong>g observed frequencies (given by this position<br />

weight matrix):<br />

Pos. -8 -7 -6 -5 -4 -3 -2 -1 +1 +2 +3 +4 +5 +6 +7<br />

A .16 .29 .20 .25 .22 .66 .27 .15 1 0 0 .28 .24 .11 .26<br />

C .48 .31 .21 .33 .56 .05 .50 .58 0 0 0 .16 .29 .24 .40<br />

G .18 .16 .46 .21 .17 .27 .12 .22 0 0 1 .48 .20 .45 .21<br />

T .19 .24 .14 .21 .06 .02 .11 .05 0 1 0 .09 .26 .21 .21


98 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

7.5 GENSCAN’s model<br />

We are go<strong>in</strong>g to discuss the popular program Genscan <strong>in</strong> detail, which is based on a semi-<br />

Markov model:<br />

E0+ E1+ E2+<br />

I0+ I1+ I2+<br />

P−<br />

(promoter)<br />

A−<br />

(poly−A<br />

signal)<br />

F−<br />

(5’ UTR)<br />

Esngl−<br />

(s<strong>in</strong>gle−exon<br />

gene)<br />

T−<br />

(3’ UTR)<br />

E<strong>in</strong>it+<br />

(<strong>in</strong>itial<br />

exon)<br />

Eterm+<br />

(term<strong>in</strong>al<br />

exon)<br />

E<strong>in</strong>it−<br />

(<strong>in</strong>itial<br />

exon)<br />

Eterm−<br />

(term<strong>in</strong>al<br />

exon)<br />

F+<br />

(5’ UTR)<br />

Esngl+<br />

(s<strong>in</strong>gle−exon<br />

gene)<br />

T+<br />

(3’ UTR)<br />

P+<br />

(promoter)<br />

A+<br />

(poly−A<br />

signal)<br />

I0− I1− I2−<br />

Forward (+) strand<br />

Reverse (−) strand<br />

N<br />

(<strong>in</strong>tergenic<br />

region)<br />

E0− E1− E2−<br />

Genscan’s model can be formulated as an explicit state duration HMM. (This is an HMM <strong>in</strong><br />

which, additionally, a duration period is explicitly modeled for each state, us<strong>in</strong>g a probability<br />

distribution). The model is thought of generat<strong>in</strong>g a parse φ, consist<strong>in</strong>g of:<br />

• an or<strong>der</strong>ed set of states q = {q 1 , q 2 , . . .,q n }, and<br />

• an associated set of durations d = {d 1 , d 2 , . . .,d n },<br />

which, us<strong>in</strong>g probabilistic models for each of the state types, generates a DNA sequence S of<br />

length L = ∑ n<br />

i=1 d i.<br />

The generation of a parse correspond<strong>in</strong>g to a (pre-def<strong>in</strong>ed) sequence length L is as follows:<br />

1. An <strong>in</strong>itial state q 1 is chosen accord<strong>in</strong>g to an <strong>in</strong>itial distribution π on the states, i.e.<br />

π i = P(q 1 = Q (i) ), where Q (j) (j = 1, . . .,27) is an <strong>in</strong>dex<strong>in</strong>g of the states of the model.<br />

2. A length (state duration), d 1 , correspond<strong>in</strong>g to the state q 1 is generated conditional on<br />

the value of q 1 = Q (i) from the length distribution f Q (i).<br />

3. A sequence segment s 1 of length d 1 is generated, conditional on d 1 and q 1 , accord<strong>in</strong>g to<br />

an appropriate sequence generat<strong>in</strong>g model for state type q 1 .<br />

4. The subsequent state q 2 is generated, conditional on the value of q 1 , from the (first-or<strong>der</strong><br />

Markov) state transition matrix T, i.e. T i,j = P(q k+1 = Q (j) | q k = Q (i) ).<br />

This process is repeated until the sum ∑ n<br />

i=1 d i of the state durations first equals or exceeds L,<br />

at which po<strong>in</strong>t the last state duration is appropriately truncated, the f<strong>in</strong>al stretch of sequence<br />

is generated and the process stops.<br />

The result<strong>in</strong>g sequence is simply the concatenation of the sequence segments, S = s 1 s 2 . . .s n .<br />

Note that the generated sequence is not restricted to correspond to a s<strong>in</strong>gle gene, but could represent<br />

multiple genes, <strong>in</strong> both strands, or none.<br />

In addition to its topology <strong>in</strong>volv<strong>in</strong>g the 27 states and 46 transitions depicted above, the model<br />

has four ma<strong>in</strong> components:


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 99<br />

• a vector of <strong>in</strong>itial probabilities π,<br />

• a matrix of state transition probabilities T,<br />

• a set of length distributions f, and<br />

• a set of sequence generat<strong>in</strong>g models P.<br />

(Recall that an HMM has <strong>in</strong>itial-, transition- and emission probabilities).<br />

7.6 Likelihood prediction<br />

Given such a model M. For a fixed sequence length L, consi<strong>der</strong><br />

Ω = Φ L × S,<br />

where Φ L is the set of all possible parses of M of length L, and S L is the set of all possible<br />

sequences of length L.<br />

The model M assigns a probability density to each po<strong>in</strong>t (parse/sequence pair) <strong>in</strong> Ω. Thus, for<br />

a given sequence S ∈ S L , a conditional probability of a particular parse φ ∈ Φ L is given by:<br />

P(φ | S) =<br />

P(φ, S)<br />

P(S)<br />

=<br />

P(φ, S)<br />

∑φ ′ ∈Φ L<br />

P(φ ′ , S) ,<br />

us<strong>in</strong>g Baye’s Rule.<br />

The essential idea is to specify a precise probabilistic model of what a gene looks like <strong>in</strong> advance<br />

and then to select the parse φ through the model M that has highest likelihood, given the<br />

sequence S.<br />

7.7 Computational issues<br />

Given a sequence S of length L, the jo<strong>in</strong>t probability, P(φ, S), of generat<strong>in</strong>g the parse φ and<br />

the sequence S is given by:<br />

P(φ, S) = π q1 f q1 (d 1 )P(s 1 | q 1 , d 1 )<br />

×<br />

n∏<br />

T qk−1 ,q k<br />

f qk (d k )P(s k | q k , d k ),<br />

k=2<br />

where the states of φ are q 1 , q 2 , . . .,q n with associated state lengths d 1 , d 2 , . . .,d n , which break<br />

the sequence <strong>in</strong>to segments s 1 , s 2 , . . .,s n .<br />

Here, P(s k | q k , d k ) is the probability of generat<strong>in</strong>g the segment s k un<strong>der</strong> the appropriate<br />

sequence generat<strong>in</strong>g model for a type-q k state of length d k .<br />

A modification of the Viterbi algorithm may be used to calculate φ opt , the parse with maximal<br />

jo<strong>in</strong>t probability (un<strong>der</strong> M), that gives the predicted gene or set of genes <strong>in</strong> the sequence.<br />

We can compute P(S) us<strong>in</strong>g the “forward algorithm” discussed un<strong>der</strong> HMMs. With the help<br />

of the “backward algorithm”, certa<strong>in</strong> additional quantities of <strong>in</strong>terest can also be computed.


100 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

For example, consi<strong>der</strong> the event E (k)<br />

[x,y]<br />

that a particular sequence segment [x, y] is an <strong>in</strong>ternal<br />

exon of phase k ∈ {0, 1, 2}. Un<strong>der</strong> M, this event has probability<br />

∑<br />

P(φ, S)<br />

P(E (k)<br />

[x,y] | S) = φ:E (k) ∈φ [x,y]<br />

,<br />

P(S)<br />

where the sum is taken over all parses that conta<strong>in</strong> the given exon E (k)<br />

[x,y]. This sum can be<br />

computed us<strong>in</strong>g the forward-backward algorithm.<br />

7.8 Details of the model<br />

So far, we have discussed the topology and the other ma<strong>in</strong> components of the Genscan model<br />

<strong>in</strong> general terms. The follow<strong>in</strong>g details need to be discussed:<br />

• the <strong>in</strong>itial and transition probabilities,<br />

• the state length distributions,<br />

• transcriptional and translational signals,<br />

• splice signals, and<br />

• reverse-strand states.<br />

7.9 Initial and transition probabilities<br />

For gene prediction <strong>in</strong> randomly chosen blocks of contiguous human DNA, the <strong>in</strong>itial probability<br />

of each state should be chosen proportionally to its estimated frequency <strong>in</strong> bulk human genomic<br />

DNA.<br />

This is a non-trivial problem, because gene density and certa<strong>in</strong> aspects of gene structure vary<br />

significantly <strong>in</strong> regions of differ<strong>in</strong>g C+G% content (so-called “isochores”) of the human genome,<br />

with a much higher gene density <strong>in</strong> C+G-rich regions.<br />

Hence, <strong>in</strong> practice, <strong>in</strong>itial and transitional probabilities are estimated for four different categories:<br />

(I) < 43% C+G, (<strong>II</strong>) 43 − 51% C+G, (<strong>II</strong>I) 51 − 57% C+G, and (IV) > 57% C+G.<br />

The follow<strong>in</strong>g <strong>in</strong>itial probabilities were obta<strong>in</strong>ed from a learn<strong>in</strong>g set of 380 genes, by compar<strong>in</strong>g<br />

the number of bases correspond<strong>in</strong>g to each of the different states:<br />

Group I <strong>II</strong> <strong>II</strong>I IV<br />

C+G-range < 43% 43 − 51% 51 − 57% > 57%<br />

Initial probabilities:<br />

Intergenic (N) 0.892 0.867 0.540 0.418<br />

Intron (I + i , I− i ) 0.095 0.103 0.338 0.388<br />

5’ UTR (F + , F − ) 0.008 0.018 0.077 0.122<br />

3’ UTR (T + , T − ) 0.005 0.011 0.045 0.072<br />

For simplicity, the <strong>in</strong>itial probabilities for the exon, promoter and poly-A states were set to 0.<br />

Transition probabilities are obta<strong>in</strong>ed <strong>in</strong> a similar way.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 101<br />

7.10 State length distributions<br />

In general, the states of the model correspond to sequence segments of highly variable length.<br />

For certa<strong>in</strong> states, most notably for <strong>in</strong>ternal exon states E k , length is probably important for<br />

proper biological function, i.e. proper splic<strong>in</strong>g and <strong>in</strong>clusion <strong>in</strong> the f<strong>in</strong>al processed mRNA.<br />

For example, it has been shown <strong>in</strong> vivo that <strong>in</strong>ternal deletions of exons to sizes below about<br />

50 bp may often lead to exon skipp<strong>in</strong>g, and there is evidence that steric <strong>in</strong>terference between<br />

factors recogniz<strong>in</strong>g splice sites may make splic<strong>in</strong>g of small exons more difficult. There is also<br />

evidence that spliceosomal assembly is <strong>in</strong>hibited if <strong>in</strong>ternal exons are expanded beyond 300 bp.<br />

In summary, these arguments support the observation that <strong>in</strong>ternal exons are usually ≈ 120 −<br />

150 bp long, with only a few of length less that 50 bp or more than 300 bp.<br />

Constra<strong>in</strong>ts for <strong>in</strong>itial and term<strong>in</strong>al exons are slightly different.<br />

The duration <strong>in</strong> <strong>in</strong>itial, <strong>in</strong>ternal and term<strong>in</strong>al exon states is modeled by a different empirical<br />

distribution for each of the types of states.<br />

In contrast to exons, the length of <strong>in</strong>trons does not seem critical, although a m<strong>in</strong>imum length<br />

of 70 − 80 may be preferred.<br />

The length distribution for <strong>in</strong>trons appears to be approximately geometric (exponential). However,<br />

the average length of <strong>in</strong>trons differs substantially between the different C+G groups: In<br />

group I, the average length is 2069 bp, whereas for group IV , the average length is only 518 bp.<br />

Hence, the duration <strong>in</strong> <strong>in</strong>tron states is modeled by a geometric distribution with parameter q<br />

estimated for each C+G group separately.<br />

Empirical length distributions for <strong>in</strong>trons and exons:<br />

75<br />

300<br />

60<br />

Number of <strong>in</strong>trons<br />

200<br />

100<br />

Number of exons<br />

30<br />

0<br />

0<br />

2k 3k 4k 6k<br />

0 1k 5k 7k 8k<br />

0<br />

200 400<br />

Length (bp)<br />

Introns<br />

Length (bp)<br />

Initial exons<br />

250<br />

40<br />

200<br />

Number of exons<br />

100<br />

Number of exons<br />

20<br />

0<br />

0<br />

0<br />

200 400<br />

0<br />

200 400<br />

Length (bp)<br />

Internal exons<br />

Length (bp)<br />

Term<strong>in</strong>al exons<br />

Note that the exon lengths generated must be consistent with the phases of adjacent <strong>in</strong>trons.<br />

To account for this, first the number of complete codons is generated from the appropriate<br />

length distribution, then the appropriate number (0, 1 or 2) of bp is added to each end to<br />

account for the phases of the preced<strong>in</strong>g and subsequent states.<br />

For example, if the number of complete codons generated for an <strong>in</strong>ternal exon is C = 6, and<br />

the phase of the previous and next <strong>in</strong>tron is 1 and 2, respectively, then the total length of the<br />

exon is l = 3C + 2 + 2 = 22:<br />

phase 1 <strong>in</strong>tron<br />

TA TGT GTT ACT CGC GCT CGC TT<br />

exon<br />

phase 2 <strong>in</strong>tron


102 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

For the 5 ′ UTR and 3 ′ UTR states, geometric distributions are used with mean values of 769<br />

and 457 bp, respectively.<br />

7.11 Simple signal models<br />

There are a number of different models of biological signal sequences, such as donor and acceptor<br />

sites, promoters, etc.<br />

One of the earliest and must <strong>in</strong>fluential approaches is the weight matrix method (WMM), <strong>in</strong><br />

which the frequency p a<br />

(i) of each nucleotide a at position i of a signal of length n is <strong>der</strong>ived from<br />

a collection of aligned signal sequences.<br />

The product P(A) = ∏ n<br />

i=1 P a (i)<br />

i<br />

sequence A = a 1 a 2 . . .a n .<br />

is used to estimate the probability of generat<strong>in</strong>g a particular<br />

The weight array matrix (WAM) is a generalization that takes dependencies between adjacent<br />

positions <strong>in</strong>to account. In this model, the probability of generat<strong>in</strong>g a particular sequence is<br />

P(A) = p (1) ∏ n<br />

a 1 i=2 pi−1,i a i−1 ,a i<br />

, where p i−1,i<br />

j,k<br />

is the conditional probability of generat<strong>in</strong>g a particular<br />

nucleotide x k at position i, given nucleotide x j at position i − 1.<br />

Here is a WMM for recognition of a start site:<br />

Pos. -8 -7 -6 -5 -4 -3 -2 -1 +1 +2 +3 +4 +5 +6 +7<br />

A .16 .29 .20 .25 .22 .66 .27 .15 1 0 0 .28 .24 .11 .26<br />

C .48 .31 .21 .33 .56 .05 .50 .58 0 0 0 .16 .29 .24 .40<br />

G .18 .16 .46 .21 .17 .27 .12 .22 0 0 1 .48 .20 .45 .21<br />

T .19 .24 .14 .21 .06 .02 .11 .05 0 1 0 .09 .26 .21 .21<br />

Un<strong>der</strong> this model, the sequence ...CCGCCACC ATG GCGC... has the highest probability of<br />

conta<strong>in</strong><strong>in</strong>g a start site, namely: P = 0.48 · 0.31 · 46 · 0.33 · 0.56 · 0.66 · 0.5 · 0.58 · 1 · 1 · 1 · 0.48 ·<br />

0.29 · 0.45 · 0.4 = 0.006.<br />

The sequence ...AGTTTTTT ATG TAAT ... has the lowest probability of conta<strong>in</strong><strong>in</strong>g a start site<br />

at the <strong>in</strong>dicated position, namely: P = 0.16 · 0.16 · 0.14 · 0.21 · 0.06 · 0.02 · 0.11 · 0.05 · 1 · 1 · 1 ·<br />

0.09 · 0.24 · 0.11 · 0.21 = 20.4 · 10 −11 .<br />

7.12 Transcriptional and translational signals<br />

Poly-A signals are modeled as a 6 bp WMM model, with consensus sequence AATAAA.<br />

A 12 bp WMM, beg<strong>in</strong>n<strong>in</strong>g 6 bp prior to the start codon, is used for the translation <strong>in</strong>itiation<br />

signal.<br />

In both cases, one can estimate the probabilities us<strong>in</strong>g the GenBank annotated “polyA signal”<br />

and “CDS” features from sequences.<br />

Approximately 30% of eukaryotic promoters lack a TATA signal. Hence, a TATA-conta<strong>in</strong><strong>in</strong>g<br />

promoter is generated with 0.7 probability, and a TATA-less one with probability 0.3.<br />

TATA-conta<strong>in</strong><strong>in</strong>g promoters are modeled as a 15 bp TATA WMM and an 8 bp cap site WMM.<br />

The length between the two WMMs is generated uniformly from the range 14 − 20 bp.<br />

TATA-less ones are modeled as <strong>in</strong>tergenic regions of 40 bp.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 103<br />

7.13 Splice signals<br />

The donor and acceptor splice signals are probably the most important signals, as the majority<br />

of exons are <strong>in</strong>ternal ones. Previous approaches use WMMs or WAMs to model them, thus<br />

assum<strong>in</strong>g <strong>in</strong>dependence of sites, or that dependencies only occur between adjacent sites.<br />

The consensus region of the donor splice sites covers the last 3 bp of the exon (positions -3 to<br />

-1) and the first 6 bp of the succeed<strong>in</strong>g <strong>in</strong>tron (positions 1 to 6):<br />

...exon <strong>in</strong>tron...<br />

Position -3 -2 -1 +1 +2 +3 +4 +5 +6<br />

Consensus c/a A G G T a/g A G t<br />

WMM:<br />

A .33 .60 .08 0 0 .49 .71 .06 .15<br />

C .37 .13 .04 0 0 .03 .07 .05 .19<br />

G .18 .14 .81 1 0 .45 .12 .84 .20<br />

T .12 .13 .07 0 1 .03 .09 .05 .46<br />

7.14 Donor site model<br />

However, donor sites show significant dependencies between non-adjacent positions, which probably<br />

reflect details of donor splice site recognition by U1 snRNA and other factors.<br />

Given a sequence S. Let C i denote the consensus <strong>in</strong>dicator variable that is 1, if the given<br />

nucleotide at position i matches the consensus at position i, and 0 otherwise. Let X j denote<br />

the nucleotide at position j.<br />

For example, consi<strong>der</strong>:<br />

...exon <strong>in</strong>tron...<br />

Position -3 -2 -1 +1 +2 +3 +4 +5 +6<br />

Consensus c/a A G G T a/g A G t<br />

S ...T A A C G T A A G C C ...<br />

Here, C −1 = 0 and C +6 = 0, and = 1, for all other positions. Similarly, X −3 = A, X −2 = A,<br />

X −1 = C etc.<br />

We use χ 2 statistics for the variable C i versus X j , for all pairs i, j with i ≠ j <strong>in</strong> the set of donor<br />

sites from the genes of the given learn<strong>in</strong>g set, based on the C i versus X j cont<strong>in</strong>gency table:<br />

X j<br />

C i A C G T<br />

0 f 0 (A) f 0 (C) f 0 (G) f 0 (T)<br />

1 f 1 (A) f 1 (C) f 1 (G) f 1 (T),<br />

where f i (x) is the frequency at which the tra<strong>in</strong><strong>in</strong>g set has the consensus base at position i and<br />

the base x at position j.<br />

A significant χ 2 score <strong>in</strong>dicates that there is a dependency between site i and j.<br />

The idea is then to identify an or<strong>der</strong><strong>in</strong>g of the sites by decreas<strong>in</strong>g discrim<strong>in</strong>atory power and then<br />

to <strong>der</strong>ive separate WMMs for each of the different cases, thus obta<strong>in</strong><strong>in</strong>g a so-called maximal<br />

dependence decomposition:


104 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

Here, H = A|C|U, B = C|G|U and V = A|C|G. For example, G 5 , or H 5 , is the set of donor sites<br />

with, or without, a G at position +5, respectively.<br />

7.15 Acceptor site model<br />

Intron/exon junctions are modeled by a (first-or<strong>der</strong>) WAM for bases −20 to +3, captur<strong>in</strong>g the<br />

pyrimid<strong>in</strong>e (C,T) rich region and the acceptor splice site itself.<br />

It is difficult to model the branch po<strong>in</strong>t <strong>in</strong> the preced<strong>in</strong>g <strong>in</strong>tron, and only 30% of the test data<br />

had <strong>in</strong> YYRAY sequence <strong>in</strong> the appropriate region [−40, −21].<br />

A modified variant of a second or<strong>der</strong> WAM is employed <strong>in</strong> which nucleotides are generated<br />

conditional on the previous two ones, <strong>in</strong> an attempt to model the weak but detectable tendency<br />

toward YYY triplets as well as certa<strong>in</strong> branch po<strong>in</strong>t-related triplets such as TGA, TAA, GAC,<br />

and AAC <strong>in</strong> this region, without requir<strong>in</strong>g the occurrence of any specific branch po<strong>in</strong>t consensus.<br />

(A w<strong>in</strong>dow<strong>in</strong>g and averag<strong>in</strong>g process is used to obta<strong>in</strong> estimates from the limited tra<strong>in</strong><strong>in</strong>g<br />

data.)<br />

7.16 Exon models<br />

Cod<strong>in</strong>g portions of exons are modeled us<strong>in</strong>g an <strong>in</strong>homongeneous 3-periodic fifth or<strong>der</strong> Markov<br />

model. Here, separate Markov transition matrices, c 1 , c 2 and c 3 , are determ<strong>in</strong>ed for hexamers<br />

end<strong>in</strong>g at each of the three codon positions, respectively:<br />

xxxxxxxxxx<br />

C1<br />

x1 x2 x3 y1 y2 y3 z1 z2 z3 xxxxxxxxxx<br />

C2<br />

C3<br />

This is based on the observation that frame-shifted hexamer counts are generally the most<br />

accurate compositional discrim<strong>in</strong>ator of cod<strong>in</strong>g versus non-cod<strong>in</strong>g regions.<br />

However, A+T rich genes are often not well predicted us<strong>in</strong>g hexamer counts based on bulk<br />

DNA and so Genscan uses two different sets of transition matrices, one tra<strong>in</strong>ed for sequences<br />

with < 43% C+G content and one for all others.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 105<br />

7.17 Performance studies<br />

The performance of a gene prediction program is evaluated by apply<strong>in</strong>g it to DNA sequences<br />

for which all conta<strong>in</strong>ed genes are known and annotated with high confidence.<br />

To calculate accuracy statistics, each nucleotide of a test sequence is classified as:<br />

• a predicted positive (PP) if it is predicted to be conta<strong>in</strong>ed <strong>in</strong> a cod<strong>in</strong>g region,<br />

• a predicted negative (PN) if it is predicted to be conta<strong>in</strong>ed <strong>in</strong> non-cod<strong>in</strong>g region,<br />

• an actual positive (AP) if it is annotated to be conta<strong>in</strong>ed <strong>in</strong> cod<strong>in</strong>g region, and<br />

• an actual negative (AN) if it is annotated to be conta<strong>in</strong>ed <strong>in</strong> non-cod<strong>in</strong>g region.<br />

The performance is measured both on the level of nucleotides and on whole predicted exons,<br />

us<strong>in</strong>g a similar classification.<br />

Based on this classification, we compute the number of:<br />

• true positives, TP = PP ∩ AP,<br />

• false positives, FP = PP ∩ AN,<br />

• true negatives, TN = PN ∩ AN, and<br />

• false negatives, FN = PN ∩ AP.<br />

The sensitivity Sn and specificity Sp of a method are then def<strong>in</strong>ed as<br />

Sn = TP<br />

AP<br />

and Sp =<br />

TP<br />

PP ,<br />

respectively, measur<strong>in</strong>g both the ability to predict true genes and to avoid predict<strong>in</strong>g false<br />

ones.<br />

7.18 Performance of GENSCAN<br />

Genscan was run on a test set of 570 vertebrate sequences and the forward strand exons <strong>in</strong> the<br />

optimal Genscan parse of the sequence were compared to the annotated exons. The follow<strong>in</strong>g<br />

table shows the results and compares them with results obta<strong>in</strong>ed us<strong>in</strong>g other programs:<br />

Genscan performs very well here and is currently the most popular gene f<strong>in</strong>d<strong>in</strong>g method.


106 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

7.19 Comparative gene f<strong>in</strong>d<strong>in</strong>g<br />

Genscan’s model makes use of statistical features of the genome un<strong>der</strong> consi<strong>der</strong>ation, obta<strong>in</strong>ed<br />

from an annotated tra<strong>in</strong><strong>in</strong>g set.<br />

More recently, a number of methods have been suggested that attempt to also make use of<br />

comparative data. They are based on the observation that<br />

the level of sequence conservation between two species depends on the function of<br />

the DNA, e.g. cod<strong>in</strong>g sequence is more conserved than <strong>in</strong>tergenic sequence.<br />

One such program is Rosetta, which first computes a global alignment of two homologous<br />

sequences and then attempts to predict genes <strong>in</strong> both sequences simultaneously. A second is<br />

the conserved exon method, that uses local conservation.<br />

The Tw<strong>in</strong>scan program is an extension of Genscan, that additionally models a conserved<br />

sequence.<br />

7.20 TWINSCAN<br />

The <strong>in</strong>put to Tw<strong>in</strong>scan consists of a target sequence, i.e. a genomic sequence <strong>in</strong> which genes are<br />

to be predicted, and an <strong>in</strong>formant sequence, i.e. a genomic sequence from a related organism.<br />

For example, the target sequence may come mouse genome and the <strong>in</strong>formant sequence may<br />

be the human genome.<br />

Given a target and an <strong>in</strong>formant, <strong>in</strong> a preprocess<strong>in</strong>g step, one determ<strong>in</strong>es a set of top homologs<br />

(e.g. us<strong>in</strong>g BLAST) from the <strong>in</strong>formant sequence, i.e. one or more sequences from the <strong>in</strong>formant<br />

sequence that match the target sequence best.<br />

mouse<br />

conserved human (top homologs)<br />

The top homologs represent the regions of conserved <strong>in</strong>formant sequence, which we will simply<br />

call “the <strong>in</strong>formant sequence” <strong>in</strong> the follow<strong>in</strong>g.<br />

7.21 Conservation sequence<br />

Similarity is represented by a conservation sequence, which pairs one of three symbols with<br />

each nucleotide of the target:<br />

. unaligned | matched : mismatched<br />

Gaps <strong>in</strong> the <strong>in</strong>formant sequence become mismatch symbols, gaps <strong>in</strong> the target sequence are<br />

ignored. Consi<strong>der</strong>:<br />

123456789 position<br />

GAATTCCGT target sequence


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 107<br />

and suppose that BLAST<br />

yields the follow<strong>in</strong>g HSP:<br />

The conservation sequence<br />

<strong>der</strong>ived from this HSP is:<br />

345 6789 target position 123456789 position<br />

ATT-CCGT target alignment GAATTCCGT target sequence<br />

|| || | BLAST alignment ..||:||:| conservation sequence<br />

ATCACC-T Informant alignment<br />

The follow<strong>in</strong>g algorithm takes a list of HSPs and computes the conservation sequence C:<br />

Algorithm<br />

Input: target sequence, list of HSPs<br />

Output: conservation sequence C<br />

Init.: C[1..n] := unaligned<br />

Sort HSPs by alignment score<br />

for each position i <strong>in</strong> the target sequence:<br />

for each HSP H from best to worst:<br />

if H covers position i:<br />

if C[i] = unaligned:<br />

C[i] := H<br />

end<br />

Note that the conservation symbol assigned to the target nucleotide at position i is determ<strong>in</strong>ed<br />

by the best HSP that covers i, regardless of which homologous sequence it comes from. Position<br />

i is classified as unaligned only if none of the HSPs overlap it.<br />

7.22 Probability of sequence and conservation sequence<br />

Recall that Genscan assigns each nucleotide of an <strong>in</strong>put sequence to one of seven categories:<br />

promoter, 5’ UTR, exon, <strong>in</strong>tron, 3’ UTR, poly-A signal and <strong>in</strong>tergenic.<br />

Genscan chooses the most likely assignment of categories to nucleotides accord<strong>in</strong>g to the<br />

Genscan model, us<strong>in</strong>g an optimization algorithm (i.e., a modification of the Viterbi algorithm).<br />

Given a sequence, the Genscan model assigns a probability to each parse of the sequence (i.e.,<br />

path through the model that generates the sequence.)<br />

The Tw<strong>in</strong>scan model assigns a probability to any parsed DNA sequence together with a<br />

parallel conservation sequence. Un<strong>der</strong> this model, the probability of a DNA sequence and the<br />

probability of the parallel conservation sequence are <strong>in</strong>dependent, given the parse.<br />

Consi<strong>der</strong> the follow<strong>in</strong>g example:<br />

10 20 30<br />

123456789|123456789|123456789|123456789<br />

ATTTAGCCTACTGAAATGGACCGCTTCAGCATGGTATCC target sequence T<br />

||:|||.........|:|:|||||||||:||:|||::|| conservation sequence C<br />

Consi<strong>der</strong> the probability of observ<strong>in</strong>g the target sequence T 7,33 extend<strong>in</strong>g from position 7 to 33,<br />

given the hypothesis E 7,33 that an <strong>in</strong>ternal exon extends from position 7 to 33.<br />

This is simply the probability of the target sequence T 7,33 un<strong>der</strong> the Genscan model times<br />

the probability of the conservation sequence C 7,33 un<strong>der</strong> the conservation model, assum<strong>in</strong>g the


108 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

parse E 7,33 :<br />

P(T 7,33 , C 7,33 | E 7,33 ) = P(T 7,33 | E 7,33 )P(C 7,33 | E 7,33 ).<br />

7.23 TWINSCAN’s model<br />

Tw<strong>in</strong>scan consists of a new, jo<strong>in</strong>t probability model on DNA sequences and conservation<br />

sequences, together with the same optimization algorithm used by Genscan.<br />

Tw<strong>in</strong>scan arguments the state-specific sequence models of Genscan with models of the probability<br />

of generat<strong>in</strong>g any given conservation sequence from any given state.<br />

Cod<strong>in</strong>g, UTR, and <strong>in</strong>tron/<strong>in</strong>tergenic states all assign probabilities to stretches of conservation<br />

sequence us<strong>in</strong>g homogeneous 5th-or<strong>der</strong> Markov cha<strong>in</strong>s:<br />

ccccccccccc c1 c2 c3 c4 c5 c6 ccccccccccc<br />

One set of parameters is estimated for each of these types of regions.<br />

Aga<strong>in</strong>, consi<strong>der</strong>:<br />

10 20 30<br />

123456789|123456789|123456789|123456789<br />

ATTTAGCCTACTGAAATGGACCGCTTCAGCATGGTATCC target sequence T<br />

||:|||.........|:|:|||||||||:||:|||::|| conservation sequence C<br />

The probability of observ<strong>in</strong>g C 7,33 , given E 7,33 , is:<br />

P C (C 7,33 | E 7,33 ) = P E (C 7,7 | C 2,6 ) · . . . · P E (C 33,33 | C 28,32 ),<br />

where P E (C 33,33 | C 28,32 ), for example, is the estimated probability of a ‘|’ (match) follow<strong>in</strong>g<br />

the give context symbols “|:||:” <strong>in</strong> the conservation sequence of an exon.<br />

Models of conservation at splice donor and acceptor sites are modeled us<strong>in</strong>g 2nd-or<strong>der</strong> WAMs<br />

of length 9 bp and 43 bp, respectively (lengths as <strong>in</strong> Genscan).<br />

7.24 TWINSCAN’s performance<br />

Tw<strong>in</strong>scan was tested on two data sets. The first set consists of 86 mouse sequences total<strong>in</strong>g<br />

7.6 Mb and used top homologs from human:<br />

Program Exons Exon Sn Exon Sp Genes Genes Sn Genes Sp<br />

Annotation 2758 275<br />

Genscan 2997 0.631 0.581 395 0.153 0.106<br />

Tw<strong>in</strong>scan 2854 0.683 0.660 464 0.244 0.144<br />

The second set is a subset conta<strong>in</strong><strong>in</strong>g 8 pairs of f<strong>in</strong>ished orthologs:<br />

Program Exons Exon Sn Exon Sp Genes Genes Sn Genes Sp<br />

Annotation 610 48<br />

Genscan 731 0.798 0.666 51 0.167 0.157<br />

Tw<strong>in</strong>scan 684 0.854 0.752 50 0.271 0.260


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 109<br />

7.25 The conserved exon method (CEM)<br />

Based on a model of sequence conservation, Tw<strong>in</strong>scan uses an <strong>in</strong>formant sequence to obta<strong>in</strong><br />

better gene predictions for a given target sequence.<br />

Input to the conserved exon method (CEM) are two related sequences and the method predicts<br />

gene structures <strong>in</strong> both sequences simultaneously. The un<strong>der</strong>ly<strong>in</strong>g assumption is that exons are<br />

well preserved, whereas <strong>in</strong>trons and <strong>in</strong>tergenic DNA have very little similarity.<br />

For this assumption to hold, the two <strong>in</strong>put sequences must be at an appropriate evolutionary<br />

distance. Cod<strong>in</strong>g regions are generally well conserved <strong>in</strong> species as far back as 450 Myrs. At<br />

evolutionary distances of 50–100 Myrs (human and mouse), the conservation also extends to<br />

other functional regions important for gene expression and ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g genome structure.<br />

The ma<strong>in</strong> idea of CEM is to look for conserved prote<strong>in</strong> sequences by compar<strong>in</strong>g pairs of DNA<br />

sequences, to identify putative exons based on sequence and splice site conservation, and then<br />

to cha<strong>in</strong> such pairs of conserved exons together to obta<strong>in</strong> gene structure predictions <strong>in</strong> both<br />

sequences.<br />

Identify<strong>in</strong>g conserved cod<strong>in</strong>g sequence The first part of the CEM is not new. For example,<br />

the TBLASTX program performs precisely this task. Additionally, a number of tools exist<br />

for compar<strong>in</strong>g two genomic sequences, f<strong>in</strong>d<strong>in</strong>g conserved exons and regulatory regions etc.<br />

Build<strong>in</strong>g gene models The second part of the CEM is more <strong>in</strong>terest<strong>in</strong>g, <strong>in</strong> which gene<br />

structures are generated from the identified matches and complete gene structures are predicted<br />

<strong>in</strong> both <strong>in</strong>put sequences.<br />

7.26 Application of TBLASTX<br />

Throughout the follow<strong>in</strong>g, we are given two similar DNA sequences S and T.<br />

The program TBLASTX produces a list of high-scor<strong>in</strong>g pairs (HSPs) of locally aligned substr<strong>in</strong>gs<br />

of S and T, where the two substr<strong>in</strong>gs are <strong>in</strong>terpreted as am<strong>in</strong>o-acid cod<strong>in</strong>g str<strong>in</strong>gs and<br />

the score of the alignment is computed us<strong>in</strong>g a BLOSSUM or PAM prote<strong>in</strong> scor<strong>in</strong>g matrix.<br />

This is how an HSP is reported by TBLASTX:<br />

Score = 214 (98.4 bits), Expect = 0.0, Sum P(24) = 0.0<br />

Identities = 44/46 (95%), Positives = 46/46 (100%), Frame = +1 / +1<br />

Query: 5284 RLVLRIATDDSKAVCRLSVKFGATLRTSRLLLERAKELNIDVVGVR 5421<br />

RLVLRIATDDSKAVCRLSVKFGATL+TSRLLLERAKELNIDV+GVR<br />

Sbjct: 3871 RLVLRIATDDSKAVCRLSVKFGATLKTSRLLLERAKELNIDVIGVR 4008<br />

In this example, the positions 5284–5421 of sequence S and positions 3871–4008 of sequence<br />

T are aligned together and <strong>in</strong>terpreted as am<strong>in</strong>o-acids as shown. The “frame” <strong>in</strong>dicates the<br />

directions and the offsets of the two substr<strong>in</strong>gs.<br />

TBLASTX matches between two similar pieces of human and mouse DNA:


110 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

CEMexplorer: /home/huson/genomics/CG/testcases/J03733_X16277: mus.mask vs. hum.mask<br />

ornith<strong>in</strong>e_1(+,+)<br />

tblastx<br />

9000<br />

8000<br />

7000<br />

6000<br />

5000<br />

hum.mask<br />

4000<br />

3000<br />

2000<br />

1000<br />

1000 2000 3000 4000 5000 6000 7000<br />

mus.mask<br />

7.27 Key assumption for conserved exons<br />

Note that programs such as TBLASTX predict putative cod<strong>in</strong>g regions, but not actual splice<br />

boundaries. Also, many HSPs are due to other conserved features, not exons.<br />

In the CEM, the local alignments produced by TBLASTX are used as seeds for dynamic<br />

programm<strong>in</strong>g alignments that are computed to detect complete exons.<br />

Key assumption Any pair of conserved exons E 1 (<strong>in</strong> S) and E 2 (<strong>in</strong> T) gives rise to a witness,<br />

i.e. an HSP h whose middle codon is a portion of the correct local alignment of E 1 and E 2 , <strong>in</strong><br />

the correct frame.<br />

B<br />

E2<br />

h<br />

E1<br />

A<br />

7.28 Conserved exon pairs<br />

A putative conserved exon pair (CEP) consists of a pair of substr<strong>in</strong>gs E 1 (<strong>in</strong> S) and E 2 (<strong>in</strong> T)<br />

that are both flanked by appropriate splice junctions and have a high scor<strong>in</strong>g local am<strong>in</strong>o-acid<br />

alignment. We now discuss how to obta<strong>in</strong> putative CEPs.<br />

Given an HSP h. Let m S (h) and m T (h) denote the position of the middle codon of h <strong>in</strong> S and<br />

<strong>in</strong> T, respectively.<br />

Let b S (h) and e S (h) denote the position of the left-most possible <strong>in</strong>tron-exon site and right-most<br />

possible exon-<strong>in</strong>tron site for any putative exon <strong>in</strong> S that is witnessed by h. Def<strong>in</strong>e b T (h) and<br />

e T (h) <strong>in</strong> the same way.<br />

In a simple approach, we use empirical bounds on the lengths of exons to f<strong>in</strong>d the values of b S ,<br />

e S , b T and e T . A more sophisticated approach takes the amount of coverage by HSPs etc. <strong>in</strong>to<br />

account.<br />

Start, stop and splice sites are detected by WMMs or more advanced techniques.<br />

If the values of b S , e S , b T and e T were chosen large enough, then the key assumption implies


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 111<br />

that the two exons E 1 (<strong>in</strong> S) and E 2 (<strong>in</strong> T) of the true CEP (witnessed by h) will start <strong>in</strong><br />

[b S (h), m S (h)] and [b T (h), m T (h)], and will end <strong>in</strong> [m S (h), e S (h)] and [m T (h), e T (h)], respectively.<br />

We evaluate all possible pairs of exons <strong>in</strong> this region by runn<strong>in</strong>g two dynamic programs: one<br />

starts at (m S (h), m T (h)) and ends at (e S (h), e T (h)), the other runs <strong>in</strong> reverse direction from<br />

(m S (h), m T (h)) to (b S (h), b T (h)):<br />

e<br />

T<br />

(h)<br />

m T<br />

(h)<br />

h<br />

b<br />

T<br />

(h)<br />

(h) b S<br />

(h) m S<br />

e (h)<br />

S<br />

7.29 Exon alignment<br />

The actual algorithms used for the local alignment computations are variants of the standard<br />

algorithm.<br />

Note that the alignments are forced to start <strong>in</strong> the frame def<strong>in</strong>ed by the HSP. Frame-shifts are<br />

allowed subsequently (with an appropriate <strong>in</strong>del penalty).<br />

Each splice-junction pair is a cell <strong>in</strong> the dynamic programm<strong>in</strong>g matrix, and its score is ma<strong>in</strong>ta<strong>in</strong>ed<br />

<strong>in</strong> a separate list.<br />

Let (i, j) be the coord<strong>in</strong>ates of a cell correspond<strong>in</strong>g to a splice-pair (z S (h), z T (h)). The score<br />

assigned to (z S (h), z T (h)) is not Score[i, j], but<br />

Score(z S (h), z T (h)) =<br />

max {Score[i − k S (h)][j − k T (h)]}<br />

0≤k S (h),k T (h)≤2<br />

This is to allow for the possibility of an <strong>in</strong>tron splitt<strong>in</strong>g a codon. In this way, the alignment<br />

(which only scores codons) allows term<strong>in</strong>al nucleotide gaps without <strong>in</strong>curr<strong>in</strong>g a frame-shift<br />

penalty.<br />

The amount of overhang<br />

(o S (h), o T (h)) = arg max {Score[i − k S (h)][j − k T (h)]}<br />

0≤k S (h),k T (h)≤2<br />

is also stored along with the score.<br />

As the alignment is done at the prote<strong>in</strong> level, there is a direction associated with it. The<br />

dynamic programm<strong>in</strong>g computation from the mid-po<strong>in</strong>t to the acceptor splice junctions is done<br />

by revers<strong>in</strong>g each codon before scor<strong>in</strong>g.<br />

7.30 The CEP graph<br />

For each HSP h we construct a CEP graph. Each node u <strong>in</strong> the CEP-graph corresponds to a<br />

coord<strong>in</strong>ate pair (i, j), which is the start<strong>in</strong>g po<strong>in</strong>t, mid-po<strong>in</strong>t or term<strong>in</strong>at<strong>in</strong>g po<strong>in</strong>t of a candidate<br />

exon pair (E 1 , E 2 ). More precisely, u is one of the follow<strong>in</strong>g:<br />

• a center node, if (i, j) = (m S (h), m T (h)) is the position of the middle codon of h,


112 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

• a donor node if i ∈ [m S (h), e S (h)] & j ∈ [m T (h), e T (h)] are sites of donor splice signals <strong>in</strong><br />

S, and T,<br />

• an acceptor node if i ∈ [b S (h), m S (h)] and j ∈ [b T (h), m T (h)] are sites of acceptor signals,<br />

• a start node if i ∈ [b S (h), m S (h)] and j ∈ [b T (h), m T (h)] are sites of translation <strong>in</strong>itiation<br />

signals, or<br />

• a term<strong>in</strong>al node if i ∈ [m S (h), e S (h)] and j ∈ [m T (h), e T (h)] are sites for a stop codon.<br />

e<br />

T<br />

b<br />

T<br />

(h)<br />

m(h)<br />

h<br />

→ h<br />

T<br />

(h)<br />

(h) b S<br />

(h) m S<br />

e (h)<br />

S<br />

Each node u has some additional <strong>in</strong>formation associated with it. The coord<strong>in</strong>ates of the cell<br />

are ma<strong>in</strong>ta<strong>in</strong>ed as (u S , u T ). For each acceptor or donor node u, we ma<strong>in</strong>ta<strong>in</strong> <strong>in</strong>formation on<br />

the nucleotide overhang at the boundary as overhang(u) = (o S (u), o T (u)).<br />

A directed edge is constructed from each acceptor or start node to the center, and from the<br />

center to each donor or term<strong>in</strong>al node. The weight of the edge is the score of the correspond<strong>in</strong>g<br />

local alignment.<br />

7.31 The CEM graph<br />

As discussed above, each HSP gives rise to a CEP graph. (In practice, however, different HSPs<br />

often lead to the same CEP graph and such redundancies should be removed.)<br />

Each CEP-graph is a concise representation of alignments of pairs of exons. At most one pair<br />

can actually be a conserved-exon-pair <strong>in</strong> the true gene structures. The Conserved-Exon-Method<br />

takes the CEP-graphs of HSPs and cha<strong>in</strong>s them together, thus obta<strong>in</strong><strong>in</strong>g the full “CEM-graph”.<br />

It builds gene models from this graph based on the assumption that the transcripts <strong>der</strong>ived<br />

from correct orthologous gene structures will have the highest alignment score.<br />

Let S and T be the two genomic sequences.<br />

For each HSP h, compute the CEP-graph. We build a candidate exon graph G = (V, E) (which<br />

we call the CEM-graph), as follows: V is the union of all the nodes <strong>in</strong> the CEP-graphs, and E<br />

conta<strong>in</strong>s all the edges <strong>in</strong> each CEP-graph. Further, add an edge from donor or term<strong>in</strong>al node<br />

u to an acceptor or start node v if both:<br />

• v S >= u S +M, and v T >= u T +M, where M is a suitably chosen m<strong>in</strong>imum <strong>in</strong>tron length,<br />

and:<br />

• Let (o S (u), o T (u)) = overhang(u), and (o S (v), o T (v)) = overhang(v). Then, (o S (u) +<br />

o S (v)) = 0(mod 3), and (o T (u) + o T (v)) = 0(mod 3),<br />

The weight of the edge (u, v) is the score of align<strong>in</strong>g the am<strong>in</strong>o-acids obta<strong>in</strong>ed by concatenat<strong>in</strong>g<br />

the overhangs on either side added to the penalty for an <strong>in</strong>tron gap.<br />

Example of l<strong>in</strong>k<strong>in</strong>g two CEPs, nodes are labeled by their offsets (o S ,o T ):


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 113<br />

h<br />

(0,0)<br />

(0,2)<br />

h’<br />

(0,0)<br />

(0,0)<br />

(0,1)<br />

(1,2)<br />

(0,2)<br />

additional edges l<strong>in</strong>k<strong>in</strong>g CEPs<br />

Example of a complete graph:<br />

6.84<br />

51.05<br />

55.76 6505,8616 6509,8620<br />

6601,8703<br />

6205,8309 6238,8342<br />

6294,8398<br />

6150,8254<br />

5 2<br />

-900<br />

11<br />

114.83 49.76 109.7 79.41 29.94 15.1<br />

82.89 76.63 89.67 94.97 150.1 145.39 140.26 155.23<br />

18.72 56.72 165.55 108.26 170.68<br />

-3 6<br />

105.38 23.59 27.04 49.11 82.38<br />

689.88<br />

79.62 5<br />

4 1 0<br />

5008,6977 5032,7001 5049,7018 5057,7026 5063,7032<br />

5141,7110<br />

5156,7110<br />

5165,7110<br />

5170,7110<br />

5185,7154 5223,7193 5282,7252 5359,7329<br />

5314,7305 5335,7305<br />

5509,7490 5519,7500<br />

5524,7505<br />

5616,7597<br />

5632,7613<br />

5453,7399<br />

5523,7504<br />

5636,7617<br />

5526,7507<br />

5418,7399 5453,7434<br />

5314,7284<br />

4979,6948<br />

-31 -38<br />

-39 -10<br />

-46<br />

-14<br />

-16<br />

-15<br />

-20<br />

-22<br />

-6<br />

49.82 54.95 30.04 35.17 58.11 52.98 13.54 4.23<br />

-35<br />

53.11 62.4<br />

-26<br />

-44<br />

86.57 20.36 40.14 62.02 32.82 42.24 78.57 82.41 58.79 62.63 52.6<br />

-35 23.65<br />

4 5<br />

56.53 94.25 86.96 97.38 52.41 21.98<br />

-92 -99<br />

0<br />

6565 -3<br />

65 3940,5353<br />

3873,5286 3924,5337 3958,5371<br />

4011,5424 4109,5551<br />

4000,5420 4002,5420 4007,5420<br />

4149,5591 4250,5686<br />

4351,5809<br />

4190,5632<br />

4364,5822<br />

4369,5827 4393,5851 4396,5854<br />

4434,5892 4438,5875<br />

4034,5478<br />

-75<br />

-74<br />

-12<br />

96.67<br />

3557,4898 3589,4930 3590,4931 3592,4933<br />

3628,4969<br />

3686,5027<br />

3514,4855<br />

3690,5031<br />

0<br />

3197,4503<br />

03293,4571<br />

57.96 16.35 58.34 61.37 54.11<br />

3197,4475<br />

3370,4648<br />

50.7 65.6<br />

0<br />

2832,4073<br />

2847,4073<br />

2863,4090 2917,4144 2923,4150<br />

2964,4191<br />

-32<br />

-37 -43 -24<br />

2.84 3.1<br />

1811,2378 1815,2382 1825,2392<br />

7.32 Obta<strong>in</strong><strong>in</strong>g a gene prediction<br />

By construction, a path <strong>in</strong> the CEM-graph corresponds to a prediction of orthologous gene<br />

structures <strong>in</strong> the two genomes. Based on the assumption that the correct gene models will have<br />

the highest alignment score, we can extract the correct gene structures simply by choos<strong>in</strong>g<br />

the highest scor<strong>in</strong>g path. As this is a directed acyclic graph, the highest scor<strong>in</strong>g path can be<br />

computed via a topological sort:<br />

getGeneModelScores(CEMGraph G(V,E))<br />

beg<strong>in</strong><br />

Or<strong>der</strong>edNodeListL = TopologicalSort(G)<br />

for each v <strong>in</strong> L<br />

Initialize(Score(v))<br />

for each <strong>in</strong>com<strong>in</strong>g edge e = (x,v)<br />

if (Score(v) < Score(x) + w(e))<br />

Score(v) = Score(x) + w(e)<br />

predecessor(v) = x<br />

end


114 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

(An or<strong>der</strong><strong>in</strong>g φ of the nodes of an acyclic graph is a topological sort<strong>in</strong>g if for any edge (v, w) we<br />

have φ(v) < φ(w).)<br />

For an arbitrary node u, score(u) is the best score of an alignment of two sequence prefixes<br />

S[1..u i ], and T[1..u j ], allow<strong>in</strong>g for frame-shifts, am<strong>in</strong>o-acid <strong>in</strong>dels and <strong>in</strong>tron penalties.<br />

Once the scores on the nodes are computed the gene models are built by start<strong>in</strong>g at the node<br />

with the highest score, and follow<strong>in</strong>g the predecessors. The coord<strong>in</strong>ates of start, term<strong>in</strong>al,<br />

donor and acceptor nodes reveal the gene structure <strong>in</strong> the two genomic sequences. As the<br />

boundaries of the path are not limited to start and term<strong>in</strong>al nodes, partial gene structures can<br />

be predicted.<br />

7.33 Multiple genes<br />

Additionally, we add an edge from every stop node v with coord<strong>in</strong>ates (v S , v T ) to every downstream<br />

start node w (with coord<strong>in</strong>ates w S > v S and w T > v T ). Such edges are given weight 0.<br />

The role of such edges is to allow prediction of multiple genes.<br />

HSPs with negative frames <strong>in</strong> one or both sequences are possible witnesses for exons <strong>in</strong> the<br />

reverse strand of one or both sequences. The CEP graphs <strong>der</strong>ived from such HSPs are simply<br />

added to the CEM graph.<br />

To enable a prediction of genes <strong>in</strong> both strands simultaneously, appropriate additional edges<br />

must be <strong>in</strong>serted between the start and stop nodes of the CEPs. For example, a stop node<br />

obta<strong>in</strong>ed from an HSP with +/+ frame is connected to all downstream stop nodes that have<br />

frame −/−.<br />

7.34 Summary of CEM algorithm<br />

1. Determ<strong>in</strong>e a list of candidate exons for S, and T.<br />

2. For every HSP h, determ<strong>in</strong>e the range of possible exons and their possible splice sites.<br />

3. Sort HSPs lexicographically accord<strong>in</strong>g to their ranges.<br />

4. For each HSP h, build the correspond<strong>in</strong>g CEP graph.<br />

5. Compute the CEM graph by jo<strong>in</strong><strong>in</strong>g all CEP graphs.<br />

6. Compute the gene model scores.<br />

7. Determ<strong>in</strong>e the highest scor<strong>in</strong>g path through the CEM graph.<br />

8. Extract the correspond<strong>in</strong>g gene model.<br />

7.35 Performance of CEM<br />

Here is a comparison of the performance of CEM and Genscan on a test data set of 60 pairs<br />

of gene from human and mouse:


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 115<br />

Number of Exon Exon Nucl. Nucl.<br />

Sequences Sens. Spec. Sens. Spec.<br />

CEM 120 0.76 0.80 0.94 0.95<br />

GenScan 120 0.74 0.78 0.92 0.94<br />

The ga<strong>in</strong> <strong>in</strong> performance obta<strong>in</strong>ed is not spectacular. However, it provides a proof of concept<br />

and additional work may well lead to a useful tool for comparative gene f<strong>in</strong>d<strong>in</strong>g, especially for<br />

genomes for which little is known of the statistical properties of the conta<strong>in</strong>ed genes.<br />

7.36 Homology method: Procrustes<br />

Any newly sequence gene has a good chance of hav<strong>in</strong>g an already known relative and progress<br />

<strong>in</strong> large-scale sequenc<strong>in</strong>g projects is rapidly <strong>in</strong>creas<strong>in</strong>g the number of known genes and prote<strong>in</strong><br />

sequences.<br />

Hence, homology-based gene prediction methods are becom<strong>in</strong>g more and more useful. In particular,<br />

such a method may be able to detect exons that are missed by statistical methods<br />

because they are small, or statistically unusual.<br />

Procrustes is a popular program that uses homology to predict genes and is based on the<br />

follow<strong>in</strong>g<br />

Idea: Given a genomic sequence G and a target prote<strong>in</strong> P. Determ<strong>in</strong>e a cha<strong>in</strong> Γ<br />

of blocks <strong>in</strong> G that has the highest spliced-alignment score with target T. These<br />

blocks are <strong>in</strong>terpreted as exons and the cha<strong>in</strong> Γ is the predicted gene structure.<br />

7.37 Example<br />

Given the genome G = baabaablacksheephaveyouanywool and the target prote<strong>in</strong> T =<br />

barbarasleepsonwool, f<strong>in</strong>d the best spliced alignment of T to G and thus obta<strong>in</strong> a gene<br />

prediction <strong>in</strong> G:<br />

Genome sequence:<br />

baa baa black sheep have you any wool<br />

Assume that these are the possible blocks:<br />

baa baa black sheep have you any wool<br />

baa baa black sheep have you any wool<br />

baa baa black sheep have you any wool<br />

Best spliced alignment:<br />

barbara sleeps on wool<br />

Result<strong>in</strong>g gene structure prediction:<br />

baa baa sheep any wool<br />

There are many possible cha<strong>in</strong><strong>in</strong>gs of blocks <strong>in</strong> the given example:


116 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

However, we choose the one that yields the best alignment to the given target sequence. In<br />

general, a number of possible target sequence will be given and then we choose the one that<br />

gives rise to the best alignment.<br />

7.38 Preprocess<strong>in</strong>g: determ<strong>in</strong><strong>in</strong>g the blocks<br />

Given a genomic sequence G. The first computational step is to determ<strong>in</strong>e the set B of all<br />

candidate blocks for G, which should conta<strong>in</strong> all true exons. Naively, this is done be select<strong>in</strong>g<br />

all blocks between potential acceptor and donor sites, which are detected us<strong>in</strong>g e.g. a WMM:<br />

acacacAG aggtaAG taggagctcagttacactgcatcagcatg GTatcacttacgacacGTcacacgt<br />

block 1<br />

block 2<br />

block 3<br />

block 4<br />

Clearly, this set of blocks will conta<strong>in</strong> many false exons. Statistical methods may be used <strong>in</strong> an<br />

attempt to remove blocks that are obviously not true exons.<br />

Any cha<strong>in</strong> of blocks corresponds to a gene prediction and the number of such cha<strong>in</strong>s can be<br />

huge. Dynamic programm<strong>in</strong>g is used to obta<strong>in</strong> an algorithm that runs <strong>in</strong> polynomial time.<br />

7.39 The spliced alignment problem<br />

Let G = g 1 . . .g n be a str<strong>in</strong>g of letters, and B = g i . . .g j and B ′ = g i ′ . . .g j ′ be substr<strong>in</strong>gs of<br />

G. We write B ≺ B ′ , if B ends before B ′ starts, i.e. j < i ′ . A sequence Γ = (B 1 , . . .,B b ) of<br />

substr<strong>in</strong>gs of G is a cha<strong>in</strong>, if B 1 ≺ B 2 ≺ . . . ≺ B b . We denote the concatenation of the str<strong>in</strong>gs<br />

<strong>in</strong> Γ by Γ ∗ = B 1 ∗ B 2 ∗ . . . ∗ B b .<br />

For two str<strong>in</strong>gs G and T, we set s(G, T) to the score of an optimal alignment between G and<br />

T.<br />

Spliced Alignment Problem (SAP) Let G = g 1 . . .g n be a genomic sequence,<br />

T = t 1 . . .t m a target sequence and B = {B 1 , . . .,B b } a set of blocks <strong>in</strong> G. Given G,<br />

T and B, the Spliced Alignment Problem is to f<strong>in</strong>d a cha<strong>in</strong> Γ of str<strong>in</strong>gs from B such<br />

that the score s(Γ ∗ , T) is maximum among all cha<strong>in</strong>s of blocks from B.<br />

7.40 Solv<strong>in</strong>g the spliced alignment problem<br />

The SAP can be reduced to the search of a path <strong>in</strong> some (unweighted) graph. Vertices of this<br />

graph correspond to the blocks, arcs correspond to potential transitions between blocks, and<br />

the path weight is def<strong>in</strong>ed as the weight of the optimal alignment between the concatenated<br />

blocks of this path and the target sequence.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 117<br />

For simplicity, we will consi<strong>der</strong> sequence alignment with l<strong>in</strong>ear gap penalties and def<strong>in</strong>e the<br />

∆ match , ∆ mismatch and ∆ <strong>in</strong>del scores as usual.<br />

{<br />

∆match if x = y, and<br />

We set ∆(x, y) =<br />

else.<br />

∆ mismatch<br />

7.41 The score of a prefix alignment<br />

For a block B k = g m . . .g l <strong>in</strong> G, def<strong>in</strong>e first(k) = m, last(k) = l and size(k) = l − m + 1. Let<br />

B k (i) denote the i-prefix g m . . . g i of B k , if m ≤ i ≤ l.<br />

Given a position i and let Γ = (B 1 , . . ., B k , . . .,B t ) be a cha<strong>in</strong> such that some block B k conta<strong>in</strong>s<br />

i. We def<strong>in</strong>e<br />

Γ ∗ (i) = B 1 ∗ B 2 ∗ . . . ∗ B k (i)<br />

as the concatenation of B 1 . . .B k−1 and the i-prefix of B k .<br />

Then<br />

S(i, j, k) =<br />

max<br />

all cha<strong>in</strong>s Γ<br />

conta<strong>in</strong><strong>in</strong>g block B k<br />

s(Γ ∗ (i), T(j)),<br />

is the optimal score for align<strong>in</strong>g a cha<strong>in</strong> of blocks up to position i <strong>in</strong> G to the j-prefix of T. As<br />

we will see, the values of this matrix is computed us<strong>in</strong>g dynamic programm<strong>in</strong>g.<br />

7.42 The dynamic program<br />

Let B(i) = {k | last(k) < i} be the set of all blocks that end (strictly) before position i <strong>in</strong> G.<br />

The follow<strong>in</strong>g recurrence computes S(i, j, k) for 1 ≤ i ≤ n, 1 ≤ j ≤ m and 1 ≤ k ≤ b:<br />

S(i, j, k) =<br />

⎧<br />

S(i − 1, j − 1, k) + ∆(g i , t j ),<br />

if i ≠ first(k)<br />

⎪⎨<br />

max<br />

S(i − 1, j, k) + ∆ <strong>in</strong>del ,<br />

if i ≠ first(k)<br />

max l∈B(first(k)) S(last(l), j − 1, l) + ∆(g i , t j ), if i = first(k)<br />

⎪⎩<br />

max l∈B(first(k)) S(last(l), j, l) + ∆ <strong>in</strong>del ,<br />

S(i, j − 1, k) + ∆ <strong>in</strong>del .<br />

if i = first(k)<br />

The score of the optimal spliced alignment can be found as:<br />

max S(last(k), m, k).<br />

k<br />

Note that S(i, j, k) is only def<strong>in</strong>ed if i ∈ B k and therefore only a portion of entries <strong>in</strong> the<br />

three-dimensional n × m × b matrix S needs to be computed.<br />

The total number of such entries is:<br />

m<br />

b∑<br />

size(k) = nmc,<br />

k=1


118 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

∑<br />

where c = 1 b<br />

n k=1<br />

size(k) is the coverage of the genomic sequence by blocks.<br />

Hence, a naive implementation of the recurrence runs <strong>in</strong> O(mnc + mb 2 ) time.<br />

(Recall that n = |G|, m = |T | and b is the number of blocks.)<br />

7.43 Example<br />

Consi<strong>der</strong> the follow<strong>in</strong>g str<strong>in</strong>g G with all possible blocks <strong>in</strong>dicated by boxes:<br />

The recurrence corresponds to the follow<strong>in</strong>g graph:<br />

The target sequence is:<br />

’T WAS BRILLIG, AND THE SLITHE TOVES DID GYRE AND GIMBLE IN THE WABE<br />

The four highlighted cha<strong>in</strong>s <strong>in</strong> the above graph correspond to the follow<strong>in</strong>g spliced alignments<br />

of G and T:<br />

7.44 Speed up<br />

The time and space requirements of the algorithm can be reduced significantly. Here we only<br />

discuss one such improvement.<br />

Def<strong>in</strong>e P(i, j) = max l∈B(i) S(last(l), j, l). The recurrence can be rewritten as follows:<br />

S(i, j, k) =<br />

⎧<br />

S(i − 1, j − 1, k) + ∆(g i , t j ), if i ≠ first(k)<br />

⎪⎨ S(i − 1, j, k) + ∆ <strong>in</strong>del , if i ≠ first(k)<br />

max P(first(k), j − 1) + ∆(g i , t j ), if i = first(k)<br />

P(first(k), j) + ∆ <strong>in</strong>del , if i = first(k)<br />

⎪⎩<br />

S(i, j − 1, k) + ∆ <strong>in</strong>del ,<br />

where<br />

{<br />

P(i − 1, j)<br />

P(i, j) = max<br />

max k:last(k)=i−1 S(i − 1, j, k).<br />

With this modification, we ma<strong>in</strong>ta<strong>in</strong> and update the maximal score for all preced<strong>in</strong>g blocks<br />

explicitly and thus do not reconsi<strong>der</strong> all preced<strong>in</strong>g blocks <strong>in</strong> each evaluation of the recurrence.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 119<br />

This reduces the run time of the algorithm to O(mnc + mb).<br />

The correspond<strong>in</strong>g network that <strong>in</strong>dicates which computations are performed looks like this:<br />

7.45 Evaluation of the method<br />

The authors of Procrustes evaluated the performance of the program on a test sample of<br />

human genes with known mammalian relatives. In their study, the average correlation between<br />

the predicted and actual prote<strong>in</strong>s was 99%. The algorithm correctly reconstructed 87% of the<br />

genes.<br />

They also reported that the algorithm predicts human genes reasonably well when the homologous<br />

prote<strong>in</strong> is non-vertebrate or even prokaryotic.<br />

Additionally, predictions were made us<strong>in</strong>g simulated targets that gradually diverged from the<br />

analyzed gene. For targets up to 100 PAM distance, the predictions were almost 100% correct.<br />

(This distance roughly corresponds to 40% similarity).<br />

This <strong>in</strong>dicates that for an average prote<strong>in</strong> family the method is likely to correctly predict a<br />

human gene given a mammalian relative.<br />

8.2 BLAST and BLAT<br />

The popular BLAST program (or family of programs), was first <strong>in</strong>troduced <strong>in</strong>: S.F. Altschul, W.<br />

Gish, W. Miller, E.W. Myers and D.J. Lipman. Basic local alignment search tool, J. Molecular<br />

Biology, 215:403-410 (1990).<br />

Recall (from ABI I) that BLAST is an alignment heuristic that computes local alignments<br />

between a query and a database sequence, us<strong>in</strong>g a seed-and-extend approach:<br />

Given three parameters, i.e. a word size K, a word similarity threshold T and a m<strong>in</strong>imum<br />

match score S, BLAST operates <strong>in</strong> three steps:<br />

1. The list of all words of length K that have similarity ≥ T to some word <strong>in</strong> the query<br />

sequence is generated.<br />

2. The database sequence is scanned for all hits of words <strong>in</strong> the list.<br />

3. Each hit is extended until its score falls a certa<strong>in</strong> distance below the best score found for<br />

shorter extensions and then all best extensions are reported that have score ≥ S.<br />

8.3 BLAT- BLAST-Like Alignment Tool<br />

The follow<strong>in</strong>g is based on: W. James Kent. BLAT- the BLAST-like alignment tool, Genome<br />

Research 12, 2002.


120 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

Analyz<strong>in</strong>g vertebrate genomes requires rapid mRNA/DNA and cross-species prote<strong>in</strong> alignments.<br />

As the amount of data <strong>in</strong>creases, faster tools are required for compar<strong>in</strong>g sequences.<br />

A new tool, BLAT, is more accurate and 500 times faster than popular exist<strong>in</strong>g tools for<br />

mRNA/DNA alignments and 50 times faster for prote<strong>in</strong> alignments at sensitivity sett<strong>in</strong>gs typically<br />

used when compar<strong>in</strong>g vertebrate sequences.<br />

BLAT’s speed stems from an <strong>in</strong>dex of all non-overlapp<strong>in</strong>g K-mers <strong>in</strong> the genome. The program<br />

has several stages: It uses the <strong>in</strong>dex to f<strong>in</strong>d regions <strong>in</strong> the genome that are possibly homologous<br />

to the query sequence. It performs an alignment between such regions. It stitches together the<br />

aligned regions (often exons) <strong>in</strong>to larger alignments (typically genes). F<strong>in</strong>ally, BLAT revisits<br />

small <strong>in</strong>ternal exons and adjusts large gap boundaries that have canonical splice sites where<br />

feasible.<br />

8.4 Mapp<strong>in</strong>g ESTs and Mouse reads<br />

In the public assembly of the human genome, the problem arises to map 3 million ESTs to the<br />

human genome. Additionally, 13 million (and cont<strong>in</strong>ually more) whole genome shotgun reads<br />

need to be aligned to the human genome.<br />

The human EST alignments compared 1.75 Gb <strong>in</strong> 3.72 million ESTs aga<strong>in</strong>st 2.88 Gb bases of<br />

Human DNA and took 200 hours on a farm of 90 L<strong>in</strong>ux boxes.<br />

BLAT was used to align 2.5× coverage unassembled mouse reads to the masked human genome.<br />

This <strong>in</strong>volved 7.51 Gb <strong>in</strong> 13.3 million reads and took 16, 300 CPU hours.<br />

As work cont<strong>in</strong>ues to f<strong>in</strong>ish the human genome, these computations need to be repeated on a<br />

monthly or bi-monthly basis. Hence, comparison tools are needed that do the job <strong>in</strong> a couple<br />

of weeks.<br />

This was the motivation for the development of BLAT.<br />

8.5 BLAT vs BLAST<br />

BLAT is similar to BLAST: The program rapidly scans for relatively short matches (hits) and<br />

extends these <strong>in</strong>to HSPs. However BLAT differs from BLAST <strong>in</strong> some important ways:<br />

• BLAST builds an <strong>in</strong>dex of the query str<strong>in</strong>g and then scans l<strong>in</strong>early through the database<br />

– BLAT builds an <strong>in</strong>dex of the database and then scans l<strong>in</strong>early through the query,<br />

• BLAST triggers an extension when one or two hits occur – BLAT can trigger extensions<br />

on any given number of perfect or near perfect matches,<br />

• BLAST returns each area of homology as separate alignments – BLAT stitches them<br />

together <strong>in</strong>to larger alignments,<br />

• BLAST delivers a list of exons sorted by size, with alignments extend<strong>in</strong>g slightly beyond<br />

the edge of each exon – BLAT “unsplices” mRNA onto the genome, giv<strong>in</strong>g a s<strong>in</strong>gle<br />

alignment that uses each base of the mRNA only once, with correctly positioned splice<br />

sites.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 121<br />

8.6 Seed-and-extend<br />

Like all fast alignment programs, BLAT uses the two stage seed-and-extend approach:<br />

• <strong>in</strong> the seed stage, the program detects regions of the two sequences that are likely to be<br />

homologous, and<br />

• <strong>in</strong> the extend stage, these regions are exam<strong>in</strong>ed <strong>in</strong> detail and alignments are produces for<br />

the regions that are <strong>in</strong>deed homologous accord<strong>in</strong>g to some criterion.<br />

BLAT provides three different methods for the seed stage:<br />

• S<strong>in</strong>gle perfect K-mer matches,<br />

• Multiple perfect K-mer matches, and<br />

• S<strong>in</strong>gle near-perfect K-mer matches.<br />

Given a long database sequence and a short query sequence, we will discuss the different seed<br />

strategies.<br />

The simplest seed method is to look for subsequences of a given size K that are shared by the<br />

query and the database. In many applications, every K-mer <strong>in</strong> the query sequence is compared<br />

with all non-overlapp<strong>in</strong>g K-mers <strong>in</strong> the database sequence.<br />

We want to analyze:<br />

1. how many homologous regions are missed, and<br />

2. how many non-homologous regions are passed to the extension stage, us<strong>in</strong>g this criteria.<br />

Errors of type (1) will cause the application to miss true homologs, whereas errors of type (2)<br />

will <strong>in</strong>crease the runn<strong>in</strong>g time of the application.<br />

8.7 Some def<strong>in</strong>itions<br />

K: The K-mer size, 8 − 16 for nucleotides and 3 − 7 for<br />

am<strong>in</strong>o acids.<br />

M: Match ratio between homologous areas, ≈ 98% for<br />

cDNA/genomic alignments with<strong>in</strong> the same species, ≈<br />

89% for prote<strong>in</strong> alignments between human and mouse.<br />

H: The size of a homologous area. For a human exon this<br />

is typically 50 − 200 bp.<br />

G: Database size, e.g. 3 Gb for human.<br />

Q: Query size.<br />

A: Alphabet size, 20 for am<strong>in</strong>o acids, 4 for nucleotides.<br />

query sequence (e.g. cDNA)<br />

matches<br />

Database sequence (e.g. genome)


122 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

8.8 S<strong>in</strong>gle perfect matches<br />

Assum<strong>in</strong>g that each letter is <strong>in</strong>dependent of the previous letter, the probability that a specific<br />

K-mer <strong>in</strong> a homologous region of the database matches perfectly the correspond<strong>in</strong>g K-mer <strong>in</strong><br />

the query is:<br />

p 1 = M K .<br />

Let T = ⌊ H ⌋ denote the number of non-overlapp<strong>in</strong>g K-mers <strong>in</strong> a homologous region of length<br />

K<br />

H.<br />

The probability that at least one non-overlapp<strong>in</strong>g K-mer <strong>in</strong> the homologous region matches<br />

perfectly with the correspond<strong>in</strong>g K-mer <strong>in</strong> the query is:<br />

P = 1 − (1 − p 1 ) T = 1 − (1 − M K ) T .<br />

The number of non-overlapp<strong>in</strong>g K-mers that are expected to match by chance, assum<strong>in</strong>g all<br />

letters are equally likely, is:<br />

F = (Q − K + 1) · G ( ) K 1<br />

K · .<br />

A<br />

These formulas can be used to predict the sensitivity and specificity of s<strong>in</strong>gle perfect nucleotide<br />

K-mer matches as a seed-search criterion:<br />

These formulas can be used to predict the sensitivity and specificity of s<strong>in</strong>gle perfect am<strong>in</strong>o<br />

acid K-mer matches as a seed-search criterion:<br />

Examples<br />

1. For EST alignments, we would like to f<strong>in</strong>d seeds for 99% of all homologous regions that<br />

have 5% or less sequenc<strong>in</strong>g noise. Table 3 <strong>in</strong>dicates that K = 14 or less will work. For


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 123<br />

K = 14, we can expect that 399 random hits per query will be produced. A smaller value<br />

of K will produce significantly more random hits.<br />

2. The mouse and human genomes average 89% identity at the am<strong>in</strong>o acid level. To f<strong>in</strong>d<br />

true seeds for 99% of all translated mouse reads requires K = 5 or less. For K = 5, each<br />

read will generate ≈ 62625 random hits, see Table 4.<br />

3. Compar<strong>in</strong>g mouse and human at a nucleotide level, where there is only 86% identity is<br />

not feasible: Table 3 implies that K = 7 must be used to f<strong>in</strong>d 99% of all true hits, but<br />

this value generates ≈ 13 million random hits per query.<br />

8.9 S<strong>in</strong>gle near-perfect matches<br />

Now consi<strong>der</strong> the case of near-perfect matches, that is, hits with one letter mismatch. The<br />

probability that a non-overlapp<strong>in</strong>g K-mer <strong>in</strong> a homologous region of the database matches<br />

near-perfectly the correspond<strong>in</strong>g K-mer <strong>in</strong> the query is:<br />

p 1 = K · M K−1 · (1 − M) + M K .<br />

Aga<strong>in</strong>, the probability that any non-overlapp<strong>in</strong>g K-mer <strong>in</strong> the homologous region matches<br />

near-perfectly with the correspond<strong>in</strong>g K-mer <strong>in</strong> the query is:<br />

P = 1 − (1 − p 1 ) T .<br />

The number of K-mers which match near-perfectly by chance is:<br />

(<br />

F = (Q − k + 1) · G ( ) K−1 ( 1<br />

K · K · · 1 − 1 ) ( ) K 1<br />

+ .<br />

A A A)<br />

These formulas can be used to predict the sensitivity and specificity of s<strong>in</strong>gle near-perfect<br />

nucleotide K-mer matches as a seed-search criterion:<br />

These formulas can be used to predict the sensitivity and specificity of s<strong>in</strong>gle near-perfect am<strong>in</strong>o<br />

acid K-mer matches as a seed-search criterion:


124 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

Examples<br />

1. For the purposes of EST alignments, a K of 22 or less produce true seeds for 99% of all<br />

queries, while on average produc<strong>in</strong>g only one random hit, see Table 5.<br />

2. For comparison of translated mouse reads and the human genome, Table 6 <strong>in</strong>dicates that<br />

K = 8 would detect true seeds for 99% of all mouse reads, while only generat<strong>in</strong>g 374<br />

random hits per read.<br />

3. A comparison of mouse reads and the human genome (86% identity) on the nucleotide level<br />

would require K = 13 or K = 12 to detect true seeds for 99% of the reads, while generat<strong>in</strong>g<br />

275671 random hits per read. Us<strong>in</strong>g a fast extension program, this computation is feasible.<br />

BLAT implements near-perfect matches allow<strong>in</strong>g one mismatch <strong>in</strong> a hit, as follows:<br />

A non-overlapp<strong>in</strong>g <strong>in</strong>dex of all K-mers <strong>in</strong> the database is generated.<br />

Every possible K-mer <strong>in</strong> the query sequence that matches <strong>in</strong> all but one, or <strong>in</strong> all, positions, is<br />

looked up. Hence, this means K · (A − 1) + 1 lookups. For an am<strong>in</strong>o-acid search with K = 8,<br />

for example, 153 lookups are required per occurr<strong>in</strong>g K-mer.<br />

For a given level of sensitivity, the near-perfect match criterion runs 15× more slowly than the<br />

multiple-perfect match criterion and thus is not so useful <strong>in</strong> practice.<br />

8.10 Multiple perfect matches<br />

An alternative seed<strong>in</strong>g strategy is to require multiple perfect matches that are constra<strong>in</strong>ed to<br />

be near each other.<br />

For example, consi<strong>der</strong> a situation where there are two hits between the query and the database<br />

sequences that “lie on the same diagonal” and are close to each other (with<strong>in</strong> some given<br />

distance W), such as a and b here:<br />

query sequence (e.g. cDNA)<br />

d<br />

a<br />

k<br />

w<br />

b<br />

c<br />

Database sequence (e.g. genome)


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 125<br />

For N = 1, the probability that a non-overlapp<strong>in</strong>g K-mer <strong>in</strong> a homologous region of the<br />

database matches perfectly the correspond<strong>in</strong>g K-mer <strong>in</strong> the query is (as discussed above):<br />

p 1 = M K .<br />

The probability that there are exactly n matches with<strong>in</strong> the homologous region is<br />

P n = p n 1 · (1 − p 1 ) T −n ·<br />

T!<br />

n! · (T − n)! ,<br />

and the probability that there are N or more matches is the sum:<br />

P = P N + P N+1 + . . . + P T .<br />

Aga<strong>in</strong>, we are <strong>in</strong>terested <strong>in</strong> the number of matches generated by chance. The probability that<br />

such a cha<strong>in</strong> is generated for N = 1 is simply:<br />

F = (Q − K + 1) · G ( ) K 1<br />

K · .<br />

A<br />

The probability of a second match occurr<strong>in</strong>g with<strong>in</strong> W letters after the first is<br />

S = 1 −<br />

(<br />

1 −<br />

( 1<br />

A) K<br />

) W<br />

K<br />

,<br />

because the second match can occur with<strong>in</strong> any of the W K<br />

database with<strong>in</strong> W letters after the first match.<br />

non-overlapp<strong>in</strong>g K-mers <strong>in</strong> the<br />

The number of size N cha<strong>in</strong>s of K-mers <strong>in</strong> which any two consecutive hits are not more than<br />

W apart is<br />

F N = F 1 · S N−1 .<br />

These formulas can be used to predict the sensitivity and specificity of multiple nucleotide (2<br />

and 3) perfect K-mer matches as a seed-search criterion:<br />

These formulas can be used to predict the sensitivity and specificity of multiple am<strong>in</strong>o acid (2<br />

and 3) perfect K-mer matches as a seed-search criterion:


126 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

8.11 Clump<strong>in</strong>g hits<br />

BLAT builds a non-overlapp<strong>in</strong>g <strong>in</strong>dex of all K-mers <strong>in</strong> the database, ignor<strong>in</strong>g those K-mers<br />

that occur too often <strong>in</strong> the database, those conta<strong>in</strong><strong>in</strong>g ambiguity codes and optionally, those <strong>in</strong><br />

lower case (“soft screened regions”).<br />

BLAT then looks up each overlapp<strong>in</strong>g K-mer of the query sequence <strong>in</strong> the <strong>in</strong>dex, obta<strong>in</strong><strong>in</strong>g a<br />

list L of hits. Each hit consists of a database position and a query position.<br />

The next step is to form clumps of hits that represent regions <strong>in</strong> the database sequence that are<br />

homologous to the query sequence. Each such clump consists of a number of hits (that exceeds<br />

a given m<strong>in</strong>imum number of hits) that form a cha<strong>in</strong> <strong>in</strong> which two consecutive hits are not too<br />

far apart from each other and also <strong>in</strong> which the gap size <strong>in</strong> either sequence does not exceed a<br />

given threshold.<br />

Multiple hits are clumped together as follows:<br />

• The hit list L is sorted by database coord<strong>in</strong>ate.<br />

• The list L is split <strong>in</strong>to buckets of size 64 kb each, based on the database coord<strong>in</strong>ate.<br />

• Each bucket is sorted along the diagonal, i.e. hits are sorted by the value of database<br />

position m<strong>in</strong>us query position.<br />

• Hits that are with<strong>in</strong> the gap limit are grouped together <strong>in</strong>to proto-clumps.<br />

• Hits with<strong>in</strong> proto-clumps are then sorted by their database coord<strong>in</strong>ate and put <strong>in</strong>to real<br />

clumps, if they are with<strong>in</strong> the w<strong>in</strong>dow limit on the database coord<strong>in</strong>ate.<br />

• Clumps with<strong>in</strong> 300 bp or 100 am<strong>in</strong>o acids of each other <strong>in</strong> the database are merged and<br />

then 500 bp are added to each end of a clump.<br />

A list of hits:<br />

query sequence<br />

2 3<br />

4<br />

6 1<br />

5<br />

Database sequence<br />

Sorted by database coord<strong>in</strong>ate:<br />

query sequence<br />

1 2<br />

4 6<br />

5<br />

3<br />

Database sequence


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 127<br />

Sorted along the diagonal:<br />

query sequence<br />

1 2<br />

3<br />

5<br />

4<br />

6<br />

Database sequence<br />

8.12 Nucleotide alignments<br />

Clump<strong>in</strong>g is the first part of the extension stage. In the case of nucleotide alignments, each<br />

clump is then processed as follows.<br />

• A hit list is generated between the query sequence q and the homologous region h <strong>in</strong> the database,<br />

look<strong>in</strong>g for smaller, perfect K-mers.<br />

• If a K-mer w <strong>in</strong> q matches multiple K-mers <strong>in</strong> h, then w is repeatedly extended by one until<br />

the match is unique or exceeds a certa<strong>in</strong> size.<br />

• The hits are extended as far as possible, without mismatches.<br />

• Overlapp<strong>in</strong>g hits are merged.<br />

• If there are gaps <strong>in</strong> the alignment <strong>in</strong> both the query and the database, then the algorithm<br />

recurses to fill <strong>in</strong> the gaps, us<strong>in</strong>g a smaller K.<br />

• Then extensions us<strong>in</strong>g <strong>in</strong>dels followed by matches are consi<strong>der</strong>ed.<br />

• Large gaps <strong>in</strong> the query sequence often correspond to <strong>in</strong>trons and they are slid around to f<strong>in</strong>d<br />

the best GT/AG consensus sequence for the <strong>in</strong>tron ends.<br />

8.13 Prote<strong>in</strong> alignments<br />

In the case of am<strong>in</strong>o acid sequences, each clump is processed as follows:<br />

• All hits obta<strong>in</strong>ed <strong>in</strong> the seed stage is extended <strong>in</strong>to maximally scor<strong>in</strong>g ungapped alignments<br />

(HSPs) us<strong>in</strong>g a score function where a match is worth 2 and a mismatch is worth 1.<br />

• A graph is build with HSPs as nodes.<br />

• If HSP A starts before HSP B <strong>in</strong> both sequences, then an edge is put from A to B that is<br />

weighted by the score of B m<strong>in</strong>us a gap penalty based on the distances between A and B.<br />

• If A and B overlap, then an optimal crossover position x is determ<strong>in</strong>ed that maximizes the sum<br />

of score of A up to x and B start<strong>in</strong>g from x and the edge weight is set accord<strong>in</strong>gly.<br />

• A dynamic program then extracts the maximal scor<strong>in</strong>g alignment by travers<strong>in</strong>g the graph.<br />

• The HSPs conta<strong>in</strong>ed <strong>in</strong> the path are removed and if any HSPs are left then the dynamic program<br />

is run aga<strong>in</strong>.


128 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

8.14 Mouse/Human alignment choices<br />

The similarity between the human and mouse genomes is 86% on the nucleotide level and 89%<br />

on the am<strong>in</strong>o-acid level (for cod<strong>in</strong>g regions). The follow<strong>in</strong>g table compares DNA vs am<strong>in</strong>o acid<br />

alignments, and different seed<strong>in</strong>g stratergies:<br />

9 Phylogenetic Networks<br />

Real evolutionary data often conta<strong>in</strong>s a number of different and sometimes conflict<strong>in</strong>g phylogenetic<br />

signals, and thus do not always clearly support a unique tree. To address this problem,<br />

Hans-Jürgen Bandelt and Andreas Dress developed the method of split decomposition.<br />

For ideal data, this method gives rise to a tree, whereas less ideal data are represented by a<br />

tree-like network that may <strong>in</strong>dicate evidence for different and conflict<strong>in</strong>g phylogenies.<br />

The follow<strong>in</strong>g lectures are based on:<br />

Hans-Jürgen Bandelt and Andreas W. M. Dress. A canonical decomposition theory for metrics<br />

on a f<strong>in</strong>ite set, Advances <strong>in</strong> Mathematics, 92(1):47-105 (1992)<br />

Daniel H. Huson, SplitsTree: analyz<strong>in</strong>g and visualiz<strong>in</strong>g evolutionary data, Bio<strong>in</strong>formatics,<br />

14(10):68-73 (1998).<br />

9.1 Trees vs networks<br />

Here is (a) the unrooted neighbor-jo<strong>in</strong><strong>in</strong>g tree for 16S rRNA sequences (1355 bp) from ten<br />

species of Neisseria and (b) a splits graph computed from the same distance matrix:<br />

(a)<br />

(b)


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 129<br />

(See: Eddie C. Holmes. Genomics, phylogenetics and epidemiology, Microbiology Today, 26:162-163<br />

(1999).)<br />

9.2 Phylogenetic trees<br />

Evolutionary relationships are usually represented by a phylogenetic tree T, i.e. a tree whose<br />

leaves (and perhaps some <strong>in</strong>ternal nodes, too) are all labeled by elements of a set X of taxa<br />

and whose <strong>in</strong>ternal nodes all have degree at least three.<br />

(If the tree is rooted, then the root node may have degree two, but we will only consi<strong>der</strong> unrooted<br />

trees.) Often, the edges have weights correspond<strong>in</strong>g to some notion of evolutionary distance<br />

between the taxa.<br />

t1<br />

t2<br />

t8<br />

Example:<br />

t3<br />

t4<br />

t5<br />

t6<br />

t7<br />

9.3 Trees and splits<br />

Any edge e of T def<strong>in</strong>es a split S = {A, Ā} of X, that is, a partition<strong>in</strong>g of X <strong>in</strong>to two non-empty<br />

sets A and Ā, consist<strong>in</strong>g of all taxa on the one side and other side of e, respectively.<br />

t1<br />

t2<br />

t8<br />

For example:<br />

Here, A = {t 3 , t 4 , t 5 } and Ā = {t 1, t 2 , t 6 , t 7 , t 8 }.<br />

Let Σ(T) denote the set of all splits obta<strong>in</strong>ed from T.<br />

t4<br />

t3<br />

t5<br />

Ideally, each edge of the tree separates a monophyletic group from the rest and this is reflected<br />

by the correspond<strong>in</strong>g split.<br />

e<br />

t6<br />

t7<br />

9.4 Compatible splits<br />

Given a set of taxa X. Let Σ be a set of splits of X. Two splits S 1 = {A 1 , Ā1} and S 2 = {A 2 , Ā2}<br />

are called compatible, if one of the four follow<strong>in</strong>g <strong>in</strong>tersections<br />

is empty.<br />

A 1 ∩ A 2 , A 1 ∩ Ā2, Ā 1 ∩ A 2 , or Ā 1 ∩ Ā2,<br />

A set Σ of splits of X is called compatible, if every pair of splits <strong>in</strong> Σ is compatible.<br />

Example<br />

Given the taxa set X = {a, b, c, d, e}. The splits S 1 = {{a, b}, {c, d, e}}, S 2 = {{a, b, c}, {d, e}}


130 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

and S 3 = {{e}, {a, b, c, d}} are all compatible with each other. However, S 4 = {{a, c}, {b, d, e}}<br />

is not compatible with the first one. Hence, the set Σ = {S 1 , S 2 , S 3 } is compatible, but Σ ′ =<br />

{S 1 , S 2 , S 3 , S 4 } is not.<br />

The compatibility condition states that any split S subdivides either the one side, or the other<br />

side, of any other split S ′ , but not both sides. Hence, any set of compatible splits can be drawn<br />

as follows, without cross<strong>in</strong>g l<strong>in</strong>es:<br />

t1<br />

t2<br />

t8<br />

t3<br />

t4<br />

t5<br />

t6<br />

t7<br />

This figure also shows the relationship between compatible splits and a hierarchical cluster<strong>in</strong>g.<br />

9.5 Compatible splits and trees<br />

Any compatible set of splits Σ gives rise to a phylogenetic tree T, for example:<br />

t1<br />

t2<br />

t8<br />

t3<br />

t4<br />

t5<br />

t6<br />

t7<br />

Note: For this always to work, we need to allow taxa to be positioned at <strong>in</strong>ternal nodes of the<br />

tree, not just at leaves!<br />

Vice versa, any tree T gives rise to a compatible set of splits. (Proof: Consi<strong>der</strong> two edges e and<br />

e ′ . Because T is a tree, e and e ′ are connected by a unique path P:<br />

v<br />

e<br />

←→ w ←→ P ←→ w ′ e<br />

←→ ′<br />

v ′<br />

Because T is a tree, the set A, consist<strong>in</strong>g of all nodes reachable from node v not us<strong>in</strong>g edge e, is disjo<strong>in</strong>t<br />

from A ′ , the set of all nodes reachable from node v ′ not us<strong>in</strong>g edge e ′ . Hence, the correspond<strong>in</strong>g splits<br />

S = {A,Ā = X \ A} and S′ = {A ′ ,Ā′ = X \ A ′ } are compatible.)<br />

In summary:<br />

Theorem A set of splits Σ is compatible, iff there exists a phylogenetic tree T such that<br />

Σ = Σ(T).


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 131<br />

9.6 Represent<strong>in</strong>g distances us<strong>in</strong>g trees<br />

Given a set X of taxa. For our purposes, a phylogenetic tree T is a tree such that:<br />

• all leaves, and perhaps some of the <strong>in</strong>ternal nodes, too, are (multi-)labeled by elements<br />

of X, such that each taxon appears exactly once, and<br />

• every edge e has a weight d e associated with it.<br />

Given a set of taxa X and a distance matrix {d ab } (i.e., a dissimilarity function or pseudo<br />

metric) describ<strong>in</strong>g “evolutionary distances” between the different taxa, obta<strong>in</strong>ed <strong>in</strong> some way<br />

(as described <strong>in</strong> ABI I).<br />

Any distance-based tree build<strong>in</strong>g method attempts to represent given distance matrix d as<br />

well as possible us<strong>in</strong>g a phylogenetic tree T, i.e. for any two taxa a, b ∈ X we approximate<br />

d ab ≈ ∑ e∈P d e, where P is the unique path of edges <strong>in</strong> T that connects the nodes with labels<br />

a and b.<br />

9.7 Ma<strong>in</strong> goal<br />

Given a distance matrix d, a tree build<strong>in</strong>g method such as neighbor-jo<strong>in</strong><strong>in</strong>g will compute a<br />

phylogenetic tree T for d, no matter how “untree-like” the distance matrix d may be. (Recall<br />

from ABI I that the four-po<strong>in</strong>t condition determ<strong>in</strong>es whether a given distance matrix d is<br />

additive or not, i.e. whether it has an exact representation as by a phylogenetic tree, or not.)<br />

Our goal is to use more general graphs to represent distances, so-called splits graphs. As we<br />

will see, the graph will be a tree, whenever the given distances are tree-like (i.e., additive, or<br />

close to additive).<br />

to obta<strong>in</strong> this goal, we proceed <strong>in</strong>directly by discuss<strong>in</strong>g sets of splits and <strong>in</strong>troduc<strong>in</strong>g the notation<br />

of weak-compatibility.<br />

Just as a set of compatible splits can be represented by a phylogenetic tree, we will see that a<br />

weakly-compatible set of splits can be represented by a splits graph.<br />

9.8 Tree and splits<br />

Here is an example of a phylogenetic tree T:<br />

Gorilla<br />

Pan_panisc<br />

Homo_sap<br />

rabbit<br />

gu<strong>in</strong>ea_pig<br />

Pongo_pygB<br />

f<br />

Bos_ta(cow)<br />

f<strong>in</strong>_whale<br />

blue_whale<br />

Mus_mouse<br />

platypus wallaroo<br />

0.1 opossum<br />

e<br />

Rattus_norv<br />

Each edge <strong>in</strong> T def<strong>in</strong>es a split of the set of taxa X. For example, the edge labeled e separates<br />

rat and mouse from all other taxa, and the edge f separates cow, f<strong>in</strong> whale and blue whale<br />

from all others.


132 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

9.9 Tree distance and Σ-distance<br />

Given a phylogenetic tree T with edge weights. We def<strong>in</strong>e the tree distance between two taxa<br />

a and b as<br />

d T (a, b) := ∑ d e ,<br />

e∈P<br />

where P denotes the set of edges along the unique simple path from the node labeled a to the<br />

node labeled b.<br />

We set d S := d e , if S is the split correspond<strong>in</strong>g to e. We def<strong>in</strong>e the Σ-distance between two<br />

taxa a and b as<br />

d Σ (a, b) :=<br />

∑<br />

d S ,<br />

S∈Σ(a,b)<br />

where Σ(a, b) is the set of all splits <strong>in</strong> Σ that separate a and b.<br />

These def<strong>in</strong>itions imply<br />

for all taxa a, b ∈ X.<br />

d T (a, b) = d Σ (a, b)<br />

9.10 Weak compatibility<br />

Compatibility is a requirement def<strong>in</strong>ed on any two splits. A relaxed concept is that of weak<br />

compatibility, which is a condition placed on any three splits.<br />

Let S 1 , S 2 and S 3 be three splits of X. This triplet is called weakly compatible, if for every<br />

choice of A i ∈ S i (i = 1, 2, 3), at least one of the four <strong>in</strong>tersections<br />

A 1 ∩ A 2 ∩ A 3 , A 1 ∩ Ā2 ∩ Ā3, Ā 1 ∩ A 2 ∩ Ā3, or Ā 1 ∩ Ā2 ∩ A 3<br />

is empty. This means that at least one shaded and one unshaded region of the follow<strong>in</strong>g diagram<br />

must be empty:<br />

S2<br />

S1<br />

X<br />

S3<br />

Note that if any pair of the three splits is compatible, then all three are weakly-compatible:<br />

E.g., if for S 1 = {A 1 , Ā1} and S 2 = {A 2 , Ā2} we have A 1 ∩ A 2 = ∅, then A 1 ∩ A 2 ∩ A 3 = ∅ and<br />

A 1 ∩ A 2 ∩ Ā3 = ∅.<br />

On the other hand, it is possible that every pair of the three weakly-compatible splits is <strong>in</strong>compatible:<br />

C<br />

B e D<br />

A<br />

F<br />

Here, a split of X = {A, B, C, D, E, F } if given by each pair of parallel edges, e.g. edges e and<br />

e ′ def<strong>in</strong>e the split {{A, B, F }, {C, D, E}}.<br />

e’<br />

E


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 133<br />

9.11 Weak compatibility and splits graphs<br />

As discussed above, any given set of splits Σ can be represented by a tree T(Σ), if and only if<br />

Σ is compatible.<br />

A weakly compatible split system S can be represented by a splits graph G(Σ) that has the<br />

follow<strong>in</strong>g properties:<br />

• all leaves (and, additionally, some <strong>in</strong>ternal nodes, perhaps) are multi-labeled by taxa so<br />

that each taxon appears exactly once,<br />

• edges are labeled by splits such that each split appears at least once,<br />

• delet<strong>in</strong>g all edges labeled by any given split S = {A, Ā} produces precisely two components,<br />

one conta<strong>in</strong><strong>in</strong>g all nodes with labels <strong>in</strong> A and the other conta<strong>in</strong><strong>in</strong>g all nodes with<br />

labels <strong>in</strong> Ā, and<br />

• the graph is m<strong>in</strong>imal with these properties.<br />

9.12 Weak compatibility and splits graphs<br />

Given the follow<strong>in</strong>g splits:<br />

S 1 = {{A, B}, {C, D, E, F }}, S 2 = {{A, B, C}, {D, E, F }},<br />

S 3 = {{A, F, E}, {B, C, D}}, S 4 = {{A, B, F }, {C, D, E}},<br />

and all s<strong>in</strong>gleton splits A vs B − F, etc.<br />

They can be represented as follows:<br />

C<br />

B<br />

S3<br />

S1<br />

S3<br />

S4<br />

S2<br />

S3<br />

D<br />

A<br />

S1<br />

S2<br />

S4<br />

E<br />

Here is another example of a splits graph:<br />

F<br />

A.cerana<br />

A.andrenof<br />

A.florea<br />

A.mellifer<br />

A.koschev<br />

A.dorsata<br />

This graph is based on DNA obta<strong>in</strong>ed from bees. It <strong>in</strong>dicates that is there is some evidence<br />

that groups A.cerana and A.meillifer together, and conflict<strong>in</strong>g evidence that groups A.mellifer<br />

with A.dorsata, for example.


134 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

9.13 Splits graphs and distances<br />

Given a set of taxa X, a set of weakly compatible splits Σ of X and a value d S ≥ 0 for each<br />

split S.<br />

As above, we def<strong>in</strong>e the Σ-distance between taxa a and b simply as d Σ (a, b) := ∑ S∈Σ(a,b) d S,<br />

where Σ(a, b) is the set of all splits that separate a and b.<br />

Assume we are given a correspond<strong>in</strong>g splits graph G. In G, each split S is represented by a<br />

band of parallel edges and each such edge e has weight d e = d S .<br />

Consi<strong>der</strong> any two taxa a, b ∈ X. We def<strong>in</strong>e<br />

d G (a, b) := m<strong>in</strong>{ ∑ e∈P<br />

d e | P is a simple path from a to b}.<br />

Lemma We have d Σ (a, b) = d G (a, b) for all a, b ∈ X.<br />

(Proof: need to show that a m<strong>in</strong>imum path from a to b uses precisely one edge for every split<br />

that separates a and b.)<br />

9.14 Two ma<strong>in</strong> questions<br />

• First, given a set of taxa X and a distance matrix d. How do we compute a set of<br />

weakly compatible system of splits Σ and values d S , such that ∑ S∈Σ(a,b) d S is a useful<br />

approximation of d ab ?<br />

• Second, given a weakly compatible set of splits, how do we compute the correspond<strong>in</strong>g<br />

splits graph?<br />

9.15 Distance matrices and d-splits<br />

Given a distance matrix d on X. We call a split S = {A, Ā} a d-split, if for all i, j ∈ A and<br />

k, l ∈ Ā we have d ij + d kl < max(d ik + d jl , d il + d jk ).<br />

In other words, the metric <strong>in</strong>duced by d on any four taxa i, j ∈ A, k, l ∈ Ā, places i, j and k, l<br />

together as <strong>in</strong>dicated here:<br />

j<br />

i<br />

A<br />

l<br />

k<br />

B<br />

or<br />

i<br />

j<br />

A<br />

l<br />

k<br />

B<br />

or<br />

i<br />

k<br />

i<br />

k<br />

j<br />

A<br />

l<br />

B<br />

but NOT:<br />

j<br />

A<br />

l<br />

B<br />

.<br />

9.16 d-splits are weakly compatible<br />

Lemma (Bandelt & Dress) Let d be a distance matrix on X. Then the set of all d-splits is<br />

weakly compatible.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 135<br />

Proof Consi<strong>der</strong> three d-splits S 1 = {A 1 , Ā1}, S 2 = {A 2 , Ā2} and S 3 = {A 3 , Ā3} and assume that<br />

they are not weakly-compatible. Then there exist four taxa x, y, z, t conta<strong>in</strong>ed <strong>in</strong> A 1 ∩ Ā2 ∩ Ā3,<br />

Ā 1 ∩ A 2 ∩ Ā3, Ā 1 ∩ Ā2 ∩ A 3 and A 1 ∩ A 2 ∩ A 3 , respectively:<br />

S1<br />

S2<br />

y<br />

t<br />

x<br />

X<br />

z<br />

S3<br />

The def<strong>in</strong>ition of a d split implies the follow<strong>in</strong>g three <strong>in</strong>equalities:<br />

For S 1 : d xt + d yz < max(d xy + d tz , d xz + d ty ),<br />

for S 2 : d yt + d xz < max(d yx + d tz , d yz + d tx ), and<br />

for S 3 : d zt + d xy < max(d zx + d ty , d zy + d tx ).<br />

Note that these three <strong>in</strong>equalities cannot be fulfilled simutaneously, contradict<strong>in</strong>g our assumptions<br />

and thus the three splits must be weakly-compatible. □.<br />

9.17 The isolation <strong>in</strong>dex of a split<br />

We give any d-split S = {A, Ā} a positive weight, namely the quantity<br />

α A,B := α d A,B := 1 2<br />

called the isolation <strong>in</strong>dex of S.<br />

m<strong>in</strong> {max(d ik + d jl , d il + d jk ) − (d ij + d kl )},<br />

i,j∈A,k,l∈Ā<br />

We can easily modify this def<strong>in</strong>ition to apply to any split S = {A, Ā}, whether d-split or not:<br />

α A,B := α d A,B := 1 2<br />

m<strong>in</strong> {max(d ik + d jl , d il + d jk , d ij + d kl ) − (d ij + d kl ))},<br />

i,j∈A, k,l∈Ā<br />

thus obta<strong>in</strong><strong>in</strong>g a value ≥ 0 that equals the previously def<strong>in</strong>ed isolation <strong>in</strong>dex, if S is a d-split,<br />

and 0, if not.<br />

9.18 The split decomposition<br />

For any split S = {A, Ā} of X, the split metric δ S is given by<br />

{<br />

0, if i, j ∈ A or i, j ∈ Ā,<br />

δ S (i, j) :=<br />

1, else.<br />

Theorem (Bandelt & Dress) Any given distance matrix d on X possesses the follow<strong>in</strong>g<br />

unique decomposition:<br />

d ij = ( ∑ α S δ S (i, j)) + d 0 ij,<br />

S<br />

for all i, j ∈ X. Here, the sum runs over all possible splits S and the map d 0 : X × X → R ≥0<br />

is a (pseudo-)metric that does not admit any further splits with positive isolation <strong>in</strong>dex, i.e.<br />

there exist no d 0 -splits.<br />

Hence, we have ∑ S α Sδ S (i, j) ≤ d ij for any pair of taxa i, j ∈ X and the Σ-distance α S<br />

approximates d ij from below.<br />

One can prove that the number of d-splits is ≤ ( )<br />

|X|<br />

2 .


136 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

9.19 Comput<strong>in</strong>g the set of d-splits<br />

Given a distance matrix d on X. The set of all d-splits can be computed iteratively <strong>in</strong> O(n 6 )<br />

steps:<br />

Algorithm<br />

Input: Distance matrix d, taxon set X = {x 1 , x 2 , . . .,x n }.<br />

Output: Set Σ = Σ n of all d-splits<br />

Initialization: Σ 0 := ∅, X 0 := ∅<br />

for each k = 1, 2, . . ., n do:<br />

Set S k := ∅, X k := ∅<br />

for each split S = {A, Ā} ∈ Σ k−1:<br />

if {A ∪ {x k }, Ā} has positive isolation <strong>in</strong>dex then<br />

Add {A ∪ {x k }, Ā} to Σ k<br />

if {A, Ā ∪ {x k}} has positive isolation <strong>in</strong>dex then<br />

Add {A, Ā ∪ {x k}} to Σ k<br />

If {{x 1 , x 2 , . . .,x k−1 }, {x k }} has positive isolation <strong>in</strong>dex then<br />

Add {{x 1 , x 2 , . . .,x k−1 }, {x k }} to Σ k .<br />

end<br />

Lemma This algorithm computes all d-splits.<br />

Proof: First note that <strong>in</strong> the k-iteration of the algorithm the new partial s<strong>in</strong>gleton<br />

split {{x 1 , . . .,x k−1 }, {x k }} is evaluated and then added to the current set of splits, if<br />

α {{x1 ,...,x k−1 },{x k }} > 0. Additionally, the algorithm attempts to extend all exist<strong>in</strong>g partial splits<br />

by add<strong>in</strong>g x k to the one side, or the other side, of them. By def<strong>in</strong>ition of the isolation <strong>in</strong>dex<br />

as the m<strong>in</strong>imum of certa<strong>in</strong> sums <strong>in</strong>volv<strong>in</strong>g quartets of taxa, add<strong>in</strong>g a taxon to either side of<br />

a partial split can only decrease the isolation <strong>in</strong>dex. Hence, any split of X is obta<strong>in</strong>able as a<br />

partial s<strong>in</strong>gleton split for some k, followed by successive addition of the rema<strong>in</strong><strong>in</strong>g taxa to the<br />

split.<br />

□<br />

9.20 Comput<strong>in</strong>g the splits graph<br />

Given a compatible system of splits, it is easy to construct the correspond<strong>in</strong>g tree and to<br />

compute coord<strong>in</strong>ates for the tree.<br />

The problem of comput<strong>in</strong>g a splits graph for a given set of weakly compatible splits is more<br />

difficult. In practice, one dist<strong>in</strong>guishes between circular split systems, which correspond to<br />

planar split graphs, and non-circular ones. A nice algorithm exists for circular split systems<br />

that produces a planar graph.<br />

Here we discuss the convex hull approach that applies to any set of weakly compatible splits,<br />

whether circular or not. This method is easy to describe. Its ma<strong>in</strong> draw-back is that it usually<br />

produces redundant nodes and edges and so the result<strong>in</strong>g graph is not always m<strong>in</strong>imal <strong>in</strong> the<br />

sense postulated above.


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 137<br />

9.21 Convex-hull construction method<br />

Given a splits graph G. For a given set of taxa A ⊂ X, let G A denote the set of all nodes<br />

labeled by taxa <strong>in</strong> A. The convex hull G A of G A is obta<strong>in</strong>ed by first sett<strong>in</strong>g G A = G A and<br />

then repeatedly add<strong>in</strong>g any node v to G A , if there exist two nodes a, b already <strong>in</strong> G A such that<br />

d G (a, v) + d G (v, b) ≤ d G (a, b).<br />

Given a weakly compatible set of splits Σ = {S 1 , S 2 , . . ., S k }. The convex-hull construction<br />

method constructs a graph by add<strong>in</strong>g one split at time. For each split, the convex hull for both<br />

sides of the split is computed. The <strong>in</strong>tersection of the two is duplicated, one copy is connected<br />

to one side of the graph correspond<strong>in</strong>g to one side of the split, the other to the other and then<br />

the two duplicated subgraphs are connected by a set of new edges that represent the new split.<br />

Assume that G is the graph constructed for splits S 1 . . .,S i . To add the next split S i+1 = {A, Ā}:<br />

• Determ<strong>in</strong>e the two convex hulls G A and G Ā .<br />

• Let H := G A ∩ G Ā denote their <strong>in</strong>tersection.<br />

• For each node v ∈ H, produce two new nodes v + and v − and connect them by an edge<br />

labeled S i .<br />

• If v ∈ H is labeled by a taxon x ∈ A, or x ∈ Ā, then attach this label to node v+ , or v − ,<br />

respectively.<br />

• Connect any two nodes v + and w + , and v − and w − , respectively, by an edge, if v and w<br />

are connected by an edge <strong>in</strong> G.<br />

• If v ∈ H is connected to some node w ∈ G A \ G Ā , then connect v + and w by an edge.<br />

• If v ∈ H is connected to some node w ∈ G Ā \ G A , then connect v − and w by an edge.<br />

• Delete H.<br />

9.22 Comput<strong>in</strong>g the splits graph<br />

Given the follow<strong>in</strong>g set Σ of splits:<br />

We will demonstrate how to generate G Σ .<br />

S 1 = {{1, 5, 6} ,{2, 3, 4} }<br />

S 2 = {{1, 2, 3} ,{4, 5, 6} }<br />

S 3 = {{1, 2, 5, 6},{3, 4} }<br />

S 4 = {{1, 2} ,{3, 4, 5, 6}}<br />

S 5 = {{1, 6} ,{3, 4, 5, 6}}<br />

Initially, start with a s<strong>in</strong>gle node labeled by all of X = {1, 2, 3, 4, 5, 6}:<br />

G 1<br />

1,2,3,4,5,6<br />

Then add the first split S 1 . Note that H consists of the s<strong>in</strong>gle node present <strong>in</strong> G 1 :<br />

G 2<br />

1,5,6 2,3,4<br />

Add the second split S 2 = {{1, 2, 3}, {4, 5, 6}}. Note that H consists of both nodes <strong>in</strong> G 2 :


138 <strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson<br />

G 3<br />

5,6 4<br />

1<br />

Add the third split S 3 = {{1, 2, 5, 6}, {3, 4}}. Note that H consists of the two nodes labeled<br />

2, 3 and 4 <strong>in</strong> G 3 :<br />

2,3<br />

4<br />

5,6<br />

G 4<br />

1<br />

Add the fourth split S 4 = {{1, 2}, {3, 4, 5, 6}}. Note that H consists of the two nodes labeled<br />

1 and 2 <strong>in</strong> G 4 :<br />

4<br />

5,6<br />

3<br />

G 5<br />

2<br />

3<br />

1 2<br />

Add the fifth split S 5 = {{1, 6}, {2, 3, 4, 5}}. Note that H consists of the two nodes labeled 1<br />

and 5, 6, plus the node ly<strong>in</strong>g between these two <strong>in</strong> G 5 :<br />

4<br />

6 5<br />

3<br />

G 6<br />

1<br />

2<br />

F<strong>in</strong>ally, add all s<strong>in</strong>gleton splits to G 6 to obta<strong>in</strong> the f<strong>in</strong>al graph G:<br />

4<br />

5<br />

6<br />

G<br />

3<br />

1<br />

2<br />

9.23 Example of splits graph<br />

The distance matrix for the follow<strong>in</strong>g example was produced <strong>in</strong> a psychology experiment <strong>in</strong><br />

which people where asked to estimate the distance between different colors:


<strong>Algorithms</strong> <strong>in</strong> Bio<strong>in</strong>formatics <strong>II</strong>, SS 2002, Uni Tüb<strong>in</strong>gen, Daniel Huson 139<br />

Title: colors.nex<br />

Date: Mon Jul 15 09:16:02 2002<br />

red<br />

yellow<br />

red−purple<br />

gr.−y.(yellowish)<br />

purple−reddish<br />

gr.−y.(greenish)<br />

purple<br />

green<br />

10<br />

Fit=97.0 ntax=10<br />

purple−blue<br />

blue

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!