30.04.2015 Views

Global Sequence Alignment: the Needleman-‐Wunsch Algorithm

Global Sequence Alignment: the Needleman-‐Wunsch Algorithm

Global Sequence Alignment: the Needleman-‐Wunsch Algorithm

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Global</strong> <strong>Sequence</strong> <strong>Alignment</strong>: <br />

<strong>the</strong> <strong>Needleman</strong>-­‐Wunsch <strong>Algorithm</strong> <br />

Tony Capra <br />

BMIF 310 <br />

Sept. 6, 2013


Which sequence is more <br />

“similar” to s 1 ? <br />

s 1 = ASLVNDK!<br />

s 2 = ALVNKDK OR<br />

s 3 = AFPSTW!


What is <strong>the</strong> intuiLon behind s 2 <br />

seeming more similar? <br />

We can write s1 and s2 in a way that juxtaposes more <br />

idenLcal characters than s1 and s3: <br />

s 1 = ASLVN-DK!<br />

s 2 = A-LVNKDK!<br />

s 2 = A LVN DK <br />

s 1 = ASLVNDK!<br />

s 3 = AFPSTW-!<br />

s 3 = A <br />

These juxtaposiLons are called alignments.


Overview <br />

• Why align sequences of characters? <br />

• Formalize our intuiLve noLon of an opLmal <br />

alignment. <br />

• Describe efficient algorithm for finding <br />

opLmal global alignments of two sequences.


Why align sequences? <br />

• <strong>Alignment</strong>s provide a way to consistently <br />

evaluate <strong>the</strong> similarity of pairs of sequences. <br />

• We focus on aligning protein and DNA <br />

sequences. <br />

• <strong>Algorithm</strong>s we will discuss today work on any <br />

sequences of characters.


Why align sequences? <br />

Biological sequence similarity suggests funcLonal <br />

similarity and common evoluLonary origins (homology). <br />

Beta globin <br />

evoluLonary <br />

trajectory: <br />

<strong>Alignment</strong>s quanLfy <br />

<strong>the</strong> similarity between <br />

different sequences <br />

within and between <br />

species. <br />

Ancetral globin: <br />

h[p://www.nature.com/scitable/topicpage/dna-­‐deleLon-­‐and-­‐duplicaLon-­‐and-­‐<strong>the</strong>-­‐associated-­‐331


Why align sequences? <br />

<strong>Alignment</strong>s can highlight funcLonally relevant regions <br />

of a DNA/protein sequence: <br />

FuncLonally important regions oben have be[er <br />

alignments due to evolu5onary conserva5on. <br />

adapted from: h[p://www.pnas.org/content/104/25/10388/suppl/DC1#F5


<strong>Global</strong> vs. Local <strong>Alignment</strong> <br />

s 1 <br />

<strong>Global</strong> (<strong>Needleman</strong>-­‐Wunsch) <br />

s 2 <br />

Local (Smith-­‐Waterman) <br />

Today we will learn an efficient algorithm for <br />

finding opLmal global alignments.


There are many possible alignments. <br />

s 1 = ASLVNDK! !s 2 = ALVNKDK!<br />

ASLVNDK--!<br />

A-LVN-KDK!<br />

ASLVN-DK!<br />

A-LVNKDK!<br />

A----SLVNDK!<br />

ALVNK----DK!<br />

ASLVNDK!<br />

ALVNKDK!


Which is <strong>the</strong> best alignment? <br />

s 1 = ASLVNDK! !s 2 = ALVNKDK!<br />

ASLVNDK--!<br />

A-LVN-KDK!<br />

ASLVN-DK!<br />

A-LVNKDK!<br />

A----SLVNDK!<br />

ALVNK----DK!<br />

ASLVNDK!<br />

ALVNKDK!


Which is <strong>the</strong> best alignment? <br />

s 1 = ASLVNDK! !s 2 = ALVNKDK!<br />

ASLVNDK--!<br />

A-LVN-KDK!<br />

A LVN K <br />

ASLVN-DK!<br />

A-LVNKDK!<br />

A LVN DK <br />

A----SLVNDK!<br />

ALVNK----DK!<br />

A DK <br />

ASLVNDK!<br />

ALVNKDK!<br />

A DK


Implicit Column Scoring FuncLon <br />

• 1, match <br />

• 0, mismatch (mutaLon) <br />

• 0, gap (inserLon, deleLon) <br />

ASLVN-DK!<br />

A-LVNKDK!<br />

10111011=6!<br />

• Sum <strong>the</strong> score over all columns of alignment. <br />

Deriving more complex scoring schemes will be a main <br />

topic of my next lecture.


Idea: <br />

How can we find an opLmal <br />

alignment? <br />

– Enumerate all possible alignments of s 1 and s 2 . <br />

– Calculate score using <strong>the</strong> column scoring funcLon. <br />

– Pick <strong>the</strong> alignment(s) with <strong>the</strong> highest score. <br />

Why is this a terrible idea?


The # of alignments of two sequences is <br />

an exponenLal funcLon of <strong>the</strong>ir lengths. <br />

If n is <strong>the</strong> length of s 1 and m is <strong>the</strong> length of s 2 , <strong>the</strong>n <strong>the</strong> number of possible <br />

alignments is roughly: <br />

For sequences of plausible lengths (~150) this quickly <br />

outgrows <strong>the</strong> number of atoms in <strong>the</strong> observable universe.


So what can we do?


Dynamic Programming <br />

• Is not a lively way of wriLng computer code. <br />

• Is a way to exploit substructure within a <br />

problem by keeping good records of parLal <br />

answers. <br />

• Is a very common and powerful algorithmic <br />

technique.


Some NotaLon <br />

• s 1 [1..i] = <strong>the</strong> first i characters of s 1 (i in 0 to n) <br />

• s 2 [1..j] = <strong>the</strong> first j characters of s 2 (j in 0 to m) <br />

• M(i,j) = score of opLmal alignment of s 1 [1..i] <br />

with s 2 [1..j] <br />

– in o<strong>the</strong>r words, M(i,j) = <strong>the</strong> maximum # of <br />

matches in an opLmal alignment of s 1 [1..i] with <br />

s 2 [1..j]


M(i,j) Examples <br />

s 1 = ASLVNDK! !s 2 = ALVNKDK!<br />

M(2,1) = 1 <br />

s 1 : AS!<br />

s 2 : A-!<br />

1+0=1!<br />

M(3,3) = ?!<br />

s 1 : ASL-!<br />

s 2 : A-LV!<br />

1+0+1+0=2!


The Key Insight <br />

• Think about <strong>the</strong> last column of an opLmal <br />

alignment. <br />

• Four possibiliLes (where X, Y are characters): <br />

X!<br />

-!<br />

-!<br />

X!<br />

⇒ M(n-­‐1, m) is opLmal <br />

for s 1 [1..n-­‐1] and s 2 [1..m] <br />

⇒ M(n, m-­‐1) is opLmal <br />

for s 1 [1..n] and s 2 [1..m-­‐1] <br />

X!<br />

Y!<br />

X!<br />

X!<br />

⇒ M(n-­‐1, m-­‐1) is opLmal <br />

for s 1 [1..n-­‐1] and s 2 [1..m-­‐1] <br />

⇒ M(n-­‐1, m-­‐1) is opLmal <br />

for s 1 [1..n-­‐1] and s 2 [1..m-­‐1]


The Key Insight II <br />

M(i,j) = max( <br />

M(i, j-­‐1) + 0, <br />

M(i-­‐1, j) + 0, <br />

M(i-­‐1,j-­‐1) + t(i,j)) <br />

X!<br />

-!<br />

where t(i,j) = 1 if s 1 [i] == s 2 [j] and 0 if not. <br />

-!<br />

X!<br />

X! X!<br />

Y! X!<br />

So, if you were given M(i, j-­‐1), M(i-­‐1, j), and M(i-­‐1, j-­‐1), <br />

you could compute M(i,j) in a constant number of steps.


Dynamic Programming Approach <br />

• Use this recursive relaLonship to build up <strong>the</strong> <br />

score of an opLmal alignment. <br />

• Start with M(0, j) and M(i, 0). <br />

• Keep track of intermediate scores in a table.


<strong>Needleman</strong>-­‐Wunsch Example <br />

M(i,j) s 1 A! S! L! V! N! D! K!<br />

s 2 0 1 2 3 4 5 6 7 <br />

0 <br />

A! 1 <br />

L! 2 <br />

V! 3 <br />

N! 4 <br />

K! 5 <br />

D! 6 <br />

K! 7


Start with base condiLons <br />

D(i,j) s 1 A! S! L! V! N! D! K!<br />

s 2 0 1 2 3 4 5 6 7 <br />

0 0 ? ? ? ? ? ? ? <br />

A! 1 ? <br />

L! 2 ? <br />

V! 3 ? <br />

N! 4 ? <br />

K! 5 ? <br />

D! 6 ? <br />

K! 7 ? <br />

What is D(1, 0), <strong>the</strong> score of <strong>the</strong> opLmal alignment of <br />

s 1 [1..1] with a gap?


Start with base condiLons <br />

M(i,j) s 1 A! S! L! V! N! D! K!<br />

s 2 0 1 2 3 4 5 6 7 <br />

0 0 0 ? ? ? ? ? ? <br />

A! 1 ? <br />

L! 2 ? <br />

V! 3 ? <br />

N! 4 ? <br />

K! 5 ? <br />

D! 6 ? <br />

K! 7 ? <br />

How about <strong>the</strong> rest of this row?


Now compute M(i,j) row by row. <br />

M(i,j) s 1 A! S! L! V! N! D! K!<br />

s 2 0 1 2 3 4 5 6 7 <br />

0 0 0 0 0 0 0 0 0 <br />

A! 1 0 ? <br />

L! 2 0 <br />

V! 3 0 <br />

N! 4 0 <br />

K! 5 0 <br />

D! 6 0 <br />

K! 7 0 <br />

M(1,1) = max(M(0,1) + 0, M(1,0) + 0, M(0,0) + t(1,1)) <br />

M(1,1) = max(0 + 0, 0 + 0, 0 + 1) = 1


Now compute M(i,j) row by row. <br />

M(i,j) s 1 A! S! L! V! N! D! K!<br />

s 2 0 1 2 3 4 5 6 7 <br />

0 0 0 0 0 0 0 0 0 <br />

A! 1 0 1 <br />

L! 2 0 <br />

V! 3 0 <br />

N! 4 0 <br />

K! 5 0 <br />

D! 6 0 <br />

K! 7 0


Now compute M(i,j) row by row. <br />

M(i,j) s 1 A! S! L! V! N! D! K!<br />

s 2 0 1 2 3 4 5 6 7 <br />

0 0 0 0 0 0 0 0 0 <br />

A! 1 0 1 1 1 1 1 1 1 <br />

L! 2 0 ? <br />

V! 3 0 <br />

N! 4 0 <br />

K! 5 0 <br />

D! 6 0 <br />

K! 7 0 <br />

M(i,j) = max(M(i-­‐1, j) + 0, M(i, j-­‐1) + 0, M(i-­‐1, j-­‐1) + t(i, j))


Now compute M(i,j) row by row. <br />

M(i,j) s 1 A! S! L! V! N! D! K!<br />

s 2 0 1 2 3 4 5 6 7 <br />

0 0 0 0 0 0 0 0 0 <br />

A! 1 0 1 1 1 1 1 1 1 <br />

L! 2 0 1 1 2 2 2 2 2 <br />

V! 3 0 1 1 2 3 3 3 3 <br />

N! 4 0 1 1 2 3 4 4 4 <br />

K! 5 0 1 1 2 3 4 4 5 <br />

D! 6 0 1 1 2 3 4 5 5 <br />

K! 7 0 1 1 2 3 4 5 6 <br />

M(7,7) = 6, so <strong>the</strong> opLmal match count is six!


How do we construct an opLmal <br />

alignment from this table?


The Traceback <br />

M(i,j) s 1 A! S! L! V! N! D! K!<br />

s 2 0 1 2 3 4 5 6 7 <br />

0 0 0 0 0 0 0 0 0 <br />

A! 1 0 1 1 1 1 1 1 1 <br />

L! 2 0 1 1 2 2 2 2 2 <br />

V! 3 0 1 1 2 3 3 3 3 <br />

N! 4 0 1 1 2 3 4 4 4 <br />

K! 5 0 1 1 2 3 4 4 5 <br />

D! 6 0 1 1 2 3 4 5 5 <br />

K! 7 0 1 1 2 3 4 5 6 <br />

Trace <strong>the</strong> path followed to get M(n,m). <br />

Align s 1 [i] and s 2 [j]. Align gap and s 2 [j]. Align gap and s 1 [i]. <br />

There may be more than one path/opLmal alignment.


Is this algorithm feasible? <br />

• It takes roughly nm + (n + m) operaLons to fill <br />

in <strong>the</strong> table and do <strong>the</strong> traceback. <br />

• QuadraLc run Lme is much be[er than <br />

exponenLal!


<strong>Global</strong> vs. Local <strong>Alignment</strong> <br />

s 1 <br />

<strong>Global</strong> (<strong>Needleman</strong>-­‐Wunsch) <br />

s 2 <br />

s 1 <br />

Local (Smith-­‐Waterman) <br />

s 2 <br />

In my next lecture, we will learn an efficient algorithm for <br />

finding opLmal local alignments.


When would you want a local <br />

alignment? <br />

• Proteins are composed of modular funcLonal “domains.” <br />

• These domains are oben shuffled into different combinaLons. <br />

• Thus, parts of proteins may align well, while o<strong>the</strong>rs do not, <br />

due to <strong>the</strong>ir different evoluLonary histories.


Summary <br />

• <strong>Sequence</strong> alignment helps us discover and <br />

compare relaLonships between sequences. <br />

• Dynamic programming enables <strong>the</strong> efficient <br />

idenLficaLon of opLmal alignments.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!