Global Sequence Alignment: the Needleman-âWunsch Algorithm
Global Sequence Alignment: the Needleman-âWunsch Algorithm
Global Sequence Alignment: the Needleman-âWunsch Algorithm
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Global</strong> <strong>Sequence</strong> <strong>Alignment</strong>: <br />
<strong>the</strong> <strong>Needleman</strong>-‐Wunsch <strong>Algorithm</strong> <br />
Tony Capra <br />
BMIF 310 <br />
Sept. 6, 2013
Which sequence is more <br />
“similar” to s 1 ? <br />
s 1 = ASLVNDK!<br />
s 2 = ALVNKDK OR<br />
s 3 = AFPSTW!
What is <strong>the</strong> intuiLon behind s 2 <br />
seeming more similar? <br />
We can write s1 and s2 in a way that juxtaposes more <br />
idenLcal characters than s1 and s3: <br />
s 1 = ASLVN-DK!<br />
s 2 = A-LVNKDK!<br />
s 2 = A LVN DK <br />
s 1 = ASLVNDK!<br />
s 3 = AFPSTW-!<br />
s 3 = A <br />
These juxtaposiLons are called alignments.
Overview <br />
• Why align sequences of characters? <br />
• Formalize our intuiLve noLon of an opLmal <br />
alignment. <br />
• Describe efficient algorithm for finding <br />
opLmal global alignments of two sequences.
Why align sequences? <br />
• <strong>Alignment</strong>s provide a way to consistently <br />
evaluate <strong>the</strong> similarity of pairs of sequences. <br />
• We focus on aligning protein and DNA <br />
sequences. <br />
• <strong>Algorithm</strong>s we will discuss today work on any <br />
sequences of characters.
Why align sequences? <br />
Biological sequence similarity suggests funcLonal <br />
similarity and common evoluLonary origins (homology). <br />
Beta globin <br />
evoluLonary <br />
trajectory: <br />
<strong>Alignment</strong>s quanLfy <br />
<strong>the</strong> similarity between <br />
different sequences <br />
within and between <br />
species. <br />
Ancetral globin: <br />
h[p://www.nature.com/scitable/topicpage/dna-‐deleLon-‐and-‐duplicaLon-‐and-‐<strong>the</strong>-‐associated-‐331
Why align sequences? <br />
<strong>Alignment</strong>s can highlight funcLonally relevant regions <br />
of a DNA/protein sequence: <br />
FuncLonally important regions oben have be[er <br />
alignments due to evolu5onary conserva5on. <br />
adapted from: h[p://www.pnas.org/content/104/25/10388/suppl/DC1#F5
<strong>Global</strong> vs. Local <strong>Alignment</strong> <br />
s 1 <br />
<strong>Global</strong> (<strong>Needleman</strong>-‐Wunsch) <br />
s 2 <br />
Local (Smith-‐Waterman) <br />
Today we will learn an efficient algorithm for <br />
finding opLmal global alignments.
There are many possible alignments. <br />
s 1 = ASLVNDK! !s 2 = ALVNKDK!<br />
ASLVNDK--!<br />
A-LVN-KDK!<br />
ASLVN-DK!<br />
A-LVNKDK!<br />
A----SLVNDK!<br />
ALVNK----DK!<br />
ASLVNDK!<br />
ALVNKDK!
Which is <strong>the</strong> best alignment? <br />
s 1 = ASLVNDK! !s 2 = ALVNKDK!<br />
ASLVNDK--!<br />
A-LVN-KDK!<br />
ASLVN-DK!<br />
A-LVNKDK!<br />
A----SLVNDK!<br />
ALVNK----DK!<br />
ASLVNDK!<br />
ALVNKDK!
Which is <strong>the</strong> best alignment? <br />
s 1 = ASLVNDK! !s 2 = ALVNKDK!<br />
ASLVNDK--!<br />
A-LVN-KDK!<br />
A LVN K <br />
ASLVN-DK!<br />
A-LVNKDK!<br />
A LVN DK <br />
A----SLVNDK!<br />
ALVNK----DK!<br />
A DK <br />
ASLVNDK!<br />
ALVNKDK!<br />
A DK
Implicit Column Scoring FuncLon <br />
• 1, match <br />
• 0, mismatch (mutaLon) <br />
• 0, gap (inserLon, deleLon) <br />
ASLVN-DK!<br />
A-LVNKDK!<br />
10111011=6!<br />
• Sum <strong>the</strong> score over all columns of alignment. <br />
Deriving more complex scoring schemes will be a main <br />
topic of my next lecture.
Idea: <br />
How can we find an opLmal <br />
alignment? <br />
– Enumerate all possible alignments of s 1 and s 2 . <br />
– Calculate score using <strong>the</strong> column scoring funcLon. <br />
– Pick <strong>the</strong> alignment(s) with <strong>the</strong> highest score. <br />
Why is this a terrible idea?
The # of alignments of two sequences is <br />
an exponenLal funcLon of <strong>the</strong>ir lengths. <br />
If n is <strong>the</strong> length of s 1 and m is <strong>the</strong> length of s 2 , <strong>the</strong>n <strong>the</strong> number of possible <br />
alignments is roughly: <br />
For sequences of plausible lengths (~150) this quickly <br />
outgrows <strong>the</strong> number of atoms in <strong>the</strong> observable universe.
So what can we do?
Dynamic Programming <br />
• Is not a lively way of wriLng computer code. <br />
• Is a way to exploit substructure within a <br />
problem by keeping good records of parLal <br />
answers. <br />
• Is a very common and powerful algorithmic <br />
technique.
Some NotaLon <br />
• s 1 [1..i] = <strong>the</strong> first i characters of s 1 (i in 0 to n) <br />
• s 2 [1..j] = <strong>the</strong> first j characters of s 2 (j in 0 to m) <br />
• M(i,j) = score of opLmal alignment of s 1 [1..i] <br />
with s 2 [1..j] <br />
– in o<strong>the</strong>r words, M(i,j) = <strong>the</strong> maximum # of <br />
matches in an opLmal alignment of s 1 [1..i] with <br />
s 2 [1..j]
M(i,j) Examples <br />
s 1 = ASLVNDK! !s 2 = ALVNKDK!<br />
M(2,1) = 1 <br />
s 1 : AS!<br />
s 2 : A-!<br />
1+0=1!<br />
M(3,3) = ?!<br />
s 1 : ASL-!<br />
s 2 : A-LV!<br />
1+0+1+0=2!
The Key Insight <br />
• Think about <strong>the</strong> last column of an opLmal <br />
alignment. <br />
• Four possibiliLes (where X, Y are characters): <br />
X!<br />
-!<br />
-!<br />
X!<br />
⇒ M(n-‐1, m) is opLmal <br />
for s 1 [1..n-‐1] and s 2 [1..m] <br />
⇒ M(n, m-‐1) is opLmal <br />
for s 1 [1..n] and s 2 [1..m-‐1] <br />
X!<br />
Y!<br />
X!<br />
X!<br />
⇒ M(n-‐1, m-‐1) is opLmal <br />
for s 1 [1..n-‐1] and s 2 [1..m-‐1] <br />
⇒ M(n-‐1, m-‐1) is opLmal <br />
for s 1 [1..n-‐1] and s 2 [1..m-‐1]
The Key Insight II <br />
M(i,j) = max( <br />
M(i, j-‐1) + 0, <br />
M(i-‐1, j) + 0, <br />
M(i-‐1,j-‐1) + t(i,j)) <br />
X!<br />
-!<br />
where t(i,j) = 1 if s 1 [i] == s 2 [j] and 0 if not. <br />
-!<br />
X!<br />
X! X!<br />
Y! X!<br />
So, if you were given M(i, j-‐1), M(i-‐1, j), and M(i-‐1, j-‐1), <br />
you could compute M(i,j) in a constant number of steps.
Dynamic Programming Approach <br />
• Use this recursive relaLonship to build up <strong>the</strong> <br />
score of an opLmal alignment. <br />
• Start with M(0, j) and M(i, 0). <br />
• Keep track of intermediate scores in a table.
<strong>Needleman</strong>-‐Wunsch Example <br />
M(i,j) s 1 A! S! L! V! N! D! K!<br />
s 2 0 1 2 3 4 5 6 7 <br />
0 <br />
A! 1 <br />
L! 2 <br />
V! 3 <br />
N! 4 <br />
K! 5 <br />
D! 6 <br />
K! 7
Start with base condiLons <br />
D(i,j) s 1 A! S! L! V! N! D! K!<br />
s 2 0 1 2 3 4 5 6 7 <br />
0 0 ? ? ? ? ? ? ? <br />
A! 1 ? <br />
L! 2 ? <br />
V! 3 ? <br />
N! 4 ? <br />
K! 5 ? <br />
D! 6 ? <br />
K! 7 ? <br />
What is D(1, 0), <strong>the</strong> score of <strong>the</strong> opLmal alignment of <br />
s 1 [1..1] with a gap?
Start with base condiLons <br />
M(i,j) s 1 A! S! L! V! N! D! K!<br />
s 2 0 1 2 3 4 5 6 7 <br />
0 0 0 ? ? ? ? ? ? <br />
A! 1 ? <br />
L! 2 ? <br />
V! 3 ? <br />
N! 4 ? <br />
K! 5 ? <br />
D! 6 ? <br />
K! 7 ? <br />
How about <strong>the</strong> rest of this row?
Now compute M(i,j) row by row. <br />
M(i,j) s 1 A! S! L! V! N! D! K!<br />
s 2 0 1 2 3 4 5 6 7 <br />
0 0 0 0 0 0 0 0 0 <br />
A! 1 0 ? <br />
L! 2 0 <br />
V! 3 0 <br />
N! 4 0 <br />
K! 5 0 <br />
D! 6 0 <br />
K! 7 0 <br />
M(1,1) = max(M(0,1) + 0, M(1,0) + 0, M(0,0) + t(1,1)) <br />
M(1,1) = max(0 + 0, 0 + 0, 0 + 1) = 1
Now compute M(i,j) row by row. <br />
M(i,j) s 1 A! S! L! V! N! D! K!<br />
s 2 0 1 2 3 4 5 6 7 <br />
0 0 0 0 0 0 0 0 0 <br />
A! 1 0 1 <br />
L! 2 0 <br />
V! 3 0 <br />
N! 4 0 <br />
K! 5 0 <br />
D! 6 0 <br />
K! 7 0
Now compute M(i,j) row by row. <br />
M(i,j) s 1 A! S! L! V! N! D! K!<br />
s 2 0 1 2 3 4 5 6 7 <br />
0 0 0 0 0 0 0 0 0 <br />
A! 1 0 1 1 1 1 1 1 1 <br />
L! 2 0 ? <br />
V! 3 0 <br />
N! 4 0 <br />
K! 5 0 <br />
D! 6 0 <br />
K! 7 0 <br />
M(i,j) = max(M(i-‐1, j) + 0, M(i, j-‐1) + 0, M(i-‐1, j-‐1) + t(i, j))
Now compute M(i,j) row by row. <br />
M(i,j) s 1 A! S! L! V! N! D! K!<br />
s 2 0 1 2 3 4 5 6 7 <br />
0 0 0 0 0 0 0 0 0 <br />
A! 1 0 1 1 1 1 1 1 1 <br />
L! 2 0 1 1 2 2 2 2 2 <br />
V! 3 0 1 1 2 3 3 3 3 <br />
N! 4 0 1 1 2 3 4 4 4 <br />
K! 5 0 1 1 2 3 4 4 5 <br />
D! 6 0 1 1 2 3 4 5 5 <br />
K! 7 0 1 1 2 3 4 5 6 <br />
M(7,7) = 6, so <strong>the</strong> opLmal match count is six!
How do we construct an opLmal <br />
alignment from this table?
The Traceback <br />
M(i,j) s 1 A! S! L! V! N! D! K!<br />
s 2 0 1 2 3 4 5 6 7 <br />
0 0 0 0 0 0 0 0 0 <br />
A! 1 0 1 1 1 1 1 1 1 <br />
L! 2 0 1 1 2 2 2 2 2 <br />
V! 3 0 1 1 2 3 3 3 3 <br />
N! 4 0 1 1 2 3 4 4 4 <br />
K! 5 0 1 1 2 3 4 4 5 <br />
D! 6 0 1 1 2 3 4 5 5 <br />
K! 7 0 1 1 2 3 4 5 6 <br />
Trace <strong>the</strong> path followed to get M(n,m). <br />
Align s 1 [i] and s 2 [j]. Align gap and s 2 [j]. Align gap and s 1 [i]. <br />
There may be more than one path/opLmal alignment.
Is this algorithm feasible? <br />
• It takes roughly nm + (n + m) operaLons to fill <br />
in <strong>the</strong> table and do <strong>the</strong> traceback. <br />
• QuadraLc run Lme is much be[er than <br />
exponenLal!
<strong>Global</strong> vs. Local <strong>Alignment</strong> <br />
s 1 <br />
<strong>Global</strong> (<strong>Needleman</strong>-‐Wunsch) <br />
s 2 <br />
s 1 <br />
Local (Smith-‐Waterman) <br />
s 2 <br />
In my next lecture, we will learn an efficient algorithm for <br />
finding opLmal local alignments.
When would you want a local <br />
alignment? <br />
• Proteins are composed of modular funcLonal “domains.” <br />
• These domains are oben shuffled into different combinaLons. <br />
• Thus, parts of proteins may align well, while o<strong>the</strong>rs do not, <br />
due to <strong>the</strong>ir different evoluLonary histories.
Summary <br />
• <strong>Sequence</strong> alignment helps us discover and <br />
compare relaLonships between sequences. <br />
• Dynamic programming enables <strong>the</strong> efficient <br />
idenLficaLon of opLmal alignments.