Advanced Data Analytics Using Python_ With Machine Learning, Deep Learning and NLP Examples ( 2023)
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 4
Unsupervised Learning: Clustering
Needleman–Wunsch Algorithm
The Needleman–Wunsch algorithm is used in bioinformatics to align
protein or nucleotide sequences. It was one of the first applications of
dynamic programming for comparing biological sequences. It works
using dynamic programming. First it creates a matrix where the rows and
columns are alphabets. Each cell of the matrix is a similarity score of the
corresponding alphabet in that row and column. Scores are one of three
types: matched, not matched, or matched with insert or deletion. Once
the matrix is filled, the algorithm does a backtracing operation from the
bottom-right cell to the top-left cell and finds the path where the neighbor
score distance is the minimum. The sum of the score of the backtracing
path is the Needleman–Wunsch distance for two strings.
Pyopa is a Python module that provides a ready-made Needleman–
Wunsch distance between two strings.
import pyopa
data = {'gap_open': -20.56,
'gap_ext': -3.37,
'pam_distance': 150.87,
'scores': [[10.0]],
'column_order': 'A',
'threshold': 50.0}
env = pyopa.create_environment(**data)
s1 = pyopa.Sequence('AAA')
s2 = pyopa.Sequence('TTT')
print(pyopa.align_double(s1, s1, env))
print(env.estimate_pam(aligned_strings[0], aligned_strings[1]))
86