Needleman–Wunsch algorithm

Introduction

The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of dynamic programming to compare biological sequences.

In this article, we try to use Needleman-Wunsch algorithm to find the minimum mismatch between two different DNA sequences.

Definition of Mismatches

Here we have two different DNA sequences, i.e., S1 and S2, where

S1 = "GCATGCU",

S2 = "GCTTAGCU".

One solution to match these two DNA sequences is listed as below:

S1* = "GCAT-GCU",

S2* = "GCTTAGCU",

in which - is called as “gap”, i.e., in this position the DNA is supposed to be unknown.

Match: The two letters are the same, except both two letters are gapped.

Mismatch: The two letters are differential or both two letters are gapped.

By above definition, the solution we find for matching S1 and S2 has two mismatches — T & A and - & A.

Codes for finding minimum mismatches

Here I list the python code for finding minimum number of mismatches between two DNA sequences. The insights of codes below can be viewed by supplemental documents here.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
'''
input:
S1 = ['A','B','C','D']
S2 = ['A','B','C']
output:
1
print info:
New S1 is: ABCD
New S2 is: ABC-
'''
def MinMismatch(S1, S2):
'''
First of all, we construct matrix for counting maximum length of common subsequence.
'''
l1, l2 = len(S1), len(S2)
matrix = [[0 for j in range(l2+1)] for i in range(l1+1)]

for i in range(1, l1+1):
for j in range(1, l2+1):
if S1[i-1] == S2[j-1]:
matrix[i][j] = matrix[i-1][j-1] + 1
else:
matrix[i][j] = max(matrix[i-1][j-1], matrix[i][j-1], matrix[i-1][j])
'''
Dynamic programming. Tracing back the matrix and construct two DNA sequences with gaps.
We saved the restructured sequences into A (originally S1) and B (originally S2).
'''
left_up, up, left = 0, 1, 2
i, j = l1, l2
A, B = [], []
while (i,j) != (0,0):
if i == 0:
i, j = i, j-1
flag = left
elif j == 0:
i, j = i-1, j
flag = up
elif S1[i-1] == S2[j-1] or matrix[i-1][j-1] == max(matrix[i-1][j-1], matrix[i][j-1], matrix[i-1][j]):
i, j, flag = i-1, j-1, left_up
elif matrix[i-1][j] == max(matrix[i-1][j-1], matrix[i][j-1], matrix[i-1][j]):
i, j, flag = i-1, j, up
else:
i, j, flag = i, j-1, left

if flag == left_up:
A.append(S1[i])
B.append(S2[j])
elif flag == up:
A.append(S1[i])
B.append('-')
else:
A.append('-')
B.append(S2[j])
'''
Print info about new DNA sequences with minimum number of mismatches.
'''
A.reverse()
B.reverse()
printA, printB = '',''
for i in range(len(A)):
printA += A[i]
for i in range(len(B)):
printB += B[i]
print('New S1 is: '+ printA)
print('New S2 is: '+ printB)
'''
Calculate the number of mismatches.
'''
mismatch = 0
for i in range(len(A)):
if A[i] != B[i] or (A[i] == '-' and B[i] == '-'):
mismatch += 1
return mismatch

if __name__=="__main__":
S1 = list("GGATCGA")
S2 = list("GAATTCAGTTA")
print(MinMismatch(S1,S2))