Min Edit Distance
Min Edit Distance
————两字符串之间的最小距离
PPT原稿参见Stanford;http://www.stanford.edu/class/cs124/lec/med.pdf
Tips:由于本人水平有限,对MED的个人理解可能有纰漏之处,请勿尽信。
Edit:个人理解指编辑之意,也即对于两个字符串,对其中的一个进行各种编辑操作(插入、删除、替换)使其变为另一个字符串。要解决的问题是求出最小的编辑操作次数是多少。
基因系列比对
定义距离:
X,Y是大小分别为n,m的字符串。
定义D(i,j)表示X[1..i],Y[1..j]两子字符串的距离。则:
X与Y的距离为D(n,m)
应用动态规划的方法:
对于D(i,j)的计算结果做成表格(矩阵),D(i,j)的运算结果可以建立在之前的结果之上。
对本算法的一点个人理解:
设 i:输出子字符串长度
j:输入子字符串长度
D(0,0)=0
D(i,0):输入0个字符转换到i个字符的输出,也即要插入i个字符,代价为insertCost*i
D(0,j):目标长度为0,输入长度为j,所以代价为:deletionCost*j
D(i,j)的上一步可以分为三种情况:
1、上一步输入长度为j,输出长度为i-1,那么现在这一步肯定至少还要插入一个字符才能达到i长度输出:
D(i-1,j)+inseartCost*1
2、上一步输入长度为j-1,输出长度为i-1,那么现在这一步第j个输入只需要做替换处理(如果第j个输入与第i个输出不相等)或者保持不变(如果第j个输入与第i个输出相等):
D(i-1,j-1)+substituteCost*(source[j]==target[i] ? 0 : 1)
3、上一步输入长度为j-1,输出长度为i,那么现在这一步由于又多了一个字符,所以要把多的这个字符删除:
D(i,j-1)+deletionCost*1
由于我们要求的是最小Edit Distance,自然,就是上述三种情况中最小值做为D(i,j)的值。具体算法如下:
最小编辑距离动态算法(Levenshtein):

D(n,m)即为最小距离。
字符串对齐:
|
|
|
|
上图中从(0,0)到(n,m)的每一个非递减路径即为两字符串的对齐关系。
附加回溯过程的MED:

Weighted Edit Distance:(加权编辑距离)
为什么要加权?
因为有些字符被写错的概率要大些(如搜索引擎中经常能自动搜索到相近的词句)

算法:

Levenshtein算法python实现:
#===============================================================================
# Using dynamic programming to realize the MED algorithm(Levenshtein)
# MED: short for Minimum Edit Distance
#=============================================================================== import types
import numpy as np class MED:
def __init__(self):
self.insCost = 1 #insertion cost
self.delCost = 1 #deletion cost
self.subsCost = 1 #substitution cost
self.D = 0 def mDistance(self, word1, word2):
assert type(word1) is types.StringType
assert type(word2) is types.StringType
size1 = len(word1)
size2 = len(word2)
if size1==0 or size2 ==0:
return size1*self.delCost+size2*self.insCost
D_mat = np.zeros((size1+1,size2+1))
D_mat[:,0] = range(size1+1)
D_mat[0,:] = range(size2+1)
for i in range(1,size1+1):
for j in range(1,size2+1):
insert_cost = D_mat[i-1, j] + self.insCost*1
delete_cost = D_mat[i, j-1] + self.delCost*1
if word1[i-1]==word2[j-1]:
substitue_cost = D_mat[i-1, j-1]
else:
substitue_cost = D_mat[i-1, j-1] + self.subsCost*1
D_mat[i,j] = min(insert_cost, delete_cost, substitue_cost)
self.D = D_mat
return D_mat[size1, size2] if __name__ == "__main__":
word1 = "Function"
word2 = "fanctional"
med = MED()
print "MED distance is :" ,med.mDistance(word1, word2)
输出结果:
MED distance is : 4.0
扩展应用1:利用回溯做字符对齐
对原ED计算函数做点更改(每次得到MED时记录该值的来源,然后从D(n,m)开始利用记录的来源回溯至起始位置,得到该MED的完整路径):
def computingAlignment(self, word1, word2):
assert type(word1) is types.StringType
assert type(word2) is types.StringType
size1 = len(word1)
size2 = len(word2)
if size1==0 or size2 ==0:
return size1*self.delCost+size2*self.insCost
D_mat = np.zeros((size1+1,size2+1))
D_rec = np.zeros((size1+1, size2+1))
D_mat[:,0] = range(size1+1)
D_mat[0,:] = range(size2+1)
for i in range(1,size1+1):
for j in range(1,size2+1):
insert_cost = D_mat[i-1, j] + self.insCost*1
delete_cost = D_mat[i, j-1] + self.delCost*1
if word1[i-1]==word2[j-1]:
substitue_cost = D_mat[i-1, j-1]
else:
substitue_cost = D_mat[i-1, j-1] + self.subsCost*1
D_mat[i,j] = min(insert_cost, delete_cost, substitue_cost)
if D_mat[i,j] == insert_cost:#Record Where the min val comes from
D_rec[i,j] = 1
elif D_mat[i,j]== delete_cost:
D_rec[i,j] = 2
else:
D_rec[i,j] = 3
self.D = D_mat
self.D_rec = D_rec
# BackTrace
alignRevPath=[]#Get the reverse path
j = size2
i = size1
while i!=0 or j!=0:#Be carefull of this row
alignRevPath.append([i,j,D_rec[i,j]])
if D_rec[i,j]==1:
i -=1
elif D_rec[i,j]==2:
j -=1
elif D_rec[i,j]==3:
i -=1
j-=1
elif D_rec[i,j]==0:
if i>0:
i -= 1
if j>0:
j -= 1
alignStr1 =[]
alignStr2 =[]
if alignRevPath[-1][0]!=0:#process the first postion of the path
alignStr1.append(word1[alignRevPath[-1][0]-1])
else:
alignStr1.append('*')
if alignRevPath[-1][1]!=0:
alignStr2.append(word2[alignRevPath[-1][1]-1])
else:
alignStr2.append('*') for i in range(len(alignRevPath)-1, 0, -1): #process the rest of the path
k = np.subtract(alignRevPath[i-1], alignRevPath[i])
bK = k>0
if bK[0]!=0:
alignStr1.append(word1[alignRevPath[i-1][0]-1])
else:
alignStr1.append('*') if bK[1]!=0:
alignStr2.append(word2[alignRevPath[i-1][1]-1])
else:
alignStr2.append('*')
return (alignStr1, alignStr2)
上面的代码中alignRevPath用来保存路径上的每一个位置,其每个元素都为3元列表,前两维为路上的坐标,第3维取0、1、2、3四种值,0表示路径到达边界了,1表示当前的ED结果由前一个insert操作得到,2表示当前ED结果由前一个Delete得到,3表示当前ED结果由前一个substitue得到。
在主程序中加入如下测试代码:
word1 = "efnction"
word2 = "faunctional"
med = MED()
w1, w2 = med.computingAlignment(word1, word2)
print ' '.join(w1)
print '| '*len(w1)
print ' '.join(w2)
得到的输出:

扩展应用2:匹配最长子字符串(不一定连续)
原理:


图示:

从上图可以看出,匹配的并不一定是连续的子串.这是因为我们的惩罚项设置为:

也即迭代过程
中的s(xi,yj)取-1(不匹配)或1(匹配)。当子串中有不匹配的字符出现时,将会对之前的匹配计数减1。
如果想匹配最长连续子串,可以令惩罚项F(i,j)为0(不匹配)或F(i-1,j-1)+1(匹配)
Python 实现(对computingAlignment做少许更改):
def longestSubstr(self, word1, word2):
assert type(word1) is types.StringType
assert type(word2) is types.StringType
size1 = len(word1)
size2 = len(word2)
if size1==0 or size2 ==0:
return size1*self.delCost+size2*self.insCost
D_mat = np.zeros((size1+1,size2+1))
D_rec = np.zeros((size1+1, size2+1))
D_mat[:,0] = 0
D_mat[0,:] = 0
for i in range(1,size1+1):
for j in range(1,size2+1):
insert_cost = D_mat[i-1, j] - self.insCost*1
delete_cost = D_mat[i, j-1] - self.delCost*1
if word1[i-1]==word2[j-1]:
substitue_cost = D_mat[i-1, j-1] +1
else:
substitue_cost = D_mat[i-1, j-1] - self.subsCost*1
# substitue_cost = 0
D_mat[i,j] = max(0,insert_cost, delete_cost, substitue_cost)
if D_mat[i,j] == insert_cost:#Record Where the min val comes from
D_rec[i,j] = 1
elif D_mat[i,j]== delete_cost:
D_rec[i,j] = 2
elif D_mat[i,j]==substitue_cost:
D_rec[i,j] = 3 self.D = D_mat
self.D_rec = D_rec
maxVal = np.max(D_mat)
maxBool = D_mat == maxVal
numMax = np.sum(maxBool)
alignRevPaths=[]
for i in range(numMax):
alignRevPaths.append([])
pathStarts=[]
for i in range(size1+1):
for j in range(size2+1):
if maxBool[i,j]:
pathStarts.append([i,j]) for m in range(numMax):
i = pathStarts[m][0]
j = pathStarts[m][1]
while i!=0 and j!=0:
alignRevPaths[m].append([i,j,D_rec[i,j]])
if D_rec[i,j]==1:
i -=1
elif D_rec[i,j]==2:
j -=1
elif D_rec[i,j]==3:
i -=1
j-=1
elif D_rec[i,j]==0:
break
if D_mat[i,j]==0:
break
str1=[]
str2=[]
for m in range(numMax):
str1.append([])
str2.append([])
p = alignRevPaths[m]
for i in range(len(p)-1, -1,-1):
str1[m].append(word1[p[i][0]-1])
str2[m].append(word2[p[i][1]-1])
return ([''.join(i) for i in str1],[''.join(i) for i in str2])
代码测试:
word1 = "ATCAT"
word2 = "ATTATC"
med = MED()
str1,str2= med.longestSubstr(word1, word2)
print "Longest match substr:"
print "match substr in word1:", str1
print "match substr in word2:", str2
输出结果:
Longest match substr:
match substr in word1: ['ATC', 'ATCAT']
match substr in word2: ['ATC', 'ATTAT']
如果把longestSubstr中的
substitue_cost = D_mat[i-1, j-1] - self.subsCost*1
改为:
substitue_cost = 0
即可用来求最大连续子字符串,同样运行上述测试代码,可得:
Longest match substr:
match substr in word1: ['ATC']
match substr in word2: ['ATC']
改用另外一组字符串进行实验进行进一步验证:
word1 = "fefnction"
word2 = "faunctional"
med = MED()
str1,str2= med.longestSubstr(word1, word2)
print "Longest match substr:"
print "match substr in word1:", str1
print "match substr in word2:", str2
当求解的是最长子字符串(非连续)时,输出:
Longest match substr:
match substr in word1: ['nction']
match substr in word2: ['nction']
当求解的是最长连续子字符串时,输出:
Longest match substr:
match substr in word1: ['nction']
match substr in word2: ['nction']
在以上的算法中,MED的一个最大的特点就是利用了矩阵保存之前处理的结果数据,以做为下一次的输入。对于最长子字符串的查找,同样套用了MED的框架,但仔细一想我们会发现,其实最长子串的查找并不一定需要记录路径alignRevPaths,有了alignRevPaths这个路径,我们编程时是根据这个路径处理字符的。路径的设置主要是为了在computingAlignment中对字符进行对齐,因为字符对齐的情况下,字符不一定是连续,比如会有如下对齐的形式中的“*”号:
所以我们才想到要用路径做记录。
但在最长子串中,子串肯定是连续的,自然路径也就不需要了。
下面对最长子串程序做简化:
def longestSubstr2(self,word1, word2):
assert type(word1) is types.StringType
assert type(word2) is types.StringType
size1 = len(word1)
size2 = len(word2)
if size1==0 or size2 ==0:
return size1*self.delCost+size2*self.insCost
D_mat = np.zeros((size1+1,size2+1))
D_mat[:,0] = 0
D_mat[0,:] = 0
for i in range(1,size1+1):
for j in range(1,size2+1):
insert_cost = D_mat[i-1, j] - self.insCost*1
delete_cost = D_mat[i, j-1] - self.delCost*1
if word1[i-1]==word2[j-1]:
substitue_cost = D_mat[i-1, j-1] +1
else:
substitue_cost = D_mat[i-1, j-1] - self.subsCost*1
# substitue_cost = 0 D_mat[i,j] = max(0,insert_cost, delete_cost, substitue_cost)
self.D = D_mat
maxVal = np.max(D_mat)
maxBool = D_mat == maxVal
numMax = np.sum(maxBool)
alignRevPaths=[]
for i in range(numMax):
alignRevPaths.append([])
pathStarts=[]
for i in range(size1+1):
for j in range(size2+1):
if maxBool[i,j]:
pathStarts.append([i,j]) str1=[]
str2=[]
for m in range(numMax):
str1.append([])
str2.append([])
i = pathStarts[m][0]
j = pathStarts[m][1]
s1Tmp = []
s2Tmp = []
while D_mat[i,j]!=0:
s1Tmp.append(word1[i-1])
s2Tmp.append(word2[j-1])
i -= 1
j -= 1
str1[m]=[s1Tmp[len(s1Tmp)-i-1] for i in range(len(s1Tmp))]
str2[m]=[s2Tmp[len(s2Tmp)-i-1] for i in range(len(s2Tmp))]
return ([''.join(i) for i in str1],[''.join(i) for i in str2])
使用以下字符做实验:
word1 = "ATCAT"
word2 = "ATTATC"
输出结果同样为:
Longest match substr:
match substr in word1: ['ATC', 'ATCAT']
match substr in word2: ['ATC', 'ATTAT']
一点个人感悟:
虽然这里的动态规划算法看上去有点不可思议,但联想起线性代数中的矩阵运算也就不难理解了。
就拿最长子串的程序来说,其实际过程仍可看做:对每word1中的每一个字符在word2中进行查找,当匹配第一个字符后继续匹配第二个字符,然后第三、第四……个,直到有字符不匹配时,记录该匹配成功的串长度及字串始末位置。
而这里的Smith-waterman算法将这个匹配记录在一个2维矩阵(数组)中。我们都知道,线性代数中的矩阵乘法同样是可以展开为各元素的乘与累加操作的。而这里,我认为,发明者正是利用了这一思想。
Min Edit Distance的更多相关文章
- 动态规划 求解 Minimum Edit Distance
http://blog.csdn.net/abcjennifer/article/details/7735272 自然语言处理(NLP)中,有一个基本问题就是求两个字符串的minimal Edit D ...
- 72. Edit Distance
题目: Given two words word1 and word2, find the minimum number of steps required to convert word1 to w ...
- [LeetCode] One Edit Distance 一个编辑距离
Given two strings S and T, determine if they are both one edit distance apart. 这道题是之前那道Edit Distance ...
- [LeetCode] Edit Distance 编辑距离
Given two words word1 and word2, find the minimum number of steps required to convert word1 to word2 ...
- Edit Distance
Edit Distance Given two words word1 and word2, find the minimum number of steps required to convert ...
- 编辑距离——Edit Distance
编辑距离 在计算机科学中,编辑距离是一种量化两个字符串差异程度的方法,也就是计算从一个字符串转换成另外一个字符串所需要的最少操作步骤.不同的编辑距离中定义了不同操作的集合.比较常用的莱温斯坦距离(Le ...
- LintCode Edit Distance
LintCode Edit Distance Given two words word1 and word2, find the minimum number of steps required to ...
- 【leetcode】Edit Distance
Edit Distance Given two words word1 and word2, find the minimum number of steps required to convert ...
- ✡ leetcode 161. One Edit Distance 判断两个字符串是否是一步变换 --------- java
Given two strings S and T, determine if they are both one edit distance apart. 给定两个字符串,判断他们是否是一步变换得到 ...
随机推荐
- React gulp、Browserify、Webpack实例
一.gulp var gulp = require('gulp'); var react = require('gulp-react'); gulp.task('jsx', function() { ...
- highcharts 结合phantomjs纯后台生成图片
highcharts 结合phantomjs纯后台生成图片 highcharts 这个图表展示插件我想大家应该都知道,纯javascript编写,相比那些flash图表插件有很大的优势,至少浏览器不用 ...
- 三个Timer
System.Windows.Forms.Timer System.Timers.Timer System.Threading.Timer 关于这三者的区别,可以参见msdn https:// ...
- Android开发之单例模式
参考:http://blog.csdn.net/guolin_blog/article/details/8860649 http://www.cnblogs.com/liyiran/p/5283690 ...
- [HDOJ3635]Dragon Balls(并查集,路径压缩)
题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=3635 题意:有n个龙珠,n个城市.初始状态第i个龙珠在第i个城市里.接下来有两个操作: T A B:把 ...
- Mac下安装HBase及详解
Mac下安装HBase及详解 1. 千篇一律的HBase简介 HBase是Hadoop的数据库, 而Hive数据库的管理工具, HBase具有分布式, 可扩展及面向列存储的特点(基于谷歌BigTabl ...
- Alpha、Beta、RC、GA版本的区别
Alpha:是内部测试版,一般不向外部发布,会有很多Bug.一般只有测试人员使用. Beta:也是测试版,这个阶段的版本会一直加入新的功能.在Alpha版之后推出. RC:(Release Candi ...
- bzoj1054: [HAOI2008]移动玩具
hash+bfs:要注意特殊情况.(似乎连sort.lower_bound都不用数据小直接判重了... #include<cstdio> #include<cstring> # ...
- 漫谈 polling 和 Websocket
Http被设计成了一个单向的通信的协议,即客户端发起一个request,然后服务器回应一个response.这让服务器很为恼火:我特么才是老大,我居然不能给小弟发消息... 轮询 老大发火了,小弟们自 ...
- unicode string和ansi string的转换函数及获取程序运行路径的代码
#pragma once#include <string> namespace stds { class tool { public: std::string ws2s(const std ...




