Levenshtein distance 编辑距离

编辑距离，又称Levenshtein距离，是指两个字串之间，由一个转成另一个所需的最少编辑操作次数。许可的编辑操作包括将一个字符替换成另一个字符，插入一个字符，删除一个字符

实现方案：

1. 找出最长公共子串长度

参考代码：

apache commons-lang

public static int getLevenshteinDistance(CharSequence s, CharSequence t) {
    if (s == null || t == null) {
        throw new IllegalArgumentException("Strings must not be null");
    }

    /*
       The difference between this impl. and the previous is that, rather
       than creating and retaining a matrix of size s.length() + 1 by t.length() + 1,
       we maintain two single-dimensional arrays of length s.length() + 1.  The first, d,
       is the 'current working' distance array that maintains the newest distance cost
       counts as we iterate through the characters of String s.  Each time we increment
       the index of String t we are comparing, d is copied to p, the second int[].  Doing so
       allows us to retain the previous cost counts as required by the algorithm (taking
       the minimum of the cost count to the left, up one, and diagonally up and to the left
       of the current cost count being calculated).  (Note that the arrays aren't really
       copied anymore, just switched...this is clearly much better than cloning an array
       or doing a System.arraycopy() each time  through the outer loop.)

       Effectively, the difference between the two implementations is this one does not
       cause an out of memory condition when calculating the LD over two very large strings.
     */

    int n = s.length(); // length of s
    int m = t.length(); // length of t

    if (n == 0) {
        return m;
    } else if (m == 0) {
        return n;
    }

    if (n > m) {
        // swap the input strings to consume less memory
        final CharSequence tmp = s;
        s = t;
        t = tmp;
        n = m;
        m = t.length();
    }

    int p[] = new int[n + 1]; //'previous' cost array, horizontally
    int d[] = new int[n + 1]; // cost array, horizontally
    int _d[]; //placeholder to assist in swapping p and d

    // indexes into strings s and t
    int i; // iterates through s
    int j; // iterates through t

    char t_j; // jth character of t

    int cost; // cost

    for (i = 0; i <= n; i++) {
        p[i] = i;
    }

    for (j = 1; j <= m; j++) {
        t_j = t.charAt(j - 1);
        d[0] = j;

        for (i = 1; i <= n; i++) {
            cost = s.charAt(i - 1) == t_j ? 0 : 1;
            // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
            d[i] = Math.min(Math.min(d[i - 1] + 1, p[i] + 1), p[i - 1] + cost);
        }

        // copy current distance counts to 'previous row' distance counts
        _d = p;
        p = d;
        d = _d;
    }

    // our last action in the above loop was to switch d and p, so p now
    // actually has the most recent cost counts
    return p[n];
}

Levenshtein distance 编辑距离的更多相关文章

利用Levenshtein Distance (编辑距离)实现文档相似度计算
1.首先将word文档解压缩为zip /** * 修改后缀名 */ public static String reName(String path){ File file=new File(path) ...
Levenshtein distance 编辑距离算法
这几天再看 virtrual-dom,关于两个列表的对比,讲到了 Levenshtein distance 距离,周末抽空做一下总结. Levenshtein Distance 介绍在信息理论和计算 ...
Levenshtein Distance (编辑距离) 算法详解
编辑距离即从一个字符串变换到另一个字符串所需要的最少变化操作步骤(以字符为单位,如son到sun,s不用变,将o->s,n不用变,故操作步骤为1). 为了得到编辑距离,我们画一张二维表来理解,以 ...
C#实现Levenshtein distance最小编辑距离算法
Levenshtein distance,中文名为最小编辑距离,其目的是找出两个字符串之间需要改动多少个字符后变成一致.该算法使用了动态规划的算法策略,该问题具备最优子结构,最小编辑距离包含子最小编辑 ...
Levenshtein Distance算法（编辑距离算法）
编辑距离编辑距离(Edit Distance),又称Levenshtein距离,是指两个字串之间,由一个转成另一个所需的最少编辑操作次数.许可的编辑操作包括将一个字符替换成另一个字符,插入一个字符, ...
字符串相似度算法（编辑距离算法 Levenshtein Distance）（转）
在搞验证码识别的时候需要比较字符代码的相似度用到“编辑距离算法”,关于原理和C#实现做个记录. 据百度百科介绍: 编辑距离,又称Levenshtein距离(也叫做Edit Distance),是指两个 ...
扒一扒编辑距离（Levenshtein Distance）算法
最近由于工作需要,接触了编辑距离(Levenshtein Distance)算法.赶脚很有意思.最初百度了一些文章,但讲的都不是很好,读起来感觉似懂非懂.最后还是用google找到了一些资料才慢慢理解 ...
字符串相似度算法（编辑距离算法 Levenshtein Distance）
在搞验证码识别的时候需要比较字符代码的相似度用到“编辑距离算法”,关于原理和C#实现做个记录.据百度百科介绍:编辑距离,又称Levenshtein距离(也叫做Edit Distance),是指两个字串 ...
用C#实现字符串相似度算法（编辑距离算法 Levenshtein Distance）
在搞验证码识别的时候需要比较字符代码的相似度用到"编辑距离算法",关于原理和C#实现做个记录. 据百度百科介绍: 编辑距离,又称Levenshtein距离(也叫做Edit Dist ...

随机推荐

Pomelo的component组件
pomelo的核心是由一系列松耦合的component组成,同时我们也可以实现我们自己的component来完成一些自己定制的功能.对于我们的聊天应用,我们尝试给其增加一个component,目的是展 ...
一篇完整的FlexBox布局指南
一篇完整的FlexBox布局指南转载请标注本文链接并附带以下信息: 译:Cydiacen 作者:CHRIS COYIER 原文:A Complete Guide to Flexbox 原文更新于 2 ...
内功心法 -- java.util.ArrayList<E> (3)
写在前面的话:读书破万卷,编码如有神--------------------------------------------------------------------下文主要对java.util ...
写给Java开发者的Node.JS简介
前言今天上推特看见这篇文章,点进去发现是新货. 正好最近想入Node的坑,又有一些Java基础,所以希望翻译出来给大家,同时也让自己加深理解. 才疏学浅,如有不妥之处请指正. 原文链接:Node f ...
构建自动化前端样式回归测试——BackstopJS篇
在使用scss和less开发的时候,遇到过一件很有趣的事,因为网站需要支持响应式,就开了一个响应式样式框架,简单的几百行scss代码,居然生成了近100KB的css代码,因此决定重构这个样式库.而重构 ...
Swift 2.0 自定义cell和不同风格的cell
昨天我们写了使用系统的cell怎样创建tableView,今天我们再细分一下,就是不同风格的cell,我们怎写代码.先自己创建一个cell,继承于UItableviewcell 我们看看 cell 里 ...
开发mis系统的技术
一.b/s架构 b/s架构:就broser/server,浏览器/服务器的说法.服务器端要运行tomcat,提供链接数据库服务供java代码读写数据,这个可以在eclipse中配置运行.浏览器则解释j ...
C# 类型和变量
C# 中的类型有两种:值类型 (value type) 和引用类型 (reference type).值类型的变量直接包含它们的数据,而引用类型的变量存储对它们的数据的引用,后者称为对象.对于引用类型 ...
C# 6 与 .NET Core 1.0 高级编程 - 40 ASP.NET Core（上）
译文,个人原创,转载请注明出处(C# 6 与 .NET Core 1.0 高级编程 - 40 章 ASP.NET Core(上)),不对的地方欢迎指出与交流. 章节出自<Professiona ...
weex官方demo weex-hackernews代码解读(1)
一.介绍 weex 是阿里出品的一个类似RN的框架,可以使用前端技术来开发移动应用,实现一份代码支持H5,IOS和Android.最新版本的weex已默认将vue.js作为前端框架,而weex-hac ...

Levenshtein distance 编辑距离

Levenshtein distance 编辑距离的更多相关文章

随机推荐

热门专题