Finding Similar Items 文本相似度计算的算法——机器学习、词向量空间cosine、NLTK、diff、Levenshtein距离

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf 汇总于此还有这本书 http://www-nlp.stanford.edu/IR-book/ 里面有词向量空间 SVM 等介绍

http://pages.cs.wisc.edu/~dbbook/openAccess/thirdEdition/slides/slides3ed-english/Ch27b_ir2-vectorspace-95.pdf 专门介绍向量空间

https://courses.cs.washington.edu/courses/cse573/12sp/lectures/17-ir.pdf 也提到了其他思路貌似类似语音识别的统计模型

使用深度学习来做文档相似度计算 https://cs224d.stanford.edu/reports/PoulosJackson.pdf 还有这里 http://www.cms.waikato.ac.nz/~ml/publications/2012/JASIST2012.pdf

网页里直接比较文本相似度的 http://www.scurtu.it/documentSimilarity.html

这里汇总了一些回答 http://stackoverflow.com/questions/8897593/similarity-between-two-text-documents 包括利用NLP NLTK库来做，或者是diff，skylearn词向量空间+cos

http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene 也有cosine相似度计算方法

lucene 3 里的cosine相似度计算方法 https://darakpanand.wordpress.com/2013/06/01/document-comparison-by-cosine-methodology-using-lucene/#more-53 注意：4和3的计算方法不一样

向量空间模型（http://stackoverflow.com/questions/10649898/better-way-of-calculating-document-similarity-using-lucene）：

Once you've got your data components properly standardized, then you can worry about what's better: fuzzy match, Levenshtein distance, or cosine similarity (etc.)

As I told you in my comment, I think you made a mistake somewhere. The vectors actually contain the <word,frequency> pairs, not words only. Therefore, when you delete the sentence, only the frequency of the corresponding words are subtracted by 1 (the words after are not shifted). Consider the following example:

Document a:

A B C A A B C. D D E A B. D A B C B A.

Document b:

A B C A A B C. D A B C B A.

Vector a:

A:6, B:5, C:3, D:3, E:1

Vector b:

A:5, B:4, C:3, D:1, E:0

Which result in the following similarity measure:

(6*5+5*4+3*3+3*1+1*0)/(Sqrt(6^2+5^2+3^2+3^2+1^2) Sqrt(5^2+4^2+3^2+1^2+0^2))=

62/(8.94427*7.14143)=

0.970648

lucene里 more like this：

you may want to check the MoreLikeThis feature of lucene.

MoreLikeThis constructs a lucene query based on terms within a document to find other similar documents in the index.

http://lucene.apache.org/java/3_0_1/api/contrib-queries/org/apache/lucene/search/similar/MoreLikeThis.html

Sample code example (java reference) -

MoreLikeThis mlt = new MoreLikeThis(reader); // Pass the index reader

mlt.setFieldNames(new String[] {"title", "author"}); // specify the fields for similiarity

Query query = mlt.like(docID); // Pass the doc id

TopDocs similarDocs = searcher.search(query, 10); // Use the searcher

if (similarDocs.totalHits == 0)

    // Do handling

}

http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene 提到：

i have built an index in Lucene. I want without specifying a query, just to get a score (cosine similarity or another distance?) between two documents in the index.

For example i am getting from previously opened IndexReader ir the documents with ids 2 and 4. Document d1 = ir.document(2); Document d2 = ir.document(4);

How can i get the cosine similarity between these two documents?

Thank you

When indexing, there's an option to store term frequency vectors.

During runtime, look up the term frequency vectors for both documents using IndexReader.getTermFreqVector(), and look up document frequency data for each term using IndexReader.docFreq(). That will give you all the components necessary to calculate the cosine similarity between the two docs.

An easier way might be to submit doc A as a query (adding all words to the query as OR terms, boosting each by term frequency) and look for doc B in the result set.

16down vote

As Julia points out Sujit Pal's example is very useful but the Lucene 4 API has substantial changes. Here is a version rewritten for Lucene 4.

import java.io.IOException;

import java.util.*;

import org.apache.commons.math3.linear.*;

import org.apache.lucene.analysis.Analyzer;

import org.apache.lucene.analysis.core.SimpleAnalyzer;

import org.apache.lucene.document.*;

import org.apache.lucene.document.Field.Store;

import org.apache.lucene.index.*;

import org.apache.lucene.store.*;

import org.apache.lucene.util.*;

public class CosineDocumentSimilarity {

    public static final String CONTENT = "Content";

    private final Set<String> terms = new HashSet<>();

    private final RealVector v1;

    private final RealVector v2;

    CosineDocumentSimilarity(String s1, String s2) throws IOException {

        Directory directory = createIndex(s1, s2);

        IndexReader reader = DirectoryReader.open(directory);

        Map<String, Integer> f1 = getTermFrequencies(reader, 0);

        Map<String, Integer> f2 = getTermFrequencies(reader, 1);

        reader.close();

        v1 = toRealVector(f1);

        v2 = toRealVector(f2);

    }

    Directory createIndex(String s1, String s2) throws IOException {

        Directory directory = new RAMDirectory();

        Analyzer analyzer = new SimpleAnalyzer(Version.LUCENE_CURRENT);

        IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_CURRENT,

                analyzer);

        IndexWriter writer = new IndexWriter(directory, iwc);

        addDocument(writer, s1);

        addDocument(writer, s2);

        writer.close();

        return directory;

    }

    /* Indexed, tokenized, stored. */

    public static final FieldType TYPE_STORED = new FieldType();

    static {

        TYPE_STORED.setIndexed(true);

        TYPE_STORED.setTokenized(true);

        TYPE_STORED.setStored(true);

        TYPE_STORED.setStoreTermVectors(true);

        TYPE_STORED.setStoreTermVectorPositions(true);

        TYPE_STORED.freeze();

    }

    void addDocument(IndexWriter writer, String content) throws IOException {

        Document doc = new Document();

        Field field = new Field(CONTENT, content, TYPE_STORED);

        doc.add(field);

        writer.addDocument(doc);

    }

    double getCosineSimilarity() {

        return (v1.dotProduct(v2)) / (v1.getNorm() * v2.getNorm());

    }

    public static double getCosineSimilarity(String s1, String s2)

            throws IOException {

        return new CosineDocumentSimilarity(s1, s2).getCosineSimilarity();

    }

    Map<String, Integer> getTermFrequencies(IndexReader reader, int docId)

            throws IOException {

        Terms vector = reader.getTermVector(docId, CONTENT);

        TermsEnum termsEnum = null;

        termsEnum = vector.iterator(termsEnum);

        Map<String, Integer> frequencies = new HashMap<>();

        BytesRef text = null;

        while ((text = termsEnum.next()) != null) {

            String term = text.utf8ToString();

            int freq = (int) termsEnum.totalTermFreq();

            frequencies.put(term, freq);

            terms.add(term);

        }

        return frequencies;

    }

    RealVector toRealVector(Map<String, Integer> map) {

        RealVector vector = new ArrayRealVector(terms.size());

        int i = 0;

        for (String term : terms) {

            int value = map.containsKey(term) ? map.get(term) : 0;

            vector.setEntry(i++, value);

        }

        return (RealVector) vector.mapDivide(vector.getL1Norm());

    }

}

Finding Similar Items 文本相似度计算的算法——机器学习、词向量空间cosine、NLTK、diff、Levenshtein距离的更多相关文章

4. 文本相似度计算-CNN-DSSM算法
1. 文本相似度计算-文本向量化 2. 文本相似度计算-距离的度量 3. 文本相似度计算-DSSM算法 4. 文本相似度计算-CNN-DSSM算法 1. 前言之前介绍了DSSM算法,它主要是用了DN ...
3. 文本相似度计算-DSSM算法
1. 文本相似度计算-文本向量化 2. 文本相似度计算-距离的度量 3. 文本相似度计算-DSSM算法 4. 文本相似度计算-CNN-DSSM算法 1. 前言最近在学习文本相似度的计算,前面两篇文章 ...
转：Python 文本挖掘：使用gensim进行文本相似度计算
Python使用gensim进行文本相似度计算转于:http://rzcoding.blog.163.com/blog/static/2222810172013101895642665/ 在文本处理 ...
python 文本相似度计算
参考:python文本相似度计算原始语料格式:一个文件,一篇文章. #!/usr/bin/env python # -*- coding: UTF-8 -*- import jieba from g ...
word2vec词向量训练及中文文本类似度计算
本文是讲述怎样使用word2vec的基础教程.文章比較基础,希望对你有所帮助! 官网C语言下载地址:http://word2vec.googlecode.com/svn/trunk/ 官网Python ...
java文章标题及文章相似度计算hash算法实现
参看了 https://github.com/awnuxkjy/recommend-system 对方用了余弦函数实现相似度计算,我则用的是 hanlp+hash 算法(Hash算法总结) 再看服 ...
【NLP】Python实例：基于文本相似度对申报项目进行查重设计
Python实例:申报项目查重系统设计与实现作者:白宁超 2017年5月18日17:51:37 摘要:关于查重系统很多人并不陌生,无论本科还是硕博毕业都不可避免涉及论文查重问题,这也对学术不正之风起 ...
NLP点滴——文本相似度
[TOC] 前言在自然语言处理过程中,经常会涉及到如何度量两个文本之间的相似性,我们都知道文本是一种高维的语义空间,如何对其进行抽象分解,从而能够站在数学角度去量化其相似性.而有了文本之间相似性的度 ...
海量数据相似度计算之simhash和海明距离
通过采集系统我们采集了大量文本数据,但是文本中有很多重复数据影响我们对于结果的分析.分析前我们需要对这些数据去除重复,如何选择和设计文本的去重算法?常见的有余弦夹角算法.欧式距离.Jaccard相 ...

随机推荐

[C++]二维数组还是一维数组？
记得刚学习C++那会这个问题曾困扰过我,后来慢慢形成了不管什么时候都用一维数组的习惯,再后来知道了在一维数组中提出首列元素地址进行二维调用的办法.可从来没有细想过这个问题,最近自己写了点代码测试下,虽 ...
解决google登录界面input输入框颜色不正确问题
加入以下样式: input:-webkit-autofill { -webkit-box-shadow: 0 0 0px 1000px #e2e2e2 inset !important; }
UFLDL深度学习笔记（四）用于分类的深度网络
UFLDL深度学习笔记 (四)用于分类的深度网络 1. 主要思路本文要讨论的"UFLDL 建立分类用深度网络"基本原理基于前2节的softmax回归和无监督特征学习,区别在于使 ...
saltstack内置执行模块useradd
useradd模块用于命令行管理用户 salt.modules.useradd.add(name, uid=None, gid=None, groups=None, home=None, shell= ...
C语言基础知识【基本语法】
C 基本语法1.C 的令牌(Tokens)C 程序由各种令牌组成,令牌可以是关键字.标识符.常量.字符串值,或者是一个符号.2.分号 ;在 C 程序中,分号是语句结束符.也就是说,每个语句必须以分号结 ...
[Java开发之路]（8）输入流和输出流
1. Java流的分类按流向分: 输入流: 能够从当中读入一个字节序列的对象称作输入流. 输出流: 能够向当中写入一个字节序列的对象称作输出流. 这些字节序列的来源地和目的地能够是文件,并且通常都是 ...
【BZOJ3720】Gty的妹子树块状树
[BZOJ3720]Gty的妹子树我曾在弦歌之中听过你,檀板声碎,半出折子戏.舞榭歌台被风吹去,岁月深处尚有余音一缕……Gty神(xian)犇(chong)从来不缺妹子……他来到了一棵妹子树下,发现 ...
【BZOJ5018】[Snoi2017]英雄联盟背包
[BZOJ5018][Snoi2017]英雄联盟 Description 正在上大学的小皮球热爱英雄联盟这款游戏,而且打的很菜,被网友们戏称为「小学生」.现在,小皮球终于受不了网友们的嘲讽,决定变强了 ...
php在web端播放amr语音（如微信语音）
在使用微信JSSDK的上传下载语音接口时,发现一个问题: 下载的语音在iPhone上不能播放,测试了之后原因竟然是: 微信接口返回的音频内容是amr格式的,但iPhone不支持播放此类型格式. 那么转 ...
Django使用富文本编辑器
1.下载kindeditor 网址:http://kindeditor.net/demo.php2.解压到项目中地址:\static\js\kindeditor-4.1.103.删除没用的文件例如 ...

Finding Similar Items 文本相似度计算的算法——机器学习、词向量空间cosine、NLTK、diff、Levenshtein距离

Finding Similar Items 文本相似度计算的算法——机器学习、词向量空间cosine、NLTK、diff、Levenshtein距离的更多相关文章

随机推荐

热门专题