句子相似度_tf/idf

import math
from math import isnan
import pandas as pd
#结巴分词，切开之后，有分隔符
def jieba_function(sent):
    import jieba
    sent1 = jieba.cut(sent)
    s = []
    for each in sent1:
        s.append(each)
    return ' '.join(str(i) for i in s)
def count_cos_similarity(vec_1, vec_2):
    if len(vec_1) != len(vec_2):
        return 0

    s = sum(vec_1[i] * vec_2[i] for i in range(len(vec_2)))
    den1 = math.sqrt(sum([pow(number, 2) for number in vec_1]))
    den2 = math.sqrt(sum([pow(number, 2) for number in vec_2]))
    return s / (den1 * den2)
#计算文本向量，传入文本,接受的是字符串
def tf(sent1, sent2):
    from sklearn.feature_extraction.text import CountVectorizer

    sent1 = jieba_function(sent1)
    sent2 = jieba_function(sent2)

    count_vec = CountVectorizer()

    sentences = [sent1, sent2]
    print('sentences',sentences)
    print('vector',count_vec.fit_transform(sentences).toarray())## 输出特征向量化后的表示
    print('cut_word',count_vec.get_feature_names())#输出的是切分的词， 输出向量各个维度的特征含义

    #转换成维度相同的
    vec_1 = count_vec.fit_transform(sentences).toarray()[0]
    vec_2 = count_vec.fit_transform(sentences).toarray()[1]
    similarity=count_cos_similarity(vec_1, vec_2)
    if isnan(similarity):
        similarity=0.0

    print('count_cos_similarity',similarity)
def tfidf(sent1, sent2):
    from sklearn.feature_extraction.text import TfidfVectorizer

    sent1 = jieba_function(sent1)
    sent2 = jieba_function(sent2)

    tfidf_vec = TfidfVectorizer()

    sentences = [sent1, sent2]
    vec_1 = tfidf_vec.fit_transform(sentences).toarray()[0]
    vec_2 = tfidf_vec.fit_transform(sentences).toarray()[1]
    similarity=count_cos_similarity(vec_1, vec_2)
    if isnan(similarity):
        similarity=0.0
    return similarity

if __name__=='__main__':

    sent1 = '我喜欢看电视也喜欢看电影，'
    sent2 = '我不喜欢看电视也不喜欢看电影'
    print('<<<<tf<<<<<<<')
    tf(sent1, sent2)
    print('<<<<tfidf<<<<<<<')
    tfidf(sent1, sent2)

句子相似度_tf/idf的更多相关文章

使用 TF-IDF 加权的空间向量模型实现句子相似度计算
使用 TF-IDF 加权的空间向量模型实现句子相似度计算字符匹配层次计算句子相似度计算两个句子相似度的算法有很多种,但是对于从未了解过这方面算法的人来说,可能最容易想到的就是使用字符串匹配相关的算 ...
NLP入门（一）词袋模型及句子相似度
本文作为笔者NLP入门系列文章第一篇,以后我们就要步入NLP时代. 本文将会介绍NLP中常见的词袋模型(Bag of Words)以及如何利用词袋模型来计算句子间的相似度(余弦相似度,cosi ...
[LeetCode] 737. Sentence Similarity II 句子相似度 II
Given two sentences words1, words2 (each represented as an array of strings), and a list of similar ...
[LeetCode] 734. Sentence Similarity 句子相似度
Given two sentences words1, words2 (each represented as an array of strings), and a list of similar ...
LSTM 句子相似度分析
使用句子中出现单词的Vector加权平均进行文本相似度分析虽然简单,但也有比较明显的缺点:没有考虑词序且词向量区别不明确.如下面两个句子: "北京的首都是中国"与"中国的 ...
[LeetCode] Sentence Similarity 句子相似度
Given two sentences words1, words2 (each represented as an array of strings), and a list of similar ...
Wordvec_句子相似度
import jiebafrom jieba import analyseimport numpyimport gensimimport codecsimport pandas as pdimport ...
[LeetCode] Sentence Similarity II 句子相似度之二
Given two sentences words1, words2 (each represented as an array of strings), and a list of similar ...
[LeetCode] 737. Sentence Similarity II 句子相似度之二
Given two sentences words1, words2 (each represented as an array of strings), and a list of similar ...

随机推荐

第三篇：jmeter的作用域和执行顺序
1.元件的作用域: 8类可执行的元件,testplan和threadgroup不属于可执行的元件:这些元件中,取样器,是典型的不与其他元件发生交互作用的元件,逻辑控制器只对其子节点的取样器有效,而其他 ...
【翻译】 View Frustum Culling --1 View Frustum’s Shape
这是一些列来自lighthouse3d的视锥体裁剪教程.旨在学习总结,及便于查阅. 1.视锥体的形状在OpenGL中,透视投影是由两个函数定义的gluPerspective和gluLookAt.我们 ...
UML中的关联，泛化，依赖，聚集，组合(转)
转自:http://blog.sina.com.cn/s/blog_5f8b45f20100dzjo.html 关联(association): 这是一种很常见的关系,这种关系在我们的生活中到处可见, ...
《深入理解java虚拟机》笔记
二.java内存区域与内存溢出异常 0.在内存管理领域,java与c/c++不同的是,在java虚拟机自动内存管理机制下,java不需要手动去为对象写配对的free内存的代码,不容易出现内存泄漏和内存 ...
扩展欧几里得 hdu 1576
题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=1576 不知道扩展欧几里得的同学可以参考:https://blog.csdn.net/zhjchengf ...
vue params和query传参区别
参考地址:https://blog.csdn.net/bluefish_flying/article/details/81011230 router.js中路由设置这里, 当你使用params方法传 ...
UVa 122 Trees on the level(二叉树层序遍历)
Trees are fundamental in many branches of computer science. Current state-of-the art parallel comput ...
public void method()，void前面的泛型T是什么
public <T>这个T是个修饰符的功能,表示是个泛型方法,就像有static修饰的方法是个静态方法一样. 注意<T> 不是返回值,此处的返回值是void ,此处的<T ...
html5新添加的表单类型和属性
email类型: <input type="email"> url类型: <input type="url"> date类型: < ...
vue2.0插件
1.better-scroll 参考网址:https://ustbhuangyi.github.io/better-scroll/doc/zh-hans/ better-scroll 是什么 firs ...

句子相似度_tf/idf

句子相似度_tf/idf的更多相关文章

随机推荐

热门专题