唐诗掠影：基于词移距离（Word Mover's Distance）的唐诗诗句匹配实践

词移距离（Word Mover's Distance）是在词向量的基础上发展而来的用来衡量文档相似性的度量。

词移距离的具体介绍参考http://blog.csdn.net/qrlhl/article/details/78512598 或网上的其他资料

词移距离的gensim官方例子在https://github.com/RaRe-Technologies/gensim/blob/c971411c09773488dbdd899754537c0d1a9fce50/docs/notebooks/WMD_tutorial.ipynb

此处，用词移距离来衡量唐诗诗句的相关性。为什么用唐诗？因为全唐诗的txt很容易获取，随便一搜就可以下载了。全唐诗txt链接：https://files.cnblogs.com/files/combfish/%E5%85%A8%E5%94%90%E8%AF%97.zip。

步骤：

1. 预处理语料集：唐诗的断句分词，断句基于标点符号，分词依靠结巴分词

2. gensim训练词向量模型与wmd相似性模型

3. 查询

代码：

import jieba

from nltk import word_tokenize

from nltk.corpus import stopwords

from time import time

start_nb = time()

import logging

print(20*'*','loading data',40*'*')

f=open('全唐诗.txt',encoding='utf-8')

lines=f.readlines()

corpus=[]

documents=[]

useless=[',','.','(',')','!','?','\'','\"',':','<','>',

         '，', '。', '（', '）', '！', '？', '’', '“','：','《','》','[',']','【','】']

for each in lines:

    each=each.replace('\n','')

    each.replace('-','')

    each=each.strip()

    each=each.replace(' ','')

    if(len(each)>3):

        if(each[0]!='卷'):

            documents.append(each)

            each=list(jieba.cut(each))

            text=[w for w in each if not w in useless]

            corpus.append(text)

print(len(corpus))

print(20*'*','trainning models',40*'*')

from gensim.models import Word2Vec

model = Word2Vec(corpus, workers=3, size=100)

# Initialize WmdSimilarity.

from gensim.similarities import WmdSimilarity

num_best = 10

instance = WmdSimilarity(corpus, model, num_best=10)

print(20*'*','testing',40*'*')

while True:

    sent = input('输入查询语句： ')

    sent_w = list(jieba.cut(sent))

    query = [w for w in sent_w if not w in useless]

    sims = instance[query]  # A query is simply a "look-up" in the similarity class.

    # Print the query and the retrieved documents, together with their similarities.

    print('Query:')

    print(sent)

    for i in range(num_best):

        print

        print('sim = %.4f' % sims[i][1])

        print(documents[sims[i][0]])

结果：从结果kan

<wiz_tmp_tag id="wiz-table-range-border" contenteditable="false" style="display: none;">

来自为知笔记(Wiz)

唐诗掠影：基于词移距离（Word Mover's Distance）的唐诗诗句匹配实践的更多相关文章

Distributed Sentence Similarity Base on Word Mover's Distance
Algorithm: Refrence from one ICML15 paper: Word Mover's Distance. 1. First use Google's word2vec too ...
文本情感分析(一)：基于词袋模型(VSM、LSA、n-gram)的文本表示
现在自然语言处理用深度学习做的比较多,我还没试过用传统的监督学习方法做分类器,比如SVM.Xgboost.随机森林,来训练模型.因此,用Kaggle上经典的电影评论情感分析题,来学习如何用传统机器学习 ...
【CV知识学习】【转】beyond Bags of features for rec scenen categories。基于词袋模型改进的自然场景识别方法
原博文地址:http://www.cnblogs.com/nobadfish/articles/5244637.html 原论文名叫Byeond bags of features:Spatial Py ...
Earth Mover's Distance (EMD)
原文: http://d.hatena.ne.jp/aidiary/20120804/1344058475作者: sylvan5翻译: Myautsai和他的朋友们(Google Translate. ...
[转]Earth Mover's Distance (EMD)
转自:http://www.sigvc.org/bbs/forum.php?mod=viewthread&tid=981 Earth Mover's Distance (EMD)原文: htt ...
基于ABP落地领域驱动设计-05.实体创建和更新最佳实践
目录系列文章数据传输对象输入DTO最佳实践不要在输入DTO中定义不使用的属性不要重用输入DTO 输入DTO中验证逻辑输出DTO最佳实践对象映射学习帮助系列文章基于ABP落地领域驱动 ...
The Earth Mover's Distance
The EMD is based on the minimal cost that must be paid to transform one distribution into the other. ...
[Swift]LeetCode748. 最短完整词 | Shortest Completing Word
Find the minimum length word from a given dictionary words, which has all the letters from the strin ...
[python] 基于词云的关键词提取：wordcloud的使用、源码分析、中文词云生成和代码重写
1. 词云简介词云,又称文字云.标签云,是对文本数据中出现频率较高的“关键词”在视觉上的突出呈现,形成关键词的渲染形成类似云一样的彩色图片,从而一眼就可以领略文本数据的主要表达意思.常见于博客.微博 ...

随机推荐

图像处理之基础---2个YUV视频拼接技术
/************************************************* * 主要功能:两路 YUV4:2:0拼接一路左右半宽格式YUV视频参考资料:http://www ...
Linq系列(7)——表达式树之ExpressionVisitor
大家好,由于今天项目升级,大家都在获最新代码,所以我又有时间在这里写点东西,跟大家分享. 在上一篇的文章中我介绍了一个dll,使大家在debug的时候可以可视化的看到ExpressionTree的Bo ...
由浅到深理解ROS（3）-命名空间
全局命名空间: /rosout前面的反斜杠“/”表明该节点名称属于全局命名空间.之所以叫做全局名称因为它们在任何地方(包括代码.命令行工具.图形界面工具等的任何地方)都可以使用.无论这些名称用作众多命 ...
基于jquery的bootstrap在线文本编辑器插件Summernote （转）
Summernote是一个基于jquery的bootstrap超级简单WYSIWYG在线编辑器.Summernote非常的轻量级,大小只有30KB,支持Safari,Chrome,Firefox.Op ...
why factory pattern and when to use factory pattern
1 factory pattern本质上就是对对象创建进行抽象抽象的好处是显然的,可以方便用户去获取对象. 2 使用factory pattern的时机第一,当一个对象的创建依赖于其它很多对象的时 ...
iOS基础动画的KeyPath取值
一 .基础动画 1.基础动画的属性详解注:Core Animation的动画执行过程都是在后台操作的,不会阻塞主线程. 属性解读 Autoreverses 设定这个属性为 YES 时,在它到达目的 ...
php在不同平台下路径分隔符不同的解决办法
在看phpamf时看到一个常量“DIRECTORY_SEPARATOR”,最后发现是一个全局的常量,用来定义路径分隔符的主要解决在windows和linux下路径分隔符不同的造成代码不通用的问题,在 ...
ElasticSearch（十七）初识倒排索引
现在有两条document: doc1:I really liked my small dogs, and I think my mom also liked them. doc2:He never ...
Django继承HTML模板
Django在渲染模板的过程中可以实现模板样式的继承,以减少重复的代码 1.extend继承模板.html: 模板内容 {{% block name1 %}} {{% enfblock %}} #n ...
CentOS7安装MySQL8.0小计
之前讲配置文件和权限的时候有很多MySQL8的知识,有同志说安装不太一样,希望发个文,我这边简单演示一下 1.环境安装下载MySQL提供的CentOS7的yum源官方文档:<https:// ...

唐诗掠影：基于词移距离（Word Mover's Distance）的唐诗诗句匹配实践

唐诗掠影：基于词移距离（Word Mover's Distance）的唐诗诗句匹配实践的更多相关文章

随机推荐

热门专题