models.doc2vec – Deep learning with paragraph2vec
参考:
用 Doc2Vec 得到文档/段落/句子的向量表达
https://radimrehurek.com/gensim/models/doc2vec.html
Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset
基于gensim的Doc2Vec简析
Gensim进阶教程:训练word2vec与doc2vec模型
用gensim doc2vec计算文本相似度
转自:
gensim doc2vec + sklearn kmeans 做文本聚类
原文显示太乱 为方便看摘录过来。。
用doc2vec做文本相似度,模型可以找到输入句子最相似的句子,然而分析大量的语料时,不可能一句一句的输入,语料数据大致怎么分类也不能知晓。于是决定做文本聚类。 选择kmeans作为聚类方法。前面doc2vec可以将每个段文本的向量计算出来,然后用kmeans就很好操作了。 选择sklearn库中的KMeans类。 程序如下:
# coding:utf-8 import sys import gensim import numpy as np from gensim.models.doc2vec import Doc2Vec, LabeledSentence from sklearn.cluster import KMeans TaggededDocument = gensim.models.doc2vec.TaggedDocument def get_datasest(): with open("out/text_dict_cut.txt", 'r') as cf: docs = cf.readlines() print len(docs) x_train = [] #y = np.concatenate(np.ones(len(docs))) for i, text in enumerate(docs): word_list = text.split(' ') l = len(word_list) word_list[l-1] = word_list[l-1].strip() document = TaggededDocument(word_list, tags=[i]) x_train.append(document) return x_train def train(x_train, size=200, epoch_num=1): model_dm = Doc2Vec(x_train,min_count=1, window = 3, size = size, sample=1e-3, negative=5, workers=4) model_dm.train(x_train, total_examples=model_dm.corpus_count, epochs=100) model_dm.save('model/model_dm') return model_dm def cluster(x_train): infered_vectors_list = [] print "load doc2vec model..." model_dm = Doc2Vec.load("model/model_dm") print "load train vectors..." i = 0 for text, label in x_train: vector = model_dm.infer_vector(text) infered_vectors_list.append(vector) i += 1 print "train kmean model..." kmean_model = KMeans(n_clusters=15) kmean_model.fit(infered_vectors_list) labels= kmean_model.predict(infered_vectors_list[0:100]) cluster_centers = kmean_model.cluster_centers_ with open("out/own_claasify.txt", 'w') as wf: for i in range(100): string = "" text = x_train[i][0] for word in text: string = string + word string = string + '\t' string = string + str(labels[i]) string = string + '\n' wf.write(string) return cluster_centers if __name__ == '__main__': x_train = get_datasest() model_dm = train(x_train) cluster_centers = cluster(x_train)
models.doc2vec – Deep learning with paragraph2vec的更多相关文章
- DEEP LEARNING WITH STRUCTURE
DEEP LEARNING WITH STRUCTURE Charlie Tang is a PhD student in the Machine Learning group at the Univ ...
- deep learning新征程
deep learning新征程(一) zoerywzhou@gmail.com http://www.cnblogs.com/swje/ 作者:Zhouwan 2015-11-26 声明: 1 ...
- A Statistical View of Deep Learning (I): Recursive GLMs
A Statistical View of Deep Learning (I): Recursive GLMs Deep learningand the use of deep neural netw ...
- What are some good books/papers for learning deep learning?
What's the most effective way to get started with deep learning? 29 Answers Yoshua Bengio, ...
- 《Deep Learning》(深度学习)中文版 开发下载
<Deep Learning>(深度学习)中文版开放下载 <Deep Learning>(深度学习)是一本皆在帮助学生和从业人员进入机器学习领域的教科书,以开源的形式免费在 ...
- How To Improve Deep Learning Performance
如何提高深度学习性能 20 Tips, Tricks and Techniques That You Can Use ToFight Overfitting and Get Better Genera ...
- 深度学习Deep learning
In the last chapter we learned that deep neural networks are often much harder to train than shallow ...
- 《Deep Learning》全书已完稿_附全书电子版
Deep Learning第一篇书籍最终问世了.站点链接: http://www.deeplearningbook.org/ Bengio大神的<Deep Learning>全书电子版在百 ...
- How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras
Hyperparameter optimization is a big part of deep learning. The reason is that neural networks are n ...
随机推荐
- InnoDB存储引擎表的主键
在InnoDB存储引擎中,表是按照主键顺序组织存放的.在InnoDB存储引擎表中,每张表都有主键(primary key),如果在创建表时没有显式地定义主键,则InnoDB存储引擎会按如下方式选择或创 ...
- 把旧系统迁移到.Net Core 2.0 日记(11) -- Authentication 认证 claimsIdentity 对比 之前的FormAuthentication
实现最简单的认证,类似之前的FormAuthentication 在 Startup 的 ConfigureServices() 方法中添加 Authentication 的配置: 这个CookieA ...
- cin.get()函数使用例子
#include <iostream>using namespace std; int k = 0; int main(){ char a[1000]; char c; do { cin. ...
- Cmd管理员运行
Cmd管理员运行 C:\Windows\System32
- PhpStudy的安装及使用教程
1.PhpStudy是什么 phpstudy是一个PHP调试环境的程序集成包,phpStudy软件集成了最新的Apache.PHP.MySQL.phpMyAdmin.ZendOptimizer,一次性 ...
- zabbix3.4.7监控linux进程
利用zabbix proc.num方法监控Linux服务进程 proc.num[<name>,<user>,<state>,<cmdline>] 监控用 ...
- 基于Quartz.NET 实现可中断的任务(转)
Quartz.NET 是一个开源的作业调度框架,非常适合在平时的工作中,定时轮询数据库同步,定时邮件通知,定时处理数据等. Quartz.NET 允许开发人员根据时间间隔(或天)来调度作业.它实现了作 ...
- Consecutive Subsequence CodeForces - 977F(dp)
Consecutive Subsequence CodeForces - 977F 题目大意:输出一序列中的最大的连续数列的长度和与其对应的下标(连续是指 7 8 9这样的数列) 解题思路: 状态:把 ...
- shell 变量介绍
变量命名规则 变量名必须以字母或下划线开头,名字中间只能由字母,数字和下划线组成,大小写是区分的 变量名的长度不得超过255个字符 变量名在有效的范围内必须是唯一的 在Bash中,变量的默认类型都是字 ...
- win10与centos7的双系统U盘安装(一:制作u盘启动盘)
博主近来在学习linux系统,当然学习第一步自然是安装系统了,博主选择的是centos7,博主自己的电脑是联想的,系统是win10专业版,在历经数次失败后,博主成功使用u盘安装了win10和cento ...