models.doc2vec – Deep learning with paragraph2vec
参考:
用 Doc2Vec 得到文档/段落/句子的向量表达
https://radimrehurek.com/gensim/models/doc2vec.html
Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset
基于gensim的Doc2Vec简析
Gensim进阶教程:训练word2vec与doc2vec模型
用gensim doc2vec计算文本相似度
转自:
gensim doc2vec + sklearn kmeans 做文本聚类
原文显示太乱 为方便看摘录过来。。
用doc2vec做文本相似度,模型可以找到输入句子最相似的句子,然而分析大量的语料时,不可能一句一句的输入,语料数据大致怎么分类也不能知晓。于是决定做文本聚类。 选择kmeans作为聚类方法。前面doc2vec可以将每个段文本的向量计算出来,然后用kmeans就很好操作了。 选择sklearn库中的KMeans类。 程序如下:
# coding:utf-8
import sys
import gensim
import numpy as np
from gensim.models.doc2vec import Doc2Vec, LabeledSentence
from sklearn.cluster import KMeans
TaggededDocument = gensim.models.doc2vec.TaggedDocument
def get_datasest():
with open("out/text_dict_cut.txt", 'r') as cf:
docs = cf.readlines()
print len(docs)
x_train = []
#y = np.concatenate(np.ones(len(docs)))
for i, text in enumerate(docs):
word_list = text.split(' ')
l = len(word_list)
word_list[l-1] = word_list[l-1].strip()
document = TaggededDocument(word_list, tags=[i])
x_train.append(document)
return x_train
def train(x_train, size=200, epoch_num=1):
model_dm = Doc2Vec(x_train,min_count=1, window = 3, size = size, sample=1e-3, negative=5, workers=4)
model_dm.train(x_train, total_examples=model_dm.corpus_count, epochs=100)
model_dm.save('model/model_dm')
return model_dm
def cluster(x_train):
infered_vectors_list = []
print "load doc2vec model..."
model_dm = Doc2Vec.load("model/model_dm")
print "load train vectors..."
i = 0
for text, label in x_train:
vector = model_dm.infer_vector(text)
infered_vectors_list.append(vector)
i += 1
print "train kmean model..."
kmean_model = KMeans(n_clusters=15)
kmean_model.fit(infered_vectors_list)
labels= kmean_model.predict(infered_vectors_list[0:100])
cluster_centers = kmean_model.cluster_centers_
with open("out/own_claasify.txt", 'w') as wf:
for i in range(100):
string = ""
text = x_train[i][0]
for word in text:
string = string + word
string = string + '\t'
string = string + str(labels[i])
string = string + '\n'
wf.write(string)
return cluster_centers
if __name__ == '__main__':
x_train = get_datasest()
model_dm = train(x_train)
cluster_centers = cluster(x_train)
models.doc2vec – Deep learning with paragraph2vec的更多相关文章
- DEEP LEARNING WITH STRUCTURE
DEEP LEARNING WITH STRUCTURE Charlie Tang is a PhD student in the Machine Learning group at the Univ ...
- deep learning新征程
deep learning新征程(一) zoerywzhou@gmail.com http://www.cnblogs.com/swje/ 作者:Zhouwan 2015-11-26 声明: 1 ...
- A Statistical View of Deep Learning (I): Recursive GLMs
A Statistical View of Deep Learning (I): Recursive GLMs Deep learningand the use of deep neural netw ...
- What are some good books/papers for learning deep learning?
What's the most effective way to get started with deep learning? 29 Answers Yoshua Bengio, ...
- 《Deep Learning》(深度学习)中文版 开发下载
<Deep Learning>(深度学习)中文版开放下载 <Deep Learning>(深度学习)是一本皆在帮助学生和从业人员进入机器学习领域的教科书,以开源的形式免费在 ...
- How To Improve Deep Learning Performance
如何提高深度学习性能 20 Tips, Tricks and Techniques That You Can Use ToFight Overfitting and Get Better Genera ...
- 深度学习Deep learning
In the last chapter we learned that deep neural networks are often much harder to train than shallow ...
- 《Deep Learning》全书已完稿_附全书电子版
Deep Learning第一篇书籍最终问世了.站点链接: http://www.deeplearningbook.org/ Bengio大神的<Deep Learning>全书电子版在百 ...
- How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras
Hyperparameter optimization is a big part of deep learning. The reason is that neural networks are n ...
随机推荐
- CP-ABE ToolKit 安装笔记(转载)
博主论文狗,好久没有来贴博客,最近做实验需要用到属性加密,了解了下CP-ABE,前来记录一下: 网上相关的博文较多,博主看了大部分的,认为下面这两个看完了基本就可以成功安装. 可参见博文: http: ...
- zookeeper 食谱
以示例形式说明 zk 食谱. 假定有 4 个客户端,分别执行 create -s -e /lock/read xx 或 create -s -e /lock/write 获取锁. 一.获取读锁的情况: ...
- ActiveMQ 事务和XA
1. 客户端怎样显式地使用事务? producer 开启事务(代码片段): ActiveMQSession session = (ActiveMQSession)connection.createSe ...
- WebSphere安装教程(WAS6.1为例)
1.网络准备 我们选择图形界面安装,如果堡垒机是windows则要在目标机器安装桌面环境并开启vcnserver:如果堡垒机是Linux则在堡垒机安装桌面环境和vncserver,然后将目标机的DIS ...
- python settings :RROR 1130: Host 'XXXXXX' is not allowed to connect to this MySQL server
pymysql.err.InternalError: (1130, u"Host '127.0.0.1' is not allowed to connect to this MySQL se ...
- day12 生成器和各种推导式
今天主要学习了 1.生成器 2.生成器函数 3.各种推导式(比较诡异,理解了很简单,不理解很难) 4.生成器表达式(重点) 一.生成器 def func(): print'我叫周润发' return ...
- Eclipse集成Tomcat插件(特别简单)
. 只需要一个jar包 复制到eclipse/plugins文件夹下,重启Eclipse即可看到如下三只小猫 1.修改Tomcat (1)Tomcat version:版本 (2)Tomcat Hom ...
- 十一. Python基础(11)—补充: 作用域 & 装饰器
十一. Python基础(11)-补充: 作用域 & 装饰器 1 ● Python的作用域补遗 在C/C++等语言中, if语句等控制结构(control structure)会产生新的作用域 ...
- :策略模式--Duck
原则:封装变化的部分:针对超类编程,不针对实现:多组合少继承: #ifndef __DUCK_H__ #define __DUCK_H__ #include "FlyBehavior.h&q ...
- 你真的会使用Github吗?
快捷键 r 快速引用 你可以选中别人的评论文字,然后按r,这些内容会以引用的形式被复制在文本框中: t:搜索文件 s:光标定位到搜索窗口 w:选择分支 g n Go to Notifications ...