models.doc2vec – Deep learning with paragraph2vec
参考:
用 Doc2Vec 得到文档/段落/句子的向量表达
https://radimrehurek.com/gensim/models/doc2vec.html
Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset
基于gensim的Doc2Vec简析
Gensim进阶教程:训练word2vec与doc2vec模型
用gensim doc2vec计算文本相似度
转自:
gensim doc2vec + sklearn kmeans 做文本聚类
原文显示太乱 为方便看摘录过来。。
用doc2vec做文本相似度,模型可以找到输入句子最相似的句子,然而分析大量的语料时,不可能一句一句的输入,语料数据大致怎么分类也不能知晓。于是决定做文本聚类。 选择kmeans作为聚类方法。前面doc2vec可以将每个段文本的向量计算出来,然后用kmeans就很好操作了。 选择sklearn库中的KMeans类。 程序如下:
# coding:utf-8
import sys
import gensim
import numpy as np
from gensim.models.doc2vec import Doc2Vec, LabeledSentence
from sklearn.cluster import KMeans
TaggededDocument = gensim.models.doc2vec.TaggedDocument
def get_datasest():
with open("out/text_dict_cut.txt", 'r') as cf:
docs = cf.readlines()
print len(docs)
x_train = []
#y = np.concatenate(np.ones(len(docs)))
for i, text in enumerate(docs):
word_list = text.split(' ')
l = len(word_list)
word_list[l-1] = word_list[l-1].strip()
document = TaggededDocument(word_list, tags=[i])
x_train.append(document)
return x_train
def train(x_train, size=200, epoch_num=1):
model_dm = Doc2Vec(x_train,min_count=1, window = 3, size = size, sample=1e-3, negative=5, workers=4)
model_dm.train(x_train, total_examples=model_dm.corpus_count, epochs=100)
model_dm.save('model/model_dm')
return model_dm
def cluster(x_train):
infered_vectors_list = []
print "load doc2vec model..."
model_dm = Doc2Vec.load("model/model_dm")
print "load train vectors..."
i = 0
for text, label in x_train:
vector = model_dm.infer_vector(text)
infered_vectors_list.append(vector)
i += 1
print "train kmean model..."
kmean_model = KMeans(n_clusters=15)
kmean_model.fit(infered_vectors_list)
labels= kmean_model.predict(infered_vectors_list[0:100])
cluster_centers = kmean_model.cluster_centers_
with open("out/own_claasify.txt", 'w') as wf:
for i in range(100):
string = ""
text = x_train[i][0]
for word in text:
string = string + word
string = string + '\t'
string = string + str(labels[i])
string = string + '\n'
wf.write(string)
return cluster_centers
if __name__ == '__main__':
x_train = get_datasest()
model_dm = train(x_train)
cluster_centers = cluster(x_train)
models.doc2vec – Deep learning with paragraph2vec的更多相关文章
- DEEP LEARNING WITH STRUCTURE
DEEP LEARNING WITH STRUCTURE Charlie Tang is a PhD student in the Machine Learning group at the Univ ...
- deep learning新征程
deep learning新征程(一) zoerywzhou@gmail.com http://www.cnblogs.com/swje/ 作者:Zhouwan 2015-11-26 声明: 1 ...
- A Statistical View of Deep Learning (I): Recursive GLMs
A Statistical View of Deep Learning (I): Recursive GLMs Deep learningand the use of deep neural netw ...
- What are some good books/papers for learning deep learning?
What's the most effective way to get started with deep learning? 29 Answers Yoshua Bengio, ...
- 《Deep Learning》(深度学习)中文版 开发下载
<Deep Learning>(深度学习)中文版开放下载 <Deep Learning>(深度学习)是一本皆在帮助学生和从业人员进入机器学习领域的教科书,以开源的形式免费在 ...
- How To Improve Deep Learning Performance
如何提高深度学习性能 20 Tips, Tricks and Techniques That You Can Use ToFight Overfitting and Get Better Genera ...
- 深度学习Deep learning
In the last chapter we learned that deep neural networks are often much harder to train than shallow ...
- 《Deep Learning》全书已完稿_附全书电子版
Deep Learning第一篇书籍最终问世了.站点链接: http://www.deeplearningbook.org/ Bengio大神的<Deep Learning>全书电子版在百 ...
- How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras
Hyperparameter optimization is a big part of deep learning. The reason is that neural networks are n ...
随机推荐
- 七、持久层框架(MyBatis)
一.MyBatis学习 平时我们都用JDBC访问数据库,除了自己需要写SQL,还要操作Connection,Statement,ResultSet这些. 使用MyBatis,只需要自己提供SQL语句, ...
- python匿名函数以及return语句
- Lock、synchronized和ReadWriteLock,StampedLock戳锁的区别和联系以及Condition
https://www.cnblogs.com/RunForLove/p/5543545.html 先来看一段代码,实现如下打印效果: 1 2 A 3 4 B 5 6 C 7 8 D 9 10 E 1 ...
- 模块之 logging, shelve, sys 模块
一. logging模块 用来记录日志,日志:记录某个时间点发生了什么事 日志作用:程序调试 了解软件程序的运行情况,是否正常 软件程序运行故障分析与问题定位 还可用来做用户行为分析 日志等级:在不改 ...
- ueeditor 百度编译器使用onchange效果
<script id="editor" type="text/plain" style="width:100%;height:200px;&qu ...
- python---多线程与多进程
一. 单进程多线程 1. 使用的模块是Threading.使用join()函数进行阻塞. from pdf2txt import pdfTotxt1, pdfTotxt2 import xlrd im ...
- mysql 下载资源地址
http://mirror.neu.edu.cn/mysql/Downloads/MySQL-5.6/
- learning at commad AT+CPSI
[Purpose] Learning how to get mobile network info [Eevironment] Shell terminal, base on gcom command ...
- css3 前端开发
一.前缀: -moz(例如 -moz-border-radius)用于Firefox -webkit(例如:-webkit-border-radius)用于Safari和Chrome. 二.CSS3圆 ...
- 回溯算法 LEETCODE别人的小结 一八皇后问题
回溯算法实际上是一个类似枚举的搜索尝试过程,主要是在搜索尝试中寻找问题的解,当发现已不满足求解条件时,就回溯返回,尝试别的路径. 回溯法是一种选优搜索法,按选优条件向前搜索,以达到目的.但是当探索到某 ...