models.doc2vec – Deep learning with paragraph2vec
参考:
用 Doc2Vec 得到文档/段落/句子的向量表达
https://radimrehurek.com/gensim/models/doc2vec.html
Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset
基于gensim的Doc2Vec简析
Gensim进阶教程:训练word2vec与doc2vec模型
用gensim doc2vec计算文本相似度
转自:
gensim doc2vec + sklearn kmeans 做文本聚类
原文显示太乱 为方便看摘录过来。。
用doc2vec做文本相似度,模型可以找到输入句子最相似的句子,然而分析大量的语料时,不可能一句一句的输入,语料数据大致怎么分类也不能知晓。于是决定做文本聚类。 选择kmeans作为聚类方法。前面doc2vec可以将每个段文本的向量计算出来,然后用kmeans就很好操作了。 选择sklearn库中的KMeans类。 程序如下:
# coding:utf-8
import sys
import gensim
import numpy as np
from gensim.models.doc2vec import Doc2Vec, LabeledSentence
from sklearn.cluster import KMeans
TaggededDocument = gensim.models.doc2vec.TaggedDocument
def get_datasest():
with open("out/text_dict_cut.txt", 'r') as cf:
docs = cf.readlines()
print len(docs)
x_train = []
#y = np.concatenate(np.ones(len(docs)))
for i, text in enumerate(docs):
word_list = text.split(' ')
l = len(word_list)
word_list[l-1] = word_list[l-1].strip()
document = TaggededDocument(word_list, tags=[i])
x_train.append(document)
return x_train
def train(x_train, size=200, epoch_num=1):
model_dm = Doc2Vec(x_train,min_count=1, window = 3, size = size, sample=1e-3, negative=5, workers=4)
model_dm.train(x_train, total_examples=model_dm.corpus_count, epochs=100)
model_dm.save('model/model_dm')
return model_dm
def cluster(x_train):
infered_vectors_list = []
print "load doc2vec model..."
model_dm = Doc2Vec.load("model/model_dm")
print "load train vectors..."
i = 0
for text, label in x_train:
vector = model_dm.infer_vector(text)
infered_vectors_list.append(vector)
i += 1
print "train kmean model..."
kmean_model = KMeans(n_clusters=15)
kmean_model.fit(infered_vectors_list)
labels= kmean_model.predict(infered_vectors_list[0:100])
cluster_centers = kmean_model.cluster_centers_
with open("out/own_claasify.txt", 'w') as wf:
for i in range(100):
string = ""
text = x_train[i][0]
for word in text:
string = string + word
string = string + '\t'
string = string + str(labels[i])
string = string + '\n'
wf.write(string)
return cluster_centers
if __name__ == '__main__':
x_train = get_datasest()
model_dm = train(x_train)
cluster_centers = cluster(x_train)
models.doc2vec – Deep learning with paragraph2vec的更多相关文章
- DEEP LEARNING WITH STRUCTURE
DEEP LEARNING WITH STRUCTURE Charlie Tang is a PhD student in the Machine Learning group at the Univ ...
- deep learning新征程
deep learning新征程(一) zoerywzhou@gmail.com http://www.cnblogs.com/swje/ 作者:Zhouwan 2015-11-26 声明: 1 ...
- A Statistical View of Deep Learning (I): Recursive GLMs
A Statistical View of Deep Learning (I): Recursive GLMs Deep learningand the use of deep neural netw ...
- What are some good books/papers for learning deep learning?
What's the most effective way to get started with deep learning? 29 Answers Yoshua Bengio, ...
- 《Deep Learning》(深度学习)中文版 开发下载
<Deep Learning>(深度学习)中文版开放下载 <Deep Learning>(深度学习)是一本皆在帮助学生和从业人员进入机器学习领域的教科书,以开源的形式免费在 ...
- How To Improve Deep Learning Performance
如何提高深度学习性能 20 Tips, Tricks and Techniques That You Can Use ToFight Overfitting and Get Better Genera ...
- 深度学习Deep learning
In the last chapter we learned that deep neural networks are often much harder to train than shallow ...
- 《Deep Learning》全书已完稿_附全书电子版
Deep Learning第一篇书籍最终问世了.站点链接: http://www.deeplearningbook.org/ Bengio大神的<Deep Learning>全书电子版在百 ...
- How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras
Hyperparameter optimization is a big part of deep learning. The reason is that neural networks are n ...
随机推荐
- Tomcat指定JDK路径(Linux+Windows)
当系统有多套JDK,不方便在系统配统一的JAVA_HOME时,我们可能想给tomcat指定JDK路径. 1.Linux下Tomcat指定JDK路径 找到$CATALINE_HOME/bin/catal ...
- MySQL设置白名单教程
1 登录mysql mysql -h host -u username -p password 2 切换至mysql库 use mysql; 3 查看当前允许登录IP及用户 select Host,U ...
- Oracle中如何停止正在执行SQL语句
oracle的用P/SQL客户端中,如何停止正在执行的SQL语句? 我们使用oracle语句查询某个表时,如果查询的表数据太多,如何停止正在执行操作 如查询的表数据超过上万条时,如何停止查询操作
- js 日期格式化函数(可自定义)
js 日期格式化函数 DateFormat var DateFormat = function (datetime, formatStr) { var dat = datetime; var str ...
- js-数组方法的使用和详谈
写博客的同时也是对自己知识的一次全面总结,方便自己日后复习.今天总结一下JS中Array的所有方法和技巧,对算法题算是一个基础了,有不足的地方,还望童鞋们指出来,一起进步. 在总结方法之前,提到一点, ...
- python if elif else判断语句
username = 'jack' password = ' _username = input('username') _password = input('password') if userna ...
- tomcat 服务器线程问题
http://blog.csdn.net/wtopps/article/details/71339295 http://blog.csdn.net/wtopps/article/details/713 ...
- jstree使用新的
1.首先准备jstree树的dom元素 <p id="flowList_ul" class="flowList_ul"></p> 2.初 ...
- DevExpress WinForms v18.2新版亮点(八)
买 DevExpress Universal Subscription 免费赠 万元汉化资源包1套! 限量15套!先到先得,送完即止!立即抢购>> 行业领先的.NET界面控件2018年第 ...
- BBbacktrace installation
1: get the installation package https://backtrackbb.readthedocs.io/en/latest/Method.html#overview ht ...