models.doc2vec – Deep learning with paragraph2vec
参考:
用 Doc2Vec 得到文档/段落/句子的向量表达
https://radimrehurek.com/gensim/models/doc2vec.html
Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset
基于gensim的Doc2Vec简析
Gensim进阶教程:训练word2vec与doc2vec模型
用gensim doc2vec计算文本相似度
转自:
gensim doc2vec + sklearn kmeans 做文本聚类
原文显示太乱 为方便看摘录过来。。
用doc2vec做文本相似度,模型可以找到输入句子最相似的句子,然而分析大量的语料时,不可能一句一句的输入,语料数据大致怎么分类也不能知晓。于是决定做文本聚类。 选择kmeans作为聚类方法。前面doc2vec可以将每个段文本的向量计算出来,然后用kmeans就很好操作了。 选择sklearn库中的KMeans类。 程序如下:
# coding:utf-8 import sys import gensim import numpy as np from gensim.models.doc2vec import Doc2Vec, LabeledSentence from sklearn.cluster import KMeans TaggededDocument = gensim.models.doc2vec.TaggedDocument def get_datasest(): with open("out/text_dict_cut.txt", 'r') as cf: docs = cf.readlines() print len(docs) x_train = [] #y = np.concatenate(np.ones(len(docs))) for i, text in enumerate(docs): word_list = text.split(' ') l = len(word_list) word_list[l-1] = word_list[l-1].strip() document = TaggededDocument(word_list, tags=[i]) x_train.append(document) return x_train def train(x_train, size=200, epoch_num=1): model_dm = Doc2Vec(x_train,min_count=1, window = 3, size = size, sample=1e-3, negative=5, workers=4) model_dm.train(x_train, total_examples=model_dm.corpus_count, epochs=100) model_dm.save('model/model_dm') return model_dm def cluster(x_train): infered_vectors_list = [] print "load doc2vec model..." model_dm = Doc2Vec.load("model/model_dm") print "load train vectors..." i = 0 for text, label in x_train: vector = model_dm.infer_vector(text) infered_vectors_list.append(vector) i += 1 print "train kmean model..." kmean_model = KMeans(n_clusters=15) kmean_model.fit(infered_vectors_list) labels= kmean_model.predict(infered_vectors_list[0:100]) cluster_centers = kmean_model.cluster_centers_ with open("out/own_claasify.txt", 'w') as wf: for i in range(100): string = "" text = x_train[i][0] for word in text: string = string + word string = string + '\t' string = string + str(labels[i]) string = string + '\n' wf.write(string) return cluster_centers if __name__ == '__main__': x_train = get_datasest() model_dm = train(x_train) cluster_centers = cluster(x_train)
models.doc2vec – Deep learning with paragraph2vec的更多相关文章
- DEEP LEARNING WITH STRUCTURE
DEEP LEARNING WITH STRUCTURE Charlie Tang is a PhD student in the Machine Learning group at the Univ ...
- deep learning新征程
deep learning新征程(一) zoerywzhou@gmail.com http://www.cnblogs.com/swje/ 作者:Zhouwan 2015-11-26 声明: 1 ...
- A Statistical View of Deep Learning (I): Recursive GLMs
A Statistical View of Deep Learning (I): Recursive GLMs Deep learningand the use of deep neural netw ...
- What are some good books/papers for learning deep learning?
What's the most effective way to get started with deep learning? 29 Answers Yoshua Bengio, ...
- 《Deep Learning》(深度学习)中文版 开发下载
<Deep Learning>(深度学习)中文版开放下载 <Deep Learning>(深度学习)是一本皆在帮助学生和从业人员进入机器学习领域的教科书,以开源的形式免费在 ...
- How To Improve Deep Learning Performance
如何提高深度学习性能 20 Tips, Tricks and Techniques That You Can Use ToFight Overfitting and Get Better Genera ...
- 深度学习Deep learning
In the last chapter we learned that deep neural networks are often much harder to train than shallow ...
- 《Deep Learning》全书已完稿_附全书电子版
Deep Learning第一篇书籍最终问世了.站点链接: http://www.deeplearningbook.org/ Bengio大神的<Deep Learning>全书电子版在百 ...
- How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras
Hyperparameter optimization is a big part of deep learning. The reason is that neural networks are n ...
随机推荐
- oracle in 函数
IN操作符 select * from scott.emp where empno=7369 or empno=7566 or empno=7788 or empno=9999: ...
- 在js中if条件为null/undefined/0/NaN/""表达式时,统统被解释为false,此外均为true
Boolean 表达式 一个值为 true 或者 false 的表达式.如果需要,非 Boolean 表达式也可以被转换为 Boolean 值,但是要遵循下列规则: 所有的对象都被当作 true. 当 ...
- 利用ML&AI判定未知恶意程序——里面提到ssl恶意加密流检测使用N个payload CNN + 字节分布包长等特征综合判定
利用ML&AI判定未知恶意程序 导语:0x01.前言 在上一篇ML&AI如何在云态势感知产品中落地中介绍了,为什么我们要预测未知恶意程序,传统的安全产品已经无法满足现有的安全态势.那么 ...
- SIMD指令集——一条指令操作多个数,SSE,AVX都是,例如:乘累加,Shuffle等
SIMD指令集 from:https://zhuanlan.zhihu.com/p/31271788 SIMD,即Single Instruction, Multiple Data,一条指令操作多个数 ...
- from…import 语句
- linux shell 编程参考
#!/bin/bash my_fun() { echo "$#" } echo 'the number of parameter in "$@" is '$(m ...
- 转 Deep Learning for NLP 文章列举
原文链接:http://www.xperseverance.net/blogs/2013/07/2124/ 大部分文章来自: http://www.socher.org/ http://deepl ...
- git Please move or remove them before you can merge.
git clean -d -fx "" 其中 x -----删除忽略文件已经对git来说不识别的文件 d -----删除未被添加到git的路径中的文件 f -----强制运行
- QuickStart系列:docker部署之MongoDB
MongoDB[1] 是一个基于分布式文件存储的数据库.由C++语言编写.旨在为WEB应用提供可扩展的高性能数据存储解决方案. MongoDB[2] 是一个介于关系数据库和非关系数据库之间的产品, ...
- caffe,Inception v2 Check failed: top_shape[j] == bottom[i]->shape(j)
使用Caffe 跑 Google 的Inception V2 对输入图片的shape有要求,某些shape输进去可能会报错. Inception model中有从conv和pooling层concat ...