models.doc2vec – Deep learning with paragraph2vec
参考:
用 Doc2Vec 得到文档/段落/句子的向量表达
https://radimrehurek.com/gensim/models/doc2vec.html
Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset
基于gensim的Doc2Vec简析
Gensim进阶教程:训练word2vec与doc2vec模型
用gensim doc2vec计算文本相似度
转自:
gensim doc2vec + sklearn kmeans 做文本聚类
原文显示太乱 为方便看摘录过来。。
用doc2vec做文本相似度,模型可以找到输入句子最相似的句子,然而分析大量的语料时,不可能一句一句的输入,语料数据大致怎么分类也不能知晓。于是决定做文本聚类。 选择kmeans作为聚类方法。前面doc2vec可以将每个段文本的向量计算出来,然后用kmeans就很好操作了。 选择sklearn库中的KMeans类。 程序如下:
# coding:utf-8
import sys
import gensim
import numpy as np
from gensim.models.doc2vec import Doc2Vec, LabeledSentence
from sklearn.cluster import KMeans
TaggededDocument = gensim.models.doc2vec.TaggedDocument
def get_datasest():
with open("out/text_dict_cut.txt", 'r') as cf:
docs = cf.readlines()
print len(docs)
x_train = []
#y = np.concatenate(np.ones(len(docs)))
for i, text in enumerate(docs):
word_list = text.split(' ')
l = len(word_list)
word_list[l-1] = word_list[l-1].strip()
document = TaggededDocument(word_list, tags=[i])
x_train.append(document)
return x_train
def train(x_train, size=200, epoch_num=1):
model_dm = Doc2Vec(x_train,min_count=1, window = 3, size = size, sample=1e-3, negative=5, workers=4)
model_dm.train(x_train, total_examples=model_dm.corpus_count, epochs=100)
model_dm.save('model/model_dm')
return model_dm
def cluster(x_train):
infered_vectors_list = []
print "load doc2vec model..."
model_dm = Doc2Vec.load("model/model_dm")
print "load train vectors..."
i = 0
for text, label in x_train:
vector = model_dm.infer_vector(text)
infered_vectors_list.append(vector)
i += 1
print "train kmean model..."
kmean_model = KMeans(n_clusters=15)
kmean_model.fit(infered_vectors_list)
labels= kmean_model.predict(infered_vectors_list[0:100])
cluster_centers = kmean_model.cluster_centers_
with open("out/own_claasify.txt", 'w') as wf:
for i in range(100):
string = ""
text = x_train[i][0]
for word in text:
string = string + word
string = string + '\t'
string = string + str(labels[i])
string = string + '\n'
wf.write(string)
return cluster_centers
if __name__ == '__main__':
x_train = get_datasest()
model_dm = train(x_train)
cluster_centers = cluster(x_train)
models.doc2vec – Deep learning with paragraph2vec的更多相关文章
- DEEP LEARNING WITH STRUCTURE
DEEP LEARNING WITH STRUCTURE Charlie Tang is a PhD student in the Machine Learning group at the Univ ...
- deep learning新征程
deep learning新征程(一) zoerywzhou@gmail.com http://www.cnblogs.com/swje/ 作者:Zhouwan 2015-11-26 声明: 1 ...
- A Statistical View of Deep Learning (I): Recursive GLMs
A Statistical View of Deep Learning (I): Recursive GLMs Deep learningand the use of deep neural netw ...
- What are some good books/papers for learning deep learning?
What's the most effective way to get started with deep learning? 29 Answers Yoshua Bengio, ...
- 《Deep Learning》(深度学习)中文版 开发下载
<Deep Learning>(深度学习)中文版开放下载 <Deep Learning>(深度学习)是一本皆在帮助学生和从业人员进入机器学习领域的教科书,以开源的形式免费在 ...
- How To Improve Deep Learning Performance
如何提高深度学习性能 20 Tips, Tricks and Techniques That You Can Use ToFight Overfitting and Get Better Genera ...
- 深度学习Deep learning
In the last chapter we learned that deep neural networks are often much harder to train than shallow ...
- 《Deep Learning》全书已完稿_附全书电子版
Deep Learning第一篇书籍最终问世了.站点链接: http://www.deeplearningbook.org/ Bengio大神的<Deep Learning>全书电子版在百 ...
- How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras
Hyperparameter optimization is a big part of deep learning. The reason is that neural networks are n ...
随机推荐
- spring xml的配置
Spring xml文档头得配置 spring文档头一般是可以复制过来得,刚学习得时候一直看网上有没有配置,然后也没有找到,希望以下过程得学习可以给大家带来帮助!! 1 ...
- IIS隐藏版本号教程(Windows Server 2003)
1.下载Urlscan https://www.microsoft.com/en-us/search/DownloadResults.aspx?q=URLScan(总下载页面) https://dow ...
- nginx补丁格式说明(CVE-2016-4450为例)
nginx安全公告地址:http://nginx.org/en/security_advisories.html CVE-2016-4450:一个特定构造的数据包,可引发nginx引用空指针,导致ng ...
- Xmind settings lower
Xmind settings lower 1● setting 2● options 3● fast short keys 快捷键(Windows) 快捷键(Mac) 描述 Ctrl+N ...
- 每天进步一点点out1
1● attend ətend 2● infant əfənd
- 学习笔记-AngularJs(四)
之前学习的事视图与模版,我们在控制器文件中直接定义一个数组,让其在模版文件中用ng-repeat指令构造一个迭代器,定义的数组http://t.cn/RUbL4rP如同以下: $scope.phone ...
- Oracle.PL/SQL高级
一.匿名块 .使用returning ... INTO 保存增删改表数据时的一些列的值 ()增加数据时保存数据 DECLARE v_ename emp.ename%TYPE; v_sal emp.sa ...
- Python自然语言处理---TF-IDF模型
一. 信息检索技术简述 信息检索技术是当前比较热门的一项技术,我们通常意义上的论文检索,搜索引擎都属于信息检索的范畴.信息检索的问题可以抽象为:在文档集合D上,对于关键词w[1]…w[k]组成的查询串 ...
- python使用变量
#不建议用加号,建议用.format name = input('name:') age = input('age:') print( name ,age) print('姓名:',name,'年龄: ...
- C/C++知识补充(2) C/C++操作符/运算符的优先级 & 结合性
, 逗号操作符 for( i = 0, j = 0; i < 10; i++, j++ ) ... 从左到右 Precedence Operator Description Example ...