参考：

转自：

gensim doc2vec + sklearn kmeans 做文本聚类

原文显示太乱为方便看摘录过来。。

用doc2vec做文本相似度，模型可以找到输入句子最相似的句子，然而分析大量的语料时，不可能一句一句的输入，语料数据大致怎么分类也不能知晓。于是决定做文本聚类。
选择kmeans作为聚类方法。前面doc2vec可以将每个段文本的向量计算出来，然后用kmeans就很好操作了。
选择sklearn库中的KMeans类。

程序如下：

# coding:utf-8

import sys
import gensim
import numpy as np

from gensim.models.doc2vec import Doc2Vec, LabeledSentence
from sklearn.cluster import KMeans

TaggededDocument = gensim.models.doc2vec.TaggedDocument

def get_datasest():
    with open("out/text_dict_cut.txt", 'r') as cf:
        docs = cf.readlines()
        print len(docs)

    x_train = []
    #y = np.concatenate(np.ones(len(docs)))
    for i, text in enumerate(docs):
        word_list = text.split(' ')
        l = len(word_list)
        word_list[l-1] = word_list[l-1].strip()
        document = TaggededDocument(word_list, tags=[i])
        x_train.append(document)

    return x_train

def train(x_train, size=200, epoch_num=1):
    model_dm = Doc2Vec(x_train,min_count=1, window = 3, size = size, sample=1e-3, negative=5, workers=4)
    model_dm.train(x_train, total_examples=model_dm.corpus_count, epochs=100)
    model_dm.save('model/model_dm')

    return model_dm

def cluster(x_train):
    infered_vectors_list = []
    print "load doc2vec model..."
    model_dm = Doc2Vec.load("model/model_dm")
    print "load train vectors..."
    i = 0
    for text, label in x_train:
        vector = model_dm.infer_vector(text)
        infered_vectors_list.append(vector)
        i += 1

    print "train kmean model..."
    kmean_model = KMeans(n_clusters=15)
    kmean_model.fit(infered_vectors_list)
    labels= kmean_model.predict(infered_vectors_list[0:100])
    cluster_centers = kmean_model.cluster_centers_

    with open("out/own_claasify.txt", 'w') as wf:
        for i in range(100):
            string = ""
            text = x_train[i][0]
            for word in text:
                string = string + word
            string = string + '\t'
            string = string + str(labels[i])
            string = string + '\n'
            wf.write(string)

    return cluster_centers

if __name__ == '__main__':
    x_train = get_datasest()
    model_dm = train(x_train)
    cluster_centers = cluster(x_train)

models.doc2vec – Deep learning with paragraph2vec的更多相关文章

DEEP LEARNING WITH STRUCTURE
DEEP LEARNING WITH STRUCTURE Charlie Tang is a PhD student in the Machine Learning group at the Univ ...
deep learning新征程
deep learning新征程(一) zoerywzhou@gmail.com http://www.cnblogs.com/swje/ 作者:Zhouwan 2015-11-26 声明: 1 ...
A Statistical View of Deep Learning (I): Recursive GLMs
A Statistical View of Deep Learning (I): Recursive GLMs Deep learningand the use of deep neural netw ...
What are some good books/papers for learning deep learning?
What's the most effective way to get started with deep learning? 29 Answers Yoshua Bengio, ...
《Deep Learning》（深度学习）中文版开发下载
<Deep Learning>(深度学习)中文版开放下载 <Deep Learning>(深度学习)是一本皆在帮助学生和从业人员进入机器学习领域的教科书,以开源的形式免费在 ...
How To Improve Deep Learning Performance
如何提高深度学习性能 20 Tips, Tricks and Techniques That You Can Use ToFight Overfitting and Get Better Genera ...
深度学习Deep learning
In the last chapter we learned that deep neural networks are often much harder to train than shallow ...
《Deep Learning》全书已完稿_附全书电子版
Deep Learning第一篇书籍最终问世了.站点链接: http://www.deeplearningbook.org/ Bengio大神的<Deep Learning>全书电子版在百 ...
How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras
Hyperparameter optimization is a big part of deep learning. The reason is that neural networks are n ...

随机推荐

InnoDB存储引擎表的主键
在InnoDB存储引擎中,表是按照主键顺序组织存放的.在InnoDB存储引擎表中,每张表都有主键(primary key),如果在创建表时没有显式地定义主键,则InnoDB存储引擎会按如下方式选择或创 ...
把旧系统迁移到.Net Core 2.0 日记（11) -- Authentication 认证 claimsIdentity 对比之前的FormAuthentication
实现最简单的认证,类似之前的FormAuthentication 在 Startup 的 ConfigureServices() 方法中添加 Authentication 的配置: 这个CookieA ...
cin.get()函数使用例子
#include <iostream>using namespace std; int k = 0; int main(){ char a[1000]; char c; do { cin. ...
Cmd管理员运行
Cmd管理员运行 C:\Windows\System32
PhpStudy的安装及使用教程
1.PhpStudy是什么 phpstudy是一个PHP调试环境的程序集成包,phpStudy软件集成了最新的Apache.PHP.MySQL.phpMyAdmin.ZendOptimizer,一次性 ...
zabbix3.4.7监控linux进程
利用zabbix proc.num方法监控Linux服务进程 proc.num[<name>,<user>,<state>,<cmdline>] 监控用 ...
基于Quartz.NET 实现可中断的任务（转）
Quartz.NET 是一个开源的作业调度框架,非常适合在平时的工作中,定时轮询数据库同步,定时邮件通知,定时处理数据等. Quartz.NET 允许开发人员根据时间间隔(或天)来调度作业.它实现了作 ...
Consecutive Subsequence CodeForces - 977F(dp)
Consecutive Subsequence CodeForces - 977F 题目大意:输出一序列中的最大的连续数列的长度和与其对应的下标(连续是指 7 8 9这样的数列) 解题思路: 状态:把 ...
shell 变量介绍
变量命名规则变量名必须以字母或下划线开头,名字中间只能由字母,数字和下划线组成,大小写是区分的变量名的长度不得超过255个字符变量名在有效的范围内必须是唯一的在Bash中,变量的默认类型都是字 ...
win10与centos7的双系统U盘安装（一：制作u盘启动盘）
博主近来在学习linux系统,当然学习第一步自然是安装系统了,博主选择的是centos7,博主自己的电脑是联想的,系统是win10专业版,在历经数次失败后,博主成功使用u盘安装了win10和cento ...

models.doc2vec – Deep learning with paragraph2vec

用 Doc2Vec 得到文档／段落／句子的向量表达

Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset

基于gensim的Doc2Vec简析

Gensim进阶教程：训练word2vec与doc2vec模型

用gensim doc2vec计算文本相似度

gensim doc2vec + sklearn kmeans 做文本聚类

models.doc2vec – Deep learning with paragraph2vec的更多相关文章

随机推荐

热门专题