参考:

用 Doc2Vec 得到文档/段落/句子的向量表达

https://radimrehurek.com/gensim/models/doc2vec.html

Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset

基于gensim的Doc2Vec简析

Gensim进阶教程:训练word2vec与doc2vec模型

用gensim doc2vec计算文本相似度

转自:

gensim doc2vec + sklearn kmeans 做文本聚类

原文显示太乱 为方便看摘录过来。。

用doc2vec做文本相似度,模型可以找到输入句子最相似的句子,然而分析大量的语料时,不可能一句一句的输入,语料数据大致怎么分类也不能知晓。于是决定做文本聚类。
选择kmeans作为聚类方法。前面doc2vec可以将每个段文本的向量计算出来,然后用kmeans就很好操作了。
选择sklearn库中的KMeans类。

程序如下:
# coding:utf-8

import sys
import gensim
import numpy as np

from gensim.models.doc2vec import Doc2Vec, LabeledSentence
from sklearn.cluster import KMeans

TaggededDocument = gensim.models.doc2vec.TaggedDocument

def get_datasest():
    with open("out/text_dict_cut.txt", 'r') as cf:
        docs = cf.readlines()
        print len(docs)

    x_train = []
    #y = np.concatenate(np.ones(len(docs)))
    for i, text in enumerate(docs):
        word_list = text.split(' ')
        l = len(word_list)
        word_list[l-1] = word_list[l-1].strip()
        document = TaggededDocument(word_list, tags=[i])
        x_train.append(document)

    return x_train

def train(x_train, size=200, epoch_num=1):
    model_dm = Doc2Vec(x_train,min_count=1, window = 3, size = size, sample=1e-3, negative=5, workers=4)
    model_dm.train(x_train, total_examples=model_dm.corpus_count, epochs=100)
    model_dm.save('model/model_dm')

    return model_dm

def cluster(x_train):
    infered_vectors_list = []
    print "load doc2vec model..."
    model_dm = Doc2Vec.load("model/model_dm")
    print "load train vectors..."
    i = 0
    for text, label in x_train:
        vector = model_dm.infer_vector(text)
        infered_vectors_list.append(vector)
        i += 1

    print "train kmean model..."
    kmean_model = KMeans(n_clusters=15)
    kmean_model.fit(infered_vectors_list)
    labels= kmean_model.predict(infered_vectors_list[0:100])
    cluster_centers = kmean_model.cluster_centers_

    with open("out/own_claasify.txt", 'w') as wf:
        for i in range(100):
            string = ""
            text = x_train[i][0]
            for word in text:
                string = string + word
            string = string + '\t'
            string = string + str(labels[i])
            string = string + '\n'
            wf.write(string)

    return cluster_centers

if __name__ == '__main__':
    x_train = get_datasest()
    model_dm = train(x_train)
    cluster_centers = cluster(x_train)

models.doc2vec – Deep learning with paragraph2vec的更多相关文章

  1. DEEP LEARNING WITH STRUCTURE

    DEEP LEARNING WITH STRUCTURE Charlie Tang is a PhD student in the Machine Learning group at the Univ ...

  2. deep learning新征程

    deep learning新征程(一) zoerywzhou@gmail.com http://www.cnblogs.com/swje/ 作者:Zhouwan  2015-11-26   声明: 1 ...

  3. A Statistical View of Deep Learning (I): Recursive GLMs

    A Statistical View of Deep Learning (I): Recursive GLMs Deep learningand the use of deep neural netw ...

  4. What are some good books/papers for learning deep learning?

    What's the most effective way to get started with deep learning?       29 Answers     Yoshua Bengio, ...

  5. 《Deep Learning》(深度学习)中文版 开发下载

    <Deep Learning>(深度学习)中文版开放下载   <Deep Learning>(深度学习)是一本皆在帮助学生和从业人员进入机器学习领域的教科书,以开源的形式免费在 ...

  6. How To Improve Deep Learning Performance

    如何提高深度学习性能 20 Tips, Tricks and Techniques That You Can Use ToFight Overfitting and Get Better Genera ...

  7. 深度学习Deep learning

    In the last chapter we learned that deep neural networks are often much harder to train than shallow ...

  8. 《Deep Learning》全书已完稿_附全书电子版

    Deep Learning第一篇书籍最终问世了.站点链接: http://www.deeplearningbook.org/ Bengio大神的<Deep Learning>全书电子版在百 ...

  9. How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras

    Hyperparameter optimization is a big part of deep learning. The reason is that neural networks are n ...

随机推荐

  1. CP-ABE ToolKit 安装笔记(转载)

    博主论文狗,好久没有来贴博客,最近做实验需要用到属性加密,了解了下CP-ABE,前来记录一下: 网上相关的博文较多,博主看了大部分的,认为下面这两个看完了基本就可以成功安装. 可参见博文: http: ...

  2. zookeeper 食谱

    以示例形式说明 zk 食谱. 假定有 4 个客户端,分别执行 create -s -e /lock/read xx 或 create -s -e /lock/write 获取锁. 一.获取读锁的情况: ...

  3. ActiveMQ 事务和XA

    1. 客户端怎样显式地使用事务? producer 开启事务(代码片段): ActiveMQSession session = (ActiveMQSession)connection.createSe ...

  4. WebSphere安装教程(WAS6.1为例)

    1.网络准备 我们选择图形界面安装,如果堡垒机是windows则要在目标机器安装桌面环境并开启vcnserver:如果堡垒机是Linux则在堡垒机安装桌面环境和vncserver,然后将目标机的DIS ...

  5. python settings :RROR 1130: Host 'XXXXXX' is not allowed to connect to this MySQL server

    pymysql.err.InternalError: (1130, u"Host '127.0.0.1' is not allowed to connect to this MySQL se ...

  6. day12 生成器和各种推导式

    今天主要学习了 1.生成器 2.生成器函数 3.各种推导式(比较诡异,理解了很简单,不理解很难) 4.生成器表达式(重点) 一.生成器 def func(): print'我叫周润发' return ...

  7. Eclipse集成Tomcat插件(特别简单)

    . 只需要一个jar包 复制到eclipse/plugins文件夹下,重启Eclipse即可看到如下三只小猫 1.修改Tomcat (1)Tomcat version:版本 (2)Tomcat Hom ...

  8. 十一. Python基础(11)—补充: 作用域 & 装饰器

    十一. Python基础(11)-补充: 作用域 & 装饰器 1 ● Python的作用域补遗 在C/C++等语言中, if语句等控制结构(control structure)会产生新的作用域 ...

  9. :策略模式--Duck

    原则:封装变化的部分:针对超类编程,不针对实现:多组合少继承: #ifndef __DUCK_H__ #define __DUCK_H__ #include "FlyBehavior.h&q ...

  10. 你真的会使用Github吗?

    快捷键 r 快速引用 你可以选中别人的评论文字,然后按r,这些内容会以引用的形式被复制在文本框中: t:搜索文件 s:光标定位到搜索窗口 w:选择分支 g n Go to Notifications ...