参考:

用 Doc2Vec 得到文档/段落/句子的向量表达

https://radimrehurek.com/gensim/models/doc2vec.html

Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset

基于gensim的Doc2Vec简析

Gensim进阶教程:训练word2vec与doc2vec模型

用gensim doc2vec计算文本相似度

转自:

gensim doc2vec + sklearn kmeans 做文本聚类

原文显示太乱 为方便看摘录过来。。

用doc2vec做文本相似度,模型可以找到输入句子最相似的句子,然而分析大量的语料时,不可能一句一句的输入,语料数据大致怎么分类也不能知晓。于是决定做文本聚类。
选择kmeans作为聚类方法。前面doc2vec可以将每个段文本的向量计算出来,然后用kmeans就很好操作了。
选择sklearn库中的KMeans类。

程序如下:
# coding:utf-8

import sys
import gensim
import numpy as np

from gensim.models.doc2vec import Doc2Vec, LabeledSentence
from sklearn.cluster import KMeans

TaggededDocument = gensim.models.doc2vec.TaggedDocument

def get_datasest():
    with open("out/text_dict_cut.txt", 'r') as cf:
        docs = cf.readlines()
        print len(docs)

    x_train = []
    #y = np.concatenate(np.ones(len(docs)))
    for i, text in enumerate(docs):
        word_list = text.split(' ')
        l = len(word_list)
        word_list[l-1] = word_list[l-1].strip()
        document = TaggededDocument(word_list, tags=[i])
        x_train.append(document)

    return x_train

def train(x_train, size=200, epoch_num=1):
    model_dm = Doc2Vec(x_train,min_count=1, window = 3, size = size, sample=1e-3, negative=5, workers=4)
    model_dm.train(x_train, total_examples=model_dm.corpus_count, epochs=100)
    model_dm.save('model/model_dm')

    return model_dm

def cluster(x_train):
    infered_vectors_list = []
    print "load doc2vec model..."
    model_dm = Doc2Vec.load("model/model_dm")
    print "load train vectors..."
    i = 0
    for text, label in x_train:
        vector = model_dm.infer_vector(text)
        infered_vectors_list.append(vector)
        i += 1

    print "train kmean model..."
    kmean_model = KMeans(n_clusters=15)
    kmean_model.fit(infered_vectors_list)
    labels= kmean_model.predict(infered_vectors_list[0:100])
    cluster_centers = kmean_model.cluster_centers_

    with open("out/own_claasify.txt", 'w') as wf:
        for i in range(100):
            string = ""
            text = x_train[i][0]
            for word in text:
                string = string + word
            string = string + '\t'
            string = string + str(labels[i])
            string = string + '\n'
            wf.write(string)

    return cluster_centers

if __name__ == '__main__':
    x_train = get_datasest()
    model_dm = train(x_train)
    cluster_centers = cluster(x_train)

models.doc2vec – Deep learning with paragraph2vec的更多相关文章

  1. DEEP LEARNING WITH STRUCTURE

    DEEP LEARNING WITH STRUCTURE Charlie Tang is a PhD student in the Machine Learning group at the Univ ...

  2. deep learning新征程

    deep learning新征程(一) zoerywzhou@gmail.com http://www.cnblogs.com/swje/ 作者:Zhouwan  2015-11-26   声明: 1 ...

  3. A Statistical View of Deep Learning (I): Recursive GLMs

    A Statistical View of Deep Learning (I): Recursive GLMs Deep learningand the use of deep neural netw ...

  4. What are some good books/papers for learning deep learning?

    What's the most effective way to get started with deep learning?       29 Answers     Yoshua Bengio, ...

  5. 《Deep Learning》(深度学习)中文版 开发下载

    <Deep Learning>(深度学习)中文版开放下载   <Deep Learning>(深度学习)是一本皆在帮助学生和从业人员进入机器学习领域的教科书,以开源的形式免费在 ...

  6. How To Improve Deep Learning Performance

    如何提高深度学习性能 20 Tips, Tricks and Techniques That You Can Use ToFight Overfitting and Get Better Genera ...

  7. 深度学习Deep learning

    In the last chapter we learned that deep neural networks are often much harder to train than shallow ...

  8. 《Deep Learning》全书已完稿_附全书电子版

    Deep Learning第一篇书籍最终问世了.站点链接: http://www.deeplearningbook.org/ Bengio大神的<Deep Learning>全书电子版在百 ...

  9. How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras

    Hyperparameter optimization is a big part of deep learning. The reason is that neural networks are n ...

随机推荐

  1. 七、持久层框架(MyBatis)

    一.MyBatis学习 平时我们都用JDBC访问数据库,除了自己需要写SQL,还要操作Connection,Statement,ResultSet这些. 使用MyBatis,只需要自己提供SQL语句, ...

  2. python匿名函数以及return语句

  3. Lock、synchronized和ReadWriteLock,StampedLock戳锁的区别和联系以及Condition

    https://www.cnblogs.com/RunForLove/p/5543545.html 先来看一段代码,实现如下打印效果: 1 2 A 3 4 B 5 6 C 7 8 D 9 10 E 1 ...

  4. 模块之 logging, shelve, sys 模块

    一. logging模块 用来记录日志,日志:记录某个时间点发生了什么事 日志作用:程序调试 了解软件程序的运行情况,是否正常 软件程序运行故障分析与问题定位 还可用来做用户行为分析 日志等级:在不改 ...

  5. ueeditor 百度编译器使用onchange效果

    <script id="editor" type="text/plain" style="width:100%;height:200px;&qu ...

  6. python---多线程与多进程

    一. 单进程多线程 1. 使用的模块是Threading.使用join()函数进行阻塞. from pdf2txt import pdfTotxt1, pdfTotxt2 import xlrd im ...

  7. mysql 下载资源地址

    http://mirror.neu.edu.cn/mysql/Downloads/MySQL-5.6/

  8. learning at commad AT+CPSI

    [Purpose] Learning how to get mobile network info [Eevironment] Shell terminal, base on gcom command ...

  9. css3 前端开发

    一.前缀: -moz(例如 -moz-border-radius)用于Firefox -webkit(例如:-webkit-border-radius)用于Safari和Chrome. 二.CSS3圆 ...

  10. 回溯算法 LEETCODE别人的小结 一八皇后问题

    回溯算法实际上是一个类似枚举的搜索尝试过程,主要是在搜索尝试中寻找问题的解,当发现已不满足求解条件时,就回溯返回,尝试别的路径. 回溯法是一种选优搜索法,按选优条件向前搜索,以达到目的.但是当探索到某 ...