参考:

用 Doc2Vec 得到文档/段落/句子的向量表达

https://radimrehurek.com/gensim/models/doc2vec.html

Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset

基于gensim的Doc2Vec简析

Gensim进阶教程:训练word2vec与doc2vec模型

用gensim doc2vec计算文本相似度

转自:

gensim doc2vec + sklearn kmeans 做文本聚类

原文显示太乱 为方便看摘录过来。。

用doc2vec做文本相似度,模型可以找到输入句子最相似的句子,然而分析大量的语料时,不可能一句一句的输入,语料数据大致怎么分类也不能知晓。于是决定做文本聚类。
选择kmeans作为聚类方法。前面doc2vec可以将每个段文本的向量计算出来,然后用kmeans就很好操作了。
选择sklearn库中的KMeans类。

程序如下:
# coding:utf-8

import sys
import gensim
import numpy as np

from gensim.models.doc2vec import Doc2Vec, LabeledSentence
from sklearn.cluster import KMeans

TaggededDocument = gensim.models.doc2vec.TaggedDocument

def get_datasest():
    with open("out/text_dict_cut.txt", 'r') as cf:
        docs = cf.readlines()
        print len(docs)

    x_train = []
    #y = np.concatenate(np.ones(len(docs)))
    for i, text in enumerate(docs):
        word_list = text.split(' ')
        l = len(word_list)
        word_list[l-1] = word_list[l-1].strip()
        document = TaggededDocument(word_list, tags=[i])
        x_train.append(document)

    return x_train

def train(x_train, size=200, epoch_num=1):
    model_dm = Doc2Vec(x_train,min_count=1, window = 3, size = size, sample=1e-3, negative=5, workers=4)
    model_dm.train(x_train, total_examples=model_dm.corpus_count, epochs=100)
    model_dm.save('model/model_dm')

    return model_dm

def cluster(x_train):
    infered_vectors_list = []
    print "load doc2vec model..."
    model_dm = Doc2Vec.load("model/model_dm")
    print "load train vectors..."
    i = 0
    for text, label in x_train:
        vector = model_dm.infer_vector(text)
        infered_vectors_list.append(vector)
        i += 1

    print "train kmean model..."
    kmean_model = KMeans(n_clusters=15)
    kmean_model.fit(infered_vectors_list)
    labels= kmean_model.predict(infered_vectors_list[0:100])
    cluster_centers = kmean_model.cluster_centers_

    with open("out/own_claasify.txt", 'w') as wf:
        for i in range(100):
            string = ""
            text = x_train[i][0]
            for word in text:
                string = string + word
            string = string + '\t'
            string = string + str(labels[i])
            string = string + '\n'
            wf.write(string)

    return cluster_centers

if __name__ == '__main__':
    x_train = get_datasest()
    model_dm = train(x_train)
    cluster_centers = cluster(x_train)

models.doc2vec – Deep learning with paragraph2vec的更多相关文章

  1. DEEP LEARNING WITH STRUCTURE

    DEEP LEARNING WITH STRUCTURE Charlie Tang is a PhD student in the Machine Learning group at the Univ ...

  2. deep learning新征程

    deep learning新征程(一) zoerywzhou@gmail.com http://www.cnblogs.com/swje/ 作者:Zhouwan  2015-11-26   声明: 1 ...

  3. A Statistical View of Deep Learning (I): Recursive GLMs

    A Statistical View of Deep Learning (I): Recursive GLMs Deep learningand the use of deep neural netw ...

  4. What are some good books/papers for learning deep learning?

    What's the most effective way to get started with deep learning?       29 Answers     Yoshua Bengio, ...

  5. 《Deep Learning》(深度学习)中文版 开发下载

    <Deep Learning>(深度学习)中文版开放下载   <Deep Learning>(深度学习)是一本皆在帮助学生和从业人员进入机器学习领域的教科书,以开源的形式免费在 ...

  6. How To Improve Deep Learning Performance

    如何提高深度学习性能 20 Tips, Tricks and Techniques That You Can Use ToFight Overfitting and Get Better Genera ...

  7. 深度学习Deep learning

    In the last chapter we learned that deep neural networks are often much harder to train than shallow ...

  8. 《Deep Learning》全书已完稿_附全书电子版

    Deep Learning第一篇书籍最终问世了.站点链接: http://www.deeplearningbook.org/ Bengio大神的<Deep Learning>全书电子版在百 ...

  9. How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras

    Hyperparameter optimization is a big part of deep learning. The reason is that neural networks are n ...

随机推荐

  1. 循环中点击单个事件(巧用this,指向当前对象)

    <em id='show' value="<?php echo $member['phone']; ?>" class="sui">&l ...

  2. openssl安装/更新教程(CentOS)

    1.下载openssl 下载链接:https://www.openssl.org/source/snapshot/ 里边是当前仍支持版本的快照:同版本不同日期内容可能不同的,所以下载一般下对应版本的最 ...

  3. 使用MongoDB数据库(1)(三十五)

    MongoDB简介 MongoDB是一个基于分布式文件存储的数据库,它是一个介于关系数据库和非关系数据库之间的产品,其主要目标是在键/值存储方式(提供了高性能和高度伸缩性)和传统的RDBMS系统(具有 ...

  4. gpu内存查看命令nvidia-smi

    nvidia-smi nvidia-settings nvidia-xconfig

  5. ubuntu中更新.netcore到2.1版本

    如果需要安装新版本到dotnetcore,需要先卸载旧版本(https://github.com/dotnet/core/blob/master/release-notes/download-arch ...

  6. JavaScript中如何对一个对象进行深度clone

    <!doctype html><html><head><meta charset="utf-8"><title>深克隆& ...

  7. 【原创】paintEvent()函数显示文本

    [代码] void MainWindow::paintEvent(QPaintEvent*) { QPainter p(this); QRect r; p.setPen(Qt::red); p.dra ...

  8. MarkDown编辑器中缩进

    首先,Markdown是不支持缩进的. 在Markdown里按下四个空格,就自动转入Code模式. 在Markdown里一个回车,不是分段而是换行,要两个回车,才是分段. 分段和换行的区别是:换行后, ...

  9. opencv3.0+vs2013安装记录

    为了能够更好的学习图像,我觉得opencv是一个必不可少的库,因此在以后的研究上使用opencv作为研究工具,与大家共同进步. 话归正题:先搭建opencv的环境. 1.下载安装包3.0 a,官网打开 ...

  10. block,inline-block,行内元素区别及浮动

    1.block: 默认占据一行空间,盒子外的元素被迫另起一行 2.inline-block: 行内块盒子,可以设置宽高 3.行内元素: 宽度即使内容宽度,不能设置宽高,如果要设置宽高,需要转换成行内块 ...