参考:

用 Doc2Vec 得到文档/段落/句子的向量表达

https://radimrehurek.com/gensim/models/doc2vec.html

Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset

基于gensim的Doc2Vec简析

Gensim进阶教程:训练word2vec与doc2vec模型

用gensim doc2vec计算文本相似度

转自:

gensim doc2vec + sklearn kmeans 做文本聚类

原文显示太乱 为方便看摘录过来。。

用doc2vec做文本相似度,模型可以找到输入句子最相似的句子,然而分析大量的语料时,不可能一句一句的输入,语料数据大致怎么分类也不能知晓。于是决定做文本聚类。
选择kmeans作为聚类方法。前面doc2vec可以将每个段文本的向量计算出来,然后用kmeans就很好操作了。
选择sklearn库中的KMeans类。

程序如下:
# coding:utf-8

import sys
import gensim
import numpy as np

from gensim.models.doc2vec import Doc2Vec, LabeledSentence
from sklearn.cluster import KMeans

TaggededDocument = gensim.models.doc2vec.TaggedDocument

def get_datasest():
    with open("out/text_dict_cut.txt", 'r') as cf:
        docs = cf.readlines()
        print len(docs)

    x_train = []
    #y = np.concatenate(np.ones(len(docs)))
    for i, text in enumerate(docs):
        word_list = text.split(' ')
        l = len(word_list)
        word_list[l-1] = word_list[l-1].strip()
        document = TaggededDocument(word_list, tags=[i])
        x_train.append(document)

    return x_train

def train(x_train, size=200, epoch_num=1):
    model_dm = Doc2Vec(x_train,min_count=1, window = 3, size = size, sample=1e-3, negative=5, workers=4)
    model_dm.train(x_train, total_examples=model_dm.corpus_count, epochs=100)
    model_dm.save('model/model_dm')

    return model_dm

def cluster(x_train):
    infered_vectors_list = []
    print "load doc2vec model..."
    model_dm = Doc2Vec.load("model/model_dm")
    print "load train vectors..."
    i = 0
    for text, label in x_train:
        vector = model_dm.infer_vector(text)
        infered_vectors_list.append(vector)
        i += 1

    print "train kmean model..."
    kmean_model = KMeans(n_clusters=15)
    kmean_model.fit(infered_vectors_list)
    labels= kmean_model.predict(infered_vectors_list[0:100])
    cluster_centers = kmean_model.cluster_centers_

    with open("out/own_claasify.txt", 'w') as wf:
        for i in range(100):
            string = ""
            text = x_train[i][0]
            for word in text:
                string = string + word
            string = string + '\t'
            string = string + str(labels[i])
            string = string + '\n'
            wf.write(string)

    return cluster_centers

if __name__ == '__main__':
    x_train = get_datasest()
    model_dm = train(x_train)
    cluster_centers = cluster(x_train)

models.doc2vec – Deep learning with paragraph2vec的更多相关文章

  1. DEEP LEARNING WITH STRUCTURE

    DEEP LEARNING WITH STRUCTURE Charlie Tang is a PhD student in the Machine Learning group at the Univ ...

  2. deep learning新征程

    deep learning新征程(一) zoerywzhou@gmail.com http://www.cnblogs.com/swje/ 作者:Zhouwan  2015-11-26   声明: 1 ...

  3. A Statistical View of Deep Learning (I): Recursive GLMs

    A Statistical View of Deep Learning (I): Recursive GLMs Deep learningand the use of deep neural netw ...

  4. What are some good books/papers for learning deep learning?

    What's the most effective way to get started with deep learning?       29 Answers     Yoshua Bengio, ...

  5. 《Deep Learning》(深度学习)中文版 开发下载

    <Deep Learning>(深度学习)中文版开放下载   <Deep Learning>(深度学习)是一本皆在帮助学生和从业人员进入机器学习领域的教科书,以开源的形式免费在 ...

  6. How To Improve Deep Learning Performance

    如何提高深度学习性能 20 Tips, Tricks and Techniques That You Can Use ToFight Overfitting and Get Better Genera ...

  7. 深度学习Deep learning

    In the last chapter we learned that deep neural networks are often much harder to train than shallow ...

  8. 《Deep Learning》全书已完稿_附全书电子版

    Deep Learning第一篇书籍最终问世了.站点链接: http://www.deeplearningbook.org/ Bengio大神的<Deep Learning>全书电子版在百 ...

  9. How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras

    Hyperparameter optimization is a big part of deep learning. The reason is that neural networks are n ...

随机推荐

  1. oracle in 函数

    IN操作符      select * from scott.emp where empno=7369 or empno=7566 or empno=7788 or empno=9999:      ...

  2. 在js中if条件为null/undefined/0/NaN/""表达式时,统统被解释为false,此外均为true

    Boolean 表达式 一个值为 true 或者 false 的表达式.如果需要,非 Boolean 表达式也可以被转换为 Boolean 值,但是要遵循下列规则: 所有的对象都被当作 true. 当 ...

  3. 利用ML&AI判定未知恶意程序——里面提到ssl恶意加密流检测使用N个payload CNN + 字节分布包长等特征综合判定

    利用ML&AI判定未知恶意程序 导语:0x01.前言 在上一篇ML&AI如何在云态势感知产品中落地中介绍了,为什么我们要预测未知恶意程序,传统的安全产品已经无法满足现有的安全态势.那么 ...

  4. SIMD指令集——一条指令操作多个数,SSE,AVX都是,例如:乘累加,Shuffle等

    SIMD指令集 from:https://zhuanlan.zhihu.com/p/31271788 SIMD,即Single Instruction, Multiple Data,一条指令操作多个数 ...

  5. from…import 语句

  6. linux shell 编程参考

    #!/bin/bash my_fun() { echo "$#" } echo 'the number of parameter in "$@" is '$(m ...

  7. 转 Deep Learning for NLP 文章列举

    原文链接:http://www.xperseverance.net/blogs/2013/07/2124/   大部分文章来自: http://www.socher.org/ http://deepl ...

  8. git Please move or remove them before you can merge.

    git clean -d -fx "" 其中  x -----删除忽略文件已经对git来说不识别的文件 d -----删除未被添加到git的路径中的文件 f -----强制运行

  9. QuickStart系列:docker部署之MongoDB

    MongoDB[1]  是一个基于分布式文件存储的数据库.由C++语言编写.旨在为WEB应用提供可扩展的高性能数据存储解决方案. MongoDB[2]  是一个介于关系数据库和非关系数据库之间的产品, ...

  10. caffe,Inception v2 Check failed: top_shape[j] == bottom[i]->shape(j)

    使用Caffe 跑 Google 的Inception V2 对输入图片的shape有要求,某些shape输进去可能会报错. Inception model中有从conv和pooling层concat ...