参考:

用 Doc2Vec 得到文档/段落/句子的向量表达

https://radimrehurek.com/gensim/models/doc2vec.html

Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset

基于gensim的Doc2Vec简析

Gensim进阶教程:训练word2vec与doc2vec模型

用gensim doc2vec计算文本相似度

转自:

gensim doc2vec + sklearn kmeans 做文本聚类

原文显示太乱 为方便看摘录过来。。

用doc2vec做文本相似度,模型可以找到输入句子最相似的句子,然而分析大量的语料时,不可能一句一句的输入,语料数据大致怎么分类也不能知晓。于是决定做文本聚类。
选择kmeans作为聚类方法。前面doc2vec可以将每个段文本的向量计算出来,然后用kmeans就很好操作了。
选择sklearn库中的KMeans类。

程序如下:
# coding:utf-8

import sys
import gensim
import numpy as np

from gensim.models.doc2vec import Doc2Vec, LabeledSentence
from sklearn.cluster import KMeans

TaggededDocument = gensim.models.doc2vec.TaggedDocument

def get_datasest():
    with open("out/text_dict_cut.txt", 'r') as cf:
        docs = cf.readlines()
        print len(docs)

    x_train = []
    #y = np.concatenate(np.ones(len(docs)))
    for i, text in enumerate(docs):
        word_list = text.split(' ')
        l = len(word_list)
        word_list[l-1] = word_list[l-1].strip()
        document = TaggededDocument(word_list, tags=[i])
        x_train.append(document)

    return x_train

def train(x_train, size=200, epoch_num=1):
    model_dm = Doc2Vec(x_train,min_count=1, window = 3, size = size, sample=1e-3, negative=5, workers=4)
    model_dm.train(x_train, total_examples=model_dm.corpus_count, epochs=100)
    model_dm.save('model/model_dm')

    return model_dm

def cluster(x_train):
    infered_vectors_list = []
    print "load doc2vec model..."
    model_dm = Doc2Vec.load("model/model_dm")
    print "load train vectors..."
    i = 0
    for text, label in x_train:
        vector = model_dm.infer_vector(text)
        infered_vectors_list.append(vector)
        i += 1

    print "train kmean model..."
    kmean_model = KMeans(n_clusters=15)
    kmean_model.fit(infered_vectors_list)
    labels= kmean_model.predict(infered_vectors_list[0:100])
    cluster_centers = kmean_model.cluster_centers_

    with open("out/own_claasify.txt", 'w') as wf:
        for i in range(100):
            string = ""
            text = x_train[i][0]
            for word in text:
                string = string + word
            string = string + '\t'
            string = string + str(labels[i])
            string = string + '\n'
            wf.write(string)

    return cluster_centers

if __name__ == '__main__':
    x_train = get_datasest()
    model_dm = train(x_train)
    cluster_centers = cluster(x_train)

models.doc2vec – Deep learning with paragraph2vec的更多相关文章

  1. DEEP LEARNING WITH STRUCTURE

    DEEP LEARNING WITH STRUCTURE Charlie Tang is a PhD student in the Machine Learning group at the Univ ...

  2. deep learning新征程

    deep learning新征程(一) zoerywzhou@gmail.com http://www.cnblogs.com/swje/ 作者:Zhouwan  2015-11-26   声明: 1 ...

  3. A Statistical View of Deep Learning (I): Recursive GLMs

    A Statistical View of Deep Learning (I): Recursive GLMs Deep learningand the use of deep neural netw ...

  4. What are some good books/papers for learning deep learning?

    What's the most effective way to get started with deep learning?       29 Answers     Yoshua Bengio, ...

  5. 《Deep Learning》(深度学习)中文版 开发下载

    <Deep Learning>(深度学习)中文版开放下载   <Deep Learning>(深度学习)是一本皆在帮助学生和从业人员进入机器学习领域的教科书,以开源的形式免费在 ...

  6. How To Improve Deep Learning Performance

    如何提高深度学习性能 20 Tips, Tricks and Techniques That You Can Use ToFight Overfitting and Get Better Genera ...

  7. 深度学习Deep learning

    In the last chapter we learned that deep neural networks are often much harder to train than shallow ...

  8. 《Deep Learning》全书已完稿_附全书电子版

    Deep Learning第一篇书籍最终问世了.站点链接: http://www.deeplearningbook.org/ Bengio大神的<Deep Learning>全书电子版在百 ...

  9. How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras

    Hyperparameter optimization is a big part of deep learning. The reason is that neural networks are n ...

随机推荐

  1. Tomcat指定JDK路径(Linux+Windows)

    当系统有多套JDK,不方便在系统配统一的JAVA_HOME时,我们可能想给tomcat指定JDK路径. 1.Linux下Tomcat指定JDK路径 找到$CATALINE_HOME/bin/catal ...

  2. MySQL设置白名单教程

    1 登录mysql mysql -h host -u username -p password 2 切换至mysql库 use mysql; 3 查看当前允许登录IP及用户 select Host,U ...

  3. Oracle中如何停止正在执行SQL语句

    oracle的用P/SQL客户端中,如何停止正在执行的SQL语句? 我们使用oracle语句查询某个表时,如果查询的表数据太多,如何停止正在执行操作 如查询的表数据超过上万条时,如何停止查询操作

  4. js 日期格式化函数(可自定义)

    js 日期格式化函数 DateFormat var DateFormat = function (datetime, formatStr) { var dat = datetime; var str ...

  5. js-数组方法的使用和详谈

    写博客的同时也是对自己知识的一次全面总结,方便自己日后复习.今天总结一下JS中Array的所有方法和技巧,对算法题算是一个基础了,有不足的地方,还望童鞋们指出来,一起进步. 在总结方法之前,提到一点, ...

  6. python if elif else判断语句

    username = 'jack' password = ' _username = input('username') _password = input('password') if userna ...

  7. tomcat 服务器线程问题

    http://blog.csdn.net/wtopps/article/details/71339295 http://blog.csdn.net/wtopps/article/details/713 ...

  8. jstree使用新的

    1.首先准备jstree树的dom元素 <p id="flowList_ul" class="flowList_ul"></p> 2.初 ...

  9. DevExpress WinForms v18.2新版亮点(八)

    买 DevExpress Universal Subscription  免费赠 万元汉化资源包1套! 限量15套!先到先得,送完即止!立即抢购>> 行业领先的.NET界面控件2018年第 ...

  10. BBbacktrace installation

    1: get the installation package https://backtrackbb.readthedocs.io/en/latest/Method.html#overview ht ...