gensim中TaggedDocument 怎么使用

我有两个目录，我想从中读取它们的文本文件并给它们贴上标签，但我不知道如何通过taggedDocument来实现这一点。我以为它可以作为标记文档（[strings]，[labels]）工作，但这显然不起作用。

from gensim import models

from gensim.models.doc2vec import TaggedDocument

import utilities as util

import os

from sklearn import svm

from nltk.tokenize import sent_tokenize

CogPath = "./FixedCog/"

NotCogPath = "./FixedNotCog/"

SamplePath ="./Sample/"

docs = []

tags = []

CogList = [p for p in os.listdir(CogPath) if p.endswith('.txt')]

NotCogList = [p for p in os.listdir(NotCogPath) if p.endswith('.txt')]

SampleList = [p for p in os.listdir(SamplePath) if p.endswith('.txt')]

for doc in CogList:

     str = open(CogPath+doc,'r').read().decode("utf-8")

     docs.append(str)

     print docs

     tags.append(doc)

     print "###########"

     print tags

     print "!!!!!!!!!!!"

for doc in NotCogList:

     str = open(NotCogPath+doc,'r').read().decode("utf-8")

     docs.append(str)

     tags.append(doc)

for doc in SampleList:

     str = open(SamplePath + doc, 'r').read().decode("utf-8")

     docs.append(str)

     tags.append(doc)

T = TaggedDocument(docs,tags)

model = models.Doc2Vec(T,alpha=.025, min_alpha=.025, min_count=1,size=50)

错误

Traceback (most recent call last):

  File "/home/farhood/PycharmProjects/word2vec_prj/doc2vec.py", line 34, in <module>

    model = models.Doc2Vec(T,alpha=.025, min_alpha=.025, min_count=1,size=50)

  File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 635, in __init__

    self.build_vocab(documents, trim_rule=trim_rule)

  File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 544, in build_vocab

    self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule)  # initial survey

  File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 674, in scan_vocab

    if isinstance(document.words, string_types):

AttributeError: 'list' object has no attribute 'words'

所以我只是做了一些测试，在Github上发现了这一点：

class TaggedDocument(namedtuple('TaggedDocument', 'words tags')):

    """

    A single document, made up of `words` (a list of unicode string tokens)

    and `tags` (a list of tokens). Tags may be one or more unicode string

    tokens, but typical practice (which will also be most memory-efficient) is

    for the tags list to include a unique integer id as the only tag.

    Replaces "sentence as a list of words" from Word2Vec.

因此，我决定通过为每个文档生成一个taggedDocument类来更改使用taggedDocument函数的方式，重要的是必须将标记作为列表传递。

for doc in CogList:

     str = open(CogPath+doc,'r').read().decode("utf-8")

     str_list = str.split()

     T = TaggedDocument(str_list,[doc])

     docs.append(T)

doc2vec模型的输入应该是taggeddocument的列表（['list'、'of'、'word']、[tag_]）。一个好的实践是使用句子的索引作为标记。例如，用两个句子（即文档、段落）训练doc2vec模型：

s1 = 'the quick fox brown fox jumps over the lazy dog'

s1_tag = ''

s2 = 'i want to burn a zero-day'

s2_tag = ''

docs = []

docs.append(TaggedDocument(words=s1.split(), tags=[s1_tag])

docs.append(TaggedDocument(words=s2.split(), tags=[s2_tag])

model = gensim.models.Doc2Vec(vector_size=300, window=5, min_count=5, workers=4, epochs=20)

model.build_vocab(docs)

print 'Start training process...'

model.train(docs, total_examples=model.corpus_count, epochs=model.iter)

#save model

model.save(model_path)

您可以使用Gensim的常用文本作为示例：

from gensim.test.utils import common_texts

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]

model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)

gensim中TaggedDocument 怎么使用的更多相关文章

gensim中word2vec
from gensim.models import Word2Vec Word2Vec(self, sentences=None, size=100, alpha=0.025, window=5, m ...
全网独发gensim中similarities.Similarity用法
index = similarities.MatrixSimilarity(lsi[corpus]) # 管网的原文翻译如下: 警告:similarities.MatrixSimilarity类仅仅适 ...
gensim中word2vec和其他一些向量的使用
直接上代码吧,word2vec # test from gensim.models.word2vec import Word2Vec txt_file = open('data.txt') sente ...
doc2vec使用说明（二）gensim工具包 LabeledSentence
欢迎交流,转载请注明出处. 本文介绍gensim工具包中,带标签(一个或者多个)的文档的doc2vec 的向量表示. 应用场景: 当每个文档不仅可以由文本信息表示,还有别的其他标签信息时,比如,在商品 ...
Gensim进阶教程：训练word2vec与doc2vec模型
本篇博客是Gensim的进阶教程,主要介绍用于词向量建模的word2vec模型和用于长文本向量建模的doc2vec模型在Gensim中的实现. Word2vec Word2vec并不是一个模型--它其 ...
用gensim学习word2vec
在word2vec原理篇中,我们对word2vec的两种模型CBOW和Skip-Gram,以及两种解法Hierarchical Softmax和Negative Sampling做了总结.这里我们就从 ...
文本分布式表示（三）：用gensim训练word2vec词向量
今天参考网上的博客,用gensim训练了word2vec词向量.训练的语料是著名科幻小说<三体>,这部小说我一直没有看,所以这次拿来折腾一下. <三体>这本小说里有不少人名和一 ...
解决在使用gensim.models.word2vec.LineSentence加载语料库时报错 UnicodeDecodeError: 'utf-8' codec can't decode byte......的问题
在window下使用gemsim.models.word2vec.LineSentence加载中文维基百科语料库(已分词)时报如下错误: UnicodeDecodeError: 'utf-8' cod ...
python 全栈开发，Day133(玩具与玩具之间的对话,基于jieba gensim pypinyin实现的自然语言处理,打包apk)
先下载github代码,下面的操作,都是基于这个版本来的! https://github.com/987334176/Intelligent_toy/archive/v1.6.zip 注意:由于涉及到 ...

随机推荐

tcpdump使用小记
1, 类型的关键字主要包括:host, net, port: 2, 确定传输方向的关键字主要包括:src, dst, dst or src, dst and src: 3, 协议的关键字主要包括:fd ...
【POM】maven profile切换正式环境和测试环境
有时候,我们在开发和部署的时候,有很多配置文件数据是不一样的,比如连接mysql,连接redis,一些properties文件等等每次部署或者开发都要改配置文件太麻烦了,这个时候,就需要用到mave ...
四轴电池ADC监控学习
一.硬件原理电池供电通过两个分压电阻接地,STM32则在两电阻中间通过ADC检测电池电压.(引脚BAT_DET) 二.ADC通道初始化 //初始化电池检测ADC //开启ADC1的通道8 //Bat ...
有关于css的四种布局
四种布局 (1).左右两侧,左侧固定宽度200px, 右侧自适应占满. (2).左中右三列,左右个200px固定,中间自适应占满. (3).上中下三行,头部200px高,底部200px高,中间自适应占 ...
高水线 High water mark(HWM)
所有的Oracle表都有一个容纳数据的上限(很像一个水库历史最高的水位),我们把这个上限称为“High water mark"或HWM.这个HWM是一个标记(专门有一个数据块来记录高水标记等 ...
Codeforces Round #499 (Div. 2) Problem-A-Stages（水题纠错）
CF链接 http://codeforces.com/contest/1011/problem/A Natasha is going to fly to Mars. She needs to bui ...
ROS编程: 重要的代码优化知识点记录(1)
订阅多个话题并对其进行同步处理本小节针对在ROS节点中需要订阅两个及两个以上的话题时,需要保持对这两个话题数据的同步,且需要同时接收数据一起处理然后当做参数传入到另一个函数中: 研究背景:reals ...
vue实现轮播效果
vue实现轮播效果效果如下:(不好意思,图有点大:) 功能:点击左侧图片,右侧出现相应的图片:同时左侧边框变颜色. 代码如下:(也可以直接下载文件) <!DOCTYPE html> &l ...
34-python基础-python3-列表删除元素-remove()方法-del语句-pop()方法
1-remove()方法根据值删除元素. remove()方法传入一个列表中的值,它将从被调用的列表中删除. 如果该值在列表中出现多次,只有第一次出现的值会被删除. 如果要删除的值可能在列表中出现 ...
如何在vue框架中兼容IE
IE目前已经放弃了自己的独特化,正一步步迎入互联网的主流怀抱.但迫于有用户存在,还是要兼容到IE8,9, 以上. 下面聊一下如何在vue框架中兼容IE 1.首先在index.html <meta ...

gensim中TaggedDocument 怎么使用

gensim中TaggedDocument 怎么使用的更多相关文章

随机推荐

热门专题