解决在使用gensim.models.word2vec.LineSentence加载语料库时报错 UnicodeDecodeError: 'utf-8' codec can't decode byte......的问题

　　在window下使用gemsim.models.word2vec.LineSentence加载中文维基百科语料库（已分词）时报如下错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 0: invalid continuation byte

　　这种编码问题真的很让人头疼，这种问题都是出现在xxx.decode("utf-8")的时候，所以接下来我们来看看gensim中的源码：

class LineSentence(object):

    """Iterate over a file that contains sentences: one line = one sentence.

    Words must be already preprocessed and separated by whitespace.

    """

    def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None):

        """

        Parameters

        ----------

        source : string or a file-like object

            Path to the file on disk, or an already-open file object (must support `seek(0)`).

        limit : int or None

            Clip the file to the first `limit` lines. Do no clipping if `limit is None` (the default).

        Examples

        --------

        .. sourcecode:: pycon

            >>> from gensim.test.utils import datapath

            >>> sentences = LineSentence(datapath('lee_background.cor'))

            >>> for sentence in sentences:

            ...     pass

        """

        self.source = source

        self.max_sentence_length = max_sentence_length

        self.limit = limit

    def __iter__(self):

        """Iterate through the lines in the source."""

        try:

            # Assume it is a file-like object and try treating it as such

            # Things that don't have seek will trigger an exception

            self.source.seek(0)

            for line in itertools.islice(self.source, self.limit):

                line = utils.to_unicode(line).split()

                i = 0

                while i < len(line):

                    yield line[i: i + self.max_sentence_length]

                    i += self.max_sentence_length

        except AttributeError:

            # If it didn't work like a file, use it as a string filename

            with utils.smart_open(self.source) as fin:

                for line in itertools.islice(fin, self.limit):

                    line = utils.to_unicode(line).split()

                    i = 0

                    while i < len(line):

                        yield line[i: i + self.max_sentence_length]

                        i += self.max_sentence_length

　　从源码中可以看到__iter__方法让LineSentence成为了一个可迭代的对象，而且文件读取的方法也都定义在__iter__方法中。一般我们输入的source参数都是一个文件路径（也就是一个字符串形式），因此在try时，self.source.seek(0)会报“字符串没有seek方法”的错，所以真正执行的代码是在except中。

　　接下来我们有两种方法来解决我们的问题：

　　1）from gensim import utils

　　　　utils.samrt_open(url, mode="rb", **kw)

　　　　在源码中用utils.smart_open()方法打开文件时默认是用二进制的形式打开的，可以将mode=“rb” 改成mode=“r”。

　　2）from gensim import utils

　　　　utils.to_unicode(text, encoding='utf8', errors='strict')

　　　　在源码中在decode("utf8")时，其默认errors=“strict”, 可以将其改成errors="ignore"。即utils.to_unicode(line, errors="ignore")

　　不过建议大家不要直接在源码上修改，可以直接将源码复制下来，例如：

import logging

import itertools

import gensim

from gensim.models import word2vec

from gensim import utils

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

class LineSentence(object):

    """Iterate over a file that contains sentences: one line = one sentence.

    Words must be already preprocessed and separated by whitespace.

    """

    def __init__(self, source, max_sentence_length=10000, limit=None):

        """

        Parameters

        ----------

        source : string or a file-like object

            Path to the file on disk, or an already-open file object (must support `seek(0)`).

        limit : int or None

            Clip the file to the first `limit` lines. Do no clipping if `limit is None` (the default).

        Examples

        --------

        .. sourcecode:: pycon

            >>> from gensim.test.utils import datapath

            >>> sentences = LineSentence(datapath('lee_background.cor'))

            >>> for sentence in sentences:

            ...     pass

        """

        self.source = source

        self.max_sentence_length = max_sentence_length

        self.limit = limit

    def __iter__(self):

        """Iterate through the lines in the source."""

        try:

            # Assume it is a file-like object and try treating it as such

            # Things that don't have seek will trigger an exception

            self.source.seek(0)

            for line in itertools.islice(self.source, self.limit):

                line = utils.to_unicode(line).split()

                i = 0

                while i < len(line):

                    yield line[i: i + self.max_sentence_length]

                    i += self.max_sentence_length

        except AttributeError:

            # If it didn't work like a file, use it as a string filename

            with utils.smart_open(self.source, mode="r") as fin:

                for line in itertools.islice(fin, self.limit):

                    line = utils.to_unicode(line).split()

                    i = 0

                    while i < len(line):

                        yield line[i: i + self.max_sentence_length]

                        i += self.max_sentence_length

our_sentences = LineSentence("./zhwiki_token.txt")

model = gensim.models.Word2Vec(our_sentences, size=200, iter=30)  # 大语料，用CBOW，适当的增大迭代次数

# model.save(save_model_file)

model.save("./mathWord2Vec" + ".model")   # 以该形式保存模型以便之后可以继续增量训练

解决在使用gensim.models.word2vec.LineSentence加载语料库时报错 UnicodeDecodeError: 'utf-8' codec can't decode byte......的问题的更多相关文章

VS加载项目时报错尚未配置为Web项目XXXX指定的本地IIS
网上找的几个方法都不行最后自己解决了.首先打开该项目得csproj文件,找到<ProjectExtensions>这个标签,是在最后部分,然后把<UseIIS>True< ...
OpenCV使用：加载图片时报错 0x00007FFC1084A839 处(位于 test1.exe 中)有未经处理的异常: Microsoft C++ 异常: cv::Exception，位于内存位置 0x00000026ABAFF1A8 处。
加载图片代码为: #include<iostream> #include <opencv2/core/core.hpp> #include <opencv2/highgu ...
Visual studio加载项目时报错尚未配置为Web项目XXXX指定的本地IIS,需要配置虚拟目录。解决办法。
在SVN上下载工程项目.使用visual studio打开时,出现如下提示: 查找相关资料,解决办法如下: 使用记事本打开工程目录下的.csproj文件.把<UseIIS>False< ...
解决vs2013下创建的python文件，到其他平台（如linux）下中文乱码（或运行时报SyntaxError: (unicode error) 'utf-8' codec can't decode byte...）
Vs2013中创建python文件,在文件中没输入中文时,编码为utf-8的,如图接着,在里面输入几行中文后,再次用notepad++查看其编码如下,在vs下运行也报错(用cmd运行就不会): 根据 ...
moviepy用VideoFileClip加载视频时报UnicodeDecodeError: utf-8 codec cant decode byte invalid start byte错误
专栏:Python基础教程目录专栏:使用PyQt开发图形界面Python应用专栏:PyQt入门学习老猿Python博文目录老猿学5G博文目录使用moviepy用: clip1 = Video ...
【技术贴】第二篇：解决使用maven jetty启动后无法加载修改过后的静态资源
之前写过第一篇:[技术贴]解决使用maven jetty启动后无法加载修改过后的静态资源一直用着挺舒服的,直到今天,出现了又不能修改静态js,jsp等资源的现象.很是苦闷. 经过调错处理之后,发现是 ...
解决Vue刷新一瞬间出现样式未加载完或者出现Vue代码问题
解决Vue刷新一瞬间出现样式未加载完或者出现Vue代码问题: <style> [v-cloak]{ display: none; } </style> <div id=& ...
解决Torch.load()错误信息: UnicodeDecodeError: 'ascii' codec can't decode byte 0x8d in position 0: ordinal not in range(128)
使用PyTorch跑pretrained预训练模型的时候,发现在加载数据的时候会报错,具体错误信息如下: File "main.py", line 238, in main_wor ...
moviepy用VideoFileClip加载视频时报UnicodeDecodeError: codec cant decode ，No mapping character 错误
专栏:Python基础教程目录专栏:使用PyQt开发图形界面Python应用专栏:PyQt入门学习老猿Python博文目录老猿学5G博文目录昨天处理视频时出现了解码错误,通过修改ffmpeg ...

随机推荐

odoo权限机制
转两篇关于权限的2篇文章,加深这方面的认识.注:后面附有原作者地址,希望不构成侵权. 第一篇:http://www.cnblogs.com/dancesir/p/6994030.html Odoo的权 ...
ReactiveSwift源码解析(四) Signal中的静态属性静态方法以及面向协议扩展
上篇博客我们聊了Signal的几种状态.Signal与Observer的关联方式以及Signal是如何向关联的Observer发送事件的.本篇博客继续上篇博客的内容,来聊一下Signal类中静态的ne ...
使用 docker-compose 快速安装Jenkins
本文分享在 docker 环境中,使用 docker-compose.yml 快速安装 Jenkins,以及使用主机中的 docker 打包推送镜像到阿里云博客园的第100篇文章达成,2019的第一 ...
最短路问题之Dijkstra算法
题目: 在上一篇博客的基础上,这是另一种方法求最短路径的问题. Dijkstra(迪杰斯特拉)算法:找到最短距离已经确定的点,从它出发更新相邻顶点的最短距离.此后不再关心前面已经确定的“最短距离已经确 ...
java面试基础（一）
1.基本数据类型.封装类和运算操作(1)简述 & 和 && ,以及 | 和 || 的区别.———&和|是位运算符也是逻辑运算符,作为逻辑运算符时左右两边都会进行判断(不 ...
【python3基础】相对路径，‘/’，‘./’，‘../’
python3相对路径 “/” 前有没有 “.” ,有几个“.”,意思完全不一样. “/”:表示根目录,在windows系统下表示某个盘的根目录,如“E:\”: “./”:表示当前目录:(表示当前目录 ...
Elasticsearch倒排索引结构
一切设计都是为了提高搜索的性能倒排索引(Inverted Index)也叫反向索引,有反向索引必有正向索引.通俗地来讲,正向索引是通过key找value,反向索引则是通过value找key. 先来回 ...
Flink的分布式缓存
分布式缓存 Flink提供了一个分布式缓存,类似于hadoop,可以使用户在并行函数中很方便的读取本地文件,并把它放在taskmanager节点中,防止task重复拉取.此缓存的工作机制如下:程序注册 ...
spring boot 集成 zookeeper 搭建微服务架构
PRC原理 RPC 远程过程调用(Remote Procedure Call) 一般用来实现部署在不同机器上的系统之间的方法调用,使得程序能够像访问本地系统资源一样,通过网络传输去访问远程系统资源,R ...
Python 为什么要使用描述符？
学习 Python 这么久了,说起 Python 的优雅之处,能让我脱口而出的, Descriptor(描述符)特性可以排得上号. 描述符是Python 语言独有的特性,它不仅在应用层使用,在语言的 ...

解决在使用gensim.models.word2vec.LineSentence加载语料库时报错 UnicodeDecodeError: 'utf-8' codec can't decode byte......的问题

解决在使用gensim.models.word2vec.LineSentence加载语料库时报错 UnicodeDecodeError: 'utf-8' codec can't decode byte......的问题的更多相关文章

随机推荐

热门专题