Python: 转换文本编码

最近在做周报的时候，需要把csv文本中的数据提取出来制作表格后生产图表。

在获取csv文本内容的时候，基本上都是用with open(filename, encoding ='UTF-8') as f:来打开csv文本，但是实际使用过程中发现有些csv文本并不是utf-8格式，从而导致程序在run的过程中报错，每次都需要手动去把该文本文件的编码格式修改成utf-8，再次来run该程序，所以想说：直接在程序中判断并修改文本编码。

基本思路：先查找该文本是否是utf-8的编码，如果不是则修改为utf-8编码的文本，然后再处理。

python有chardet库可以查看到文本的encoding信息：

detect函数只需要一个非unicode字符串参数，返回一个字典（例如：{'encoding': 'utf-8', 'confidence': 0.99}）。该字典包括判断到的编码格式及判断的置信度。

import chardet

def get_encode_info(file):

    with open(file, 'rb') as f:

        return chardet.detect(f.read())['encoding']

不过这个在从处理小文件的时候性能还行，如果文本稍微过大就很慢了，目前我本地的csv文件是近200k，就能明显感觉到速度过慢了，效率低下。不过chardet库中提供UniversalDetector对象来处理：创建UniversalDetector对象，然后对每个文本块重复调用其feed方法。如果检测器达到了最小置信阈值，它就会将detector.done设置为True。一旦您用完了源文本，请调用detector.close()，这将完成一些最后的计算，以防检测器之前没有达到其最小置信阈值。结果将是一个字典，其中包含自动检测的字符编码和置信度(与charde.test函数返回的相同)。

from chardet.universaldetector import UniversalDetector

def get_encode_info(file):

 with open(file, 'rb') as f:

        detector = UniversalDetector()

 for line in f.readlines():

            detector.feed(line)

 if detector.done:

 break

        detector.close()

 return detector.result['encoding']

在做编码转换的时候遇到问题：UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 178365: character maps to <undefined>

def read_file(file):

 with open(file, 'rb') as f:

 return f.read()

def write_file(content, file):

 with open(file, 'wb') as f:

        f.write(content)

def convert_encode2utf8(file, original_encode, des_encode):

    file_content = read_file(file)

    file_decode = file_content.decode(original_encode)   #-->此处有问题

    file_encode = file_decode.encode(des_encode)

    write_file(file_encode, file)

这是由于byte字符组没解码好，要加另外一个参数errors。官方文档中写道：

bytearray.decode(encoding=”utf-8”, errors=”strict”)

Return a string decoded from the given bytes. Default encoding is 'utf-8'. errors may be given to set a different error handling scheme. The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace' and any other name registered via codecs.register_error(), see section Error Handlers. For a list of possible encodings, see section Standard Encodings.

意思就是字符数组解码成一个utf-8的字符串，可能被设置成不同的处理方案，默认是‘严格’的，有可能抛出UnicodeError，可以改成‘ignore’，’replace’就能解决。

所以将此行代码file_decode = file_content.decode(original_encode)修改成file_decode = file_content.decode(original_encode,'ignore')即可。

完整代码：

from chardet.universaldetector import UniversalDetector

def get_encode_info(file):

 with open(file, 'rb') as f:

     detector = UniversalDetector()

     for line in f.readlines():

         detector.feed(line)

         if detector.done:

             break

     detector.close()

     return detector.result['encoding']

def read_file(file):

    with open(file, 'rb') as f:

        return f.read()

def write_file(content, file):

    with open(file, 'wb') as f:

        f.write(content)

def convert_encode2utf8(file, original_encode, des_encode):

    file_content = read_file(file)

    file_decode = file_content.decode(original_encode,'ignore')

    file_encode = file_decode.encode(des_encode)

    write_file(file_encode, file)

if __name__ == "__main__":

    filename = r'C:\Users\danvy\Desktop\Automation\testdata\test.csv'

    file_content = read_file(filename)

    encode_info = get_encode_info(filename)

    if encode_info != 'utf-8':

        convert_encode2utf8(filename, encode_info, 'utf-8')

    encode_info = get_encode_info(filename)

    print(encode_info)

参考：https://chardet.readthedocs.io/en/latest/usage.html

Python: 转换文本编码的更多相关文章

Mac下用命令行直接批量转换文本编码到UTF8
由于近期在Mac下写Android程序,下载的一些Demo由于编码问题源码里的汉字出现乱码,文件比较多,所以想批量解决下文件的编码问题. Mac下有以下两种方式可以解决: A. 文件名的编码:Mac的 ...
转：Python常见字符编码及其之间的转换
参考:Python常见字符编码 + Python常见字符编码间的转换一.Python常见字符编码字符编码的常用种类介绍第一种:ASCII码 ASCII(American Standard Cod ...
[2015.02.02]文本编码转换专家 v2.6
软件名称:文本编码转换专家最新版本:v2.6操作系统:XP/2003/Win7/Win2008软件介绍:文本编码转换专家,界面简洁易用,功能强大实用.自动识别文件编码,有效转换成目标编码.真正的多线程 ...
Python常见字符编码间的转换
主要内容: 1.Unicode 和 UTF-8的爱恨纠葛 2.字符在硬盘上的存储 3.编码的转换 4.验证编码是否转换正确 5.Python bytes类型前 ...
python 读不同编码的文本，传递一个可选的encoding 参数给open() 函数
文件的读写操作默认使用系统编码,可以通过调用sys.getdefaultencoding() 来得到.在大多数机器上面都是utf-8 编码.如果你已经知道你要读写的文本是其他编码方式,那么可以通过传递 ...
Python判断字符串编码以及编码的转换
转自:http://www.cnblogs.com/zhanhg/p/4392089.html Python判断字符串编码以及编码的转换判断字符串编码: 使用 chardet 可以很方便的实现字符串 ...
Python字符串的编码与解码(encode与decode)
首先要搞清楚,字符串在Python内部的表示是unicode编码,因此,在做编码转换时,通常需要以unicode作为中间编码,即先将其他编码的字符串解码(decode)成unicode,再从unico ...
python中的编码与解码
编码与解码首先,明确一点,计算机中存储的信息都是二进制的编码/解码本质上是一种映射(对应关系),比如‘a’用ascii编码则是65,计算机中存储的就是00110101,但是显示的时候不能显 ...
python 3字符编码
python 3字符编码官方链接:http://legacy.python.org/dev/peps/pep-0263/ 在Python2中默认是ascii编码,Python3是utf-8编码在p ...

随机推荐

了解使用wireshark抓包工具
一.简介 1.什么是wireshark 百度: Wireshark(前称Ethereal)是一个网络封包分析软件.网络封包分析软件的功能是撷取网络封包,并尽可能显示出最为详细的网络封包资料.Wires ...
sql注入篇1
一.前言学习了感觉很久的渗透,总结一下sql注入,系统整理一下sql注入思路. 二.关于sql注入所谓SQL注入,就是通过把SQL命令插入到Web表单提交或输入域名或页面请求的查询字符串,最终达到 ...
LiteDB源码解析系列（3）索引原理详解
在这一章,我们将了解LiteDB里面几个基本数据结构包括索引结构和数据块结构,我也会试着说明前辈数据之巅在博客中遇到的问题,最后对比mysql进一步深入了解LiteDB的索引原理. 1.LiteDB的 ...
windows下挂载linux的nfs网络硬盘
挂载命令,速度快 mount \\10.8.200.167\goworkspace z: -o nolock,rsize=1024,wsize=1024,timeo=15 安装步骤 yum insta ...
什么是icmp协议？
英文原义:Internet Control Message Protocol 中文释义:(RFC-792)Internet控制消息协议定义: ICMP协议是一种面向无连接的协议,用于传输出 ...
C/C++指针函数和函数指针
一.指针函数当一个函数声明其返回值为一个指针时,实际上就是返回一个地址给调用函数,以用于需要指针或地址的表达式中. 格式: 类型说明符 * 函数名(参数) 当然了,由于返回的是一个地址,所以类型说明 ...
分布式任务调度平台XXL-JOB学习笔记一
分布式任务调度平台XXL-JOB学习笔记一 XXL-JOB是一个轻量级分布式任务调度平台,其核心设计目标是开发迅速.学习简单.轻量级.易扩展.现已开放源代码并接入多家公司线上产品线,开箱即用.码云地址 ...
如何使用 nvm-windows 管理 nodejs 版本
写在前边的话: (1). 路径一定不要包含空格,如 Program Files 这样,所以有把程序安装到 D:\Program Files 文件下的同学请注意了:(2). 为了避免 nvm 无法切换源 ...
仿LookUpEdit多列模糊搜索，功能比GridLookUpEdit强大，比SearhLookUpEdit方便
先上效果图: 控件调用示例:(devexpress使用了16.2.6.0版本,可以根据实际需要进行版本转换) using System; using System.Collections.Generi ...
【Java】Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 1099
详细信息如下: Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: ...

Python: 转换文本编码

Python: 转换文本编码的更多相关文章

随机推荐

热门专题