unicodedata.normalize()清理字符串

# normalize()的第一个参数指定字符串标准化的方式，分别有NFD/NFC

>>> s1 = 'Spicy Jalape\u00f1o'

>>> s2 = 'Spicy Jalapen\u0303o'

>>> import unicodedata

# NFC表示字符应该是整体组成(可能是使用单一编码)

>>> t1 = unicodedata.normalize('NFC', s1)

>>> t2 = unicodedata.normalize('NFC', s2)

>>> t1 == t2

True

# NFD表示字符应该分解为多个组合字符表示

>>> t1 = unicodedata.normalize('NFD', s1)

>>> t2 = unicodedata.normalize('NFD', s2)

>>> t1 == t2

True

注：Python中同样支持NFKC/NFKD，使用原理同上

combining()匹配文本上的和音字符

>>> s1

'Spicy Jalapeño'

>>> t1 = unicodedata.normalize('NFD', s1)

>>> ''.join(c for c in t1 if not unicodedata.combining(c)) # 去除和音字符

'Spicy Jalapeno'

使用strip()、rstrip()和lstrip()

>>> s = ' hello world \n'

# 去除左右空白字符

>>> s.strip()

'hello world'

# 去除右边空白字符

>>> s.rstrip()

' hello world'

# 去除左边空白字符

>>> s.lstrip()

'hello world \n'

>>> t = '-----hello====='

# 去除左边指定字段('-')

>>> t.lstrip('-')

'hello====='

# 去除右边指定字段('-')

>>> t.rstrip('=')

'-----hello'

# 值得注意的是，strip等不能够去除中间空白字符，要使用去除中间空白字符可以使用下面方法

>>> s = ' hello world \n'

# 使用replace()那么会造成"一个不留"

>>> s.replace(' ', '')

'helloworld\n'

# 使用正则

>>> import re

>>> re.sub(r'\s+', ' ', s)

' hello world '

关于translate()

# 处理和音字符

>>> s = 'pýtĥöñ\fis\tawesome\r\n'

>>> remap = {ord('\r'): None, ord('\t'): ' ', ord('\f'): ' '} # 构造字典,对应空字符

>>> a = s.translate(remap) # 进行字典转换

>>> a

'pýtĥöñ is awesome\n'

>>> import unicodedata

>>> import sys

>>> cmb_chrs = dict.fromkeys(c for c in range(sys.maxunicode) if unicodedata.combining(chr(c))) # 查找系统的和音字符，并将其设置为字典的键，值设置为空

>>> b = unicodedata.normalize('NFD', a) # 将原始输入标准化为分解形式字符

>>> b

'pýtĥöñ is awesome\n'

>>> b.translate(cmb_chrs)

'python is awesome\n'

# 将所有的Unicode数字字符映射到对应的ASCII字符上

# unicodedata.digit(chr(c)) # 将ASCII转换为十进制数字，再加上'0'的ASCII就对应了“0~9”的ASCII码

>>> digitmap = {c: ord('')+unicodedata.digit(chr(c)) for c in range(sys.maxunicode) if unicodedata.category(chr(c)) == 'Nd'} # （unicodedata.category(chr(c)) == 'Nd'）表示系统“0~9”的Unicode字符

>>> len(digitmap)

610

>>> x = '\u0661\u0662\u0663'

>>> x.translate(digitmap)

''

关于I/O解码和编码函数

>>> a

'pýtĥöñ is awesome\n'

>>> b = unicodedata.normalize('NFD', a)

>>> b.encode('ascii', 'ignore').decode('ascii')

'python is awesome\n'

unicodedata.normalize()/使用strip()、rstrip()和lstrip()/encode和decode 笔记(具体可看《Python Cookbook》3rd Edition 2.9~2.11)的更多相关文章

【LeetCode】535. Encode and Decode TinyURL 解题报告（Python & C++）
作者: 负雪明烛 id: fuxuemingzhu 个人博客: http://fuxuemingzhu.cn/ 目录题目描述解题方法方法一:数组方法二:字典日期题目地址:https://l ...
探究 encode 和 decode 的使用问题（Python）
很多时候在写Python程序的时候都要在头部添加这样一行代码 #coding: utf-8 或者是这样 # -*- coding:utf-8 -*- 等等这行代码的意思就是设定同一编码格式为utf- ...
python的str，unicode对象的encode和decode方法（转）
python的str,unicode对象的encode和decode方法(转) python的str,unicode对象的encode和decode方法 python中的str对象其实就是" ...
48-python基础-python3-字符串-常用字符串方法(六)-strip()-rstrip()-lstrip()
7-用 strip().rstrip()和 lstrip()删除空白字符 strip()字符串方法将返回一个新的字符串,它的开头或末尾都没有空白字符. lstrip()和 rstrip()方法将相应删 ...
[LeetCode] Encode and Decode Strings 加码解码字符串
Design an algorithm to encode a list of strings to a string. The encoded string is then sent over th ...
【python】python新手必碰到的问题---encode与decode，中文乱码[转]
转自:http://blog.csdn.net/a921800467b/article/details/8579510 为什么会报错“UnicodeEncodeError:'ascii' codec ...
LeetCode Encode and Decode Strings
原题链接在这里:https://leetcode.com/problems/encode-and-decode-strings/ 题目: Design an algorithm to encode a ...
Encode and Decode Strings
Design an algorithm to encode a list of strings to a string. The encoded string is then sent over th ...
encode和decode
Python字符串的encode与decode研究心得乱码问题解决方法为什么会报错“UnicodeEncodeError: 'ascii' codec can't encode characters ...

随机推荐

win7 java环境变量配置
进行win7下Java环境变量配置在"系统变量"下进行如下配置: (1)新建->变量名:JAVA_HOME变量值 C:\Program Files\Java\jd ...
python day02--运算符，编码
一.格式化输出 %s: 字符串的占位符, 可以放置任何内容(数字)%d: 数字的占位符 age="18"name="小明"print("我叫%s&qu ...
使用python绘出常见函数
'''''' ''' mpl.rcParams['font.sans-serif'] = ['SimHei'] mpl.rcParams['axes.unicode_minus'] = False用来 ...
HDU 6034 17多校1 Balala Power!（思维排序）
Problem Description Talented Mr.Tang has n strings consisting of only lower case characters. He want ...
webgl opengl教程样例
webgl2样例: http://webglsamples.org opengl教程: https://learnopengl.com/ http://www.opengl-tutorial.org/ ...
Java中的数组初探
1.数组的类型? Java中的数组为引用类型. 2.数组的三种初始化方式 1. int[] arr1=new int[] {1,2,3,4,}; 2. int[] arr2= {1,2,3,4,}; ...
WEBBASE篇：第一篇， HTML知识1
HTML知识1 1,web概述 WEB就是互联网上的一种应用程序 - 网页程序: 程序结构: (1)C / S: C:Client 客户端:S:Server 服务器: (2)B / S: B:Brow ...
Pycharm出现Segmentation fault...(interrupted by signal 11: SIGSEGV)的解决方法
众所周知,用pycharm远程服务器debug代码已经成为学习深度学习相关代码的有力工具,但是最近创建了一个虚拟环境,进行debug的时候,莫名会出现下面这个错误,看的我都抽风了 bash: line ...
input标签（按钮）
按钮: <input type="button" name="..." value="..." /> <input typ ...
牛客国庆集训派对Day4 I-连通块计数（思维，组合数学）
链接:https://www.nowcoder.com/acm/contest/204/I 来源:牛客网时间限制:C/C++ 1秒,其他语言2秒空间限制:C/C++ 1048576K,其他语言20 ...

unicodedata.normalize()/使用strip()、rstrip()和lstrip()/encode和decode 笔记(具体可看 《Python Cookbook》3rd Edition 2.9~2.11)

unicodedata.normalize()清理字符串

关于translate()

关于I/O解码和编码函数

unicodedata.normalize()/使用strip()、rstrip()和lstrip()/encode和decode 笔记(具体可看 《Python Cookbook》3rd Edition 2.9~2.11)的更多相关文章

随机推荐

热门专题

unicodedata.normalize()/使用strip()、rstrip()和lstrip()/encode和decode 笔记(具体可看《Python Cookbook》3rd Edition 2.9~2.11)

unicodedata.normalize()/使用strip()、rstrip()和lstrip()/encode和decode 笔记(具体可看《Python Cookbook》3rd Edition 2.9~2.11)的更多相关文章