python正则的中文处理(转)

匹配中文时，正则表达式规则和目标字串的编码格式必须相同

    print sys.getdefaultencoding()

    text =u"#who#helloworld#a中文x#"

    print isinstance(text,unicode)

    print text

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 18: ordinal not in range(128)

print text报错
解释：控制台信息输出窗口是按照ascii编码输出的（英文系统的默认编码是ascii），而上面代码中的字符串是Unicode编码的，所以输出时产生了错误。
改成 print(word.encode('utf8'))即可

//确定系统默认编码
import sys
print sys.getdefaultencoding()
//'ascii'

//判断字符类型是否unicode
print isinstance(text,unicode)
//True

unicode\python字符互转

__author__ = 'medcl'

# -*- coding: utf-8 -*-

unistr= u'a';

pystr=unistr.encode('utf8')

unistr2=unicode(pystr,'utf8')

#需要unicode的环境

if not isinstance(input,unicode):

    temp=unicode(input,'utf8')

else:

    temp=input

#需要pythonstr的环境

if isinstance(input,unicode):

    temp2=input.encode('utf8')

else:

    temp2=input

正则获取No-ascii

内容：
"#who#helloworld#a中文x#"

正则：
r"[\x80-\xff]+"

输出：
中文

__author__ = 'medcl'

# -*- coding: utf-8 -*-

import re

def findPart(regex, text, name):

    res=re.findall(regex, text)

    if res:

        print "There are %d %s parts:\n"% (len(res), name)

        for r in res:

            print "\t",r.encode("utf8")

        print

 

text ="#who#helloworld#a中文x#"

usample=unicode(text,'utf8')

findPart(u"#[\w\u2E80-\u9FFF]+#", usample, "unicode chinese")

输出

	#who#

	#a中文x#

几个主要非英文语系字符范围

2E80～33FFh：中日韩符号区。收容康熙字典部首、中日韩辅助部首、注音符号、日本假名、韩文音符，中日韩的符号、标点、带圈或带括符文数字、月份，以及日本的假名组合、单位、年号、月份、日期、时间等。

3400～4DFFh：中日韩认同表意文字扩充A区，总计收容6,582个中日韩汉字。

4E00～9FFFh：中日韩认同表意文字区，总计收容20,902个中日韩汉字。

A000～A4FFh：彝族文字区，收容中国南方彝族文字和字根。

AC00～D7FFh：韩文拼音组合字区，收容以韩文音符拼成的文字。

F900～FAFFh：中日韩兼容表意文字区，总计收容302个中日韩汉字。

FB00～FFFDh：文字表现形式区，收容组合拉丁文字、希伯来文、阿拉伯文、中日韩直式标点、小符号、半角符号、全角符号等。

REF:http://www.blogjava.net/Skynet/archive/2009/05/02/268628.html

http://iregex.org/blog/python-chinese-unicode-regular-expressions.html

本文来自: python正则的中文处理

python正则的中文处理(转)的更多相关文章

python正则匹配——中文字符的匹配
# -*- coding:utf-8 -*- import re '''python 3.5版本正则匹配中文,固定形式:\u4E00-\u9FA5 ''' words = 'study in 山海大 ...
python 正则匹配中文(unicode)(转)
由于需求原因,需要匹配提取中文,大量google下,并没有我需要的.花了一个小时大概测试,此utf8中文通过,特留文. 参考: http://hi.baidu.com/nivrrex/blo ...
python正则的中文处理
因工作需要,要查找中文汉字分词,因为python正则表达式\W+表示的是所有的中文字就连标点符号都包括.所以要想办法过滤掉. 参考博客:http://log.medcl.net/item/2011/0 ...
2019-02-18 扩展Python控制台实现中文反馈信息之二-正则替换
"中文编程"知乎专栏原文地址续前文扩展Python控制台实现中文反馈信息, 实现了如下效果: >>> 学 Traceback (most recent call ...
Python正则式的基本用法
Python正则式的基本用法 1.1基本规则 1.2重复 1.2.1最小匹配与精确匹配 1.3前向界定与后向界定 1.4组的基本知识 2．re模块的基本函数 2.1使用compile加速 2.2 ma ...
python 正则,常用正则表达式大全
Nginx访问日志匹配 re.compile #re.compile 规则解释,改规则必须从前面开始匹配一个一个写到后面,前面一个修改后面全部错误.特殊标准结束为符号为空或者双引号: 改符号开始从 ...
Python2.7 转义和正则匹配中文
今天爬虫(新浪微博个人信息页面)的时候遇到了转义和正则匹配中文出乱码的问题. 先给出要匹配的部分网页源代码如下: <span class=\"pt_title S_txt2\&quo ...
python正则中如何匹配汉字以及encode(‘utf-8’)和decode(‘utf-8’)的互转
正则表达式: [\u2E80-\u9FFF]+$ 匹配所有东亚区的语言 [\u4E00-\u9FFF]+$ 匹配简体和繁体 [\u4E00-\u9FA5]+$ 匹配简体 <input ty ...
Python只读取文本中文字符
#coding=utf-8 import re with open('aaa.txt','r',encoding="utf-8") as f: #data = f.read().d ...

随机推荐

PIP 批量更新改为清华这边的镜像更新
之前pip批量更新的时候发现有些包无法更新,而且速度也特别慢,今天尝试了下清华的镜像,速度是真快 # coding=utf-8import pipfrom subprocess import call ...
Eclipse 调试
F6:(Step Over)单步执行每一行程序: F8:(Resume)继续执行该程序直到下一个断点或程序结束: F5: (Step Into)跳入一个方法内部: F7:(Step Return)从当 ...
HDU 5950Recursive sequence ICPC沈阳站
Recursive sequence Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 65536/65536 K (Java/Other ...
Eclipse Support UTF-8
1. Windows > Preferences > General > Content Types, set UTF-8 as the default encoding for ...
初识Webx 1
Webx是一套基于Java Servlet API的通用Web框架.它在Alibaba集团内部被广泛使用.从2010年底,向社会开放源码. Webx框架是一个稳定.强大的Web框架.建立在Spring ...
websocket连接相关的几个问题
https://blog.csdn.net/shangmingtao/article/details/75810099 https://blog.csdn.net/keketrtr/article/d ...
Quick-Cocos2dx-Community_3.6.3_Release 编译时libtiff.lib 无法解析
Quick-Cocos2dx-Community_3.6.3_Release 使用VS2012编译,报错: libtiff.lib lnk2001 无法解析的外部符号 ltod3 类似于上面这种,刚才 ...
【转载】Quick 中的触摸事件
原文地址 http://cn.cocos2d-x.org/article/index?type=quick_doc&url=/doc/cocos-docs-master/manual/fram ...
VS开发工具因插件问题导致已停止工作解决办法
解决方案如下:No1. 开始-->所有程序-->Microsoft Visual Studio 2012-->Visual Studio Tools-->VS2012 开发人员 ...
kolakoski序列
搜狐笔试=.= 当时少想一个slow的指针..呜呜呜哇的一声哭出来 function kolakoski(token0, token1) { token0 = token ...

python正则的中文处理(转)

python正则的中文处理(转)的更多相关文章

随机推荐

热门专题