Python正则表达式的re库一些用法（上）

1、查找文本中的模式

search()函数取模式和要扫描的文本作为输入，找到这个模式时就返回一个match对象。如果没有找到模式，search()就返回None。

每个match对象包含有关匹配性质的信息，包含原输入字符串，所使用的正则表达式以及模式在原字符串出现的位置。

import re

pattern = 'this'

text = 'Does this text match the pattern?'

match = re.search(pattern, text)

s = match.start()

e = match.end()

print('Found "{}"\nin "{}"\nfrom {} to ("{}")'.format(match.re.pattern, match.string, s, e, text[s:e]))

_____________________输出___________________________________________

Found "this"

in "Does this text match the pattern?"

from 5 to ("9")

start()和end()方法可以提供字符串中的相应索引，指示与模式匹配的文本在字符串中出现的位置。

2、编译表达式

尽管re包括模块级函数，可以处理作为文本字符串的正则表达式，但是对于程序频繁使用的表达式而言，编译它们会更为高效。compile()函数会把一个表达式字符串转换为一个Regex0bject。

import re

regexes = [

    re.compile(p)

    for p in ['this', 'that']

]

text = 'Does this text match the pattern?'

print('Text: {!r}\n'.format(text))

for regex in regexes:

    print('Seeking "{}" ->'.format(regex.pattern),end=' ')

    if regex.search(text):

        print('match')

    else:

        print('no match')

_________________________输出_________________________________

Text: 'Does this text match the pattern?'

Seeking "this" -> match

Seeking "that" -> no match

模块级函数会维护一个包含已编译表达式的缓存，不过这个缓存的大小是有限的，另外直接使用已编译表达式可以避免与缓存查找相关的开销。使用已编译表达式的另一个好处为，通过在加载模块时预编译所有的表达式，可以把编译工作转移到应用开始时，而不是当程序响应一个用户动作时才编译。

3、多重匹配

使用search()来查找字面量文本字符串的单个实例，findall()函数会返回输入中与模式匹配而且不重叠的所有子串。

import re

text = 'abbaaabbbbaaaaaa'

pattern = 'ab'

for match in re.findall(pattern, text):

    print('Found {!r}'.format(match))

_______________________输出_____________________

Found 'ab'

Found 'ab'

finditer()返回一个迭代器，它会生成Match实例，而不是像findall()那样返回字符串。

import re

text = 'abbaaabbbbaaaaaa'

pattern = 'ab'

for match in re.finditer(pattern, text):

    s = match.start()

    e = match.end()

    print('Found {!r} at {:d}:{:d}'.format(text[s:e],s,e))

_______________________输出_______________________________

Found 'ab' at 0:2

Found 'ab' at 5:7

4、模式语法

正则表达式还支持更加强大的模式。模式可以重复，可以锚定到输入中不同的逻辑位置，可以用紧凑的形式表述而不需要在模式中提取每一个重复的字符

import re

def test_pattern(text,patterns):

    for pattern, desc in patterns:

        print("'{}' ({})\n".format(pattern,desc))

        print(" '{}'".format(text))

        for match in re.finditer(pattern, text):

            s = match.start()

            e = match.end()

            substr = text[s:e]

            n_backslashes = text[:s].count('\\')

            prefix = '.' * (s+n_backslashes)

            print(" {}'{}'".format(prefix,substr))

        print()

    return

if __name__ == "__main__":

    test_pattern('abbaaabbbbaaaaaa',[('ab',"'a' follow by 'b'"),])

________________________输出_______________________________________

'ab' ('a' follow by 'b')

 'abbaaabbbbaaaaaa'

 'ab'

 .....'ab'

输出显示输入文本以及输入中与模式匹配的各个部分的子串区间。

重复

模式中有5种表示重复的方法。模式后面如果有元字符*，则表示重复0次或多次（允许一个模式重复0次是指这个模式即使不出现也可以匹配）。如果把*替换为+，那么模式必须至少出现一次才能匹配。使用？表示模式出现0次或1次。如果要制定出现次数，需要在模式后面使用{m}，m表示模式应重复的次数。最后，如果要允许一个可变但有限的重复次数，那么可以使用{m,n}，这里m是最少重复次数，n是最大重复次数。如果省略n({m,}),则表示值必须至少出现m次，但没有最大限制。

test_pattern('abbaabbba',[('ab*','a followed by zero or more b'),

                          ('ab+','a followed by one or more b'),

                          ('ab?','a followed by zero or one b'),

                          ('ab{3}','a followed by three b'),

                          ('ab{2,3}','a followed by two or three b')],

)

__________________________输出__________________________________

'ab*' (a followed by zero or more b)

 'abbaabbba'

 'abb'

 ...'a'

 ....'abbb'

 ........'a'

'ab+' (a followed by one or more b)

 'abbaabbba'

 'abb'

 ....'abbb'

'ab?' (a followed by zero or one b)

 'abbaabbba'

 'ab'

 ...'a'

 ....'ab'

 ........'a'

'ab{3}' (a followed by three b)

 'abbaabbba'

 ....'abbb'

'ab{2,3}' (a followed by two or three b)

 'abbaabbba'

 'abb'

 ....'abbb'

处理重复指令时，re在匹配模式时通常会尽可能多地消费输入。这种像“贪心”的行为可能会导致单个匹配减少，或匹配结果包含比预想更多的输入文本。可以在重复指令后面加？来关闭贪心行为。

test_pattern('abbaabbba',[('ab*?','a followed by zero or more b'),

                          ('ab+?','a followed by one or more b'),

                          ('ab??','a followed by zero or one b'),

                          ('ab{3}?','a followed by three b'),

                          ('ab{2,3}?','a followed by two or three b')],

)

______________________________输出________________________________

'ab*?' (a followed by zero or more b)

 'abbaabbba'

 'a'

 ...'a'

 ....'a'

 ........'a'

'ab+?' (a followed by one or more b)

 'abbaabbba'

 'ab'

 ....'ab'

'ab??' (a followed by zero or one b)

 'abbaabbba'

 'a'

 ...'a'

 ....'a'

 ........'a'

'ab{3}?' (a followed by three b)

 'abbaabbba'

 ....'abbb'

'ab{2,3}?' (a followed by two or three b)

 'abbaabbba'

 'abb'

 ....'abb'

字符集

字符集是一组字符，包含可以与模式中当前位置匹配的所有字符。例如，[a,b]可以匹配为a或b

test_pattern('abbaabbba',[('[ab]','either a or b'),

                          ('a[ab]+','a followed by one or more a or b'),

                          ('a[ab]+?','a followed by one or more a or b, not greedy')],

)

________________________________输出__________________________________________

'[ab]' (either a or b)

 'abbaabbba'

 'a'

 .'b'

 ..'b'

 ...'a'

 ....'a'

 .....'b'

 ......'b'

 .......'b'

 ........'a'

'a[ab]+' (a followed by one or more a or b)

 'abbaabbba'

 'abbaabbba'

'a[ab]+?' (a followed by one or more a or b, not greedy)

 'abbaabbba'

 'ab'

 ...'aa'

尖字符（^）代表要查找不在这个尖字符后面的集合中的字符

test_pattern('This is some text -- with punctuation',[('[^-. ]+','sequences without -, ., or space')],)

__________________________________输出__________________________________

'[^-. ]+' (sequences without -, ., or space)

 'This is some text -- with punctuation'

 'This'

 .....'is'

 ........'some'

 .............'text'

 .....................'with'

 ..........................'punctuation'

利用字符区间来定义一个字符集

test_pattern('This is some text -- with punctuation',

            [('[a-z]+','sequences of lowercase letters'),

            ('[A-Z]+','sequences of uppercase letters'),

            ('[a-zA-Z]+','sequences of lower- or uppercase letters'),

            ('[a-z][A-Z]+','one uppercase followed by lowercase')],

)

————————————————————————————————————输出————————————————————————————————————

'[a-z]+' (sequences of lowercase letters)

 'This is some text -- with punctuation'

 .'his'

 .....'is'

 ........'some'

 .............'text'

 .....................'with'

 ..........................'punctuation'

'[A-Z]+' (sequences of uppercase letters)

 'This is some text -- with punctuation'

 'T'

'[a-zA-Z]+' (sequencesof lower- or uppercase letters)

 'This is some text -- with punctuation'

 'This'

 .....'is'

 ........'some'

 .............'text'

 .....................'with'

 ..........................'punctuation'

'[a-z][A-Z]+' (one uppercase followed by lowercase)

 'This is some text -- with punctuation'

元字符点号（.）指示模式应当匹配该位置的单个字符

test_pattern('This is some text -- with punctuation',

            [('a.','a followed by any one character'),

            ('b.','b follwed by any one character'),

            ('a.*b','a followed by anything ending in b'),

            ('a.*?b','a followed by anything, ending in b')],

)

____________________________输出_____________________________________

'a.' (a followed by any one character)

 'This is some text -- with punctuation'

 ................................'at'

'b.' (b follwed by any one character)

 'This is some text -- with punctuation'

'a.*b' (a followed by anything ending in b)

 'This is some text -- with punctuation'

'a.*?b' (a followed by anything, ending in b)

 'This is some text -- with punctuation'

转义码

转义码	含义
\d	数字
\D	非数字
\s	空白符（制表符、空格、换行等）
\S	非空白符
\w	字母数字
\W	非字母数字

test_pattern('A prime #1 example!',

            [(r'\d+','sequence of digits'),

            (r'\D+','sequence of non-digits'),

            (r'\s+','sequence of whitespqce'),

            (r'\S+','sequence of non-whitespqce'),

            (r'\w+','alphanumeric characters'),

            (r'\W+','non-alphanumeric'],

)

___________________________输出_______________________________

'\d+' (sequence of digits)

 'A prime #1 example!'

 .........'1'

'\D+' (sequence of non-digits)

 'A prime #1 example!'

 'A prime #'

 ..........' example!'

'\s+' (sequence of whitespqce)

 'A prime #1 example!'

 .' '

 .......' '

 ..........' '

'\S+' (sequence of non-whitespqce)

 'A prime #1 example!'

 'A'

 ..'prime'

 ........'#1'

 ...........'example!'

'\w+' (alphanumeric characters)

 'A prime #1 example!'

 'A'

 ..'prime'

 .........'1'

 ...........'example'

'\W+' (non-alphanumeric)

 'A prime #1 example!'

 .' '

 .......' #'

 ..........' '

 ..................'!'

要匹配正则表达式语法中包含的字符，需要转义搜索模式中的字符。

test_pattern(r'\d+ \D+ \s+',[(r'\\.\+','escape code')],)

________________________输出_____________________________

'\\.\+' (escape code)

 '\d+ \D+ \s+'

 '\d+'

 .....'\D+'

 ..........'\s+'

锚定

可以使用锚定指令指定模式在输入文本中的相对位置。

正则表达式锚定码

锚定码	含义
^	字符串或行开头
$	字符串或行末尾
\A	字符串开头
\Z	字符串末尾
\b	单词开头或末尾的空串
\B	不在的单词开头或末尾的空串

test_pattern('This is some text -- with punctuation',

            [(r'^\w+','word at start of string'),

            (r'\A\w+','word at start of string'),

            (r'\w+\S*$','word near end of string'),

            (r'\w+\S*\Z','word near end of string'),

            (r'\bt\w+','t at end of word'),

            (r'\Bt\B','not start or end of word')],

)

_____________________________输出__________________________________

'^\w+' (word at start of string)

 'This is some text -- with punctuation'

 'This'

'\A\w+' (word at start of string)

 'This is some text -- with punctuation'

 'This'

'\w+\S*$' (word near end of string)

 'This is some text -- with punctuation'

 ..........................'punctuation'

'\w+\S*\Z' (word near end of string)

 'This is some text -- with punctuation'

 ..........................'punctuation'

'\bt\w+' (t at end of word)

 'This is some text -- with punctuation'

 .............'text'

'\Bt\B' (not start or end of word)

 'This is some text -- with punctuation'

 .......................'t'

 ..............................'t'

 .................................'t'

Python正则表达式的re库一些用法（上）的更多相关文章

python正则表达式与Re库
正则表达式是用来简洁表达一组字符串的表达式,一行胜千言,有点类似于数列的通项公式. 在python中提供了re库(regular expression)即正则表达式库,内置于python的标准库中,导 ...
【Python】http.client库的用法
代码: # http.client测试,该库较底层,不常用 import http.client conn=None try: conn=http.client.HTTPSConnection(&qu ...
Python正则表达式如何进行字符串替换实例
Python正则表达式如何进行字符串替换实例 Python正则表达式在使用中会经常应用到字符串替换的代码.有很多人都不知道如何解决这个问题,下面的代码就告诉你其实这个问题无比的简单,希望你有所收获. ...
Python正则表达式Regular Expression基本用法
资料来源:http://blog.csdn.net/whycadi/article/details/2011046 直接从网上资料转载过来,作为自己的参考.这个写的很清楚.先拿来看看. 1.正则表 ...
(转)Python爬虫利器一之Requests库的用法
官方文档以下内容大多来自于官方文档,本文进行了一些修改和总结.要了解更多可以参考官方文档安装利用 pip 安装 $ pip install requests 或者利用 easy_install ...
python爬虫---selenium库的用法
python爬虫---selenium库的用法 selenium是一个自动化测试工具,支持Firefox,Chrome等众多浏览器在爬虫中的应用主要是用来解决JS渲染的问题. 1.使用前需要安装这个 ...
Python爬虫利器一之Requests库的用法
前言之前我们用了 urllib 库,这个作为入门的工具还是不错的,对了解一些爬虫的基本理念,掌握爬虫爬取的流程有所帮助.入门之后,我们就需要学习一些更加高级的内容和工具来方便我们的爬取.那么这一节来 ...
【归纳】正则表达式及Python中的正则库
正则表达式正则表达式30分钟入门教程 runoob正则式教程正则表达式练习题集(附答案) 元字符\b代表单词的分界处,在英文中指空格,标点符号或换行例子:\bhi\b可以用来匹配hi这个单词,且 ...
Python爬虫：数据分析小能手：JSON库的用法
JSON(JavaScript Object Notation) 是一种轻量级的数据交换格式,易于人阅读和编写. 给大家推荐一个Python交流的q裙,大家在学习遇到了什么问题都可以进群一起交流,大家 ...

随机推荐

style1
<!doctype html> 我的简历基本信息姓名张三性别男应聘职位 WEb前端工程师联系方式手机 12312341234 Email joinefe@baidu.com ...
配置TortoiseGit与Github
https://jingyan.baidu.com/article/495ba841f2892638b30edefa.html https://www.cnblogs.com/maojunyi/p/7 ...
ASP.NET - Validators
ASP.NET validation controls validate the user input data to ensure that useless, unauthenticated, or ...
Maven构建 SpringMVC+Spring+MyBatis 环境整合
目录 1. Maven 项目搭建 2. Maven 插件生成 MyBatis 代码 3. 待续 ... 开发环境开发环境请尽量保持一致,不一致的情况可能存在问题. JDK 1.7 MyEclipse ...
Lintcode470-Tweaked Identical Binary Tree-Easy
470. Tweaked Identical Binary Tree Check two given binary trees are identical or not. Assuming any n ...
手写JavaScript常用的函数
一.bind.call.apply函数的实现改变函数的执行上下文中的this指向,但不执行该函数(位于Function构造函数的原型对象上的方法) Function.prototype.myBind ...
Docker跨主机网络联通之etcd实现
搭建ETCD集群查看NODE1机器IP,并启动ETCD ubuntu@docker-node1:~$ ifconfig eth0 eth0: flags=4163<UP,BROADCAST,R ...
php输出语句有什么不同
print()函数: 输出一个或者多个字符串.同echo一样,实际上它并不是一个函数.print有返回值.而echo没有.当其执行失败时返回false,成功则返回true,速度比echo稍慢.只能打 ...
JSP介绍
1.JSP简介 JSP全名为Java Server Pages,中文名叫java服务器页面,其根本是一个简化的Servlet设计,它是由Sun Microsystems公司倡导.许多公司参与一起建立的 ...
OO-第一单元总结
经过了前三次作业和两次实验的引导,我的编程思路在逐步从面向过程转向面向对象.也对面向对象有了初步的理解.虽然第一次实验由于自己没有及时完成导致没有提交过有些遗憾,但是第二次实验还是提交了几次的(虽然由 ...

Python正则表达式的re库一些用法（上）

Python正则表达式的re库一些用法（上）的更多相关文章

随机推荐

热门专题