Python中的正则表达式regular expression
1 match = re.search(pat,str)
match = re.search(pat, str) stores the search result in a variable named "match". Then the if-statement tests the match -- if true the search succeeded and match.group() is the matching text (e.g. 'word:cat'). Otherwise if the match is false (None to be more specific), then the search did not succeed, and there is no matching text.str ='an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:
print'found', match.group()## 'found word:cat'
else:
print'did not find'
2 Basic Patterns
- a, X, 9, < -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)
 - . (a period) -- matches any single character except newline '\n'
 - \w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.
 - \b -- boundary between word and non-word
 - \s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
 - \t, \n, \r -- tab, newline, return
 - \d -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)
 - ^ = start, $ = end -- match the start or end of the string
 - \ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.
 
3 Basic Examples
The basic rules of regular expression search for a pattern within a string are:
- The search proceeds through the string from start to end, stopping at the first match found
 - All of the pattern must be matched, but not all of the string
 - If 
match = re.search(pat, str)is successful, match is not None and in particular match.group() is the matching text 
## Search for pattern 'iii' in string 'piiig'.
## All of the pattern must match, but it may appear anywhere.
## On success, match.group() is matched text.
match = re.search(r'iii','piiig')=> found, match.group()=="iii"
match = re.search(r'igs','piiig')=> not found, match ==None ## . = any char but \n
match = re.search(r'..g','piiig')=> found, match.group()=="iig" ## \d = digit char, \w = word char
match = re.search(r'\d\d\d','p123g')=> found, match.group()=="123"
match = re.search(r'\w\w\w','@@abcd!!')=> found, match.group()=="abc"
4 Repetition
Things get more interesting when you use + and * to specify repetition in the pattern
- + -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
 - * -- 0 or more occurrences of the pattern to its left
 - ? -- match 0 or 1 occurrences of the pattern to its left
 
5 Repetition Examples
## i+ = one or more i's, as many as possible.
match = re.search(r'pi+','piiig')=> found, match.group()=="piii" ## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second set of i's.
match = re.search(r'i+','piigiiii')=> found, match.group()=="ii" ## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d','xx1 2 3xx')=> found, match.group()=="1 2 3"
match = re.search(r'\d\s*\d\s*\d','xx12 3xx')=> found, match.group()=="12 3"
match = re.search(r'\d\s*\d\s*\d','xx123xx')=> found, match.group()=="123" ## ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+','foobar')=> not found, match ==None
## but without the ^ it succeeds:
match = re.search(r'b\w+','foobar')=> found, match.group()=="bar"
6 Emails Example
Suppose you want to find the email address inside the string 'xyz alice-b@google.com purple monkey'. We'll use this as a running example to demonstrate more regular expression features. Here's an attempt using the pattern r'\w+@\w+':
str ='purple alice-b@google.com monkey dishwasher'
match = re.search(r'\w+@\w+', str)
if match:
print match.group() ## 'b@google'
The search does not get the whole email address in this case because the \w does not match the '-' or '.' in the address. We'll fix this using the regular expression features below.
Square Brackets
Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. For the emails problem, the square brackets are an easy way to add '.' and '-' to the set of chars which can appear around the @ with the pattern r'[\w.-]+@[\w.-]+' to get the whole email address:
match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
print match.group() ## 'alice-b@google.com'
(More square-bracket features) You can also use a dash to indicate a range, so [a-z] matches all lowercase letters. To use a dash without indicating a range, put the dash last, e.g. [abc-]. An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except 'a' or 'b'.
7 Group Extraction
The "group" feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: r'([\w.-]+)@([\w.-]+)'. In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. On a successful search, match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. The plain match.group() is still the whole match text as usual.
str ='purple alice-b@google.com monkey dishwasher'
match = re.search('([\w.-]+)@([\w.-]+)', str)
if match:
print match.group() ## 'alice-b@google.com' (the whole match)
print match.group(1) ## 'alice-b' (the username, group 1)
print match.group(2) ## 'google.com' (the host, group 2)
A common workflow with regular expressions is that you write a pattern for the thing you are looking for, adding parenthesis groups to extract the parts you want.
8 findall
findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern. findall() finds *all* the matches and returns them as a list of strings, with each string representing one match.
## Suppose we have a text with many email addresses
str ='purple alice@google.com, blah monkey bob@abc.com blah dishwasher' ## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str)## ['alice@google.com', 'bob@abc.com']
for email in emails:
# do something with each found email string
print email
9 findall With Files
For files, you may be in the habit of writing a loop to iterate over the lines of the file, and you could then call findall() on each line. Instead, let findall() do the iteration for you -- much better! Just feed the whole file text into findall() and let it return a list of all the matches in a single step (recall that f.read() returns the whole text of a file in a single string):
# Open file
f = open('test.txt','r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'some pattern', f.read())
10 findall and Groups
The parenthesis ( ) group mechanism can be combined with findall(). If the pattern includes 2 or more parenthesis groups, then instead of returning a list of strings, findall() returns a list of *tuples*. Each tuple represents one match of the pattern, and inside the tuple is the group(1), group(2) .. data. So if 2 parenthesis groups are added to the email pattern, then findall() returns a list of tuples, each length 2 containing the username and host, e.g. ('alice', 'google.com').
str ='purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)
print tuples ## [('alice', 'google.com'), ('bob', 'abc.com')]
for tuple in tuples:
print tuple[0] ## username
print tuple[1] ## host
Once you have the list of tuples, you can loop over it to do some computation for each tuple. If the pattern includes no parenthesis, then findall() returns a list of found strings as in earlier examples. If the pattern includes a single set of parenthesis, then findall() returns a list of strings corresponding to that single group. (Obscure optional feature: Sometimes you have paren ( ) groupings in the pattern, but which you do not want to extract. In that case, write the parens with a ?: at the start, e.g. (?: ) and that left paren will not count as a group result.)
11 Options
The re functions take options to modify the behavior of the pattern match. The option flag is added as an extra argument to the search() or findall() etc., e.g. re.search(pat, str, re.IGNORECASE).
- IGNORECASE -- ignore upper/lowercase differences for matching, so 'a' matches both 'a' and 'A'.
 - DOTALL -- allow dot (.) to match newline -- normally it matches anything but newline. This can trip you up -- you think .* matches everything, but by default it does not go past the end of a line. Note that \s (whitespace) includes newlines, so if you want to match a run of whitespace that may include a newline, you can just use \s*
 - MULTILINE -- Within a string made of many lines, allow ^ and $ to match the start and end of each line. Normally ^/$ would just match the start and end of the whole string.
 
12 Greedy vs. Non-Greedy
This is optional section which shows a more advanced regular expression technique not needed for the exercises.
Suppose you have text with tags in it: <b>foo</b> and <i>so on</i>
Suppose you are trying to match each tag with the pattern '(<.*>)' -- what does it match first?
The result is a little surprising, but the greedy aspect of the .* causes it to match the whole '<b>foo</b> and <i>so on</i>' as one big match. The problem is that the .* goes as far as is it can, instead of stopping at the first > (aka it is "greedy").
There is an extension to regular expression where you add a ? at the end, such as .*? or .+?, changing them to be non-greedy. Now they stop as soon as they can. So the pattern '(<.*?>)' will get just '<b>' as the first match, and '</b>' as the second match, and so on getting each <..> pair in turn. The style is typically that you use a .*?, and then immediately its right look for some concrete marker (> in this case) that forces the end of the .*? run.
The *? extension originated in Perl, and regular expressions that include Perl's extensions are known as Perl Compatible Regular Expressions -- pcre. Python includes pcre support. Many command line utils etc. have a flag where they accept pcre patterns.
An older but widely used technique to code this idea of "all of these chars except stopping at X" uses the square-bracket style. For the above you could write the pattern, but instead of .* to get all the chars, use [^>]* which skips over all characters which are not > (the leading ^ "inverts" the square bracket set, so it matches any char not in the brackets).
13 Substitution
The re.sub(pat, replacement, str) function searches for all the instances of pattern in the given string, and replaces them. The replacement string can include '\1', '\2' which refer to the text from group(1), group(2), and so on from the original matching text.
Here's an example which searches for all the email addresses, and changes them to keep the user (\1) but have yo-yo-dyne.com as the host.
str ='purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
## re.sub(pat, replacement, str) -- returns new string with all replacements,
## \1 is group(1), \2 group(2) in the replacement
print re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@yo-yo-dyne.com', str)
## purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher
Python中的正则表达式regular expression的更多相关文章
- C#中【正则表达式regular expression】相关的知识
		
Regex System.Text.RegularExpressions.Regex regex应该是regular expression的缩写 https://msdn.microsoft ...
 - Python  -- 正则表达式 regular expression
		
正则表达式(regular expression) 根据其英文翻译,re模块 作用:用来匹配字符串. 在Python中,正则表达式是特殊的字符序列,检查一个字符串是否与某种模式匹配. 设计思想:用一 ...
 - Python::re 模块 -- 在Python中使用正则表达式
		
前言 这篇文章,并不是对正则表达式的介绍,而是对Python中如何结合re模块使用正则表达式的介绍.文章的侧重点是如何使用re模块在Python语言中使用正则表达式,对于Python表达式的语法和详细 ...
 - Python 模块 re (Regular Expression)
		
使用 Python 模块 re 实现解析小工具 概要 在开发过程中发现,Python 模块 re(Regular Expression)是一个很有价值并且非常强大的文本解析工具,因而想要分享一下此 ...
 - 在python中使用正则表达式(转载)
		
https://www.cnblogs.com/hanmk/p/9143514.html 在python中使用正则表达式(一) 在python中通过内置的re库来使用正则表达式,它提供了所有正则表 ...
 - 在Python中使用正则表达式同时匹配邮箱和电话并进行简单的分类
		
在Python使用正则表达式需要使用re(regular exprssion)模块,使用正则表达式的难点就在于如何写好p=re.compile(r' 正则表达式')的内容. 下面是在Python中使用 ...
 - Java基础-正则表达式(Regular Expression)语法规则简介
		
Java基础-正则表达式(Regular Expression)语法规则简介 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.正则表达式的概念 正则表达式(Regular Exp ...
 - 正则表达式-Regular expression学习笔记
		
正则表达式 正则表达式(Regular expression)是一种符号表示法,被用来识别文本模式. 最近在学习正则表达式,今天整理一下其中的一些知识点 grep - 打印匹配行 grep 是个很强大 ...
 - 正则表达式(Regular Expression, RegEx)学习入门
		
1. 概述 正则表达式(Regular Expression, RegEx)是一种匹配模式,描述的是一串文本的特征. 正如自然语言中高大.坚固等词语抽象出来描述事物特征一样,正则表达式就是字符的高度抽 ...
 
随机推荐
- ViewPager中GridView问题
			
GridView 嵌套在ViewPager中问题. 1. GridView属性设置无法显示. 正常显示方式 <GridView android:padding="8dip" ...
 - 靶形数独 (codevs 1174)题解
			
[问题描述] 小城和小华都是热爱数学的好学生,最近,他们不约而同地迷上了数独游戏,好胜的他们想用数独来一比高低.但普通的数独对他们来说都过于简单了,于是他们向Z 博士请教,Z 博士拿出了他最近发明的“ ...
 - C# 平时碰见的问题【4】
			
1. 模糊查询 like的参数化写法 string keyword="value"; // 要模糊匹配的值 错误示范: sql: string strSql=" ...
 - Python学习教程(learning Python)--2 Python简单函数设计
			
本节讨论Python程序设计时为何引入函数? 为何大家都反对用一堆堆的单个函数语句完成一项程序的设计任务呢? 用一条条的语句去完成某项程序设计时,冗长.不宜理解,不宜复用,而采用按功能模块划分成函数, ...
 - SQL基础篇---基本概念解析
			
1.数据库database:保存表和其他相关SQL结构容器(一般是一个文件或者一组文件) 2.SQL (Structared Query Language):是一种专门用来与数据库沟通的语言,是一种结 ...
 - 细说Debug和Release区别
			
VC下Debug和Release区别 最近写代码过程中,发现 Debug 下运行正常,Release 下就会出现问题,百思不得其解,而Release 下又无法进行调试,于是只能采用printf方式逐步 ...
 - 关于UIView需要看的一些官方文档
			
View Controller PG(Programming Guide) 看过一遍 View PG 正在看 Drawing and Printing PG Quartz 2D PG 更高级的cus ...
 - 如何查看hadoop与hbase的版本匹配关系
			
官网:http://hbase.apache.org/book.html 搜索:Hadoop version support matrix 下面有一个二维的支持关系表.
 - Pintos修改优先级捐赠、嵌套捐赠、锁的获得与释放、信号量及PV操作
			
Pintos修改优先级捐赠.嵌套捐赠.锁的获得与释放.信号量及PV操作 原有的优先级更改的情况下面没有考虑到捐赠的情况,仅仅只是改变更改了当前线程的优先级,更别说恢复原本优先级了,所以不能通过任何有关 ...
 - sublime mac快捷键
			
^是control ⌥是option 打开/前往 ⌘T 前往文件 ⌘⌃P 前往项目 ⌘R 前往 method ⌘⇧P 命令提示 ⌃G 前往行 ⌘KB 开关侧栏 ⌃ ` python 控制台 ⌘⇧N 新 ...