【Python学习笔记】Coursera课程《Using Python to Access Web Data 》密歇根大学 Charles Severance—

Coursera课程《Using Python to Access Web Data》密歇根大学 Charles Severance

**Week2 Regular Expressions **

11.1 Regular Expressions

11.1.1 Python Regular Expression Quick Guide

^	匹配一行的开头
$	匹配一行的末尾
.	匹配任何字符
\s	匹配空白字符
\S	匹配任何非空白字符
*****	重复一个字符0次或多次
*?	重复一个字符0次或多次(non-greedy)
+	重复一个字符一次或多次
+?	重复一个字符一次或多次(non-greedy)
[aeiou]	匹配被列出来的一个单字符
[^XYZ]	匹配没有被列出来的一个单字符
[a-z0-9]	设置可以包含的字符
()	表示提取字符串的开头处
)	表示提取字符串的结尾处

【注】non-greedy模式表示尽可能少的匹配字符

11.1.2 The Regular Expression Module

在程序里使用正则表达式之前，必须使用'import re'引入一个模块。

然后可以使用re.search()来查看，是否一个字符串匹配正则表达式，和find()有点相似。

也可以使用re.findall()来提取一个字符串的部分来匹配正则表达式，这和find()与切片var[5:10]很相似。

11.1.3 Using re.search() Like find()

使用find()的代码

hand = open('mobox-short.txt')

for line in hand:

    line = line.restrip()

    if line.find('From:') >= 0:

        print(line)

使用re.search()的代码

import re

hand = open('mbox-short.txt')

for line in hand:

    line = line.rstrip()

    if re.search('From:', line):

        print(line)

11.1.4 Using re.search() Like startswith()

使用startswith()的代码

hand = open('mbox-short.txt')

for line in hand:

    line = line.rstrip()

    if line.startswith('From:'):

        print(line)

使用re.search()的代码

import re

hand = open('mbox-short.txt')

for line in hand:

    line = line.rstrip()

    if re.search('From:', line):

        print(line)

11.1.5 Wild-Card Characters

点号可以匹配任何字符。但如果加上了星号，那么这个字符可以出现任何次。

所以正则表达式^X.*：表示，查找以X开头的字符串，X后面可以接任何字符，而且任意长度。

那么例如我们可能会返回这样的

X-Sieve: CMU Sieve 2.3

X-DSPAM-Result: Innocent

X-Plane is behind schedule: two weeks

11.1.6 Fine-Tuning Your Match

为了更精准地匹配到我们想要的东西。我们可以稍作改进。

比如改成^X-\S+:表示，查找以X开头的字符串，X后面可以接任何不含空格的字符，而且字符数大于等于1个。

那么我们会上面的两行数据，而不会返回第三行。

11.2 Extracting Data

使用[0-9]+，表示查找一个或多个数字。

>>> import re

>>> x = 'My 2 favorite numbers are 19 and 42'

>>> y = re.findall('[0-9]+', x)

>>> print(y)

['2', '19', '42']

11.2.1 Warning: Greedy Matching

之前说的Greedy模式，其实就是匹配符合条件的最长的字符。

比如说

>>> import re

>>> x = 'From: Using the : character'

>>> y = re.findall('^F.+:', x)

>>> print(y)

['From: Using the :']

因为是Greedy模式，所以不是匹配的'From:'。

11.2.2 Non-Greedy Matching

而如果在+或后加上一个？，则可以切换到Non-Greedy*模式。

>>> import re

>>> x = 'From: Using the : character'

>>> y = re.findall('^F.+?:', x)

>>> print(y)

['From:']

11.2.3 Fine-Tuning String Extraction

如果我们要定位下面这段中的邮件地址。

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

那么我们可以这样

>>> y = re.findall('\S+@\S+', x)

>>> print(y)

['stephen.marquard@uct.ac.za']

使用括号，我们可以规定我们想要提取的文本的起始。比如这样

>>> y = re.findall('From (\S+@\S+)', x)

>>> print(y)

['stephen.marquard@uct.ac.za']

11.2.4 Spam Confidence

一个例子。

import re

hand = open('mbox-short.txt')

numlist = list()

for line in hand:

    line = line.rstrip()

    stuff = re.findall('X-DSPAM-Confidence: ([0-9.]+)', line)

    if len(stuff) != 1: continue

    num = float(stuff[0])

    numlist.append(num)

print('Maximum:', max(numlist))

Assignment

import re

hand = open('actual.txt')

numlist = list()

counts = dict()

for line in hand:

    line = line.rstrip()

    stuff = re.findall('[0-9]+', line)

    if len(stuff) == 0: continue

    for i in range(len(stuff)):

        num = int(stuff[i])

        numlist.append(num)

print(len(numlist))

print(sum(numlist))