Python 爬虫初探

准备部分

0x01 爬虫的简介和价值

a. 简介

自动抓取互联网数据的程序，是基础技术之一

b. 价值

快速提取网络中有价值的信息

0x02 爬虫的开发环境

a. 环境清单

Python3.7
开发环境：Mac、Windows、Linux
编辑器：Pycharm
网页下载：requests(2.21.0)
网页解析：BeautifulSoup/bs4(4.11.2)
动态网页下载：Selenium(3.141.0)

b. 环境测试

新建一个 Python 软件包，命名为 test
在上述软件包中新建一个 Python 文件，命名为 test_env

测试代码如下

import requests

from bs4 import BeautifulSoup

import selenium

print("OK!")

如果成功输入OK!则说明测模块安装成功

基础部分

0x03 简单的爬虫架构和执行流程

爬虫调度端（启动、停止）
爬虫架构（三大模块）

graph LR
A(URL 管理器)--URL-->B(网页下载器)
B--HTML-->C(网页解析器)
C-.URL.->A
1. URL 管理器
  
  URL 对管理，防止重复爬取
2. 网页下载器
  
  网页内容下载
3. 网页解析器
  
  提取价值数据，提取新的待爬 URL
价值数据

0x04 URL 管理器

a. 介绍

作用：对爬取的 URL 进行管理，防止重复和循环爬取
对外接口
- 取出一个待爬取的 URL
- 新增待爬取的 URL
实现逻辑
- 取出时状态变成已爬取
- 新增时判断是否已存在
数据存储
- Python 内存
  - 待爬取 URL 集合：set
  - 已爬取 URL 集合：set
- redis
  - 待爬取 URL 集合：set
  - 已爬取 URL 集合：set
- MySQL
  - urls(url, is_crawled)

b. 代码实现

新建一个 Python 软件包，命名为 utils
在上述软件包中新建一个 Python 文件，命名为 url_manager

由于需要对外暴露接口，需要封装成类，代码如下：

class UrlManager():

    """

    url 管理器

    """

    # 初始化函数

    def __init__(self):

        self.new_urls = set()

        self.old_urls = set()

    # 新增 URL

    def add_new_url(self, url):

        # 判空

        if url is None or len(url) == 0:

            return

        # 判重

        if url is self.new_urls or url in self.old_urls:

            return

        # 添加

        self.new_urls.add(url)

    # 批量添加 URL

    def add_new_urls(self, urls):

        if urls is None or len(urls) == 0:

            return

        for url in urls:

            self.add_new_url(url)

    # 获取一个新的待爬取 URL

    def get_url(self):

        if self.has_new_url():

            url = self.new_urls.pop()

            self.old_urls.add(url)

            return url

        else:

            return None

    # 判断是否有新的待爬取的 URL

    def has_new_url(self):

        return len(self.new_urls) > 0

# 测试代码

if __name__ == "__main__":

    url_manager = UrlManager()

    # URL 添加测试

    url_manager.add_new_url("url1")

    url_manager.add_new_urls(["url1", "url2"])

    print(url_manager.new_urls, url_manager.old_urls)

    # URL 获取测试

    print("=" * 20)   # 分割线

    new_url = url_manager.get_url()

    print(url_manager.new_urls, url_manager.old_urls)

    print("=" * 20)

    new_url = url_manager.get_url()

    print(url_manager.new_urls, url_manager.old_urls)

    print("=" * 20)

    print(url_manager.has_new_url())

0x05 网页下载器(requests)

a. 介绍

网址：python-requests
安装：pip install requests
介绍：

Requests is an elegant and simple HTTP library for Python, built for human beings.

Requests 是一个优雅的、简单的 Python HTTP 库，常常用于爬虫中对网页内容的下载
执行流程

graph LR
A(Python程序<br/>requests 库)--request-->B(网页服务器)
B--respone-->A

b. 发送 request 请求

request.get/post(url, params, data, headers, timeout, verify, allow_redirects, cookies)

url：要下载的目标网页的 URL
params：字典形式，设置 URL 后面的参数，如：?id=123&name=xxx
data：字典或者字符串，一般用于使用 POST 方法时提交数据
headers：设置user-agent、refer等请求头
timeout：超时时间，单位是秒
verify：布尔值，是否进行 HTTPS 证书认证，默认 True，需要自己设置证书地址
allow_redirects：布尔值，是否让 requests 做重定向处理，默认 True
cookies：附带本地的 cookies 数据

url、data、headers、timeout为常用参数

c. 接收 response 响应

res = requests.get/post(url)

res.status_code：查看状态码
res.encoding：查看当前编码以及变更编码

（requests 会根据请求头推测编码，推测失败则采用ISO-8859-1进行编码）
res.text：查看返回的网页内容
res.headers：查看返回的 HTTP 的 Headers
res.url：查看实际访问的 URL
res.content：以字节的方式返回内容，比如下载图片时
res.cookies：服务端要写入本地的 cookies 数据

d. 使用演示

在 cmd 中安装 ipython，命令为：python -m pip install ipython

在 cmd 中启动 ipython，命令为：ipython

In [1]: import requests

In [2]: url = "https://www.cnblogs.com/SRIGT"

In [3]: res = requests.get(url)

In [4]: res.status_code

Out[4]: 200

In [5]: res.encoding

Out[5]: 'utf-8'

In [6]: res.url

Out[6]: 'https://www.cnblogs.com/SRIGT'

0x06 网页解析器(BeautifulSoup)

a. 介绍

网址：Beautiful Soup: We called him Tortoise because he taught us.
安装：pip install beautifulsoup4
介绍：Python 第三方库，用于从 HTML 中提取数据
使用：import bs4或from bs4 import BeautifulSoup

b. 语法

graph LR
HTML网页-->A(创建 BeautifulSoup 对象)
A-->B(搜索节点<br/>find_all, find)
B-.->B1(按节点名称)
B-.->B2(按节点属性值)
B-.->B3(按节点文字)
B-->C(访问节点<br/>名称, 属性, 文字)

创建 BeautifulSoup 对象

from bs4 import BeautifulSoup

# 根据 HTML 网页字符串创建 BeautifulSoup 对象

soup = BeautifulSoup(

    html_doc,				# HTML 文档字符串

    'html.parser',			 # HTML 解析器

    from_encoding = 'utf-8'	 # HTML 文档的编码

)

搜索节点

# find_all(name, attrs, string)

# 查找所有标签为 a 的节点

soup.find_all('a')

# 查找所有标签为 a，链接符合 /xxx/index.html 形式的节点

soup.find_all('a', href='/xxx/index.html')

# 查找所有标签为 div，class 为 abc，文字为 python 的节点

soup.find_all('div', class_='abc', string='python')

访问节点信息

# 得到节点： <a href='1.html'>Python</a>

# 获取查找到的节点的标签名称

node.name

# 获取查找到的 a 节点的 href 属性

node['href']

# 获取查找到的 a 节点的链接文字

node.get_text()

c. 使用演示

目标网页

<html>

    <head>

        <meta charset="utf-8">

        <title>页面标题</title>

    </head>

    <body>

        <h1>标题一</h1>

        <h2>标题二</h2>

        <h3>标题一</h3>

        <h4>标题一</h4>

        <div id="content" class="default">

            <p>段落</p>

            <a href="http://www.baidu.com">百度</a>

            <a href="http://www.cnblogs.com/SRIGT">我的博客</a>

        </div>

    </body>

</html>

测试代码

from bs4 import BeautifulSoup

with open("./test.html", 'r', encoding='utf-8') as fin:

    html_doc = fin.read()

soup = BeautifulSoup(html_doc, "html.parser")

div_node = soup.find("div", id="content")

print(div_node)

print()

links = div_node.find_all("a")

for link in links:

    print(link.name, link["href"], link.get_text())

img = div_node.find("img")

print(img["src"])

实战部分

0x07 简单案例

url = "http://www.crazyant.net/"

import requests

r = requests.get(url)

if r.status_code != 200:

    raise Exception()

html_doc = r.text

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, "html.parser")

h2_nodes = soup.find_all("h2", class_="entry-title")

for h2_node in h2_nodes:

    link = h2_node.find("a")

    print(link["href"], link.get_text())

0x08 爬取所有博客页面

根域名：蚂蚁学Python

文章页 URL 形式：PyCharm开发PySpark程序的配置和实例 – 蚂蚁学Python

requests 请求时附带 cookie 字典

import requests

cookies = {

    "captchaKey": "14a54079a1",

    "captchaExpire": "1548852352"

}

r = requests.get(

    "http://url",

    cookies = cookies

)

正则表达式实现模糊匹配

url1 = "http://www.crazyant.net/123.html"

url2 = "http://www.crazyant.net/123.html#comments"

url3 = "http://www.baidu.com"

import re

pattern = r'^http://www.crazyant.net/\d+.html$'

print(re.match(pattern, url1))

print(re.match(pattern, url2))

print(re.match(pattern, url3))

全页面爬取

from utils import url_manager

from bs4 import BeautifulSoup

import requests

import re

root_url = "http://www.crazyant.net"

urls = url_manager.UrlManager()

urls.add_new_url(root_url)

file = open("craw_all_pages.txt", "w")

while urls.has_new_url():

    curr_url = urls.get_url()

    r = requests.get(curr_url, timeout=3)

    if r.status_code != 200:

        print("error, return status_code is not 200", curr_url)

        continue

    soup = BeautifulSoup(r.text, "html.parser")

    title = soup.title.string

    file.write("%s\t%s\n" % (curr_url, title))

    file.flush()

    print("success: %s, %s, %d" % (curr_url, title, len(urls.new_urls)))

    links = soup.find_all("a")

    for link in links:

        href = link.get("href")

        if href is None:

            continue

        pattern = r'^http://www.crazyant.net/\d+.html$'

        if re.match(pattern, href):

            urls.add_new_url(href)

file.close()

0x09 爬取豆瓣电影Top250

目前该榜单设置了反爬

步骤：

使用 requests 爬取网页

使用 BeautifulSoup 实现数据解析

借助 pandas 将数据写到 Excel

调用

import requests

from bs4 import BeautifulSoup

import pandas as pd

下载共 10 个页面的 HTML

# 构造分页数字列表

page_indexs = range(0, 250, 25)

list(page_indexs)

def download_all_htmls():

    """

    下载所有列表页面的 HTML，用于后续的分析

    """

    htmls = []

    for idx in page_indexs:

        url = f"https://movie.douban.com/top250?start={idx}&filter="

        print("craw html: ", url)

        r = requests.get(url)

        if r.status_code != 200:

            raise Exception("error")

        htmls.append(r.text)

    return htmls

# 执行爬取

htmls = download_all_htmls()

解析 HTML 得到数据

def parse_single_html(html):

    """

    解析单个 HTML，得到数据

    @return list({"link", "title", [label]})

    """

    soup = BeautifulSoup(html, 'html.parser')

    article_items = (

        soup.find("div", class_="article")

            .find("ol", class_="grid_view")

            .find_all("div", class_="item")

    )

    datas = []

    for article_item in article_items:

        rank = article_item.find("div", class_="pic").find("em").get_text()

        info = article_item.find("div", class_="info")

        title = info.find("div", class_="hd").find("span", class_="title").get_text()

        stars = (

            info.find("div", class_="bd")

                .find("div", class_="star")

                .find_all("span")

        )

        rating_star = stars[0]["class"][0]

        rating_num = stars[1].get_text()

        comments = stars[3].get_text()

        datas.append({

            "rank": rank,

            "title": title,

            "rating_star": rating_star.replace("rating", "").replace("-t", ""),

            "rating_num": rating_num,

            "comments": comments.replace("人评价", "")

        })

    return datas

pprint.pprint(parse_single_html(htmls[0]))

all_datas = []

for html in htmls:

    all_datas.extend(parse_single_html(html))

print(all_datas)

将结果存入 Excel

df = pd.DataFrame(all_datas)

df.to_excel("TOP250.xlsx")

End