Python 基础

我之前写的《Python 3 极简教程.pdf》，适合有点编程基础的快速入门，通过该系列文章学习，能够独立完成接口的编写，写写小东西没问题。

requests

requests，Python HTTP 请求库，相当于 Android 的 Retrofit，它的功能包括 Keep-Alive 和连接池、Cookie 持久化、内容自动解压、HTTP 代理、SSL 认证、连接超时、Session 等很多特性，同时兼容 Python2 和 Python3，GitHub：https://github.com/requests/requests 。

安装

Mac：

pip3 install requests

Windows：

pip install requests

发送请求

HTTP 请求方法有 get、post、put、delete。

import requests

# get 请求

response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all')

# post 请求

response = requests.post('http://127.0.0.1:1024/developer/api/v1.0/insert')

# put 请求

response = requests.put('http://127.0.0.1:1024/developer/api/v1.0/update')

# delete 请求

response = requests.delete('http://127.0.0.1:1024/developer/api/v1.0/delete')

请求返回 Response 对象，Response 对象是对 HTTP 协议中服务端返回给浏览器的响应数据的封装，响应的中的主要元素包括：状态码、原因短语、响应首部、响应 URL、响应 encoding、响应体等等。

# 状态码

print(response.status_code)

# 响应 URL

print(response.url)

# 响应短语

print(response.reason)

# 响应内容

print(response.json())

定制请求头

请求添加 HTTP 头部 Headers，只要传递一个 dict 给 headers 关键字参数就可以了。

header = {'Application-Id': '19869a66c6',

          'Content-Type': 'application/json'

          }

response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all/', headers=header)

构建查询参数

想为 URL 的查询字符串(query string)传递某种数据，比如：http://127.0.0.1:1024/developer/api/v1.0/all?key1=value1&key2=value2 ，Requests 允许你使用 params 关键字参数，以一个字符串字典来提供这些参数。

payload = {'key1': 'value1', 'key2': 'value2'}

response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", params=payload)

还可以将 list 作为值传入：

payload = {'key1': 'value1', 'key2': ['value2', 'value3']}

response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", params=payload)

# 响应 URL

print(response.url)# 打印：http://127.0.0.1:1024/developer/api/v1.0/all?key1=value1&key2=value2&key2=value3

post 请求数据

如果服务器要求发送的数据是表单数据，则可以指定关键字参数 data。

payload = {'key1': 'value1', 'key2': 'value2'}

response = requests.post("http://127.0.0.1:1024/developer/api/v1.0/insert", data=payload)

如果要求传递 json 格式字符串参数，则可以使用 json 关键字参数，参数的值都可以字典的形式传过去。

obj = {

    "article_title": "小公务员之死2"

}

# response = requests.post('http://127.0.0.1:1024/developer/api/v1.0/insert', json=obj)

响应内容

Requests 会自动解码来自服务器的内容。大多数 unicode 字符集都能被无缝地解码。请求发出后，Requests 会基于 HTTP 头部对响应的编码作出有根据的推测。

# 响应内容

# 返回是 是 str 类型内容

# print(response.text())

# 返回是 JSON 响应内容

print(response.json())

# 返回是二进制响应内容

# print(response.content())

# 原始响应内容，初始请求中设置了 stream=True

# response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all', stream=True)

# print(response.raw())

超时

如果没有显式指定了 timeout 值，requests 是不会自动进行超时处理的。如果遇到服务器没有响应的情况时，整个应用程序一直处于阻塞状态而没法处理其他请求。

response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all', timeout=5)  # 单位秒数

代理设置

如果频繁访问一个网站，很容易被服务器屏蔽掉，requests 完美支持代理。

# 代理

proxies = {

    'http': 'http://127.0.0.1:1024',

    'https': 'http://127.0.0.1:4000',

}

response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all', proxies=proxies)

BeautifulSoup

BeautifulSoup，Python Html 解析库，相当于 Java 的 jsoup。

安装

BeautifulSoup 3 目前已经停止开发，直接使用BeautifulSoup 4。

Mac：

pip3 install beautifulsoup4

Windows：

pip install beautifulsoup4

安装解析器

我用的是 html5lib，纯 Python 实现的。

Mac：

pip3 install html5lib

Windows：

pip install html5lib

简单使用

BeautifulSoup 将复杂 HTML 文档转换成一个复杂的树形结构，每个节点都是 Python 对象。

解析

from bs4 import BeautifulSoup

def get_html_data():

    html_doc = """

    <html>

    <head>

    <title>WuXiaolong</title>

    </head>

    <body>

    <p>分享 Android 技术，也关注 Python 等热门技术。</p>

    <p>写博客的初衷：总结经验，记录自己的成长。</p>

    <p>你必须足够的努力，才能看起来毫不费力！专注！精致！

    </p>

    <p class="Blog"><a href="http://wuxiaolong.me/">WuXiaolong's blog</a></p>

    <p class="WeChat"><a href="https://open.weixin.qq.com/qr/code?username=MrWuXiaolong">公众号：吴小龙同学</a> </p>

    <p class="GitHub"><a href="http://example.com/tillie" class="sister" id="link3">GitHub</a></p>

    </body>

    </html>

    """

    soup = BeautifulSoup(html_doc, "html5lib")

tag

tag = soup.head

print(tag)  # <head><title>WuXiaolong</title></head>

print(tag.name)  # head

print(tag.title)  # <title>WuXiaolong</title>

print(soup.p)  # <p>分享 Android 技术，也关注 Python 等热门技术。</p>

print(soup.a['href'])  # 输出 a 标签的 href 属性：http://wuxiaolong.me/

注意：tag 如果多个匹配，返回第一个，比如这里的 p 标签。

查找

print(soup.find('p'))  # <p>分享 Android 技术，也关注 Python 等热门技术。</p>

find 默认也是返回第一个匹配的标签，没找到匹配的节点则返回 None。如果我想指定查找，比如这里的公众号，可以指定标签的如 class 属性值：

# 因为 class 是 Python 关键字，所以这里指定为 class_。

print(soup.find('p', class_="WeChat"))

# <p class="WeChat"><a href="https://open.weixin.qq.com/qr/code?username=MrWuXiaolong">公众号</a> </p>

查找所有的 P 标签：

for p in soup.find_all('p'):

    print(p.string)

实战

前段时间，有用户反馈，我的个人 APP 挂了，虽然这个 APP 我已经不再维护，但是我也得起码保证它能正常运行。大部分人都知道这个 APP 数据是爬来的（详见：《手把手教你做个人app》），数据爬来的好处之一就是不用自己管数据，弊端是别人网站挂了或网站的 HTML 节点变了，我这边就解析不到，就没数据。这次用户反馈，我在想要不要把他们网站数据直接爬虫了，正好自学 Python，练练手，嗯说干就干，本来是想着先用 Python 爬虫，MySQL 插入本地数据库，然后 Flask 自己写接口，用 Android 的 Retrofit 调，再用 bmob sdk 插入 bmob……哎，费劲，感觉行不通，后来我得知 bmob 提供了 RESTful，解决大问题，我可以直接 Python 爬虫插入就好了，这里我演示的是插入本地数据库，如果用 bmob，是调 bmob 提供的 RESTful 插数据。

网站选定

我选的演示网站：https://meiriyiwen.com/random ，大家可以发现，每次请求的文章都不一样，正好利用这点，我只要定时去请求，解析自己需要的数据，插入数据库就 OK 了。

创建数据库

我直接用 NaviCat Premium 创建的，当然也可以用命令行。

创建表

创建表 article，用的 pymysql，表需要 id，article_title，article_author，article_content 字段，代码如下，只需要调一次就好了。

import pymysql

def create_table():

    # 建立连接

    db = pymysql.connect(host='localhost',

                         user='root',

                         password='root',

                         db='python3learn')

    # 创建名为 article 数据库语句

    sql = '''create table if not exists article (

    id int NOT NULL AUTO_INCREMENT,

    article_title text,

    article_author text,

    article_content text,

    PRIMARY KEY (`id`)

    )'''

    # 使用 cursor() 方法创建一个游标对象 cursor

    cursor = db.cursor()

    try:

        # 执行 sql 语句

        cursor.execute(sql)

        # 提交事务

        db.commit()

        print('create table success')

    except BaseException as e:  # 如果发生错误则回滚

        db.rollback()

        print(e)

    finally:

        # 关闭游标连接

        cursor.close()

        # 关闭数据库连接

        db.close()

if __name__ == '__main__':

    create_table()

解析网站

首先需要 requests 请求网站，然后 BeautifulSoup 解析自己需要的节点。

import requests

from bs4 import BeautifulSoup

def get_html_data():

    # get 请求

    response = requests.get('https://meiriyiwen.com/random')

    soup = BeautifulSoup(response.content, "html5lib")

    article = soup.find("div", id='article_show')

    article_title = article.h1.string

    print('article_title=%s' % article_title)

    article_author = article.find('p', class_="article_author").string

    print('article_author=%s' % article.find('p', class_="article_author").string)

    article_contents = article.find('div', class_="article_text").find_all('p')

    article_content = ''

    for content in article_contents:

        article_content = article_content + str(content)

        print('article_content=%s' % article_content)

插入数据库

这里做了一个筛选，默认这个网站的文章标题是唯一的，插入数据时，如果有了同样的标题就不插入。

import pymysql

def insert_table(article_title, article_author, article_content):

    # 建立连接

    db = pymysql.connect(host='localhost',

                         user='root',

                         password='root',

                         db='python3learn',

                         charset="utf8")

    # 插入数据

    query_sql = 'select * from article where article_title=%s'

    sql = 'insert into article (article_title,article_author,article_content) values (%s, %s, %s)'

    # 使用 cursor() 方法创建一个游标对象 cursor

    cursor = db.cursor()

    try:

        query_value = (article_title,)

        # 执行 sql 语句

        cursor.execute(query_sql, query_value)

        results = cursor.fetchall()

        if len(results) == 0:

            value = (article_title, article_author, article_content)

            cursor.execute(sql, value)

            # 提交事务

            db.commit()

            print('--------------《%s》 insert table success-------------' % article_title)

            return True

        else:

            print('--------------《%s》 已经存在-------------' % article_title)

            return False

    except BaseException as e:  # 如果发生错误则回滚

        db.rollback()

        print(e)

    finally:  # 关闭游标连接

        cursor.close()

        # 关闭数据库连接

        db.close()

定时设置

做了一个定时，过段时间就去爬一次。

import sched

import time

# 初始化 sched 模块的 scheduler 类

# 第一个参数是一个可以返回时间戳的函数，第二个参数可以在定时未到达之前阻塞。

schedule = sched.scheduler(time.time, time.sleep)

# 被周期性调度触发的函数

def print_time(inc):

    # to do something

    print('to do something')

    schedule.enter(inc, 0, print_time, (inc,))

# 默认参数 60 s

def start(inc=60):

    # enter四个参数分别为：间隔事件、优先级（用于同时间到达的两个事件同时执行时定序）、被调用触发的函数，

    # 给该触发函数的参数（tuple形式）

    schedule.enter(0, 0, print_time, (inc,))

    schedule.run()

if __name__ == '__main__':

    # 5 s 输出一次

    start(5)

完整代码

import pymysql

import requests

from bs4 import BeautifulSoup

import sched

import time

def create_table():

    # 建立连接

    db = pymysql.connect(host='localhost',

                         user='root',

                         password='root',

                         db='python3learn')

    # 创建名为 article 数据库语句

    sql = '''create table if not exists article (

    id int NOT NULL AUTO_INCREMENT,

    article_title text,

    article_author text,

    article_content text,

    PRIMARY KEY (`id`)

    )'''

    # 使用 cursor() 方法创建一个游标对象 cursor

    cursor = db.cursor()

    try:

        # 执行 sql 语句

        cursor.execute(sql)

        # 提交事务

        db.commit()

        print('create table success')

    except BaseException as e:  # 如果发生错误则回滚

        db.rollback()

        print(e)

    finally:

        # 关闭游标连接

        cursor.close()

        # 关闭数据库连接

        db.close()

def insert_table(article_title, article_author, article_content):

    # 建立连接

    db = pymysql.connect(host='localhost',

                         user='root',

                         password='root',

                         db='python3learn',

                         charset="utf8")

    # 插入数据

    query_sql = 'select * from article where article_title=%s'

    sql = 'insert into article (article_title,article_author,article_content) values (%s, %s, %s)'

    # 使用 cursor() 方法创建一个游标对象 cursor

    cursor = db.cursor()

    try:

        query_value = (article_title,)

        # 执行 sql 语句

        cursor.execute(query_sql, query_value)

        results = cursor.fetchall()

        if len(results) == 0:

            value = (article_title, article_author, article_content)

            cursor.execute(sql, value)

            # 提交事务

            db.commit()

            print('--------------《%s》 insert table success-------------' % article_title)

            return True

        else:

            print('--------------《%s》 已经存在-------------' % article_title)

            return False

    except BaseException as e:  # 如果发生错误则回滚

        db.rollback()

        print(e)

    finally:  # 关闭游标连接

        cursor.close()

        # 关闭数据库连接

        db.close()

def get_html_data():

    # get 请求

    response = requests.get('https://meiriyiwen.com/random')

    soup = BeautifulSoup(response.content, "html5lib")

    article = soup.find("div", id='article_show')

    article_title = article.h1.string

    print('article_title=%s' % article_title)

    article_author = article.find('p', class_="article_author").string

    print('article_author=%s' % article.find('p', class_="article_author").string)

    article_contents = article.find('div', class_="article_text").find_all('p')

    article_content = ''

    for content in article_contents:

        article_content = article_content + str(content)

        print('article_content=%s' % article_content)

    # 插入数据库

    insert_table(article_title, article_author, article_content)

# 初始化 sched 模块的 scheduler 类

# 第一个参数是一个可以返回时间戳的函数，第二个参数可以在定时未到达之前阻塞。

schedule = sched.scheduler(time.time, time.sleep)

# 被周期性调度触发的函数

def print_time(inc):

    get_html_data()

    schedule.enter(inc, 0, print_time, (inc,))

# 默认参数 60 s

def start(inc=60):

    # enter四个参数分别为：间隔事件、优先级（用于同时间到达的两个事件同时执行时定序）、被调用触发的函数，

    # 给该触发函数的参数（tuple形式）

    schedule.enter(0, 0, print_time, (inc,))

    schedule.run()

if __name__ == '__main__':

    start(60*5)

问题：这只是对一篇文章爬虫，如果是那种文章列表，点击是文章详情，这种如何爬虫解析？首先肯定要拿到列表，再循环一个个解析文章详情插入数据库？还没有想好该如何做更好，留给后面的课题吧。

最后

虽然我学 Python 纯属业余爱好，但是也要学以致用，不然这些知识很快就忘记了，期待下篇 Python 方面的文章。

参考

快速上手 — Requests 2.18.1 文档

爬虫入门系列（二）：优雅的HTTP库requests

Beautiful Soup 4.2.0 文档

爬虫入门系列（四）：HTML文本解析库BeautifulSoup

Python 爬虫实战（一）：使用 requests 和 BeautifulSoup的更多相关文章

Python 爬虫实战（二）：使用 requests-html
Python 爬虫实战(一):使用 requests 和 BeautifulSoup,我们使用了 requests 做网络请求,拿到网页数据再用 BeautifulSoup 解析,就在前不久,requ ...
【图文详解】python爬虫实战——5分钟做个图片自动下载器
python爬虫实战——图片自动下载器之前介绍了那么多基本知识[Python爬虫]入门知识,(没看的先去看!!)大家也估计手痒了.想要实际做个小东西来看看,毕竟: talk is cheap sho ...
Python爬虫实战八之利用Selenium抓取淘宝匿名旺旺
更新其实本文的初衷是为了获取淘宝的非匿名旺旺,在淘宝详情页的最下方有相关评论,含有非匿名旺旺号,快一年了淘宝都没有修复这个. 可就在今天,淘宝把所有的账号设置成了匿名显示,SO,获取非匿名旺旺号已经 ...
Python爬虫实战六之抓取爱问知识人问题并保存至数据库
大家好,本次为大家带来的是抓取爱问知识人的问题并将问题和答案保存到数据库的方法,涉及的内容包括: Urllib的用法及异常处理 Beautiful Soup的简单应用 MySQLdb的基础用法正则表 ...
python爬虫实战——5分钟做个图片自动下载器
python爬虫实战——图片自动下载器制作爬虫的基本步骤顺便通过这个小例子,可以掌握一些有关制作爬虫的基本的步骤. 一般来说,制作一个爬虫需要分以下几个步骤: 分析需求(对,需求分析非常重要, ...
路飞学城—Python爬虫实战密训班第三章
路飞学城—Python爬虫实战密训班第三章一.scrapy-redis插件实现简单分布式爬虫 scrapy-redis插件用于将scrapy和redis结合实现简单分布式爬虫: - 定义调度器 - ...
python爬虫实战---爬取大众点评评论
python爬虫实战—爬取大众点评评论(加密字体) 1.首先打开一个店铺找到评论很多人学习python,不知道从何学起.很多人学习python,掌握了基本语法过后,不知道在哪里寻找案例上手.很多已经 ...
python爬虫学习(6) —— 神器 Requests
Requests 是使用 Apache2 Licensed 许可证的 HTTP 库.用 Python 编写,真正的为人类着想. Python 标准库中的 urllib2 模块提供了你所需要的大多数 H ...
Python爬虫实战（4）：豆瓣小组话题数据采集—动态网页
1, 引言注释:上一篇<Python爬虫实战(3):安居客房产经纪人信息采集>,访问的网页是静态网页,有朋友模仿那个实战来采集动态加载豆瓣小组的网页,结果不成功.本篇是针对动态网页的数据 ...
Python爬虫实战（2）：爬取京东商品列表
1,引言在上一篇<Python爬虫实战:爬取Drupal论坛帖子列表>,爬取了一个用Drupal做的论坛,是静态页面,抓取比较容易,即使直接解析html源文件都可以抓取到需要的内容.相反 ...

随机推荐

SQL 数据操作（实验六）
SQL 数据操作 emp.dept 目标表结构及数据 INSERT 命令的使用与结果验证 2.1把一名新来雇员信息插入到EMP表中:雇员号:1011 姓名: 王晓明入职日期:今天 ```insert ...
js 图片转换为base64
<input id="file" type="file"> <img id="img" style="max-h ...
socket的简单例子
最近刚刚开始学了socket的模块,就写了一个服务器与客户端交互的程序有两种模式: 1.就是先电脑自动回复 2.就是人工服务接下来就是代码了服务器端的代码: #Author:陈浩彬 import ...
配置KindEditor富文本编辑器
第一步:首先我们要到KindEditor官网下载资源包-点击进入官网下载KindEditor资源包第二部:在下载完了KindEditor的资源包后解压结构如下图所示: 里面包括集中语言的文件上传后台 ...
bootstarp-fileinput上传火狐防止拖入文件直接打开新页面
今日接触了一个bootstarp的上传插件,发现其功能很强大,具体名为bootstarp-fileinput,需要的可以自行度一下. 然后当使用其拖拽功能时,其他浏览器没出毛病,独火狐浏览器拖入时直接 ...
百度OCR文字识别-身份证识别
简介一.介绍身份证识别 API 接口文档地址:http://ai.baidu.com/docs#/OCR-API/top 接口描述用户向服务请求识别身份证,身份证识别包括正面和背面. 请求说明 ...
大白话Vue源码系列(01)：万事开头难
阅读目录 Vue 的源码目录结构预备知识先捡软的捏 Angular 是 Google 亲儿子,React 是 Facebook 小正太,那咱为啥偏偏选择了 Vue 下手,一句话,Vue 是咱见过的 ...
springMVC使用jsp:include嵌入页面的两种方式
1.静态嵌入子页面 <%@ include file="header.jsp" %> 静态嵌入支持 jsp . html . xml 以及纯文本. 静态嵌入在编译时 ...
webpack入门之打包html,css,js,img(一)
webpack到底是什么,网上一大堆介绍的东西,越看越不知道说的什么,所以今天打算自己来记录一下这段时间学习webpack的成果, webpack就是打包文件用的,html,css,js,img,为什 ...
jqgrid嵌套子表格
jqgrid的subGrid子表格 jqGrid的一项高级功能就是嵌套子表格,使用起来也非常简单.使用的方式有两种: 使用普通的subGrid子表格: 使用一个完整jqGrid作为子表格: 1.选项含 ...

Python 爬虫实战（一）：使用 requests 和 BeautifulSoup

Python 基础

requests

安装

发送请求

定制请求头

构建查询参数

post 请求数据

响应内容

超时

代理设置

BeautifulSoup

安装

安装解析器

简单使用

解析

tag

查找

实战

网站选定

创建数据库

创建表

解析网站

插入数据库

定时设置

完整代码

最后

参考

Python 爬虫实战（一）：使用 requests 和 BeautifulSoup的更多相关文章

随机推荐

热门专题