requests和BeautifulSoup

一：Requests库

Requests is an elegant and simple HTTP library for Python, built for human beings.

1.安装

pip install requests

安装小测

>>> import requests

>>> r=requests.get("http://www.baidu.com")

>>> print(r.status_code)

200

>>> r.text

'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>ç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç\x99¾åº¦ä¸\x80ä¸\x8b class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ\x96°é\x97»</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>å\x9c°å\x9b¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§\x86é¢\x91</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å\x90§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç\x99»å½\x95</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">ç\x99»å½\x95</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ\x9b´å¤\x9aäº§å\x93\x81</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å\x85³äº\x8eç\x99¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>ä½¿ç\x94¨ç\x99¾åº¦å\x89\x8då¿\x85è¯»</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>æ\x84\x8fè§\x81å\x8f\x8dé¦\x88</a>&nbsp;äº¬ICPè¯\x81030173å\x8f·&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'

2.requests库的八个方法

requests.request() 构造一个请求，支撑以下各方法的基础方法

requests.get() 获取HTML网页的主要方法，对应于HTTP的GET

requests.head() 获取HTML网页头信息的方法，对应于HTTP的HEAD

requests.post() 向HTML网页提交POST请求的方法，对应于HTTP的POS

requests.put() 向HTML网页提交PUT请求的方法，对应于HTTP的PUT

requests.patch() 向HTML网页提交局部修改请求，对应于HTTP的PATCH

requests.delete() 向HTML页面提交删除请求，对应于HTTP的DELETE
requests.options(url, **kwargs)

3 requests库的两个重要对象：request和response（包含爬虫返回的内容）

response = requests.get(url)

构造一个向服务器请求资源的Request对象

返回一个包含服务器资源的Response对象

∙ url : 拟获取页面的url链接
∙ params : url中的额外参数，字典或字节流格式，可选
∙ **kwargs: 12个控制访问的参数

Response对象包含服务器返回的所有信息，也包含请求的Request信息

>>> import requests

>>> r=requests.get("http://www.baidu.com")

>>> print(r.status_code)

200

>>> type(r)

<class 'requests.models.Response'>

>>> r.headers

{'Server': 'bfe/1.0.8.18', 'Date': 'Fri, 17 Nov 2017 02:24:03 GMT', 'Content-Type': 'text/html', 'Last-Modified': 'Mon, 23 Jan 2017 13:28:28 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'Keep-Alive', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Pragma': 'no-cache', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Content-Encoding': 'gzip'}

>>>

4 Response对象的属性

r.status_code HTTP 请求的返回状态，200表示连接成功，404表示失败
r.text HTTP 响应内容的字符串形式，即，url对应的页面内容
r.encoding 从HTTP header中猜测的响应内容编码方式
apparent_encoding 从内容中分析出的响应内容编码方式（备选编码方式）
r.content HTTP 响应内容的二进制形式

与安装小测比较

>>> r.encoding

'ISO-8859-1'

>>> r.apparent_encoding

'utf-8'

>>> r.encoding="utf-8"

>>> r.text

'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">登录</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'

r.encoding 从HTTP header中猜测的响应内容编码方式
r.apparent_encoding 从内容中分析出的响应内容编码方式（备选编码方式）

r.encoding：如果header中不存在charset，则认为编码为ISO‐8859‐1 r.text根据r.encoding显示网页内容

r.apparent_encoding：根据网页内容分析出的编码方式，可以看作是r.encoding的备选

5. requests异常

 异常 说明

 requests.ConnectionError 网络连接错误异常，如DNS查询失败、拒绝连接等

 requests.HTTPError          HTTP错误异常

 requests.URLRequired           URL缺失异常

 requests.TooManyRedirects     超过最大重定向次数，产生重定向异常

 requests.ConnectTimeout         连接远程服务器超时异常

 requests.Timeout             请求URL超时，产生超时异常

6.response异常

r.raise_for_status() 如果不是200，产生异常 requests.HTTPError

r.raise_for_status()在方法内部判断r.status_code是否等于200，不需要增加额外的if语句，该语句便于利用try‐except进行异常处理
7.HTTP，Hypertext Transfer Protocol，超文本传输协议
HTTP是一个基于“请求与响应”模式的、无状态的应用层协议
HTTP协议采用URL作为定位网络资源的标识，URL格式如下：
http://host[:port][path]
host: 合法的Internet主机域名或IP地址
port: 端口号，缺省端口为80
path: 请求资源的路径

HTTP URL实例：
http://www.baidu.com
http://192.168.179.130/duty
HTTP URL的理解：
URL是通过HTTP协议存取资源的Internet路径，一个URL对应一个数据资源

HTTP协议对资源的操作方法：

 GET 请求获取URL位置的资源

 HEAD 请求获取URL位置资源的响应消息报告，即获得该资源的头部信息

 POST 请求向URL位置的资源后附加新的数据

 PUT 请求向URL位置存储一个资源，覆盖原URL位置的资源

 PATCH 请求局部更新URL位置的资源，即改变该处资源的部分内容

 DELETE 请求删除URL位置存储的资源

patch与put

假设URL位置有一组数据UserInfo，包括UserID、 UserName等20个字段
需求：用户修改了UserName，其他不变
• 采用PATCH，仅向URL提交UserName的局部更新请求
• 采用PUT，必须将所有20个字段一并提交到URL，未提交字段被删除
PATCH的最主要好处：节省网络带宽

head

>>> r=requests.head("http://www.baidu.com")

>>> r.headers

{'Server': 'bfe/1.0.8.18', 'Date': 'Fri, 17 Nov 2017 02:51:22 GMT', 'Content-Type': 'text/html', 'Last-Modified': 'Mon, 13 Jun 2016 02:50:50 GMT', 'Connection': 'Keep-Alive', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Pragma': 'no-cache', 'Content-Encoding': 'gzip'}

>>> r.text

''

参数

 **kwargs: 控制访问的参数，均为可选项

 params : 字典或字节序列，作为参数增加到url中

 data : 字典、字节序列或文件对象，作为Request的内容

 json : JSON格式的数据，作为Request的内容

 headers : 字典，HTTP定制头

 cookies : 字典或CookieJar，Request中的cookie

 auth : 元组，支持HTTP认证功能

  files : 字典类型，传输文件

 timeout : 设定超时时间，秒为单位

 proxies : 字典类型，设定访问代理服务器，可以增加登录认证

 allow_redirects : True/False，默认为True，重定向开关

 stream : True/False，默认为True，获取内容立即下载开关

 verify : True/False，默认为True，认证SSL证书开关

 cert : 本地SSL证书路径

Robots Exclusion Standard，网络爬虫排除标准
作用：
网站告知网络爬虫哪些页面可以抓取，哪些不行
形式：
在网站根目录下的robots.txt文件

网络爬虫：自动或人工识别robots.txt，再进行内容爬取

约束性：Robots协议是建议但非约束性，网络爬虫可以不遵守，但存在法律风险

安装：pip install beautifulsoup4

Beautiful Soup库，也叫beautifulsoup4 或 bs4
约定引用方式如下，即主要是用BeautifulSoup类

from bs4 import BeautifulSoup
import bs4

Beautiful Soup库解析器

解析器　　　　　　　　使用方法　　　　　　　　　　　　条件
bs4的HTML解析器　　BeautifulSoup(mk,'html.parser') 　　安装bs4库
lxml的HTML解析器　　BeautifulSoup(mk,'lxml')　　 pip install lxml
lxml的XML解析器　　 BeautifulSoup(mk,'xml') 　　pip install lxml
html5lib的解析器　　BeautifulSoup(mk,'html5lib') 　　pip install html5lib

BeautifulSoup类的基本元素

Tag 标签，最基本的信息组织单元，分别用<>和</>标明开头和结尾，任何存在于HTML语法中的标签都可以用soup.<tag>访问获得，当HTML文档中存在多个相同<tag>对应内容时，soup.<tag>返回第一个

Name 标签的名字，<p>…</p>的名字是'p'，格式：<tag>.name,每个<tag>都有自己的名字，通过<tag>.name获取，字符串类型

Attributes 标签的属性，字典形式组织，格式：<tag>.attrs,一个<tag>可以有0或多个属性，字典类型

NavigableString 标签内非属性字符串，<>…</>中字符串，格式：<tag>.string,NavigableString可以跨越多个层次

Comment 标签内字符串的注释部分，一种特殊的Comment类型,Comment是一种特殊类型

标签树的下行遍历

.contents 子节点的列表，将<tag>所有儿子节点存入列表

.children 子节点的迭代类型，与.contents类似，用于循环遍历儿子节点

.descendants 子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

上行遍历

.parent 节点的父亲标签

.parents 节点先辈标签的迭代类型，用于循环遍历先辈节点

平行遍历

.next_sibling 返回按照HTML文本顺序的下一个平行节点标签

.previous_sibling 返回按照HTML文本顺序的上一个平行节点标签

.next_siblings 迭代类型，返回按照HTML文本顺序的后续所有平行节点标签

.previous_siblings 迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

bs4库的prettify()方法
.prettify()为HTML文本<>及其内容增加更加'\n'
.prettify()可用于标签，方法：<tag>.prettify()

fiand_all()方法

<>.find_all(name, attrs, recursive, string, **kwargs)

∙ name : 对标签名称的检索字符串

∙ attrs: 对标签属性值的检索字符串，可标注属性检索

∙ recursive: 是否对子孙全部检索，默认True

∙ string: <>…</>中字符串区域的检索字符串
<tag>(..) 等价于 <tag>.find_all(..)
soup(..) 等价于 soup.find_all(..)

<>.find() 搜索且只返回一个结果，同.find_all()参数

<>.find_parents() 在先辈节点中搜索，返回列表类型，同.find_all()参数

<>.find_parent() 在先辈节点中返回一个结果，同.find()参数

<>.find_next_siblings() 在后续平行节点中搜索，返回列表类型，同.find_all()参数

<>.find_next_sibling() 在后续平行节点中返回一个结果，同.find()参数

<>.find_previous_siblings() 在前序平行节点中搜索，返回列表类型，同.find_all()参数

<>.find_previous_sibling() 在前序平行节点中返回一个结果，同.find()参数

requests和BeautifulSoup的更多相关文章

【安全】requests和BeautifulSoup小试牛刀
web安全的题,为了找key随手写的程序,无处安放,姑且贴上来. # -*- coding: UTF-8 -*- __author__ = 'weimw' import requests from B ...
$python爬虫系列（2）—— requests和BeautifulSoup库的基本用法
本文主要介绍python爬虫的两大利器:requests和BeautifulSoup库的基本用法. 1. 安装requests和BeautifulSoup库可以通过3种方式安装: easy_inst ...
python爬虫系列（2）—— requests和BeautifulSoup
本文主要介绍python爬虫的两大利器:requests和BeautifulSoup库的基本用法. 1. 安装requests和BeautifulSoup库可以通过3种方式安装: easy_inst ...
【网络爬虫入门01】应用Requests和BeautifulSoup联手打造的第一条网络爬虫
[网络爬虫入门01]应用Requests和BeautifulSoup联手打造的第一条网络爬虫广东职业技术学院欧浩源 2017-10-14 1.引言在数据量爆发式增长的大数据时代,网络与用户的沟 ...
基于Requests和BeautifulSoup实现“自动登录”
基于Requests和BeautifulSoup实现“自动登录”实例自动登录抽屉新热榜 #!/usr/bin/env python # -*- coding:utf-8 -*- import req ...
Python使用urllib,urllib3,requests库+beautifulsoup爬取网页
Python使用urllib/urllib3/requests库+beautifulsoup爬取网页 urllib urllib3 requests 笔者在爬取时遇到的问题 1.结果不全 2.'抓取失 ...
Python 爬虫实战（一）：使用 requests 和 BeautifulSoup
Python 基础我之前写的<Python 3 极简教程.pdf>,适合有点编程基础的快速入门,通过该系列文章学习,能够独立完成接口的编写,写写小东西没问题. requests requ ...
#1 爬虫：豆瓣图书TOP250 「requests、BeautifulSoup」
一.项目背景随着时代的发展,国人对于阅读的需求也是日益增长,既然要阅读,就要读好书,什么是好书呢?本项目选择以豆瓣图书网站为对象,统计其排行榜的前250本书籍. 二.项目介绍本项目使用Python ...
爬虫不过如此（python的Re 、Requests、BeautifulSoup 详细篇）
网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本. 爬虫的本质就是一段自动抓取互联网信息的程序,从网络获取 ...

随机推荐

ZOJ1315
代码先寄放这里 #include<cstdio> #include<cstdlib> #include<iostream> #include<cstring& ...
基于RTKLIB构建高并发通信测试工具
1. RTKLIB基础动态库生成 RTKLIB是全球导航卫星系统GNSS(global navigation satellite system)的标准&精密定位开源程序包,由日本东京海洋大学的 ...
Arrays.asList () 不可添加或删除元素的原因
Java中奖数组转换为List<T>容器有一个很方便的方法 Arrays.asList(T ... a),我通过此方法给容器进行了赋值操作,接着对其进行添加元素,却发现会抛出一个(jav ...
【学习】文本框输入监听事件oninput
真实项目中遇到的,需求是:一个文本框,一个按钮,当文本框输入内容时,按钮可用,当删除内容时,按钮不可用. 刚开始用的focus和blur, $(".pay-text").focus ...
【转载】基于VoiceOver的移动web站无障碍访问实战
文章转载自张鑫旭-鑫空间-鑫生活 http://www.zhangxinxu.com/wordpress/ 原文链接:http://www.zhangxinxu.com/wordpress/?p=5 ...
win10 UWP 九幽数据分析
九幽数据统计是统计和分析数据来源,用户使用,先申请账号 http://www.windows.sc 创建应用图片要72*72 记密钥在项目Nuget 在App.xaml.cs public App ...
eclipse环境下，java操作MySQL的简单演示
首先先通过power shell 进入MySQL 查看现在数据库的状态(博主是win10系统) 右键开始,选择Windows powershell ,输入MySQL -u用户名 -p密码选择数据库( ...
Linux入门(10)——Ubuntu16.04使用pip3和pip安装numpy,scipy,matplotlib等第三方库
安装Python3第三方库numpy,scipy,matplotlib: sudo apt install python3-pip pip3 install numpy pip3 install sc ...
（转）uml各类图
原文:http://www.cnblogs.com/way-peng/archive/2012/06/11/2544932.html 一.UML是什么?UML有什么用? 二.UML的历史三.UML的 ...
Bootstrap--下拉菜单.dropdown
下拉菜单.dropdown .dropdown <下拉菜单触发器button+下拉菜单ul> .dropdown 包裹层 .dropdown-toggle 下拉菜单触发器 data-to ...

requests和BeautifulSoup

requests和BeautifulSoup的更多相关文章

随机推荐

热门专题