requests和BeautifulSoup

一：Requests库

Requests is an elegant and simple HTTP library for Python, built for human beings.

1.安装

pip install requests

安装小测

>>> import requests

>>> r=requests.get("http://www.baidu.com")

>>> print(r.status_code)

200

>>> r.text

'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>ç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç\x99¾åº¦ä¸\x80ä¸\x8b class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ\x96°é\x97»</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>å\x9c°å\x9b¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§\x86é¢\x91</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å\x90§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç\x99»å½\x95</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">ç\x99»å½\x95</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ\x9b´å¤\x9aäº§å\x93\x81</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å\x85³äº\x8eç\x99¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>ä½¿ç\x94¨ç\x99¾åº¦å\x89\x8då¿\x85è¯»</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>æ\x84\x8fè§\x81å\x8f\x8dé¦\x88</a>&nbsp;äº¬ICPè¯\x81030173å\x8f·&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'

2.requests库的八个方法

requests.request() 构造一个请求，支撑以下各方法的基础方法

requests.get() 获取HTML网页的主要方法，对应于HTTP的GET

requests.head() 获取HTML网页头信息的方法，对应于HTTP的HEAD

requests.post() 向HTML网页提交POST请求的方法，对应于HTTP的POS

requests.put() 向HTML网页提交PUT请求的方法，对应于HTTP的PUT

requests.patch() 向HTML网页提交局部修改请求，对应于HTTP的PATCH

requests.delete() 向HTML页面提交删除请求，对应于HTTP的DELETE
requests.options(url, **kwargs)

3 requests库的两个重要对象：request和response（包含爬虫返回的内容）

response = requests.get(url)

构造一个向服务器请求资源的Request对象

返回一个包含服务器资源的Response对象

∙ url : 拟获取页面的url链接
∙ params : url中的额外参数，字典或字节流格式，可选
∙ **kwargs: 12个控制访问的参数

Response对象包含服务器返回的所有信息，也包含请求的Request信息

>>> import requests

>>> r=requests.get("http://www.baidu.com")

>>> print(r.status_code)

200

>>> type(r)

<class 'requests.models.Response'>

>>> r.headers

{'Server': 'bfe/1.0.8.18', 'Date': 'Fri, 17 Nov 2017 02:24:03 GMT', 'Content-Type': 'text/html', 'Last-Modified': 'Mon, 23 Jan 2017 13:28:28 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'Keep-Alive', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Pragma': 'no-cache', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Content-Encoding': 'gzip'}

>>>

4 Response对象的属性

r.status_code HTTP 请求的返回状态，200表示连接成功，404表示失败
r.text HTTP 响应内容的字符串形式，即，url对应的页面内容
r.encoding 从HTTP header中猜测的响应内容编码方式
apparent_encoding 从内容中分析出的响应内容编码方式（备选编码方式）
r.content HTTP 响应内容的二进制形式

与安装小测比较

>>> r.encoding

'ISO-8859-1'

>>> r.apparent_encoding

'utf-8'

>>> r.encoding="utf-8"

>>> r.text

'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">登录</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'

r.encoding 从HTTP header中猜测的响应内容编码方式
r.apparent_encoding 从内容中分析出的响应内容编码方式（备选编码方式）

r.encoding：如果header中不存在charset，则认为编码为ISO‐8859‐1 r.text根据r.encoding显示网页内容

r.apparent_encoding：根据网页内容分析出的编码方式，可以看作是r.encoding的备选

5. requests异常

 异常 说明

 requests.ConnectionError 网络连接错误异常，如DNS查询失败、拒绝连接等

 requests.HTTPError          HTTP错误异常

 requests.URLRequired           URL缺失异常

 requests.TooManyRedirects     超过最大重定向次数，产生重定向异常

 requests.ConnectTimeout         连接远程服务器超时异常

 requests.Timeout             请求URL超时，产生超时异常

6.response异常

r.raise_for_status() 如果不是200，产生异常 requests.HTTPError

r.raise_for_status()在方法内部判断r.status_code是否等于200，不需要增加额外的if语句，该语句便于利用try‐except进行异常处理
7.HTTP，Hypertext Transfer Protocol，超文本传输协议
HTTP是一个基于“请求与响应”模式的、无状态的应用层协议
HTTP协议采用URL作为定位网络资源的标识，URL格式如下：
http://host[:port][path]
host: 合法的Internet主机域名或IP地址
port: 端口号，缺省端口为80
path: 请求资源的路径

HTTP URL实例：
http://www.baidu.com
http://192.168.179.130/duty
HTTP URL的理解：
URL是通过HTTP协议存取资源的Internet路径，一个URL对应一个数据资源

HTTP协议对资源的操作方法：

 GET 请求获取URL位置的资源

 HEAD 请求获取URL位置资源的响应消息报告，即获得该资源的头部信息

 POST 请求向URL位置的资源后附加新的数据

 PUT 请求向URL位置存储一个资源，覆盖原URL位置的资源

 PATCH 请求局部更新URL位置的资源，即改变该处资源的部分内容

 DELETE 请求删除URL位置存储的资源

patch与put

假设URL位置有一组数据UserInfo，包括UserID、 UserName等20个字段
需求：用户修改了UserName，其他不变
• 采用PATCH，仅向URL提交UserName的局部更新请求
• 采用PUT，必须将所有20个字段一并提交到URL，未提交字段被删除
PATCH的最主要好处：节省网络带宽

head

>>> r=requests.head("http://www.baidu.com")

>>> r.headers

{'Server': 'bfe/1.0.8.18', 'Date': 'Fri, 17 Nov 2017 02:51:22 GMT', 'Content-Type': 'text/html', 'Last-Modified': 'Mon, 13 Jun 2016 02:50:50 GMT', 'Connection': 'Keep-Alive', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Pragma': 'no-cache', 'Content-Encoding': 'gzip'}

>>> r.text

''

参数

 **kwargs: 控制访问的参数，均为可选项

 params : 字典或字节序列，作为参数增加到url中

 data : 字典、字节序列或文件对象，作为Request的内容

 json : JSON格式的数据，作为Request的内容

 headers : 字典，HTTP定制头

 cookies : 字典或CookieJar，Request中的cookie

 auth : 元组，支持HTTP认证功能

  files : 字典类型，传输文件

 timeout : 设定超时时间，秒为单位

 proxies : 字典类型，设定访问代理服务器，可以增加登录认证

 allow_redirects : True/False，默认为True，重定向开关

 stream : True/False，默认为True，获取内容立即下载开关

 verify : True/False，默认为True，认证SSL证书开关

 cert : 本地SSL证书路径

Robots Exclusion Standard，网络爬虫排除标准
作用：
网站告知网络爬虫哪些页面可以抓取，哪些不行
形式：
在网站根目录下的robots.txt文件

网络爬虫：自动或人工识别robots.txt，再进行内容爬取

约束性：Robots协议是建议但非约束性，网络爬虫可以不遵守，但存在法律风险

安装：pip install beautifulsoup4

Beautiful Soup库，也叫beautifulsoup4 或 bs4
约定引用方式如下，即主要是用BeautifulSoup类

from bs4 import BeautifulSoup
import bs4

Beautiful Soup库解析器

解析器　　　　　　　　使用方法　　　　　　　　　　　　条件
bs4的HTML解析器　　BeautifulSoup(mk,'html.parser') 　　安装bs4库
lxml的HTML解析器　　BeautifulSoup(mk,'lxml')　　 pip install lxml
lxml的XML解析器　　 BeautifulSoup(mk,'xml') 　　pip install lxml
html5lib的解析器　　BeautifulSoup(mk,'html5lib') 　　pip install html5lib

BeautifulSoup类的基本元素

Tag 标签，最基本的信息组织单元，分别用<>和</>标明开头和结尾，任何存在于HTML语法中的标签都可以用soup.<tag>访问获得，当HTML文档中存在多个相同<tag>对应内容时，soup.<tag>返回第一个

Name 标签的名字，<p>…</p>的名字是'p'，格式：<tag>.name,每个<tag>都有自己的名字，通过<tag>.name获取，字符串类型

Attributes 标签的属性，字典形式组织，格式：<tag>.attrs,一个<tag>可以有0或多个属性，字典类型

NavigableString 标签内非属性字符串，<>…</>中字符串，格式：<tag>.string,NavigableString可以跨越多个层次

Comment 标签内字符串的注释部分，一种特殊的Comment类型,Comment是一种特殊类型

标签树的下行遍历

.contents 子节点的列表，将<tag>所有儿子节点存入列表

.children 子节点的迭代类型，与.contents类似，用于循环遍历儿子节点

.descendants 子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

上行遍历

.parent 节点的父亲标签

.parents 节点先辈标签的迭代类型，用于循环遍历先辈节点

平行遍历

.next_sibling 返回按照HTML文本顺序的下一个平行节点标签

.previous_sibling 返回按照HTML文本顺序的上一个平行节点标签

.next_siblings 迭代类型，返回按照HTML文本顺序的后续所有平行节点标签

.previous_siblings 迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

bs4库的prettify()方法
.prettify()为HTML文本<>及其内容增加更加'\n'
.prettify()可用于标签，方法：<tag>.prettify()

fiand_all()方法

<>.find_all(name, attrs, recursive, string, **kwargs)

∙ name : 对标签名称的检索字符串

∙ attrs: 对标签属性值的检索字符串，可标注属性检索

∙ recursive: 是否对子孙全部检索，默认True

∙ string: <>…</>中字符串区域的检索字符串
<tag>(..) 等价于 <tag>.find_all(..)
soup(..) 等价于 soup.find_all(..)

<>.find() 搜索且只返回一个结果，同.find_all()参数

<>.find_parents() 在先辈节点中搜索，返回列表类型，同.find_all()参数

<>.find_parent() 在先辈节点中返回一个结果，同.find()参数

<>.find_next_siblings() 在后续平行节点中搜索，返回列表类型，同.find_all()参数

<>.find_next_sibling() 在后续平行节点中返回一个结果，同.find()参数

<>.find_previous_siblings() 在前序平行节点中搜索，返回列表类型，同.find_all()参数

<>.find_previous_sibling() 在前序平行节点中返回一个结果，同.find()参数

requests和BeautifulSoup的更多相关文章

【安全】requests和BeautifulSoup小试牛刀
web安全的题,为了找key随手写的程序,无处安放,姑且贴上来. # -*- coding: UTF-8 -*- __author__ = 'weimw' import requests from B ...
$python爬虫系列（2）—— requests和BeautifulSoup库的基本用法
本文主要介绍python爬虫的两大利器:requests和BeautifulSoup库的基本用法. 1. 安装requests和BeautifulSoup库可以通过3种方式安装: easy_inst ...
python爬虫系列（2）—— requests和BeautifulSoup
本文主要介绍python爬虫的两大利器:requests和BeautifulSoup库的基本用法. 1. 安装requests和BeautifulSoup库可以通过3种方式安装: easy_inst ...
【网络爬虫入门01】应用Requests和BeautifulSoup联手打造的第一条网络爬虫
[网络爬虫入门01]应用Requests和BeautifulSoup联手打造的第一条网络爬虫广东职业技术学院欧浩源 2017-10-14 1.引言在数据量爆发式增长的大数据时代,网络与用户的沟 ...
基于Requests和BeautifulSoup实现“自动登录”
基于Requests和BeautifulSoup实现“自动登录”实例自动登录抽屉新热榜 #!/usr/bin/env python # -*- coding:utf-8 -*- import req ...
Python使用urllib,urllib3,requests库+beautifulsoup爬取网页
Python使用urllib/urllib3/requests库+beautifulsoup爬取网页 urllib urllib3 requests 笔者在爬取时遇到的问题 1.结果不全 2.'抓取失 ...
Python 爬虫实战（一）：使用 requests 和 BeautifulSoup
Python 基础我之前写的<Python 3 极简教程.pdf>,适合有点编程基础的快速入门,通过该系列文章学习,能够独立完成接口的编写,写写小东西没问题. requests requ ...
#1 爬虫：豆瓣图书TOP250 「requests、BeautifulSoup」
一.项目背景随着时代的发展,国人对于阅读的需求也是日益增长,既然要阅读,就要读好书,什么是好书呢?本项目选择以豆瓣图书网站为对象,统计其排行榜的前250本书籍. 二.项目介绍本项目使用Python ...
爬虫不过如此（python的Re 、Requests、BeautifulSoup 详细篇）
网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本. 爬虫的本质就是一段自动抓取互联网信息的程序,从网络获取 ...

随机推荐

JavaScript 版数据结构与算法（四）集合
今天,我们要讲的是数据结构与算法中的集合. 集合简介什么是集合?与栈.队列.链表这些顺序数据结构不同,集合是一种无序且唯一的数据结构.集合有什么用?在 Python 中,我经常使用集合来给数组去重: ...
SqlServer2008 导入导出txt或Execl数据
--右键user表所在的数据库,然后任务--导出数据,然后根据提示设置就行 --从txt中导入 EXEC master..xp_cmdshell 'bcp Northwind.dbo.sysusers ...
Installation of the JDK-9 on ubuntu(linux上安装jdk-9)
Description:Java SE 9 is the latest update to the Java Platform(General Availability on 21 September ...
使用Jquery.cookie.js操作cookie
query.cookie.js是一个基于jquery的插件,点击下载! 创建一个会话cookie: $.cookie(‘cookieName’,'cookieValue’); 注:当没有指明cooki ...
MXBridge - 插件式JS与OC交互框架
概述 MXBridge,提供一个插件式的JavaScript与Objective-C交互的框架,通过JavaScriptCore实现,插件式扩展Obejctive-C接口以供JavaScript调用. ...
git无法pull仓库refusing to merge unrelated histories
本文讲的是把git在最新2.9.2,合并pull两个不同的项目,出现的问题如何去解决fatal: refusing to merge unrelated histories 我在Github新建一个仓 ...
webstorm 卡死解决方法
方法1: 先在外部终端清空node-modules目录,包括隐藏文件,再打开Webstorm,打开Project Structure页面,选中工程,选择node_modules目录(没有的话自己先新建 ...
基于HTML5的WebGL实现json和echarts图表展现在同一个界面
突然有个想法,如果能把一些用到不同的知识点放到同一个界面上,并且放到一个盒子里,这样我如果要看什么东西就可以很直接显示出来,而且这个盒子一定要能打开.我用HT实现了我的想法,代码一百多行,这么少的代码 ...
MongoDB全文检索
1. 全文检索概念: 全文检索是对每一个词建立一个索引,指明该词在文章中出现的次数和位置,当用户查询时,检索程序就根据事先建立的索引进行查找,并将查找的结果反馈给用户的检索方式. (暂时不支持中文) ...
HTTP Error 500.19 - Internal Server Error
1.使用svn对项目进行管理 2.之前都是平安无事,忽然有一天报错:HTTP Error 500.19 - Internal Server Error,如图: 3.经过各种挣扎和求证,最后发现是项目. ...

requests和BeautifulSoup

requests和BeautifulSoup的更多相关文章

随机推荐

热门专题