爬虫-request以及beautisoup模块笔记

requests模块

pip3 install requests

res = requests.get('')

res.text

res.cookies.get_dict()

res.content

res.encoding

res.aparent_encoding

res.status_code

res = requests.post(

    headers={

        'User-Agent': ''

    }

    url="",

    cookies={},

    data={}

)

request.request参数详解

1. method: 提交方式

2. url: 提交地址

3. params: 在 url 上传递的参数，以 get 形式的提交数据

4. data:  在 body 中提交的数据，字典:{"k1":'v1'}、字节:"k1=v1&k2=v2"、文件对象

5. json:  在 body 中提交的数据，把数据进行一个 json.dump()后一个字符串后发送，字典中嵌套字典时使用

6. headers: 请求头

        Referer：你上一次访问的地址

        User-Agent : 客户端

7. cookies: cookies置于 header 中发送

8. files:  文件对象

9. auth:  在 hrader 中加入加密的用户名和密码来做认证

10. timeout:  超时时间   (connect timeout, read timeout) <timeouts>.

11. allow_redirects: 是否允许重定向

12. proxies: 代理

        proxys={

            'http':'http://192.168.112:8080'

        }

13. verify: 布尔值，是否忽略https证书

14. stream: 流式处理，布尔值，从 response.iter_content()中迭代去取

15. cert: https 证书

request.session

s8.py测试，保存客户端历史访问信息

beautisoup 模块

1. name 根据标签名称获取标签

tag = soup.find('a')  # 查找

name = tag.name  # 获取a

tag.name = 'span' # 设置

2. attrs 标签的属性

# tag = soup.find('a')

# attrs = tag.attrs    # 获取

# print(attrs)

# tag.attrs = {'ik':123} # 设置会覆盖原来的所有属性

# tag.attrs['id'] = 'iiiii' # 增加新的属性

del tag.attrs['id']  删除某个属性

# print(soup)

3. children 所有儿子标签,第一层

tag.children

4. descendants 子子孙孙的标签

tag.descendants

5.删除

clear,将标签的所有儿子标签全部清空（保留自己标签）
decompose 递归的删除所有，包括自己
extract 递归的删除所有标签，并返回删除的，通 pop 方法

6. 字节与字符串之间的转换

decode 对象转换为字符串，含有当前标签
decode_contents 转换为字符串，不含有当前标签
encode 对象转换为字节、含有当前标签
encode_contents 转换为字节、不含有当前标签

7. 匹配

find

tag = soup.find(

    name='a',  # 标签

    attrs={'class': 'sister'},  # 属性

    recursive=True,  # 布尔值、True(只匹配一层,默认值)；False(递归寻找)

    text='Lacie'  # 标签文本匹配

)

tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie')

find_all

tags = soup.find_all(

    'a',limit=1 # limit 只匹配一个

)

--------------------

v = soup.find_all(name=['a','div']) 匹配两个标签，关系为或

v = soup.find_all(class_=['sister0', 'sister']) 匹配多个属性，关系为或

v = soup.find_all(id=['link1','link2'])

= soup.find_all(href=['link1','link2'])

8. 自定义正则

import re

rep = re.compile('^h')

v = soup.find_all(name=rep)

rep = re.compile('sister.*')

v = soup.find_all(class_=rep)

rep = re.compile('http://www.oldboy.com/static/.*')

v = soup.find_all(href=rep)

9. 自定义方法帅选

def func(tag):

    return tag.has_attr('class') and tag.has_attr('id')

v = soup.find_all(name=func)

10. has_attr 是否有属性检测

11. text get_text 文本获取

12. index 获取标签在在某个标签中的索引位置

tag = soup.find('body')

v = tag.index(tag.find('div'))

tag = soup.find('body')

for i,v in enumerate(tag):

print(i,v)

13. is_empty_element 判断是否是空标签或者自闭合标签

判断是否为：'br' , 'hr', 'input', 'img', 'meta','spacer', 'link', 'frame', 'base'

tag = soup.find('br')

v = tag.is_empty_element

14. 获取当前的关联标签（属性）

soup.next  # 下一个

soup.next_element

soup.next_elements

soup.next_sibling

soup.next_siblings

tag.previous  # 上一个

tag.previous_element

tag.previous_elements

tag.previous_sibling

tag.previous_siblings

tag.parent  # 父标签

tag.parents  # 层级父标签

15. 获取当前的关联标签（方法）

tag.find_next(...)

tag.find_all_next(...)

tag.find_next_sibling(...)

tag.find_next_siblings(...)

tag.find_previous(...)

tag.find_all_previous(...)

tag.find_previous_sibling(...)

tag.find_previous_siblings(...)

tag.find_parent(...)

tag.find_parents(...)

参数同find_all

16. css 选择器，同 js 标签选择器使用

soup.select("title")

soup.select("p nth-of-type(3)")

soup.select("body a")

soup.select("html head title")

tag = soup.select("span,a")

soup.select("head > title")

soup.select("p > a")

soup.select("p > a:nth-of-type(2)")

soup.select("p > #link1")

soup.select("body > a")

soup.select("#link1 ~ .sister")

soup.select("#link1 + .sister")

soup.select(".sister")

soup.select("[class~=sister]")

soup.select("#link1")

soup.select("a#link2")

soup.select('a[href]')

soup.select('a[href="http://example.com/elsie"]')

soup.select('a[href^="http://example.com/"]')

soup.select('a[href$="tillie"]')

soup.select('a[href*=".com/el"]')

from bs4.element import Tag

def default_candidate_generator(tag):

    for child in tag.descendants:

        if not isinstance(child, Tag):

            continue

        if not child.has_attr('href'):

            continue

        yield child

tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator)

print(type(tags), tags)

from bs4.element import Tag

def default_candidate_generator(tag):

    for child in tag.descendants:

        if not isinstance(child, Tag):

            continue

        if not child.has_attr('href'):

            continue

        yield child

tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator, limit=1)

print(type(tags), tags)

17. 标签内容

tag = soup.find('span')

print(tag.string)          # 获取

tag.string = 'new content' # 设置

print(soup)

18. append 在当前标签内部最后追加一个标签对象

tag = soup.find('body')

tag.append(soup.find('a'))

print(soup)

from bs4.element import Tag

obj = Tag(name='i',attrs={'id': 'it'})

obj.string = '我是一个新来的'

tag = soup.find('body')

tag.append(obj)

print(soup)

19. insert 在当前标签内部指定位置插入一个标签对象

from bs4.element import Tag

obj = Tag(name='i', attrs={'id': 'it'})

obj.string = '我是一个新来的'

tag = soup.find('body')

tag.insert(2, obj)

print(soup)

20. insert_after、insert_before 在指定标签的前面或者和面插入一个标签对象

21. replace_with 将当前标签替换为指定的标签一个标签对象

22. 创建标签之间的关系，不修改 html 中的位置，只修改逻辑属性关系

tag = soup.find('div')

a = soup.find('a')

tag.setup(previous_sibling=a)

print(tag.previous_sibling)

23. wrap 将指定标签把当前标签包裹起来

div = soup.find('div')

a = soup.find('a')

div.wrap(a) # 将 div 包裹进 a标签中

24.unwrap，去掉当前标签，将保留其包裹的标签

a = '<div>测试<a>test</a></div>'

tag = soup.find('div')

v = tag.unwrap()

v = '测试<a>test</a>'  # 把外层的标签去除，将保留其包裹的标签

爬虫-request以及beautisoup模块笔记的更多相关文章

爬虫-request和BeautifulSoup模块
requests简介 Python标准库中提供了:urllib.urllib2.httplib等模块以供Http请求,但是,它的 API 太渣了.它是为另一个时代.另一个互联网所创建的.它需要巨量的工 ...
爬虫开发3.requests模块
requests模块 - 基于如下5点展开requests模块的学习什么是requests模块 requests模块是python中原生的基于网络请求的模块,其主要作用是用来模拟浏览器发起请求.功能 ...
爬虫(三)：urllib模块
1. urllib模块 1.1 urllib简介 urllib 是 Python3 中自带的 HTTP 请求库,无需复杂的安装过程即可正常使用,十分适合爬虫入门 urllib 中包含四个模块,分别是: ...
【爬虫入门手记03】爬虫解析利器beautifulSoup模块的基本应用
[爬虫入门手记03]爬虫解析利器beautifulSoup模块的基本应用 1.引言网络爬虫最终的目的就是过滤选取网络信息,因此最重要的就是解析器了,其性能的优劣直接决定这网络爬虫的速度和效率.Bea ...
Python爬虫与数据分析之模块：内置模块、开源模块、自定义模块
专栏目录: Python爬虫与数据分析之python教学视频.python源码分享,python Python爬虫与数据分析之基础教程:Python的语法.字典.元组.列表 Python爬虫与数据分析 ...
【网络爬虫入门03】爬虫解析利器beautifulSoup模块的基本应用
[网络爬虫入门03]爬虫解析利器beautifulSoup模块的基本应用 1.引言网络爬虫最终的目的就是过滤选取网络信息,因此最重要的就是解析器了,其性能的优劣直接决定这网络爬虫的速度和效率.B ...
爬虫简介与requests模块
爬虫简介与requests模块一爬虫简介概述网络爬虫是一种按照一定规则,通过网页的链接地址来寻找网页的,从网站某一个页面(通常是首页)开始,读取网页的内容,找到网页中的其他链接地址,然后通过这 ...
Python爬虫——Request模块
# 使用 Requests 发送网络请求# 1.导入 Requests 模块import requests# 2.尝试获取某个网页 # HTTP 请求类型r = requests.get('https ...
爬虫---request+++urllib
网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本.另外一些不常使用的名字还有蚂蚁.自动索引.模拟程序或者蠕 ...

随机推荐

CodeForces-747E
这几天好懒,昨天写的题,今天才来写博客.... 这题你不知道它究竟有多少层,但是知道字符串长度不超过10^6,那么它的总容量是被限定的,用一个二维动态数组就OK了.输入字符串后,可以把它按照逗号分割成 ...
坑人的toLocaleDateString和简单地跳坑方式
最近在做一个一个医学大数据的项目的时候,独立设计.构思.制作了完成了一个生命历程图的功能.既然设计到时间,那就免不了对Date对象进行一系列的操作,也就免不了对日期对象进行一系列的格式化.走的路多了, ...
python产生随机值-random模块
import random产生随机值的模块random.random() #获取一个随机的浮点值;help(random.random) #查看随机范围:0-1;random.uniform(1,10 ...
ubuntu 下命令行播放器mplayer 使用详解
ubuntu 下命令行播放器mplayer 使用详解 2011-01-02 21:00:42| 分类: Linux/Unix | 标签: |字号大中小订阅使用 MPlayer 播放媒体文件最简 ...
VBR与CBR的区别是什么?
VBR是动态码率.CBR是静态码率. VBR(Variable Bitrate)动态比特率.也就是没有固定的比特率,压缩软件在压缩时根据音频数据即时确定使用什么比特率,这是以质量为前提兼顾文件大小的方 ...
Java之List排序出错
Java之List排序出错 Bound mismatch: The generic method sort(List<T>) of type Collections is not appl ...
CAN总线基础知识（一）
1．CAN总线是什么? CAN(Controller Area Network)是ISO国际标准化的串行通信协议.广泛应用于汽车.船舶等.具有已经被大家认可的高性能和可靠性. CAN控制器通过组成总线 ...
列出JDK中常用的Java包
列出JDK中常用的Java包 1.java.lang 2.java.sql 3.java.io 4.java.math 5.java.text 6.java.net 7.java.util 8.jav ...
Linux显示用户注册名
Linux显示用户注册名 youhaidong@youhaidong-ThinkPad-Edge-E545:~$ finger -s Login Name Tty Idle Login Time Of ...
Minimum Inversion Number~hdu 1394
The inversion number of a given number sequence a1, a2, ..., an is the number of pairs (ai, aj) that ...

爬虫-request以及beautisoup模块笔记

requests模块

request.request参数详解

request.session

beautisoup 模块

1. name 根据标签名称获取标签

2. attrs 标签的属性

3. children 所有儿子标签,第一层

4. descendants 子子孙孙的标签

5.删除

6. 字节与字符串之间的转换

7. 匹配

8. 自定义正则

9. 自定义方法帅选

10. has_attr 是否有属性检测

11. text get_text 文本获取

12. index 获取标签在在某个标签中的索引位置

13. is_empty_element 判断是否是空标签或者自闭合标签

14. 获取当前的关联标签（属性）

15. 获取当前的关联标签（方法）

16. css 选择器，同 js 标签选择器使用

17. 标签内容

18. append 在当前标签内部最后追加一个标签对象

19. insert 在当前标签内部指定位置插入一个标签对象

20. insert_after、insert_before 在指定标签的前面或者和面插入一个标签对象

21. replace_with 将当前标签替换为指定的标签一个标签对象

22. 创建标签之间的关系，不修改 html 中的位置，只修改逻辑属性关系

23. wrap 将指定标签把当前标签包裹起来

24.unwrap，去掉当前标签，将保留其包裹的标签

爬虫-request以及beautisoup模块笔记的更多相关文章

随机推荐

热门专题