爬虫之BeautifulSoup

BeautifulSoup是一个模块，该模块用于接收一个HTML或XML字符串，然后将其进行格式化，之后便可以使用他提供的方法进行快速查找指定元素，从而使得在HTML或XML中查找指定元素变得简单。

from bs4 import BeautifulSoup

html_doc = """

<html><head><title>The Dormouse's story</title></head>

<body>

asdf

    <div class="title">

        <b>The Dormouse's story总共</b>

        <h1>f</h1>

    </div>

<div class="story">Once upon a time there were three little sisters; and their names were

    <a  class="sister0" id="link1">Els<span>f</span>ie</a>,

    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</div>

ad<br/>sf

<p class="story">...</p>

</body>

</html>

"""

soup = BeautifulSoup(html_doc, features="lxml")

# 找到第一个a标签

tag1 = soup.find(name='a')

# 找到所有的a标签

tag2 = soup.find_all(name='a')

# 找到id＝link2的标签

tag3 = soup.select('#link2')

简单示例

1. name，标签名称

1 # tag = soup.find('a')

2 # name = tag.name # 获取

3 # print(name)

4 # tag.name = 'span' # 设置

5 # print(soup)

2. attr，标签属性

1 # tag = soup.find('a')

2 # attrs = tag.attrs    # 获取

3 # print(attrs)

4 # tag.attrs = {'ik':123} # 设置

5 # tag.attrs['id'] = 'iiiii' # 设置

6 # print(soup)

3. children,所有子标签

1 # body = soup.find('body')

2 # v = body.children

4. descendants,所有子子孙孙标签

1 # body = soup.find('body')

2 # v = body.descendants

5. clear,将标签的所有子标签全部清空（保留标签名）

1 # tag = soup.find('body')

2 # tag.clear()

3 # print(soup)

6. decompose,递归的删除所有的标签

1 # body = soup.find('body')

2 # body.decompose()

3 # print(soup)

7. extract,递归的删除所有的标签，并获取删除的标签

1 # body = soup.find('body')

2 # v = body.extract()

3 # print(soup)

8. decode,转换为字符串（含当前标签）；decode_contents（不含当前标签）

1 # body = soup.find('body')

2 # v = body.decode()

3 # v = body.decode_contents()

4 # print(v)

9. encode,转换为字节（含当前标签）；encode_contents（不含当前标签）

1 # body = soup.find('body')

2 # v = body.encode()

3 # v = body.encode_contents()

4 # print(v)

10. find,获取匹配的第一个标签

1 # tag = soup.find('a')

2 # print(tag)

3 # tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')

4 # tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie')

5 # print(tag)

11. find_all,获取匹配的所有标签

 1 # tags = soup.find_all('a')

 2 # print(tags)

 3

 4 # tags = soup.find_all('a',limit=1)

 5 # print(tags)

 6

 7 # tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')

 8 # # tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie')

 9 # print(tags)

10

11

12 # ####### 列表 #######

13 # v = soup.find_all(name=['a','div'])

14 # print(v)

15

16 # v = soup.find_all(class_=['sister0', 'sister'])

17 # print(v)

18

19 # v = soup.find_all(text=['Tillie'])

20 # print(v, type(v[0]))

21

22

23 # v = soup.find_all(id=['link1','link2'])

24 # print(v)

25

26 # v = soup.find_all(href=['link1','link2'])

27 # print(v)

28

29 # ####### 正则 #######

30 import re

31 # rep = re.compile('p')

32 # rep = re.compile('^p')

33 # v = soup.find_all(name=rep)

34 # print(v)

35

36 # rep = re.compile('sister.*')

37 # v = soup.find_all(class_=rep)

38 # print(v)

39

40 # rep = re.compile('http://www.oldboy.com/static/.*')

41 # v = soup.find_all(href=rep)

42 # print(v)

43

44 # ####### 方法筛选 #######

45 # def func(tag):

46 # return tag.has_attr('class') and tag.has_attr('id')

47 # v = soup.find_all(name=func)

48 # print(v)

49

50

51 # ## get,获取标签属性

52 # tag = soup.find('a')

53 # v = tag.get('id')

54 # print(v)

12. has_attr,检查标签是否具有该属性

1 # tag = soup.find('a')

2 # v = tag.has_attr('id')

3 # print(v)

13. get_text,获取标签内部文本内容

1 # tag = soup.find('a')

2 # v = tag.get_text('id')

3 # print(v)

14. index,检查标签在某标签中的索引位置

1 # tag = soup.find('body')

2 # v = tag.index(tag.find('div'))

3 # print(v)

4

5 # tag = soup.find('body')

6 # for i,v in enumerate(tag):

7 # print(i,v)

15. is_empty_element,是否是空标签(是否可以是空)或者自闭合标签，

判断是否是如下标签：'br' , 'hr', 'input', 'img', 'meta','spacer', 'link', 'frame', 'base'

1 # tag = soup.find('br')

2 # v = tag.is_empty_element

3 # print(v)

16. 当前的关联标签

 1 # soup.next

 2 # soup.next_element

 3 # soup.next_elements

 4 # soup.next_sibling

 5 # soup.next_siblings

 6

 7 #

 8 # tag.previous

 9 # tag.previous_element

10 # tag.previous_elements

11 # tag.previous_sibling

12 # tag.previous_siblings

13

14 #

15 # tag.parent

16 # tag.parents

17. 查找某标签的关联标签

 1 # tag.find_next(...)

 2 # tag.find_all_next(...)

 3 # tag.find_next_sibling(...)

 4 # tag.find_next_siblings(...)

 5

 6 # tag.find_previous(...)

 7 # tag.find_all_previous(...)

 8 # tag.find_previous_sibling(...)

 9 # tag.find_previous_siblings(...)

10

11 # tag.find_parent(...)

12 # tag.find_parents(...)

13

14 # 参数同find_all

18. select,select_one, CSS选择器

 1 soup.select("title")

 2

 3 soup.select("p nth-of-type(3)")

 4

 5 soup.select("body a")

 6

 7 soup.select("html head title")

 8

 9 tag = soup.select("span,a")

10

11 soup.select("head > title")

12

13 soup.select("p > a")

14

15 soup.select("p > a:nth-of-type(2)")

16

17 soup.select("p > #link1")

18

19 soup.select("body > a")

20

21 soup.select("#link1 ~ .sister")

22

23 soup.select("#link1 + .sister")

24

25 soup.select(".sister")

26

27 soup.select("[class~=sister]")

28

29 soup.select("#link1")

30

31 soup.select("a#link2")

32

33 soup.select('a[href]')

34

35 soup.select('a[href="http://example.com/elsie"]')

36

37 soup.select('a[href^="http://example.com/"]')

38

39 soup.select('a[href$="tillie"]')

40

41 soup.select('a[href*=".com/el"]')

42

43

44 from bs4.element import Tag

45

46 def default_candidate_generator(tag):

47     for child in tag.descendants:

48         if not isinstance(child, Tag):

49             continue

50         if not child.has_attr('href'):

51             continue

52         yield child

53

54 tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator)

55 print(type(tags), tags)

56

57 from bs4.element import Tag

58 def default_candidate_generator(tag):

59     for child in tag.descendants:

60         if not isinstance(child, Tag):

61             continue

62         if not child.has_attr('href'):

63             continue

64         yield child

65

66 tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator, limit=1)

67 print(type(tags), tags)

19. 标签的内容

 1 # tag = soup.find('span')

 2 # print(tag.string)          # 获取

 3 # tag.string = 'new content' # 设置

 4 # print(soup)

 5

 6 # tag = soup.find('body')

 7 # print(tag.string)

 8 # tag.string = 'xxx'

 9 # print(soup)

10

11 # tag = soup.find('body')

12 # v = tag.stripped_strings  # 递归内部获取所有标签的文本

13 # print(v)

20.append在当前标签内部追加一个标签

 1 # tag = soup.find('body')

 2 # tag.append(soup.find('a'))

 3 # print(soup)

 4 #

 5 # from bs4.element import Tag

 6 # obj = Tag(name='i',attrs={'id': 'it'})

 7 # obj.string = '我是一个新来的'

 8 # tag = soup.find('body')

 9 # tag.append(obj)

10 # print(soup)

21.insert在当前标签内部指定位置插入一个标签

1 # from bs4.element import Tag

2 # obj = Tag(name='i', attrs={'id': 'it'})

3 # obj.string = '我是一个新来的'

4 # tag = soup.find('body')

5 # tag.insert(2, obj)

6 # print(soup)

22. insert_after,insert_before 在当前标签后面或前面插入

1 # from bs4.element import Tag

2 # obj = Tag(name='i', attrs={'id': 'it'})

3 # obj.string = '我是一个新来的'

4 # tag = soup.find('body')

5 # # tag.insert_before(obj)

6 # tag.insert_after(obj)

7 # print(soup)

23. replace_with 在当前标签替换为指定标签

1 # from bs4.element import Tag

2 # obj = Tag(name='i', attrs={'id': 'it'})

3 # obj.string = '我是一个新来的'

4 # tag = soup.find('div')

5 # tag.replace_with(obj)

6 # print(soup)

24. 创建标签之间的关系（但不会改变标签的位置）

1 # tag = soup.find('div')

2 # a = soup.find('a')

3 # tag.setup(previous_sibling=a)

4 # print(tag.previous_sibling)

25. wrap，将指定标签把当前标签包裹起来

 1 # from bs4.element import Tag

 2 # obj1 = Tag(name='div', attrs={'id': 'it'})

 3 # obj1.string = '我是一个新来的'

 4 #

 5 # tag = soup.find('a')

 6 # v = tag.wrap(obj1)

 7 # print(soup)

 8

 9 # tag = soup.find('a')

10 # v = tag.wrap(soup.find('p'))

11 # print(soup)

26. unwrap，去掉当前标签，将保留其包裹的标签

1 # tag = soup.find('a')

2 # v = tag.unwrap()

3 # print(soup)

爬虫之BeautifulSoup的更多相关文章

爬虫模块BeautifulSoup
中文文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html# 1.1 安装BeautifulSoup模块 ...
使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作详解（新手必学）
为大家介绍下Python爬虫库BeautifulSoup遍历文档树并对标签进行操作的详细方法与函数下面就是使用Python爬虫库BeautifulSoup对文档树进行遍历并对标签进行操作的实例,都是最 ...
Python 爬虫—— requests BeautifulSoup
本文记录下用来爬虫主要使用的两个库.第一个是requests,用这个库能很方便的下载网页,不用标准库里面各种urllib:第二个BeautifulSoup用来解析网页,不然自己用正则的话很烦. req ...
python爬虫之BeautifulSoup
爬虫有时候写正则表达式会有假死现象就是正则表达式一直在进行死循环查找例如:https://social.msdn.microsoft.com/forums/azure/en-us/3f4390ac ...
Python开发爬虫之BeautifulSoup解析网页篇：爬取安居客网站上北京二手房数据
目标:爬取安居客网站上前10页北京二手房的数据,包括二手房源的名称.价格.几室几厅.大小.建造年份.联系人.地址.标签等. 网址为:https://beijing.anjuke.com/sale/ B ...
web爬虫，BeautifulSoup
BeautifulSoup 该模块用于接收一个HTML或XML字符串,然后将其进行格式化,之后遍可以使用他提供的方法进行快速查找指定元素,从而使得在HTML或XML中查找指定元素变得简单. 1 2 3 ...
python 爬虫 requests+BeautifulSoup 爬取巨潮资讯公司概况代码实例
第一次写一个算是比较完整的爬虫,自我感觉极差啊,代码low,效率差,也没有保存到本地文件或者数据库,强行使用了一波多线程导致数据顺序发生了变化... 贴在这里,引以为戒吧. # -*- coding: ...
Python爬虫——用BeautifulSoup、python-docx爬取廖雪峰大大的教程为word文档
版权声明:本文为博主原创文章,欢迎转载,并请注明出处.联系方式:460356155@qq.com 廖雪峰大大贡献的教程写的不错,写了个爬虫把教程保存为word文件,供大家方便下载学习:http://p ...
python3: 爬虫---- urllib, beautifulsoup
最近晚上学习爬虫,首先从基本的开始: python3 将urllib,urllib2集成到urllib中了, urllib可以对指定的网页进行请求下载, beautifulsoup 可以从杂乱的ht ...
【Python爬虫】BeautifulSoup网页解析库
BeautifulSoup 网页解析库阅读目录初识Beautiful Soup Beautiful Soup库的4种解析器 Beautiful Soup类的基本元素基本使用标签选择器节点操作 ...

随机推荐

使用PHP生成和获取XML格式数据
1.php生成xml
There are inconsistent line endings in the 'xxx' script. Some are Mac OS X (UNIX) and some are Windows.问题解决
在Window上使用Visual Studio编辑Unity3D脚本时常会出现类似如下警告: 警告 1 There are inconsistent line endings in the 'Asse ...
Tomcat 8（九）解读Tomcat组件的生命周期(Lifecycle)
Tomcat 8(七)解读Bootstrap介绍过.运行startup.bat.将引发Tomcat一连串组件的启动.事实上这一连串启动是通过组件的生命周期(Lifecycle)实现的今天来看看Lif ...
Leetcode: Best Time to Buy and Sell Stock I, II
思路: 1. 算法导论讲 divide and conquer 时, 讲到过这个例子. 书中的做法是先让 price 数组减去一个值, 然后求解最大连续子数组的和. 分治算法的复杂度为 o(nlogn ...
Android开发相关
在用红米4X进行真机调试的时候,出现此问题,问题描述如下: DELETE_FAILED_INTERNAL_ERROR Error while Installing APK 一直调试不成功,百度了下,因 ...
shell基础篇（十）shell脚本的包含
前记写到这里:shell中基础差不多已经讲完了.希望你已经对shell有了一个基本了解.你可能跃跃欲试,要写一些程序练习一下.这会对你很有好处.建议大家去chinaunix去学习:我是li0924. ...
OAuth2认证有一定的了解
转到分享界面后,进行OAuth2认证: 以新浪为例: 第一步.WebView加载界面,传递参数使用WebView加载登陆网页,通过Get方法传递三个参数:应用的appkey.回调地址和展示方式dis ...
orcale 闪回操作已提交的修改给还原
delete from conf_ty_parser_title; INSERT INTO conf_ty_parser_title ( SELECT * FROM conf_ty_parser_ti ...
thinkphp5 Windows下用Composer引入官方GitHub扩展包
很多新手,比如说我,写代码就是在windows下,所以总会遇到很多不方便的地方,比如说GitHub上面的代码更新了,要是你在linux,只要几行命令就可以搞定更新了,在windows下面,你需要用到C ...
JS时间格式化函数
Date.prototype.format = function (format) { var o = { "M+": this.getMonth() + 1, //month & ...

爬虫之BeautifulSoup

爬虫之BeautifulSoup的更多相关文章

随机推荐

热门专题