Python网络数据采集1-Beautifulsoup的使用

来自此书: [美]Ryan Mitchell 《Python网络数据采集》，例子是照搬的，觉得跟着敲一遍还是有作用的，所以记录下来。

import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page1.html')
soup = BeautifulSoup(res.text, 'lxml')
print(soup.h1)

<h1>An Interesting Title</h1>

使用urllib访问页面是这样的，read返回的是字节，需要解码为utf-8的文本。像这样a.read().decode('utf-8')，不过在使用bs4解析时候，可以直接传入urllib库返回的响应对象。

import urllib.request

a = urllib.request.urlopen('https://www.pythonscraping.com/pages/page1.html')
soup = BeautifulSoup(a, 'lxml')
print(soup.h1)

<h1>An Interesting Title</h1>

抓取所有CSS class属性为green的span标签，这些是人名。

import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/warandpeace.html')
soup = BeautifulSoup(res.text, 'lxml')
green_names = soup.find_all('span', class_='green')
for name in green_names:
    print(name.string)

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
...

孩子(child)和后代(descendant)是不一样的。孩子标签就是父标签的直接下一代，而后代标签则包括了父标签下面所有的子子孙孙。通俗来说，descendant包括了child。

import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page3.html')
soup = BeautifulSoup(res.text, 'lxml')
gifts = soup.find('table', id='giftList').children
for name in gifts:
    print(name)

<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>

<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>

<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>

找到表格后，选取当前结点为tr，并找到这个tr之后的兄弟节点，由于第一个tr为表格标题，这样的写法能提取出所有除开表格标题的正文数据。

import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page3.html')
soup = BeautifulSoup(res.text, 'lxml')
gifts = soup.find('table', id='giftList').tr.next_siblings
for name in gifts:
    print(name)

<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>

<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>

查找商品的价格，可以根据商品的图片找到其父标签<td>，其上一个兄弟标签就是价格。

import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page3.html')
soup = BeautifulSoup(res.text, 'lxml')
price = soup.find('img', src='../img/gifts/img1.jpg').parent.previous_sibling.string
print(price)

$15.00

采集所有商品图片，为了避免其他图片乱入。使用正则表达式精确搜索。

import re
import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page3.html')
soup = BeautifulSoup(res.text, 'lxml')
imgs= soup.find_all('img', src=re.compile(r'../img/gifts/img.*.jpg'))
for img in imgs:
    print(img['src'])

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg

find_all()还可以传入函数，对这个函数有个要求：就是其返回值必须是布尔类型，若是True则保留，若是False则剔除。

import re
import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.pythonscraping.com/pages/page3.html')
soup = BeautifulSoup(res.text, 'lxml')
# lambda tag: tag.name=='img'
tags = soup.find_all(lambda tag: tag.has_attr('src'))
for tag in tags:
    print(tag)

<img src="../img/gifts/logo.jpg" style="float:left;"/>
<img src="../img/gifts/img1.jpg"/>
<img src="../img/gifts/img2.jpg"/>
<img src="../img/gifts/img3.jpg"/>
<img src="../img/gifts/img4.jpg"/>
<img src="../img/gifts/img6.jpg"/>

tag是一个Element对象，has_attr用来判断是否有该属性。tag.name则是获取标签名。在上面的网页中，下面的写法返回的结果一样。

lambda tag: tag.has_attr('src')或lambda tag: tag.name=='img'

by @sunhaiyu

2017.7.14

Python网络数据采集1-Beautifulsoup的使用的更多相关文章

[python] 网络数据采集操作清单 BeautifulSoup、Selenium、Tesseract、CSV等
Python网络数据采集操作清单 BeautifulSoup.Selenium.Tesseract.CSV等 Python网络数据采集操作清单 BeautifulSoup.Selenium.Tesse ...
笔记之Python网络数据采集
笔记之Python网络数据采集非原创即采集一念清净, 烈焰成池, 一念觉醒, 方登彼岸网络数据采集, 无非就是写一个自动化程序向网络服务器请求数据, 再对数据进行解析, 提取需要的信息通常, ...
Python网络数据采集7-单元测试与Selenium自动化测试
Python网络数据采集7-单元测试与Selenium自动化测试单元测试 Python中使用内置库unittest可完成单元测试.只要继承unittest.TestCase类,就可以实现下面的功能. ...
Python网络数据采集3-数据存到CSV以及MySql
Python网络数据采集3-数据存到CSV以及MySql 先热热身,下载某个页面的所有图片. import requests from bs4 import BeautifulSoup headers ...
Python网络数据采集2-wikipedia
Python网络数据采集2-wikipedia 随机链接跳转获取维基百科的词条超链接,并随机跳转.可能侧边栏和低栏会有其他链接.这不是我们想要的,所以定位到正文.正文在id为bodyContent的 ...
Python网络数据采集PDF
Python网络数据采集(高清版)PDF 百度网盘链接:https://pan.baidu.com/s/16c4GjoAL_uKzdGPjG47S4Q 提取码:febb 复制这段内容后打开百度网盘手 ...
20190715《Python网络数据采集》第 1 章
<Python网络数据采集>7月8号-7月10号,这三天将该书精读一遍,脑海中有了一个爬虫大体框架后,对于后续学习将更加有全局感. 此前,曾试验看视频学习,但是一个视频基本2小时,全部拿下 ...
Python网络数据采集PDF高清完整版免费下载|百度云盘
百度云盘:Python网络数据采集PDF高清完整版免费下载提取码:1vc5 内容简介本书采用简洁强大的Python语言,介绍了网络数据采集,并为采集新式网络中的各种数据类型提供了全面的指导.第 ...
《python 网络数据采集》代码更新
<python 网络数据采集>这本书中会出现很多这一段代码: 1 from urllib.request import urlopen 2 from bs4 import Beautifu ...

随机推荐

SICP-Elements of program
编程语言=组合简单形成复杂的工具简单的声明和表达式简单元素之间的组合方式组合后元素的抽象方式程序=数据+函数数据是我们要处理的内容函数是我们处理数据的方式函数式与中缀式函数式不会出现歧 ...
[USACO08NOV]奶牛混合起来Mixed Up Cows
题目描述 Each of Farmer John's N (4 <= N <= 16) cows has a unique serial number S_i (1 <= S_i & ...
Qt使用MySQL笔记一
原始日期:2015-08-20 18:01 今天开发项目时,遇到一个问题,经过自己不断尝试,终于找到了解决办法,于是赶紧记下来,不然过段时间可能又忘了呵呵,从而重蹈覆辙,浪费时间~问题是这样的:在插入 ...
Tomcat、JBOSS、WebSphere、WebLogic、Apache等技术概述
Tomcat:应用也算非常广泛的web服务器,支持部分j2ee,免费,出自apache基金组织 JBoss:开源的应用服务器,比较受人喜爱,免费(文档要收费) Weblogic:应该说算是业界 ...
C#开发移动应用系列(3.使用照相机扫描二维码+各种基础知识)
前言上篇文章地址: C#开发移动应用系列(1.环境搭建) C#开发移动应用系列(2.使用WebView搭建WebApp应用) 今天我们来讲一下如何使用Camera来调用照相机扫描二维码. (Tips ...
深入浅出TCP/IP协议栈
TCP/IP协议栈是一系列网络协议的总和,是构成网络通信的核心骨架,它定义了电子设备如何连入因特网,以及数据如何在它们之间进行传输.TCP/IP协议采用4层结构,分别是应用层.传输层.网络层和链路层, ...
[leetcode-628-Maximum Product of Three Numbers]
Given an integer array, find three numbers whose product is maximum and output the maximum product. ...
【Android Developers Training】 90. 序言：解决云储存冲突
注:本文翻译自Google官方的Android Developers Training文档,译者技术一般,由于喜爱安卓而产生了翻译的念头,纯属个人兴趣爱好. 原文链接:http://developer ...
Vim练级笔记(持续更新)
漫漫练级路...作为一个VS重度依赖患者,又加上visual assist X 这种懒人必备的神级插件,转投vim门下,真是各种疼... vim用着用着就有拿鼠标去点的冲动,有木有啊! 不过经过一段时 ...
在vs2010中显示代码的行数
1.打开VS2010,然后"工具" → "选项" 2.在选项页面,点击"文本编辑器"→"所有语言",在显示里将[行号]选 ...

Python网络数据采集1-Beautifulsoup的使用

Python网络数据采集1-Beautifulsoup的使用

Python网络数据采集1-Beautifulsoup的使用的更多相关文章

随机推荐

热门专题