scrapy--BeautifulSoup

BeautifulSoup官方文档:https://beautifulsoup.readthedocs.io/zh_CN/latest/#id8

太繁琐的,精简了一些自己用的到的。

1.index.html

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

2..prettify()--标准的缩进格式输出

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

# <html>

#  <head>

#   <title>

#    The Dormouse's story

#   </title>

#  </head>

#  <body>

#   <p class="title">

#    <b>

#     The Dormouse's story

#    </b>

#   </p>

#   <p class="story">

#    Once upon a time there were three little sisters; and their names were

#    <a class="sister" href="http://example.com/elsie" id="link1">

#     Elsie

#    </a>

#    ,

#    <a class="sister" href="http://example.com/lacie" id="link2">

#     Lacie

#    </a>

#    and

#    <a class="sister" href="http://example.com/tillie" id="link2">

#     Tillie

#    </a>

#    ; and they lived at the bottom of a well.

#   </p>

#   <p class="story">

#    ...

#   </p>

#  </body>

# </html>

3.选择标签,属性

soup.title

# <title>The Dormouse's story</title>

soup.title.name

# u'title'

soup.title.string

# u'The Dormouse's story'

soup.title.parent.name

# u'head'

soup.p

# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']

# u'title'

soup.a

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

for link in soup.find_all('a'):

    print(link.get('href'))

    # http://example.com/elsie

    # http://example.com/lacie

    # http://example.com/tillie

print(soup.get_text())

# The Dormouse's story

#

# The Dormouse's story

#

# Once upon a time there were three little sisters; and their names were

# Elsie,

# Lacie and

# Tillie;

# and they lived at the bottom of a well.

#

# ...

#Tag

    soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')

    tag = soup.b

    type(tag)

    # <class 'bs4.element.Tag'>

#Name

    tag.name

    # u'b'

    tag.name = "blockquote"

    tag

    # <blockquote class="boldest">Extremely bold</blockquote>

#Attributes

    tag['class']

    # u'boldest'

    tag.attrs

    # {u'class': u'boldest'}

    tag['class'] = 'verybold'

    tag['id'] = 1

    tag

    # <blockquote class="verybold" id="1">Extremely bold</blockquote>

    del tag['class']

    del tag['id']

    tag

    # <blockquote>Extremely bold</blockquote>

    tag['class']

    # KeyError: 'class'

    print(tag.get('class'))

    # None

2.find_all

1）name 参数#A.传字符串soup.find_all('b')
# [<b>The Dormouse's story</b>]

print soup.find_all('a')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.find_all('a')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<
a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

#B.传正则表达式

import re

for tag in soup.find_all(re.compile("^b")):

    print(tag.name)

# body

# b

#C.传列表

soup.find_all(["a", "b"])

# [<b>The Dormouse's story</b>,

#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

#D.传 True

for tag in soup.find_all(True):

    print(tag.name)

# html

# head

# title

# body

# p

# b

# p

# a

# a

#E.传方法

def has_class_but_no_id(tag):

    return tag.has_attr('class') and not tag.has_attr('id')

soup.find_all(has_class_but_no_id)

# [<p class="title"><b>The Dormouse's story</b></p>,

#  <p class="story">Once upon a time there were...</p>,

#  <p class="story">...</p>]

2）keyword 参数

soup.find_all(id='link2')

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

soup.find_all(href=re.compile("elsie"))

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.find_all(href=re.compile("elsie"), id='link1')

# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

soup.find_all("a", class_="sister")

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')

data_soup.find_all(data-foo="value")

# SyntaxError: keyword can't be an expression

data_soup.find_all(attrs={"data-foo": "value"})

# [<div data-foo="value">foo!</div>]

3）text 参数

soup.find_all(text="Elsie")

# [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"])

# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse"))

[u"The Dormouse's story", u"The Dormouse's story"]

4）limit 参数

soup.find_all("a", limit=2)

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

5）recursive 参数

#调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False

soup.html.find_all("title")

# [<title>The Dormouse's story</title>]

soup.html.find_all("title", recursive=False)

# []

3.CSS选择器

（1）通过标签名查找

print soup.select('title')

#[<title>The Dormouse's story</title>]

print soup.select('a')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.select('b')

#[<b>The Dormouse's story</b>]

（2）通过类名查找

print soup.select('.sister')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

（3）通过 id 名查找

print soup.select('#link1')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

（4）组合查找

print soup.select('p #link1')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

print soup.select("head > title")

#[<title>The Dormouse's story</title>]

（5）属性查找

print soup.select('a[class="sister"]')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.select('a[href="http://example.com/elsie"]')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

print soup.select('p a[href="http://example.com/elsie"]')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

soup = BeautifulSoup(html, 'lxml')

print type(soup.select('title'))

print soup.select('title')[0].get_text()

for title in soup.select('title'):

    print title.get_text()

scrapy--BeautifulSoup的更多相关文章

python scrapy,beautifulsoup,regex,sgmparser,request,connection
In [2]: import requests In [3]: s = requests.Session() In [4]: s.headers 如果你是爬虫相关的业务?抓取的网站还各种各样, ...
python大规模爬取京东
python大规模爬取京东主要工具 scrapy BeautifulSoup requests 分析步骤打开京东首页,输入裤子将会看到页面跳转到了这里,这就是我们要分析的起点我们可以看到这个页面 ...
python web开发小结
书籍 <python基础教程> <流畅的python> web框架 flask django tornado ORM sqlalchemy orator 消息队列 celery ...
爬虫四大金刚：requests，selenium，BeautifulSoup，Scrapy
一.简介爬虫 1.什么是爬虫 #1.什么是互联网? 互联网是由网络设备(网线,路由器,交换机,防火墙等等)和一台台计算机连接而成,像一张网一样. #2.互联网建立的目的? 互联网的核心价值在于数据的共 ...
scrapy vs requests+beautifulsoup
两种爬虫模式比较: 1.requests和beautifulsoup都是库,scrapy是框架. 2.scrapy框架中可以加入requests和beautifulsoup. 3.scrapy基于tw ...
Scrapy框架爬虫初探——中关村在线手机参数数据爬取
关于Scrapy如何安装部署的文章已经相当多了,但是网上实战的例子还不是很多,近来正好在学习该爬虫框架,就简单写了个Spider Demo来实践.作为硬件数码控,我选择了经常光顾的中关村在线的手机页面 ...
网络爬虫：使用Scrapy框架编写一个抓取书籍信息的爬虫服务
上周学习了BeautifulSoup的基础知识并用它完成了一个网络爬虫( 使用Beautiful Soup编写一个爬虫系列随笔汇总 ), BeautifulSoup是一个非常流行的Python网 ...
Scrapy爬取自己的博客内容
python中常用的写爬虫的库有urllib2.requests,对于大多数比较简单的场景或者以学习为目的,可以用这两个库实现.这里有一篇我之前写过的用urllib2+BeautifulSoup做的一 ...
scrapy爬虫笔记(三)------写入源文件的爬取
开始爬取网页:(2)写入源文件的爬取为了使代码易于修改,更清晰高效的爬取网页,我们将代码写入源文件进行爬取. 主要分为以下几个步骤: 一.使用scrapy创建爬虫框架: 二.修改并编写源代码,确定我 ...
scrapy爬虫笔记(二)------交互式爬取
开始网页爬取:(1)交互式爬取首先,我们使用scrapy建立起爬虫的框架.在命令行中输入 scrapy shell “url” 如:scrapy shell “http://www.baidu.co ...

随机推荐

一个position为fixed的div，宽高自适应，怎样让它水平垂直都在窗口居中？
.div{ position: fixed; left: %; top: %; -webkit-transform: translate(-%, -%); transform: translate(- ...
WCF、WebAPI、WCFREST、WebService 、RPC、HTTP 概念解释
在.net平台下,有大量的技术让你创建一个HTTP服务,像Web Service,WCF,现在又出了Web API.在.net平台下,你有很多的选择来构建一个HTTP Services.我分享一下我对 ...
vue-样式问题
问题: 今天在用vue开发单页面应用的时候,遇到一个问题,在A页面,直接刷新,页面的布局样式之类的是没有问题的,不过在B页面跳转到A页面,那么A页面有一些样式就不是预期的效果. 发现解决问题: 用调试 ...
System.Data.SqlClient.SqlException: 从 datetime2 数据类型到 datetime 数据类型的转换产生一个超出范围的值
System.Data.SqlClient.SqlException: 从 datetime2 数据类型到 datetime 数据类型的转换产生一个超出范围的值.解决办法是: 而这位大哥提出的解决办法 ...
IIS7 http自动跳转到https（通过编辑Web.config实现）
本文摘自:https://www.cnblogs.com/wxbug/p/7054972.html 1.下载安装URL重写模块:Microsoft URL Rewrite Module 32位:htt ...
April 11 2017 Week 15 Tuesday
Love is hard to get into, but harder to get out of. 相爱不易,相忘更难. The past are hurt, but I think we can ...
POJ-3009 Curling 2.0---DFS求最短路
题目链接: https://vjudge.net/problem/POJ-3009 题目大意: 问题:打冰球.冰球可以往上下左右4个方向走,只有当冰球撞到墙时才会停下来,而墙会消失.当冰球紧贴墙时,不 ...
2017.10.12 Java的计数器的开发
//我们用一个合成的applet/application来简单显示出一个计数器的结果/** * Created by qichunlin on 2017/10/12. */ /*简单的计数器*/ im ...
VedioCapture
国内的技术的浮躁可见一般,在一个用了七八年的项目里面使用的类,居然拼写都是错的,在网上一搜,转载的也大有人在,最低级的错误,你可以不懂编程,但是只要上过高中,Video这个单词总该学过吧,居然转载的时 ...
redis 对cmd的操作
这个是原子递增的知识点: 关于list部分: 利用lpush命令, rpush命令, lrange命令,对列表操作此前我已经在列表(list)中插入了部分元素了关于集合set 部分首先 ...

scrapy--BeautifulSoup

scrapy--BeautifulSoup的更多相关文章

随机推荐

热门专题