1.参考

《用Python写网络爬虫》——2.2 三种网页抓取方法 re / lxml / BeautifulSoup

需要注意的是，lxml在内部实现中，实际上是将CSS选择器转换为等价的XPath选择器。

从结果中可以看出，在抓取我们的示例网页时，Beautiful Soup比其他两种方法慢了超过6倍之多。实际上这一结果是符合预期的，因为lxml和正则表达式模块都是C语言编写的，而BeautifulSoup``则是纯Python编写的。一个有趣的事实是，lxml表现得和正则表达式差不多好。由于lxml在搜索元素之前，必须将输入解析为内部格式，因此会产生额外的开销。而当抓取同一网页的多个特征时，这种初始化解析产生的开销就会降低，lxml也就更具竞争力。这真是一个令人惊叹的模块！

2.Scrapy Selectors 选择器

https://doc.scrapy.org/en/latest/topics/selectors.html#topics-selectors

BeautifulSoup缺点：慢
lxml:基于 ElementTree
Scrapy seletors: parsel library，构建于 lxml 库之上，这意味着它们在速度和解析准确性上非常相似。

.css() .xpath() 返回 SelectorList，即 a list of new selectors
.extract() .re() 提取过滤 tag data

import scrapy

C:\Program Files\Anaconda2\Lib\site-packages\scrapy\__init__.py

from scrapy.selector import Selector

C:\Program Files\Anaconda2\Lib\site-packages\scrapy\selector\__init__.py

from scrapy.selector.unified import *

C:\Program Files\Anaconda2\Lib\site-packages\scrapy\selector\unified.py

from parsel import Selector as _ParselSelector

class Selector(_ParselSelector, object_ref):

>>> from scrapy.selector import Selector

>>> from scrapy.http import HtmlResponse
如此导入 Selector，实例化 Selector 的时候第一个参数是 HtmlResponse 实例，如果要通过 str 实例化 Selector ，需要 sel = Selector(text=doc)

In [926]: from parsel import Selector

In [927]: Selector?

Init signature: Selector(self, text=None, type=None, namespaces=None, root=None, base_url=None, _expr=None)

Docstring:

:class:`Selector` allows you to select parts of an XML or HTML text using CSS

or XPath expressions and extract data from it.

``text`` is a ``unicode`` object in Python 2 or a ``str`` object in Python 3

``type`` defines the selector type, it can be ``"html"``, ``"xml"`` or ``None`` (default).

If ``type`` is ``None``, the selector defaults to ``"html"``.

File:           c:\program files\anaconda2\lib\site-packages\parsel\selector.py

Type:           type

doc=u"""

<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">

        <span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>>>

        <span>by <small class="author" itemprop="author">Thomas A. Edison</small>

        <a href="/author/Thomas-A-Edison">(about)</a>

"""

sel = Selector(doc)

sel.css('div.quote')

[<Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data=u'<div class="quote" itemscope itemtype="h'>]

3.使用 scrapy shell 调试

https://doc.scrapy.org/en/latest/intro/tutorial.html#extracting-data

G:\pydata\pycode\scrapy\splash_cnblogs>scrapy shell "http://quotes.toscrape.com/page/1/"

3.1Xpath VS CSS

对比

	CSS	Xpath	备注
含有属性	response.css('div[class]')	response.xpath('//div[@class]')	css可以简写为 div.class 甚至 .class，div#abc 或 #abc 则对应于id=abc
匹配属性值	response.css('div[class="quote"]')	response.xpath('//div[@class="quote"]')	response.xpath('//small[text()="Albert Einstein"]')
匹配部分属性值	response.css('div[class*="quo"]')	response.xpath('//div[contains(@class,"quo")]')	response.xpath('//small[contains(text(),"Einstein")]')
提取属性值	response.css('small::attr(class)')	response.xpath('//small/@class')	css里面text排除在attr以外，所以不支持上面两个过滤text？？？
提取文字	response.css('small::text')	response.xpath('//small/text()')

使用

In [135]: response.xpath('//small[@class="author"]').extract_first()

In [122]: response.css('small.author').extract_first()

Out[122]: u'<small class="author" itemprop="author">Albert Einstein</small>'

In [136]: response.xpath('//small[@class="author"]/text()').extract_first()

In [123]: response.css('small.author::text').extract_first()

Out[123]: u'Albert Einstein'

In [137]: response.xpath('//small[@class="author"]/@class').extract_first()  #class也是属性

In [124]: response.css('small.author::attr(class)').extract_first()

Out[124]: u'author'

In [138]: response.xpath('//small[@class="author"]/@itemprop').extract_first()

In [125]: response.css('small.author::attr(itemprop)').extract_first()

Out[125]: u'author'

class 是一个特殊属性，允许多值 class="row header-box"

# 匹配多值中的某一个值
In [228]: response.css('div.row')

Out[228]:

[<Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' row ')]" data=u'<div class="row header-box">\n           '>,

 <Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' row ')]" data=u'<div class="row">\n    <div class="col-md'>]

In [232]: response.css('div.ro')

Out[232]: []

# 整个class属性值，匹配全部字符串

In [226]: response.css('div[class="row"]')

Out[226]: [<Selector xpath=u"descendant-or-self::div[@class = 'row']" data=u'<div class="row">\n    <div class="col-md'>]

　In [240]: response.xpath('//div[@class="row header-box"]')
　Out[240]: [<Selector xpath='//div[@class="row header-box"]' data=u'<div class="row header-box">\n '>]

# 整个class属性值，匹配部分字符串

In [229]: response.css('div[class*="row"]')

Out[229]:

[<Selector xpath=u"descendant-or-self::div[@class and contains(@class, 'row')]" data=u'<div class="row header-box">\n           '>,

 <Selector xpath=u"descendant-or-self::div[@class and contains(@class, 'row')]" data=u'<div class="row">\n    <div class="col-md'>]

In [230]: response.xpath('//div[contains(@class,"row")]')

Out[230]:

[<Selector xpath='//div[contains(@class,"row")]' data=u'<div class="row header-box">\n           '>,

 <Selector xpath='//div[contains(@class,"row")]' data=u'<div class="row">\n    <div class="col-md'>]

In [234]: response.css('div[class*="w h"]')

Out[234]: [<Selector xpath=u"descendant-or-self::div[@class and contains(@class, 'w h')]" data=u'<div class="row header-box">\n           '>]

In [235]: response.xpath('//div[contains(@class,"w h")]')

Out[235]: [<Selector xpath='//div[contains(@class,"w h")]' data=u'<div class="row header-box">\n           '>]

3.2提取数据

提取data

　　selectorList / selector.extract()，extract_frist()

　　　　selector.extract() 返回一个str，selector.extract_first() 报错

　　　　selectorList.extract() 对每一个selector执行selector.extract，返回 list of str，selectorList.extract_frist() 取前面list的第一个。

提取data同时过滤

　　selectorList / selector.re(r'xxx')，re_frist(r'xxx')

　　　　selector.re() 返回 list，selector.re_first() 取第一个str

　　　　selectorList.re() 对每一个selector执行selector.re，每个list结果（注意并非每个selector都会match）合并为一个list，selectorList.re_first()取前面合并list的第一个str。。。

使用 extract

In [21]: response.css('.author')  #内部转为 xpath，返回 SelectorList 实例

Out[21]:

[<Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,

 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,

 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,

 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,

 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,

 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,

 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,

 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,

 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,

 <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>]

In [22]: response.css('.author').extract()  #提取上面的 data 部分，对 SelectorList 中的每个 Selector 执行 extract()，

Out[22]:

[u'<small class="author" itemprop="author">Albert Einstein</small>',

 u'<small class="author" itemprop="author">J.K. Rowling</small>',

 u'<small class="author" itemprop="author">Albert Einstein</small>',

 u'<small class="author" itemprop="author">Jane Austen</small>',

 u'<small class="author" itemprop="author">Marilyn Monroe</small>',

 u'<small class="author" itemprop="author">Albert Einstein</small>',

 u'<small class="author" itemprop="author">Andr\xe9 Gide</small>',

 u'<small class="author" itemprop="author">Thomas A. Edison</small>',

 u'<small class="author" itemprop="author">Eleanor Roosevelt</small>',

 u'<small class="author" itemprop="author">Steve Martin</small>']

In [23]: response.css('.author').extract_first()  #只取第一个，可能返回 None ,可能报错 response.css('.author')[0].extract()

Out[23]: u'<small class="author" itemprop="author">Albert Einstein</small>'

In [24]: response.css('.author::text').extract_first()  #定位到 tag 内部的 text

Out[24]: u'Albert Einstein'

使用 re

In [46]: response.css('.author::text')[0]

Out[46]: <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]/text()" data=u'Albert Einstein'>

In [47]: response.css('.author::text')[0].re(r'\w+')

Out[47]: [u'Albert', u'Einstein']

In [48]: response.css('.author::text')[0].re_first(r'\w+')

Out[48]: u'Albert'

In [49]: response.css('.author::text')[0].re(r'((\w+)\s(\w+))')  #按照左边括号顺序输出

Out[49]: [u'Albert Einstein', u'Albert', u'Einstein']

Scrapy Selectors 选择器的更多相关文章

Scrapy中选择器的用法
官方文档:https://doc.scrapy.org/en/latest/topics/selectors.html Using selectors Constructing selectors R ...
flight.Archives001 / CSS Selectors选择器
Title/CSS选择器序 : 这是flight.Archives 梦开始的地方, 作者我熬夜肝出来了这篇文章... 保证这是最简洁高效的 CSS Selectors 教程 Note : 暂时没有能 ...
scrapy框架之Selectors选择器
Selectors(选择器) 当您抓取网页时,您需要执行的最常见任务是从HTML源中提取数据.有几个库可以实现这一点: BeautifulSoup是Python程序员中非常流行的网络抓取库,它基于HT ...
scrapy中选择器用法
一.Selector选择器介绍 python从网页中提取数据常用以下两种方法: lxml:基于ElementTree的XML解析库(也可以解析HTML),不是python的标准库 BeautifulS ...
三、Scrapy中选择器用法
官方示例源码<html> <head> <base href='http://example.com/' /> <title>Example web ...
scrapy xpath选择器多级选择错误
在学习scrapy中用xpath提取网页内容时,有时要先提取出一整个行标签内容,再从行标签里寻找目标内容.出现一个错误. 错误代码: def parse(self, response): sel = ...
scrapy selector选择器
这部分内容属于补充内容 1.xpath() 2.css() 3.正则表达式 # 多个值,列表 response.xpath('//a/text()').re('(.*?):\s(.*)') # 取第一 ...
scrapy初探
一创建scrapy项目运行命令: scrapy startproject 项目名称目录结构二定义Item容器 Item是保存爬取到数据的容器,其使用方法和python字典类似,并且提供了 ...
Scrapy进阶知识点总结（二）——选择器Selectors
1. Selectors选择器在抓取网页时,您需要执行的最常见任务是从HTML源提取数据.有几个库可用于实现此目的,例如: BeautifulSoup是Python程序员中非常流行的Web抓取库,它 ...

随机推荐

量化交易之下单函数和context对象
一.下单函数聚宽设计的函数(如前文所说准确叫法是API)的用法都写在API文档里,位置在聚宽网站导航栏-帮助-API文档 1.order按股数下单 order(security, amount, s ...
单链表&双链表的头插入&尾插入
#include<stdio.h> #include"stdlib.h" struct student { int data; struct student *pnex ...
P2822 组合数问题 HMR大佬讲解
今天HMR大佬给我们讲解了这一道难题. 基本思路是: 可以将问题转化为:求出杨辉三角,用二维数组f[i][j]来表示在杨辉三角中以第i行第j列的点为右下角,第0行第0列处的点为左上角的矩阵中所有元素是 ...
Linux端口被占用的解决（附Python专版）
先说一般情况的解决: lsof -i:8000 查出PID,然后 kill掉程序,接着就可以了软件重启之后绑定没有释放,lsof -i:8080也查不出来占用的情况再来个长连接版Python解决法 ...
secureCRT自动断开的解决方法
转: secureCRT自动断开的解决方法 secureCRT自动断开的解决方法在secureCRT上登录时,一段时间不用的话会自动断开,必须重新连接,有点麻烦. 有时候服务器端的 /etc/pro ...
MySQL 导出数据库，出现 “mysqldump: Got error: 1146”
出现场景在 cmd 导出数据库时: mysqldump -hlocalhost -uroot -p student_db > C:\student_db.sql 出现: mysqldump: ...
JS学习笔记Day13
一.cookie (一)什么是cookie: 1.就是会话跟踪技术,存放在客户端浏览器中的一段文本信息 2.会话:从浏览网站开始到结束的这个过程称为一次会话,浏览器关闭,表示会话结束 3.会话跟踪技术 ...
验证性控件的使用--验证两个文本框至少有一个不为空CustomValidator
转:http://blog.163.com/zhaowencong_2010/blog/static/20402815220122103155643/ 有时候我们在注册一个帐号时要求我们留下电话号码, ...
django - 总结 - ModelForm
gender = forms.ChoiceField(choices=((1, '男'), (2, '女'), (3, '其他'))) # 与sql没关系 publish = forms.Choice ...
HTML（四）HTML常用标签（a，img）
a元素 <a>元素 (或HTML锚元素, Anchor Element)通常用来表示一个锚点/链接.但严格来说,<a>元素不是一个链接,而是超文本锚点,可以链接到一个新文件.用 ...

Scrapy Selectors 选择器