0.

1.参考

《用Python写网络爬虫》——2.2 三种网页抓取方法  re / lxml / BeautifulSoup

需要注意的是,lxml在内部实现中,实际上是将CSS选择器转换为等价的XPath选择器。

从结果中可以看出,在抓取我们的示例网页时,Beautiful Soup比其他两种方法慢了超过6倍之多。实际上这一结果是符合预期的,因为lxml和正则表达式模块都是C语言编写的,而BeautifulSoup``则是纯Python编写的。一个有趣的事实是,lxml表现得和正则表达式差不多好。由于lxml在搜索元素之前,必须将输入解析为内部格式,因此会产生额外的开销。而当抓取同一网页的多个特征时,这种初始化解析产生的开销就会降低,lxml也就更具竞争力。这真是一个令人惊叹的模块!

2.Scrapy Selectors 选择器

https://doc.scrapy.org/en/latest/topics/selectors.html#topics-selectors

  • BeautifulSoup缺点:慢
  • lxml:基于 ElementTree
  • Scrapy seletors: parsel library,构建于 lxml 库之上,这意味着它们在速度和解析准确性上非常相似。

.css() .xpath() 返回 SelectorList,即 a list of new selectors
.extract() .re() 提取 过滤 tag data

import scrapy

C:\Program Files\Anaconda2\Lib\site-packages\scrapy\__init__.py

from scrapy.selector import Selector

C:\Program Files\Anaconda2\Lib\site-packages\scrapy\selector\__init__.py

from scrapy.selector.unified import *

C:\Program Files\Anaconda2\Lib\site-packages\scrapy\selector\unified.py

from parsel import Selector as _ParselSelector

class Selector(_ParselSelector, object_ref):

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
如此导入 Selector,实例化 Selector 的时候第一个参数是 HtmlResponse 实例,如果要通过 str 实例化 Selector ,需要 sel = Selector(text=doc)

xx

In [926]: from parsel import Selector

In [927]: Selector?
Init signature: Selector(self, text=None, type=None, namespaces=None, root=None, base_url=None, _expr=None)
Docstring:
:class:`Selector` allows you to select parts of an XML or HTML text using CSS
or XPath expressions and extract data from it. ``text`` is a ``unicode`` object in Python 2 or a ``str`` object in Python 3 ``type`` defines the selector type, it can be ``"html"``, ``"xml"`` or ``None`` (default).
If ``type`` is ``None``, the selector defaults to ``"html"``.
File: c:\program files\anaconda2\lib\site-packages\parsel\selector.py
Type: type 

xx

doc=u"""
<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>>>
<span>by <small class="author" itemprop="author">Thomas A. Edison</small>
<a href="/author/Thomas-A-Edison">(about)</a>
""" sel = Selector(doc) sel.css('div.quote')
[<Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data=u'<div class="quote" itemscope itemtype="h'>]

3.使用 scrapy shell 调试

https://doc.scrapy.org/en/latest/intro/tutorial.html#extracting-data

G:\pydata\pycode\scrapy\splash_cnblogs>scrapy shell "http://quotes.toscrape.com/page/1/"

3.1Xpath VS CSS

对比

  CSS Xpath 备注
含有属性 response.css('div[class]') response.xpath('//div[@class]') css可以简写为 div.class 甚至 .class,div#abc 或 #abc 则对应于id=abc
匹配属性值 response.css('div[class="quote"]') response.xpath('//div[@class="quote"]') response.xpath('//small[text()="Albert Einstein"]')
匹配部分属性值 response.css('div[class*="quo"]') response.xpath('//div[contains(@class,"quo")]') response.xpath('//small[contains(text(),"Einstein")]')
提取属性值 response.css('small::attr(class)') response.xpath('//small/@class') css里面text排除在attr以外,所以不支持上面两个过滤text???
提取文字 response.css('small::text') response.xpath('//small/text()')  
       

使用

In [135]: response.xpath('//small[@class="author"]').extract_first()
In [122]: response.css('small.author').extract_first()
Out[122]: u'<small class="author" itemprop="author">Albert Einstein</small>' In [136]: response.xpath('//small[@class="author"]/text()').extract_first()
In [123]: response.css('small.author::text').extract_first()
Out[123]: u'Albert Einstein' In [137]: response.xpath('//small[@class="author"]/@class').extract_first() #class也是属性
In [124]: response.css('small.author::attr(class)').extract_first()
Out[124]: u'author' In [138]: response.xpath('//small[@class="author"]/@itemprop').extract_first()
In [125]: response.css('small.author::attr(itemprop)').extract_first()
Out[125]: u'author' 

class 是一个特殊属性,允许多值  class="row header-box"

# 匹配多值中的某一个值
In [228]: response.css('div.row')
Out[228]:
[<Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' row ')]" data=u'<div class="row header-box">\n '>,
<Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' row ')]" data=u'<div class="row">\n <div class="col-md'>] In [232]: response.css('div.ro')
Out[232]: []
# 整个class属性值,匹配全部字符串
In [226]: response.css('div[class="row"]')
Out[226]: [<Selector xpath=u"descendant-or-self::div[@class = 'row']" data=u'<div class="row">\n <div class="col-md'>]

 In [240]: response.xpath('//div[@class="row header-box"]')
 Out[240]: [<Selector xpath='//div[@class="row header-box"]' data=u'<div class="row header-box">\n '>]

# 整个class属性值,匹配部分字符串
In [229]: response.css('div[class*="row"]')
Out[229]:
[<Selector xpath=u"descendant-or-self::div[@class and contains(@class, 'row')]" data=u'<div class="row header-box">\n '>,
<Selector xpath=u"descendant-or-self::div[@class and contains(@class, 'row')]" data=u'<div class="row">\n <div class="col-md'>] In [230]: response.xpath('//div[contains(@class,"row")]')
Out[230]:
[<Selector xpath='//div[contains(@class,"row")]' data=u'<div class="row header-box">\n '>,
<Selector xpath='//div[contains(@class,"row")]' data=u'<div class="row">\n <div class="col-md'>] In [234]: response.css('div[class*="w h"]')
Out[234]: [<Selector xpath=u"descendant-or-self::div[@class and contains(@class, 'w h')]" data=u'<div class="row header-box">\n '>] In [235]: response.xpath('//div[contains(@class,"w h")]')
Out[235]: [<Selector xpath='//div[contains(@class,"w h")]' data=u'<div class="row header-box">\n '>]

3.2提取数据

  • 提取data

  selectorList / selector.extract(),extract_frist()

    selector.extract() 返回一个str,selector.extract_first() 报错

    selectorList.extract() 对每一个selector执行selector.extract,返回 list of str,selectorList.extract_frist() 取前面list的第一个。

  • 提取data同时过滤

  selectorList / selector.re(r'xxx'),re_frist(r'xxx')

    selector.re() 返回 list,selector.re_first() 取第一个str

    selectorList.re() 对每一个selector执行selector.re,每个list结果(注意并非每个selector都会match)合并为一个list,selectorList.re_first()取前面合并list的第一个str。。。

使用 extract

In [21]: response.css('.author')  #内部转为 xpath,返回 SelectorList 实例
Out[21]:
[<Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
<Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
<Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
<Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
<Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
<Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
<Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
<Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
<Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>,
<Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data=u'<small class="author" itemprop="author">'>] In [22]: response.css('.author').extract() #提取上面的 data 部分,对 SelectorList 中的每个 Selector 执行 extract(),
Out[22]:
[u'<small class="author" itemprop="author">Albert Einstein</small>',
u'<small class="author" itemprop="author">J.K. Rowling</small>',
u'<small class="author" itemprop="author">Albert Einstein</small>',
u'<small class="author" itemprop="author">Jane Austen</small>',
u'<small class="author" itemprop="author">Marilyn Monroe</small>',
u'<small class="author" itemprop="author">Albert Einstein</small>',
u'<small class="author" itemprop="author">Andr\xe9 Gide</small>',
u'<small class="author" itemprop="author">Thomas A. Edison</small>',
u'<small class="author" itemprop="author">Eleanor Roosevelt</small>',
u'<small class="author" itemprop="author">Steve Martin</small>'] In [23]: response.css('.author').extract_first() #只取第一个,可能返回 None ,可能报错 response.css('.author')[0].extract()
Out[23]: u'<small class="author" itemprop="author">Albert Einstein</small>' In [24]: response.css('.author::text').extract_first() #定位到 tag 内部的 text
Out[24]: u'Albert Einstein'

使用 re

In [46]: response.css('.author::text')[0]
Out[46]: <Selector xpath=u"descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]/text()" data=u'Albert Einstein'> In [47]: response.css('.author::text')[0].re(r'\w+')
Out[47]: [u'Albert', u'Einstein'] In [48]: response.css('.author::text')[0].re_first(r'\w+')
Out[48]: u'Albert' In [49]: response.css('.author::text')[0].re(r'((\w+)\s(\w+))') #按照左边括号顺序输出
Out[49]: [u'Albert Einstein', u'Albert', u'Einstein']

3.

Scrapy Selectors 选择器的更多相关文章

  1. Scrapy中选择器的用法

    官方文档:https://doc.scrapy.org/en/latest/topics/selectors.html Using selectors Constructing selectors R ...

  2. flight.Archives001 / CSS Selectors选择器

    Title/CSS选择器 序 : 这是flight.Archives 梦开始的地方, 作者我熬夜肝出来了这篇文章... 保证这是最简洁高效的 CSS Selectors 教程 Note : 暂时没有能 ...

  3. scrapy框架之Selectors选择器

    Selectors(选择器) 当您抓取网页时,您需要执行的最常见任务是从HTML源中提取数据.有几个库可以实现这一点: BeautifulSoup是Python程序员中非常流行的网络抓取库,它基于HT ...

  4. scrapy中选择器用法

    一.Selector选择器介绍 python从网页中提取数据常用以下两种方法: lxml:基于ElementTree的XML解析库(也可以解析HTML),不是python的标准库 BeautifulS ...

  5. 三、Scrapy中选择器用法

    官方示例源码<html> <head>  <base href='http://example.com/' />  <title>Example web ...

  6. scrapy xpath选择器多级选择错误

    在学习scrapy中用xpath提取网页内容时,有时要先提取出一整个行标签内容,再从行标签里寻找目标内容.出现一个错误. 错误代码: def parse(self, response): sel = ...

  7. scrapy selector选择器

    这部分内容属于补充内容 1.xpath() 2.css() 3.正则表达式 # 多个值,列表 response.xpath('//a/text()').re('(.*?):\s(.*)') # 取第一 ...

  8. scrapy初探

    一  创建scrapy项目 运行命令: scrapy startproject 项目名称 目录结构 二  定义Item容器 Item是保存爬取到数据的容器,其使用方法和python字典类似,并且提供了 ...

  9. Scrapy进阶知识点总结(二)——选择器Selectors

    1. Selectors选择器 在抓取网页时,您需要执行的最常见任务是从HTML源提取数据.有几个库可用于实现此目的,例如: BeautifulSoup是Python程序员中非常流行的Web抓取库,它 ...

随机推荐

  1. Codeforce Round #554 Div.2 C - Neko does Maths

    数论 gcd 看到这个题其实知道应该是和(a+k)(b+k)/gcd(a+k,b+k)有关,但是之后推了半天,思路全无. 然而..有一个引理: gcd(a, b) = gcd(a, b - a) = ...

  2. jmeter5.1测试dubbo接口

    dubbo接口功能介绍 客户端输入uncleyong(当然,也可以是其他字符串),服务端返回hello uncleyong 开发dubbo服务jmeter客户端 idea中创建模块dubbo_jmet ...

  3. Python--Linux上安装Python

    Linux 上安装 Python 官网下载:https://www.python.org/downloads/ 本文安装包下载链接:https://pan.baidu.com/s/1uL2JyoY_g ...

  4. php支持大文件上传

    打开php.ini找到 upload_max_filesize . memory_limit . post_max_size 这三个参数! upload_max_filesize = 2G 是上传最大 ...

  5. Linux硬盘性能测试工具 - FIO

    1.安装:方法一:直接用指令yum -y install fio方法二:如果方法一不可行则,在官网http://freshmeat.net/projects/fio/下载fio的安装包.安装方法很简单 ...

  6. Redis系列八:redis主从复制和哨兵

    一.Redis主从复制 主从复制:主节点负责写数据,从节点负责读数据,主节点定期把数据同步到从节点保证数据的一致性 1. 主从复制的相关操作 a,配置主从复制方式一.新增redis6380.conf, ...

  7. org.hibernate.ObjectNotFoundException: No row with the given identifier exists解决办法

    hibernate-取消关联外键引用数据丢失抛异常的设置@NotFound hibernate项目里面配了很多many-to-one的关联,后台在查询数据时已经作了健全性判断,但还是经常抛出对象找不到 ...

  8. Typora使用说明(记录总结)

    目录 区域元素 YAML FONT Matters 菜单 段落 标题 引注 序列 可选序列 代码块 数学块 表格 脚注 水平线 特征元素 链接 超链接 内链接 相关链 URLs 图片 斜体 加粗 删除 ...

  9. freemarker和thymeleaf的使用样例

    最近需要对公司项目首页使用Java模板重做,以提高首屏加载速度和优化SEO. 在选择模板时发现freemarker和thymeleaf最为常用. 两者最大的区别在于语法,对性能方面未作测试,具体性能测 ...

  10. jpa返回List<Map<String, Object>>相当于jdbctemplate的queryForlist

    public class Test(){ @PersistenceContext(unitName = "manageFactory") protected EntityManag ...