三、Scrapy中选择器用法

官方示例源码
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<div id='images'>
 <a href='image1.html'>Name: My image 1 <img src='image1_thumb.jpg' /></a>
 <a href='image2.html'>Name: My image 2 <img src='image2_thumb.jpg' /></a>
 <a href='image3.html'>Name: My image 3 <img src='image3_thumb.jpg' /></a>
 <a href='image4.html'>Name: My image 4 <img src='image4_thumb.jpg' /></a>
 <a href='image5.html'>Name: My image 5 <img src='image5_thumb.jpg' /></a>
</div>
</body>
</html>

# scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html

>>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]

>>> response.css('title::text')
[<Selector (text) xpath=//title/text()>]

>>> response.css('title::text').extract()
[u'Example website']

>>> response.xpath('//title/text()').extract()
[u'Example website']

>>> response.xpath('//base/@href').extract()
[u'http://example.com/']

>>> response.css('base::attr(href)').extract()
[u'http://example.com/']

>>> response.xpath('//a[contains(@href, "image")]/@href').extract()
[u'image1.html',
u'image2.html',
u'image3.html',
u'image4.html',
u'image5.html']

>>> response.css('a[href*=image]::attr(href)').extract()
[u'image1.html',
u'image2.html',
u'image3.html',
u'image4.html',
u'image5.html']

>>> response.xpath('//a/@href')]').extract()
['image1.html',
'image2.html',
'image3.html',
'image4.html',
'image5.html']

>>> response.css('a::attr(href)').extract()
['image1.html',
'image2.html',
'image3.html',
'image4.html',
'image5.html']

>>> response.xpath('//div[@id="image"]').css('img::attr(src)').extract()
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']

>>> response.xpath('//div[@id="image"]').css('img::attr(src)').extract_first()
'image1_thumb.jpg'

# 默认值，查找不存在的元素，使用默认值
>>> response.xpath('//div[@id="image"]').css('img::attr(data-src)').extract_first(deafult='')
''

>>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()
[u'image1_thumb.jpg',
u'image2_thumb.jpg',
u'image3_thumb.jpg',
u'image4_thumb.jpg',
u'image5_thumb.jpg']

>>> response.css('a[href*=image] img::attr(src)').extract()
[u'image1_thumb.jpg',
u'image2_thumb.jpg',
u'image3_thumb.jpg',
u'image4_thumb.jpg',
u'image5_thumb.jpg']

>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.extract()
[u'<a href="image1.html">Name: My image 1 <img src="data:image1_thumb.jpg"></a>',
u'<a href="image2.html">Name: My image 2 <img src="data:image2_thumb.jpg"></a>',
u'<a href="image3.html">Name: My image 3 <img src="data:image3_thumb.jpg"></a>',
u'<a href="image4.html">Name: My image 4 <img src="data:image4_thumb.jpg"></a>',
u'<a href="image5.html">Name: My image 5 <img src="data:image5_thumb.jpg"></a>']

>>> for index, link in enumerate(links):
args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
print 'Link number %d points to url %s and image %s' % args

Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg']
Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg']
Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg']
Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg']
Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']

>>> response.xpath('//a/text()').extract()
['Name:My image 1',
'Name:My image 2',
'Name:My image 3',
'Name:My image 4',
'Name:My image 5']

>>> response.css('a::text').extract()
['Name:My image 1',
'Name:My image 2',
'Name:My image 3',
'Name:My image 4',
'Name:My image 5']

>>> response.xpath('//a[contains(@href, "image")]/@href').extract()
['image1.html',
'image2.html',
'image3.html',
'image4.html',
'image5.html']

>>> response.css('a[href*=image] img::attr(href)').extract()
['image1.html',
'image2.html',
'image3.html',
'image4.html',
'image5.html']

# 使用正则
>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
[u'My image 1',
u'My image 2',
u'My image 3',
u'My image 4',
u'My image 5']

>>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')
'My image 1'

>>> response.xpath('//a/text()').re(r'Name:\s*(.*)')
['My image 1',
'My image 2',
'My image 3',
'My image 4',
'My image 5']

>>> response.xpath('//a/text()').re_first(r'Name:\s*(.*)')
'My image 1'

>>> response.css('a::text').re(r'Name:\s*(.*)')
['My image 1',
'My image 2',
'My image 3',
'My image 4',
'My image 5']

#使用strip()再次处理字符串中的空格,注意跟前面的相比较
re_first('Name:(.*)').strip()
re(r'Name:\s*(.*)')
>>> response.css('a::text').re_first('Name:(.*)').strip()
'My image 1'

# 获取所有的a标签超链接
>>> response.css('a').extract()
['<a href='image1.html'>Name: My image 1 <img src='image1_thumb.jpg' /></a>',
'<a href='image2.html'>Name: My image 2 <img src='image2_thumb.jpg' /></a>',
'<a href='image3.html'>Name: My image 3 <img src='image3_thumb.jpg' /></a>',
'<a href='image4.html'>Name: My image 4 <img src='image4_thumb.jpg' /></a>',
'<a href='image5.html'>Name: My image 5 <img src='image5_thumb.jpg' /></a>']

>>> response.css('a').extract_first()
'<a href='image1.html'>Name: My image 1 <img src='image1_thumb.jpg' /></a>'

三、Scrapy中选择器用法的更多相关文章

scrapy中选择器用法
一.Selector选择器介绍 python从网页中提取数据常用以下两种方法: lxml:基于ElementTree的XML解析库(也可以解析HTML),不是python的标准库 BeautifulS ...
不可不看！CSS3中三十一种选择器用法
原文 The 30 CSS Selectors you Must Memorize 由 Jeffrey Way 发表于 2012 年 6 月,介绍了 30 种最常用的 CSS 选择器用法,多加了一种, ...
Scrapy中选择器的用法
官方文档:https://doc.scrapy.org/en/latest/topics/selectors.html Using selectors Constructing selectors R ...
Scrapy框架中选择器的用法【转】
Python爬虫从入门到放弃(十四)之 Scrapy框架中选择器的用法请给作者点赞 --> 原文链接 Scrapy提取数据有自己的一套机制,被称作选择器(selectors),通过特定的Xpa ...
scrapy框架中选择器的用法
scrapy框架中选择器的用法 Scrapy提取数据有自己的一套机制,被称作选择器(selectors),通过特定的Xpath或者CSS表达式来选择HTML文件的某个部分Xpath是专门在XML文件中 ...
小白学 Python 爬虫（35）：爬虫框架 Scrapy 入门基础（三） Selector 选择器
人生苦短,我用 Python 前文传送门: 小白学 Python 爬虫(1):开篇小白学 Python 爬虫(2):前置准备(一)基本类库的安装小白学 Python 爬虫(3):前置准备(二)Li ...
JQuery 中三十一种选择器的应用
选择器(selector)是CSS中很重要的概念,所有HTML语言中的标记都是通过不同的CSS选择器进行控制的.用户只需要通过选择器对不同的HTML标签进行控制,并赋予各种样式声明,即可实现各种效果. ...
使用scrapy中xpath选择器的一个坑点
情景如下: 一个网页下有一个ul,这个ur下有125个li标签,每个li标签下有我们想要的 url 字段(每个 url 是唯一的)和 price 字段,我们现在要访问每个li下的url并在生成的请求中 ...
第三百五十一节，Python分布式爬虫打造搜索引擎Scrapy精讲—将selenium操作谷歌浏览器集成到scrapy中
第三百五十一节,Python分布式爬虫打造搜索引擎Scrapy精讲—将selenium操作谷歌浏览器集成到scrapy中 1.爬虫文件 dispatcher.connect()信号分发器,第一个参数信 ...

随机推荐

struts2 全局拦截器，显示请求方法和參数
后台系统中应该须要一个功能那就是将每一个请求的url地址和请求的參数log出来,方便系统调试和bug追踪,使用struts2时能够使用struts2的全局拦截器实现此功能: import java.u ...
MSP430：中断简介
(5).中断应用程序举例(外部中断): void interrupt_initial() { P1DIR&=~BIT7; //P1.7为输入 P1IE|=0x80; //P ...
8.4 IP地址的划分及子网划分
都是比较灵活的一些计算题.只要掌握了其中的规则,还是比较容易解题的.在了解子网的划分如何进行之前呢,一定要弄清楚一个概念:子网掩码.这是弄清楚如何进行子网划分的一个关键. IP地址是四段二进制码拼合而 ...
等价表达式 2005年NOIP全国联赛提高组(栈模拟)
P1054 等价表达式题目描述明明进了中学之后,学到了代数表达式.有一天,他碰到一个很麻烦的选择题.这个题目的题干中首先给出了一个代数表达式,然后列出了若干选项,每个选项也是一个代数表达式,题目的 ...
2017北京国庆刷题Day1 morning T2
T2火柴棒 (stick) Time Limit:1000ms Memory Limit:128MB 题目描述众所周知的是,火柴棒可以拼成各种各样的数字.具体可以看下图: 通过2根火柴棒可以拼出 ...
memcache缓存系统
一.缓存系统静态web页面: 1.在静态Web程序中,客户端使用Web浏览器(IE.FireFox等)经过网络(Network)连接到服务器上,使用HTTP协议发起一个请求(Request),告诉服 ...
$P3931 SAC E一道难题 Tree$
problem #include <bits/stdc++.h> #define rep(i,j,n) for(register int i=j;i<=n;i++) #define ...
HTML--使用mailto在网页中链接Email地址
<a>标签还有一个作用是可以链接Email地址,使用mailto能让访问者便捷向网站管理者发送电子邮件.我们还可以利用mailto做许多其它事情.下面一一进行讲解,请看详细图示: 注意:如 ...
Flume特点
Flume 特点 1.可靠性当节点出现故障时,日志能够被传送到其他节点上而不会丢失. Flume提供了三种级别的可靠性保障,从强到弱依次分别为: (1) end-to-end(收到数据agent首 ...
[ SCOI 2008 ] 着色方案
$\\$ $Description$ 给出$K$种颜料各自的个数$C_i$,每一个颜料只够涂一个格子,求将颜料用完,涂一排格子,每个格子只能涂一次的条件下,相邻两个格子的颜色互不相同的 ...

三、Scrapy中选择器用法

三、Scrapy中选择器用法的更多相关文章

随机推荐

热门专题