scrapy的selectors

from scrapy import Selector

>>> doc = """

... <div>

... <ul>

... <li class="item-0"><a href="link1.html">first item</a></li>

... <li class="item-1"><a href="link2.html">second item</a></li>

... <li class="item-inactive"><a href="link3.html">third item</a></li>

... <li class="item-1"><a href="link4.html">fourth item</a></li>

... <li class="item-0"><a href="link5.html">fifth item</a></li>

... </ul>

... </div>

... """

>>> sel = Selector(text=doc, type="html")

>>> sel.xpath('//li//@href').extract()

[u'link1.html', u'link2.html', u'link3.html', u'link4.html', u'link5.html']

在xpath中使用正则表达式

>>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').extract()

[u'link1.html', u'link2.html', u'link4.html', u'link5.html']

在xpath中使用变量，用$标识，下面路径表示提取包含5个<a>标签的div标签的属性id的值

response.xpath('//div[count(a)=$cnt]/@id',cnt=5).extract_first()

response.xpath('//div[@id=$val]/a/text()', val='images').extract_first()

u'Name: My image 1 '

response.xpath('//base/@href').extract()

[u'http://example.com/']

response.css('base::attr(href)').extract()

[u'http://example.com/']

response.xpath('//a[contains(@href,"img")]/@href').extract()

response.css(

scrapy的selectors的更多相关文章

python爬虫scrapy的Selectors参考文档
http://doc.scrapy.org/en/1.0/topics/selectors.html#topics-selectors-htmlcode
Scrapy里Selectors 四种基础的方法
在Scrapy里面,Selectors 有四种基础的方法xpath():返回一系列的selectors,每一个select表示一个xpath参数表达式选择的节点css():返回一系列的selector ...
scrapy之Selectors
练习url:https://doc.scrapy.org/en/latest/_static/selectors-sample1.html 一获取文本值 xpath In []: response. ...
【Scrapy】Selectors
Constructing selectors For convenience,response objects exposes a selector on .selector attribute,it ...
Scrapy Selectors 选择器
0. 1.参考 <用Python写网络爬虫>——2.2 三种网页抓取方法 re / lxml / BeautifulSoup 需要注意的是,lxml在内部实现中,实际上是将CSS选择器转 ...
Scrapy进阶知识点总结（二）——选择器Selectors
1. Selectors选择器在抓取网页时,您需要执行的最常见任务是从HTML源提取数据.有几个库可用于实现此目的,例如: BeautifulSoup是Python程序员中非常流行的Web抓取库,它 ...
Scrapy框架——介绍、安装、命令行创建，启动、项目目录结构介绍、Spiders文件夹详解（包括去重规则）、Selectors解析页面、Items、pipelines（自定义pipeline）、下载中间件（Downloader Middleware）、爬虫中间件、信号
一介绍 Scrapy一个开源和协作的框架,其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的,使用它可以以快速.简单.可扩展的方式从网站中提取所需的数据.但目前Scrapy的用途十分广泛,可 ...
scrapy框架之Selectors选择器
Selectors(选择器) 当您抓取网页时,您需要执行的最常见任务是从HTML源中提取数据.有几个库可以实现这一点: BeautifulSoup是Python程序员中非常流行的网络抓取库,它基于HT ...
Scrapy 爬虫使用指南完全教程
scrapy note command 全局命令: startproject :在 project_name 文件夹下创建一个名为 project_name 的Scrapy项目. scrapy sta ...

随机推荐

【附5】springboot之配置文件
本文转载自http://www.jianshu.com/p/80621291373b,作者:龙白一梦我的boss 代码从开发到测试要经过各种环境,开发环境,测试环境,demo环境,线上环境,各种环境 ...
dp暑假专题训练记录
A 回文串的最小划分题意:给出长度不超过1000的字符串,把它分割成若干个回文字串,求能分成的最少字串数. #include <iostream> #include <cstdio ...
Visual Studio 项目模板制作(三)
前面,我们已经制作好了模板,然后放到相应的Template目录就可以在Visual Studio中使用本篇,我们采用安装VSIX扩展的方式来安装模板,这种方式需要安装Visual Studio SD ...
junit中test注解测试使用案列解析一
本文原创,转载请注明出处在写代码的过程中,只想测试部分代码,调试一小段功能有没有通的情况下,可以用该方法: 以下为在项目中测试一个小功能的案例,在此记录一下, /** * <解析查询磁 ...
ACMG遗传变异分类标准与指南
2015年,美国权威机构——美国医学遗传学与基因组学学会(ACMG)编写和发布了<ACMG遗传变异分类标准与指南>.为帮助我国医疗工作者和遗传咨询从业者更好地理解ACMG遗传变异分类标准. ...
AngularJS监听路由变化
使用AngularJS时,当路由发生改变时,我们需要做某些处理,此时可以监听路由事件,常用的是$routeStartChange, $routeChangeSuccess.完整例子如下: <!D ...
PCH Warning: header stop cannot be in a macro or #if block.
在编写头文件时,遇到这么一个warning:PCH Warning: header stop cannot be in a macro or #if block. An intellisense PC ...
03_ExeZZ.cpp
1.静态加载 DLL : #pragma comment(lib, "DllZZ.lib") __declspec(dllimport) void __stdcall AA(); ...
在线客服分享 qq 一键加好友一键入群
1.在线客服 <a href="tencent://message/?uin=qqnum&Site=&menu=yes">qq客服</a> ...
mount: unknown filesystem type 'LVM2_member'解决方案【转】
一台服务器,普通/dev/sda1/2(硬盘一) 同步数据到 lvm_member(硬盘二) rsync两硬盘数据同步: From: http://hi.baidu.com/williwill/ite ...

scrapy的selectors

scrapy的selectors的更多相关文章

随机推荐

热门专题