【scrapy】Item及Spider
Items
Item objects are simple containers used to collect the scraped data.They provide a dictionary-like api with a convenient syntax for declaring their available fields.
import scrapy;
class Product(scrapy.Item):
name=scrapy.Field()
price=scrapy.Field()
stock=scrapy.Field()
last_updated=scrapy.Field(serializer=str)
Extending Items
you can extend Items(to add more fields or to change some metadata for some fields)by declaring a subclass of your original Item.
class DiscountedProduct(Product):
discount_percent=scrapy.Field(serializer=str)
You can also extend fields metadata by using the previous field metadata and appending more values,or changind existing values.
class SpecificProduct(Product):
name=scrapy.Field(Product.fields['name'],serializer=my_serializer)
Item Objects
1.class scrapy.item.Item([arg])
Return a new Item optionally initialized from the given argument
The only additional attribute provided by Items is:fields
2.Field objects
class scrapy.item.Field([arg])
The Field class is just an alias to the built-in dict class and doesn't provide any extra functionality or attributes.
_______________________________________________________________________________________________________________________________
Built-in spiders reference
Scrapy comes with some useful generic spiders that you can use,to subclass your spiders from .Their aim is to provide convenient functionality for a few common scraping cases.
class scrapy.spider.Spider
This is the simplest spider,and the one from which every other spider must inherit from.
重要属性:
name
A string which defines the name for this spider.it must be unique.This is the most important spider attribute and it is required.
allowed_domains
An optional list of strings containing domains that this spider is allowed to crawl.Requests for URLs not belonging to the domain names specified in this list won't be followed if offsiteMiddleware is enabled.
start_urls
A list of URLs where the spider will begin to crawl from,when no particular URLs are specified.
start_requests()
This is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified.If particular URLs are specified,the make_requests_from_url() is used instead to create the Requests.
make_requests_from_url(url)
A method that receives a URL and returns a Request object to scrape.Unless overridden,this method returns Requests with the parse() method as their callback function.
parse(response)
The parse method is in charge of processing the response and returning scraped data .
log(message[,level,component])
Log a message
closed(reason)
called when the spider closes.
class scrapy.contrib.spiders.CrawlSpider
This is the most commonly used spider for crawling regular websites,as it provides a convenient mechanism for following links by defining a set of rules.
除了继承自Spider的属性,CrawlSpider还提供以下属性。
rules
Which is a list of one or more Rule objects.Each Rule defines a certain behaviour for crawling the site.
关于Rule对象:
class scrapy.contrib.spiders.Rule(link_extractor,callback=None,cb_kwargs=None,follow=None,process_links=None,process_request=None)
link_extractor is a Link Extractor object which defines how links will be extracted from each crawled page.
callback is a callable or a string to be called for each link extracted with the specified link_extractor.
注意:when writing crawl spider rules,avoid using parse as callback,since the CrawlSpider uses the parse method itself to implement its logic.
cb_kwargs is a dict containing the keyword arguments to be passed to the callback function.
follow is a boolean which specifies if links should be followed from each response extracted with this rule.If callback is None,follow defaults to true(即继续爬取这个链接),otherwise it default to false.
process_request is a callable or a string which will be called with every request extracted by this rule,and must return a request or None.
------------------------------------------------------------------------------------------------------------------------------------
LinkExtractors are objects whose only purpose is to extract links from web pages(scrapy.http.Response objects).
Scrapy内置的Link Extractors有两个,可以根据需要自己来写。
All availble link extractors classes bundled with scrapy are provided in the scrapy.contrib.linkextractors module.
SgmlLinkExtractor
class scrapy.contrib.linkextractors.sgml.SgmlLinkExtractor(allow,...)
The SgmlLinkExtrator extends the base BaseSgmlLinkExtractor by providing additional filters that you can specify to extract links.
allow(a (a list of)regular expression):a single regular expression that the urls must match in order to be extracted,if not given,it will match all links.
【scrapy】Item及Spider的更多相关文章
- scrapy 原理,结构,基本命令,item,spider,selector简述
原理,结构,基本命令,item,spider,selector简述 原理 (1)结构 (2)运行流程 实操 (1) scrapy命令: 注意先把python安装目录的scripts文件夹添加到环境变量 ...
- 爬虫(十六):Scrapy框架(三) Spider Middleware、Item Pipeline
1. Spider Middleware Spider Middleware是介入到Scrapy的Spider处理机制的钩子框架. 当Downloader生成Response之后,Response会被 ...
- python爬虫入门(七)Scrapy框架之Spider类
Spider类 Spider类定义了如何爬取某个(或某些)网站.包括了爬取的动作(例如:是否跟进链接)以及如何从网页的内容中提取结构化数据(爬取item). 换句话说,Spider就是您定义爬取的动作 ...
- [scrapy]Item Loders
Items Items就是结构化数据的模块,相当于字典,比如定义一个{"title":"","author":""},i ...
- scrapy框架之spider
爬取流程 Spider类定义如何爬取指定的一个或多个网站,包括是否要跟进网页里的链接和如何提取网页内容中的数据. 爬取的过程是类似以下步骤的循环: 1.通过指定的初始URL初始化Request,并指定 ...
- Scrapy框架之Spider模板 转
一.安装scrapy 首先安装依赖库Twisted pip install (依赖库的路径) 在这个网址http://www.lfd.uci.edu/~gohlke/pythonlibs#twiste ...
- 第三百四十四节,Python分布式爬虫打造搜索引擎Scrapy精讲—craw母版l创建自动爬虫文件—以及 scrapy item loader机制
第三百四十四节,Python分布式爬虫打造搜索引擎Scrapy精讲—craw母版l创建自动爬虫文件—以及 scrapy item loader机制 用命令创建自动爬虫文件 创建爬虫文件是根据scrap ...
- 二十三 Python分布式爬虫打造搜索引擎Scrapy精讲—craw母版l创建自动爬虫文件—以及 scrapy item loader机制
用命令创建自动爬虫文件 创建爬虫文件是根据scrapy的母版来创建爬虫文件的 scrapy genspider -l 查看scrapy创建爬虫文件可用的母版 Available templates: ...
- scrapy item
item item定义了爬取的数据的model item的使用类似于dict 定义 在items.py中,继承scrapy.Item类,字段类型scrapy.Field() 实例化:(假设定义了一个名 ...
随机推荐
- QT_1
QT概述 1.1 QT 是一个跨平台的C++图形用户界面应用程序框架 1.2 发展史: 1991奇趣科技 1.3 QT 版本:商业版.开源版 1.4 优点: 1.4.1 跨平台 1.4.2 接口简单 ...
- 前端开发中的 meta 整理
meta是html语言head区的一个辅助性标签.也许你认为这些代码可有可无.其实如果你能够用好meta标签,会给你带来意想不到的效果,meta标签的作用有:搜索引擎优化(SEO),定义页面使用语言, ...
- PTA|团体程序设计天梯赛-练习题目题解锦集(C/C++)(持续更新中……)
PTA|团体程序设计天梯赛-练习题目题解锦集(持续更新中) 实现语言:C/C++: 欢迎各位看官交流讨论.指导题解错误:或者分享更快的方法!! 题目链接:https://pintia.cn/ ...
- [CF] 219D Choosing Capital for Treeland
题意翻译 题目描述 Treeland国有n个城市,这n个城市连成了一颗树,有n-1条道路连接了所有城市.每条道路只能单向通行.现在政府需要决定选择哪个城市为首都.假如城市i成为了首都,那么为了使首都能 ...
- InnoDB INFORMATION_SCHEMA Buffer Pool Tables
InnoDB INFORMATION_SCHEMA Buffer Pool Tables InnoDB INFORMATION_SCHEMA缓冲池表提供有关InnoDB缓冲池中页面的缓冲池状态信息和元 ...
- tkinter学习-Lable&Button
tkinter学习-La&Bu 我的第一个Tkinter程序 Label组件的使用 Button组件的使用 1.我的第一个Tkinter程序 常用的属性: title:设置窗口的标题 geom ...
- sql使用row_number()查询标记行号
背景: 在分页功能中,记录需分页显示,需要row_number()函数标记行号. 数据表: 排序之前数据表显示: sql语句: select ROW_NUMBER() over(order by id ...
- JVM——内存管理和垃圾回收
1. 何为GC 转载请注明出处:http://blog.csdn.net/seu_calvin/article/details/51892567 Java与C语言相比的一个优势是,可以通过自己的JV ...
- 洛谷 P1156 垃圾陷阱 谈论剪枝,非满分
这是一个91分的非dp代码(是我太弱) 剪枝八五个(实际上根本没那么多,主要是上课装逼,没想到他们dp水过去了),不过我的思路与dp不同: 1.层数到达i+1,return 这个必须有 2.当前剩余生 ...
- String字符串去掉双引号
public static String stringReplace(String str) { //去掉" "号 String str= str.replace("\& ...