scrapy之parallel
- Limiting Parallelism
-
jcalderone
- May 22nd, 2006
This blog has moved! Read this post and its comments at its new home. Concurrency can be a great way to speed things up, but what happens when you have too much concurrency? Overloading a system or a network can be detrimental to performance. Often there is a peak in performance at a particular level of concurrency. Executing a particular number of tasks in parallel will be easier than ever with Twisted 2.5 and Python 2.5:from twisted.internet import defer, task def parallel(iterable, count, callable, *args, **named):
coop = task.Cooperator()
work = (callable(elem, *args, **named) for elem in iterable)
return defer.DeferredList([coop.coiterate(work) for i in xrange(count)])Here's an example of using this to save the contents of a bunch of URLs which are listed one per line in a text file, downloading at most fifty at a time:
from twisted.python import log
from twisted.internet import reactor
from twisted.web import client def download((url, fileName)):
return client.downloadPage(url, file(fileName, 'wb')) urls = [(url, str(n)) for (n, url) in enumerate(file('urls.txt'))]
finished = parallel(urls, 50, download)
finished.addErrback(log.err)
finished.addCallback(lambda ign: reactor.stop())
reactor.run()[Edit: The original generator expression in this post was of the form ((yield foo()) for x in y). The yield here is completely superfluous, of course, so I have removed it.]
from twisted.internet import defer, reactor, task
l=[3,4,5,6]
def f(a):
print a
work = (f(elem) for elem in l)
for i in range(3):
work.next() coop = task.Cooperator()
#work = (callable(elem, *args, **named) for elem in iterable)
d=[coop.coiterate(work) for _ in range(5)]
print d[<Deferred at 0x1aa0c88 waiting on Deferred at 0x1aa0d50>, <Deferred at 0x1aa0dc8 waiting on Deferred at 0x1aa0e90>, <Deferred at 0x1aa0f30 waiting on Deferred at 0x1aa4030>, <Deferred at 0x1aa40d0 waiting on Deferred at 0x1aa4198>, <Deferred at 0x1aa4238 waiting on Deferred at 0x1aa4300>]
scrapy之parallel的更多相关文章
- twisted的task之cooperator和scrapy的parallel()函数
def handle_spider_output(self, result, request, response, spider): if not result: return defer_succe ...
- scrapy item处理----cooperator和parallel()函数
twisted的task之cooperator和scrapy的parallel()函数 本文是关于下载结果返回后调用item处理的过程实现研究. 从scrapy的结果处理说起 def handle_s ...
- [转]使用scrapy进行大规模抓取
原文:http://www.yakergong.net/blog/archives/500 使用scrapy有大概半年了,算是有些经验吧,在这里跟大家讨论一下使用scrapy作为爬虫进行大规模抓取可能 ...
- scrapy爬虫框架setting模块解析
平时写爬虫的时候并不需要设置setting里所有的参数,今天心血来潮,花了点时间查了一下setting模块创建后自动写入的所有参数的含义,记录一下. 模块相关说明信息 # -*- coding: ut ...
- 97、爬虫框架scrapy
本篇导航: 介绍与安装 命令行工具 项目结构以及爬虫应用简介 Spiders 其它介绍 爬取亚马逊商品信息 一.介绍与安装 Scrapy一个开源和协作的框架,其最初是为了页面抓取 (更确切来说, ...
- Python scrapy框架
Scrapy Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中.其最初是为了页面抓取 (更确切来说, 网络抓取 )所设 ...
- scrapy爬取全部知乎用户信息
# -*- coding: utf-8 -*- # scrapy爬取全部知乎用户信息 # 1:是否遵守robbots_txt协议改为False # 2: 加入爬取所需的headers: user-ag ...
- Scrapy抓取Quotes to Scrape
# 爬虫主程序quotes.py # -*- coding: utf-8 -*- import scrapy from quotetutorial.items import QuoteItem # 启 ...
- scrapy分布式爬虫scrapy_redis一篇
分布式爬虫原理 首先我们来看一下scrapy的单机架构: 可以看到,scrapy单机模式,通过一个scrapy引擎通过一个调度器,将Requests队列中的request请求发给下载器,进行页 ...
随机推荐
- Boost--variant (C++中的union)
union联合体类型的问题 只能用于内部类型,这使得union在C++中几乎没有用 所以boost提供了variant,相当于是C++中的union #include "boost/vari ...
- C++11--20分钟了解C++11 (下)
20分钟了解C++11 9 override关键字 (虚函数使用) * * 避免在派生类中意外地生成新函数 */ // C++ 03 class Dog { virtual void A(int); ...
- 提取字符串substring()
substring() 方法用于提取字符串中介于两个指定下标之间的字符. 语法: stringObject.substring(startPos,stopPos) 参数说明: 注意: 1. 返回的内 ...
- 用IntelliJ的IDEA来创建SpringBoot框架
要安装ULTIMATE版本,并导入key http://idea.iteblog.com/key.php 安装完成后 1:首先打开New Project 2:选择Spring Initializr 这 ...
- ntp服务问题
原本国内的主机直接指向阿里云就可以时间同步了 但是国外的主机 却有报错 这个报错还没有解决 1 Oct 03:47:30 ntpdate[20969]: no server suitable fo ...
- BloomFilter理解
知道BloomFilter是因为RocksDB数据库中用到了这个技术,用于判断1个数据是否存在于1个SST文件中. BloomFilter可能存在误判,就是判断数据是存在集合中,而实际上可能不存在,概 ...
- HTTP RFC解析
HTTP协议(HyperText Transfer Protocol,超文本传输协议)HTTP是一个属于应用层的面向对象的协议,由于其简捷.快速的方式,适用于分布式超媒体信息系统.它于1990年提出, ...
- IIS 反向代理设置
http://blog.csdn.net/yuanguozhengjust/article/details/23576033
- 4G模块*99#拨号上网
操作系统:win10 模块型号:quectel EC20 CE FAG 4G模块拨号步骤如下: 1. 打开网络和internet设置 2. 选择“拨号” 3. 选择“设置新连接” 4. 选择“拨号调至 ...
- CRM N:1 关系或者字段无法删除
点开详细信息查看那些实体引用了该组件.查看内容如下: 1 查看窗体上有无该字段; 2 查看视图中有无该字段; 3 查看试图的 筛选条件; 4 查看试图的 查找列; 5 发布之后再试试.