scrapy之parallel
- Limiting Parallelism
-
jcalderone
- May 22nd, 2006
This blog has moved! Read this post and its comments at its new home. Concurrency can be a great way to speed things up, but what happens when you have too much concurrency? Overloading a system or a network can be detrimental to performance. Often there is a peak in performance at a particular level of concurrency. Executing a particular number of tasks in parallel will be easier than ever with Twisted 2.5 and Python 2.5:from twisted.internet import defer, task def parallel(iterable, count, callable, *args, **named):
coop = task.Cooperator()
work = (callable(elem, *args, **named) for elem in iterable)
return defer.DeferredList([coop.coiterate(work) for i in xrange(count)])Here's an example of using this to save the contents of a bunch of URLs which are listed one per line in a text file, downloading at most fifty at a time:
from twisted.python import log
from twisted.internet import reactor
from twisted.web import client def download((url, fileName)):
return client.downloadPage(url, file(fileName, 'wb')) urls = [(url, str(n)) for (n, url) in enumerate(file('urls.txt'))]
finished = parallel(urls, 50, download)
finished.addErrback(log.err)
finished.addCallback(lambda ign: reactor.stop())
reactor.run()[Edit: The original generator expression in this post was of the form ((yield foo()) for x in y). The yield here is completely superfluous, of course, so I have removed it.]
from twisted.internet import defer, reactor, task
l=[3,4,5,6]
def f(a):
print a
work = (f(elem) for elem in l)
for i in range(3):
work.next() coop = task.Cooperator()
#work = (callable(elem, *args, **named) for elem in iterable)
d=[coop.coiterate(work) for _ in range(5)]
print d[<Deferred at 0x1aa0c88 waiting on Deferred at 0x1aa0d50>, <Deferred at 0x1aa0dc8 waiting on Deferred at 0x1aa0e90>, <Deferred at 0x1aa0f30 waiting on Deferred at 0x1aa4030>, <Deferred at 0x1aa40d0 waiting on Deferred at 0x1aa4198>, <Deferred at 0x1aa4238 waiting on Deferred at 0x1aa4300>]
scrapy之parallel的更多相关文章
- twisted的task之cooperator和scrapy的parallel()函数
def handle_spider_output(self, result, request, response, spider): if not result: return defer_succe ...
- scrapy item处理----cooperator和parallel()函数
twisted的task之cooperator和scrapy的parallel()函数 本文是关于下载结果返回后调用item处理的过程实现研究. 从scrapy的结果处理说起 def handle_s ...
- [转]使用scrapy进行大规模抓取
原文:http://www.yakergong.net/blog/archives/500 使用scrapy有大概半年了,算是有些经验吧,在这里跟大家讨论一下使用scrapy作为爬虫进行大规模抓取可能 ...
- scrapy爬虫框架setting模块解析
平时写爬虫的时候并不需要设置setting里所有的参数,今天心血来潮,花了点时间查了一下setting模块创建后自动写入的所有参数的含义,记录一下. 模块相关说明信息 # -*- coding: ut ...
- 97、爬虫框架scrapy
本篇导航: 介绍与安装 命令行工具 项目结构以及爬虫应用简介 Spiders 其它介绍 爬取亚马逊商品信息 一.介绍与安装 Scrapy一个开源和协作的框架,其最初是为了页面抓取 (更确切来说, ...
- Python scrapy框架
Scrapy Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中.其最初是为了页面抓取 (更确切来说, 网络抓取 )所设 ...
- scrapy爬取全部知乎用户信息
# -*- coding: utf-8 -*- # scrapy爬取全部知乎用户信息 # 1:是否遵守robbots_txt协议改为False # 2: 加入爬取所需的headers: user-ag ...
- Scrapy抓取Quotes to Scrape
# 爬虫主程序quotes.py # -*- coding: utf-8 -*- import scrapy from quotetutorial.items import QuoteItem # 启 ...
- scrapy分布式爬虫scrapy_redis一篇
分布式爬虫原理 首先我们来看一下scrapy的单机架构: 可以看到,scrapy单机模式,通过一个scrapy引擎通过一个调度器,将Requests队列中的request请求发给下载器,进行页 ...
随机推荐
- js中的 Table 对象
Table 对象Table 对象代表一个 HTML 表格.在 HTML 文档中 <table> 标签每出现一次,一个 Table 对象就会被创建. Table 对象集合cells[] ...
- 黄聪:HBuilder复制PHP项目后,【转到定位】功能失效
1.[工具]--[插件安装]--[Aptana php插件]--[选择]--[安装] 2.随便开几个文件,操作一下[编辑]--[整理代码格式]就可以了
- linux上mongodb的安装与卸载
安装 1.下载安装包 wget http://fastdl.mongodb.org/linux/mongodb-linux-i686-1.8.2.tgz 下载完成后解压缩压缩包 tar zxf mon ...
- bzoj5052: 繁忙的财政官
求区间内相差最小的两个数的差 分sqrt(n)块,预处理两个数在块内,以及一个数在块内一个数在零散部分的情况,询问时归并排序处理两个数都在零散部分的情况,时间复杂度$O((n+q)\sqrt{n})$ ...
- 廖雪峰Java2面向对象编程-5包和classpath-4classpath和jar
1.classpath 1.1classpath定义 classpath是一个环境变量 classpath指示JVM如何搜索class classpath设置的搜索路径与操作系统相关 * window ...
- 用ADB打开MUMU模拟器的WLAN用于设置代理IP
adb connect 127.0.0.1:7555 adb shell am start -a android.intent.action.MAIN -n com.android.settings/ ...
- docker镜像创建redis5.0.3容器集群
拉取redis5.0.3镜像 # docker pull daocloud.io/library/redis:5.0.3 [root@localhost ~]# docker pull daoclou ...
- Schiff Move Free维骨力这个牌子的保健效果怎么样,是要给中老年人群服用的
Schiff Move Free维骨力这个牌子的保健效果怎么样,是要给中老年人群服用的.? https://www.zhihu.com/question/46399868 服move free还要补钙 ...
- MySQL 之 MyTop实时监控MySQL
CentOS下使用MyTop实时监控MySQL MyTop的项目页面为:http://jeremy.zawodny.com/mysql/mytop/ MyTop安装 安装依赖包 yum instal ...
- Mysql配置参数sync_binlog说明
Mysql配置参数sync_binlog说明 mysql> select version(); +-----------+ | version() | +-----------+ | | +-- ...