Limiting Parallelism
jcalderone
May 22nd, 2006
This blog has moved! Read this post and its comments at its new home. Concurrency can be a great way to speed things up, but what happens when you have too much concurrency? Overloading a system or a network can be detrimental to performance. Often there is a peak in performance at a particular level of concurrency. Executing a particular number of tasks in parallel will be easier than ever with Twisted 2.5 and Python 2.5:

from twisted.internet import defer, task

def parallel(iterable, count, callable, *args, **named):
coop = task.Cooperator()
work = (callable(elem, *args, **named) for elem in iterable)
return defer.DeferredList([coop.coiterate(work) for i in xrange(count)])

Here's an example of using this to save the contents of a bunch of URLs which are listed one per line in a text file, downloading at most fifty at a time:

from twisted.python import log
from twisted.internet import reactor
from twisted.web import client def download((url, fileName)):
return client.downloadPage(url, file(fileName, 'wb')) urls = [(url, str(n)) for (n, url) in enumerate(file('urls.txt'))]
finished = parallel(urls, 50, download)
finished.addErrback(log.err)
finished.addCallback(lambda ign: reactor.stop())
reactor.run()

[Edit: The original generator expression in this post was of the form ((yield foo()) for x in y). The yield here is completely superfluous, of course, so I have removed it.]

from twisted.internet import defer, reactor, task
l=[3,4,5,6]
def f(a):
print a
work = (f(elem) for elem in l)
for i in range(3):
work.next() coop = task.Cooperator()
#work = (callable(elem, *args, **named) for elem in iterable)
d=[coop.coiterate(work) for _ in range(5)]
print d

[<Deferred at 0x1aa0c88 waiting on Deferred at 0x1aa0d50>, <Deferred at 0x1aa0dc8 waiting on Deferred at 0x1aa0e90>, <Deferred at 0x1aa0f30 waiting on Deferred at 0x1aa4030>, <Deferred at 0x1aa40d0 waiting on Deferred at 0x1aa4198>, <Deferred at 0x1aa4238 waiting on Deferred at 0x1aa4300>]

scrapy之parallel的更多相关文章

  1. twisted的task之cooperator和scrapy的parallel()函数

    def handle_spider_output(self, result, request, response, spider): if not result: return defer_succe ...

  2. scrapy item处理----cooperator和parallel()函数

    twisted的task之cooperator和scrapy的parallel()函数 本文是关于下载结果返回后调用item处理的过程实现研究. 从scrapy的结果处理说起 def handle_s ...

  3. [转]使用scrapy进行大规模抓取

    原文:http://www.yakergong.net/blog/archives/500 使用scrapy有大概半年了,算是有些经验吧,在这里跟大家讨论一下使用scrapy作为爬虫进行大规模抓取可能 ...

  4. scrapy爬虫框架setting模块解析

    平时写爬虫的时候并不需要设置setting里所有的参数,今天心血来潮,花了点时间查了一下setting模块创建后自动写入的所有参数的含义,记录一下. 模块相关说明信息 # -*- coding: ut ...

  5. 97、爬虫框架scrapy

    本篇导航: 介绍与安装 命令行工具 项目结构以及爬虫应用简介 Spiders 其它介绍 爬取亚马逊商品信息   一.介绍与安装 Scrapy一个开源和协作的框架,其最初是为了页面抓取 (更确切来说, ...

  6. Python scrapy框架

    Scrapy Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中.其最初是为了页面抓取 (更确切来说, 网络抓取 )所设 ...

  7. scrapy爬取全部知乎用户信息

    # -*- coding: utf-8 -*- # scrapy爬取全部知乎用户信息 # 1:是否遵守robbots_txt协议改为False # 2: 加入爬取所需的headers: user-ag ...

  8. Scrapy抓取Quotes to Scrape

    # 爬虫主程序quotes.py # -*- coding: utf-8 -*- import scrapy from quotetutorial.items import QuoteItem # 启 ...

  9. scrapy分布式爬虫scrapy_redis一篇

    分布式爬虫原理 首先我们来看一下scrapy的单机架构:     可以看到,scrapy单机模式,通过一个scrapy引擎通过一个调度器,将Requests队列中的request请求发给下载器,进行页 ...

随机推荐

  1. 黄聪:php7配置php.ini使其支持<? ?>

    <? ?>这种写在php配置文件里php.ini法叫short_tags,默认是不打开的,也就是,在默认配置的php里,这样写法不被认为是php脚本的,除非设置 short_open_ta ...

  2. PAT 乙级 1008 数组元素循环右移问题 (20) C++版

    1008. 数组元素循环右移问题 (20) 时间限制 400 ms 内存限制 65536 kB 代码长度限制 8000 B 判题程序 Standard 一个数组A中存有N(N>0)个整数,在不允 ...

  3. Java的大内存分页支持

    原文:http://kilik.iteye.com/blog/677253 最近在研究java的性能调优,顺手写了一个小程序来测试性能问题.这个程序用来进行矩阵乘法运算,如下: for (int i ...

  4. 非ECS阿里云安装插件,给阿里云云监控平台

    linux的init学习: https://blog.csdn.net/kunkliu/article/details/80942279 阿里云官方文档: https://help.aliyun.co ...

  5. Centos7修改系统时区timezone

    第一步:查询服务器时间 [root@localhost ~]# timedatectl Local time: Sat 2018-03-31 01:11:46 UTC Universal time: ...

  6. 主机、Docker时间与时区设置总结

    最近在使用Docker容器时,部署java程序发现时间输出不对,在修改问题时总结如下. #date [-R] #查看主机时间 #timedatectl       #查看主机时区 #tzselect ...

  7. CRM 2016 IFrame_A嵌入 EXT.net 页面 a.aspx,刷新另一IFrame_B嵌入 b.aspx gird.

    说白了就是一个IFrame页面,执行另一IFrame页面的函数.  a.aspx JS: parent.Xrm.Page.getControl("IFRAME_B").getObj ...

  8. Vue百度搜索

    <!DOCTYPE html><html lang="en"><head> <meta charset="GBK"&g ...

  9. 查看hbase中的中文

    python: print '\xE4\xB8\xAD\xE5\x9B\xBD\xE7\x9A\x84\xE4\xB8\x8A\xE5\x8D\x88'.decode('utf-8')

  10. POI实现导出Excel和模板导出Excel

    一.导出过程 1.用户请求导出 2.先访问数据库,查询需要导出的结果集 3.创建导出的Excel工作簿 4.遍历结果集,写入工作簿 5.将Excel已文件下载的形式回复给请求客户端 二.具体实现(截取 ...