网络爬虫必备知识之concurrent.futures库
就库的范围,个人认为网络爬虫必备库知识包括urllib、requests、re、BeautifulSoup、concurrent.futures,接下来将结对concurrent.futures库的使用方法进行总结
建议阅读本博的博友先阅读下上篇博客:python究竟要不要使用多线程,将会对concurrent.futures库的使用有帮助。
1. concurrent.futures库简介
python标准库为我们提供了threading和mutiprocessing模块实现异步多线程/多进程功能。从python3.2版本开始,标准库又为我们提供了concurrent.futures模块来实现线程池和进程池功能,实现了对threading和mutiprocessing模块的高级抽象,更大程度上方便了我们python程序员。
concurrent.futures模块提供了ThreadPoolExecutor和ProcessPoolExecutor两个类
(1)看下来个类的继承关系和关键属性
from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor
print('ThreadPoolExecutor继承关系:',ThreadPoolExecutor.__mro__)
print('ThreadPoolExecutor属性:',[attr for attr in dir(ThreadPoolExecutor) if not attr.startswith('_')])
print('ProcessPoolExecutor继承关系:',ProcessPoolExecutor.__mro__)
print('ThreadPoolExecutor属性:',[attr for attr in dir(ProcessPoolExecutor) if not attr.startswith('_')])

都继承自futures._base.Executor类,拥有三个重要方法map、submit和shutdow,这样看起来就很简单了
(2)再看下futures._base.Executor基类实现
class Executor(object):
"""This is an abstract base class for concrete asynchronous executors.""" def submit(self, fn, *args, **kwargs):
"""Submits a callable to be executed with the given arguments. Schedules the callable to be executed as fn(*args, **kwargs) and returns
a Future instance representing the execution of the callable. Returns:
A Future representing the given call.
"""
raise NotImplementedError() def map(self, fn, *iterables, timeout=None, chunksize=):
"""Returns an iterator equivalent to map(fn, iter). Args:
fn: A callable that will take as many arguments as there are
passed iterables.
timeout: The maximum number of seconds to wait. If None, then there
is no limit on the wait time.
chunksize: The size of the chunks the iterable will be broken into
before being passed to a child process. This argument is only
used by ProcessPoolExecutor; it is ignored by
ThreadPoolExecutor. Returns:
An iterator equivalent to: map(func, *iterables) but the calls may
be evaluated out-of-order. Raises:
TimeoutError: If the entire result iterator could not be generated
before the given timeout.
Exception: If fn(*args) raises for any values.
"""
if timeout is not None:
end_time = timeout + time.time() fs = [self.submit(fn, *args) for args in zip(*iterables)] # Yield must be hidden in closure so that the futures are submitted
# before the first iterator value is required.
def result_iterator():
try:
# reverse to keep finishing order
fs.reverse()
while fs:
# Careful not to keep a reference to the popped future
if timeout is None:
yield fs.pop().result()
else:
yield fs.pop().result(end_time - time.time())
finally:
for future in fs:
future.cancel()
return result_iterator() def shutdown(self, wait=True):
"""Clean-up the resources associated with the Executor. It is safe to call this method several times. Otherwise, no other
methods can be called after this one. Args:
wait: If True then shutdown will not return until all running
futures have finished executing and the resources used by the
executor have been reclaimed.
"""
pass def __enter__(self):
return self def __exit__(self, exc_type, exc_val, exc_tb):
self.shutdown(wait=True)
return False
提供了map、submit、shutdow和with方法,下面首先对这个几个方法的使用进行说明
2. map函数
函数原型:def map(self, fn, *iterables, timeout=None, chunksize=1)
map函数和python自带的map函数用法一样,只不过该map函数从迭代器获取参数后异步执行,timeout用于设置超时时间
参数chunksize的理解:
The size of the chunks the iterable will be broken into
before being passed to a child process. This argument is only
used by ProcessPoolExecutor; it is ignored by ThreadPoolExecutor.
例:
from concurrent.futures import ThreadPoolExecutor
import time
import requests def download(url):
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0',
'Connection':'keep-alive',
'Host':'example.webscraping.com'}
response = requests.get(url, headers=headers)
return(response.status_code) if __name__ == '__main__':
urllist = ['http://example.webscraping.com/places/default/view/Afghanistan-1',
'http://example.webscraping.com/places/default/view/Aland-Islands-2'] pool = ProcessPoolExecutor(max_workers = 2)
start = time.time()
result = list(pool.map(download, urllist))
end = time.time()
print('status_code:',result)
print('使用多线程--timestamp:{:.3f}'.format(end-start))
3. submit函数
函数原型:def submit(self, fn, *args, **kwargs)
fn:需要异步执行的函数
args、kwargs:函数传递的参数
例:下例中future类的使用的as_complete后面介绍
from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor,as_completed
import time
import requests def download(url):
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0',
'Connection':'keep-alive',
'Host':'example.webscraping.com'}
response = requests.get(url, headers=headers)
return response.status_code if __name__ == '__main__':
urllist = ['http://example.webscraping.com/places/default/view/Afghanistan-1',
'http://example.webscraping.com/places/default/view/Aland-Islands-2'] start = time.time()
pool = ProcessPoolExecutor(max_workers = )
futures = [pool.submit(download,url) for url in urllist]
for future in futures:
print('执行中:%s, 已完成:%s' % (future.running(), future.done()))
print('#### 分界线 ####')
for future in as_completed(futures, timeout=):
print('执行中:%s, 已完成:%s' % (future.running(), future.done()))
print(future.result())
end = time.time()
print('使用多线程--timestamp:{:.3f}'.format(end-start))
输出:

4. shutdown函数
函数原型:def shutdown(self, wait=True)
此函数用于释放异步执行操作后的系统资源
由于_base.Executor类提供了上下文方法,将shutdown封装在了__exit__中,若使用with方法,将不需要自己进行资源释放
with ProcessPoolExecutor(max_workers = ) as pool:
5. Future类
submit函数返回Future对象,Future类提供了跟踪任务执行状态的方法:
future.running():判断任务是否执行
futurn.done:判断任务是否执行完成
futurn.result():返回函数执行结果
futures = [pool.submit(download,url) for url in urllist]
for future in futures:
print('执行中:%s, 已完成:%s' % (future.running(), future.done()))
print('#### 分界线 ####')
for future in as_completed(futures, timeout=):
print('执行中:%s, 已完成:%s' % (future.running(), future.done()))
print(future.result())
as_completed方法传入futures迭代器和timeout两个参数
默认timeout=None,阻塞等待任务执行完成,并返回执行完成的future对象迭代器,迭代器是通过yield实现的。
timeout>0,等待timeout时间,如果timeout时间到仍有任务未能完成,不再执行并抛出异常TimeoutError
6. 回调函数
Future类提供了add_done_callback函数可以自定义回调函数:
def add_done_callback(self, fn):
"""Attaches a callable that will be called when the future finishes. Args:
fn: A callable that will be called with this future as its only
argument when the future completes or is cancelled. The callable
will always be called by a thread in the same process in which
it was added. If the future has already completed or been
cancelled then the callable will be called immediately. These
callables are called in the order that they were added.
"""
with self._condition:
if self._state not in [CANCELLED, CANCELLED_AND_NOTIFIED, FINISHED]:
self._done_callbacks.append(fn)
return
fn(self)
例子:
from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor,as_completed
import time
import requests def download(url):
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0',
'Connection':'keep-alive',
'Host':'example.webscraping.com'}
response = requests.get(url, headers=headers)
return response.status_code def callback(future):
print(future.result()) if __name__ == '__main__':
urllist = ['http://example.webscraping.com/places/default/view/Afghanistan-1',
'http://example.webscraping.com/places/default/view/Aland-Islands-2',
'http://example.webscraping.com/places/default/view/Albania-3',
'http://example.webscraping.com/places/default/view/Algeria-4',
'http://example.webscraping.com/places/default/view/American-Samoa-5'] start = time.time()
with ProcessPoolExecutor(max_workers = ) as pool:
futures = [pool.submit(download,url) for url in urllist]
for future in futures:
print('执行中:%s, 已完成:%s' % (future.running(), future.done()))
print('#### 分界线 ####')
for future in as_completed(futures, timeout=):
future.add_done_callback(callback)
print('执行中:%s, 已完成:%s' % (future.running(), future.done()))
end = time.time()
print('使用多线程--timestamp:{:.3f}'.format(end-start))
7. wait函数
函数原型:def wait(fs, timeout=None, return_when=ALL_COMPLETED)
def wait(fs, timeout=None, return_when=ALL_COMPLETED):
"""Wait for the futures in the given sequence to complete. Args:
fs: The sequence of Futures (possibly created by different Executors) to
wait upon.
timeout: The maximum number of seconds to wait. If None, then there
is no limit on the wait time.
return_when: Indicates when this function should return. The options
are: FIRST_COMPLETED - Return when any future finishes or is
cancelled.
FIRST_EXCEPTION - Return when any future finishes by raising an
exception. If no future raises an exception
then it is equivalent to ALL_COMPLETED.
ALL_COMPLETED - Return when all futures finish or are cancelled. Returns:
A named -tuple of sets. The first set, named 'done', contains the
futures that completed (is finished or cancelled) before the wait
completed. The second set, named 'not_done', contains uncompleted
futures.
"""
with _AcquireFutures(fs):
done = set(f for f in fs
if f._state in [CANCELLED_AND_NOTIFIED, FINISHED])
not_done = set(fs) - done if (return_when == FIRST_COMPLETED) and done:
return DoneAndNotDoneFutures(done, not_done)
elif (return_when == FIRST_EXCEPTION) and done:
if any(f for f in done
if not f.cancelled() and f.exception() is not None):
return DoneAndNotDoneFutures(done, not_done) if len(done) == len(fs):
return DoneAndNotDoneFutures(done, not_done) waiter = _create_and_install_waiters(fs, return_when) waiter.event.wait(timeout)
for f in fs:
with f._condition:
f._waiters.remove(waiter) done.update(waiter.finished_futures)
return DoneAndNotDoneFutures(done, set(fs) - done)
wait方法返回一个中包含两个元组,元组中包含两个集合(set),一个是已经完成的(completed),一个是未完成的(uncompleted)
它接受三个参数,重点看下第三个参数:
FIRST_COMPLETED:Return when any future finishes or iscancelled.
from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor,\
as_completed,wait,ALL_COMPLETED, FIRST_COMPLETED, FIRST_EXCEPTION
import time
import requests def download(url):
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0',
'Connection':'keep-alive',
'Host':'example.webscraping.com'}
response = requests.get(url, headers=headers)
return response.status_code if __name__ == '__main__':
urllist = ['http://example.webscraping.com/places/default/view/Afghanistan-1',
'http://example.webscraping.com/places/default/view/Aland-Islands-2',
'http://example.webscraping.com/places/default/view/Albania-3',
'http://example.webscraping.com/places/default/view/Algeria-4',
'http://example.webscraping.com/places/default/view/American-Samoa-5'] start = time.time()
with ProcessPoolExecutor(max_workers = ) as pool:
futures = [pool.submit(download,url) for url in urllist]
for future in futures:
print('执行中:%s, 已完成:%s' % (future.running(), future.done()))
print('#### 分界线 ####')
completed, uncompleted = wait(futures, timeout=, return_when=FIRST_COMPLETED)
for cp in completed:
print('执行中:%s, 已完成:%s' % (cp.running(), cp.done()))
print(cp.result())
end = time.time()
print('使用多线程--timestamp:{:.3f}'.format(end-start))
输出:

只返回了一个完成的
网络爬虫必备知识之concurrent.futures库的更多相关文章
- 网络爬虫必备知识之urllib库
就库的范围,个人认为网络爬虫必备库知识包括urllib.requests.re.BeautifulSoup.concurrent.futures,接下来将结合爬虫示例分别对urllib库的使用方法进行 ...
- 网络爬虫必备知识之requests库
就库的范围,个人认为网络爬虫必备库知识包括urllib.requests.re.BeautifulSoup.concurrent.futures,接下来将结对requests库的使用方法进行总结 1. ...
- 【网络爬虫入门02】HTTP客户端库Requests的基本原理与基础应用
[网络爬虫入门02]HTTP客户端库Requests的基本原理与基础应用 广东职业技术学院 欧浩源 1.引言 实现网络爬虫的第一步就是要建立网络连接并向服务器或网页等网络资源发起请求.urllib是 ...
- 网络爬虫基础知识(Python实现)
浏览器的请求 url=请求协议(http/https)+网站域名+资源路径+参数 http:超文本传输协议(以明文的形式进行传输),传输效率高,但不安全. https:由http+ssl(安全套接子层 ...
- python网络爬虫,知识储备,简单爬虫的必知必会,【核心】
知识储备,简单爬虫的必知必会,[核心] 一.实验说明 1. 环境登录 无需密码自动登录,系统用户名shiyanlou 2. 环境介绍 本实验环境采用带桌面的Ubuntu Linux环境,实验中会用到桌 ...
- 网络爬虫:利用selenium,pyquery库抓取并处理京东上的图片并存储到使用mongdb数据库进行存储
一,环境的搭建已经简单的工具介绍 1.selenium,一个用于Web应用程序测试的工具.其特点是直接运行在浏览器中,就像真正的用户在操作一样.新版本selenium2集成了 Selenium 1.0 ...
- Python 网络爬虫的常用库汇总
爬虫的编程语言有不少,但 Python 绝对是其中的主流之一.下面就为大家介绍下 Python 在编写网络爬虫常常用到的一些库. 请求库:实现 HTTP 请求操作 urllib:一系列用于操作URL的 ...
- python 爬虫基础知识一
网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动的抓取万维网信息的程序或者脚本. 网络爬虫必备知识点 1. Python基础知识2. P ...
- 《Python3网络爬虫开发实战》
推荐:★ ★ ★ ★ ★ 第1章 开发环境配置 第2章 网页基础知识 第3章 网络爬虫基础 第4章 基本库的使用 第5章 解析库的使用 第6章 数据存储 第7章 Ajax数据爬取 第8章 动态渲染页面 ...
随机推荐
- centos中安装php7
centos7下安装php7 php7 centos7 安装PHP7 首先安装一些必须的依赖,这里就不阐述了,后面文章再细说 yum install -y \ gcc-c++ autoconf \ l ...
- MySql 5.7 详细参数说明
max_connections: 允许客户端并发连接的最大数量,默认值是151,一般将该参数设置为500-2000 max_connect_errors: 如果客户端尝试连接的错误数量超过这个参数设置 ...
- php面向对象之克隆对象
在前面的PHP面向对象之对象和引用,我们试图以"$b=$a"的方式复制对象以传递对象的值(内容),结果却是传递对象的地址,在结尾为了解决复制对象这个问题,提到了克隆的方法.接下来讲 ...
- for循环执行流程
语句格式: for(表达式1;表达式2;表达式3) { 循环体 } 表达式1:赋值表达式,用来给控制变量赋初值.(只执行一次) 表达式2:逻辑表达式,是循环的控制条件,用来判断控制变量是否符合循环条件 ...
- iOS_CLLocation定位
CoreLocation框架可用于定位设备当前经纬度,通过该框架,应用程序可通过附近的蜂窝基站,WIFI信号或者GPS等信息计算用户位置. iOS定位支持的3种模式. (1)G ...
- linux下java unrecognized class file version错误的解决
root@Mr javaPC]# java HelloWorldException in thread “main” java.lang.ClassFormatError: HelloWorld (u ...
- sg函数的应用
刚刚接触到sg函数突然感觉到原来可以这么好用,sg函数应该算是博弈论中比较经典的东西了.下面来说说sg函数: 从网上搜集资料终于能看懂了下面解释来自http://www.cnblogs.com/cj6 ...
- UVA 11186 Circum Triangle (枚举三角形优化)(转)
题意:圆上有n个点,求出这n个点组成的所有三角形的面积之和 题解: 当我们要求出S(i,j,k)时,我们需要假设k在j的左侧,k在i与j之间,k在i的右侧. 如果k在 j的左侧 那么 S(i,j,k ...
- (转)Nova中的compute_node
如需转载,请标明原文出处以及作者 陈锐 RuiChen @kiwik *2015/2/4 22:44:22 * 写在最前面: 这段时间连续改了几个scheduler和resource_tracker相 ...
- Ubuntu16.04 安装wine下的QQ
下载连接 wine-qqintl http://www.ubuntukylin.com/application/show.php?lang=cn&id=279 安装步骤 安装依赖库 sudo ...