就库的范围，个人认为网络爬虫必备库知识包括urllib、requests、re、BeautifulSoup、concurrent.futures，接下来将结对concurrent.futures库的使用方法进行总结
建议阅读本博的博友先阅读下上篇博客：python究竟要不要使用多线程，将会对concurrent.futures库的使用有帮助。

1. concurrent.futures库简介

　　python标准库为我们提供了threading和mutiprocessing模块实现异步多线程/多进程功能。从python3.2版本开始，标准库又为我们提供了concurrent.futures模块来实现线程池和进程池功能，实现了对threading和mutiprocessing模块的高级抽象，更大程度上方便了我们python程序员。

　　concurrent.futures模块提供了ThreadPoolExecutor和ProcessPoolExecutor两个类

（1）看下来个类的继承关系和关键属性

from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor

print('ThreadPoolExecutor继承关系：',ThreadPoolExecutor.__mro__)

    print('ThreadPoolExecutor属性：',[attr for attr in dir(ThreadPoolExecutor) if not attr.startswith('_')])

    print('ProcessPoolExecutor继承关系：',ProcessPoolExecutor.__mro__)

    print('ThreadPoolExecutor属性：',[attr for attr in dir(ProcessPoolExecutor) if not attr.startswith('_')])

　　都继承自futures._base.Executor类，拥有三个重要方法map、submit和shutdow，这样看起来就很简单了

（2）再看下futures._base.Executor基类实现

class Executor(object):

    """This is an abstract base class for concrete asynchronous executors."""

    def submit(self, fn, *args, **kwargs):

        """Submits a callable to be executed with the given arguments.

        Schedules the callable to be executed as fn(*args, **kwargs) and returns

        a Future instance representing the execution of the callable.

        Returns:

            A Future representing the given call.

        """

        raise NotImplementedError()

    def map(self, fn, *iterables, timeout=None, chunksize=):

        """Returns an iterator equivalent to map(fn, iter).

        Args:

            fn: A callable that will take as many arguments as there are

                passed iterables.

            timeout: The maximum number of seconds to wait. If None, then there

                is no limit on the wait time.

            chunksize: The size of the chunks the iterable will be broken into

                before being passed to a child process. This argument is only

                used by ProcessPoolExecutor; it is ignored by

                ThreadPoolExecutor.

        Returns:

            An iterator equivalent to: map(func, *iterables) but the calls may

            be evaluated out-of-order.

        Raises:

            TimeoutError: If the entire result iterator could not be generated

                before the given timeout.

            Exception: If fn(*args) raises for any values.

        """

        if timeout is not None:

            end_time = timeout + time.time()

        fs = [self.submit(fn, *args) for args in zip(*iterables)]

        # Yield must be hidden in closure so that the futures are submitted

        # before the first iterator value is required.

        def result_iterator():

            try:

                # reverse to keep finishing order

                fs.reverse()

                while fs:

                    # Careful not to keep a reference to the popped future

                    if timeout is None:

                        yield fs.pop().result()

                    else:

                        yield fs.pop().result(end_time - time.time())

            finally:

                for future in fs:

                    future.cancel()

        return result_iterator()

    def shutdown(self, wait=True):

        """Clean-up the resources associated with the Executor.

        It is safe to call this method several times. Otherwise, no other

        methods can be called after this one.

        Args:

            wait: If True then shutdown will not return until all running

                futures have finished executing and the resources used by the

                executor have been reclaimed.

        """

        pass

    def __enter__(self):

        return self

    def __exit__(self, exc_type, exc_val, exc_tb):

        self.shutdown(wait=True)

        return False

　　提供了map、submit、shutdow和with方法，下面首先对这个几个方法的使用进行说明

2. map函数

　　函数原型：def map(self, fn, *iterables, timeout=None, chunksize=1)

　　map函数和python自带的map函数用法一样，只不过该map函数从迭代器获取参数后异步执行，timeout用于设置超时时间

　　参数chunksize的理解：

The size of the chunks the iterable will be broken into

 before being passed to a child process. This argument is only

 used by ProcessPoolExecutor; it is ignored by ThreadPoolExecutor.

　　例：

from concurrent.futures import ThreadPoolExecutor

import time

import requests

def download(url):

    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0',

                'Connection':'keep-alive',

                'Host':'example.webscraping.com'}

    response = requests.get(url, headers=headers)

    return(response.status_code)

if __name__ == '__main__':

    urllist = ['http://example.webscraping.com/places/default/view/Afghanistan-1',

               'http://example.webscraping.com/places/default/view/Aland-Islands-2']

    pool = ProcessPoolExecutor(max_workers = 2) 
    start = time.time()

    result = list(pool.map(download, urllist))

    end = time.time()

    print('status_code:',result)

    print('使用多线程--timestamp:{:.3f}'.format(end-start))

3. submit函数

　　函数原型：def submit(self, fn, *args, **kwargs)

　　fn：需要异步执行的函数

　　args、kwargs：函数传递的参数

　　例：下例中future类的使用的as_complete后面介绍

from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor,as_completed

import time

import requests

def download(url):

    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0',

                'Connection':'keep-alive',

                'Host':'example.webscraping.com'}

    response = requests.get(url, headers=headers)

    return response.status_code

if __name__ == '__main__':

    urllist = ['http://example.webscraping.com/places/default/view/Afghanistan-1',

               'http://example.webscraping.com/places/default/view/Aland-Islands-2']

    start = time.time()

    pool = ProcessPoolExecutor(max_workers = )

    futures = [pool.submit(download,url) for url in urllist]

    for future in futures:

        print('执行中:%s, 已完成:%s' % (future.running(), future.done()))

    print('#### 分界线 ####')

    for future in as_completed(futures, timeout=):

        print('执行中:%s, 已完成:%s' % (future.running(), future.done()))

        print(future.result())

    end = time.time()

    print('使用多线程--timestamp:{:.3f}'.format(end-start))

　　输出：

4. shutdown函数

　　函数原型：def shutdown(self, wait=True)

　　此函数用于释放异步执行操作后的系统资源

　　由于_base.Executor类提供了上下文方法，将shutdown封装在了__exit__中，若使用with方法，将不需要自己进行资源释放

with ProcessPoolExecutor(max_workers = ) as pool:

5. Future类

　　submit函数返回Future对象，Future类提供了跟踪任务执行状态的方法：

　　future.running()：判断任务是否执行

　　futurn.done：判断任务是否执行完成

　　futurn.result()：返回函数执行结果

futures = [pool.submit(download,url) for url in urllist]

for future in futures:

    print('执行中:%s, 已完成:%s' % (future.running(), future.done()))

print('#### 分界线 ####')

for future in as_completed(futures, timeout=):

    print('执行中:%s, 已完成:%s' % (future.running(), future.done()))

    print(future.result())

　　as_completed方法传入futures迭代器和timeout两个参数

　　默认timeout=None，阻塞等待任务执行完成，并返回执行完成的future对象迭代器，迭代器是通过yield实现的。

　　timeout>0，等待timeout时间，如果timeout时间到仍有任务未能完成，不再执行并抛出异常TimeoutError

6. 回调函数

　　Future类提供了add_done_callback函数可以自定义回调函数：

def add_done_callback(self, fn):

        """Attaches a callable that will be called when the future finishes.

        Args:

            fn: A callable that will be called with this future as its only

                argument when the future completes or is cancelled. The callable

                will always be called by a thread in the same process in which

                it was added. If the future has already completed or been

                cancelled then the callable will be called immediately. These

                callables are called in the order that they were added.

        """

        with self._condition:

            if self._state not in [CANCELLED, CANCELLED_AND_NOTIFIED, FINISHED]:

                self._done_callbacks.append(fn)

                return

        fn(self)

　　例子：

from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor,as_completed

import time

import requests

def download(url):

    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0',

                'Connection':'keep-alive',

                'Host':'example.webscraping.com'}

    response = requests.get(url, headers=headers)

    return response.status_code

def callback(future):

    print(future.result())

if __name__ == '__main__':

    urllist = ['http://example.webscraping.com/places/default/view/Afghanistan-1',

               'http://example.webscraping.com/places/default/view/Aland-Islands-2',

               'http://example.webscraping.com/places/default/view/Albania-3',

               'http://example.webscraping.com/places/default/view/Algeria-4',

               'http://example.webscraping.com/places/default/view/American-Samoa-5']

    start = time.time()

    with ProcessPoolExecutor(max_workers = ) as pool:

        futures = [pool.submit(download,url) for url in urllist]

        for future in futures:

            print('执行中:%s, 已完成:%s' % (future.running(), future.done()))

            print('#### 分界线 ####')

        for future in as_completed(futures, timeout=):

            future.add_done_callback(callback)

            print('执行中:%s, 已完成:%s' % (future.running(), future.done()))

        end = time.time()

        print('使用多线程--timestamp:{:.3f}'.format(end-start))

7. wait函数

　　函数原型：def wait(fs, timeout=None, return_when=ALL_COMPLETED)

def wait(fs, timeout=None, return_when=ALL_COMPLETED):

    """Wait for the futures in the given sequence to complete.

    Args:

        fs: The sequence of Futures (possibly created by different Executors) to

            wait upon.

        timeout: The maximum number of seconds to wait. If None, then there

            is no limit on the wait time.

        return_when: Indicates when this function should return. The options

            are:

            FIRST_COMPLETED - Return when any future finishes or is

                              cancelled.

            FIRST_EXCEPTION - Return when any future finishes by raising an

                              exception. If no future raises an exception

                              then it is equivalent to ALL_COMPLETED.

            ALL_COMPLETED -   Return when all futures finish or are cancelled.

    Returns:

        A named -tuple of sets. The first set, named 'done', contains the

        futures that completed (is finished or cancelled) before the wait

        completed. The second set, named 'not_done', contains uncompleted

        futures.

    """

    with _AcquireFutures(fs):

        done = set(f for f in fs

                   if f._state in [CANCELLED_AND_NOTIFIED, FINISHED])

        not_done = set(fs) - done

        if (return_when == FIRST_COMPLETED) and done:

            return DoneAndNotDoneFutures(done, not_done)

        elif (return_when == FIRST_EXCEPTION) and done:

            if any(f for f in done

                   if not f.cancelled() and f.exception() is not None):

                return DoneAndNotDoneFutures(done, not_done)

        if len(done) == len(fs):

            return DoneAndNotDoneFutures(done, not_done)

        waiter = _create_and_install_waiters(fs, return_when)

    waiter.event.wait(timeout)

    for f in fs:

        with f._condition:

            f._waiters.remove(waiter)

    done.update(waiter.finished_futures)

    return DoneAndNotDoneFutures(done, set(fs) - done)

　　wait方法返回一个中包含两个元组，元组中包含两个集合（set）,一个是已经完成的（completed）,一个是未完成的（uncompleted）

　　它接受三个参数，重点看下第三个参数：

　　FIRST_COMPLETED：Return when any future finishes or iscancelled.　

　　FIRST_EXCEPTION：Return when any future finishes by raising an exception，

　　　　　　　　　　　　 If no future raises an exception then it is equivalent to ALL_COMPLETED.

　　ALL_COMPLETED：Return when all futures finish or are cancelled

　　例：

from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor,\

            as_completed,wait,ALL_COMPLETED, FIRST_COMPLETED, FIRST_EXCEPTION

import time

import requests

def download(url):

    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0',

                'Connection':'keep-alive',

                'Host':'example.webscraping.com'}

    response = requests.get(url, headers=headers)

    return response.status_code

if __name__ == '__main__':

    urllist = ['http://example.webscraping.com/places/default/view/Afghanistan-1',

               'http://example.webscraping.com/places/default/view/Aland-Islands-2',

               'http://example.webscraping.com/places/default/view/Albania-3',

               'http://example.webscraping.com/places/default/view/Algeria-4',

               'http://example.webscraping.com/places/default/view/American-Samoa-5']

    start = time.time()

    with ProcessPoolExecutor(max_workers = ) as pool:

        futures = [pool.submit(download,url) for url in urllist]

        for future in futures:

            print('执行中:%s, 已完成:%s' % (future.running(), future.done()))

        print('#### 分界线 ####')

        completed, uncompleted = wait(futures, timeout=, return_when=FIRST_COMPLETED)

        for cp in completed:

            print('执行中:%s, 已完成:%s' % (cp.running(), cp.done()))

            print(cp.result())

        end = time.time()

        print('使用多线程--timestamp:{:.3f}'.format(end-start))

　　输出：

　　只返回了一个完成的

网络爬虫必备知识之concurrent.futures库的更多相关文章

网络爬虫必备知识之urllib库
就库的范围,个人认为网络爬虫必备库知识包括urllib.requests.re.BeautifulSoup.concurrent.futures,接下来将结合爬虫示例分别对urllib库的使用方法进行 ...
网络爬虫必备知识之requests库
就库的范围,个人认为网络爬虫必备库知识包括urllib.requests.re.BeautifulSoup.concurrent.futures,接下来将结对requests库的使用方法进行总结 1. ...
【网络爬虫入门02】HTTP客户端库Requests的基本原理与基础应用
[网络爬虫入门02]HTTP客户端库Requests的基本原理与基础应用广东职业技术学院欧浩源 1.引言实现网络爬虫的第一步就是要建立网络连接并向服务器或网页等网络资源发起请求.urllib是 ...
网络爬虫基础知识（Python实现）
浏览器的请求 url=请求协议(http/https)+网站域名+资源路径+参数 http:超文本传输协议(以明文的形式进行传输),传输效率高,但不安全. https:由http+ssl(安全套接子层 ...
python网络爬虫，知识储备，简单爬虫的必知必会，【核心】
知识储备,简单爬虫的必知必会,[核心] 一.实验说明 1. 环境登录无需密码自动登录,系统用户名shiyanlou 2. 环境介绍本实验环境采用带桌面的Ubuntu Linux环境,实验中会用到桌 ...
网络爬虫：利用selenium，pyquery库抓取并处理京东上的图片并存储到使用mongdb数据库进行存储
一,环境的搭建已经简单的工具介绍 1.selenium,一个用于Web应用程序测试的工具.其特点是直接运行在浏览器中,就像真正的用户在操作一样.新版本selenium2集成了 Selenium 1.0 ...
Python 网络爬虫的常用库汇总
爬虫的编程语言有不少,但 Python 绝对是其中的主流之一.下面就为大家介绍下 Python 在编写网络爬虫常常用到的一些库. 请求库:实现 HTTP 请求操作 urllib:一系列用于操作URL的 ...
python 爬虫基础知识一
网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动的抓取万维网信息的程序或者脚本. 网络爬虫必备知识点 1. Python基础知识2. P ...
《Python3网络爬虫开发实战》
推荐:★ ★ ★ ★ ★ 第1章开发环境配置第2章网页基础知识第3章网络爬虫基础第4章基本库的使用第5章解析库的使用第6章数据存储第7章 Ajax数据爬取第8章动态渲染页面 ...

随机推荐

arcgis for flex map遮罩
效果1:map的遮罩(对整个map进行遮罩) 效果2:对某个图层进行遮罩
C#无边框窗体移动的三种方法
1. 重写WndProc protected override void WndProc(ref Message m) { const int WM_NCHITTEST = 0x84; const i ...
逐行读取txt文件并存入到数组中
get_file_contents_on_line.php $file = fopen("log.txt", "r"); $user=array(); $i=0 ...
《网络对抗》逆向及Bof基础实践
<网络对抗>-逆向及Bof基础实践 1 逆向及Bof基础实践说明 1.1 实践目标本次实践的对象是一个名为pwn1的linux可执行文件. 该程序正常执行流程是:main调用foo函数, ...
2018-2019-2 20165114《网络对抗技术》Exp6 信息收集与漏洞扫描
Exp6 信息收集与漏洞扫描目录一.实验目标与内容二.实验后问题回答三.实验过程记录 3.1 各种搜索技巧的应用 3.2 DNS IP注册信息的查询 3.3 基本的扫描技术 [主机发现] [端 ...
dll和lib
lib:里面包含了很多源代码,工程会将这些源代码加入自己的项目中编译: dll:动态编译库,允许可执行文件在运行中加载里面的资源. 使用lib需注意两个文件:(1).h头文件,包含lib中说明输出的类 ...
K8s + Flannel 网络架构图
这是Flannel官网给出的网络架构图这是通过自己的理解画的逻辑结构图查看bridge [root@node01 ~]# brctl show bridge name bridge id STP ...
centos 源码安装php5.5
系统环境: CentOS 6.5 / 7.0 x86_64 Fedora 20 x86_64下载 PHP 源码包 # wget http://cn2.php.net/distributions/php ...
Apollo和分布式配置
传统配置文件有什么缺点如果修改了配置文件,需要重新打包发布,而且每个环境变量配置文件复杂. 分布式配置中心将配置文件注册到配置中心平台上,可以使用分布式配置中心实时更新配置文件,统一管理,不需要重 ...
DataX-MySQL(读写)
DataX操作MySQL 一. 从MySQL读取介绍 MysqlReader插件实现了从Mysql读取数据.在底层实现上,MysqlReader通过JDBC连接远程Mysql数据库,并执行相应的sq ...

网络爬虫必备知识之concurrent.futures库