python3 线程池-threadpool模块与concurrent.futures模块

一、既然多线程可以缩短程序运行时间，那么，是不是线程数量越多越好呢？

显然，并不是，每一个线程的从生成到消亡也是需要时间和资源的，太多的线程会占用过多的系统资源（内存开销，cpu开销），而且生成太多的线程时间也是可观的，很可能会得不偿失，这里给出一个最佳线程数量的计算方式：

最佳线程数的获取：

1、通过用户慢慢递增来进行性能压测，观察QPS（即每秒的响应请求数，也即是最大吞吐能力。），响应时间

2、根据公式计算:服务器端最佳线程数量=((线程等待时间+线程cpu时间)/线程cpu时间) * cpu数量

3、单用户压测，查看CPU的消耗，然后直接乘以百分比，再进行压测，一般这个值的附近应该就是最佳线程数量。

二、为什么要使用线程池？

对于任务数量不断增加的程序，每有一个任务就生成一个线程，最终会导致线程数量的失控，例如，整站爬虫，假设初始只有一个链接a，那么，这个时候只启动一个线程，运行之后，得到这个链接对应页面上的b，c，d，，，等等新的链接，作为新任务，这个时候，就要为这些新的链接生成新的线程，线程数量暴涨。在之后的运行中，线程数量还会不停的增加，完全无法控制。所以，对于任务数量不端增加的程序，固定线程数量的线程池是必要的。

三、如何实现线程池？

这里，我分别介绍三种实现方式：

1、过去：

使用threadpool模块，这是个python的第三方模块，支持python2和python3，具体使用方式如下：

#! /usr/bin/env python

# -*- coding: utf-8 -*-

import threadpool

import time

def sayhello (a):

    print("hello: "+a)

    time.sleep(2)

def main():

    global result

    seed=["a","b","c"]

    start=time.time()

    task_pool=threadpool.ThreadPool(5)

    requests=threadpool.makeRequests(sayhello,seed)

    for req in requests:

        task_pool.putRequest(req)

    task_pool.wait()

    end=time.time()

    time_m = end-start

    print("time: "+str(time_m))

    start1=time.time()

    for each in seed:

        sayhello(each)

    end1=time.time()

    print("time1: "+str(end1-start1))

if __name__ == '__main__':

    main()

运行结果如下：

threadpool是一个比较老的模块了，现在虽然还有一些人在用，但已经不再是主流了，关于python多线程，现在已经开始步入未来（future模块）了

2、未来：

使用concurrent.futures模块，这个模块是python3中自带的模块，但是，python2.7以上版本也可以安装使用，具体使用方式如下：

#! /usr/bin/env python

# -*- coding: utf-8 -*-

from concurrent.futures import ThreadPoolExecutor

import time

def sayhello(a):

    print("hello: "+a)

    time.sleep(2)

def main():

    seed=["a","b","c"]

    start1=time.time()

    for each in seed:

        sayhello(each)

    end1=time.time()

    print("time1: "+str(end1-start1))

    start2=time.time()

    with ThreadPoolExecutor(3) as executor:

        for each in seed:

            executor.submit(sayhello,each)

    end2=time.time()

    print("time2: "+str(end2-start2))

    start3=time.time()

    with ThreadPoolExecutor(3) as executor1:

        executor1.map(sayhello,seed)

    end3=time.time()

    print("time3: "+str(end3-start3))

if __name__ == '__main__':

    main()

运行结果如下：

注意到一点：

concurrent.futures.ThreadPoolExecutor，在提交任务的时候，有两种方式，一种是submit（）函数，另一种是map（）函数，两者的主要区别在于：

2.1、map可以保证输出的顺序, submit输出的顺序是乱的

2.2、如果你要提交的任务的函数是一样的，就可以简化成map。但是假如提交的任务函数是不一样的，或者执行的过程之可能出现异常（使用map执行过程中发现问题会直接抛出错误）就要用到submit（）

2.3、submit和map的参数是不同的，submit每次都需要提交一个目标函数和对应的参数，map只需要提交一次目标函数，目标函数的参数放在一个迭代器（列表，字典）里就可以。

3.现在？

这里要考虑一个问题，以上两种线程池的实现都是封装好的，任务只能在线程池初始化的时候添加一次，那么，假设我现在有这样一个需求，需要在线程池运行时，再往里面添加新的任务（注意，是新任务，不是新线程），那么要怎么办？

其实有两种方式：

3.1、重写threadpool或者future的函数：

这个方法需要阅读源模块的源码，必须搞清楚源模块线程池的实现机制才能正确的根据自己的需要重写其中的方法。

3.2、自己构建一个线程池：

这个方法就需要对线程池的有一个清晰的了解了，附上我自己构建的一个线程池：

#! /usr/bin/env python

# -*- coding: utf-8 -*-

import threading

import Queue

import hashlib

import logging

from utils.progress import PrintProgress

from utils.save import SaveToSqlite

class ThreadPool(object):

    def __init__(self, thread_num, args):

        self.args = args

        self.work_queue = Queue.Queue()

        self.save_queue = Queue.Queue()

        self.threads = []

        self.running = 0

        self.failure = 0

        self.success = 0

        self.tasks = {}

        self.thread_name = threading.current_thread().getName()

        self.__init_thread_pool(thread_num)

    # 线程池初始化

    def __init_thread_pool(self, thread_num):

        # 下载线程

        for i in range(thread_num):

            self.threads.append(WorkThread(self))

        # 打印进度信息线程

        self.threads.append(PrintProgress(self))

        # 保存线程

        self.threads.append(SaveToSqlite(self, self.args.dbfile))

    # 添加下载任务

    def add_task(self, func, url, deep):

        # 记录任务，判断是否已经下载过

        url_hash = hashlib.new('md5', url.encode("utf8")).hexdigest()

        if not url_hash in self.tasks:

            self.tasks[url_hash] = url

            self.work_queue.put((func, url, deep))

            logging.info("{0} add task {1}".format(self.thread_name, url.encode("utf8")))

    # 获取下载任务

    def get_task(self):

        # 从队列里取元素，如果block=True,则一直阻塞到有可用元素为止。

        task = self.work_queue.get(block=False)

        return task

    def task_done(self):

        # 表示队列中的某个元素已经执行完毕。

        self.work_queue.task_done()

    # 开始任务

    def start_task(self):

        for item in self.threads:

            item.start()

        logging.debug("Work start")

    def increase_success(self):

        self.success += 1

    def increase_failure(self):

        self.failure += 1

    def increase_running(self):

        self.running += 1

    def decrease_running(self):

        self.running -= 1

    def get_running(self):

        return self.running

    # 打印执行信息

    def get_progress_info(self):

        progress_info = {}

        progress_info['work_queue_number'] = self.work_queue.qsize()

        progress_info['tasks_number'] = len(self.tasks)

        progress_info['save_queue_number'] = self.save_queue.qsize()

        progress_info['success'] = self.success

        progress_info['failure'] = self.failure

        return progress_info

    def add_save_task(self, url, html):

        self.save_queue.put((url, html))

    def get_save_task(self):

        save_task = self.save_queue.get(block=False)

        return save_task

    def wait_all_complete(self):

        for item in self.threads:

            if item.isAlive():

                # join函数的意义，只有当前执行join函数的线程结束，程序才能接着执行下去

                item.join()

# WorkThread 继承自threading.Thread

class WorkThread(threading.Thread):

    # 这里的thread_pool就是上面的ThreadPool类

    def __init__(self, thread_pool):

        threading.Thread.__init__(self)

        self.thread_pool = thread_pool

    #定义线程功能方法，即，当thread_1，...，thread_n，调用start（）之后，执行的操作。

    def run(self):

        print (threading.current_thread().getName())

        while True:

            try:

                # get_task()获取从工作队列里获取当前正在下载的线程，格式为func,url,deep

                do, url, deep = self.thread_pool.get_task()

                self.thread_pool.increase_running()

                # 判断deep，是否获取新的链接

                flag_get_new_link = True

                if deep >= self.thread_pool.args.deep:

                    flag_get_new_link = False

                # 此处do为工作队列传过来的func，返回值为一个页面内容和这个页面上所有的新链接

                html, new_link = do(url, self.thread_pool.args, flag_get_new_link)

                if html == '':

                    self.thread_pool.increase_failure()

                else:

                    self.thread_pool.increase_success()

                    # html添加到待保存队列

                    self.thread_pool.add_save_task(url, html)

                # 添加新任务，即，将新页面上的不重复的链接加入工作队列。

                if new_link:

                    for url in new_link:

                        self.thread_pool.add_task(do, url, deep + 1)

                self.thread_pool.decrease_running()

                # self.thread_pool.task_done()

            except Queue.Empty:

                if self.thread_pool.get_running() <= 0:

                    break

            except Exception, e:

                self.thread_pool.decrease_running()

                # print str(e)

                break

python3 线程池-threadpool模块与concurrent.futures模块的更多相关文章

Python3【模块】concurrent.futures模块，线程池进程池
Python标准库为我们提供了threading和multiprocessing模块编写相应的多线程/多进程代码,但是当项目达到一定的规模,频繁创建/销毁进程或者线程是非常消耗资源的,这个时候我们就要 ...
（11）线程池（最新的concurrent.futures包去开启）
'''concurrent.futures是最新的开启线程池的包'''import timefrom concurrent.futures import ThreadPoolExecutor #开启线 ...
创建进程池与线程池concurrent.futures模块的使用
一.进程池. 当并发的任务数量远远大于计算机所能承受的范围,即无法一次性开启过多的任务数量就应该考虑去限制进程数或线程数,从而保证服务器不会因超载而瘫痪.这时候就出现了进程池和线程池. 二.conc ...
Python之路(第四十六篇)多种方法实现python线程池(threadpool模块\multiprocessing.dummy模块\concurrent.futures模块)
一.线程池很久(python2.6)之前python没有官方的线程池模块,只有第三方的threadpool模块, 之后再python2.6加入了multiprocessing.dummy 作为可以使 ...
线程池、进程池(concurrent.futures模块)和协程
一.线程池 1.concurrent.futures模块介绍 concurrent.futures模块提供了高度封装的异步调用接口 ThreadPoolExecutor:线程池,提供异步调用 Pro ...
concurrent.futures模块(进程池&线程池)
1.线程池的概念由于python中的GIL导致每个进程一次只能运行一个线程,在I/O密集型的操作中可以开启多线程,但是在使用多线程处理任务时候,不是线程越多越好,因为在线程切换的时候,需要切换上下文 ...
python之concurrent.futures模块
一.concurrent.futures模块简介 concurrent.futures 模块提供了并发执行调用的高级接口并发可以使用threads执行,使用ThreadPoolExecutor 或 ...
Python并发编程之线程池/进程池--concurrent.futures模块
一.关于concurrent.futures模块 Python标准库为我们提供了threading和multiprocessing模块编写相应的多线程/多进程代码,但是当项目达到一定的规模,频繁创建/ ...
《转载》Python并发编程之线程池/进程池--concurrent.futures模块
本文转载自Python并发编程之线程池/进程池--concurrent.futures模块一.关于concurrent.futures模块 Python标准库为我们提供了threading和mult ...

随机推荐

CC4 表达方式----输赢
“我要赢,不管付出什么,我一定要赢!”当我赢得时候,“我赢了!(欢呼)”.当我输的时候“不,我不要输.不开心.(垂头丧气)”.这样的场景你是否熟悉呢?我的一生都在经历输赢.以前我会为了赢一场游戏,花费 ...
CentOS/redhat使用光盘镜像源
1,首先进行光盘的挂载,注意光盘挂载时不会自动建立目录的, 所以需要自己建立目录. mkdir /mnt/cdrom mount /dev/cdrom /mnt/cdrom #de ...
js判断手指的上滑，下滑，左滑，右滑，事件监听
原理:1:当开始一个touchstart事件的时候,获取此刻手指的横坐标startX和staerY: 2:当触发touchmove事件的时候,再获取此时手指的横坐标moveEndX和纵坐标moveEn ...
UnityShader中的语义相关
语义简介:实际上就是一个表达参数含义的字符串. 应用阶段到顶点着色器时用常用的语义如下: POSITION:模型空间中的顶点坐标 NORMAL:模型空间中的法线方向 TANGENT:模型空间中的切线方 ...
雷林鹏分享：使用 XSLT 显示 XML
使用 XSLT 显示 XML 通过使用 XSLT,您可以把 XML 文档转换成 HTML 格式. 使用 XSLT 显示 XML XSLT 是首选的 XML 样式表语言. XSLT(eXtensible ...
20171114xlVba选定单行记录并打印
Public Sub PrintSelectRow() Dim Wb As Workbook Dim iSht As Worksheet Dim rSht As Worksheet Dim pSht ...
nodejs安装、环境配置和测试
nodejs下载 https://nodejs.org/en/ nodejs安装双击下载的nodejs,可自定义安装路径,安装模块部分直接next即可安装. 检查是否安装 win+R输入cmd,打开 ...
Scrapy - CrawlSpider爬虫
crawlSpider 爬虫思路: 从response中提取满足某个条件的url地址,发送给引擎,同时能够指定callback函数. 1. 创建项目 scrapy startproject mysp ...
『TensorFlow』第十一弹_队列&多线程&TFRecod文件_我辈当高歌
TF数据读取队列机制详解一.TFR文件多线程队列读写操作 TFRecod文件写入操作 import tensorflow as tf def _int64_feature(value): # val ...
python-day76--django-Form组件
django中Form组件 1. 用户请求数据验证 2. 自动生成错误信息 3. 打包用户提交正确信息 4. 错误:保留上次输入内容 5. 定制页面上显示的HTML标签引入: from django ...

python3 线程池-threadpool模块与concurrent.futures模块

python3 线程池-threadpool模块与concurrent.futures模块的更多相关文章

随机推荐

热门专题