并发

在编写爬虫时,性能的消耗主要在IO请求中,当单进程单线程模式下请求URL时必然会引起等待,从而使得请求整体变慢

import requests

def fetch_async(url):
response = requests.get(url)
return response url_list = ['http://www.github.com', 'http://www.bing.com'] for url in url_list:
fetch_async(url)

1.同步执行(串行)

from concurrent.futures import ThreadPoolExecutor
import requests def fetch_async(url):
response = requests.get(url)
return response url_list = ['http://www.github.com', 'http://www.bing.com']
pool = ThreadPoolExecutor(5)
for url in url_list:
pool.submit(fetch_async, url)
pool.shutdown(wait=True)

2.多线程执行(线程池)

from concurrent.futures import ThreadPoolExecutor
import requests def fetch_async(url):
response = requests.get(url)
return response def callback(future):
print(future.result()) url_list = ['http://www.github.com', 'http://www.bing.com']
pool = ThreadPoolExecutor(5)
for url in url_list:
v = pool.submit(fetch_async, url)
v.add_done_callback(callback)
pool.shutdown(wait=True)

2.多线程+回调函数执行(线程池+add_done_callback)

from concurrent.futures import ProcessPoolExecutor
import requests def fetch_async(url):
response = requests.get(url)
return response url_list = ['http://www.github.com', 'http://www.bing.com']
pool = ProcessPoolExecutor(5)
for url in url_list:
pool.submit(fetch_async, url)
pool.shutdown(wait=True)

3.多进程执行(进程池)

from concurrent.futures import ProcessPoolExecutor
import requests def fetch_async(url):
response = requests.get(url)
return response def callback(future):
print(future.result()) url_list = ['http://www.github.com', 'http://www.bing.com']
pool = ProcessPoolExecutor(5)
for url in url_list:
v = pool.submit(fetch_async, url)
v.add_done_callback(callback)
pool.shutdown(wait=True)

3.多进程+回调函数执行(进程池+add_done_callback)

异步IO

通过上述代码均可以完成对请求性能的提高,对于多线程和多进行的缺点是在IO阻塞时会造成了线程和进程的浪费,所以异步IO回事首选:

以下优先级:Twisted > gevent+requests > asyncio+aiohttp

import asyncio

@asyncio.coroutine
def func1():
print('before...func1......')
yield from asyncio.sleep(5)
print('end...func1......') tasks = [func1(), func1()] loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

1.asyncio示例1

import asyncio

@asyncio.coroutine
def fetch_async(host, url='/'):
print(host, url)
reader, writer = yield from asyncio.open_connection(host, 80) request_header_content = """GET %s HTTP/1.0\r\nHost: %s\r\n\r\n""" % (url, host,)
request_header_content = bytes(request_header_content, encoding='utf-8') writer.write(request_header_content)
yield from writer.drain()
text = yield from reader.read()
print(host, url, text)
writer.close() tasks = [
fetch_async('www.cnblogs.com', '/wupeiqi/'),
fetch_async('dig.chouti.com', '/pic/show?nid=4073644713430508&lid=10273091')
] loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

1.asyncio示例2

import aiohttp
import asyncio @asyncio.coroutine
def fetch_async(url):
print(url)
response = yield from aiohttp.request('GET', url)
# data = yield from response.read()
# print(url, data)
print(url, response)
response.close() tasks = [fetch_async('http://www.google.com/'), fetch_async('http://www.chouti.com/')] event_loop = asyncio.get_event_loop()
results = event_loop.run_until_complete(asyncio.gather(*tasks))
event_loop.close()

2.asyncio + aiohttp

import asyncio
import requests @asyncio.coroutine
def fetch_async(func, *args):
loop = asyncio.get_event_loop()
future = loop.run_in_executor(None, func, *args)
response = yield from future
print(response.url, response.content) tasks = [
fetch_async(requests.get, 'http://www.cnblogs.com/wupeiqi/'),
fetch_async(requests.get, 'http://dig.chouti.com/pic/show?nid=4073644713430508&lid=10273091')
] loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

3.asyncio + requests

import gevent

import requests
from gevent import monkey monkey.patch_all() def fetch_async(method, url, req_kwargs):
print(method, url, req_kwargs)
response = requests.request(method=method, url=url, **req_kwargs)
print(response.url, response.content) # ##### 发送请求 #####
gevent.joinall([
gevent.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),
gevent.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),
gevent.spawn(fetch_async, method='get', url='https://github.com/', req_kwargs={}),
]) # ##### 发送请求(协程池控制最大协程数量) #####
# from gevent.pool import Pool
# pool = Pool(None)
# gevent.joinall([
# pool.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),
# pool.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),
# pool.spawn(fetch_async, method='get', url='https://www.github.com/', req_kwargs={}),
# ])

4.gevent + requests

import grequests

request_list = [
grequests.get('http://httpbin.org/delay/1', timeout=0.001),
grequests.get('http://fakedomain/'),
grequests.get('http://httpbin.org/status/500')
] # ##### 执行并获取响应列表 #####
# response_list = grequests.map(request_list)
# print(response_list) # ##### 执行并获取响应列表(处理异常) #####
# def exception_handler(request, exception):
# print(request,exception)
# print("Request failed") # response_list = grequests.map(request_list, exception_handler=exception_handler)
# print(response_list)

5.grequests

from tornado.httpclient import AsyncHTTPClient
from tornado.httpclient import HTTPRequest
from tornado import ioloop def handle_response(response):
"""
处理返回值内容(需要维护计数器,来停止IO循环),调用 ioloop.IOLoop.current().stop()
:param response:
:return:
"""
if response.error:
print("Error:", response.error)
else:
print(response.body) def func():
url_list = [
'http://www.baidu.com',
'http://www.bing.com',
]
for url in url_list:
print(url)
http_client = AsyncHTTPClient()
http_client.fetch(HTTPRequest(url), handle_response) ioloop.IOLoop.current().add_callback(func)
ioloop.IOLoop.current().start()

6.Tornado

from twisted.web.client import getPage, defer
from twisted.internet import reactor def all_done(arg):
reactor.stop() def callback(contents):
print(contents) deferred_list = [] url_list = ['http://www.bing.com', 'http://www.baidu.com', ]
for url in url_list:
deferred = getPage(bytes(url, encoding='utf8'))
deferred.addCallback(callback)
deferred_list.append(deferred) dlist = defer.DeferredList(deferred_list)
dlist.addBoth(all_done) reactor.run()

7.Twisted示例

from twisted.internet import reactor
from twisted.web.client import getPage
import urllib.parse def one_done(arg):
print(arg)
reactor.stop() post_data = urllib.parse.urlencode({'check_data': 'adf'})
post_data = bytes(post_data, encoding='utf8')
headers = {b'Content-Type': b'application/x-www-form-urlencoded'}
response = getPage(bytes('http://dig.chouti.com/login', encoding='utf8'),
method=bytes('POST', encoding='utf8'),
postdata=post_data,
cookies={},
headers=headers)
response.addBoth(one_done) reactor.run()

Twisted更多

以上均是Python内置以及第三方模块提供异步IO请求模块,使用简便大大提高效率,而对于异步IO请求的本质则是【非阻塞Socket】+【IO多路复用】:

import socket
import select class Request(object):
def __init__(self,sock,info):
self.sock = sock
self.info = info def fileno(self):
return self.sock.fileno() class AsyncRequest(object):
def __init__(self):
self.sock_list = []
self.conn_list = [] def add_request(self,req_info):
"""
创建请求
:req_info:{"host":"www.baidu.com","port":80,"path":"/"}
:return:
"""
sock = socket.socket()
sock.setblocking(False)
try:
sock.connect((req_info["host"],req_info["port"]))
except BlockingIOError as e:
pass
obj = Request(sock,req_info)
self.sock_list.append(obj)
self.conn_list.append(obj) def run(self):
"""
开始事件循环,检测:连接成功?数据是否返回?
:return:
"""
while True:
# select.select([socket对象])
# 可以是任何对象,对象一定要fileno方法
# 执行 对象.fileno()
# select.select([Request对象])
r,w,e = select.select(self.sock_list,self.conn_list,[],0.05)
# w,是否连接成功
for obj in w:
# 检查obj是哪个字典
data = bytes("GET {url} HTTP/1.1\r\nhost:{host}\r\n\r\n".format(url=obj.info["path"],host=obj.info["host"]),encoding="utf-8")
obj.sock.send(data)
self.conn_list.remove(obj) # 数据返回,接收到数据
for obj in r:
response = obj.sock.recv(8096)
print(obj.info["host"],response)
obj.info["callback"](response)
# for func in obj.info["callback"]:
# func(response) self.sock_list.remove(obj) # 所有的请求已返回
if not self.sock_list:
break def done1(response):
print(response) def done2(response):
print(123,response) url_list = [
{"host":"www.baidu.com","port":80,"path":"/","callback":done1,},
{"host":"www.cnblogs.com","port":80,"path":"/index.html","callback":done2,},
{"host":"www.bing.com","port":80,"path":"/","callback":done2,},
] qinbing = AsyncRequest()
for item in url_list:
qinbing.add_request(item) qinbing.run()

简单版自开发IO模块

import select
import socket
import time class AsyncTimeoutException(TimeoutError):
"""
请求超时异常类
""" def __init__(self, msg):
self.msg = msg
super(AsyncTimeoutException, self).__init__(msg) class HttpContext(object):
"""封装请求和相应的基本数据""" def __init__(self, sock, host, port, method, url, data, callback, timeout=5):
"""
sock: 请求的客户端socket对象
host: 请求的主机名
port: 请求的端口
port: 请求的端口
method: 请求方式
url: 请求的URL
data: 请求时请求体中的数据
callback: 请求完成后的回调函数
timeout: 请求的超时时间
"""
self.sock = sock
self.callback = callback
self.host = host
self.port = port
self.method = method
self.url = url
self.data = data self.timeout = timeout self.__start_time = time.time()
self.__buffer = [] def is_timeout(self):
"""当前请求是否已经超时"""
current_time = time.time()
if (self.__start_time + self.timeout) < current_time:
return True def fileno(self):
"""请求sockect对象的文件描述符,用于select监听"""
return self.sock.fileno() def write(self, data):
"""在buffer中写入响应内容"""
self.__buffer.append(data) def finish(self, exc=None):
"""在buffer中写入响应内容完成,执行请求的回调函数"""
if not exc:
response = b''.join(self.__buffer)
self.callback(self, response, exc)
else:
self.callback(self, None, exc) def send_request_data(self):
content = """%s %s HTTP/1.0\r\nHost: %s\r\n\r\n%s""" % (
self.method.upper(), self.url, self.host, self.data,) return content.encode(encoding='utf8') class AsyncRequest(object):
def __init__(self):
self.fds = []
self.connections = [] def add_request(self, host, port, method, url, data, callback, timeout):
"""创建一个要请求"""
client = socket.socket()
client.setblocking(False)
try:
client.connect((host, port))
except BlockingIOError as e:
pass
# print('已经向远程发送连接的请求')
req = HttpContext(client, host, port, method, url, data, callback, timeout)
self.connections.append(req)
self.fds.append(req) def check_conn_timeout(self):
"""检查所有的请求,是否有已经连接超时,如果有则终止"""
timeout_list = []
for context in self.connections:
if context.is_timeout():
timeout_list.append(context)
for context in timeout_list:
context.finish(AsyncTimeoutException('请求超时'))
self.fds.remove(context)
self.connections.remove(context) def running(self):
"""事件循环,用于检测请求的socket是否已经就绪,从而执行相关操作"""
while True:
r, w, e = select.select(self.fds, self.connections, self.fds, 0.05) if not self.fds:
return for context in r:
sock = context.sock
while True:
try:
data = sock.recv(8096)
if not data:
self.fds.remove(context)
context.finish()
break
else:
context.write(data)
except BlockingIOError as e:
break
except TimeoutError as e:
self.fds.remove(context)
self.connections.remove(context)
context.finish(e)
break for context in w:
# 已经连接成功远程服务器,开始向远程发送请求数据
if context in self.fds:
data = context.send_request_data()
context.sock.sendall(data)
self.connections.remove(context) self.check_conn_timeout() if __name__ == '__main__':
def callback_func(context, response, ex):
"""
:param context: HttpContext对象,内部封装了请求相关信息
:param response: 请求响应内容
:param ex: 是否出现异常(如果有异常则值为异常对象;否则值为None)
:return:
"""
print(context, response, ex) obj = AsyncRequest()
url_list = [
{'host': 'www.google.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,
'callback': callback_func},
{'host': 'www.baidu.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,
'callback': callback_func},
{'host': 'www.bing.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,
'callback': callback_func},
]
for item in url_list:
print(item)
obj.add_request(**item) obj.running()

史上最牛逼的异步IO模块

尝试使用selector模块写上面的模块

# -*- coding:utf-8 -*-

import selectors
import socket class MySock():
def __init__(self,sock,host,port,method,url,data,callback,timeout=5):
self.sock = sock
self.host = host
self.port = port
self.url = url
self.data = data
self.method = method
self.callback = callback
self.timeout = timeout self.__buffer = [] def fileno(self):
return self.sock.fileno() def send_request(self):
content = """%s %s HTTP/1.0\r\nHost: %s\r\n\r\n%s""" % (
self.method.upper(), self.url, self.host, self.data,)
return content def write(self,data):
self.__buffer.append(data) def finish(self):
self.sock.close()
if self.callback:
response = b''.join(self.__buffer)
self.callback(self,response) def __str__(self):
return self.host class Test():
def __init__(self):
self.sel = selectors.DefaultSelector() def add_request(self,host,port,method,url,data="",callback=None,timeout=5):
sock = socket.socket()
sock.setblocking(False)
try:
sock.connect((host,port))
except BlockingIOError as e:
pass obj = MySock(sock,host,port,method,url,data,callback=callback,timeout=timeout)
self.sel.register(obj,selectors.EVENT_WRITE,self.crawl) def crawl(self,fileobj,mask):
data = fileobj.send_request()
fileobj.sock.sendall(bytes(data,encoding="utf-8"))
self.sel.unregister(fileobj)
self.sel.register(fileobj,selectors.EVENT_READ,self.read) def read(self,fileobj,mask):
sock = fileobj.sock
while True:
try:
data = sock.recv(8096)
if not data:
self.sel.unregister(fileobj)
fileobj.finish()
break
else:
fileobj.write(data)
except BlockingIOError as e:
break def run(self):
while True:
try:
events = self.sel.select() # 检测所有的fileobj,是否有完成wait data的
for sel_obj, mask in events:
callback = sel_obj.data # callback=accpet
callback(sel_obj.fileobj, mask) # accpet(server_fileobj,1)
except OSError as e:
# print("完成")
break if __name__ == '__main__':
def callback_func(fileobj,response):
print(response)
print(fileobj)
# print(response.decode("utf-8")) t = Test()
url_list = [
{'host': 'www.taiyingshi.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,
'callback': callback_func},
{'host': 'www.msj1.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,
'callback': callback_func},
]
for item in url_list:
t.add_request(**item)
t.run()

使用selector模块写,实测有效

python之爬虫_并发(串行、多线程、多进程、异步IO)的更多相关文章

  1. python网页爬虫开发之四-串行爬虫代码示例

    实现功能:代理.限速.深度.反爬 import re import queue import urllib.parse import urllib.robotparser import time fr ...

  2. 多线程/多进程/异步IO

    SOCK_STREAM :TCPSOCK_Dgram :UDP family=AF_INET: 服务器之间的通信AF_INET6: 服务器之间的通信AF_UNIX: Unix不同进程间的通信 永远遵循 ...

  3. GCD,用同步/异步函数,创建并发/串行队列

    队列  第一个参数:C语言字符串,标签 第二个参数: DISPATCH_QUEUE_CONCURRENT:并发队列 DISPATCH_QUEUE_SERIAL:串行队列 dispatch_queue_ ...

  4. python之旅:并发编程之多线程

    一 threading模块介绍 multiprocess模块的完全模仿了threading模块的接口,二者在使用层面,有很大的相似性,因而不再详细介绍 官网链接:https://docs.python ...

  5. IOS多线程知识总结/队列概念/GCD/串行/并行/同步/异步

    进程:正在进行中的程序被称为进程,负责程序运行的内存分配;每一个进程都有自己独立的虚拟内存空间: 线程:线程是进程中一个独立的执行路径(控制单元);一个进程中至少包含一条线程,即主线程. 队列:dis ...

  6. 进程与程序 并行 并发 串行 阻塞 join函数

    进程是正在运行的程序,程序是程序员编写的一对代码,也就是一堆字符,当这堆代码被系统加载到内存并执行,就有了进程. (需要注意的是:一个程序是可以产生多个程序,就像我们可以同时运行多个QQ程序一样,会形 ...

  7. python中同步、多线程、异步IO、多线程对IO密集型的影响

    目录 1.常见并发类型 2.同步版本 3.多线程 4.异步IO 5.多进程 6.总结 1.常见并发类型 I/ O密集型: 蓝色框表示程序执行工作的时间,红色框表示等待I/O操作完成的时间.此图没有按比 ...

  8. Python学习——多线程,异步IO,生成器,协程

    Python的语法是简洁的,也是难理解的. 比如yield关键字: def fun(): for i in range(5): print('test') x = yield i print('goo ...

  9. python 自动化之路 day 10 协程、异步IO、队列、缓存

    本节内容 Gevent协程 Select\Poll\Epoll异步IO与事件驱动 RabbitMQ队列 Redis\Memcached缓存 Paramiko SSH Twsited网络框架 引子 到目 ...

随机推荐

  1. python 工具 eclipse pydev工具安装。

    1.下载eclipse 2.下载java jre(这个会在运行eclipse的时候提示你下载,,根据系统型号下载就行) 3.下载完jre后,把目录下javaw.exe的路径添加到系统path环境变量中 ...

  2. CentOS6中OpenMP的运行时间或运行性能分析

    OpenMp作为单机多核心共享内存并行编程的开发工具,具有编码简洁等,容易上手等特点. 关于OpenMP的入门,博主饮水思源(见参考资料)有了深入浅出,循序渐进的分析.做并行开发,做性能分析是永远逃避 ...

  3. 自适应和响应式布局的区别,em与rem

    自适应布局:不同终端上显示的文字,图片,等位置排版都是一样的,只是大小不同. 响应式布局:通过媒体查询监听屏幕大小的变化,做出响应式的改变,在不同设备可能展现不同的样式效果. em:是相对其父元素的. ...

  4. Oracle cursors 游标 for循环遍历

    oracle提供了for循环语句,让我们可以遍历select搜索的结果.用法也很简单,代码如下: DECLARE ; BEGIN FOR C IN C1 LOOP -- 对select出的每一行进行操 ...

  5. PHP:CURL分别以GET、POST方式请求HTTPS协议接口api

    1.curl以GET方式请求https协议接口 //注意:这里的$url已经包含参数了,不带参数你自己处理哦GET很简单 function curl_get_https($url){ $curl = ...

  6. hadoop体系架构

    1.1          Hadoop 概念:hadoop是一个由Apache基金会所开发的分布式系统基础架构.是根据google发表的GFS(Google File System)论文产生过来的. ...

  7. python基础学习1-反射

    #!/usr/bin/env python # -*- coding:utf-8 -*- getattr(),hasattr(),delattr(),setattr() #反射:基于字符串的形式去对象 ...

  8. 【转载】D3DXMatrixLookAtLH视图变换函数详解

    原文:D3DXMatrixLookAtLH视图变换函数详解 /*D3DXMatrixLookAtLH函数返回的是世界->视图变换矩阵. 视图坐标系和局部坐标系是一样的,都是世界坐标系转换为指定的 ...

  9. 2991:2011 求2011^n的后四位。

    2991:2011 查看 提交 统计 提问 总时间限制:  1000ms 内存限制:  65536kB 描述 已知长度最大为200位的正整数n,请求出2011^n的后四位. 输入 第一行为一个正整数k ...

  10. CentOS 7 安装 caffe

    1.安装CUDA,很简单,傻瓜式安装 2.http://caffe.berkeleyvision.org/install_yum.html 按照里面安装 3.遇到的问题: LD -o .build_r ...