并发

在编写爬虫时,性能的消耗主要在IO请求中,当单进程单线程模式下请求URL时必然会引起等待,从而使得请求整体变慢

import requests

def fetch_async(url):
response = requests.get(url)
return response url_list = ['http://www.github.com', 'http://www.bing.com'] for url in url_list:
fetch_async(url)

1.同步执行(串行)

from concurrent.futures import ThreadPoolExecutor
import requests def fetch_async(url):
response = requests.get(url)
return response url_list = ['http://www.github.com', 'http://www.bing.com']
pool = ThreadPoolExecutor(5)
for url in url_list:
pool.submit(fetch_async, url)
pool.shutdown(wait=True)

2.多线程执行(线程池)

from concurrent.futures import ThreadPoolExecutor
import requests def fetch_async(url):
response = requests.get(url)
return response def callback(future):
print(future.result()) url_list = ['http://www.github.com', 'http://www.bing.com']
pool = ThreadPoolExecutor(5)
for url in url_list:
v = pool.submit(fetch_async, url)
v.add_done_callback(callback)
pool.shutdown(wait=True)

2.多线程+回调函数执行(线程池+add_done_callback)

from concurrent.futures import ProcessPoolExecutor
import requests def fetch_async(url):
response = requests.get(url)
return response url_list = ['http://www.github.com', 'http://www.bing.com']
pool = ProcessPoolExecutor(5)
for url in url_list:
pool.submit(fetch_async, url)
pool.shutdown(wait=True)

3.多进程执行(进程池)

from concurrent.futures import ProcessPoolExecutor
import requests def fetch_async(url):
response = requests.get(url)
return response def callback(future):
print(future.result()) url_list = ['http://www.github.com', 'http://www.bing.com']
pool = ProcessPoolExecutor(5)
for url in url_list:
v = pool.submit(fetch_async, url)
v.add_done_callback(callback)
pool.shutdown(wait=True)

3.多进程+回调函数执行(进程池+add_done_callback)

异步IO

通过上述代码均可以完成对请求性能的提高,对于多线程和多进行的缺点是在IO阻塞时会造成了线程和进程的浪费,所以异步IO回事首选:

以下优先级:Twisted > gevent+requests > asyncio+aiohttp

import asyncio

@asyncio.coroutine
def func1():
print('before...func1......')
yield from asyncio.sleep(5)
print('end...func1......') tasks = [func1(), func1()] loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

1.asyncio示例1

import asyncio

@asyncio.coroutine
def fetch_async(host, url='/'):
print(host, url)
reader, writer = yield from asyncio.open_connection(host, 80) request_header_content = """GET %s HTTP/1.0\r\nHost: %s\r\n\r\n""" % (url, host,)
request_header_content = bytes(request_header_content, encoding='utf-8') writer.write(request_header_content)
yield from writer.drain()
text = yield from reader.read()
print(host, url, text)
writer.close() tasks = [
fetch_async('www.cnblogs.com', '/wupeiqi/'),
fetch_async('dig.chouti.com', '/pic/show?nid=4073644713430508&lid=10273091')
] loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

1.asyncio示例2

import aiohttp
import asyncio @asyncio.coroutine
def fetch_async(url):
print(url)
response = yield from aiohttp.request('GET', url)
# data = yield from response.read()
# print(url, data)
print(url, response)
response.close() tasks = [fetch_async('http://www.google.com/'), fetch_async('http://www.chouti.com/')] event_loop = asyncio.get_event_loop()
results = event_loop.run_until_complete(asyncio.gather(*tasks))
event_loop.close()

2.asyncio + aiohttp

import asyncio
import requests @asyncio.coroutine
def fetch_async(func, *args):
loop = asyncio.get_event_loop()
future = loop.run_in_executor(None, func, *args)
response = yield from future
print(response.url, response.content) tasks = [
fetch_async(requests.get, 'http://www.cnblogs.com/wupeiqi/'),
fetch_async(requests.get, 'http://dig.chouti.com/pic/show?nid=4073644713430508&lid=10273091')
] loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

3.asyncio + requests

import gevent

import requests
from gevent import monkey monkey.patch_all() def fetch_async(method, url, req_kwargs):
print(method, url, req_kwargs)
response = requests.request(method=method, url=url, **req_kwargs)
print(response.url, response.content) # ##### 发送请求 #####
gevent.joinall([
gevent.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),
gevent.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),
gevent.spawn(fetch_async, method='get', url='https://github.com/', req_kwargs={}),
]) # ##### 发送请求(协程池控制最大协程数量) #####
# from gevent.pool import Pool
# pool = Pool(None)
# gevent.joinall([
# pool.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),
# pool.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),
# pool.spawn(fetch_async, method='get', url='https://www.github.com/', req_kwargs={}),
# ])

4.gevent + requests

import grequests

request_list = [
grequests.get('http://httpbin.org/delay/1', timeout=0.001),
grequests.get('http://fakedomain/'),
grequests.get('http://httpbin.org/status/500')
] # ##### 执行并获取响应列表 #####
# response_list = grequests.map(request_list)
# print(response_list) # ##### 执行并获取响应列表(处理异常) #####
# def exception_handler(request, exception):
# print(request,exception)
# print("Request failed") # response_list = grequests.map(request_list, exception_handler=exception_handler)
# print(response_list)

5.grequests

from tornado.httpclient import AsyncHTTPClient
from tornado.httpclient import HTTPRequest
from tornado import ioloop def handle_response(response):
"""
处理返回值内容(需要维护计数器,来停止IO循环),调用 ioloop.IOLoop.current().stop()
:param response:
:return:
"""
if response.error:
print("Error:", response.error)
else:
print(response.body) def func():
url_list = [
'http://www.baidu.com',
'http://www.bing.com',
]
for url in url_list:
print(url)
http_client = AsyncHTTPClient()
http_client.fetch(HTTPRequest(url), handle_response) ioloop.IOLoop.current().add_callback(func)
ioloop.IOLoop.current().start()

6.Tornado

from twisted.web.client import getPage, defer
from twisted.internet import reactor def all_done(arg):
reactor.stop() def callback(contents):
print(contents) deferred_list = [] url_list = ['http://www.bing.com', 'http://www.baidu.com', ]
for url in url_list:
deferred = getPage(bytes(url, encoding='utf8'))
deferred.addCallback(callback)
deferred_list.append(deferred) dlist = defer.DeferredList(deferred_list)
dlist.addBoth(all_done) reactor.run()

7.Twisted示例

from twisted.internet import reactor
from twisted.web.client import getPage
import urllib.parse def one_done(arg):
print(arg)
reactor.stop() post_data = urllib.parse.urlencode({'check_data': 'adf'})
post_data = bytes(post_data, encoding='utf8')
headers = {b'Content-Type': b'application/x-www-form-urlencoded'}
response = getPage(bytes('http://dig.chouti.com/login', encoding='utf8'),
method=bytes('POST', encoding='utf8'),
postdata=post_data,
cookies={},
headers=headers)
response.addBoth(one_done) reactor.run()

Twisted更多

以上均是Python内置以及第三方模块提供异步IO请求模块,使用简便大大提高效率,而对于异步IO请求的本质则是【非阻塞Socket】+【IO多路复用】:

import socket
import select class Request(object):
def __init__(self,sock,info):
self.sock = sock
self.info = info def fileno(self):
return self.sock.fileno() class AsyncRequest(object):
def __init__(self):
self.sock_list = []
self.conn_list = [] def add_request(self,req_info):
"""
创建请求
:req_info:{"host":"www.baidu.com","port":80,"path":"/"}
:return:
"""
sock = socket.socket()
sock.setblocking(False)
try:
sock.connect((req_info["host"],req_info["port"]))
except BlockingIOError as e:
pass
obj = Request(sock,req_info)
self.sock_list.append(obj)
self.conn_list.append(obj) def run(self):
"""
开始事件循环,检测:连接成功?数据是否返回?
:return:
"""
while True:
# select.select([socket对象])
# 可以是任何对象,对象一定要fileno方法
# 执行 对象.fileno()
# select.select([Request对象])
r,w,e = select.select(self.sock_list,self.conn_list,[],0.05)
# w,是否连接成功
for obj in w:
# 检查obj是哪个字典
data = bytes("GET {url} HTTP/1.1\r\nhost:{host}\r\n\r\n".format(url=obj.info["path"],host=obj.info["host"]),encoding="utf-8")
obj.sock.send(data)
self.conn_list.remove(obj) # 数据返回,接收到数据
for obj in r:
response = obj.sock.recv(8096)
print(obj.info["host"],response)
obj.info["callback"](response)
# for func in obj.info["callback"]:
# func(response) self.sock_list.remove(obj) # 所有的请求已返回
if not self.sock_list:
break def done1(response):
print(response) def done2(response):
print(123,response) url_list = [
{"host":"www.baidu.com","port":80,"path":"/","callback":done1,},
{"host":"www.cnblogs.com","port":80,"path":"/index.html","callback":done2,},
{"host":"www.bing.com","port":80,"path":"/","callback":done2,},
] qinbing = AsyncRequest()
for item in url_list:
qinbing.add_request(item) qinbing.run()

简单版自开发IO模块

import select
import socket
import time class AsyncTimeoutException(TimeoutError):
"""
请求超时异常类
""" def __init__(self, msg):
self.msg = msg
super(AsyncTimeoutException, self).__init__(msg) class HttpContext(object):
"""封装请求和相应的基本数据""" def __init__(self, sock, host, port, method, url, data, callback, timeout=5):
"""
sock: 请求的客户端socket对象
host: 请求的主机名
port: 请求的端口
port: 请求的端口
method: 请求方式
url: 请求的URL
data: 请求时请求体中的数据
callback: 请求完成后的回调函数
timeout: 请求的超时时间
"""
self.sock = sock
self.callback = callback
self.host = host
self.port = port
self.method = method
self.url = url
self.data = data self.timeout = timeout self.__start_time = time.time()
self.__buffer = [] def is_timeout(self):
"""当前请求是否已经超时"""
current_time = time.time()
if (self.__start_time + self.timeout) < current_time:
return True def fileno(self):
"""请求sockect对象的文件描述符,用于select监听"""
return self.sock.fileno() def write(self, data):
"""在buffer中写入响应内容"""
self.__buffer.append(data) def finish(self, exc=None):
"""在buffer中写入响应内容完成,执行请求的回调函数"""
if not exc:
response = b''.join(self.__buffer)
self.callback(self, response, exc)
else:
self.callback(self, None, exc) def send_request_data(self):
content = """%s %s HTTP/1.0\r\nHost: %s\r\n\r\n%s""" % (
self.method.upper(), self.url, self.host, self.data,) return content.encode(encoding='utf8') class AsyncRequest(object):
def __init__(self):
self.fds = []
self.connections = [] def add_request(self, host, port, method, url, data, callback, timeout):
"""创建一个要请求"""
client = socket.socket()
client.setblocking(False)
try:
client.connect((host, port))
except BlockingIOError as e:
pass
# print('已经向远程发送连接的请求')
req = HttpContext(client, host, port, method, url, data, callback, timeout)
self.connections.append(req)
self.fds.append(req) def check_conn_timeout(self):
"""检查所有的请求,是否有已经连接超时,如果有则终止"""
timeout_list = []
for context in self.connections:
if context.is_timeout():
timeout_list.append(context)
for context in timeout_list:
context.finish(AsyncTimeoutException('请求超时'))
self.fds.remove(context)
self.connections.remove(context) def running(self):
"""事件循环,用于检测请求的socket是否已经就绪,从而执行相关操作"""
while True:
r, w, e = select.select(self.fds, self.connections, self.fds, 0.05) if not self.fds:
return for context in r:
sock = context.sock
while True:
try:
data = sock.recv(8096)
if not data:
self.fds.remove(context)
context.finish()
break
else:
context.write(data)
except BlockingIOError as e:
break
except TimeoutError as e:
self.fds.remove(context)
self.connections.remove(context)
context.finish(e)
break for context in w:
# 已经连接成功远程服务器,开始向远程发送请求数据
if context in self.fds:
data = context.send_request_data()
context.sock.sendall(data)
self.connections.remove(context) self.check_conn_timeout() if __name__ == '__main__':
def callback_func(context, response, ex):
"""
:param context: HttpContext对象,内部封装了请求相关信息
:param response: 请求响应内容
:param ex: 是否出现异常(如果有异常则值为异常对象;否则值为None)
:return:
"""
print(context, response, ex) obj = AsyncRequest()
url_list = [
{'host': 'www.google.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,
'callback': callback_func},
{'host': 'www.baidu.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,
'callback': callback_func},
{'host': 'www.bing.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,
'callback': callback_func},
]
for item in url_list:
print(item)
obj.add_request(**item) obj.running()

史上最牛逼的异步IO模块

尝试使用selector模块写上面的模块

# -*- coding:utf-8 -*-

import selectors
import socket class MySock():
def __init__(self,sock,host,port,method,url,data,callback,timeout=5):
self.sock = sock
self.host = host
self.port = port
self.url = url
self.data = data
self.method = method
self.callback = callback
self.timeout = timeout self.__buffer = [] def fileno(self):
return self.sock.fileno() def send_request(self):
content = """%s %s HTTP/1.0\r\nHost: %s\r\n\r\n%s""" % (
self.method.upper(), self.url, self.host, self.data,)
return content def write(self,data):
self.__buffer.append(data) def finish(self):
self.sock.close()
if self.callback:
response = b''.join(self.__buffer)
self.callback(self,response) def __str__(self):
return self.host class Test():
def __init__(self):
self.sel = selectors.DefaultSelector() def add_request(self,host,port,method,url,data="",callback=None,timeout=5):
sock = socket.socket()
sock.setblocking(False)
try:
sock.connect((host,port))
except BlockingIOError as e:
pass obj = MySock(sock,host,port,method,url,data,callback=callback,timeout=timeout)
self.sel.register(obj,selectors.EVENT_WRITE,self.crawl) def crawl(self,fileobj,mask):
data = fileobj.send_request()
fileobj.sock.sendall(bytes(data,encoding="utf-8"))
self.sel.unregister(fileobj)
self.sel.register(fileobj,selectors.EVENT_READ,self.read) def read(self,fileobj,mask):
sock = fileobj.sock
while True:
try:
data = sock.recv(8096)
if not data:
self.sel.unregister(fileobj)
fileobj.finish()
break
else:
fileobj.write(data)
except BlockingIOError as e:
break def run(self):
while True:
try:
events = self.sel.select() # 检测所有的fileobj,是否有完成wait data的
for sel_obj, mask in events:
callback = sel_obj.data # callback=accpet
callback(sel_obj.fileobj, mask) # accpet(server_fileobj,1)
except OSError as e:
# print("完成")
break if __name__ == '__main__':
def callback_func(fileobj,response):
print(response)
print(fileobj)
# print(response.decode("utf-8")) t = Test()
url_list = [
{'host': 'www.taiyingshi.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,
'callback': callback_func},
{'host': 'www.msj1.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,
'callback': callback_func},
]
for item in url_list:
t.add_request(**item)
t.run()

使用selector模块写,实测有效

python之爬虫_并发(串行、多线程、多进程、异步IO)的更多相关文章

  1. python网页爬虫开发之四-串行爬虫代码示例

    实现功能:代理.限速.深度.反爬 import re import queue import urllib.parse import urllib.robotparser import time fr ...

  2. 多线程/多进程/异步IO

    SOCK_STREAM :TCPSOCK_Dgram :UDP family=AF_INET: 服务器之间的通信AF_INET6: 服务器之间的通信AF_UNIX: Unix不同进程间的通信 永远遵循 ...

  3. GCD,用同步/异步函数,创建并发/串行队列

    队列  第一个参数:C语言字符串,标签 第二个参数: DISPATCH_QUEUE_CONCURRENT:并发队列 DISPATCH_QUEUE_SERIAL:串行队列 dispatch_queue_ ...

  4. python之旅:并发编程之多线程

    一 threading模块介绍 multiprocess模块的完全模仿了threading模块的接口,二者在使用层面,有很大的相似性,因而不再详细介绍 官网链接:https://docs.python ...

  5. IOS多线程知识总结/队列概念/GCD/串行/并行/同步/异步

    进程:正在进行中的程序被称为进程,负责程序运行的内存分配;每一个进程都有自己独立的虚拟内存空间: 线程:线程是进程中一个独立的执行路径(控制单元);一个进程中至少包含一条线程,即主线程. 队列:dis ...

  6. 进程与程序 并行 并发 串行 阻塞 join函数

    进程是正在运行的程序,程序是程序员编写的一对代码,也就是一堆字符,当这堆代码被系统加载到内存并执行,就有了进程. (需要注意的是:一个程序是可以产生多个程序,就像我们可以同时运行多个QQ程序一样,会形 ...

  7. python中同步、多线程、异步IO、多线程对IO密集型的影响

    目录 1.常见并发类型 2.同步版本 3.多线程 4.异步IO 5.多进程 6.总结 1.常见并发类型 I/ O密集型: 蓝色框表示程序执行工作的时间,红色框表示等待I/O操作完成的时间.此图没有按比 ...

  8. Python学习——多线程,异步IO,生成器,协程

    Python的语法是简洁的,也是难理解的. 比如yield关键字: def fun(): for i in range(5): print('test') x = yield i print('goo ...

  9. python 自动化之路 day 10 协程、异步IO、队列、缓存

    本节内容 Gevent协程 Select\Poll\Epoll异步IO与事件驱动 RabbitMQ队列 Redis\Memcached缓存 Paramiko SSH Twsited网络框架 引子 到目 ...

随机推荐

  1. 一条SQL语句执行得很慢原因有哪些

    一条SQL语句执行得很慢,要分两种情况: 1.大多数情况是正常,偶尔很慢 数据库在处理数据忙时候,更新或新增数据都会暂时记录到redo log日志,等空闲时把数据同步到磁盘.假设数据库一直很忙,更新又 ...

  2. Ural 1183 Brackets Sequence(区间DP+记忆化搜索)

    题目地址:Ural 1183 最终把这题给A了.. .拖拉了好长时间,.. 自己想还是想不出来,正好紫书上有这题. d[i][j]为输入序列从下标i到下标j最少须要加多少括号才干成为合法序列.0< ...

  3. Linux服务-mysql基础篇

    目录 1. 关系型数据库介绍 1.1 数据结构模型 1.2 RDBMS专业名词 1.3 关系型数据库的常见组件 1.4 SQL语句 2. mysql安装与配置 2.1 mysql安装 2.2 mysq ...

  4. TI DSP 6657 SRIO 简介

    目录 TI DSP 6657 SRIO 简介 SRIO 协议介绍 RapidIO 基础 TI DSP 6657 SRIO 简介 SRIO 协议介绍 TI 的 KeyStone 系列设备中实现了 Rap ...

  5. 可用的ntp服务器

    操作系统中带的:time.windows.com 和 time.nist.gov  网上查到一个公共的:cn.ntp.org.cn 以上三个连接多次才成功一次,速度不好. 在移动电视盒子上有一个配置: ...

  6. 20154327 Exp9 Web安全基础

    基础问题回答 (1)SQL注入攻击原理,如何防御 原理: 程序员在编写代码的时候,没有对用户输入数据的合法性进行判断,攻击者利用SQL命令欺骗服务器执行恶意的SQL命令,获得某些他想得知的数据. 防御 ...

  7. [arc072F]Dam-[单调队列]

    Description 传送门 Solution 首先我们肯定不能那么耿直地直接把水混合起来吧..不然分分钟完球. 那么怎么找到最优解呢?假如我们把水的体积和温度按顺序插入队列,这时我们插入第i天的水 ...

  8. 成都Uber优步司机奖励政策(4月23日)

    滴快车单单2.5倍,注册地址:http://www.udache.com/ 如何注册Uber司机(全国版最新最详细注册流程)/月入2万/不用抢单:http://www.cnblogs.com/mfry ...

  9. DB异常状态:Recovery Pending,Suspect,估计Recovery的剩余时间

    一,RECOVERY PENDING状态 今天修改了SQL Server的Service Account的密码,然后重启SQL Server的Service,发现有db处于Recovery Pendi ...

  10. TensorFlow API 汉化

    TensorFlow API 汉化 模块:tf   定义于tensorflow/__init__.py. 将所有公共TensorFlow接口引入此模块. 模块 app module:通用入口点脚本. ...