async for 在爬虫中的使用例子

import asyncio

import re

import typing

from concurrent.futures import Executor, ThreadPoolExecutor

from urllib.request import urlopen

DEFAULT_EXECUTOR = ThreadPoolExecutor(4)

ANCHOR_TAG_PATTERN = re.compile(b"<a.+?href=[\"|\'](.*?)[\"|\'].*?>", re.RegexFlag.MULTILINE | re.RegexFlag.IGNORECASE)

async def wrap_async(generator: typing.Generator,

                     executor: Executor = DEFAULT_EXECUTOR,

                     sentinel=None,

                     *,

                     loop: asyncio.AbstractEventLoop = None):

    """

    We wrap a generator and return an asynchronous generator instead

    :param iterator:

    :param executor:

    :param sentinel:

    :param loop:

    :return:

    """

    if not loop:

        loop = asyncio.get_running_loop()

    while True:

        # 相当于执行next(generator)

        result = await loop.run_in_executor(executor, next, generator, sentinel)

        if result == sentinel:

            # 如果链接为空跳出

            break

        yield result

def follow(*links):

    """

    :param links:

    :return:

    """

    return ((link, urlopen(link).read()) for link in links)

def get_links(text: str):

    """

    Get back an iterator that gets us all the links in a text iteratively and safely

    :param text:

    :return:

    """

    # Always grab the last match, because that is how a smart http parser would interpret a malformed

    # anchor tag

    return (match.groups()[-1]

            for match in ANCHOR_TAG_PATTERN.finditer(text)

            # This portion is a safeguard against None matches and zero href matches

            if hasattr(match, "groups") and len(match.groups()))

async def main(*links):

    async for current, body in wrap_async(follow(*links)):

        print("Current url:", current)

        print("Content:", body)

        async for link in wrap_async(get_links(body)):

            print(link)

asyncio.run(main("https://www.cnblogs.com/c-x-a"))

async for 在爬虫中的使用例子的更多相关文章

跟着太白老师学python day11 闭包及在爬虫中的基本使用
闭包的基本概念: 闭包内层函数对外层函数的变量(不包括全局变量)的引用,并返回,这样就形成了闭包闭包的作用:当程序执行时,遇到了函数执行,它会在内存中开辟一个空间,如果这个函数内部形成了闭包, 那 ...
深入理解协程（四）：async/await异步爬虫实战
本文目录: 同步方式爬取博客标题 async/await异步爬取博客标题本片为深入理解协程系列文章的补充. 你将会在从本文中了解到:async/await如何运用的实际的爬虫中. 案例从CSDN上 ...
asyncio在爬虫中的使用
# -*- coding: utf-8 -*- # 协程基础.py import asyncio import time async def request(url): print("正在请 ...
采集爬虫中，解决网站限制IP的问题？ - wendi_0506的专栏 - 博客频道 - CSDN.NET
采集爬虫中,解决网站限制IP的问题? - wendi_0506的专栏 - 博客频道 - CSDN.NET undefined
break在switch中的使用例子
/* Name:break在switch中的使用例子 Copyright: By.不懂网络 Author: Yangbin Date:2014年2月21日 03:16:52 Description:以 ...
crawler_网络爬虫中编码的正确处理与乱码的解决策略
转载: http://hi.baidu.com/erliang20088/item/9156132bdaeae8949c63d134 最近一个月一直在对nutch1.6版进行中等层次的二次开发,本来是 ...
[Spark][Python][DataFrame][RDD]DataFrame中抽取RDD例子
[Spark][Python][DataFrame][RDD]DataFrame中抽取RDD例子 sqlContext = HiveContext(sc) peopleDF = sqlContext. ...
【转】C# Async/Await 异步编程中的最佳做法
Async/Await 异步编程中的最佳做法 Stephen Cleary 近日来,涌现了许多关于 Microsoft .NET Framework 4.5 中新增了对 async 和 await 支 ...
（二）Hadoop例子——运行example中的wordCount例子
Hadoop例子——运行example中的wordCount例子一. 需求说明单词计数是最简单也是最能体现MapReduce思想的程序之一,可以称为 MapReduce版"Hello ...

随机推荐

使用百度echarts仿雪球分时图（三）
这章节将完成我们的分时图,并使用真实的数据来进行展示分时图. 一天的交易时间段分为上午的09:30~11:30,下午的13:00~15:00两个时间段,因为分时间段的关系,数据是不连续的,所以会先分为 ...
IDEA乱码总结和处理
工程乱码打开File-Setting, 找到File Encodings这个选项,把encoding设置成你工程的编码即可,一般是UTF-8,如下图(红框的地方),然后重新rebuild一下,基本就 ...
手动编译用于i.MX6系列ARM的交叉编译SDK
前言: 在前一节中,在使用别的机器(系统:UBUNTU14.04)上编译好的交叉编译SDK,配置在我的电脑(系统:UBUNTU16.04)上,用于bazel编译Tensorflow时会报arm-pok ...
【收藏】linux快速查找文件的技巧
有时候,我们需要在系统中查找文件,Linux有一个非常优秀的搜寻系统. 一般提到搜寻文件的时候,很多人第一反应是find命令,但其实find不是常用的,因为速度慢,而且毁硬盘.一般我们都先用where ...
java——double数据精度问题
代码:使用BigDecimal来代替double public class BigDecimalUtil { public static BigDecimal add(double v1,double ...
npm 安装指定版本的包
使用包名@版本号指定, 例如,安装 Express 3.21.2, $ npm
美团面经-java开发
美团(1)1 1 2 3 5 8...,求第n项写了个递归,面试官问了两个,n＝-1,和极限最大值情况下怎么办.我回答,会导致栈的内存空间溢出.又问了,在栈里会是个怎样的过程.(2)打开摩拜单车页面 ...
Java抽象类详解
一.抽象类的基本概念普通类是一个完善的功能类,可以直接产生实例化对象,并且在普通类中可以包含有构造方法.普通方法.static方法.常量和变量等内容.而抽象类是指在普通类的结构里面增加抽象方法的组成 ...
内网监控zabbix
告警告警方式:linkedsee 类型:使用脚本linkedsee.sh [root@zabbix-server ~]# cat linkedsee.sh #! /bin/bash SERVICE_ ...
巧用 Img / JavaScript 采集页面数据
摘要: 当我们有一个新内容时(例如新功能.新活动.新游戏.新文章),作为运营人员总是迫不及待地希望能尽快传达到用户,因为这是获取用户的第一步.也是最重要的一步. 点此查看原文:http://click ...

async for 在爬虫中的使用例子

async for 在爬虫中的使用例子的更多相关文章

随机推荐

热门专题