20.multi_协程方法抓取总阅读量

# 用asyncio和aiohttp抓取博客的总阅读量 (提示:先用接又找到每篇文章的链接)

# https://www.jianshu.com/u/130f76596b02

import re

import asyncio

import aiohttp

import requests

import ssl

from lxml import etree

from asyncio.queues import Queue

from aiosocksy import Socks5Auth

from aiosocksy.connector import ProxyConnector, ProxyClientRequest

class Common():

    task_queue = Queue()

    result_queue = Queue()

    result_queue_1 = []

async def session_get(session, url, socks):

    auth = Socks5Auth(login='...', password='...')

    headers = {'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}

    timeout = aiohttp.ClientTimeout(total=20)

    response = await session.get(

        url,

        proxy=socks,

        proxy_auth=auth,

        timeout=timeout,

        headers=headers,

        ssl=ssl.SSLContext()

    )

    return await response.text(), response.status

async def download(url):

    connector = ProxyConnector()

    socks = None

    async with aiohttp.ClientSession(

            connector=connector,

            request_class=ProxyClientRequest

    ) as session:

        ret, status = await session_get(session, url, socks)

        if 'window.location.href' in ret and len(ret) < 1000:

            url = ret.split("window.location.href='")[1].split("'")[0]

            ret, status = await session_get(session, url, socks)

        return ret, status

async def parse_html(content):

    read_num_pattern = re.compile(r'"views_count":\d+')

    read_num = int(read_num_pattern.findall(content)[0].split(':')[-1])

    return read_num

def get_all_article_links():

    links_list = []

    for i in range(1, 21):

        url = 'https://www.jianshu.com/u/130f76596b02?order_by=shared_at&page={}'.format(

            i)

        header = {

            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',

            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 '

            '(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}

        response = requests.get(url,

                                headers=header,

                                timeout=5

                                )

        tree = etree.HTML(response.text)

        article_links = tree.xpath(

            '//div[@class="content"]/a[@class="title"]/@href')

        for item in article_links:

            article_link = 'https://www.jianshu.com' + item

            links_list.append(article_link)

            print(article_link)

    return links_list

async def down_and_parse_task(queue):

    while True:

        try:

            url = queue.get_nowait()

        except BaseException:

            return

        error = None

        for retry_cnt in range(3):

            try:

                html, status = await download(url)

                if status != 200:

                    html, status = await download(url)

                read_num = await parse_html(html)

                print(read_num)

                # await Common.result_queue.put(read_num)

                Common.result_queue_1.append(read_num)

                break

            except Exception as e:

                error = e

                await asyncio.sleep(0.2)

                continue

        else:

            raise error

async def count_sum():

    while True:

        try:

            print(Common.result_queue_1)

            print('总阅读量 = ', sum(Common.result_queue_1))

            await asyncio.sleep(3)

        except BaseException:

            pass

async def main():

    all_links = get_all_article_links()

    for item in set(all_links):

        await Common.task_queue.put(item)

    for _ in range(10):

        loop.create_task(down_and_parse_task(Common.task_queue))

        loop.create_task(count_sum())

if __name__ == '__main__':

    loop = asyncio.get_event_loop()

    loop.create_task(main())

    loop.run_forever()

20.multi_协程方法抓取总阅读量的更多相关文章

代理池抓取基础版-（python协程）--抓取网站（西刺-后期会持续更新）
# coding = utf- __autor__ = 'litao' import urllib.request import urllib.request import urllib.error ...
成功抓取csdn阅读量过万博文
http://images.cnblogs.com/cnblogs_com/elesos/1120632/o_111.png var commentscount = 1; 嵌套的评论算一条,这个可能有 ...
比物理线程都好用的C++20的协程，你会用吗？
摘要:事件驱动(event driven)是一种常见的代码模型,其通常会有一个主循环(mainloop)不断的从队列中接收事件,然后分发给相应的函数/模块处理.常见使用事件驱动模型的软件包括图形用户界 ...
开启gzip压缩/cdn是否会影响抓取和收录量
http://www.wocaoseo.com/thread-291-1-1.html 服务器开启gzip压缩是否会影响蜘蛛抓取和收录量?站点开了CDN,对百度SEO影响有多大?我发现我们站自从开了C ...
(20)gevent协程
协程: 也叫纤程,协程是线程的一种实现,指的是一条线程能够在多任务之间来回切换的一种实现,对于CPU.操作系统来说,协程并不存在任务之间的切换会花费时间.目前电脑配置一般线程开到200会阻塞卡顿 ...
scrapy实战4 GET方法抓取ajax动态页面(以糗事百科APP为例子)：
一般来说爬虫类框架抓取Ajax动态页面都是通过一些第三方的webkit库去手动执行html页面中的js代码, 最后将生产的html代码交给spider分析.本篇文章则是通过利用fiddler抓包获取j ...
scrapy实战5 POST方法抓取ajax动态页面(以慕课网APP为例子)：
在手机端打开慕课网,fiddler查看如图注意圈起来的位置经过分析只有画线的page在变化上代码: items.py import scrapy class ImoocItem(scrapy.It ...
ADB logcat 过滤方法(抓取日志)
1. Log信息级别 Log.v- VERBOSE : 黑色 Log.d- DEBUG : 蓝色 Log.i- INFO : 绿色 Log.w- WARN : 橙色 Log.e- ERROR ...
python3用BeautifulSoup用字典的方法抓取a标签内的数据
# -*- coding:utf-8 -*- #python 2.7 #XiaoDeng #http://tieba.baidu.com/p/2460150866 #标签操作 from bs4 imp ...

随机推荐

EasyUI - 简介
1. EasyUI : 简单的界面设计框架, 基于jQuery的UI插件, 主要用来设计网站的后台管理系统 2. EasyUI使用 : 将EasyUI提供的js文件和主题(themes)样式存放到项目 ...
通过base64实现图片下载功能（基于vue）
1. 使用场景当我们处理图片下载功能的时候,如果本地的图片,那么是可以通过canvas获得图片的base64的,方法如下.但是如果图片的url存在跨域问题的话,下面的方法将行不通,这时候我们可以另辟 ...
bzoj3505: [Cqoi2014]数三角形 [数论][gcd]
Description 给定一个nxm的网格,请计算三点都在格点上的三角形共有多少个.下图为4x4的网格上的一个三角形. 注意三角形的三点不能共线. Input 输入一行,包含两个空格分隔的正整数m和 ...
【Codeforces Round #589 (Div. 2) D】Complete Tripartite
[链接] 我是链接,点我呀:) [题意] 题意 [题解] 其实这道题感觉有点狗. 思路大概是这样先让所有的点都在1集合中. 然后随便选一个点x,访问它的出度y 显然tag[y]=2 因为和他相连了嘛 ...
二分图染色+分组背包+bitset优化——hdu5313
首先就是求联通块,每个联通块里记录两个部分的元素个数目标是使一边的体积接近n/2 那么每个联通块作为一组,进行分组背包,dp[i]表示体积i是否可以被凑出来,可行性背包是可以用bitset优化的最 ...
（转）ab（apachebench）测试与loadrunner
转:http://blog.csdn.net/gzh0222/article/details/7172341 ab的全称是ApacheBench,是 Apache 附带的一个小工具,专门用于 HTTP ...
docker service 集群创建
docker service create /新建docker集群--name webwork /集群的名称--replicas 3/ 启动3个节点--network my-network/ netw ...
Perl 数组应用详解(push, pop, shift, unshift)
Perl的数组操作有四大常用函数: push:从数组的末尾加入元素.pop :从数组的末尾取出元素 shift: 从数组的开头取出元素unshift:从数组的开头加入元素 1.push #!/usr/ ...
剑指offer——28对称的二叉树
题目描述请实现一个函数,用来判断一颗二叉树是不是对称的.注意,如果一个二叉树同此二叉树的镜像是同样的,定义其为对称的. 题解: 使用正常前序遍历与反向的前序遍历进行比较结果即可,注意,需将空 ...
顺时针打印矩阵元素（python实现）
我觉得我的算法比较简单易懂,比网上的那些简单些.至于时间复杂度就没有比较了. 算法思想:从最外层向内层遍历矩阵 # my algorithmimport math def print_matrix(m ...

20.multi_协程方法抓取总阅读量

20.multi_协程方法抓取总阅读量的更多相关文章

随机推荐

热门专题