使用redis+flask维护动态代理池

在进行网络爬虫时，会经常有封ip的现象。可以使用代理池来进行代理ip的处理。

代理池的要求：多站抓取，异步检测。定时筛选，持续更新。提供接口，易于提取。

代理池架构：获取器，过滤器，代理队列，定时检测。

使用https://github.com/Germey/ProxyPool/tree/master/proxypool代码进行分析。

run.py里面的代码

from proxypool.api import app

from proxypool.schedule import Schedule

def main():

    s = Schedule()

    s.run()

    app.run()

if __name__ == '__main__':

    main()

首先运行了一个调度器，接着运行了一个接口。

调度器schedule.py代码

import time

from multiprocessing import Process

import asyncio

import aiohttp

try:

    from aiohttp.errors import ProxyConnectionError,ServerDisconnectedError,ClientResponseError,ClientConnectorError

except:

    from aiohttp import ClientProxyConnectionError as ProxyConnectionError,ServerDisconnectedError,ClientResponseError,ClientConnectorError

from proxypool.db import RedisClient

from proxypool.error import ResourceDepletionError

from proxypool.getter import FreeProxyGetter

from proxypool.setting import *

from asyncio import TimeoutError

class ValidityTester(object):

    test_api = TEST_API

    def __init__(self):

        self._raw_proxies = None

        self._usable_proxies = []

    def set_raw_proxies(self, proxies):

        self._raw_proxies = proxies

        self._conn = RedisClient()

    async def test_single_proxy(self, proxy):

        """

        text one proxy, if valid, put them to usable_proxies.

        """

        try:

            async with aiohttp.ClientSession() as session:

                try:

                    if isinstance(proxy, bytes):

                        proxy = proxy.decode('utf-8')

                    real_proxy = 'http://' + proxy

                    print('Testing', proxy)

                    async with session.get(self.test_api, proxy=real_proxy, timeout=get_proxy_timeout) as response:

                        if response.status == 200:

                            self._conn.put(proxy)

                            print('Valid proxy', proxy)

                except (ProxyConnectionError, TimeoutError, ValueError):

                    print('Invalid proxy', proxy)

        except (ServerDisconnectedError, ClientResponseError,ClientConnectorError) as s:

            print(s)

            pass

    def test(self):

        """

        aio test all proxies.

        """

        print('ValidityTester is working')

        try:

            loop = asyncio.get_event_loop()

            tasks = [self.test_single_proxy(proxy) for proxy in self._raw_proxies]

            loop.run_until_complete(asyncio.wait(tasks))

        except ValueError:

            print('Async Error')

class PoolAdder(object):

    """

    add proxy to pool

    """

    def __init__(self, threshold):

        self._threshold = threshold

        self._conn = RedisClient()

        self._tester = ValidityTester()

        self._crawler = FreeProxyGetter()

    def is_over_threshold(self):

        """

        judge if count is overflow.

        """

        if self._conn.queue_len >= self._threshold:

            return True

        else:

            return False

    def add_to_queue(self):

        print('PoolAdder is working')

        proxy_count = 0

        while not self.is_over_threshold():

            for callback_label in range(self._crawler.__CrawlFuncCount__):

                callback = self._crawler.__CrawlFunc__[callback_label]

                raw_proxies = self._crawler.get_raw_proxies(callback)

                # test crawled proxies

                self._tester.set_raw_proxies(raw_proxies)

                self._tester.test()

                proxy_count += len(raw_proxies)

                if self.is_over_threshold():

                    print('IP is enough, waiting to be used')

                    break

            if proxy_count == 0:

                raise ResourceDepletionError

class Schedule(object):

    @staticmethod

    def valid_proxy(cycle=VALID_CHECK_CYCLE):

        """

        Get half of proxies which in redis

        """

        conn = RedisClient()

        tester = ValidityTester()

        while True:

            print('Refreshing ip')

            count = int(0.5 * conn.queue_len)

            if count == 0:

                print('Waiting for adding')

                time.sleep(cycle)

                continue

            raw_proxies = conn.get(count)

            tester.set_raw_proxies(raw_proxies)

            tester.test()

            time.sleep(cycle)

    @staticmethod

    def check_pool(lower_threshold=POOL_LOWER_THRESHOLD,

                   upper_threshold=POOL_UPPER_THRESHOLD,

                   cycle=POOL_LEN_CHECK_CYCLE):

        """

        If the number of proxies less than lower_threshold, add proxy

        """

        conn = RedisClient()

        adder = PoolAdder(upper_threshold)

        while True:

            if conn.queue_len < lower_threshold:

                adder.add_to_queue()

            time.sleep(cycle)

    def run(self):

        print('Ip processing running')

        valid_process = Process(target=Schedule.valid_proxy)

        check_process = Process(target=Schedule.check_pool)

        valid_process.start()

        check_process.start()

调度器里面有个run方法，里面运行了两个进程。一个进程valid_process是从网上获取代理放入数据库；另外一个进程check_process是定时的从数据库拿出代理进行检测。

valid_proxy是定时检测器，里面传入一个时间的参数cycle=VALID_CHECK_CYCLE，定义定时检测的时间。方法里首先定义一个RedisClient()进行数据库的连接，该方法定义在db.py中

import redis

from proxypool.error import PoolEmptyError

from proxypool.setting import HOST, PORT, PASSWORD

class RedisClient(object):

    def __init__(self, host=HOST, port=PORT):

        if PASSWORD:

            self._db = redis.Redis(host=host, port=port, password=PASSWORD)

        else:

            self._db = redis.Redis(host=host, port=port)

    def get(self, count=1):

        """

        get proxies from redis

        """
        #从左侧批量获取的方法

        proxies = self._db.lrange("proxies", 0, count - 1)

        self._db.ltrim("proxies", count, -1)

        return proxies

    def put(self, proxy):

        """

        add proxy to right top

        """

        self._db.rpush("proxies", proxy)

    def pop(self):

        """

        get proxy from right.

        """

        try:

            return self._db.rpop("proxies").decode('utf-8')

        except:

            raise PoolEmptyError

    @property

    def queue_len(self):

        """

        get length from queue.

        """

        return self._db.llen("proxies")

    def flush(self):

        """

        flush db

        """

        self._db.flushall()

if __name__ == '__main__':

    conn = RedisClient()

    print(conn.pop())

接着还声明了ValidityTester()，用来检测代理是否可用，其中的test_single_proxy（）方法是实现异步检测的关键。

check_pool（）方法里面需要传入三个参数：两个代理池的上下界限，一个时间。里面有个PoolAdder的add_to_queue()方法。

add_to_queue()方法中使用了一个从网站抓取ip的类FreeProxyGetter()，在getter.py里面

from .utils import get_page

from pyquery import PyQuery as pq

import re

class ProxyMetaclass(type):

    """

        元类，在FreeProxyGetter类中加入

        __CrawlFunc__和__CrawlFuncCount__

        两个参数，分别表示爬虫函数，和爬虫函数的数量。

    """

    def __new__(cls, name, bases, attrs):

        count = 0

        attrs['__CrawlFunc__'] = []

        for k, v in attrs.items():

            if 'crawl_' in k:

                attrs['__CrawlFunc__'].append(k)

                count += 1

        attrs['__CrawlFuncCount__'] = count

        return type.__new__(cls, name, bases, attrs)

class FreeProxyGetter(object, metaclass=ProxyMetaclass):

    def get_raw_proxies(self, callback):

        proxies = []

        print('Callback', callback)

        for proxy in eval("self.{}()".format(callback)):

            print('Getting', proxy, 'from', callback)

            proxies.append(proxy)

        return proxies

    def crawl_ip181(self):

        start_url = 'http://www.ip181.com/'

        html = get_page(start_url)

        ip_adress = re.compile('<tr.*?>\s*<td>(.*?)</td>\s*<td>(.*?)</td>')

        # \s* 匹配空格，起到换行作用

        re_ip_adress = ip_adress.findall(html)

        for adress, port in re_ip_adress:

            result = adress + ':' + port

            yield result.replace(' ', '')

    def crawl_kuaidaili(self):

        for page in range(1, 4):

            # 国内高匿代理

            start_url = 'https://www.kuaidaili.com/free/inha/{}/'.format(page)

            html = get_page(start_url)

            ip_adress = re.compile(

                '<td data-title="IP">(.*)</td>\s*<td data-title="PORT">(\w+)</td>'

            )

            re_ip_adress = ip_adress.findall(html)

            for adress, port in re_ip_adress:

                result = adress + ':' + port

                yield result.replace(' ', '')

    def crawl_xicidaili(self):

        for page in range(1, 4):

            start_url = 'http://www.xicidaili.com/wt/{}'.format(page)

            html = get_page(start_url)

            ip_adress = re.compile(

                '<td class="country"><img src="http://fs.xicidaili.com/images/flag/cn.png" alt="Cn" /></td>\s*<td>(.*?)</td>\s*<td>(.*?)</td>'

            )

            # \s* 匹配空格，起到换行作用

            re_ip_adress = ip_adress.findall(html)

            for adress, port in re_ip_adress:

                result = adress + ':' + port

                yield result.replace(' ', '')

    def crawl_daili66(self, page_count=4):

        start_url = 'http://www.66ip.cn/{}.html'

        urls = [start_url.format(page) for page in range(1, page_count + 1)]

        for url in urls:

            print('Crawling', url)

            html = get_page(url)

            if html:

                doc = pq(html)

                trs = doc('.containerbox table tr:gt(0)').items()

                for tr in trs:

                    ip = tr.find('td:nth-child(1)').text()

                    port = tr.find('td:nth-child(2)').text()

                    yield ':'.join([ip, port])

    def crawl_data5u(self):

        for i in ['gngn', 'gnpt']:

            start_url = 'http://www.data5u.com/free/{}/index.shtml'.format(i)

            html = get_page(start_url)

            ip_adress = re.compile(

                ' <ul class="l2">\s*<span><li>(.*?)</li></span>\s*<span style="width: 100px;"><li class=".*">(.*?)</li></span>'

            )

            # \s * 匹配空格，起到换行作用

            re_ip_adress = ip_adress.findall(html)

            for adress, port in re_ip_adress:

                result = adress + ':' + port

                yield result.replace(' ', '')

    def crawl_kxdaili(self):

        for i in range(1, 4):

            start_url = 'http://www.kxdaili.com/ipList/{}.html#ip'.format(i)

            html = get_page(start_url)

            ip_adress = re.compile('<tr.*?>\s*<td>(.*?)</td>\s*<td>(.*?)</td>')

            # \s* 匹配空格，起到换行作用

            re_ip_adress = ip_adress.findall(html)

            for adress, port in re_ip_adress:

                result = adress + ':' + port

                yield result.replace(' ', '')

    def crawl_premproxy(self):

        for i in ['China-01', 'China-02', 'China-03', 'China-04', 'Taiwan-01']:

            start_url = 'https://premproxy.com/proxy-by-country/{}.htm'.format(

                i)

            html = get_page(start_url)

            if html:

                ip_adress = re.compile('<td data-label="IP:port ">(.*?)</td>')

                re_ip_adress = ip_adress.findall(html)

                for adress_port in re_ip_adress:

                    yield adress_port.replace(' ', '')

    def crawl_xroxy(self):

        for i in ['CN', 'TW']:

            start_url = 'http://www.xroxy.com/proxylist.php?country={}'.format(

                i)

            html = get_page(start_url)

            if html:

                ip_adress1 = re.compile(

                    "title='View this Proxy details'>\s*(.*).*")

                re_ip_adress1 = ip_adress1.findall(html)

                ip_adress2 = re.compile(

                    "title='Select proxies with port number .*'>(.*)</a>")

                re_ip_adress2 = ip_adress2.findall(html)

                for adress, port in zip(re_ip_adress1, re_ip_adress2):

                    adress_port = adress + ':' + port

                    yield adress_port.replace(' ', '')

具体使用方法可以看GitHub。。。。。。

使用redis+flask维护动态代理池的更多相关文章

4.使用Redis+Flask维护动态代理池
1.为什么使用代理池许多⽹网站有专⻔门的反爬⾍虫措施,可能遇到封IP等问题. 互联⽹网上公开了了⼤大量量免费代理理,利利⽤用好资源. 通过定时的检测维护同样可以得到多个可⽤用代理理. 2.代理池的要 ...
转载：使用redis+flask维护动态代理池
githu源码地址:https://github.com/Germey/ProxyPool更好的代理池维护:https://github.com/Python3WebSpider/ProxyPool ...
5.使用Redis+Flask维护动态Cookies池
1.为什么要用Cookies池? 网站需要登录才可爬取,例如新浪微博爬取过程中如果频率过高会导致封号需要维护多个账号的Cookies池实现大规模爬取 2.Cookies池的要求自动登录更新定时 ...
使用redis所维护的代理池抓取微信文章
搜狗搜索可以直接搜索微信文章,本次就是利用搜狗搜搜出微信文章,获得详细的文章url来得到文章的信息.并把我们感兴趣的内容存入到mongodb中. 因为搜狗搜索微信文章的反爬虫比较强,经常封IP,所以要 ...
利用 Flask+Redis 维护 IP 代理池
代理池的维护目前有很多网站提供免费代理,而且种类齐全,比如各个地区.各个匿名级别的都有,不过质量实在不敢恭维,毕竟都是免费公开的,可能一个代理无数个人在用也说不定.所以我们需要做的是大量抓取这些免费 ...
python爬虫系列：做一个简单的动态代理池
自动 1.设置动态的user agent import urllib.request as ure import urllib.parse as upa import random from bs4 ...
Flask开发系列之Flask+redis实现IP代理池
Flask开发系列之Flask+redis实现IP代理池代理池的要求多站抓取,异步检测:多站抓取:指的是我们需要从各大免费的ip代理网站,把他们公开的一些免费代理抓取下来:一步检测指的是:把这些代 ...
介绍一种 Python 更方便的爬虫代理池实现方案
现在搞爬虫,代理是不可或缺的资源很多人学习python,不知道从何学起.很多人学习python,掌握了基本语法过后,不知道在哪里寻找案例上手.很多已经做案例的人,却不知道如何去学习更加高深的知识.那 ...
转载:使用Tornado+Redis维护ADSL拨号服务器代理池
我们尝试维护过一个免费的代理池,但是代理池效果用过就知道了,毕竟里面有大量免费代理,虽然这些代理是可用的,但是既然我们能刷到这个免费代理,别人也能呀,所以就导致这个代理同时被很多人使用来抓取网站,所以 ...

随机推荐

【leetcode】Valid Triangle Number
题目: Given an array consists of non-negative integers, your task is to count the number of triplets c ...
EXCL单元格公式——组装SQL用
="'"&F3&"'"
springboot结合jsp页面详解
第一次写博客,其实就是为了约束我自己,写的不一定对,互相借鉴吧!有不对的地方请多多指正,谢谢! 今天我们来看一下springboot结合jsp页面的具体操作: 1.首先我们先看一下目录结构由上面我们 ...
Vue的watch和computed方法的使用
Vue的watch属性 Vue的watch属性可以用来监听data属性中数据的变化 <!DOCTYPE html> <html> <head> <meta c ...
JavaScript求取水仙花数
一.什么是水仙花数水仙花数也称为超完全数字不变数.自幂数.阿姆斯壮数.阿姆是特朗数. 水仙花数是指一个三位数,每个位数上数字的3次幂之和等于数字它本身. 水仙花数是自幂数的一种,三位的三次自幂数才叫 ...
jeesite安装时Perhaps you are running on a JRE rather than a JDK
使用自己本地安装的maven,启动jeesite报错: No compiler is provided in this environment. Perhaps you are running on ...
配置文件：android:inputType参数类型说明
输入字符 android:inputType="none" --输入普通字符 android:inputType="text" --输入普通字符 andr ...
luogu P1314 聪明的质监员 x
P1314 聪明的质监员(至于为什么选择这个题目,可能是我觉得比较好玩呗) 题目描述小T 是一名质量监督员,最近负责检验一批矿产的质量.这批矿产共有 n 个矿石,从 1到n 逐一编号,每个矿石都有自 ...
2019 Multi-University Training Contest 3 T7 Find the answer
Find the answer Time Limit: 4000/4000 MS (Java/Others) Memory Limit: 65536/65536 K (Java/Others)Tota ...
SQL Server 分割字符串和合并多条数据为一行
分割字符串函数 create function f_split(@c varchar(2000),@split varchar(2)) returns @t table(col varchar(20) ...

使用redis+flask维护动态代理池

使用redis+flask维护动态代理池的更多相关文章

随机推荐

热门专题