Scrapy学习-13-使用DownloaderMiddleware设置IP代理池及IP变换

设置IP代理池及IP变换方案

方案一：

使用国内免费的IP代理

 http://www.xicidaili.com

# 创建一个tools文件夹，新建一个py文件，用于获取代理IP和PORT

from scrapy.selector import Selector

import MySQLdb

import requests

conn = MySQLdb.connect(host="192.168.1.1", user="root", passwd="", db="databasename", charset="utf8")

cursor = conn.cursor()

def crawl_ips():

    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36"}

    for i in range(1568):

        re = requests.get("http://www.xicidaili.com/nn/{0}".format(i), headers=headers)

        selector = Selector(text=re.text)

        all_trs = selector.css("#ip_list tr")

        ip_list = []

        for tr in all_trs[1:]:

            speed_str = tr.css(".bar::attr(title)").extract()[0]

            if speed_str:

                speed = float(speed_str.split("秒")[0])

            all_texts = tr.css("td::text").extract()

            ip = all_texts[0]

            port = all_texts[1]

            proxy_type = all_texts[5]

            ip_list.append((ip, port, proxy_type, speed))

        for ip_info in ip_list:

            cursor.execute(

                "insert proxy_ip(ip, port, speed, proxy_type) VALUES('{0}', '{1}', {2}, 'HTTP')".format(

                    ip_info[0], ip_info[1], ip_info[3]

                )

            )

            conn.commit()

class GetIP(object):

    def delete_ip(self, ip):

        delete_sql = """

            delete from proxy_ip where ip='{0}'

        """.format(ip)

        cursor.execute(delete_sql)

        conn.commit()

        return True

    def judge_ip(self, ip, port):

        http_url = "http://www.baidu.com"

        proxy_url = "http://{0}:{1}".format(ip, port)

        try:

            proxy_dict = {

                "http":proxy_url,

            }

            response = requests.get(http_url, proxies=proxy_dict)

        except Exception as e:

            print ("invalid ip and port")

            self.delete_ip(ip)

            return False

        else:

            code = response.status_code

            if code >= 200 and code < 300:

                print ("effective ip")

                return True

            else:

                print  ("invalid ip and port")

                self.delete_ip(ip)

                return False

    def get_random_ip(self):

        random_sql = """

            SELECT ip, port FROM proxy_ip

            ORDER BY RAND()

            LIMIT 1

            """

        result = cursor.execute(random_sql)

        for ip_info in cursor.fetchall():

            ip = ip_info[0]

            port = ip_info[1]

            judge_re = self.judge_ip(ip, port)

            if judge_re:

                return "http://{0}:{1}".format(ip, port)

            else:

                return self.get_random_ip()

if __name__ == "__main__":

    get_ip = GetIP()

    get_ip.get_random_ip()

# 修改settings配置

DOWNLOADER_MIDDLEWARES = {

    'ArticleSpider.middlewares.JSPageMiddleware': 1,

    'ArticleSpider.middlewares.RandomProxyMiddleware': 1

}

方案二：

改造github开源项目成为适合自己的proxies代理工具

 https://github.com/aivarsk/scrapy-proxies

方案三：

官方提供github开源项目，收费版本，但相对稳定

 https://github.com/scrapy-plugins/scrapy-crawlera

方案四：

使用tor洋葱网络，匿名伪装自己的IP地址

Scrapy学习-13-使用DownloaderMiddleware设置IP代理池及IP变换的更多相关文章

requests ip代理池单ip和多ip设置方式
reqeusts库,在使用ip代理时,单ip代理和多ip代理的写法不同 (目前测试通过,如有错误,请评论指正) 单ip代理模式省去headers等 import requests proxy = { ...
免费IP代理池定时维护，封装通用爬虫工具类每次随机更新IP代理池跟UserAgent池，并制作简易流量爬虫
前言我们之前的爬虫都是模拟成浏览器后直接爬取,并没有动态设置IP代理以及UserAgent标识,本文记录免费IP代理池定时维护,封装通用爬虫工具类每次随机更新IP代理池跟UserAgent池,并制作 ...
Scrapy加Redis加IP代理池实现音乐爬虫
音乐爬虫关注公众号"轻松学编程"了解更多. 目的:爬取歌名,歌手,歌词,歌曲url. 一.创建爬虫项目创建一个文件夹,进入文件夹,打开cmd窗口,输入: scrapy star ...
python爬虫实战（三）--------搜狗微信文章（IP代理池和用户代理池设定----scrapy）
在学习scrapy爬虫框架中,肯定会涉及到IP代理池和User-Agent池的设定,规避网站的反爬. 这两天在看一个关于搜狗微信文章爬取的视频,里面有讲到ip代理池和用户代理池,在此结合自身的所了解的 ...
写一个scrapy中间件--ip代理池
middleware文件 # -*- coding: utf-8 -*- # Define here the models for your spider middleware # See docum ...
爬取西刺ip代理池
好久没更新博客啦~,今天来更新一篇利用爬虫爬取西刺的代理池的小代码先说下需求,我们都是用python写一段小代码去爬取自己所需要的信息,这是可取的,但是,有一些网站呢,对我们的网络爬虫做了一些限制, ...
scrapy_随机ip代理池
什么是ip代理? 我们电脑访问网站,其实是访问远程的服务器,通过ip地址识别是那个机器访问了服务器,服务器就知道数据该返回给哪台机器,我们生活中所用的网络是局域网,ip是运营商随机分配的,是一种直接访 ...
Python爬虫之ip代理池
可能在学习爬虫的时候,遇到很多的反爬的手段,封ip 就是其中之一. 对于封IP的网站.需要很多的代理IP,去买代理IP,对于初学者觉得没有必要,每个卖代理IP的网站有的提供了免费IP,可是又很少,写了 ...
5 使用ip代理池爬取糗事百科
从09年读本科开始学计算机以来,一直在迷茫中度过,很想学些东西,做些事情,却往往陷进一些技术细节而蹉跎时光.直到最近几个月,才明白程序员的意义并不是要搞清楚所有代码细节,而是要有更宏高的方向,要有更专 ...

随机推荐

Linux基础学习-用户的创建修改删除
用户添加修改删除 1 useradd添加用户添加一个新用户hehe,指定uid为3000,家目录为/home/haha [root@qdlinux ~]# useradd -u 3000 -d /h ...
Linux基础学习-crond系统计划任务
系统计划任务大部分系统管理工作都是通过定期自动执行某个脚本来完成的,那么如何定期执行某个脚本,从而实现运维的自动化,这就要借助Linux的cron功能了. 计划任务分为一次性计划任务和周期性计划任务 ...
linux文件属性之用户和组基础知识
root :x :0 :0 :root ...
【windows】【md5】查看文件的md5值
certutil -hashfile filename MD5 certutil -hashfile filename SHA1 certutil -hashfile filename SHA256 ...
php生成zip压缩文件的方法，支持文件和压缩包路径查找
/* * new creatZip($_dir,$_zipName); *@ _dir是被压缩的文件夹名称,可使用路径,例 'a'或者'a/test.txt'或者'test.txt' *@ _zipN ...
CUB reduce errorinvalid configuration argument
解决CUB reduce errorinvalid configuration argument问题在写TensorFlow代码时遇到报错 CUB reduce errorinvalid confi ...
Applied Nonparametric Statistics-lec1
参考网址: https://onlinecourses.science.psu.edu/stat464/node/2 Binomial Distribution Normal Distribution ...
LeetCode（303）Range Sum Query - Immutable
题目 Given an integer array nums, find the sum of the elements between indices i and j (i ≤ j), inclus ...
线段树： CDOJ1598-加帕里公园的friends（区间合并，单点更新）
加帕里公园的friends Time Limit: 3000/1000MS (Java/Others) Memory Limit: 131072/131072KB (Java/Others) 我还有很 ...
Linux学习-磁盘配额 (Quota) 的应用与实作
什么是 Quota 在 Linux 系统中,由于是多人多任务的环境,所以会有多人共同使用一个硬盘空间的情况发生, 如果其中有少数几个使用者大量的占掉了硬盘空间的话,那势必压缩其他使用者的使用权力! ...

Scrapy学习-13-使用DownloaderMiddleware设置IP代理池及IP变换

Scrapy学习-13-使用DownloaderMiddleware设置IP代理池及IP变换的更多相关文章

随机推荐

热门专题