【python3】如何建立爬虫代理ip池

一、为什么需要建立爬虫代理ip池

在众多的网站防爬措施中，有一种是根据ip的访问频率进行限制的，在某段时间内，当某个ip的访问量达到一定的阀值时，该ip会被拉黑、在一段时间内被禁止访问。

这种时候，可以通过降低爬虫的频率，或者更改ip来应对。后者就需要有一个可用的代理ip池，以供爬虫工作时切换。

二、如何建立一个爬虫代理ip池

思路： 1、找到一个免费的ip代理网站(如：西刺代理)

2、爬取ip（常规爬取requests+BeautifulSoup）

3、验证ip有效性（携带爬取到的ip，去访问指定的url，看返回的状态码是不是200）

4、记录ip （写到文档）

代码如下：

#!/usr/bin/env python3

# -*- coding: utf-8 -*-

import requests,threading,datetime

from bs4 import BeautifulSoup

import random

"""

1、抓取西刺代理网站的代理ip

2、并根据指定的目标url,对抓取到ip的有效性进行验证

3、最后存到指定的path

"""

# ------------------------------------------------------文档处理--------------------------------------------------------

# 写入文档

def write(path,text):

    with open(path,'a', encoding='utf-8') as f:

        f.writelines(text)

        f.write('\n')

# 清空文档

def truncatefile(path):

    with open(path, 'w', encoding='utf-8') as f:

        f.truncate()

# 读取文档

def read(path):

    with open(path, 'r', encoding='utf-8') as f:

        txt = []

        for s in f.readlines():

            txt.append(s.strip())

    return txt

# ----------------------------------------------------------------------------------------------------------------------

# 计算时间差,格式: 时分秒

def gettimediff(start,end):

    seconds = (end - start).seconds

    m, s = divmod(seconds, 60)

    h, m = divmod(m, 60)

    diff = ("%02d:%02d:%02d" % (h, m, s))

    return diff

# ----------------------------------------------------------------------------------------------------------------------

# 返回一个随机的请求头 headers

def getheaders():

    user_agent_list = [ \

        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" \

        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \

        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \

        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \

        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \

        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \

        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \

        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \

        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \

        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \

        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \

        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \

        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \

        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \

        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \

        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \

        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \

        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"

    ]

    UserAgent=random.choice(user_agent_list)

    headers = {'User-Agent': UserAgent}

    return headers

# -----------------------------------------------------检查ip是否可用----------------------------------------------------

def checkip(targeturl,ip):

    headers =getheaders()  # 定制请求头

    proxies = {"http": "http://"+ip, "https": "http://"+ip}  # 代理ip

    try:

        response=requests.get(url=targeturl,proxies=proxies,headers=headers,timeout=5).status_code

        if response == 200 :

            return True

        else:

            return False

    except:

        return False

#-------------------------------------------------------获取代理方法----------------------------------------------------

# 免费代理 XiciDaili

def findip(type,pagenum,targeturl,path): # ip类型,页码,目标url,存放ip的路径

    list={'1': 'http://www.xicidaili.com/nt/', # xicidaili国内普通代理

          '2': 'http://www.xicidaili.com/nn/', # xicidaili国内高匿代理

          '3': 'http://www.xicidaili.com/wn/', # xicidaili国内https代理

          '4': 'http://www.xicidaili.com/wt/'} # xicidaili国外http代理

    url=list[str(type)]+str(pagenum) # 配置url

    headers = getheaders() # 定制请求头

    html=requests.get(url=url,headers=headers,timeout = 5).text

    soup=BeautifulSoup(html,'lxml')

    all=soup.find_all('tr',class_='odd')

    for i in all:

        t=i.find_all('td')

        ip=t[1].text+':'+t[2].text

        is_avail = checkip(targeturl,ip)

        if is_avail == True:

            write(path=path,text=ip)

            print(ip)

#-----------------------------------------------------多线程抓取ip入口---------------------------------------------------

def getip(targeturl,path):

     truncatefile(path) # 爬取前清空文档

     start = datetime.datetime.now() # 开始时间

     threads=[]

     for type in range(4):   # 四种类型ip,每种类型取前三页,共12条线程

         for pagenum in range(3):

             t=threading.Thread(target=findip,args=(type+1,pagenum+1,targeturl,path))

             threads.append(t)

     print('开始爬取代理ip')

     for s in threads: # 开启多线程爬取

         s.start()

     for e in threads: # 等待所有线程结束

         e.join()

     print('爬取完成')

     end = datetime.datetime.now() # 结束时间

     diff = gettimediff(start, end)  # 计算耗时

     ips = read(path)  # 读取爬到的ip数量

     print('一共爬取代理ip: %s 个,共耗时: %s \n' % (len(ips), diff))

#-------------------------------------------------------启动-----------------------------------------------------------

if __name__ == '__main__':

    path = 'ip.txt' # 存放爬取ip的文档path

    targeturl = 'http://www.cnblogs.com/rianley/' # 验证ip有效性的指定url

    getip(targeturl,path)

结果：

【python3】如何建立爬虫代理ip池的更多相关文章

建立爬虫代理IP池
单线程构建爬虫代理IP池 #!/usr/bin/python3.5 # -*- coding:utf-8 -*- import time import tempfile from lxml impor ...
Python爬虫代理IP池
目录[-] 1.问题 2.代理池设计 3.代码模块 4.安装 5.使用 6.最后在公司做分布式深网爬虫,搭建了一套稳定的代理池服务,为上千个爬虫提供有效的代理,保证各个爬虫拿到的都是对应网站有效的代 ...
维护爬虫代理IP池--采集并验证
任务分析我们爬的免费代理来自于https://www.kuaidaili.com这个网站.用`requests`将ip地址与端口采集过来,将`IP`与`PORT`组合成`requests`需要的代理 ...
利用代理IP池(proxy pool)搭建免费ip代理和api
先看这里!!!---->转载:Python爬虫代理IP池(proxy pool) WIIN10安装中遇到的问题: 一.先安装Microsoft Visual C++ Compiler for P ...
如何建立自己的代理IP池,减少爬虫被封的几率
如何建立自己的代理IP池,减少爬虫被封的几率在爬虫过程中,难免会遇到各种各样的反爬虫,运气不好,还会被对方网站给封了自己的IP,就访问不了对方的网站,爬虫也就凉凉. 代理参数-proxies 首先我 ...
构建一个给爬虫使用的代理IP池
做网络爬虫时,一般对代理IP的需求量比较大.因为在爬取网站信息的过程中,很多网站做了反爬虫策略,可能会对每个IP做频次控制.这样我们在爬取网站时就需要很多代理IP. 代理IP的获取,可以从以下几个途径 ...
python多线程建立代理ip池
之前有写过用单线程建立代理ip池,但是大家很快就会发现,用单线程来一个个测试代理ip实在是太慢了,跑一次要很久才能结束,完全无法忍受.所以这篇文章就是换用多线程来建立ip池,会比用单线程快很多.之所以 ...
python爬虫构建代理ip池抓取数据库的示例代码
爬虫的小伙伴,肯定经常遇到ip被封的情况,而现在网络上的代理ip免费的已经很难找了,那么现在就用python的requests库从爬取代理ip,创建一个ip代理池,以备使用. 本代码包括ip的爬取,检 ...
爬虫入门到放弃系列05：从程序模块设计到代理IP池
前言上篇文章吧啦吧啦讲了一些有的没的,现在还是回到主题写点技术相关的.本篇文章作为基础爬虫知识的最后一篇,将以爬虫程序的模块设计来完结. 在我漫(liang)长(nian)的爬虫开发生涯中,我通常将 ...

随机推荐

Saiku去掉登录模块
1.修改applicationContext-saiku-webapp.xml <security:intercept-url pattern="/rest/**" acce ...
How to Create Modifiers Using the API QP_MODIFIERS_PUB.PROCESS_MODIFIERS
In this Document Goal Solution Example Scripts Steps to verify the creation of modifier(s). ...
FNDCPASS Troubleshooting Guide For Login and Changing Applications Passwords
In this Document Goal Solution 1. Error Starting Application Services After Changing APPS Pass ...
Oracle E-Business Suite Release 12.2 Information Center - Manage
Oracle E-Business Suite Maintenance Guide Release 12.2 Part No. E22954-14 PDF: http://docs.oracl ...
C++项目中的extern "C" {}（转）
注:本文转自吴秦先生的博客http://www.cnblogs.com/skynet/archive/2010/07/10/1774964.html#.吴秦先生的博客写的非常详细深刻容易理解,故特转载 ...
"《算法导论》之‘线性表’"：基于数组实现的单链表
对于单链表,我们大多时候会用指针来实现(可参考基于指针实现的单链表).现在我们就来看看怎么用数组来实现单链表. 1. 定义单链表中结点的数据结构 typedef int ElementType; cl ...
Leetcode_223_Rectangle Area
本文是在学习中的总结,欢迎转载但请注明出处:http://blog.csdn.net/pistolove/article/details/46868363 Find the total area co ...
InvocationTargetException异常解析
InvocationTargetException异常由Method.invoke(obj, args...)方法抛出.) { throw new ZeroException("参数不能小于 ...
OpenCV混合高斯模型函数注释说明
OpenCV混合高斯模型函数注释说明一.cvaux.h #define CV_BGFG_MOG_MAX_NGAUSSIANS 500 //高斯背景检测算法的默认参数设置 #define CV_BGF ...
摄像头ov2685中关于sensor id 设置的相关的寄存器地址
OV2685 : CHIP_ID address : 0x300A default : 0x26 address : 0x300B default : 0x85 address : 0x3 ...

【python3】如何建立爬虫代理ip池

【python3】如何建立爬虫代理ip池的更多相关文章

随机推荐

热门专题