python——代理ip获取

python爬虫要经历爬虫、爬虫被限制、爬虫反限制的过程。当然后续还要网页爬虫限制优化，爬虫再反限制的一系列道高一尺魔高一丈的过程。

爬虫的初级阶段，添加headers和ip代理可以解决很多问题。

贴代码：说下思路

1、到http://www.xicidaili.com/nn/抓取相应的代理ip地址，地址比较多，但是不保证能用。先保存到列表

2、多线程验证代理ip的可行性，然后写入到对应的txt文件

3、当需要代理ip的时候，倒入模块，执行main（）函数，可得到可用的代理ip进行后续功能。

验证ip用到了telnetlib和requests两种方法。建议要爬取哪个网页，直接requests相应网页验证比较好。

#coding:utf-8

from bs4 import BeautifulSoup

import time

import threading

import random

import telnetlib,requests

#设置全局超时时间为3s，也就是说，如果一个请求3s内还没有响应，就结束访问，并返回timeout（超时）

import socket

socket.setdefaulttimeout(3)

headers = {

"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",

}

def get_ip():

    #获取代理IP，返回列表

    httpResult=[]

    httpsResult=[]

    try:

        for page in range(1,2):

            IPurl = 'http://www.xicidaili.com/nn/%s' %page

            rIP=requests.get(IPurl,headers=headers)

            IPContent=rIP.text

            print IPContent

            soupIP = BeautifulSoup(IPContent,'lxml')

            trs = soupIP.find_all('tr')

            for tr in trs[1:]:

                tds = tr.find_all('td')

                ip = tds[1].text.strip()

                port = tds[2].text.strip()

                protocol = tds[5].text.strip()

                if protocol == 'HTTP':

                    httpResult.append( 'http://' + ip + ':' + port)

                elif protocol =='HTTPS':

                    httpsResult.append( 'https://' + ip + ':' + port)

    except:

        pass

    return httpResult,httpsResult

'''

#验证ip地址的可用性，使用telnetlib模块_http

def cip(x,y):

    f = open("E:\ip_http.txt","a")

    f.truncate()

    try:

        telnetlib.Telnet(x, port=y, timeout=5)

    except:

        print('f')

    else:

        print('---------------------------success')

        f.write(x+':'+y+'\n')

#验证ip地址的可用性，使用telnetlib模块_https

def csip(x,y):

    f = open("E:\ip_https.txt","a")

    f.truncate()

    try:

        telnetlib.Telnet(x, port=y, timeout=5)

    except:

        print('f')

    else:

        print('---------------------------success')

        f.write(x+':'+y+'\n')

'''

#验证ip地址的可用性，使用requests模块，验证地址用相应要爬取的网页 http

def cip(x,y):

    f = open("E:\ip_http.txt","a")

    f.truncate()

    try:

        print (x+y)

        requests.get('http://ip.chinaz.com/getip.aspx',proxies={'http':x+":"+y},timeout=3)

    except:

        print('f')

    else:

        print('---------------------------success')

        f.write(x+':'+y+'\n')

#验证ip地址的可用性，使用requests模块，验证地址用相应要爬取的网页。https

def csip(x,y):

    f = open("E:\ip_https.txt","a")

    f.truncate()

    try:

        print (x+y)

        requests.get('https://www.lagou.com/',proxies={'https':x+":"+y},timeout=3)

    except:

        print('f')

    else:

        print('---------------------------success')

        f.write(x+':'+y+'\n')

def main():

    httpResult,httpsResult = get_ip()

    threads = []

    open("E:\ip_http.txt","a").truncate()

    for i in httpResult:

        a = str(i.split(":")[-2][2:].strip())

        b = str(i.split(":")[-1].strip())

        t = threading.Thread(target=cip,args=(a,b,))

        threads.append(t)

    for i in range(len(httpResult)):

        threads[i].start()

    for i in range(len(httpResult)):

        threads[i].join()

    threads1 = []

    open("E:\ip_https.txt","a").truncate()

    for i in httpsResult:

        a = str(i.split(":")[-2][2:].strip())

        b = str(i.split(":")[-1].strip())

        t = threading.Thread(target=csip,args=(a,b,))

        threads1.append(t)

    for i in range(len(httpsResult)):

        threads1[i].start()

    for i in range(len(httpsResult)):

        threads1[i].join()

if __name__ == '__main__':

    main()

python——代理ip获取的更多相关文章

爬虫的新手使用教程（python代理IP）
前言 Python爬虫要经历爬虫.爬虫被限制.爬虫反限制的过程.当然后续还要网页爬虫限制优化,爬虫再反限制的一系列道高一尺魔高一丈的过程.爬虫的初级阶段,添加headers和ip代理可以解决很多问题. ...
c# 代理IP获取通用方法
调用: ConcurrentQueue<string> proxyIpQueue = new ConcurrentQueue<string>(); Grab_ProxyIp(p ...
python通过ip获取地址
# -*- coding: utf-8 -*- url = "http://ip.taobao.com/service/getIpInfo.php?ip=" #查找IP地址 def ...
PYTHON代理IP
import urllib.request url = 'http://www.whatismyip.com.tw/' proxy_support = urllib.request.ProxyHand ...
使用TaskManager爬取2万条代理IP实现自动投票功能
话说某天心血来潮想到一个问题,朋友圈里面经常有人发投票链接,让帮忙给XX投票,以前呢会很自觉打开链接帮忙投一票.可是这种事做多了就会考虑能不能使用工具来进行投票呢,身为一名程序猿决定研究解决这个问题. ...
写了个小爬虫，为何用上代理ip总是出现错误。
import urllib.request import re import os import random import threading def url_open(url): #在第8到第12 ...
代理 IP 云打码平台的使用
代理ip 获取代理ip的网站: 快代理西祠代理 www.goubanjia.com #代理ip import requests headers = { 'User-Agent':'Mozilla/5 ...
python爬虫实战（一）——实时获取代理ip
在爬虫学习的过程中,维护一个自己的代理池是非常重要的. 详情看代码: 1.运行环境 python3.x,需求库:bs4,requests 2.实时抓取西刺-国内高匿代理中前3页的代理ip(可根据需求自 ...
python编写的自动获取代理IP列表的爬虫-chinaboywg-ChinaUnix博客
python编写的自动获取代理IP列表的爬虫-chinaboywg-ChinaUnix博客 undefined Python多线程抓取代理服务器 | Linux运维笔记 undefined java如 ...

随机推荐

Linux中JDK安装配置
安装jdk 1)下载地址:https://www.oracle.com/technetwork/java/javase/downloads/index.html 我选择jdk1.8版本 2)上传至服务 ...
codeforces 807 C. Success Rate（二分）
题目链接:http://codeforces.com/contest/807/problem/C 题意:记 AC 率为当前 AC 提交的数量 x / 总提交量 y .已知最喜欢的 AC 率为 p/q ...
XML序列化CDATA
不可避免的遇到对接需要使用XML文档的第三方系统,某些节点内容特殊,序列化时需特殊处理,解决方案是实现IXmlSerializable接口. /// <summary> /// Perso ...
C++ STL vector的学习
vector就是一个不定长数组,vector是动态数组,随着元素的加入,它的内部机制会自行扩充空间以容纳新元素,使用vector之前,必须包含相应的头文件和命名空间. #include <vec ...
数据库常用SQL语句（三）：子查询
一.为什么会使用子查询虽然可以通过连接查询来实现多表查询数据记录,但不建议使用,因为连接查询的性能很差,为什么呢?我们来进行分析,例如我们要查询部门表t_dept 和雇员表t_employee中的 ...
Kubernetes pod 状态
CrashLoopBackOff: 容器退出,kubelet正在将它重启 InvalidImageName: 无法解析镜像名称 ImageInspectError: 无法校验镜像 ErrImageNe ...
CF979C Kuro and Walking Route(简单的dfs/树形dp)
题意:给出一个$n$个点,$n-1$条边的无向连通图,给出两个点$x,y$,经过$x$后的路径上就不能经过$y$,问可以走的路径$(u,v)$有多少条,($(u,v)$和$(v,u)$考虑为两条不同的 ...
iptables的删除命令中的相关问题
最近在做一个VPN中间件的配置工作,在配置iptables的时候,当用户想删除EIP(即释放当前连接),发现使用iptables的相关命令会提示错误. iptables: Bad rule (does ...
Winform中实现读取xml配置文件并动态配置ZedGraph的RadioGroup的选项
场景 Winform中对ZedGraph的RadioGroup进行数据源绑定,即通过代码添加选项: https://blog.csdn.net/BADAO_LIUMANG_QIZHI/article/ ...
降低 80% 的读写响应延迟！我们测评了 etcd 3.4 新特性（内含读写发展史）
作者 | 陈洁(墨封) 阿里云开发工程师导读:etcd 作为 K8s 集群中的存储组件,读写性能方面会受到很多压力,而 etcd 3.4 中的新特性将有效缓解压力,本文将从 etcd 数据读写机制 ...

python——代理ip获取

python——代理ip获取的更多相关文章

随机推荐

热门专题