python——代理ip获取

python爬虫要经历爬虫、爬虫被限制、爬虫反限制的过程。当然后续还要网页爬虫限制优化，爬虫再反限制的一系列道高一尺魔高一丈的过程。

爬虫的初级阶段，添加headers和ip代理可以解决很多问题。

贴代码：说下思路

1、到http://www.xicidaili.com/nn/抓取相应的代理ip地址，地址比较多，但是不保证能用。先保存到列表

2、多线程验证代理ip的可行性，然后写入到对应的txt文件

3、当需要代理ip的时候，倒入模块，执行main（）函数，可得到可用的代理ip进行后续功能。

验证ip用到了telnetlib和requests两种方法。建议要爬取哪个网页，直接requests相应网页验证比较好。

#coding:utf-8

from bs4 import BeautifulSoup

import time

import threading

import random

import telnetlib,requests

#设置全局超时时间为3s，也就是说，如果一个请求3s内还没有响应，就结束访问，并返回timeout（超时）

import socket

socket.setdefaulttimeout(3)

headers = {

"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",

}

def get_ip():

    #获取代理IP，返回列表

    httpResult=[]

    httpsResult=[]

    try:

        for page in range(1,2):

            IPurl = 'http://www.xicidaili.com/nn/%s' %page

            rIP=requests.get(IPurl,headers=headers)

            IPContent=rIP.text

            print IPContent

            soupIP = BeautifulSoup(IPContent,'lxml')

            trs = soupIP.find_all('tr')

            for tr in trs[1:]:

                tds = tr.find_all('td')

                ip = tds[1].text.strip()

                port = tds[2].text.strip()

                protocol = tds[5].text.strip()

                if protocol == 'HTTP':

                    httpResult.append( 'http://' + ip + ':' + port)

                elif protocol =='HTTPS':

                    httpsResult.append( 'https://' + ip + ':' + port)

    except:

        pass

    return httpResult,httpsResult

'''

#验证ip地址的可用性，使用telnetlib模块_http

def cip(x,y):

    f = open("E:\ip_http.txt","a")

    f.truncate()

    try:

        telnetlib.Telnet(x, port=y, timeout=5)

    except:

        print('f')

    else:

        print('---------------------------success')

        f.write(x+':'+y+'\n')

#验证ip地址的可用性，使用telnetlib模块_https

def csip(x,y):

    f = open("E:\ip_https.txt","a")

    f.truncate()

    try:

        telnetlib.Telnet(x, port=y, timeout=5)

    except:

        print('f')

    else:

        print('---------------------------success')

        f.write(x+':'+y+'\n')

'''

#验证ip地址的可用性，使用requests模块，验证地址用相应要爬取的网页 http

def cip(x,y):

    f = open("E:\ip_http.txt","a")

    f.truncate()

    try:

        print (x+y)

        requests.get('http://ip.chinaz.com/getip.aspx',proxies={'http':x+":"+y},timeout=3)

    except:

        print('f')

    else:

        print('---------------------------success')

        f.write(x+':'+y+'\n')

#验证ip地址的可用性，使用requests模块，验证地址用相应要爬取的网页。https

def csip(x,y):

    f = open("E:\ip_https.txt","a")

    f.truncate()

    try:

        print (x+y)

        requests.get('https://www.lagou.com/',proxies={'https':x+":"+y},timeout=3)

    except:

        print('f')

    else:

        print('---------------------------success')

        f.write(x+':'+y+'\n')

def main():

    httpResult,httpsResult = get_ip()

    threads = []

    open("E:\ip_http.txt","a").truncate()

    for i in httpResult:

        a = str(i.split(":")[-2][2:].strip())

        b = str(i.split(":")[-1].strip())

        t = threading.Thread(target=cip,args=(a,b,))

        threads.append(t)

    for i in range(len(httpResult)):

        threads[i].start()

    for i in range(len(httpResult)):

        threads[i].join()

    threads1 = []

    open("E:\ip_https.txt","a").truncate()

    for i in httpsResult:

        a = str(i.split(":")[-2][2:].strip())

        b = str(i.split(":")[-1].strip())

        t = threading.Thread(target=csip,args=(a,b,))

        threads1.append(t)

    for i in range(len(httpsResult)):

        threads1[i].start()

    for i in range(len(httpsResult)):

        threads1[i].join()

if __name__ == '__main__':

    main()

python——代理ip获取的更多相关文章

爬虫的新手使用教程（python代理IP）
前言 Python爬虫要经历爬虫.爬虫被限制.爬虫反限制的过程.当然后续还要网页爬虫限制优化,爬虫再反限制的一系列道高一尺魔高一丈的过程.爬虫的初级阶段,添加headers和ip代理可以解决很多问题. ...
c# 代理IP获取通用方法
调用: ConcurrentQueue<string> proxyIpQueue = new ConcurrentQueue<string>(); Grab_ProxyIp(p ...
python通过ip获取地址
# -*- coding: utf-8 -*- url = "http://ip.taobao.com/service/getIpInfo.php?ip=" #查找IP地址 def ...
PYTHON代理IP
import urllib.request url = 'http://www.whatismyip.com.tw/' proxy_support = urllib.request.ProxyHand ...
使用TaskManager爬取2万条代理IP实现自动投票功能
话说某天心血来潮想到一个问题,朋友圈里面经常有人发投票链接,让帮忙给XX投票,以前呢会很自觉打开链接帮忙投一票.可是这种事做多了就会考虑能不能使用工具来进行投票呢,身为一名程序猿决定研究解决这个问题. ...
写了个小爬虫，为何用上代理ip总是出现错误。
import urllib.request import re import os import random import threading def url_open(url): #在第8到第12 ...
代理 IP 云打码平台的使用
代理ip 获取代理ip的网站: 快代理西祠代理 www.goubanjia.com #代理ip import requests headers = { 'User-Agent':'Mozilla/5 ...
python爬虫实战（一）——实时获取代理ip
在爬虫学习的过程中,维护一个自己的代理池是非常重要的. 详情看代码: 1.运行环境 python3.x,需求库:bs4,requests 2.实时抓取西刺-国内高匿代理中前3页的代理ip(可根据需求自 ...
python编写的自动获取代理IP列表的爬虫-chinaboywg-ChinaUnix博客
python编写的自动获取代理IP列表的爬虫-chinaboywg-ChinaUnix博客 undefined Python多线程抓取代理服务器 | Linux运维笔记 undefined java如 ...

随机推荐

2018宁夏邀请赛 Continuous Intervals（单调栈线段树
https://vjudge.net/problem/Gym-102222L 题意:给你n个数的序列,让判断有几个区间满足排完序后相邻两数差都不大于1. 题解:对于一个区间 [L,R],记最大值为 m ...
HDU2222Keywords Search AC_自动机
http://blog.csdn.net/niushuai666/article/details/7002823 #include <iostream> #include <cstd ...
codeforces 766 C. Mahmoud and a Message（简单dp）
题目链接:http://codeforces.com/contest/766/problem/C 题意:给你一个长度为n的字符串,这个字符串只包含小写字母,然后让你把这个字符串进行分割,形成若干个小的 ...
Android-友盟第三方登录与分享
### 前言最近项目中又一次需要集成友盟的三方登录与分享,之前没有记录过,所以这次来写一下... ### 准备工作 1.注册友盟账号创建应用,获取key:申请地址http://www.umeng.c ...
JSP内置对象（一）
一.out对象out对象是JspWriter类的实例,是向客户端输出内容常用的对象1.void println() out的println()方法,向客户端打印字符串 2. void clear() ...
运维核心基础知识之——MD5sum校验文件
如何使用MD5sum工具校验你的文件. 演示过程截图: 先给文件创建一个md5值 md5sum oldboy.txt 然后将md5sum生成的md5值写入到一个文件police.log md5sum ...
mariadb 远程访问报：Host xxx is not allowed to connect to this MariaDb server
刚开始试的是: 结果报错了,哎,这折腾的. 继续折腾,加个密码试试: 再用Navicat试试,果然成功了.
SpringBoot——HelloWorld
微服务和单体应用的宏观理解微服务:一组小型应用通过HTTP的方式进行沟通的开发思想单体应用:ALL IN ONE 单体应用的不足: 随着业务逻辑的不断更新和迭代开发,起初的小型应用会不断膨胀,当应 ...
IDEA中的各种快捷键
1.get.set快捷键: Alt+Insert 2.idea补全返回值快捷键比如写了一个new User(),需要补全前面的User user ctrl+alt+V 3.idea全局搜索: Ctr ...
spring boot使用AOP切面编程
spring boot使用AOP 1.在pom文件中添加依赖:  <dependency> <groupId>org ...

python——代理ip获取

python——代理ip获取的更多相关文章

随机推荐

热门专题