python-爬免费ip并验证其可行性

前言

最近在重新温习python基础-正则，感觉正则很强大，不过有点枯燥，想着，就去应用正则，找点有趣的事玩玩

00xx01---代理IP

有好多免费的ip,不过一个一个保存太难了，也不可能，还是用我们的python爬取吧

00xx02---正则提取ip

 import requests

 import re

 #防反爬

 headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36" }

 url = "https://www.xicidaili.com/nn/1"

 response = requests.get(url,headers=headers)

     # print(response.text)

 html = response.text

 #print(html)

 #re.S忽略换行的干扰

 ips = re.findall("<td>(\d+\.\d+\.\d+\.\d+)</td>",html,re.S)

 ports = re.findall(("<td>(\d+)</td>"),html,re.S)

 print(ips)

 print(ports)

00xx03---拼接IP和端口

 import requests

 import re

 #防反爬

 headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36" }

 url = "https://www.xicidaili.com/nn/1"

 response = requests.get(url,headers=headers)

     # print(response.text)

 html = response.text

 # print(html)

 #re.S忽略换行的干扰

 ips = re.findall("<td>(\d+\.\d+\.\d+\.\d+)</td>",html,re.S)

 ports = re.findall(("<td>(\d+)</td>"),html,re.S)

 #print(ips)

 #print(ports)

 for ip in zip(ips,ports ):  #提取拼接ip和端口

     print(ip)

00xx03---验证IP可行性

思路：带着ip和端口去访问一个网站，百度就可以

 import requests

 import re

 headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36" }

 for i in range(1,1000):

     #网址

     url = "https://www.xicidaili.com/nn/{}".format(i)

     response = requests.get(url,headers=headers)

     # print(response.text)

     html = response.text

     #re.S忽略换行的干扰

     ips = re.findall("<td>(\d+\.\d+\.\d+\.\d+)</td>",html,re.S)

     ports = re.findall(("<td>(\d+)</td>"),html,re.S)

     # print(ips)

     # print(ports)

     for ip in zip(ips,ports ):  #提取拼接ip和端口

         proxies = {

             "http":"http://" + ip[0] + ":" + ip[1],

             "https":"http://" + ip[0] + ":" + ip[1]

         }

         try:

             res = requests.get("http://www.baidu.com",proxies=proxies,timeout = 3)  #访问网站等待3s没有反应，自动断开

             print(ip,"能使用")

             with open("ip.text",mode="a+") as f:

                 f.write(":".join(ip))  #写入ip.text文本

                 f.write("\n") #换行

         except Exception as e:   #捕捉错误异常

             print(ip,"不能使用")

00xx04---写入文本

 import requests

 import re

 #防反爬

 headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36" }

 url = "https://www.xicidaili.com/nn/1"

 response = requests.get(url,headers=headers)

     # print(response.text)

 html = response.text

 # print(html)

 #re.S忽略换行的干扰

 ips = re.findall("<td>(\d+\.\d+\.\d+\.\d+)</td>",html,re.S)

 ports = re.findall(("<td>(\d+)</td>"),html,re.S)

 #print(ips)

 #print(ports)

 for ip in zip(ips,ports ):  #提取拼接ip和端口

     print(ip)

     proxies = {

             "http":"http://" + ip[0] + ":" + ip[1],

             "https":"http://" + ip[0] + ":" + ip[1]

         }

     try:

         res = requests.get("http://www.baidu.com",proxies=proxies,timeout = 3)  #访问网站等待3s没有反应，自动断开

         print(ip,"能使用")

         with open("ip.text",mode="a+") as f:

             f.write(":".join(ip))  #写入ip.text文本

             f.write("\n") #换行

     except Exception as e:   #捕捉错误异常

         print(ip,"不能使用")

爬了一页，才几个能用，有3000多页，不可能手动的

00xx05---批量爬

 #!/usr/bin/env python3

 # coding:utf-8

 # 2019/11/18 22:38

 #lanxing

 import requests

 import re

 #防反爬

 headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36" }

 for i in range(1,3000):  #爬3000个网页

     #网站

     url = "https://www.xicidaili.com/nn/{}".format(i)

     response = requests.get(url,headers=headers)

         # print(response.text)

     html = response.text

     # print(html)

     #re.S忽略换行的干扰

     ips = re.findall("<td>(\d+\.\d+\.\d+\.\d+)</td>",html,re.S)

     ports = re.findall(("<td>(\d+)</td>"),html,re.S)

     #print(ips)

     #print(ports)

     for ip in zip(ips,ports ):  #提取拼接ip和端口

         print(ip)

         proxies = {

                 "http":"http://" + ip[0] + ":" + ip[1],

                 "https":"http://" + ip[0] + ":" + ip[1]

             }

         try:

             res = requests.get("http://www.baidu.com",proxies=proxies,timeout = 3)  #访问网站等待3s没有反应，自动断开

             print(ip,"能使用")

             with open("ip.text",mode="a+") as f:

                 f.write(":".join(ip))  #写入ip.text文本

                 f.write("\n") #换行

         except Exception as e:   #捕捉错误异常

             print(ip,"不能使用")

00xx06---最后

哈哈，感觉爬的速度太慢了，毕竟是单线程，如果要快速爬，可以试试用多线程爬取，

以后再补充完善代码吧

python-爬免费ip并验证其可行性的更多相关文章

[python]爬代理ip v2.0(未完待续）
爬代理ip 所有的代码都放到了我的github上面, HTTP代理常识 HTTP代理按匿名度可分为透明代理.匿名代理和高度匿名代理. 特别感谢:勤奋的小孩在评论中指出我文章中的错误. REMOTE_ ...
python爬取ip地址
ip查询,异步get请求分析接口,请求接口响应json 发现可以data中获取 result.json()['data'][0]['location'] # _*_ coding : utf-8 _ ...
python爬取免费优质IP归属地查询接口
python爬取免费优质IP归属地查询接口具体不表,我今天要做的工作就是: 需要将数据库中大量ip查询出起归属地刚开始感觉好简单啊,毕竟只需要从百度找个免费接口然后来个python脚本跑一晚上就o ...
爬取西刺网的免费IP
在写爬虫时,经常需要切换IP,所以很有必要自已在数据维护库中维护一个IP池,这样,就可以在需用的时候随机切换IP,我的方法是爬取西刺网的免费IP,存入数据库中,然后在scrapy 工程中加入tools ...
无忧代理免费ip爬取（端口js加密）
起因为了训练爬虫技能(其实主要还是js技能-),翻了可能有反爬的网站挨个摧残,现在轮到这个网站了:http://www.data5u.com/free/index.shtml 解密过程打开网站,在 ...
第二篇 - python爬取免费代理
代理的作用参考https://wenda.so.com/q/1361531401066511?src=140 免费代理很多,但也有很多不可用,所以我们可以用程序对其进行筛选.以能否访问百度为例. 1. ...
爬取快代理的免费IP并测试
各大免费IP的网站的反爬手段往往是封掉在一定时间内访问过于频繁的IP,因此在爬取的时候需要设定一定的时间间隔,不过说实话,免费代理很多时候基本都不能用,可能一千个下来只有十几个可以用,而且几分钟之后估 ...
python 单例模式获取IP代理
python 单例模式获取IP代理 tags:python python单例模式 python获取ip代理引言:最近在学习python,先说一下我学Python得原因,一个是因为它足够好用,完成同样 ...
Python获取免费的可用代理
Python获取免费的可用代理在使用爬虫多次爬取同一站点时,常常会被站点的ip反爬虫机制给禁掉,这时就能够通过使用代理来解决.眼下网上有非常多提供最新免费代理列表的站点.这些列表里非常多的代理主机是 ...

随机推荐

数据库MySQL--修改数据表
创建数据库::create database 数据库名: 如果数据不存在则创建,存在不创建:Create database if not exists 数据库名 ; 删除数据库::drop datab ...
https://vjudge.net/contest/321565#problem/C 超时代码
#include <iostream> #include <cstdio> #include <queue> #include <algorithm> ...
《parsing techniques》中文翻译和正则引擎解析技术入门
http://parsing-techniques.duguying.net/ (中文版) https://swtch.com/~rsc/regexp/ https://blog.csdn.net/m ...
kubernetes 强制删除istio-system空间,强制删除pod
加上这个选项 --grace-period=0 --force--grace-period=0 --force 先删除deployment,pod,svc再删除namespace > kubec ...
02-Nov-2017 07:11:56.475 信息 [http-nio-8080-exec-10] com.mchange.v2.c3p0.impl.AbstractPoolBackedDataSource. Initializing c3p0 pool...
报错: 02-Nov-2017 07:11:56.475 信息 [http-nio-8080-exec-10] com.mchange.v2.c3p0.impl.AbstractPoolBackedD ...
DNS 攻击方式及攻击案例
[赛迪网-IT技术报道]2010年1月12日晨7时起,网络上开始陆续出现百度出现无法访问的情况反馈, 12时左右基本恢复正常:18时许百度发布官方版本公告:对事故原因说明为:"因www.ba ...
Hadoop 与 Spark 对比
Hadoop进行海量数据分析,MR频繁落地,IO操作,计算时间就拉长.由于这种设计影响,计算过程中不能进行迭代计算.造成网络节点数据传输. Spark从理念上就开始改变.应用scala特点解决上面的核 ...
C语言进阶学习第二章
本章重点记录指针的各种概念: 1.地址与内容 2.非法的赋值 3.NULL指针:NULL指针作为一个特殊的指针变量,表示不指向任何东西,在对指针进行解引用操作之前,首先必须确保它并非NULL指针. ...
C#实现拍照并且存水印照片
由于一直在高校工作,就涉及到招生工作,招生时候又要收集学生图像采集,所以就随手写了一个图像采集工具,废话不多说,进入正题. 图像采集需要调用摄像头就行拍照操作,网上查了一下资料,需要引用以下3个dll ...
ie9 jscript7 内存不足页面无响应
花了我差不多一天时间我是加载一个datagrid ,多表联查,查询几遍(不一定,又是1遍就死了)后就卡死了...后台日志都是过的.... 后来我发现数据库某个表的数据很多有一模一样的两条,把一份删 ...

python-爬免费ip并验证其可行性

python-爬免费ip并验证其可行性的更多相关文章

随机推荐

热门专题