python 爬取可用

#coding:utf-8

from bs4 import BeautifulSoup

import time

import threading

import random

import telnetlib,requests

#设置全局超时时间为3s，也就是说，如果一个请求3s内还没有响应，就结束访问，并返回timeout（超时）

import socket

socket.setdefaulttimeout(3)

headers = {

"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",

}

def get_ip():

    #获取代理IP，返回列表

    httpResult=[]

    httpsResult=[]

    try:

        for page in range(1,10):

            IPurl = 'http://www.xicidaili.com/nn/%s' %page

            rIP=requests.get(IPurl,headers=headers)

            IPContent=rIP.text

            #print (IPContent)

            soupIP = BeautifulSoup(IPContent,'html.parser')#lxml

            trs = soupIP.find_all('tr')

            for tr in trs[1:]:

                tds = tr.find_all('td')

                ip = tds[1].text.strip()

                port = tds[2].text.strip()

                protocol = tds[5].text.strip()

                if protocol == 'HTTP':

                    httpResult.append( 'http://' + ip + ':' + port)

                elif protocol =='HTTPS':

                    httpsResult.append( 'https://' + ip + ':' + port)

    except Exception as inst:

        print (inst)

    return httpResult,httpsResult

#验证ip地址的可用性，使用requests模块，验证地址用相应要爬取的网页 http

def cip(x,y):

    f = open("E:\ip_http.txt","a")

    f.truncate()

    try:

        print (x+y)

        requests.get('http://ip.chinaz.com/getip.aspx',proxies={'http':x+":"+y},timeout=3)

    except:

        print('f')

    else:

        print('---------------------------success')

        f.write(x+':'+y+'\n')

#验证ip地址的可用性，使用requests模块，验证地址用相应要爬取的网页。https

def csip(x,y):

    f = open("E:\ip_https.txt","a")

    f.truncate()

    try:

        print (x+y)

        requests.get('https://www.lagou.com/',proxies={'https':x+":"+y},timeout=3)

    except:

        print('f')

    else:

        print('---------------------------success')

        f.write(x+':'+y+'\n')

def main():

    httpResult,httpsResult = get_ip()

    print(len(httpResult), len(httpsResult))

    threads = []

    open("E:\ip_http.txt","a").truncate()

    for i in httpResult:

        a = str(i.split(":")[-2][2:].strip())

        b = str(i.split(":")[-1].strip())

        t = threading.Thread(target=cip,args=(a,b,))

        threads.append(t)

    for i in range(len(httpResult)):

        threads[i].start()

    for i in range(len(httpResult)):

        threads[i].join()

    threads1 = []

    open("E:\ip_https.txt","a").truncate()

    for i in httpsResult:

        a = str(i.split(":")[-2][2:].strip())

        b = str(i.split(":")[-1].strip())

        t = threading.Thread(target=csip,args=(a,b,))

        threads1.append(t)

    for i in range(len(httpsResult)):

        threads1[i].start()

    for i in range(len(httpsResult)):

        threads1[i].join()

if __name__ == '__main__':

    main()

python 爬取可用的更多相关文章

利用Python爬取可用的代理IP
前言就以最近发现的一个免费代理IP网站为例:http://www.xicidaili.com/nn/.在使用的时候发现很多IP都用不了. 所以用Python写了个脚本,该脚本可以把能用的代理IP检测 ...
Python:爬取乌云厂商列表，使用BeautifulSoup解析
在SSS论坛看到有人写的Python爬取乌云厂商,想练一下手,就照着重新写了一遍原帖:http://bbs.sssie.com/thread-965-1-1.html #coding:utf- im ...
萌新学习Python爬取B站弹幕+R语言分词demo说明
代码地址如下:http://www.demodashi.com/demo/11578.html 一.写在前面之前在简书首页看到了Python爬虫的介绍,于是就想着爬取B站弹幕并绘制词云,因此有了这样 ...
利用python爬取58同城简历数据
利用python爬取58同城简历数据利用python爬取58同城简历数据最近接到一个工作,需要获取58同城上面的简历信息(http://gz.58.com/qzyewu/).最开始想到是用pyth ...
用Python爬取网易云音乐热评
用Python爬取网易云音乐热评本文旨在记录Python爬虫实例:网易云热评下载由于是从零开始,本文内容借鉴于各种网络资源,如有侵权请告知作者. 要看懂本文,需要具备一点点网络相关知识.不过没有关 ...
Python 爬取所有51VOA网站的Learn a words文本及mp3音频
Python 爬取所有51VOA网站的Learn a words文本及mp3音频 #!/usr/bin/env python # -*- coding: utf-8 -*- #Python 爬取所有5 ...
python爬取网站数据
开学前接了一个任务,内容是从网上爬取特定属性的数据.正好之前学了python,练练手. 编码问题因为涉及到中文,所以必然地涉及到了编码的问题,这一次借这个机会算是彻底搞清楚了. 问题要从文字的编码讲 ...
python爬取某个网页的图片-如百度贴吧
python爬取某个网页的图片-如百度贴吧作者:vpoet mail:vpoet_sir@163.com 注:随意copy,不用告诉我 #coding:utf-8 import urllib imp ...
使用python爬取MedSci上的期刊信息
使用python爬取medsci上的期刊信息,通过设定条件,然后获取相应的期刊的的影响因子排名,期刊名称,英文全称和影响因子.主要过程如下: 首先,通过分析网站http://www.medsci.cn ...

随机推荐

java重写与重载的区别
override(重写) :即把改方法重新写一次,内部逻辑可变,外壳不变,核心重写 1. 方法名.参数.返回值相同. 2. 子类方法不能缩小父类方法的访问权限. 3. 子类方法不能抛出比父类方法更多的 ...
jQuery each和js forEach用法比较
本文实例分析了jQuery each和js forEach用法.分享给大家供大家参考,具体如下: 对于遍历数组的元素,js代码和jquery都有类似的方法,js用的是forEach而jquery用的是 ...
当input框输入到限定长度时，自动focus下一个input框
需求背景需要输入一串15位的数字,但是要分为3个输入框,每个输入框限定长度5位,当删除当前输入框的内容时,focus到上一个输入框: 实现方法 var field = $('.phone-fiel ...
vue2.x 给一个对象里添加一个没有的属性
obj = {...obj, name:'addName'} //给obj对象添加一个name字段,并且赋值为‘addName’ 参考:
webstorm快捷键大全（亲自整理）
Ctrl+/ 或 Ctrl+Shift+/ 注释(// 或者/*…*/ ) Shift+F6 重构-重命名 Ctrl+X 删除行 Ctrl+D 复制行 Ctrl+G 查找行 Ctrl+Shift+Up ...
SAP MM 无价值物料管理的一种实现思路
SAP MM 无价值物料管理的一种实现思路笔者所在的项目,客户工厂处于先期试生产阶段,尚未开始大规模的商业化生产,但是这并不影响客户集团总部的SAP项目实施.笔者于7月初加入该工厂的第2期SAP项目 ...
Centos7 系统下搭建.NET Core2.0+Ｎginx+Supervisor+Mysql环境
好记性不如烂笔头! 一.简介一直以来,微软只对自家平台提供.NET支持,这样等于让这个“理论上”可以跨平台的框架在Linux和macOS上的支持只能由第三方项目提供(比如Mono .NET).直到微 ...
docker 学习资料收集
Docker中文网 http://www.docker.org.cn/book/ docker镜像怎么迁移到其他的服务器 http://www.talkwithtrend.com/Question/1 ...
Registrator中文文档
目录快速入门概述准备运行Registrator 运行Redis 下一步运行参考运行Registrator Docker选项 Registrator选项 Consul ACL令牌注册URI ...
iOS----------developerDiskImage
真机测试时提示Could not find Developer Disk Image.这该怎么办???? 这是由于真机系统过高或者过低,Xcode中没有匹配的配置包文件,我们可以通过这个路径进入配置包 ...

python 爬取可用

python 爬取可用的更多相关文章

随机推荐

热门专题