python 爬虫循环分页

import os
from time import sleep

import faker
import requests
from lxml import etree

fake = faker.Faker()

base_url = "http://angelimg.spbeen.com"

def get_next_link(url):
    content = downloadHtml(url)
    html = etree.HTML(content)
    next_url = html.xpath("//a[@class='ch next']/@href")
    if next_url:
        return base_url + next_url[0]
    else:
        return False

def downloadHtml(ur):
    user_agent = fake.user_agent()
    headers = {'User-Agent': user_agent,"Referer":"http://angelimg.spbeen.com/"}
    response = requests.get(url, headers=headers,timeout=20)
    if response.status_code != 200:
        return None
    else:
        return response.text

def getImgUrl(content):
    html  = etree.HTML(content)
    img_url = html.xpath('//*[@id="content"]/a/img/@src')
    title = html.xpath(".//div['@class=article']/h2/text()")

    return img_url[0],title[0]

def saveImg(title,img_url):
    if img_url is not None and title is not None:

        title = title.split('【')[0]
        file_path = 'isssss/{}/'.format(title)
        if not os.path.exists(file_path):
            os.makedirs(file_path)
        file_name = img_url.split('/')[-1]

        with open(file_path+file_name+".jpg",'wb') as f:
            user_agent = fake.user_agent()
            headers = {'User-Agent': user_agent,"Referer":"http://angelimg.spbeen.com/"}
            content = requests.get(img_url, headers=headers,timeout=20)
            #request_view(content)
            f.write(content.content)
            print("save img "+ img_url)
            f.close()

def request_view(response):
    import webbrowser
    request_url = response.url
    base_url = '<head><base href="%s">' %(request_url)
    base_url = base_url.encode()
    content = response.content.replace(b"<head>",base_url)
    tem_html = open('tmp.html','wb')
    tem_html.write(content)
    tem_html.close()
    webbrowser.open_new_tab('tmp.html')

def optimizeContent(res):
    res = res.replace('b\'', '')
    res = res.replace('\\n', '')
    res = res.replace('\'', '')
    res = res.replace('style', 'nouse')
    res = res.replace('\.', '')
    return res

def crawl_img(url):
    content = downloadHtml(url)
    if content is not None:
        res = getImgUrl(content)
        title = res[1]
        img_url = res[0]
        title = optimizeContent(title)
        title = title.replace('.', '')
        print(title)
        saveImg(title,img_url)
        return True
    else:
        return None
if __name__ == "__main__":
    try:

        root_url = "http://angelimg.spbeen.com/ang/{}"

        for i in range(37,10000):
            url = root_url.format(i)
            try:
                while url:
                    res = crawl_img(url)
                    if res is None:
                        print(url + ' 无数据')
                        next = i + 1
                        url = root_url.format(next)
                        break
                    else:
                        url = get_next_link(url)
                        print("爬取页面：" + url)
                i = i + 1
            except Exception as e:
                print(str(e))
    except Exception as e:
        print(str(e))

结果

python 爬虫循环分页的更多相关文章

python爬虫循环导入MySql数据库
1.开发环境操作系统:win10 Python 版本:Python 3.5.2 MySQL:5.5.53 2.用到的模块没有的话使用pip进行安装:pip install xxx ...
Python爬虫：如何爬取分页数据？
上一篇文章<Python爬虫:爬取人人都是产品经理的数据>中说了爬取单页数据的方法,这篇文章详细解释如何爬取多页数据. 爬取对象: 有融网理财项目列表页[履约中]状态下的前10页数据,地址 ...
Python爬虫入门教程 2-100 妹子图网站爬取
妹子图网站爬取---前言从今天开始就要撸起袖子,直接写Python爬虫了,学习语言最好的办法就是有目的的进行,所以,接下来我将用10+篇的博客,写爬图片这一件事情.希望可以做好. 为了写好爬虫,我们 ...
Python爬虫(四)——豆瓣数据模型训练与检测
前文参考: Python爬虫(一)——豆瓣下图书信息 Python爬虫(二)——豆瓣图书决策树构建 Python爬虫(三)——对豆瓣图书各模块评论数与评分图形化分析数据的构建在这张表中我们可以发现 ...
Python 爬虫实战（二）：使用 requests-html
Python 爬虫实战(一):使用 requests 和 BeautifulSoup,我们使用了 requests 做网络请求,拿到网页数据再用 BeautifulSoup 解析,就在前不久,requ ...
python 爬虫（转，我使用的python3）
原文地址:http://blog.csdn.net/pi9nc/article/details/9734437 [Python]网络爬虫(一):抓取网页的含义和URL基本构成分类: 爬虫 Pyt ...
史诗级干货-python爬虫之增加CSDN访问量
史诗级干货-python爬虫之增加CSDN访问量搜索微信公众号:'AI-ming3526'或者'计算机视觉这件小事' 获取更多算法.机器学习干货 csdn:https://blog.csdn.net ...
教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神
本博文将带领你从入门到精通爬虫框架Scrapy,最终具备爬取任何网页的数据的能力.本文以校花网为例进行爬取,校花网:http://www.xiaohuar.com/,让你体验爬取校花的成就感. Scr ...
Python爬虫入门
Python爬虫简介(来源于维基百科): 网络爬虫始于一张被称作种子的统一资源地址(URLs)列表.当网络爬虫访问这些统一资源定位器时,它们会甄别出页面上所有的超链接,并将它们写入一张＂待访列表＂,即 ...

随机推荐

DoS拒绝服务-工具使用hping3、nping等（四）
Hping3几乎可以定制发送任何tcp/ip数据包,用于测试fw,端口扫描,性能测试 Syn Flood – hping3 -c 1000 -d 120 -S -w 64 -p 80 --flood ...
ctfhub 报错注入
payload 1 Union select count(*),concat((查询语句),0x26,floor(rand(0)*2))x from information_schema.colu ...
【Flutter 实战】路由堆栈详解
老孟导读:Flutter中路由是非常重要的部分,任何一个应用程序都离不开路由管理,此文讲解路由相关方法的使用和路由堆栈的变化. Flutter 路由管理中有两个非常重要的概念: Route:路由是应用 ...
JDK安装与基础环境变量配置入门详解 - 精简归纳
JDK安装与基础环境变量配置 JERRY_Z. ~ 2020 / 9 / 17 转载请注明出处!️ 目录 JDK安装与基础环境变量配置一.下载二.安装 (1).双击.exe文件 (2).全选安装工 ...
java事件触发
工作遇到一个问题:用netty实现服务和设备的交互,服务发送了一组指令,需要再等待时间内获取结果,如果结果提前全部返回,就进一步处理,如果等待时间内没有全部返回,就视为失败处理. 这个场景我遇到的困难 ...
pwnable.kr-shellshock-witeup
思路是:发现文件执行没什么好反馈显示结果的,于是看文件和权限,通过bash文件猜测可能存在破壳漏洞(CVE-2014-6271)漏洞,于是利用它并结合文件权限成功获得flag. 通过scp下载文件至本 ...
能否使用GHDL+GTKWave代替Quartus ii (续——vhdl_testbench_cli)
vhdl_testbench_cli项目介绍这是我放在gitee上的一个项目. 项目是用于Mac系统下生成vhdl testbench的工具. 主要就是续着这篇文章<能否使用GHDL+GTKW ...
Azure Storage 系列（七）使用Azure File Storage
一,引言今天我们开始介绍 Storage 中的最后一个类型的存储----- File Storage(文件存储),Azure File Storage 在云端提供完全托管的文件共享,这些共享项可通过 ...
基础篇：详解JAVA对象实例化过程
目录 1 对象的实例化过程 2 类的加载过程 3 触发类加载的条件 4 对象的实例化过程 5 类加载器和双亲委派规则,如何打破双亲委派规则欢迎指正文中错误关注公众号,一起交流参考文章 1 对象的 ...
项目启动加载配置,以及IP黑名单,使用CommandLineRunner和ApplicationRunner来实现(一般用在网关进行拦截黑名单)
//使用2个类的run方法都可以在项目启动时加载配置,唯一不同的是他们的参数不一样,CommandLineRunner的run方法参数是基本类型,ApplicationRunner的run方法参数是一 ...

python 爬虫 循环分页

python 爬虫 循环分页的更多相关文章

随机推荐

热门专题

python 爬虫循环分页

python 爬虫循环分页的更多相关文章