requests+正则表达式爬取妹子图

　　做了一个爬取妹子图某张索引页面的爬虫，主要用request和正则表达式。

　　感谢崔庆才大神的爬虫教学视频和 gitbook:

　　　　　B站：https://www.bilibili.com/video/av18202461/index_1.html

　　　　　gitBook：https://legacy.gitbook.com/book/germey/python3webspider/details

　　源码：

#! user/bin/python
# coding=utf-8

import os
import re
import requests
from requests.exceptions import RequestException
from hashlib import md5

def download_from_detail(url):
    item = get_dict(url)
    save_images(item)

def get_dict(url):
    """
    :param url:
    :return:   {"title","image_url_list"}
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36"
    }
    try:
        response = requests.get(url, headers=headers)
    except RequestException:
        print("request error")
        return None
    if response.status_code == 200:
        # parse html from gb2312 to utf-8
        response.encoding = "gb2312"
        html = response.text
        title = re.search('<title>(.*?)</title>', html, re.S).group(1).split()[0]
        images_url = re.findall('<img alt=.*?src="(.*?)" /><br />', html)
        return {
            "title": title,
            "images_url": images_url
        }
    else:
        return None

def save_images(item):
    """
        save image in file which name is title
    :param item:
    :return:
    """
    if not item:
        return

    # 1 affirm if directory exists
    if not os.path.exists(item["title"]):
        os.mkdir(item["title"])
    # 2 save all the images into folder
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36"
    }
    for url in item["images_url"]:
        try:
            image_response = requests.get(url, headers=headers)
        except RequestException:
            print("request image error")
            continue
        file_name = "{0}/{1}.{2}".format(item["title"], md5(image_response.content).hexdigest(), "jpeg")
        with open(file_name, "wb") as image_file:
            image_file.write(image_response.content)
            print("{0} writing successfully".format(file_name))

def get_page_index(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36"
    }
    try:
        response = requests.get(url, headers=headers)
        response.encoding="gb2312"
    except RequestException:
        print("request image error")
    if response.status_code == 200:
        page_index_urls = re.findall('<a href="(.*?)".*?target=\'_blank\'>',response.text,re.S)
        for url in page_index_urls:
            download_from_detail(url)

if __name__ == "__main__":
    url = "http://www.meizitu.com/a/pure.html"
    get_page_index(url)

主要问题：

　　① gb2312 转 utf-8

    response.encoding="gb2312"

requests+正则表达式爬取妹子图的更多相关文章

Python 爬虫入门(二)——爬取妹子图
Python 爬虫入门听说你写代码没动力?本文就给你动力,爬取妹子图.如果这也没动力那就没救了. GitHub 地址: https://github.com/injetlee/Python/blob ...
Python 爬虫入门之爬取妹子图
Python 爬虫入门之爬取妹子图来源:李英杰链接: https://segmentfault.com/a/1190000015798452 听说你写代码没动力?本文就给你动力,爬取妹子图.如果 ...
requests+正则表达式爬取ip
#requests+正则表达式爬取ip #findall方法,如果表达式中包含有子组,则会把子组单独返回出来,如果有多个子组,则会组合成元祖 import requests import re def ...
PYTHON 爬虫笔记八:利用Requests+正则表达式爬取猫眼电影top100（实战项目一）
利用Requests+正则表达式爬取猫眼电影top100 目标站点分析流程框架爬虫实战使用requests库获取top100首页: import requests def get_one_pag ...
爬取妹子图(requests + BeautifulSoup)
刚刚入门爬虫,今天先对于单个图集进行爬取,过几天再进行翻页爬取. 使用requests库和BeautifulSoup库目标网站:妹子图今天是对于单个图集的爬取,就选择一个进行爬取,我选择的链接为: ...
Requests+正则表达式爬取猫眼电影
目标提取出猫眼电影TOP100的电影名称.时间.评分.图片等信息,提取站点的URL为http://maoyan.com/board/4,提取的结果以文本的形式保存下来. 准备工作请安装好reque ...
scrapy 也能爬取妹子图？
目录前言 Media Pipeline 启用Media Pipeline 使用 ImgPipeline 抓取妹子图瞎比比前言我们在抓取数据的过程中,除了要抓取文本数据之外,当然也会有抓取图片的需 ...
使用Requests+正则表达式爬取猫眼TOP100电影并保存到文件或MongoDB,并下载图片
需要着重学习的地方:(1)爬取分页数据时,url链接的构建(2)保存json格式数据到文件,中文显示问题(3)线程池的使用(4)正则表达式的写法(5)根据图片url链接下载图片并保存(6)MongoD ...
Requests+正则表达式爬取猫眼电影(TOP100榜)
猫眼电影网址:www.maoyan.com 前言:网上一些大神已经对猫眼电影进行过爬取,所用的方法也是各有其优,最终目的是把影片排名.图片.名称.主要演员.上映时间与评分提取出来并保存到文件或者数据库 ...

随机推荐

vue-router规则下 history模式在iis服务器上配置
vue默认模式是hash模式 url地址栏会带有“#”这个字符. 例如:http://www.xxx.com/#/index 感觉和正常的url相比有点丑. 如何让此地址如正常的url一样官 ...
Oracle ASM操作管理
查看ASM磁盘情况 SQL> select group_number,disk_number,mount_status,header_status,mode_status,state,failg ...
JavaScript中的跨域详解（一）
同源策略所谓的同源策略,指的是浏览器对不同源的脚本或者文本访问方式进行的限制. 所谓同源,就是指两个页面具有相同的协议,主机(也常说域名),端口,三个要素缺一不可. 同源政策的目的,是为了保证用户信 ...
Configuring Transitive IPMP on Solaris 11
http://www.tokiwinter.com/configuring-transitive-ipmp-on-solaris-11/ We all know the pain of configu ...
5、数据类型三：hash
Hash数据类型使用很普遍,它同样是key-value的方式来组织的,只是其value又包含多个field-fieldValue对.想要获取某个fieldValue,可以通过key-field联合来定 ...
C# HTTP请求GET，POST
转自原文 [C#]HTTP请求GET,POST HTTP定义了与服务器交互的不同方法,基本方法有GET,POST,PUT,DELETE,分别对于查,该,增,删.一般情况下我们只用到GET和POST,其 ...
Spring Test 整合 JUnit 4 使用总结
转自:https://blog.csdn.net/hgffhh/article/details/83712924 这两天做Web开发,发现通过spring进行对象管理之后,做测试变得复杂了.因为所有的 ...
RocketMQ入门（简介、特点）
简介: RocketMQ作为一款纯java.分布式.队列模型的开源消息中间件,支持事务消息.顺序消息.批量消息.定时消息.消息回溯等. 发展历程: 1. Metaq(Metamorphosis) 1. ...
redis 开发与运维学习心得1
主要是命令相关第一章初识Redis 1.redis是基于键值对的NoSQL. 2.redis的值可以是 string, hash, list, set, zset, bitmaps, hyperl ...
python 获取当前运行的类名函数名
import inspect def get_current_function_name(): return inspect.stack()[1][3] class MyClass: def func ...

requests+正则表达式 爬取 妹子图

requests+正则表达式 爬取 妹子图的更多相关文章

随机推荐

热门专题

requests+正则表达式爬取妹子图

requests+正则表达式爬取妹子图的更多相关文章