Python爬取贴吧中的图片

#看到贴吧大佬在发图，准备盗一下

#只是爬取一个帖子中的图片

1、先新建一个scrapy项目

　　scrapy startproject TuBaEx

2、新建一个爬虫

　　scrapy genspider tubaex https://tieba.baidu.com/p/4092816277

3、先写下items

　　#保存图片的url
　　img_url=scrapy.Field()

4、开始写爬虫

# -*- coding: utf-8 -*-

import scrapy

from TuBaEx.items import TubaexItem

class TubaexSpider(scrapy.Spider):

    name = "tubaex"

    #allowed_domains = ["https://tieba.baidu.com/p/4092816277"]

    baseURL="https://tieba.baidu.com/p/4092816277?pn="

    #拼接地址用 实现翻页

    offset=0

    #要爬取的网页

    start_urls = [baseURL+str(offset)]

    def parse(self, response):

        #获取最后一页的数字

        end_page=response.xpath("//div[@id='thread_theme_5']/div/ul/li[2]/span[2]/text()").extract()

        #通过审查元素找到图片的类名，用xpath获取

        img_list=response.xpath("//img[@class='BDE_Image']/@src").extract()

        for img in img_list:

            item=TubaexItem()

            item['img_url']=img

            yield item

        url=self.baseURL

        #进行翻页

        if self.offset < int(end_page[0]): #通过xpath返回的是list

            self.offset+=1

            yield scrapy.Request(self.baseURL+str(self.offset),callback=self.parse)

5、使用ImagesPipeline，这个没什么说的，我也不太懂

# -*- coding: utf-8 -*-

import requests

from scrapy.pipelines.images import ImagesPipeline

from TuBaEx import settings

class TubaexPipeline(ImagesPipeline):

    def get_media_requests(self,item,info):

        img_link = item['img_url']

        yield scrapy.Request(img_link)

    def item_completed(self,results,item,info):

        images_store="C:/Users/ll/Desktop/py/TuBaEx/Images/"

        img_path=item['img_url']

        return item

6、配置下settings

IMAGES_STORE = 'C:/Users/ll/Desktop/py/TuBaEx/Images/'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'TuBaEx (+http://www.yourdomain.com)'

USER_AGENT="User-Agent,Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

#开启管道

ITEM_PIPELINES = {

    'TuBaEx.pipelines.TubaexPipeline': 300,

}

7、执行

　　scrapy crawl tubaex

8、收获果实

Python爬取贴吧中的图片的更多相关文章

Python 爬取煎蛋网妹子图片
#!/usr/bin/env python # -*- coding: utf-8 -*- # @Date : 2017-08-24 10:17:28 # @Author : EnderZhou (z ...
python爬取某站上海租房图片
前言对于一个net开发这爬虫真真的以前没有写过.这段时间开始学习python爬虫,今天周末无聊写了一段代码爬取上海租房图片,其实很简短就是利用爬虫的第三方库Requests与BeautifulSou ...
利用python爬取王者荣耀英雄皮肤图片
前两天看到同学用python爬下来LOL的皮肤图片,感觉挺有趣的,我也想试试,于是决定来爬一爬王者荣耀的英雄和皮肤图片. 首先,我们找到王者的官网http://pvp.qq.com/web201605 ...
python爬取返利网中值得买中的数据
先使用以前的方法将返利网的数据爬取下来,scrapy框架还不熟练,明日再战scrapy 查找目标数据使用的是beautifulsoup模块. 1.观察网页,寻找规律打开值得买这块内容 1>分析 ...
python爬取365好书中小说
需要转载的小伙伴转载后请注明转载的地址需要用到的库 from bs4 import BeautifulSoup import requests import time 365好书链接:http:// ...
python爬取妹子图全站全部图片-可自行添加-线程-进程爬取，图片去重
from bs4 import BeautifulSoupimport sys,os,requests,pymongo,timefrom lxml import etreedef get_fenlei ...
python爬取站长之家植物图片
from lxml import etree from urllib import request import urllib.parse import time import os def hand ...
用python爬取全网妹子图片【附源码笔记】
这是晚上没事无聊写的python爬虫小程序,专门爬取妹子图的,养眼用的,嘻嘻!身为程序狗只会这个了! 废话不多说,代码附上,仅供参考学习! """ 功能:爬取妹子图全网妹 ...
使用python爬取P站图片
刚开学时有一段时间周末没事,于是经常在P站的特辑里收图,但是P站加载图片的速度比较感人,觉得自己身为计算机专业,怎么可以做一张张图慢慢下这么low的事,而且这样效率的确也太低了,于是就想写个程序来帮我 ...

随机推荐

[luoguP1282] 多米诺骨牌（DP + 背包）
传送门将问题转换成分组背包,每一组有上下两个,每一组中必须选则一个,上面的价值为0,下面的价值为1,求价值最小因为要求上下两部分差值最小,只需从背包大小为总数 / 2 时往前枚举,找最小答案即可. ...
bzoj——2982: combination
2982: combination Time Limit: 1 Sec Memory Limit: 128 MBSubmit: 611 Solved: 368[Submit][Status][Di ...
为什么Linux下的环境变量要用大写而不是小写
境变量的名称通常用大写字母来定义.实际上用小写字母来定义环境变量也不会报错,只是习惯上都是用大写字母来表示的. 首先说明一下,在Windows下是不区分大小写的,所以在Windows下怎么写都能获取到 ...
Win7 无法安装Office source engine 足够的权限安装系统服务怎么办
运行CMD,输入命令:sc delete ose 重试即可.
JAVA高速开发平台 - 开源免费 - JEECG
JEECG 微云高速开发平台当前最新版本号: 3.6.2(公布日期:20160315) 下载地址:http://git.oschina.net/jeecg/jeecg 前言: 随着 WEB UI 框 ...
linux查找nginx所在目录
ps -ef |grep nginx
School Personal Contest #1 （Codeforces Beta Round #38）---A. Army
Army time limit per test 2 seconds memory limit per test 256 megabytes input standard input output s ...
HDU 5489 Difference of Clustering 图论
Difference of Clustering Problem Description Given two clustering algorithms, the old and the new, y ...
jquery操作删除元素
通过 jQuery,可以很容易地删除已有的 HTML 元素. 删除元素/内容如需删除元素和内容,一般可使用以下两个 jQuery 方法: remove() - 删除被选元素(及其子元素) empty ...
YTU 2774: Prepare for CET6
2774: Prepare for CET6 时间限制: 1 Sec 内存限制: 128 MB 提交: 40 解决: 37 题目描述 Hard to force the CET4&6 is ...

Python爬取贴吧中的图片

Python爬取贴吧中的图片的更多相关文章

随机推荐

热门专题