<scrapy爬虫>爬取360妹子图存入mysql(mongoDB还没学会,学会后加上去)

1.创建scrapy项目

dos窗口输入:

scrapy startproject images360

cd images360

2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义)

import scrapy

class Images360Item(scrapy.Item):

    # define the fields for your item here like:

    #图片ID

    image_id = scrapy.Field()

    #链接

    url = scrapy.Field()

    #标题

    title = scrapy.Field()

    #缩略图

    thumb = scrapy.Field()

3.创建爬虫文件

dos窗口输入:

scrapy genspider myspider images.so.com

4.编写myspider.py文件(接收响应,处理数据)

# -*- coding: utf-8 -*-

from urllib.parse import urlencode

import scrapy

from images360.items import Images360Item

import json

class MyspiderSpider(scrapy.Spider):

    name = 'myspider'

    allowed_domains = ['images.so.com']

    urls = []

    data = {'ch': 'beauty', 'listtype': 'new'}

    base_url = 'https://image.so.com/zj?0'

    for page in range(1,51):

        data['sn'] = page * 30

        params = urlencode(data)

        url = base_url + params

        urls.append(url)

    print(urls)

    start_urls = urls

    # ch: beauty

    # sn: 120

    # listtype: new

    # temp: 1

    def parse(self, response):

        result = json.loads(response.text)

        for each in result.get('list'):

            item = Images360Item()

            item['image_id'] = each.get('imageid')

            item['url'] = each.get('qhimg_url')

            item['title'] = each.get('group_title')

            item['thumb'] = each.get('qhimg_thumb_url')

            yield item

5.编写pipelines.py(存储数据)

import pymysql.cursors

class Images360Pipeline(object):

    def __init__(self):

        self.connect = pymysql.connect(

            host='localhost',

            user='root',

            password='',

            database='quotes',

            charset='utf8',

        )

        self.cursor = self.connect.cursor()

    def process_item(self, item, spider):

        item = dict(item)

        sql = 'insert into images360(image_id,url,title,thumb) values(%s,%s,%s,%s)'

        self.cursor.execute(sql, (item['image_id'], item['url'], item['title'],item['thumb']))

        self.connect.commit()

        return item

    def close_spider(self, spider):

        self.cursor.close()

        self.connect.close()

6.编写settings.py(设置headers,pipelines等)

robox协议

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

headers

DEFAULT_REQUEST_HEADERS = {

    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',

    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

  # 'Accept-Language': 'en',

}

pipelines

ITEM_PIPELINES = {

   'quote.pipelines.Images360Pipeline': 300,

}

7.运行爬虫

dos窗口输入:

scrapy crawl myspider

运行结果

<scrapy爬虫>爬取360妹子图存入mysql(mongoDB还没学会,学会后加上去)的更多相关文章

写一个python 爬虫爬取百度电影并存入mysql中
目标是利用python爬取百度搜索的电影在类型地区年代各个标签下电影的名字评分和图片连接以及电影连接首先我们先在mysql中建表 create table liubo4( id in ...
Scrapy框架学习（四）爬取360摄影美图
我们要爬取的网站为http://image.so.com/z?ch=photography,打开开发者工具,页面往下拉,观察到出现了如图所示Ajax请求, 其中list就是图片的详细信息,接着观察到每 ...
使用scrapy爬虫,爬取17k小说网的案例-方法一
无意间看到17小说网里面有一些小说小故事,于是决定用爬虫爬取下来自己看着玩,下图这个页面就是要爬取的来源. a 这个页面一共有125个标题,每个标题里面对应一个内容,如下图所示下面直接看最核心spi ...
<scrapy爬虫>爬取猫眼电影top100详细信息
1.创建scrapy项目 dos窗口输入: scrapy startproject maoyan cd maoyan 2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义) # -*- ...
<scrapy爬虫>爬取quotes.toscrape.com
1.创建scrapy项目 dos窗口输入: scrapy startproject quote cd quote 2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义) import ...
scrapy爬虫爬取小姐姐图片（不羞涩）
这个爬虫主要学习scrapy的item Pipeline 是时候搬出这张图了: 当我们要使用item Pipeline的时候,要现在settings里面取消这几行的注释我们可以自定义Item Pip ...
使用scrapy爬虫,爬取今日头条搜索吉林疫苗新闻（scrapy+selenium+PhantomJS）
这一阵子吉林疫苗案,备受大家关注,索性使用爬虫来爬取今日头条搜索吉林疫苗的新闻依然使用三件套(scrapy+selenium+PhantomJS)来爬取新闻以下是搜索页面,得到吉林疫苗的搜索信息, ...
<scrapy爬虫>爬取校花信息及图片
1.创建scrapy项目 dos窗口输入: scrapy startproject xiaohuar cd xiaohuar 2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义) # ...
<scrapy爬虫>爬取腾讯社招信息
1.创建scrapy项目 dos窗口输入: scrapy startproject tencent cd tencent 2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义) # - ...

随机推荐

break , continue 和标签跳出循环
break跳出代码块或循环 var i = 0: while ( i <= 10){ console.log(' i '); i ++; if ( i === 5 ) break; }// 0 ...
任意两点间的最短路问题(Floyd-Warshall算法)
/* 任意两点间的最短路问题(Floyd-Warshall算法) */ import java.util.Scanner; public class Main { //图的顶点数,总边数 static ...
JAVA C 数据类型对应
{ Java—C和操作系统数据类型的对应表 Java Type C Type Native Representation boolean int 32-bit integer (customizabl ...
BZOJ3211花神游历各国-线段树&树状数组-(HDU4027同类型)
(有任何问题欢迎留言或私聊 && 欢迎交流讨论哦题意:BZOJ HDU 原题目描述在最下面. 两种操作,1:把区间的数字开方一次,2:区间求和. 思路: 线段树: 显然不能暴力 ...
中断控制及basepri 与 basepri_max
1.总开关每个CPU有一个中断总开关.通过CPU中断控制寄存器实现.Cortex-M的中断控制寄存器包括:FAULTMASK.PRIMASK.BASEPRI.BASEPRI_MAX.总开关的本质是变 ...
面试问烂的 MySQL 查询优化，看完屌打面试官！
Java技术栈 ,一般把连接数设置得大一些). 并发量:同一时刻数据库服务器处理的请求数量 3.超高的 CPU使用率:CPU资源耗尽出现宕机. 4.磁盘 IO:磁盘 IO性能突然下降.大量消耗磁盘性能 ...
.net 超链接传值，传过去始终是null
今天做了一个删除功能,通过点击列表中的删除超链接,通过get请求,跳转到一个处理程序执行删除操作 . 因为不熟悉各种报错 , <%="<td> <a class='d ...
DRF的三大认证组件
目录 DRF的三大认证组件认证组件工作原理实现权限组件工作原理实现频率组件工作原理实现三种组件的配置 DRF的三大认证组件认证组件工作原理首先,认证组件是基于BaseAuth ...
python子线程退出
def thread_func(): while True: #do something #do something #do something t=threading.Thread(target = ...
19-Ubuntu-文件和目录命令-删除文件和目录-rm
rm 删除文件或目录注:使用rm命令要小心,因为文件删除后不能恢复.不会放在垃圾箱里,直接从磁盘删除. 选项含义 -f 强制删除文件,无需提示.不能删除目录! -r 递归的删除目录下的内容,删除文 ...

<scrapy爬虫>爬取360妹子图存入mysql(mongoDB还没学会,学会后加上去)

1.创建scrapy项目

2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义)

3.创建爬虫文件

4.编写myspider.py文件(接收响应,处理数据)

5.编写pipelines.py(存储数据)

6.编写settings.py(设置headers,pipelines等)

7.运行爬虫

<scrapy爬虫>爬取360妹子图存入mysql(mongoDB还没学会,学会后加上去)的更多相关文章

随机推荐

热门专题