scrapy框架综合运用爬取天气预报 + 定时任务

爬取目标网站：

http://www.weather.com.cn/

具体区域天气地址：

http://www.weather.com.cn/weather1d/101280601.shtm(深圳)

开始：

scrapy startproject weather

编写items.py

import scrapy

class WeatherItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    date  = scrapy.Field()

    temperature  = scrapy.Field()

    weather  = scrapy.Field()

    wind  = scrapy.Field()

　编写spider:

# -*- coding: utf-8 -*-

# @Time    : 2019/8/1 15:40

# @Author  : wujf

# @Email   : 1028540310@qq.com

# @File    : weather.py

# @Software: PyCharm

import scrapy

from weather.items import WeatherItem

class weather(scrapy.Spider):

    name = 'weather'

    allowed_domains = ['www.weather.com.cn/weather/101280601.shtml']

    start_urls = [

        'http://www.weather.com.cn/weather/101280601.shtml'

    ]

    def parse(self, response):

        '''

        筛选信息的函数

        date= 日期

        temperaturature = 当天的温度

        weather = 当天的天气

        wind = 当天的风向

        :param response:

        :return:

        '''

        items = []

        day = response.xpath('//ul[@class="t clearfix"]')

        for i in list(range(7)):

            item = WeatherItem()

            item['date']= day.xpath('./li['+str(i+1)+']/h1//text()').extract_first()

            item['temperature'] = day.xpath('./li['+str(i+1)+']/p[@class="tem"]/i//text()').extract_first()

            item['weather'] = day.xpath('./li['+str(i+1)+']/p[@class="wea"]//text()').extract_first()

            item['wind'] = day.xpath('./li[' + str(i + 1) + ']/p[@class="win"]/i//text()').extract_first()

            #print(item)

            items.append(item)

        return  items

　　编写管道PIPELINE:

pipelines.py是用来处理收尾爬虫抓到的数据的，一般情况下，我们会将数据存到本地

1.文本形式：最基本存储方式

2.json格式：方便调用

3.数据库：数据量比较大选择的存储方式

import os

import requests

import json

import codecs

import pymysql

'''文本方式'''

class WeatherPipeline(object):

    def process_item(self, item, spider):

        #print(item)

        #获取当前目录

        base_dir = os.getcwd()

        #filename = base_dir+'\\data\\test.txt'

        filename = r'E:\Python\weather\weather\data\test.txt'

        with open(filename,'a') as f:

            f.write(item['date'] + '\n')

            f.write(item['temperature'] + '\n')

            f.write(item['weather'] + '\n')

            f.write(item['wind'] + '\n\n')

        return item

'''json数据'''
class W2json(object):

    def process_item(self, item, spider):

        '''

        讲爬取的信息保存到json

        方便其他程序员调用

        '''

        base_dir = os.getcwd()

        #filename = base_dir + '/data/weather.json'

        filename = r'E:\Python\weather\weather\data\weather.json'

        # 打开json文件，向里面以dumps的方式吸入数据

        # 注意需要有一个参数ensure_ascii=False ，不然数据会直接为utf编码的方式存入比如:“/xe15”

        with codecs.open(filename, 'a') as f:

            line = json.dumps(dict(item), ensure_ascii=False) + '\n'

            f.write(line)

        return item

class W2mysql(object):

    def process_item(self, item, spider):

        '''

        讲爬取的信息保存到mysql

        '''

        date        = item['date']

        temperature = item['temperature']

        weather     = item['weather']

        wind        = item['wind']

        connection = pymysql.connect(

            host = '127.0.0.1',

            user = 'root',

            passwd='root',

            db = 'scrapy',

           # charset='utf-8',

            cursorclass = pymysql.cursors.DictCursor

        )

        try:

            with connection.cursor() as  cursor:

                #创建更新值的sql语句

                sql = """INSERT INTO `weather` (date, temperature, weather, wind) VALUES (%s, %s, %s, %s) """

                cursor.execute(

                    sql,(date,temperature,weather,wind)

                )

                connection.commit()

        finally:

            connection.close()

        return item

然后在settings.py里面配置下

'''
设置日志等级
　　           ERROR ： 一般错误

　　　　　　　　WARNING : 警告

　　　　　　　　INFO : 一般的信息

　　　　　　　　DEBUG ： 调试信息

　　　　　　　　默认的显示级别是DEBUG

'''

LOG_LEVEL = 'INFO'

ITEM_PIPELINES = {
   'weather.pipelines.WeatherPipeline': 300,
   'weather.pipelines.W2json': 400,
    'weather.pipelines.W2mysql': 300,
}

上面三个类就展示三种数据整理方式。

最后运行scrapy crawl weather得到三种结果：

　最后写个定时爬区任务

# -*- coding: utf-8 -*-

# @Time    : 2019/8/3 15:38

# @Author  : wujf

# @Email   : 1028540310@qq.com

# @File    : 定时爬虫.py

# @Software: PyCharm

'''

第一种方法 采用sleep

'''

# import time

# import os

# while True:

#     os.system('scrapy crawl weather')

#     time.sleep(3)

# 第二种

from  scrapy import  cmdline

import os

#retal = os.getcwd() #获取当前目录

#print(retal)

os.chdir(r'E:\Python\weather\weather')  #改变目录  因为只有进入scrapy框架才能执行scrapy crawl weather

cmdline.execute(['scrapy', 'crawl', 'weather'])

　　还有一个中间件，但是我手上没有代理ip ，所以暂时玩不了。

OK，到此结束！

scrapy框架综合运用爬取天气预报 + 定时任务的更多相关文章

基于scrapy框架输入关键字爬取有关贴吧帖子
基于scrapy框架输入关键字爬取有关贴吧帖子站点分析首先进入一个贴吧,要想达到输入关键词爬取爬取指定贴吧,必然需要利用搜索引擎点进看到有四种搜索方式,分别试一次,观察url变化我们得知: 搜 ...
一个scrapy框架的爬虫(爬取京东图书)
我们的这个爬虫设计来爬取京东图书(jd.com). scrapy框架相信大家比较了解了.里面有很多复杂的机制,超出本文的范围. 1.爬虫spider tips: 1.xpath的语法比较坑,但是你可以 ...
Scrapy 框架使用 selenium 爬取动态加载内容
使用 selenium 爬取动态加载内容开启中间件 DOWNLOADER_MIDDLEWARES = { 'wangyiPro.middlewares.WangyiproDownloaderMidd ...
Scrapy框架——使用CrawlSpider爬取数据
引言本篇介绍Crawlspider,相比于Spider,Crawlspider更适用于批量爬取网页 Crawlspider Crawlspider适用于对网站爬取批量网页,相对比Spider类,Cr ...
使用scrapy框架来进行抓取的原因
在python爬虫中:使用requests + selenium就可以解决将近90%的爬虫需求,那么scrapy就是解决剩下10%的吗? 这个显然不是这样的,scrapy框架是为了让我们的爬虫更强大. ...
scrapy之360图片爬取
#今日目标 **scrapy之360图片爬取** 今天要爬取的是360美女图片,首先分析页面得知网页是动态加载,故需要先找到网页链接规律, 然后调用ImagesPipeline类实现图片爬取 *代码实 ...
和风api爬取天气预报数据
''' 和风api爬取天气预报数据目标:https://free-api.heweather.net/s6/weather/forecast?key=cc33b9a52d6e48de85247779 ...
爬虫系列---scrapy全栈数据爬取框架(Crawlspider)
一简介 crawlspider 是Spider的一个子类,除了继承spider的功能特性外,还派生了自己更加强大的功能. LinkExtractors链接提取器,Rule规则解析器. 二强大的链接 ...
Scrapy爬虫框架（实战篇）【Scrapy框架对接Splash抓取javaScript动态渲染页面】
(1).前言动态页面:HTML文档中的部分是由客户端运行JS脚本生成的,即服务器生成部分HTML文档内容,其余的再由客户端生成静态页面:整个HTML文档是在服务器端生成的,即服务器生成好了,再发送 ...

随机推荐

1. 学习Linux操作系统
1.熟练使用Linux命令行(鸟哥的Linux私房菜.Linux系统管理技术手册) 2.学会Linux程序设计(UNIX环境高级编程) 3.了解Linux内核机制(深入理解LINUX内核) 4.阅读L ...
TestStand 基础知识[7]--Build-in Step Types (2)
接着上一篇文章:TestStand 基础知识[6] Build-In StepTypes(1) 继续介绍: 还是先把Build-in StepTypes图片贴一下, 1. Call Executabl ...
Linux学习1-云服务器上搭建禅道项目管理工具
前言相信各位测试的小伙伴出去面试总会被问到:测试环境怎么搭建?一个中级测试工程师还是对测试环境一无所知的话,面试官会一脸鄙视的,今天我给大家介绍一下最简单的环境部署-—如何在云服务器部署禅道环境. ...
css 纯css轮播图示例
<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <title&g ...
win10双击执行python
一. 设置py环境去官网下载Windows x86-64 executable installer安装安装后会自动配置py的bin路径和pip的路径 Pip用于安装python库的二. 设置wi ...
【5min+】设计模式的迷惑？Provider vs Factory
系列介绍 [五分钟的dotnet]是一个利用您的碎片化时间来学习和丰富.net知识的博文系列.它所包含了.net体系中可能会涉及到的方方面面,比如C#的小细节,AspnetCore,微服务中的.net ...
5分钟看懂系列：HTTP缓存机制详解
原创文章首发于公众号:「码农富哥」,欢迎收藏和关注,如转载请注明出处! 什么是HTTP缓存 HTTP 缓存可以说是HTTP性能优化中简单高效的一种优化方式了,缓存是一种保存资源副本并在下次请求时直接使 ...
西门子S7comm协议解析 —— 利用Wireshark对报文逐字节进行解析详细解析S7comm所含功能码以及UserData功能
又一次成为懒蛋了,标题就这么改了改又是一篇新文章. 网上也有很多S7comm协议的解析,但还是如同我上一篇一样我只是做报文的解析对于S7comm的原理并进行阐述. 有些地方有错误的地方尽请大家指出,共 ...
将jsp页面转化为图片或pdf升级版（二）（qq:1324981084）
java高级架构师全套vip教学视频,需要的加我qq1324981084 上面我们已经将jsp页面转化成html页面了,那么接下来我们的目标是利用这个html页面形成pdf或图片格式.这里我用到的是w ...
C++中的多态及虚函数大总结
多态是C++中很关键的一部分,在面向对象程序设计中的作用尤为突出,其含义是具有多种形式或形态的情形,简单来说,多态:向不同对象发送同一个消息,不同的对象在接收时会产生不同的行为.即用一个函数名可以调用 ...

scrapy框架综合运用 爬取天气预报 + 定时任务

scrapy框架综合运用 爬取天气预报 + 定时任务的更多相关文章

随机推荐

热门专题

scrapy框架综合运用爬取天气预报 + 定时任务

scrapy框架综合运用爬取天气预报 + 定时任务的更多相关文章