scrapy Pipeline 练习

class WeatherPipeline(object):

    def process_item(self, item, spider):

        print(item)

        return item

#插入到redis

import redis

import json

class RedisPipeline(object):

    def __init__(self,host,port,password):

        self.host=host

        self.port=port

        self.password=password

    @classmethod

    def from_crawler(cls, crawler):

        return cls(

            host=crawler.settings.get('RE_HOST'),

            port=crawler.settings.get('RE_PORT', ''),

            password=crawler.settings.get('RE_PASS', 'xxxxx')

        )

    def open_spider(self, spider):

        pool = redis.ConnectionPool(host=self.host,password=self.password,port=self.port,db=3)

        self.client=redis.Redis(connection_pool=pool)

        # print(self.client)

    def process_item(self, item, spider):

        self.client.hmset(item['city'],dict(item))

        # self.client.lpush('weather',json.dumps(dict(item)))

        # self.client.sadd('weathers',json.dumps(dict(item)))

        # return item

        return item

#插入到mongoDB

import pymongo

class MongoPipeline(object):

    collection_name = 'tianqi'

    def __init__(self, mongo_host, mongo_db):

        self.mongo_host = mongo_host

        self.mongo_db = mongo_db

    @classmethod

    def from_crawler(cls, crawler):

        return cls(

            mongo_host=crawler.settings.get('MO_HOST'),

            mongo_db=crawler.settings.get('MO_DB', 'weather')

        )

    def open_spider(self, spider):

        self.client = pymongo.MongoClient(host=self.mongo_host)

        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):

        self.client.close()

    def process_item(self, item, spider):

        self.db[self.collection_name].insert_one(dict(item))

        return item

#插入mysql 数据库

import pymysql

class MysqlPipeline(object):

    def __init__(self,host,username,password,database,port,charset):

        self.host=host

        self.username=username

        self.password=password

        self.database=database

        self.port=port

        self.charset=charset

    @classmethod

    def from_crawler(cls, crawler):

        return cls(

            host=crawler.settings.get('MY_HOST'),

            username=crawler.settings.get('MY_USER'),

            password=crawler.settings.get('MY_PASS'),

            database=crawler.settings.get('MY_DATA'),

            port=crawler.settings.get('MY_PORT'),

            charset=crawler.settings.get('MY_CHARSET'),

        )

    def open_spider(self,spider):

        self.client=pymysql.connect(host=self.host,user=self.username,password=self.password,database=self.database,port=self.port,charset=self.charset)

        self.cursor=self.client.cursor()

    def close_spider(self, spider):

        self.cursor.close()

        self.client.close()

    def process_item(self, item, spider):

        self.cursor.execute("INSERT INTO weather (`sheng`,`city`,`hqiwen`,`lqiwen`) VALUES (%s,%s,%s,%s)",(item['sheng'],item['city'],item['hqiwen'],item['lqiwen']))

        self.client.commit()

        return item

scrapy Pipeline 练习的更多相关文章

scrapy pipeline
pipeline的四个方法 @classmethod def from_crawler(cls, crawler): """ 初始化的时候,用以创建pipeline对象 ...
scrapy Pipeline使用twisted异步实现mysql数据插入
from twisted.enterprise import adbapi class MySQLAsyncPipeline: def open_spider(self, spider): db = ...
scrapy项目5：爬取ajax形式加载的数据，并用ImagePipeline保存图片
1.目标分析: 我们想要获取的数据为如下图: 1).每本书的名称 2).每本书的价格 3).每本书的简介 2.网页分析: 网站url:http://e.dangdang.com/list-WY1-dd ...
Scrapy 下载文件和图片
我们学习了从网页中爬取信息的方法,这只是爬虫最典型的一种应用,除此之外,下载文件也是实际应用中很常见的一种需求,例如使用爬虫爬取网站中的图片.视频.WORD文档.PDF文件.压缩包等. 1.Files ...
Python逆向爬虫之scrapy框架,非常详细
爬虫系列目录目录 Python逆向爬虫之scrapy框架,非常详细一.爬虫入门 1.1 定义需求 1.2 需求分析 1.2.1 下载某个页面上所有的图片 1.2.2 分页 1.2.3 进行下载图片 ...
Scrapy:为spider指定pipeline
当一个Scrapy项目中有多个spider去爬取多个网站时,往往需要多个pipeline,这时就需要为每个spider指定其对应的pipeline. [通过程序来运行spider],可以通过修改配置s ...
Python爬虫从入门到放弃（十六）之 Scrapy框架中Item Pipeline用法
当Item 在Spider中被收集之后,就会被传递到Item Pipeline中进行处理每个item pipeline组件是实现了简单的方法的python类,负责接收到item并通过它执行一些行为, ...
二、Item Pipeline和Spider-----基于scrapy取校花网的信息
Item Pipeline 当Item在Spider中被收集之后,它将会被传递到Item Pipeline,这些Item Pipeline组件按定义的顺序处理Item. 每个Item Pipeline ...
Scrapy爬虫框架第七讲【ITEM PIPELINE用法】
ITEM PIPELINE用法详解: ITEM PIPELINE作用: 清理HTML数据验证爬取的数据(检查item包含某些字段) 去重(并丢弃)[预防数据去重,真正去重是在url,即请求阶段做] ...

随机推荐

转 PYTHON2 编码处理-str与Unicode的区别
https://www.cnblogs.com/long2015/p/4090824.html
操作集合的线程安全考虑——java
运行场景:多个线程同时调用ArrayList存放元素两个线程A和B,在A线程调用的时候,list中暂时还未有元素存在,此时,list的size值为0,同时A在添加元素的时候,add进了一个元素,此时 ...
多线程编程_控制并发线程数的Semaphore
简介 Semaphore(信号量)是用来控制同时访问特定资源的线程数量,它通过协调各个线程,以保证合理的使用公共资源.很多年以来,我都觉得从字面上很难理解Semaphore所表达的含义,只能把它比作是 ...
[转]让你的网页文本框增加光晕效果与提示，水印(类似QQ2011)
本文转自:http://www.cnblogs.com/xiaofengfeng/archive/2013/01/28/2880344.html 让你的网页文本框增加光晕效果(类似QQ2011) 我们 ...
(转)vim(vi)常用操作及记忆方法
vim(vi)常用操作及记忆方法原文:https://www.cnblogs.com/doseoer/p/6241443.html vi(vim)可以说是linux中用得最多的工具了,不管你配置服务 ...
compile with -fPIC
在新公司工作第四天,依然要编译FFmpeg,不同的是难度大了,以前遇到什么参数编译不过的,就去掉,因为不是专业做视频的,但是新公司绕不过了. 编译FFmpeg动态库的时候发现链接某些静态库的时候会报错 ...
Python常用模块(四)
一.re模块正则表达式时计算机科学的一个概念,正则表达式通常被用来检索,替换那些符合某个模式的文本,大多数程序设计语言都支持利用正则表达式进行字符串操作. 正则就是用一些具有特殊含义的符号组合到一起 ...
List之Sort使用
void TestListSort(){ List<string> st = new List<string> (); st.Add ("abcd"); s ...
Azure进阶攻略丨如何驾驭罢工的Linux虚机网卡？
很多人的生活中,流传着一个屡试不爽,据说可以解决任何问题的百宝锦囊: 所以经常可以听到类似这样的对话: -我的电脑咋上不去网了? -重启一下电脑. -还是不行呢! -重启一下路由器. -怎么还不行-_ ...
ASP.NET 页面之间传递参数方法
1.通过URL链接地址传递 (1) send.aspx代码 protected void Button1_Click(object sender, EventArgs e) { Request.Red ...

scrapy Pipeline 练习

scrapy Pipeline 练习的更多相关文章

随机推荐

热门专题