scrapy Mongodb 储存

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo

from scrapy.exceptions import DropItem

class JobsPipeline(object): # (Jobs→自己的项目名)

    def process_item(self, item, spider):

        return item

class MongoPipeline(object):

    collection_name = '表名'

    def __init__(self, mongo_uri, mongo_db):

        self.mongo_uri = mongo_uri

        self.mongo_db = mongo_db

    @classmethod

    def from_crawler(cls, crawler):

        return cls(

            mongo_uri=crawler.settings.get('MONGO_URI'),

            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')

        )

    def open_spider(self, spider):

        self.client = pymongo.MongoClient(self.mongo_uri)

        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):

        self.client.close()

    def process_item(self, item, spider):

        self.db[self.collection_name].insert(dict(item))

        return item

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for Jobs project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#     https://doc.scrapy.org/en/latest/topics/settings.html

#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Jobs'

SPIDER_MODULES = ['Jobs.spiders']

NEWSPIDER_MODULE = 'Jobs.spiders'

DOWNLOAD_DELAY = 0

ITEM_PIPELINES = {

    # 'zhaopin.pipelines.ZhaopinPipeline': 300,

    'Jobs.pipelines.MongoPipeline': 400

}

MONGO_URL = 'localhost'

MONGO_DATABASE = 'shuju' # 数据库名

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'Jobs (+http://www.yourdomain.com)'

# Obey robots.txt rules

ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

#   'Accept-Language': 'en',

#}

# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

#    'Jobs.middlewares.JobsSpiderMiddleware': 543,

#}

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

#    'Jobs.middlewares.JobsDownloaderMiddleware': 543,

#}

# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

#}

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

#ITEM_PIPELINES = {

#    'Jobs.pipelines.JobsPipeline': 300,

#}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

# 设置日志级别为Error，低于次级别的日志就出不来

LOG_LEVEL = 'ERROR'

# 禁用日志输出，所有日志都出不来

# LOG_ENABLED=False

# 保持中文，不被Unicode编码

FEED_EXPORT_ENCODING = 'utf-8'

scrapy Mongodb 储存的更多相关文章

用 mongodb 储存多态消息/提醒类数据（转）
原文:http://codecampo.com/topics/66 前天看到 javaeye 计划采用mongoDB实现网站全站消息系统,很有同感,mongodb 很适合储存消息类数据.之前讨论了如何 ...
python scrapy+Mongodb爬取蜻蜓FM，酷我及懒人听书
1.初衷:想在网上批量下载点听书.脱口秀之类,资源匮乏,大家可以一试 2.技术:wireshark scrapy jsonMonogoDB 3.思路:wireshark分析移动APP返回的各种连接分类 ...
scrapy+mongodb
我们都知道scrapy适合爬取大量的网站信息,爬取到的信息储存到数据库显然需要更高的效率,scrapy配合mongodb是非常合适的,这里记录一下如何在scrapy中配置mongodb. 文件结构 $ ...
scrapy+mongodb报错 TypeError: name must be an instance of str
经过各种排查,最后找到原因,在settings文件中配置文件大小写写错了,在pipelines中 mongo_db=crawler.settings.get('MONGODB_DB'),get 获取的 ...
利用Scrapy爬取所有知乎用户详细信息并存至MongoDB
欢迎大家关注腾讯云技术社区-博客园官方主页,我们将持续在博客园为大家推荐技术精品文章哦~ 作者 :崔庆才本节分享一下爬取知乎用户所有用户信息的 Scrapy 爬虫实战. 本节目标本节要实现的内容有 ...
PHP操作MongoDB简明教程(转)
转自:http://blog.sina.com.cn/s/blog_6324c2380100ux2m.html MongoDB是最近比较流行的NoSQL数据库,网络上关于PHP操作MongoDB的资料 ...
MongoDB与PHP的添加、修改、查询、删除
链接数据库使用下面的代码创建一个数据库链接 <?php$connection = new Mongo(); //链接到 localhost:27017$connection = new Mong ...
爬虫框架之Scrapy(一)
scrapy简介 scrapy是一个用python实现为了爬取网站数据,提取结构性数据而编写的应用框架,功能非常的强大. scrapy常应用在包括数据挖掘,信息处理或者储存历史数据的一系列程序中. s ...
mongodb shell和Node.js driver使用基础
开始: Mongo Shell 安装后,输入mongo进入控制台: //所有帮助 > help //数据库的方法 > db.help() > db.stats() //当前数据库的状 ...

随机推荐

使用nodeValue获取值与a标签默认跳转的冲突问题
今天看javascript DOM编程艺术(第2版)发现这样一个例子: 效果图: 完整代码: <!DOCTYPE html> <html lang="en"> ...
数据库索引的数据结构b+树
b+树的查找过程:如上图所示,如果要查找数据项29,那么首先会把磁盘块1由磁盘加载到内存,此时发生一次IO,在内存中用二分查找确定29在17和35之间,锁定磁盘块1的P2指针, ...
idea 炫酷插件
1.插件的安装打开setting文件选择Plugins选项 Ctrl + Alt + S File -> Setting 分别是安装JetBrains插件,第三方插件,本地已下载的插件包.详情 ...
java 日志脱敏框架 sensitive-v0.0.4 系统内置常见注解，支持自定义注解
项目介绍日志脱敏是常见的安全需求.普通的基于工具类方法的方式,对代码的入侵性太强.编写起来又特别麻烦. 本项目提供基于注解的方式,并且内置了常见的脱敏方式,便于开发. 特性基于注解的日志脱敏. 可 ...
python 二叉树实现
二叉树实现思想 1.把每个节点都看作是一个对象包含以下特征: 节点的当前值节点的左孩子(存储比当前节点值小的节点对象) 节点右孩子(存储比当前节点值大的节点对象) 2.二叉树就是以根节点开始的连续的 ...
检测mysq组复制的脚本
#!/bin/bash PROG_OUT="/root/checkmysql-group.log" # 输出日志到至此文件,建议配置绝对路径 #PROG_OUT="/de ...
通过TCPView工具查看foxmail用exchange方式连接exchange时用什么端口
TCPView下载地址 https://docs.microsoft.com/zh-cn/sysinternals/downloads/tcpview
解决由于服务器调用删除或添加字段后CXF客户端未更新导致异常问题org.apache.cxf.interceptor.Fault: Unmarshalling Error: Unexpected element
采用CXF客户端调用Webservice服务,由于服务端时不时会对Webservice服务删除或添加一些字段,而CXF未及时更新客户端代码导致再次调用服务时报异常错误: Interceptor for ...
关于各种O，DO/BO/DTO/VO/AO/PO
阿里巴巴Java开发手册链接:https://pan.baidu.com/s/11I9ViOrat-Bw_HA8yItXwA 密码:x5yi 2. DO/BO/DTO/VO/AO/PO PO(per ...
Linux命令:dirs
语法 dirs [-clpv] [+N | -N] 说明打印目录栈内容. 不带任何参数,在一行里打印.空白分隔. /home/code/dir/crypto /home/code/a/b /home ...

scrapy Mongodb 储存

scrapy Mongodb 储存的更多相关文章

随机推荐

热门专题