0

1.Scrapy 使用 MongoDB

https://doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-mongodb

Write items to MongoDB

In this example we’ll write items to MongoDB using pymongo. MongoDB address and database name are specified in Scrapy settings; MongoDB collection is named after item class.

The main point of this example is to show how to use from_crawler() method and how to clean up the resources properly.:

  1. import pymongo
  2.  
  3. class MongoPipeline(object):
  4.  
  5. collection_name = 'scrapy_items'
  6.  
  7. def __init__(self, mongo_uri, mongo_db):
  8. self.mongo_uri = mongo_uri
  9. self.mongo_db = mongo_db
  10.  
  11. @classmethod
  12. def from_crawler(cls, crawler):
  13. return cls(
  14. mongo_uri=crawler.settings.get('MONGO_URI'),
  15. mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
  16. )
  17.  
  18. def open_spider(self, spider):
  19. self.client = pymongo.MongoClient(self.mongo_uri)
  20. self.db = self.client[self.mongo_db]
  21.  
  22. def close_spider(self, spider):
  23. self.client.close()
  24.  
  25. def process_item(self, item, spider):
  26. self.db[self.collection_name].insert_one(dict(item))
  27. return item

2.MongoDB Tutorial

https://api.mongodb.com/python/current/tutorial.html

建立文件夹并运行 MongoDB instance

  1. C:\Users\win7>mongod --dbpath e:\mongodb\db

连接数据库

  1. from pymongo import MongoClient
  2. client = MongoClient()
  3. # client = MongoClient('localhost', 27017)
  4. # client = MongoClient('mongodb://localhost:27017/')
  5.  
  6. db = client.test_database
  7. # db = client['test-database']

collection(等同于table) 插入一个个 document

  1. posts = db.posts
  2. # posts = db['posts']
  3.  
  4. import datetime
  5. post = {"author": "Mike",
  6. "text": "My first blog post!",
  7. "tags": ["mongodb", "python", "pymongo"],
  8. "date": datetime.datetime.utcnow()}
  9.  
  10. post2 = {"author": "Martin",
  11. "text": "My second blog post!",
  12. "tags": ["mongodb", "python", "pymongo"],
  13. "date": datetime.datetime.utcnow()}
  14.  
  15. post_id = posts.insert_one(post).inserted_id #其实等于 result =posts.insert_one(post) 再 post_id = result.inserted_id, 而 insert_many 则是 inserted_ids 返回一个list
  1. posts.insert_one(post2)

允许插入重复 document

插入之后自动更新了 post3,再次执行 posts.insert_one(post3) 提示 ObjectId 重复

如果插入 post3 之前执行了 post4 = post3.copy() 其实可以插入相同内容

  1. In [689]: post3 = {"author": "Mike",
  2. ...: "text": "My first blog post!",
  3. ...: "tags": ["mongodb", "python", "pymongo"],
  4. ...: "date": datetime.datetime.utcnow()}
  5.  
  6. In [690]: posts.insert_one(post3)
  7. Out[690]: <pymongo.results.InsertOneResult at 0xb803788>
  8.  
  9. In [691]: post3
  10. Out[691]:
  11. {'_id': ObjectId('59e57919fca565500c8e3692'),
  12. 'author': 'Mike',
  13. 'date': datetime.datetime(2017, 10, 17, 3, 29, 14, 966000),
  14. 'tags': ['mongodb', 'python', 'pymongo'],
  15. 'text': 'My first blog post!'}

检查确认:

  1. db.collection_names(include_system_collections=False)
  2.  
  3. posts.count()
  4.  
  5. import pprint
  6. pprint.pprint(posts.find_one()) #满足限制条件,而且仅限一条。不设条件也即get the first document from the posts collection
  7.  
  8. posts.find_one({"author": "Mike"})
  9.  
  10. for i in posts.find(): find() returns a Cursor instance, which allows us to iterate over all matching documents.  返回 Cursor 迭代器,同样支持 posts.find({"author": "Mike"})
  11. print i

c:\program files\anaconda2\lib\site-packages\pymongo\cursor.py

A cursor / iterator over Mongo query results.

  1. In [707]: posts.find()
  2. Out[707]: <pymongo.cursor.Cursor at 0x118a62b0>
  3.  
  4. In [708]: a=posts.find()
  5.  
  6. In [709]: a?
  7. Type: Cursor
  8. String form: <pymongo.cursor.Cursor object at 0x00000000116C6208>
  9. File: c:\program files\anaconda2\lib\site-packages\pymongo\cursor.py
  10. Docstring:
  11. A cursor / iterator over Mongo query results.
  12.  
  13. Init docstring:
  14. Create a new cursor.
  15.  
  16. Should not be called directly by application developers - see
  17. :meth:`~pymongo.collection.Collection.find` instead.
  18.  
  19. .. mongodoc:: cursors

关于编码:

MongoDB stores data in BSON format. BSON strings are UTF-8 encoded
PyMongo decodes each BSON string to a Python unicode string, not a regular str.
存储时 str 不变,unicode 自动编码为 utf-8
输出统一解码为 unicode

  1. post = {"author": "Mike",
  2.  
  3. {u'_id': ObjectId('...'),
  4. u'author': u'Mike',

Bulk Inserts 批量插入多条文档,每条文档可以不同 field,因此又称 schema-free

  1. >>> new_posts = [{"author": "Mike",
  2. ... "text": "Another post!",
  3. ... "tags": ["bulk", "insert"],
  4. ... "date": datetime.datetime(2009, 11, 12, 11, 14)},
  5. ... {"author": "Eliot",
  6. ... "title": "MongoDB is fun",
  7. ... "text": "and pretty easy too!",
  8. ... "date": datetime.datetime(2009, 11, 10, 10, 45)}]
  9. >>> result = posts.insert_many(new_posts)
  10. >>> result.inserted_ids
  11. [ObjectId('...'), ObjectId('...')]

查询数量:

  1. posts.count()
  2. posts.find({"author": "Mike"}).count()

##Range Queries 高级查询

##Indexing 索引

#Aggregation Examples 聚合

https://api.mongodb.com/python/current/examples/aggregation.html

  1. from pymongo import MongoClient
  2. db = MongoClient().aggregation_example
  3. result = db.things.insert_many([{"x": 1, "tags": ["dog", "cat"]},
  4. {"x": 2, "tags": ["cat"]},
  5. {"x": 2, "tags": ["mouse", "cat"]},
  6. {"x": 3, "tags": []}])
  7. result.inserted_ids

OperationFailure: $sort key ordering must be 1 (for ascending) or -1 (for descending)

  1. from bson.son import SON
  2. pipeline = [
  3. {"$unwind": "$tags"}, # tags 字段是一个 array,松绑
  4. {"$group": {"_id": "$tags", "count": {"$sum": 1}}}, #按照 tag 分组,即为唯一值
  5. {"$sort": SON([("count", -1), ("_id", 1)])} #先按 count 降序,再按 _id 升序
  6. ]

SON 有序字典

  1. In [773]: SON?
  2. Init signature: SON(cls, *args, **kwargs)
  3. Docstring:
  4. SON data.
  5.  
  6. A subclass of dict that maintains ordering of keys and provides a
  7. few extra niceties for dealing with SON. SON objects can be
  8. converted to and from BSON.
  1. In [779]: db.things.aggregate(pipeline)
  2. Out[779]: <pymongo.command_cursor.CommandCursor at 0x118a6cc0>
  3.  
  4. In [780]: list(db.things.aggregate(pipeline)) #list(迭代器)
  5. Out[780]:
  6. [{u'_id': u'cat', u'count': 3},
  7. {u'_id': u'dog', u'count': 1},
  8. {u'_id': u'mouse', u'count': 1}]

Map/Reduce

Copying a Database 复制备份数据库

https://api.mongodb.com/python/current/examples/copydb.html#copying-a-database

  1. from pymongo import MongoClient
  2. client = MongoClient()
  3.  
  4. client.admin.command('copydb',
  5. fromdb='test_database',
  6. todb='test_database_bak')
  7. #{u'ok': 1.0}

跨服务器以及密码认证,见原文。

#Bulk Write Operations 批处理 InsertOne, DeleteMany, ReplaceOne, UpdateOne

Bulk Insert

https://api.mongodb.com/python/current/examples/bulk.html

  1. import pymongo
  2. db = pymongo.MongoClient().bulk_example
  3. db.test.insert_many([{'i': i} for i in range(10000)]).inserted_ids
  4.  
  5. db.test.count()

Mixed Bulk Write Operations

  • 1/2 Ordered Bulk Write Operations

Ordered bulk write operations are batched and sent to the server in the order provided for serial execution. 按照顺序执行操作

  1. from pprint import pprint
  2. from pymongo import InsertOne, DeleteMany, ReplaceOne, UpdateOne #类
  3. result = db.test.bulk_write([ #根据帮助:也可写成 requests = [InsertOne({'y': 1}),]
  4. DeleteMany({}), #类实例
  5. InsertOne({'_id': 1}),
  6. InsertOne({'_id': 2}),
  7. InsertOne({'_id': 3}),
  8. UpdateOne({'_id': 1}, {'$set': {'foo': 'bar'}}),
  9. UpdateOne({'_id': 4}, {'$inc': {'j': 1}}, upsert=True), #没有则插入
  10. ReplaceOne({'j': 1}, {'j': 2})]) #也可满足 {'j': 2}, 替换为{'i': 5}
  11. pprint(result.bulk_api_result)

#{'nInserted': 3,
#'nMatched': 2,
#'nModified': 2,
#'nRemoved': 4,
#'nUpserted': 1,
#'upserted': [{u'_id': 4, u'index': 5}],
#'writeConcernErrors': [],
#'writeErrors': []}

  1. for i in db.test.find():
  2. print i
  3.  
  4. #{u'_id': 1, u'foo': u'bar'}
  5. #{u'_id': 2}
  6. #{u'_id': 3}
  7. #{u'_id': 4, u'j': 2}

清空col

  1. In [844]: r=db.test.delete_many({})
  2. In [845]: r.deleted_count
  3. Out[845]: 4

删除col

  1. In [853]: db.name
  2. Out[853]: u'bulk_example'
  3.  
  4. In [855]: db.collection_names()
  5. Out[855]: [u'test']
  6.  
  7. In [860]: db.test.drop() #无返回,不报错,建议用下面的
  8.  
  9. In [861]: db.drop_collection('test')
  10. Out[861]:
  11. {u'code': 26,
  12. u'codeName': u'NamespaceNotFound',
  13. u'errmsg': u'ns not found',
  14. u'ok': 0.0}

The first write failure that occurs (e.g. duplicate key error) aborts the remaining operations, and PyMongo raises BulkWriteError.   出错则中止后续操作。

  1. >>> from pymongo import InsertOne, DeleteOne, ReplaceOne
  2. >>> from pymongo.errors import BulkWriteError
  3. >>> requests = [
  4. ... ReplaceOne({'j': 2}, {'i': 5}),
  5. ... InsertOne({'_id': 4}), # Violates the unique key constraint on _id.
  6. ... DeleteOne({'i': 5})]
  7. >>> try:
  8. ... db.test.bulk_write(requests)
  9. ... except BulkWriteError as bwe:
  10. ... pprint(bwe.details)
  11. ...
  12. {'nInserted': 0,
  13. 'nMatched': 1,
  14. 'nModified': 1,
  15. 'nRemoved': 0,
  16. 'nUpserted': 0,
  17. 'upserted': [],
  18. 'writeConcernErrors': [],
  19. 'writeErrors': [{u'code': 11000,
  20. u'errmsg': u'...E11000...duplicate key error...',
  21. u'index': 1,
  22. u'op': {'_id': 4}}]}
  • 2/2 Unordered Bulk Write Operations 并行无序操作,最后报告出错的部分操作

  1. db.test.bulk_write(requests, ordered=False)

#Datetimes and Timezones

https://api.mongodb.com/python/current/examples/datetimes.html

避免使用本地时间 datetime.datetime.now()

  1. import datetime
  2.  
  3. result = db.objects.insert_one({"last_modified": datetime.datetime.utcnow()})

关于时区读写,详见原文

#GridFS Example 存储二进制对象,比如文件

This example shows how to use gridfs to store large binary objects (e.g. files) in MongoDB.

  1. from pymongo import MongoClient
  2. import gridfs
  3.  
  4. db = MongoClient().gridfs_example
  5. fs = gridfs.GridFS(db) # collection 表

读写doc: str,unicode,file-like

  1. In [883]: fs.get(fs.put('hello world')).read()
  2. Out[883]: 'hello world'
  3.  
  4. In [885]: fs.get(fs.put(u'hello world')).read()
  5. TypeError: must specify an encoding for file in order to write unicode
  6.  
  7. In [886]: fs.get(fs.put(u'hello world',encoding='utf-8')).read() # 写入 unicode 必须传入 encoding,没有默认
  8. Out[886]: 'hello world'
  9.  
  10. In [888]: fs.get(fs.put(open('abc.txt'),filename='abc',filetype='txt')).read() # file-like object (an object with a read() method),自定义属性为可选 filename ,filetype
  1. Out[888]: 'def'

相比第一个doc,第二个多出 encoding 字段,第三个多出 filenname 和 filetype

这里将 doc 看成 file 更容易理解

  1. In [896]: for doc in fs.find():
  2. ...: print doc.upload_date
  3. ...:
  4. 2017-10-18 03:28:04
  5. 2017-10-18 03:28:42.036000
  6. 2017-10-18 03:29:01.740000

print dir(doc)

'aliases', 'chunk_size', 'close', 'content_type', 'filename', 'length', 'md5', 'metadata', 'name', 'read', 'readchunk', 'readline', 'seek', 'tell', 'upload_date'

  1. In [899]: doc?
  2. Type: GridOut
  3. String form: <gridfs.grid_file.GridOut object at 0x000000000AB2B8D0>
  4. File: c:\program files\anaconda2\lib\site-packages\gridfs\grid_file.py
  5. Docstring:
  6. Class to read data out of GridFS.
  7.  
  8. Init docstring:
  9. Read a file from GridFS

MongoDB 及 scrapy 应用的更多相关文章

  1. day96_11_28 mongoDB与scrapy框架

    一.mongodb mongodb是一个面向文档的数据库,而不是关系型数据库.不采用关系型是为了获得更好的扩展性. 它与mysql的区别在于它没有表连接,但是可以通过其他办法实现. 安装数据库. 上官 ...

  2. Python下用Scrapy和MongoDB构建爬虫系统(1)

    本文由 伯乐在线 - 木羊 翻译,xianhu 校稿.未经许可,禁止转载!英文出处:realpython.com.欢迎加入翻译小组. 这篇文章将根据真实的兼职需求编写一个爬虫,用户想要一个Python ...

  3. 放养的小爬虫--豆瓣电影入门级爬虫(mongodb使用教程~)

    放养的小爬虫--豆瓣电影入门级爬虫(mongodb使用教程~) 笔者声明:只用于学习交流,不用于其他途径.源代码已上传github.githu地址:https://github.com/Erma-Wa ...

  4. scrapy wiki资料汇总

    See also: Scrapy homepage, Official documentation, Scrapy snippets on Snipplr Getting started If you ...

  5. python爬虫框架scrapy 豆瓣实战

    Scrapy 官方介绍是 An open source and collaborative framework for extracting the data you need from websit ...

  6. 爬虫框架Scrapy 的使用

    一.官网链接 https://docs.scrapy.org/en/latest/topics/architecture.html 二.Scrapy 需要安装的包 #Windows平台 # pip3 ...

  7. CentOS 6 安装python3.6

    参考博客:https://www.cnblogs.com/xiaodangshan/p/7197563.html 安装过程比较简单,需要注意,安装之后,为了不影响系统自带的python2.6版本,需要 ...

  8. scrapy--cnblogs

    之前一直在学习关于滑块验证码的爬虫知识,最接近的当属于模拟人的行为进行鼠标移动,登录页面之后在获取了,由于一直找不到滑块验证码的原图,无法通过openCV获取当前滑块所需要移动的距离. 1.机智如我开 ...

  9. <读书笔记>如何入门爬虫?

    大部分爬虫框架都是 发送请求 获得页面 解析页面 下载内容 存储内容 定个宏伟目标 淘宝1000页 知乎 豆瓣 ... python基础 list.dict:序列化爬取的内容 切片:分割爬取内容,获取 ...

随机推荐

  1. vue 拖拽移动(类似于iPhone虚拟home )

    vue 移动端 PC 兼容 元素 拖拽移动  效果演示 事件知识点 移动端 PC端 注释 touchstart mousedown 鼠标/手指按下事件 touchmove mousemove 鼠标/手 ...

  2. C# 中ref与out关键字区别

    ref 关键字通过引用传递的参数的内存地址,而不是值.简单点说就是在方法中对参数的任何改变都会改变调用方的基础参数中.代码举例: class RefExample { static void Meth ...

  3. 11.4 Flask session,闪现

    session 加密后放在用户浏览器的 cookie 中 于django 的自带session 不同,flask 的 session 需要导入 from flask import session 添加 ...

  4. this指针详解

    什么是this this是一个const指针,存的是当前对象的地址,指向当前对象,通过this指针可以访问类中的所有成员. 当前对象是指正在使用的对象,比如a.print(),a就是当前对象. 关于t ...

  5. python学习day12 函数Ⅳ (闭包&内置模块)

    函数Ⅳ (闭包&内置模块) 1.内置函数(补充) lambda表达式也叫匿名函数. 函数与函数之间的数据互不影响,每次运行函数都会开一个辟新的内存. item = 10 def func(): ...

  6. Django--ORM相关操作

    必知必会13条 <1> all(): 查询所有结果 <2> filter(**kwargs): 它包含了与所给筛选条件相匹配的对象 <3> get(**kwargs ...

  7. MySQL 导入导出数据库、表

    使用 GUI 软件很好操作,下面介绍命令行操作. 导出 cmd 命令 # 1.1 导出整个数据库 mysqldump -hlocalhost -uroot -p student_db > C:\ ...

  8. linux basic ------ dd 和 cp 的区别

    问:看了一些关于dd和cp的命令,但是我始终无法明白dd和cp之间有什么不同?不是都可以看成是备份的作用么?还有什么区别呢?答:1.dd是对块进行操作的,cp是对文件操作的. 2.比如有两块硬盘,要将 ...

  9. MyBatis使用注意事项

    目录 1. 使用何种映射器配置 2. 对象生命周期和作用域 SqlSessionFactoryBuilder SqlSessionFactory SqlSession 映射器实例(Mapper Ins ...

  10. MyEclipse 2015 Stable 2.0破解方法

    本篇博文简单介绍一下利用网上说明的方法破解MyEclipse 2015 Stable 2.0的具体细节.因为原来在贴吧上的方法不够详细,所以本人重新整理了一下.方法源自:http://tieba.ba ...