一.在python3中操作mongodb

　　1.连接条件


安装好pymongo库

启动mongodb的服务端(如果是前台启动后就不关闭窗口,窗口关闭后服务端也会跟着关闭)

　　3.使用

import pymongo

#连接mongodb需要使用里面的mongoclient,一般来说传入mongodb的ip和端口即可
#第一个参数为host,,第二个为ip.默认为27017,
client=pymongo.MongoClient(host='127.0.0.1',port=27017)
#这样就可以拿到一个客户端对象了
#另外MongoClient的第一个参数host还可以直接传MongoDB的连接字符串，以mongodb开头，
#例如：client = MongoClient('mongodb://localhost:27017/')可以达到同样的连接效果
# print(client)

###################指定数据库
db=client.test
#也可以这样写
# db=client['test']

##################指定集合
collections=db.student
#也可以这样写
# collections=db['student']

###################插入数据
# student={
#     'id':'1111',
#     'name':'xiaowang',
#     'age':20,
#     'sex':'boy',
# }
#
# res=collections.insert(student)
# print(res)
#在mongodb中,每一条数据其实都有一个_id属性唯一标识,
#如果灭有显示指明_id,mongodb会自动产生yigeObjectId类型的_id属性
#insert执行后的返回值就是_id的值,5c7fb5ae35573f14b85101c0

#也可以插入多条数据
# student1={
#     'name':'xx',
#     'age':20,
#     'sex':'boy'
# }
#
# student2={
#     'name':'ww',
#     'age':21,
#     'sex':'girl'
# }
# student3={
#     'name':'xxx',
#     'age':22,
#     'sex':'boy'
# }
#
# result=collections.insertMany([student1,student2,student3])
# print(result)
#这边的返回值就不是_id,而是insertoneresult对象
#我们可以通过打印insert_id来获取_id

#insert方法有两种
#insert_one,insertMany,一个是单条插入,一个是多条插入,以列表形式传入
#也可以直接inset(),如果是单个就直接写,多个还是以列表的形式传入

###################查找  单条查找
# re=collections.find_one({'name':'xx'})
# print(re)
# print(type(re))
#{'_id': ObjectId('5c7fb8d535573f13f85a6933'), 'name': 'xx', 'age': 20, 'sex': 'boy'}
# <class 'dict'>

#####################多条查找
# re=collections.find({'name':'xx'})
# print(re)
# print(type(re))
# for r in re:
#     print(r)
#结果是一个生成器,我们可以遍历里面的这个对象,拿到里面的值
# <pymongo.cursor.Cursor object at 0x000000000A98E630>
# <class 'pymongo.cursor.Cursor'>

# re=collections.find({'age':{'$gt':20}})
# print(re)
# print(type(re))
# for r in re:
#     print(r)
# 在这里查询的条件键值已经不是单纯的数字了，而是一个字典，其键名为比较符号$gt，意思是大于，键值为20，这样便可以查询出所有
# 年龄大于20的数据。

# 在这里将比较符号归纳如下表：
"""
符号含义示例
$lt小于{'age': {'$lt': 20}}
$gt大于{'age': {'$gt': 20}}
$lte小于等于{'age': {'$lte': 20}}
$gte大于等于{'age': {'$gte': 20}}
$ne不等于{'age': {'$ne': 20}}
$in在范围内{'age': {'$in': [20, 23]}}
$nin不在范围内{'age': {'$nin': [20, 23]}}
"""

#正则匹配来查找
# re = collections.find({'name': {'$regex': '^x.*'}})
# print(re)
# print(type(re))
# for r in re:
#     print(r)

# 在这里将一些功能符号再归类如下：
"""
符号含义示例示例含义
$regex匹配正则{'name': {'$regex': '^M.*'}}name以M开头
$exists属性是否存在{'name': {'$exists': True}}name属性存在
$type类型判断{'age': {'$type': 'int'}}age的类型为int
$mod数字模操作{'age': {'$mod': [5, 0]}}年龄模5余0
$text文本查询{'$text': {'$search': 'Mike'}}text类型的属性中包含Mike字符串
$where高级条件查询{'$where': 'obj.fans_count == obj.follows_count'}自身粉丝数等于关注数
"""

################计数
# count=collections.find({'age':{'$gt':20}}).count()
# print(count)

#################排序
# result=collections.find({'age':{'$gt':20}}).sort('age',pymongo.ASCENDING)
# print([re['name'] for re in result])

########### 偏移,可能想只取某几个元素，在这里可以利用skip()方法偏移几个位置，比如偏移2，就忽略前2个元素，得到第三个及以后的元素。
# result=collections.find({'age':{'$gt':20}}).sort('age',pymongo.ASCENDING).skip(1)
# print([re['name'] for re in result])

##################另外还可以用limit()方法指定要取的结果个数，示例如下：
# results = collections.find().sort('age', pymongo.ASCENDING).skip(1).limit(2)
# print([result['name'] for result in results])

# 值得注意的是，在数据库数量非常庞大的时候，如千万、亿级别，最好不要使用大的偏移量来查询数据，很可能会导致内存溢出，
# 可以使用类似find({'_id': {'$gt': ObjectId('593278c815c2602678bb2b8d')}}) 这样的方法来查询，记录好上次查询的_id。

################################数据更新
# 对于数据更新要使用update方法
# condition={'name':'xx'}
# student=collections.find_one(condition)
# student['age']=100
# result=collections.update(condition,student)
# print(result)

# 在这里我们将name为xx的数据的年龄进行更新，首先指定查询条件，然后将数据查询出来，修改年龄，
# 之后调用update方法将原条件和修改后的数据传入，即可完成数据的更新。
# {'ok': 1, 'nModified': 1, 'n': 1, 'updatedExisting': True}
# 返回结果是字典形式，ok即代表执行成功，nModified代表影响的数据条数。

# 另外update()方法其实也是官方不推荐使用的方法，在这里也分了update_one()方法和update_many()方法，用法更加严格，
# 第二个参数需要使用$类型操作符作为字典的键名，我们用示例感受一下。

# condition={'name':'xx'}
# student=collections.find_one(condition)
# print(student)
# student['age']=112
# result=collections.update_one(condition,{'$set':student})
# print(result)
# print(result.matched_count,result.modified_count)

#再看一个例子
# condition={'age':{'$gt':20}}
# result=collections.update_one(condition,{'$inc':{'age':1}})
# print(result)
# print(result.matched_count,result.modified_count)
# 在这里我们指定查询条件为年龄大于20，
# 然后更新条件为{'$inc': {'age': 1}}，执行之后会讲第一条符合条件的数据年龄加1。
# <pymongo.results.UpdateResult object at 0x000000000A99AB48>
# 1 1

# 如果调用update_many()方法，则会将所有符合条件的数据都更新，示例如下：

condition = {'age': {'$gt': 20}}
result = collections.update_many(condition, {'$inc': {'age': 1}})
print(result)
print(result.matched_count, result.modified_count)
# 这时候匹配条数就不再为1条了，运行结果如下：

# <pymongo.results.UpdateResult object at 0x10c6384c8>
# 3 3
# 可以看到这时所有匹配到的数据都会被更新。

# ###############删除
# 删除操作比较简单，直接调用remove()方法指定删除的条件即可，符合条件的所有数据均会被删除，示例如下：

# result = collections.remove({'name': 'Kevin'})
# print(result)
# 运行结果：

# {'ok': 1, 'n': 1}
# 另外依然存在两个新的推荐方法，delete_one()和delete_many()方法，示例如下：

# result = collections.delete_one({'name': 'Kevin'})
# print(result)
# print(result.deleted_count)
# result = collections.delete_many({'age': {'$lt': 25}})
# print(result.deleted_count)
# # 运行结果：

# <pymongo.results.DeleteResult object at 0x10e6ba4c8>
# 1
# 4
# delete_one()即删除第一条符合条件的数据，delete_many()即删除所有符合条件的数据，返回结果是DeleteResult类型，
# 可以调用deleted_count属性获取删除的数据条数。

# 更多
# 另外PyMongo还提供了一些组合方法，如find_one_and_delete()、find_one_and_replace()、find_one_and_update()，
# 就是查找后删除、替换、更新操作，用法与上述方法基本一致。

二.爬取腾讯招聘

　　爬虫文件

# -*- coding: utf-8 -*-

import scrapy

from Tencent.items import TencentItem

class TencentSpider(scrapy.Spider):

    name = 'tencent'

    # allowed_domains = ['www.xxx.com']

    #指定基础url用来做拼接用的

    base_url = 'http://hr.tencent.com/position.php?&start='

    page_num = 0

    start_urls = [base_url + str(page_num)]

    def parse(self, response):

        tr_list = response.xpath("//tr[@class='even' ] | //tr[@class='odd']")

        #先拿到存放类目的标签列表,然后循环标签列表

        for tr in tr_list:

            name = tr.xpath('./td[1]/a/text()').extract_first()

            url = tr.xpath('./td[1]/a/@href').extract_first()

            #在工作类别的时候,有时候是空值,会报错,需要这样直接给他一个空值

            # if len(tr.xpath("./td[2]/text()")):

            #    worktype = tr.xpath("./td[2]/text()").extract()[0].encode("utf-8")

            # else:

            #     worktype = "NULL"

            #如果不报错就用这种

            worktype = tr.xpath('./td[2]/text()').extract_first()

            num = tr.xpath('./td[3]/text()').extract_first()

            location = tr.xpath('./td[4]/text()').extract_first()

            publish_time = tr.xpath('./td[5]/text()').extract_first()

            item = TencentItem()

            item['name'] = name

            item['worktype'] = worktype

            item['url'] = url

            item['num'] = num

            item['location'] = location

            item['publish_time'] = publish_time

            print('----', name)

            print('----', url)

            print('----', worktype)

            print('----', location)

            print('----', num)

            print('----', publish_time)

            yield item

        # 分页处理:方法一

        # 这是第一中写法,在知道他的页码的情况下使用

        # 适用场景,在没有下一页可以点击,只能通过url拼接的情况

        # if self.page_num<3060:

        #     self.page_num+=10

        #     url=self.base_url+str(self.page_num)

        #     # yield  scrapy.Request(url=url,callback=self.parse)

        #     yield  scrapy.Request(url, callback=self.parse)

        # 方法二:

        # 直接提取的他的下一页连接

        # 这个等于0,说明不是最后一页,可以继续下一页,否则不等于0就继续提取

        #获取下一页的url直接拼接就可以了

        if len(response.xpath("//a[@id='next' and @class='noactive']")) == 0:

            next_url = response.xpath('//a[@id="next"]/@href').extract_first()

            url = 'https://hr.tencent.com/' + next_url

            yield scrapy.Request(url=url, callback=self.parse)

爬虫文件

　　pipeline

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql

import json

from redis import Redis

import pymongo

#存储到本地

class TencentPipeline(object):

    f=None

    def open_spider(self,spider):

        self.f=open('./tencent2.txt','w',encoding='utf-8')

    def process_item(self, item, spider):

        self.f.write(item['name']+':'+item['url']+':'+item['num']+':'+item['worktype']+':'+item['location']+':'+item['publish_time']+'\n')

        return item

    def close_spider(self,spider):

        self.f.close()

#存储到mysql

class TencentPipelineMysql(object):

    conn=None

    cursor=None

    def open_spider(self,spider):

        self.conn=pymysql.connect(host='127.0.0.1',port=3306,user='root',password='',db='tencent')

    def process_item(self,item,spider):

        print('这是mydql.米有进来吗')

        self.cursor = self.conn.cursor()

        try:

            self.cursor.execute('insert into tencent values("%s","%s","%s","%s","%s","%s")'%(item['name'],item['worktype'],item['url'],item['num'],item['publish_time'],item['location']))

            self.conn.commit()

        except Exception as  e:

            print('错误提示',e)

            self.conn.rollback()

        return item

    def close_spider(self,spider):

        self.cursor.close()

        self.conn.close()

#储存到redis

class TencentPipelineRedis(object):

    conn=None

    def open_spider(self,spider):

        self.conn=Redis(host='127.0.0.1',port=6379)

    def process_item(self,item,spider):

        item_dic=dict(item)

        item_json=json.dumps(item_dic)

        self.conn.lpush('tencent',item_json)

        return item

#存储到mongodb

class TencentPipelineMongo(object):

    client=None

    def open_spider(self,spider):

        self.client=pymongo.MongoClient(host='127.0.0.1',port=27017)

        self.db=self.client['test']

    def process_item(self,item,spider):

        collection = self.db['tencent']

        item_dic=dict(item)

        collection.insert(item_dic)

        return item

    def close_spider(self,spider):

        self.client.close()

pipeline

　　settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for Tencent project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#     https://doc.scrapy.org/en/latest/topics/settings.html

#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Tencent'

SPIDER_MODULES = ['Tencent.spiders']

NEWSPIDER_MODULE = 'Tencent.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

#   'Accept-Language': 'en',

#}

# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

#    'Tencent.middlewares.TencentSpiderMiddleware': 543,

#}

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

#    'Tencent.middlewares.TencentDownloaderMiddleware': 543,

#}

# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

#}

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

   'Tencent.pipelines.TencentPipeline': 300,

    'Tencent.pipelines.TencentPipelineMysql': 301,

    'Tencent.pipelines.TencentPipelineRedis': 302,

    'Tencent.pipelines.TencentPipelineMongo': 303,

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

　　item

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class TencentItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    name=scrapy.Field()

    url=scrapy.Field()

    worktype=scrapy.Field()

    location=scrapy.Field()

    num=scrapy.Field()

    publish_time=scrapy.Field()

pymongodb的使用和一个腾讯招聘爬取的案例的更多相关文章

如何手动写一个Python脚本自动爬取Bilibili小视频
如何手动写一个Python脚本自动爬取Bilibili小视频国庆结束之余,某个不务正业的码农不好好干活,在B站瞎逛着,毕竟国庆嘛,还让不让人休息了诶-- 我身边的很多小伙伴们在朋友圈里面晒着出去游玩 ...
第一个nodejs爬虫：爬取豆瓣电影图片
第一个nodejs爬虫:爬取豆瓣电影图片存入本地: 首先在命令行下 npm install request cheerio express -save; 代码: var http = require( ...
一个简单java爬虫爬取网页中邮箱并保存
此代码为一十分简单网络爬虫,仅供娱乐之用. java代码如下: package tool; import java.io.BufferedReader; import java.io.File; im ...
Python 之scrapy框架58同城招聘爬取案例
一.项目目录结构: 代码如下: # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See docu ...
用WebCollector制作一个爬取《知乎》并进行问题精准抽取的爬虫（JAVA）
简单介绍: WebCollector是一个无须配置.便于二次开发的JAVA爬虫框架(内核),它提供精简的的API.仅仅需少量代码就可以实现一个功能强大的爬虫. 怎样将WebCollector导入项目请 ...
写一个python 爬虫爬取百度电影并存入mysql中
目标是利用python爬取百度搜索的电影在类型地区年代各个标签下电影的名字评分和图片连接以及电影连接首先我们先在mysql中建表 create table liubo4( id in ...
简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息
简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息简单的scrapy实战:爬取腾讯招聘北京地区的相关招聘信息系统环境:Fedora22(昨天已安装scrapy环境) 爬取的开始URL:ht ...
Scrapy项目 - 实现腾讯网站社会招聘信息爬取的爬虫设计
通过使Scrapy框架,进行数据挖掘和对web站点页面提取结构化数据,掌握如何使用Twisted异步网络框架来处理网络通讯的问题,可以加快我们的下载速度,也可深入接触各种中间件接口,灵活的完成各种需求 ...
利用scrapy爬取腾讯的招聘信息
利用scrapy框架抓取腾讯的招聘信息,爬取地址为:https://hr.tencent.com/position.php 抓取字段包括:招聘岗位,人数,工作地点,发布时间,及具体的工作要求和工作任务 ...

随机推荐

Paper: ImageNet Classification with Deep Convolutional Neural Network
本文介绍了Alex net 在imageNet Classification 中的惊人表现,获得了ImagaNet LSVRC2012第一的好成绩,开启了卷积神经网络在cv领域的广泛应用. 1.数据集 ...
关于wamp中升级PHP+Apache 的问题
首先个人不建议wamp中升级php版本,如果你不信可以试一试,当你php升级后发想,奥,Apache版本不匹配,然后又去升级Apache,结果搞了半天,弄出来了就好,要是没出来,可能你会气死(好吧,气 ...
Ubuntu 12.04 LTS 中文输入法的安装 (转载)
第一步:安装语言包进入 “System Settings” 找到 “Language Support” 那一项,点击进入选择 “Install/Remove Languages” 找到 “Chin ...
Cloud Design Patterns: Prescriptive Architecture Guidance for Cloud Applications
January 2014 Containing twenty-four design patterns and ten related guidance topics, this guide arti ...
LightOJ 1248 Dice (III) (水题，期望DP)
题意:给出一个n面的色子,问看到每个面的投掷次数期望是多少. 析:这个题很水啊,就是他解释样例解释的太...我鄙视他,,,,, dp[i] 表示已经看到 i 面的期望是多少,然后两种选择一种是看到新 ...
LightOJ 1027 A Dangerous Maze (数学期望)
题意:你面前有 n 个门,每次你可以选择任意一个进去,如果xi是正数,你将在xi后出去,如果xi是负数,那么xi后你将回来并且丢失所有记忆,问你出去的期望. 析:两种情况,第一种是直接出去,期望就是 ...
WEB缓存初探
WEB缓存初探概念理解缓存--缓存就是数据交换的缓冲区(称作Cache) 缓存的作用说白了就是用来就近获取东西,比如我们会把已经拿到的常用的东西放在手边(与自己相对较近的地方),方便下次需要时去 ...
LibreOJ 6002 最小路径覆盖(最大流)
题解:最小路径覆盖=总点数减去最大匹配数,拆点,按照每条边前一个点连源点,后一个点连汇点跑最大流,即可跑出最大匹配数,然后减一减就可以了~ 代码如下: #include<queue> #i ...
MongoDB整理笔记のReplica oplog
主从操作日志oplog MongoDB的Replica Set架构是通过一个日志来存储写操作的,这个日志就叫做"oplog".oplog.rs是一个固定长度的capped coll ...
[.net 多线程]Semaphore信号量
信号量(Semaphore)是一种CLR中的内核同步对象.与标准的排他锁对象(Monitor,Mutex,SpinLock)不同的是,它不是一个排他的锁对象,它与SemaphoreSlim,Reade ...

pymongodb的使用和一个腾讯招聘爬取的案例

一.在python3中操作mongodb

1.连接条件

3.使用

二.爬取腾讯招聘

pymongodb的使用和一个腾讯招聘爬取的案例的更多相关文章

随机推荐

热门专题

　　1.连接条件

　　3.使用