爬取豆瓣电影储存到数据库MONGDB中以及反反爬虫

1.代码如下：

doubanmoive.py

# -*- coding: utf-8 -*-

import scrapy

from douban.items import DoubanItem

class DoubamovieSpider(scrapy.Spider):

    name = "doubanmovie"

    allowed_domains = ["movie.douban.com"]

    offset = 0

    url = "https://movie.douban.com/top250?start="

    start_urls = (

            url+str(offset),

    )

    def parse(self, response):

        item = DoubanItem()

        movies = response.xpath("//div[@class='info']")

        for each in movies:

            # 标题

            item['title'] = each.xpath(".//span[@class='title'][1]/text()").extract()[0]

            # 信息

            item['bd'] = each.xpath(".//div[@class='bd']/p/text()").extract()[0]

            # 评分

            item['star'] = each.xpath(".//div[@class='star']/span[@class='rating_num']/text()").extract()[0]

            # 简介

            quote = each.xpath(".//p[@class='quote']/span/text()").extract()

            if len(quote) != 0:

                item['quote'] = quote[0]

            yield item

        if self.offset < 225:

            self.offset += 25

            yield scrapy.Request(self.url + str(self.offset), callback = self.parse)

items.py

import scrapy

class DoubanItem(scrapy.Item):

    # define the fields for your item here like:

    # 标题

    title = scrapy.Field()

    # 信息

    bd = scrapy.Field()

    # 评分

    star = scrapy.Field()

    # 简介

    quote = scrapy.Field()

2.在管道文件中更改储存位置

import pymongo

from scrapy.conf import settings

class DoubanPipeline(object):

    def __init__(self):

        host = settings["MONGODB_HOST"]

        port = settings["MONGODB_PORT"]

        dbname = settings["MONGODB_DBNAME"]

        sheetname= settings["MONGODB_SHEETNAME"]

        # 创建MONGODB数据库链接

        client = pymongo.MongoClient(host = host, port = port)

        # 指定数据库

        mydb = client[dbname]

        # 存放数据的数据库表名

        self.sheet = mydb[sheetname]

    def process_item(self, item, spider):

        data = dict(item)

        self.sheet.insert(data)

        return item

3.新建中间件 middlewares.py 进行反反爬虫

 # -*- coding:utf-8 -*-

 import random

 import base64

 from settings import USER_AGENTS

 from settings import PROXIES

 # 随机的User-Agent

 class RandomUserAgent(object):

     def process_request(self, request, spider):

         useragent = random.choice(USER_AGENTS)

         #print useragent

         request.headers.setdefault("User-Agent", useragent)

 class RandomProxy(object):

     def process_request(self, request, spider):

         proxy = random.choice(PROXIES)

         if proxy['user_passwd'] is None:

             # 没有代理账户验证的代理使用方式

             request.meta['proxy'] = "http://" + proxy['ip_port']

         else:

             # 对账户密码进行base64编码转换

             base64_userpasswd = base64.b64encode(proxy['user_passwd'])

             # 对应到代理服务器的信令格式里

             request.headers['Proxy-Authorization'] = 'Basic ' + base64_userpasswd

             request.meta['proxy'] = "http://" + proxy['ip_port']

4.setting的设置

 # -*- coding: utf-8 -*-

 # Scrapy settings for douban project

 #

 # For simplicity, this file contains only settings considered important or

 # commonly used. You can find more settings consulting the documentation:

 #

 #     http://doc.scrapy.org/en/latest/topics/settings.html

 #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

 #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

 BOT_NAME = 'douban'

 SPIDER_MODULES = ['douban.spiders']

 NEWSPIDER_MODULE = 'douban.spiders'

 # Crawl responsibly by identifying yourself (and your website) on the user-agent

 USER_AGENT = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"

 # Obey robots.txt rules

 #ROBOTSTXT_OBEY = True

 # Configure maximum concurrent requests performed by Scrapy (default: 16)

 #CONCURRENT_REQUESTS = 32

 # Configure a delay for requests for the same website (default: 0)

 # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay

 # See also autothrottle settings and docs

 DOWNLOAD_DELAY = 2.5

 # The download delay setting will honor only one of:

 #CONCURRENT_REQUESTS_PER_DOMAIN = 16

 #CONCURRENT_REQUESTS_PER_IP = 16

 # Disable cookies (enabled by default)

 COOKIES_ENABLED = False

 # Disable Telnet Console (enabled by default)

 #TELNETCONSOLE_ENABLED = False

 # Override the default request headers:

 #DEFAULT_REQUEST_HEADERS = {

 #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

 #   'Accept-Language': 'en',

 #}

 # Enable or disable spider middlewares

 # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

 #SPIDER_MIDDLEWARES = {

 #    'douban.middlewares.MyCustomSpiderMiddleware': 543,

 #}

 # Enable or disable downloader middlewares

 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

 DOWNLOADER_MIDDLEWARES = {

     'douban.middlewares.RandomUserAgent': 100,

     'douban.middlewares.RandomProxy': 200,

 }

 USER_AGENTS = [

     'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)',

     'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2)',

     'Opera/9.27 (Windows NT 5.2; U; zh-cn)',

     'Opera/8.0 (Macintosh; PPC Mac OS X; U; en)',

     'Mozilla/5.0 (Macintosh; PPC Mac OS X; U; en) Opera 8.0',

     'Mozilla/5.0 (Linux; U; Android 4.0.3; zh-cn; M032 Build/IML74K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',

     'Mozilla/5.0 (Windows; U; Windows NT 5.2) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.27 Safari/525.13'

 ]

 PROXIES = [

         {"ip_port" :"121.42.140.113:16816", "user_passwd" : "mr_mao_hacker:sffqry9r"},

         #{"ip_prot" :"121.42.140.113:16816", "user_passwd" : ""}

         #{"ip_prot" :"121.42.140.113:16816", "user_passwd" : ""}

         #{"ip_prot" :"121.42.140.113:16816", "user_passwd" : ""}

 ]

 #LOG_FILE = "douban.log"

 #LOG_LEVEL = "DEBUG"

 # Enable or disable extensions

 # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html

 #EXTENSIONS = {

 #    'scrapy.extensions.telnet.TelnetConsole': None,

 #}

 # Configure item pipelines

 # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

 ITEM_PIPELINES = {

     'douban.pipelines.DoubanPipeline': 300,

 }

 # MONGODB 主机名

 MONGODB_HOST = "127.0.0.1"

 # MONGODB 端口号

 MONGODB_PORT = 27017

 # 数据库名称

 MONGODB_DBNAME = "Douban"

 # 存放数据的表名称

 MONGODB_SHEETNAME = "doubanmovies"

 # Enable and configure the AutoThrottle extension (disabled by default)

 # See http://doc.scrapy.org/en/latest/topics/autothrottle.html

 #AUTOTHROTTLE_ENABLED = True

 # The initial download delay

 #AUTOTHROTTLE_START_DELAY = 5

 # The maximum download delay to be set in case of high latencies

 #AUTOTHROTTLE_MAX_DELAY = 60

 # The average number of requests Scrapy should be sending in parallel to

 # each remote server

 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

 # Enable showing throttling stats for every response received:

 #AUTOTHROTTLE_DEBUG = False

 # Enable and configure HTTP caching (disabled by default)

 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

 #HTTPCACHE_ENABLED = True

 #HTTPCACHE_EXPIRATION_SECS = 0

 #HTTPCACHE_DIR = 'httpcache'

 #HTTPCACHE_IGNORE_HTTP_CODES = []

 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

爬取豆瓣电影储存到数据库MONGDB中以及反反爬虫的更多相关文章

python2.7爬取豆瓣电影top250并写入到TXT，Excel，MySQL数据库
python2.7爬取豆瓣电影top250并分别写入到TXT,Excel,MySQL数据库 1.任务爬取豆瓣电影top250 以txt文件保存以Excel文档保存将数据录入数据库 2.分析电影 ...
urllib+BeautifulSoup无登录模式爬取豆瓣电影Top250
对于简单的爬虫任务,尤其对于初学者,urllib+BeautifulSoup足以满足大部分的任务. 1.urllib是Python3自带的库,不需要安装,但是BeautifulSoup却是需要安装的. ...
一起学爬虫——通过爬取豆瓣电影top250学习requests库的使用
学习一门技术最快的方式是做项目,在做项目的过程中对相关的技术查漏补缺. 本文通过爬取豆瓣top250电影学习python requests的使用. 1.准备工作在pycharm中安装request库 ...
爬取豆瓣电影TOP 250的电影存储到mongodb中
爬取豆瓣电影TOP 250的电影存储到mongodb中 1.创建项目sp1 PS D:\scrapy> scrapy.exe startproject douban 2.创建一个爬虫 PS D: ...
利用Python爬取豆瓣电影
目标:使用Python爬取豆瓣电影并保存MongoDB数据库中我们先来看一下通过浏览器的方式来筛选某些特定的电影: 我们把URL来复制出来分析分析: https://movie.douban.com ...
【转】爬取豆瓣电影top250提取电影分类进行数据分析
一.爬取网页,获取需要内容我们今天要爬取的是豆瓣电影top250页面如下所示: 我们需要的是里面的电影分类,通过查看源代码观察可以分析出我们需要的东西.直接进入主题吧! 知道我们需要的内容在哪里了, ...
爬虫系列(十一) 用requests和xpath爬取豆瓣电影评论
这篇文章,我们继续利用 requests 和 xpath 爬取豆瓣电影的短评,下面还是先贴上效果图: 1.网页分析 (1)翻页我们还是使用 Chrome 浏览器打开豆瓣电影中某一部电影的评论进行分析 ...
scrapy爬虫框架教程（二）-- 爬取豆瓣电影TOP250
scrapy爬虫框架教程(二)-- 爬取豆瓣电影TOP250 前言经过上一篇教程我们已经大致了解了Scrapy的基本情况,并写了一个简单的小demo.这次我会以爬取豆瓣电影TOP250为例进一步为大 ...
scrapy爬取豆瓣电影top250
# -*- coding: utf-8 -*- # scrapy爬取豆瓣电影top250 import scrapy from douban.items import DoubanItem class ...

随机推荐

虚拟机通信配置与Xshell连接
本文主要讲解虚拟机通信配置的详细步骤和Xshell工具连接,以及如何诊断网络问题并进行相应配置的问题. 1. 虚拟机通信配置虚拟机通信配置的基本流程如图所示: 首先,我们先打开新建的虚拟机,然后输入 ...
C#移位运算(左移和右移)
C#是用<<(左移) 和 >>(右移) 运算符是用来执行移位运算. 左移 (<<) 将第一个操作数向左移动第二个操作数指定的位数,空出的位置补0. 左移相当于乘. ...
学会WCF之试错法——安全配置报错分析
安全配置报错分析服务端配置 <system.serviceModel> <bindings> <wsHttpBinding> <binding name = ...
【原创】使用workstation安装Xenserver 6.5+cloudstack 4.10----本地存储模式
1. 背景: 近期由于项目和个人学习得需求,开始接触到Cloudstack,虽然云计算概念在大学刚毕业的时候就已经略有耳闻,但是由于工作原因,也一直没有了解,下班后想自己折腾下cloudstack,便 ...
VueJS引入css或者less文件的一些坑
我们在做Vue+webpack的时,难免会引入各种公共css样式文件,那么我们改如何引入呢?引入时会有那些坑呢? 首先,引入公共样式时,我们在“main.js”里使用AMD的方式引入,即 requir ...
Mysql语句查询优化
其实对Mysql查询语句进行优化是一件非常有必要的事情. 如何查看当前sql语句的执行效率呢? 1.建一张学生表 CREATE TABLE `student` ( `stu_id` ) NOT NUL ...
debug断点调试
debug断点调试 1,虫子启动2,F6 执行断点的下一步,下一个语句 F5 进入方法 F8 执行到结束查看表达式的值:选中查看的表达式,接着按 ctrl ...
高性能、高可用、高扩展ERP系统架构设计
ERP之痛曾几何时,我混迹于电商.珠宝行业4年多,为这两个行业开发过两套大型业务系统(ERP).作为一个ERP系统,系统主要功能模块无非是订单管理.商品管理.生产采购.仓库管理.物流管理.财务管理等 ...
实用的jQuery技巧
1.回到顶部按钮利用jQuery里的animate和scrollTop方法,你便不需要使用插件创建简单的滚动到顶部动画. // Back to top $('.top').click(functi ...
Spring Boot应用打包与部署指南
Spring Boot的打包与部署有何特点? Java Web应用在Spring Boot之前,通常是打包成war(Web application ARchive)包,结合Tomcat来完成部署. 对 ...

爬取豆瓣电影储存到数据库MONGDB中以及反反爬虫

爬取豆瓣电影储存到数据库MONGDB中以及反反爬虫的更多相关文章

随机推荐

热门专题