爬虫之scrapy简单案例之猫眼

在爬虫py文件下

class TopSpider(scrapy.Spider):

    name = 'top'

    allowed_domains = ['maoyan.com']

    start_urls = ['https://maoyan.com/board/4']

    def parse(self, response):

        dds = response.xpath('//dl/dd')

        for dd in dds:

            dic = MaoyanItem()

            # dic = {}

            dic['name'] = dd.xpath('.//p[@class="name"]//text()').extract_first()

            dic['star'] = dd.xpath('.//p[@class="star"]/text()').extract_first().replace('\n', '').replace(' ', '')

            dic['releasetime'] = dd.xpath('.//p[@class="releasetime"]/text()').extract_first()

            score1 = dd.xpath('.//p[@class="score"]/i[1]/text()').extract_first()

            score2 = dd.xpath('.//p[@class="score"]/i[2]/text()').extract_first()

            dic['score'] = score1 + score2

            # 详情页

            xqy_url = 'https://maoyan.com' + dd.xpath('.//p[@class="name"]/a/@href').extract_first()

            yield scrapy.Request(xqy_url, callback=self.xqy_parse, meta={'dic': dic})

        # 翻页

        next_url = response.xpath('//a[text()="下一页"]/@href').extract_first()

        if next_url:

            url = 'https://maoyan.com/board/4' + next_url

            yield scrapy.Request(url, callback=self.parse)

    def xqy_parse(self,response):

        dic = response.meta['dic']

        dic['type'] = response.xpath('//ul/li[@class="ellipsis"][1]/text()').extract_first()

        dic['area_time'] = response.xpath('//ul/li[@class="ellipsis"][2]/text()').extract_first().replace('\n', '').replace(' ', '')

        yield dic

在items.py 文件中写入要展示的字段

class DoubanItem(scrapy.Item):

    title = scrapy.Field()

    inf = scrapy.Field()

    score = scrapy.Field()

    peo = scrapy.Field()

    brief = scrapy.Field()

在pipelines.py文件写入要打印的文本

class DoubanPipeline(object):

    def open_spider(self, spider):

        self.file = open('douban.txt', 'a', encoding='utf-8')

    def process_item(self, item, spider):

        self.file.write(str(item)+'\n')

    def close_spider(self, spider):

        self.file.close()

pipelines.py文件也可用MongoDB书写

 from pymongo import MongoClient

 class DoubanPipeline(object):

     def open_spider(self,spider):

         # self.file = open('douban.txt','a',encoding='utf8')

         self.client = MongoClient()

         self.collection = self.client['库名']['集合名']

         self.count = 0

     def process_item(self, item, spider):

         # self.file.write(str(item)+'\n')

         item['_id'] = self.count

         self.count += 1

         self.collection.insert_one(item)

         return item

     def close_spider(self, spider):

         # self.file.close()

         self.client.close()

另外，记得在setting.py文件中配置一些信息，如

或者ROBOTS协议以及其他

爬虫之scrapy简单案例之猫眼的更多相关文章

python自动化之爬虫原理及简单案例
[爬虫案例]动态地图里的数据如何抓取:以全国PPP综合信息平台网站为例 http://mp.weixin.qq.com/s/BXWTf5hmq8vp91ZvgaphEw [爬虫案例]动态页面的抓取! ...
爬虫框架Scrapy之案例二
新浪网分类资讯爬虫爬取新浪网导航页所有下所有大类.小类.小类里的子链接,以及子链接页面的新闻内容. 效果演示图: items.py import scrapy import sys reload(s ...
爬虫框架Scrapy之案例三图片下载器
items.py class CoserItem(scrapy.Item): url = scrapy.Field() name = scrapy.Field() info = scrapy.Fiel ...
爬虫框架Scrapy之案例一
阳光热线问政平台 http://wz.sun0769.com/index.php/question/questionType?type=4 爬取投诉帖子的编号.帖子的url.帖子的标题,和帖子里的内容 ...
爬虫之CrawlSpider简单案例之读书网
项目名py文件下 class DsSpider(CrawlSpider): name = 'ds' allowed_domains = ['dushu.com'] start_urls = ['htt ...
scrapy爬虫学习系列二：scrapy简单爬虫样例学习
系列文章列表: scrapy爬虫学习系列一:scrapy爬虫环境的准备: http://www.cnblogs.com/zhaojiedi1992/p/zhaojiedi_python_00 ...
Python爬虫框架--Scrapy安装以及简单实用
scrapy框架框架 -具有很多功能且具有很强通用性的一个项目模板环境安装: Linux: pip3 install scrapy Windows: ...
Python 爬虫之Scrapy框架
Scrapy框架架构 Scrapy框架介绍: 写一个爬虫,需要做很多的事情.比如:发送网络请求.数据解析.数据存储.反反爬虫机制(更换ip代理.设置请求头等).异步请求等.这些工作如果每次都要自己从零 ...
Python逆向爬虫之scrapy框架,非常详细
爬虫系列目录目录 Python逆向爬虫之scrapy框架,非常详细一.爬虫入门 1.1 定义需求 1.2 需求分析 1.2.1 下载某个页面上所有的图片 1.2.2 分页 1.2.3 进行下载图片 ...

随机推荐

FFmpeg(四) 像素转换相关函数理解
一.基本流程 1.sws_getCachedContext();//得到像素转换的上下文 2.sws_scale()://进行转换二.函数说明 1.SwsContext *vctx = NULL; ...
k8s pod访问不通外网问题排查
环境概况自建k8s集群,主机操作系统ubuntu16.04,k8s版本v1.14, 集群网络方案calico-3.3.6. worker节点数50+,均为GPU物理服务器,服务器类型异构,如Nvid ...
Python3程序设计指南：01 过程型程序设计快速入门
大家好,从本文开始将逐渐更新Python教程指南系列,为什么叫指南呢?因为本系列是参考<Python3程序设计指南>,也是作者的学习笔记,希望与读者共同学习. .py文件中的每个估计都是顺 ...
古剑奇谭三已取消该页导航，B站版本无法登陆
最近登陆古剑三突然出现这个问题怎么重开也无法登陆最后发现是Ie设置问题解决方法点中间这个圆形图标打开小娜搜索“ie” 点击打开ie之后点右上角的小齿轮选择“internet选项” 连接- ...
Linux快速入门
一.Linux介绍 1.Linux是基于Unix的开源免费的操作系统 2.Linux的分类: (1)Linux根据市场需求不同,基本分为两个方向: 1)图形化界面版:注重用户体验,类似window操作 ...
Django跨域问题(CORS错误)
Django跨域问题(CORS错误) 一.出现跨域问题(cors错误)的原因通常情况下,A网页访问B服务器资源时,不满足以下三个条件其一就是跨域访问协议不同端口不同主机不同二.Django解 ...
[Mathematics][MIT 18.02]Detailed discussions about 2-D and 3-D integral and their connections
Since it is just a sort of discussion, I will just give the formula and condition without proving th ...
PHP krsort
1.什么都不想说了,干么没事放那么悲伤的歌呢?回忆里我还是对代码懵懵懂懂的无知青年!也许不是青年,只是少年... <?php $arr = [ 1 => 'Zhangbiyu', 2 =& ...
Intel Sandy Bridge Microarchitecture Events
This is a list of all Intel Sandy Bridge Microarchitecture performance counter event types. Please s ...
日志::spdlog
https://github.com/gabime/spdlog git clone https://github.com/gabime/spdlog.git cd spdlog && ...

爬虫之scrapy简单案例之猫眼

爬虫之scrapy简单案例之猫眼的更多相关文章

随机推荐

热门专题