爬虫之scrapy简单案例之猫眼

在爬虫py文件下

class TopSpider(scrapy.Spider):

    name = 'top'

    allowed_domains = ['maoyan.com']

    start_urls = ['https://maoyan.com/board/4']

    def parse(self, response):

        dds = response.xpath('//dl/dd')

        for dd in dds:

            dic = MaoyanItem()

            # dic = {}

            dic['name'] = dd.xpath('.//p[@class="name"]//text()').extract_first()

            dic['star'] = dd.xpath('.//p[@class="star"]/text()').extract_first().replace('\n', '').replace(' ', '')

            dic['releasetime'] = dd.xpath('.//p[@class="releasetime"]/text()').extract_first()

            score1 = dd.xpath('.//p[@class="score"]/i[1]/text()').extract_first()

            score2 = dd.xpath('.//p[@class="score"]/i[2]/text()').extract_first()

            dic['score'] = score1 + score2

            # 详情页

            xqy_url = 'https://maoyan.com' + dd.xpath('.//p[@class="name"]/a/@href').extract_first()

            yield scrapy.Request(xqy_url, callback=self.xqy_parse, meta={'dic': dic})

        # 翻页

        next_url = response.xpath('//a[text()="下一页"]/@href').extract_first()

        if next_url:

            url = 'https://maoyan.com/board/4' + next_url

            yield scrapy.Request(url, callback=self.parse)

    def xqy_parse(self,response):

        dic = response.meta['dic']

        dic['type'] = response.xpath('//ul/li[@class="ellipsis"][1]/text()').extract_first()

        dic['area_time'] = response.xpath('//ul/li[@class="ellipsis"][2]/text()').extract_first().replace('\n', '').replace(' ', '')

        yield dic

在items.py 文件中写入要展示的字段

class DoubanItem(scrapy.Item):

    title = scrapy.Field()

    inf = scrapy.Field()

    score = scrapy.Field()

    peo = scrapy.Field()

    brief = scrapy.Field()

在pipelines.py文件写入要打印的文本

class DoubanPipeline(object):

    def open_spider(self, spider):

        self.file = open('douban.txt', 'a', encoding='utf-8')

    def process_item(self, item, spider):

        self.file.write(str(item)+'\n')

    def close_spider(self, spider):

        self.file.close()

pipelines.py文件也可用MongoDB书写

 from pymongo import MongoClient

 class DoubanPipeline(object):

     def open_spider(self,spider):

         # self.file = open('douban.txt','a',encoding='utf8')

         self.client = MongoClient()

         self.collection = self.client['库名']['集合名']

         self.count = 0

     def process_item(self, item, spider):

         # self.file.write(str(item)+'\n')

         item['_id'] = self.count

         self.count += 1

         self.collection.insert_one(item)

         return item

     def close_spider(self, spider):

         # self.file.close()

         self.client.close()

另外，记得在setting.py文件中配置一些信息，如

或者ROBOTS协议以及其他

爬虫之scrapy简单案例之猫眼的更多相关文章

python自动化之爬虫原理及简单案例
[爬虫案例]动态地图里的数据如何抓取:以全国PPP综合信息平台网站为例 http://mp.weixin.qq.com/s/BXWTf5hmq8vp91ZvgaphEw [爬虫案例]动态页面的抓取! ...
爬虫框架Scrapy之案例二
新浪网分类资讯爬虫爬取新浪网导航页所有下所有大类.小类.小类里的子链接,以及子链接页面的新闻内容. 效果演示图: items.py import scrapy import sys reload(s ...
爬虫框架Scrapy之案例三图片下载器
items.py class CoserItem(scrapy.Item): url = scrapy.Field() name = scrapy.Field() info = scrapy.Fiel ...
爬虫框架Scrapy之案例一
阳光热线问政平台 http://wz.sun0769.com/index.php/question/questionType?type=4 爬取投诉帖子的编号.帖子的url.帖子的标题,和帖子里的内容 ...
爬虫之CrawlSpider简单案例之读书网
项目名py文件下 class DsSpider(CrawlSpider): name = 'ds' allowed_domains = ['dushu.com'] start_urls = ['htt ...
scrapy爬虫学习系列二：scrapy简单爬虫样例学习
系列文章列表: scrapy爬虫学习系列一:scrapy爬虫环境的准备: http://www.cnblogs.com/zhaojiedi1992/p/zhaojiedi_python_00 ...
Python爬虫框架--Scrapy安装以及简单实用
scrapy框架框架 -具有很多功能且具有很强通用性的一个项目模板环境安装: Linux: pip3 install scrapy Windows: ...
Python 爬虫之Scrapy框架
Scrapy框架架构 Scrapy框架介绍: 写一个爬虫,需要做很多的事情.比如:发送网络请求.数据解析.数据存储.反反爬虫机制(更换ip代理.设置请求头等).异步请求等.这些工作如果每次都要自己从零 ...
Python逆向爬虫之scrapy框架,非常详细
爬虫系列目录目录 Python逆向爬虫之scrapy框架,非常详细一.爬虫入门 1.1 定义需求 1.2 需求分析 1.2.1 下载某个页面上所有的图片 1.2.2 分页 1.2.3 进行下载图片 ...

随机推荐

css 精灵图的使用
精灵图的使用 1.给一个容器定义一个大小(宽高) 2.引入背景图 3.定位到自己你想要的图片位置例如: background-position: 0 0; background-position ...
揭秘C# SQLite的从安装到使用
SQLite,是一款轻型的数据库,是遵守ACID的关联式数据库管理系统,它的设计目标是嵌入式的,而且目前已经在很多嵌入式产品中使用了它,它占用资源非常的低,在嵌入式设备中,可能只需要几百K的内存就够了 ...
02-23 决策树CART算法
目录决策树CART算法一.决策树CART算法学习目标二.决策树CART算法详解 2.1 基尼指数和熵 2.2 CART算法对连续值特征的处理 2.3 CART算法对离散值特征的处理 2.4 CA ...
多线程EventWaitHandle -戈多编程
在.NET的System.Threading命名空间中有一个名叫WaitHandler的类,这是一个抽象类(abstract),我们无法手动去创建它,但是WaitHandler有三个子类,这三个子类分 ...
App自动化环境搭建
1.安装Appium-desktop工具下载地址:https://github.com/appium/appium-desktop/releases/tag/v1.8.2 2.安装Android环境 ...
代码审计-YXcms1.4.7
题外: 今天是上班第一天,全都在做准备工作,明天开始正式实战做事. 看着周围稍年长的同事和老大做事,自己的感觉就是自己还是差的很多很多,自己只能算个废物. 学无止境,我这样的垃圾废物就该多练,保持战斗 ...
[BZOJ4990][Usaco2017 Feb]Why Did the Cow Cross the Road II
Description Farmer John is continuing to ponder the issue of cows crossing the road through his farm ...
Vulnhub靶场渗透练习(三) bulldog
拿到靶场后先对ip进行扫描获取ip 和端口针对项目路径爆破获取两个有用文件 http://192.168.18.144/dev/ dev,admin 更具dev 发现他们用到框架和语言找到一 ...
PageObjec页面对象模式（理论）
ui自动化测试的分层思想:实现测试数据与业务数据分离 1. 基础层 2. 对象层:每个页面的操作元素封装为一个文件 3.测试用例层:调用对象层封装的方法进行测试用例编写
微信小程序——获取openGid
小编使用的Node版本的解密方式(其他方式自行替换写法也是一样的) 1.到微信小程序官网下载解密demo包 2.获取用户[openid] 3.调用 [wx.showShareMenu] 并且设置 w ...

爬虫之scrapy简单案例之猫眼

爬虫之scrapy简单案例之猫眼的更多相关文章

随机推荐

热门专题