<scrapy爬虫>爬取猫眼电影top100详细信息

1.创建scrapy项目

dos窗口输入:

scrapy startproject maoyan

cd maoyan

2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义)

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class MaoyanItem(scrapy.Item):

    # define the fields for your item here like:

    #影片中文名称/英文名称

    ztitle = scrapy.Field()

    etitle = scrapy.Field()

    #影片类型

    type = scrapy.Field()

    #导演

    dname = scrapy.Field()

    #主演

    star = scrapy.Field()

    #上映时间

    releasetime = scrapy.Field()

    #影片时间

    time = scrapy.Field()

    # 评分

    score = scrapy.Field()

    #图片链接

    image = scrapy.Field()

    #详情信息

    info = scrapy.Field()

3.创建爬虫文件

dos窗口输入:

scrapy genspider -t crawl myspider maoyan.com

4.编写myspider.py文件(接收响应,处理数据)

# -*- coding: utf-8 -*-

import scrapy

#导入链接规则匹配

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

#导入模板

from maoyan.items import MaoyanItem

class MaoyanSpider(CrawlSpider):

    name = 'myspider'

    allowed_domains = ['maoyan.com']

    start_urls = ['https://maoyan.com/board/4?offset=0']

    rules = (

        Rule(LinkExtractor(allow=r'offset=\d+'),follow=True),

        Rule(LinkExtractor(allow=r'/films/\d+'),callback='parse_maoyan',follow=False),

    )

    def parse_maoyan(self, response):

        item = MaoyanItem()

        # 影片中文名称/英文名称

        item['ztitle'] = response.xpath('//h3/text()').extract()[0]

        item['etitle'] = response.xpath('//div[@class="ename ellipsis"]/text()').extract()[0]

        # 影片类型

        item['type'] = response.xpath('//li[@class="ellipsis"][1]/text()').extract()[0]

        # 导演

        item['dname'] = response.xpath('//a[@class="name"]/text()').extract()[0].strip()

        # 主演

        star_1 = response.xpath('//li[@class="celebrity actor"][1]//a[@class="name"]/text()').extract()[0].strip()

        star_2 = response.xpath('//li[@class="celebrity actor"][2]//a[@class="name"]/text()').extract()[0].strip()

        star_3 = response.xpath('//li[@class="celebrity actor"][3]//a[@class="name"]/text()').extract()[0].strip()

        item['star'] = star_1 + "\\" + star_2 + '\\' +star_3

        # 上映时间

        item['releasetime'] = response.xpath('//li[@class="ellipsis"][3]/text()').extract()[0]

        # 影片时间

        item['time'] = response.xpath('//li[@class="ellipsis"][2]/text()').extract()[0].strip()[-5:]

        # 评分,没抓到

        # item['score'] = response.xpath('//span[@class="stonefont"]/text()').extract()[0]

        item['score'] = "None"

        # 图片链接

        item['image'] = response.xpath('//img[@class="avatar"]/@src').extract()[0]

        # 详情信息

        item['info'] = response.xpath('//span[@class="dra"]/text()').extract()[0].strip()

        yield item

5.编写pipelines.py(存储数据)

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

class MaoyanPipeline(object):

    def __init__(self):

        self.filename = open('maoyan.txt','wb')

    def process_item(self, item, spider):

        text = json.dumps(dict(item),ensure_ascii=False) + '\n'

        self.filename.write(text.encode('utf-8'))

        return item

    def close_spider(self,spider):

        self.filename.close()

6.编写settings.py(设置headers,pipelines等)

robox协议

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

headers

DEFAULT_REQUEST_HEADERS = {

    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',

    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

  # 'Accept-Language': 'en',

}

pipelines

ITEM_PIPELINES = {

    'maoyan.pipelines.MaoyanPipeline': 300,

}

7.运行爬虫

dos窗口输入:

scrapy crawl myspider

运行结果:

emmmm,top100只爬到99个,

问题:

源码里面评分是□.□!!!全是套路,外面可以找到这个评分,懒得折腾了

单独爬取zname是100个,可能是哪个属性的xpath匹配,网页详情页没有,实现功能就行了

爬取成功

8.存储到mysql数据库

在mysql数据库建立相应的数据库和表:

改写一下pipelines.py文件即可:

import pymysql.cursors

class MaoyanPipeline(object):

    def __init__(self):

        #连接数据库

        self.connect = pymysql.connect(

            host = 'localhost',

            user = 'root',

            password = '',

            database = 'maoyan',

            charset = 'utf8'  # 别写成utf-8

            )

        self.cursor = self.connect.cursor()  # 建立游标

    def process_item(self, item, spider):

        item = dict(item)

        sql = "insert into maoyantop100(ztitle,etitle,type,dname,star,releasetime,time,score,image,info) values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"

        self.cursor.execute(sql,(item['ztitle'],item['etitle'],item['type'],item['dname'],item['star'],item['releasetime'],item['time'],item['score'],item['image'],item['info'],))

        self.connect.commit()

        return item

    def close_spider(self,spider):

        self.cursor.close()

        self.connect.close()

　　运行:

存储成功:

<scrapy爬虫>爬取猫眼电影top100详细信息的更多相关文章

python3爬虫爬取猫眼电影TOP100（含详细爬取思路）
待爬取的网页地址为https://maoyan.com/board/4,本次以requests.BeautifulSoup css selector为路线进行爬取,最终目的是把影片排名.图片.名称.演 ...
爬虫系列（1）-----python爬取猫眼电影top100榜
对于Python初学者来说,爬虫技能是应该是最好入门,也是最能够有让自己有成就感的,今天在整理代码时,整理了一下之前自己学习爬虫的一些代码,今天先上一个简单的例子,手把手教你入门Python爬虫,爬取 ...
50 行代码教你爬取猫眼电影 TOP100 榜所有信息
对于Python初学者来说,爬虫技能是应该是最好入门,也是最能够有让自己有成就感的,今天,恋习Python的手把手系列,手把手教你入门Python爬虫,爬取猫眼电影TOP100榜信息,将涉及到基础爬虫 ...
python 爬取猫眼电影top100数据
最近有爬虫相关的需求,所以上B站找了个视频(链接在文末)看了一下,做了一个小程序出来,大体上没有修改,只是在最后的存储上,由txt换成了excel. 简要需求:爬虫爬取猫眼电影TOP100榜单数据 ...
PYTHON 爬虫笔记八:利用Requests+正则表达式爬取猫眼电影top100（实战项目一）
利用Requests+正则表达式爬取猫眼电影top100 目标站点分析流程框架爬虫实战使用requests库获取top100首页: import requests def get_one_pag ...
# [爬虫Demo] pyquery+csv爬取猫眼电影top100
目录 [爬虫Demo] pyquery+csv爬取猫眼电影top100 站点分析代码君 [爬虫Demo] pyquery+csv爬取猫眼电影top100 站点分析 https://maoyan.co ...
40行代码爬取猫眼电影TOP100榜所有信息
主要内容: 一.基础爬虫框架的三大模块二.完整代码解析及效果展示 1️⃣ 基础爬虫框架的三大模块 1.HTML下载器:利用requests模块下载HTML网页. 2.HTML解析器:利用re正则表 ...
用requests库爬取猫眼电影Top100
这里需要注意一下,在爬取猫眼电影Top100时,网站设置了反爬虫机制,因此需要在requests库的get方法中添加headers,伪装成浏览器进行爬取 import requests from re ...
# 爬虫连载系列(1)--爬取猫眼电影Top100
前言学习python有一段时间了,之前一直忙于学习数据分析,耽搁了原本计划的博客更新.趁着这段空闲时间,打算开始更新一个爬虫系列.内容大致包括:使用正则表达式.xpath.BeautifulSoup ...

随机推荐

再次封装ajax函数，统一入口
根据API写网页的时候,每个页面都需要ajax请求,每次都写一大堆请求,配置什么的太麻烦,于是打算封装一个ajax函数,统一调用: 开始时是使用return返回ajax,如下: function cr ...
神经网络（1）- Alexnet
文章目录模型结构 conv1层 conv2层 conv3层 conv4层 conv5层 FC6全链接图: fc7全连接层:和fc6类似. fc8链接层: 模型优化选择ReLU作为激活函数多GPU ...
vue 兄弟组件的传值
handleLetterClick方法,采用emit 传递给父组件父组件触发的方法: handleLetterChange方法: 父组件传递给子组件: CityList组件: 兄弟组件的传值可以 ...
Java 基础 - Object.clone()深拷贝和浅拷贝
作者:YSOcean 出处:http://www.cnblogs.com/ysocean/ 本文版权归作者所有,欢迎转载,但未经作者同意不能转载,否则保留追究法律责任的权利. ---------- ...
dos中文显示乱码怎么办？
其实只需要一条命令 chcp 65001 执行该操作后,代码页就被变成UTF-8了也可是GBK, 命令式: chcp 936 2.修改窗口属性,改变字体在命令行标题栏上点击右键,选择&quo ...
计算几何,向量——cf995c
网上的题解直接用随机过的, 自己用模拟就模拟三个向量的和并就模拟不出来.. 以后再回头看看 #include<bits/stdc++.h> #include<cmath> us ...
Linux课程---12、linux中内存指令（top命令的作用是什么）
Linux课程---12.linux中内存指令(top命令的作用是什么) 一.总结一句话总结: top实时观察进程.内存和CPU情况 1.电脑出现反应慢情况,最先想到的是什么? 内存 2.linux ...
控制变量行业年份回归时在STATA里怎么操作_stata 分年份回归
控制变量行业年份回归时在STATA里怎么操作_stata 分年份回归我希望做一个多元回归,但需要控制年份和行业. (1)年份有7年2006-2012,听说STATA可以自动设置虚拟变量,请问命令是怎 ...
Collection、Iterator、泛型初步
java.util.Collection 集合层次的根接口 java.util.List extends Collection ArrayList implements List 常用方法 boole ...
js中一个标签在按顺序执行没有被读取到时可以用window.onload
<%@LANGUAGE="JAVASCRIPT" CODEPAGE="65001"%> <!DOCTYPE html PUBLIC " ...

<scrapy爬虫>爬取猫眼电影top100详细信息

1.创建scrapy项目

2.编写item.py文件(相当于编写模板,需要爬取的数据在这里定义)

3.创建爬虫文件

4.编写myspider.py文件(接收响应,处理数据)

5.编写pipelines.py(存储数据)

6.编写settings.py(设置headers,pipelines等)

7.运行爬虫

8.存储到mysql数据库

<scrapy爬虫>爬取猫眼电影top100详细信息的更多相关文章

随机推荐

热门专题