python网络爬虫（11）近期电影票房或热度信息爬取

目标意义

为了理解动态网站中一些数据如何获取，做一个简单的分析。

说明

思路，原始代码来源于：https://book.douban.com/subject/27061630/。

构造-下载器

构造分下载器，下载原始网页，用于原始网页的获取，动态网页中，js部分的响应获取。

通过浏览器模仿，合理制作请求头，获取网页信息即可。

代码如下：

import requests

import chardet

class HtmlDownloader(object):

    def download(self,url):

        if url is None:

            return None

        user_agent='Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'

        headers={'User-Agent':user_agent}

        r=requests.get(url,headers=headers)

        if r.status_code is 200:

            r.encoding=chardet.detect(r.content)['encoding']

            return r.text

        return None

构造-解析器

解析器解析数据使用。

获取的票房信息，电影名称等，使用解析器完成。

被解析的动态数据来源于js部分的代码。

js地址的获取则通过F12控制台-->网络-->JS，然后观察，得到。

地址如正上映的电影：

http://service.library.mtime.com/Movie.api?Ajax_CallBack=true&Ajax_CallBackType=Mtime.Library.Services&Ajax_CallBackMethod=GetMovieOverviewRating&Ajax_CrossDomain=1&Ajax_RequestUrl=http://movie.mtime.com/257982/&t=201907121611461266&Ajax_CallBackArgument0=257982

返回信息中，解析出json格式的部分，通过json的一些方法，获取其中的票房等信息。

其中，json解析工具地址如：https://www.json.cn/

未上映的电影是同理的。

这些数据的解析有差异，所以定制了函数分支，处理解析过程中可能遇到的不同情景。

代码如下：

import re

import json

class HtmlParser(object):

    def parser_url(self,page_url,response):

        pattern=re.compile(r'(http://movie.mtime.com/(\d+)/)')

        urls=pattern.findall(response)

        if urls != None:

            return list(set(urls))#Duplicate removal

        else:

            return None

    def parser_json(self,url,response):

        #parsing json. input page_url as js url and response for parsing

        pattern=re.compile(r'=(.*?);')

        result=pattern.findall(response)[0]

        if result != None:

            value=json.loads(result)

            isRelease=value.get('value').get('isRelease')

            if isRelease:

                isRelease=1

                return self.parser_json_release(value,url)

            else:

                isRelease=0

                return self.parser_json_notRelease(value,url)

        return None

    def parser_json_release(self,value,url):

        isRelease=1

        movieTitle=value.get('value').get('movieTitle')

        RatingFinal=value.get('value').get('movieRating').get('RatingFinal')

        try:

            TotalBoxOffice=value.get('value').get('boxOffice').get('TotalBoxOffice')

            TotalBoxOfficeUnit=value.get('value').get('boxOffice').get('TotalBoxOfficeUnit')

        except:

            TotalBoxOffice="None"

            TotalBoxOfficeUnit="None"

        return isRelease,movieTitle,RatingFinal,TotalBoxOffice,TotalBoxOfficeUnit,url

    def parser_json_notRelease(self,value,url):

        isRelease=0

        movieTitle=value.get('value').get('movieTitle')

        try:

            RatingFinal=Ranking=value.get('value').get('hotValue').get('Ranking')

        except:

            RatingFinal=-1

        TotalBoxOffice='None'

        TotalBoxOfficeUnit='None'

        return isRelease,movieTitle,RatingFinal,TotalBoxOffice,TotalBoxOfficeUnit,url

构造-存储器

存储方案为Sqlite，所以在解析器中isRelease部分，使用了0和1进行的存储。

存储需要连接sqlite3，创建数据库，获取执行数据库语句的方法，插入数据等。

按照原作者思路，存储时，先暂时存储到内存中，条数大于10以后，将内存中的数据插入到sqlite数据库中。

代码如下：

import sqlite3

class DataOutput(object):

    def __init__(self):

        self.cx=sqlite3.connect("MTime.db")

        self.create_table('MTime')

        self.datas=[]

    def create_table(self,table_name):

        values='''

        id integer primary key autoincrement,

        isRelease boolean not null,

        movieTitle varchar(50) not null,

        RatingFinal_HotValue real not null default 0.0,

        TotalBoxOffice varchar(20),

        TotalBoxOfficeUnit varchar(10),

        sourceUrl varchar(300)

        '''

        self.cx.execute('create table if not exists %s(%s)' %(table_name,values))

    def store_data(self,data):

        if data is None:

            return

        self.datas.append(data)

        if len(self.datas)>10:

            self.output_db('MTime')

    def output_db(self,table_name):

        for data in self.datas:

            cmd="insert into %s (isRelease,movieTitle,RatingFinal_HotValue,TotalBoxOffice,TotalBoxOfficeUnit,sourceUrl) values %s" %(table_name,data)

            self.cx.execute(cmd)

            self.datas.remove(data)

        self.cx.commit()

    def output_end(self):

        if len(self.datas)>0:

            self.output_db('MTime')

        self.cx.close()

主函数部分

创建以上对象作为初始化

然后获取根路径。从根路径下找到百余条电影网址信息。

对每个电影网址信息一一解析，然后存储。

import HtmlDownloader

import HtmlParser

import DataOutput

import time

class Spider(object):

    def __init__(self):

        self.downloader=HtmlDownloader.HtmlDownloader()

        self.parser=HtmlParser.HtmlParser()

        self.output=DataOutput.DataOutput()

    def crawl(self,root_url):

        content=self.downloader.download(root_url)

        urls=self.parser.parser_url(root_url, content)

        for url in urls:

            print('.')

            t=time.strftime("%Y%m%d%H%M%S1266",time.localtime())

            rank_url='http://service.library.mtime.com/Movie.api'\

            '?Ajax_CallBack=true'\

            '&Ajax_CallBackType=Mtime.Library.Services'\

            '&Ajax_CallBackMethod=GetMovieOverviewRating'\

            '&Ajax_CrossDomain=1'\

            '&Ajax_RequestUrl=%s'\

            '&t=%s'\

            '&Ajax_CallBackArgument0=%s' %(url[0],t,url[1])

            rank_content=self.downloader.download(rank_url)

            try:

                data=self.parser.parser_json(rank_url, rank_content)

            except:

                print(rank_url)

            self.output.store_data(data)

        self.output.output_end()

        print('ed')

if __name__=='__main__':

    spider=Spider()

    spider.crawl('http://theater.mtime.com/China_Beijing/')

当前效果

如下：