python 爬虫豆瓣top250

网页api：https://movie.douban.com/top250?start=0&filter=

用到的模块：urllib，re,csv

捣鼓一上午终于好了，有些小问题

（top218有bug）具体问题：上图没有主演：用到正则表达式时取出过多的值，下图则是正常取值

所以取前200名，具体python代码实现如下，望大佬指导

#! /usr/bin/python3

# -*- coding:UTF-8 -*-

from urllib import request

import re,csv

class MovieTopForDouBan(object):

    def __init__(self):

        self.start = 0

        self.param = '&filter='

        self.headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 '

                                   '(KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'}

        self.file_path = 'D:\\'

        self.head = ['排名','名称','别名','其他名称','导演','主演','年份','地区','类型','平均分','人数','短评']

        self.movie_list=[]

    def get_page(self):

        try:

            url = 'https://movie.douban.com/top250?start=' + str(self.start)

            req = request.Request(url, headers=self.headers)

            response = request.urlopen(req)

            page = response.read().decode('utf-8')

            page_num = (self.start + 25) // 25

            print('正在抓取第' + str(page_num) + '页数据...')

            self.start += 25

            return page

        except request.URLError as e:

            if hasattr(e, 'reason'):

                print('抓取失败，失败原因：', e.reason)

    def get_movie_info(self):

        pattern = re.compile(u'<div.*?class="item">.*?<em class="">(.*?)</em>'

                             u'.*?<span.*?class="title">(.*?)</span>'

                             u'.*?<span.*?class="title">(.*?)</span>'

                             u'.*?<span.*?class="other">(.*?)</span>'

                             u'.*?<div.*?class="bd">.*?<p.*?class="">'

                             u'.*?导演:(.*?)&nbsp;.*?主演: (.*?)<br>'

                             u'(.*?)&nbsp;/&nbsp;(.*?)&nbsp;/&nbsp;(.*?)</p>.*?<div.*?class="star">'

                             u'.*?<span.*?class="rating_num".*?property="v:average">(.*?)</span>'

                             u'.*?<span>(.*?)人评价</span>.*?</div>'

                             u'.*?<span.*?class="inq">(.*?)</span>.*?</p>', re.S)

        while self.start <= 176:#取前俩百 (top:218 电影名:初恋这件小事)有bug

            page=self.d=self.get_page()

            movies=re.findall(pattern,page)

            for movie in movies:

                data =list(movie)

                data[2] = data[2].lstrip('&nbsp;/&nbsp;')

                data[3] = data[3].lstrip('&nbsp;/&nbsp;')

                data[6] = data[6].lstrip()

                data[8] = data[8].rstrip()

                self.movie_list.append(data)

    def write_text(self):

        print('开始向文件写入数据....')

        with open(self.file_path+'movie_info.txt','w',encoding='utf-8') as file_TopText:

            try:

                for movie in self.movie_list:

                    file_TopText.write('电影排名：' + movie[0] + '\r\n')

                    file_TopText.write('电影名称：' + movie[1] + '\r\n')

                    file_TopText.write('外文名称：' + movie[2] + '\r\n')

                    file_TopText.write('电影别名：' + movie[3] + '\r\n')

                    file_TopText.write('导演姓名：' + movie[4] + '\r\n')

                    file_TopText.write('主演姓名：' + movie[5] + '\r\n')

                    file_TopText.write('上映年份：' + movie[6] + '\r\n')

                    file_TopText.write('制作国家/地区：' + movie[7] + '\r\n')

                    file_TopText.write('电影类别：' + movie[8] + '\r\n')

                    file_TopText.write('电影评分：' + movie[9] + '\r\n')

                    file_TopText.write('参评人数：' + movie[10] + '\r\n')

                    file_TopText.write('简短影评：' + movie[11] + '\r\n\r\n')

                print('抓取结果写入文件成功...')

            except Exception as e:

                 print(e)

        print('数据写入完毕....')

    def write_csv_file(self):

        path = self.file_path + 'movie_info.csv'

        common=0

        try:

            with open(path, 'w', newline='',encoding='utf-8') as csv_file:

                writer = csv.writer(csv_file, dialect='excel')

                if self.head is not None:

                    writer.writerow(self.head)

                for row in self.movie_list:

                    writer.writerow(row)

                    common+=1

                print("将CSV文件写入路径%s成功。" % path)

        except Exception as e:

            print("将CSV文件写入路径: %s, 信息: %s" % (path, e))

            print(common)

    def main(self):

        print('开始从豆瓣电影抓取数据........')

        self.get_movie_info()

        self.write_text()

        #self.write_csv_file()

        print('数据抓取完毕...')

if __name__ == '__main__':

    movie = MovieTopForDouBan()

    movie.main()

d盘根目录生成一个movie_info.txt 文件

python 爬虫豆瓣top250的更多相关文章

python爬虫---豆瓣Top250电影采集
代码: import requests from bs4 import BeautifulSoup as bs import time def get_movie(url): headers = { ...
Forward团队-爬虫豆瓣top250项目-项目总结
托管平台地址:https://github.com/xyhcq/top250 小组名称:Forward团队组长:马壮成员:李志宇.刘子轩.年光宇.邢云淇.张良我们这次团队项目内容是爬取豆瓣电影T ...
Forward团队-爬虫豆瓣top250项目-项目进度
项目地址:https://github.com/xyhcq/top250 我们的项目是爬取豆瓣top250的电影的信息,在做这个项目前,我们都没有经验,完全是从零开始,过程中也遇到了很多困难,不过我们 ...
Forward团队-爬虫豆瓣top250项目-设计文档
组长地址:http://www.cnblogs.com/mazhuangmz/p/7603594.html 成员:马壮,李志宇,刘子轩,年光宇,邢云淇,张良设计方案: 1.能分析HTML语言: 2. ...
《Forward团队-爬虫豆瓣top250项目-设计文档》
成员:马壮,李志宇,刘子轩,年光宇,邢云淇,张良设计方案: 1.能分析HTML语言: 2.提取重要数据,并保存为文本文档: 3.用PY代码调取文本文档的数据: 4.编写提取部分数据的python代码 ...
《Forward团队-爬虫豆瓣top250项目-开发文档》
码云地址:https://github.com/xyhcq/top250 模块功能:获取豆瓣top250网页的源代码,并分析. def getHTMLText(url,k): # 获取网页源代码 tr ...
python爬虫: 豆瓣电影top250数据分析
转载博客 https://segmentfault.com/a/1190000005920679 根据自己的环境修改并配置mysql数据库系统:Mac OS X 10.11 python 2.7 m ...
Forward团队-爬虫豆瓣top250项目-需求分析
一. 需求:1.爬取豆瓣电影top250. 2.获取电影名称,排名,分数,简介,导演,演员. 3.将爬取到的数据保存,以便随时查看. 3.可以将获取到的数据展示给用户. 二. 参考: 豆瓣api参考资 ...
Forward团队-爬虫豆瓣top250项目-模块测试
项目托管平台地址:https://github.com/xyhcq/top250 模块测试:爬虫对信息的处理部分测试方法: 实际运行一下代码: 可以看见,信息都已经爬取出来了其他补充说明: 原本系 ...

随机推荐

formatter 操作列表的合并
{field:'22',title:'操作',width:250,align:'center',sortable:true,formatter : function(value, row, index ...
U盘安装Ubuntu15.04 出现boot failed: please change disks and press a key to continue
1.根据国内的教程,用Ultraiso制作了一个Ubuntu15.04的U盘启动盘,在装系统的时候提示如下错误:boot failed: please change disks and press a ...
Lambda动态排序通用方法
using System; using System.Collections.Generic; using System.Linq; using System.Linq.Expressions; us ...
Catch the moments of your life. Catch them while you're young and quick.
Catch the moments of your life. Catch them while you're young and quick.趁你还年轻利落,把握住生活中的美好瞬间吧!
ionic 2 起航控件的使用客户列表场景(三)
我们来看看客户列表的搜索控件是怎么工作的吧. 1.打开customer.html <ion-content> <ion-searchbar [(ngModel)]="sea ...
C# DateTime的时区
C#中可以通过DateTime的Kind属性指定DateTime的时区 DateTimeKind有3个枚举值: Unspecified:未指定为UTC时间或本地时间 Utc: UTC时间 Local: ...
GBase数据库存储过程——批量查询多个数据表的磁盘占用情况
--清理历史表,可选 DROP TABLE IF EXISTS `dap_model`.`data_statics`; CREATE TABLE `dba`.`data_statics` ( `TAB ...
spring boot 下 spring security 自定义登录配置与form-login属性详解
package zhet.sprintBoot; import org.springframework.beans.factory.annotation.Autowired;import org.sp ...
【微软大法好】VS Tools for AI全攻略（3）：低配置虚拟机也能玩转深度学习，无需NC/NV系列
接着上文,现在我们需要一种穷人的方法来搭建好Azure虚拟机. 思路很简单,因为AI组件的原理其实是传送了script文件和命令上去,那么我们这个虚拟机只要做好了所有的配置,那么我们就可以将它当作深度 ...
在SAP云平台的CloudFoundry环境下消费ABAP On-Premise OData服务
我的前一篇文章使用Java+SAP云平台+SAP Cloud Connector调用ABAP On-Premise系统里的函数介绍了在SAP云平台的Neo环境下如何通过SAP Cloud Conne ...

python 爬虫豆瓣top250

python 爬虫豆瓣top250的更多相关文章

随机推荐

热门专题