python3+requests+BeautifulSoup+mysql爬取豆瓣电影top250

　　基础页面：https://movie.douban.com/top250

　　代码：

from time import sleep

from requests import get

from bs4 import BeautifulSoup

import re

import pymysql

db = pymysql.connect(host='localhost',

                     user='root',

                     password='123456',

                     db='douban',

                     charset='utf8mb4',

                     cursorclass=pymysql.cursors.DictCursor

                     )

try:

    with db.cursor() as cursor:

        sql = "CREATE TABLE IF NOT EXISTS `top250` (" \

            "`id` int(6) NOT NULL AUTO_INCREMENT," \

            "`top` int(6) NOT NULL," \

            "`page-code` int(6) NOT NULL," \

            "`title` varchar(255) NOT NULL," \

            "`origin-title` varchar(255)," \

            "`score` float NOT NULL," \

            "`theme` varchar(255) NOT NULL," \

            "PRIMARY KEY(`id`)" \

            ") ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1;"

        cursor.execute(sql,)

finally:

    db.commit()

base_url = 'https://movie.douban.com/top250'

header = {

    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',

    'Accept-Encoding': 'gzip, deflate, br',

    'Accept-Language': 'zh-CN,zh;q=0.9',

    'Cache-Control': 'max-age=0',

    'Connection': 'keep-alive',

    'Cookie': 'xxx',

    'Host': 'movie.douban.com',

    'Referer': 'https://movie.douban.com/chart',

    'Upgrade-Insecure-Requests': '1',

    'User-Agent': 'xxx'

}

def crawler(url=None, headers=None, delay=1):

    r = get(url=url, headers=headers, timeout=3)

    soup = BeautifulSoup(r.text, 'html.parser')

    page_tag = soup.find('span', attrs={'class': 'thispage'})

    page_code = re.compile(r'<span class="thispage">(.*)</').findall(str(page_tag))[0]

    movie_ranks = soup.find_all('em', attrs={'class': ''})

    movie_titles = soup.find_all('div', attrs={'class': 'hd'})

    movie_scores = soup.find_all('span', attrs={'class': 'rating_num'})

    movie_themes = soup.find_all('span', attrs={'class': 'inq'})

    next_page = soup.find('link', attrs={'rel': 'next'})

    for ranks, titles, scores, themes in zip(movie_ranks, movie_titles, movie_scores, movie_themes):

        rank = re.compile(r'<em class="">(.*)</').findall(str(ranks))

        regex_ts = re.compile(r'<span class="title">(.*)</').findall(str(titles))

        title = regex_ts[0]

        score = re.compile(r'<span class="rating_num" property="v:average">(.*)</').findall(str(scores))[0]

        theme = re.compile(r'<span class="inq">(.*)</').findall(str(themes))[0]

        try:

            origin_title = regex_ts[1]

            origin_title = re.compile(r'./.(.+)').findall(origin_title)[0]

            with db.cursor() as cursor:

                sql = "INSERT INTO `top250` (`top`, `page-code`, `title`, `origin-title`, `score`, `theme`)" \

                      " VALUES (%s, %s, %s, %s, %s, %s)"

                cursor.execute(sql, (rank, page_code, title, origin_title, score, theme,))

        except IndexError:

            with db.cursor() as cursor:

                sql = "INSERT INTO `top250` (`top`, `page-code`, `title`, `score`, `theme`)" \

                      " VALUES (%s, %s, %s, %s, %s)"

                cursor.execute(sql, (rank, page_code, title, score, theme,))

        finally:

            db.commit()

    if next_page is not None:

        headers['Referer'] = url

        next_url = base_url + re.compile(r'<link href="(.*)" rel="next">').findall(str(next_page))[0]

        sleep(delay)

        crawler(url=next_url, headers=headers, delay=3)

crawler(base_url, header, 0)

db.close()

　　结果：

mysql> select top,title,score from top250 where id = 175;

+-----+--------+-------+

| top | title  | score |

+-----+--------+-------+

| 176 | 罗生门 |   8.7 |

+-----+--------+-------+

1 row in set (0.00 sec)

mysql> select top,title,page-code,score from top250 where id = 175;

ERROR 1054 (42S22): Unknown column 'page' in 'field list'

mysql> select top,page-code,title,score from top250 where id = 175;

ERROR 1054 (42S22): Unknown column 'page' in 'field list'

mysql> select page-code from top250 where id = 175;

ERROR 1054 (42S22): Unknown column 'page' in 'field list'

mysql> describe top250

    -> ;

+--------------+--------------+------+-----+---------+----------------+

| Field        | Type         | Null | Key | Default | Extra          |

+--------------+--------------+------+-----+---------+----------------+

| id           | int(6)       | NO   | PRI | NULL    | auto_increment |

| top          | int(6)       | NO   |     | NULL    |                |

| page-code    | int(6)       | NO   |     | NULL    |                |

| title        | varchar(255) | NO   |     | NULL    |                |

| origin-title | varchar(255) | YES  |     | NULL    |                |

| score        | float        | NO   |     | NULL    |                |

| theme        | varchar(255) | NO   |     | NULL    |                |

+--------------+--------------+------+-----+---------+----------------+

7 rows in set (0.32 sec)

mysql> select page-code from top250 where id = 175;

ERROR 1054 (42S22): Unknown column 'page' in 'field list'

mysql> select origin-title from top250 where id = 175;

ERROR 1054 (42S22): Unknown column 'origin' in 'field list'

mysql> select origin_title from top250 where id = 175;

ERROR 1054 (42S22): Unknown column 'origin_title' in 'field list'

mysql> select * from top250 where id = 175;

+-----+-----+-----------+--------+--------------+-------+-------------------+

| id  | top | page-code | title  | origin-title | score | theme             |

+-----+-----+-----------+--------+--------------+-------+-------------------+

| 175 | 176 |         8 | 罗生门 | 羅生門       |   8.7 | 人生的N种可能性。 |

+-----+-----+-----------+--------+--------------+-------+-------------------+

1 row in set (0.00 sec)

mysql> select * from top250 where title = 未麻的部屋;

ERROR 1054 (42S22): Unknown column '未麻的部屋' in 'where clause'

mysql> select * from top250 where top=175;

Empty set (0.00 sec)

mysql>

　　两个小问题：

　　1.没想到数据库字段不能用'-'...，于是page-code字段与origin-title字段不能独立进行查找。。。

　　2.不知道为啥top175的电影《未麻的部屋》没爬到。。。

　　建议使用scrapy。

　　用scrapy的一些好处是配置爬虫很方便，还有其内部自带的html解析器、对不完整的url的组建等十分便利。

　　最后，吐槽一下，之前的电脑配置太差，跑深度学习程序的过程耗尽内存，出现莫名的bug后，蓝屏死机就再也没法启动了。。。所以，暂时不能更新博客了。。。

python3+requests+BeautifulSoup+mysql爬取豆瓣电影top250的更多相关文章

爬虫系列(十) 用requests和xpath爬取豆瓣电影
这篇文章我们将使用 requests 和 xpath 爬取豆瓣电影 Top250,下面先贴上最终的效果图: 1.网页分析 (1)分析 URL 规律我们首先使用 Chrome 浏览器打开豆瓣电影 T ...
urllib+BeautifulSoup无登录模式爬取豆瓣电影Top250
对于简单的爬虫任务,尤其对于初学者,urllib+BeautifulSoup足以满足大部分的任务. 1.urllib是Python3自带的库,不需要安装,但是BeautifulSoup却是需要安装的. ...
python2.7爬取豆瓣电影top250并写入到TXT，Excel，MySQL数据库
python2.7爬取豆瓣电影top250并分别写入到TXT,Excel,MySQL数据库 1.任务爬取豆瓣电影top250 以txt文件保存以Excel文档保存将数据录入数据库 2.分析电影 ...
一起学爬虫——通过爬取豆瓣电影top250学习requests库的使用
学习一门技术最快的方式是做项目,在做项目的过程中对相关的技术查漏补缺. 本文通过爬取豆瓣top250电影学习python requests的使用. 1.准备工作在pycharm中安装request库 ...
爬虫系列(十一) 用requests和xpath爬取豆瓣电影评论
这篇文章,我们继续利用 requests 和 xpath 爬取豆瓣电影的短评,下面还是先贴上效果图: 1.网页分析 (1)翻页我们还是使用 Chrome 浏览器打开豆瓣电影中某一部电影的评论进行分析 ...
【转】爬取豆瓣电影top250提取电影分类进行数据分析
一.爬取网页,获取需要内容我们今天要爬取的是豆瓣电影top250页面如下所示: 我们需要的是里面的电影分类,通过查看源代码观察可以分析出我们需要的东西.直接进入主题吧! 知道我们需要的内容在哪里了, ...
Python爬虫入门：爬取豆瓣电影TOP250
一个很简单的爬虫. 从这里学习的,解释的挺好的:https://xlzd.me/2015/12/16/python-crawler-03 分享写这个代码用到了的学习的链接: BeautifulSoup ...
scrapy爬虫框架教程（二）-- 爬取豆瓣电影TOP250
scrapy爬虫框架教程(二)-- 爬取豆瓣电影TOP250 前言经过上一篇教程我们已经大致了解了Scrapy的基本情况,并写了一个简单的小demo.这次我会以爬取豆瓣电影TOP250为例进一步为大 ...
scrapy爬取豆瓣电影top250
# -*- coding: utf-8 -*- # scrapy爬取豆瓣电影top250 import scrapy from douban.items import DoubanItem class ...

随机推荐

（转）http 之session和cookie
http://www.cnblogs.com/xuxm2007/archive/2011/12/05/2276705.html Session简介摘要:虽然session机制在web应用程序中被采 ...
使用NSIS制作可执行程序的安装包
使用NSIS制作可执行程序的安装包: 1,NSIS下载地址:https://pan.baidu.com/s/1GzzQNXgAlJPJWgjBzVwceA 下载完成之后解压缩,打开安装程序,默认安装即 ...
OpenGL 3D拾取文章（转）
参考文章深入探索3D拾取技术 OpenGL 3D拾取射线和三角形的相交检测(ray triangle intersection test) 3D拾取的方法有两种 1.基于几何计算的射线-三角形相交 ...
阿里云MySQL安装到centos，并链接。
Last login: Wed Jan 22 11:21:17 on ttys001 wulaguixiaomianyangdeMacBook-Pro:~ xingwen$ ssh root@47.9 ...
Redis Distributed lock
using StackExchange.Redis; using System; using System.Collections.Generic; using System.Linq; using ...
mysql分组，然后组内排序取最新的一条
参照: https://blog.csdn.net/qq_16504067/article/details/78589232 https://www.cnblogs.com/w1441639547/p ...
调用 url_launcher 模块打开外部浏览器打开外部应用拨打电话发送短信
1.Flutter url_launcher 模块 Flutter url_launcher 模块可以让我们实现打开外部浏览器.打开外部应用.发送短信.拨打电话等功能. https://p ...
js对象的深拷贝及其的几种方法
深拷贝和浅拷贝是javascript中一个比较复杂的问题,也是面试官最喜欢问的问题之一,通过这个为可以看出是否入门,深拷贝和浅拷贝也是初学者经常犯错一个点. 简单来说深拷贝是拷贝储存在栈中的对象,而浅 ...
HBase记录
本次记录是用于:SparkStreaming对接Kafka.HBase记录一.基本概念 1.HBase以表的形式存储数据.表有行和列族组成.列族划分为若干个列.其结构如下 2.Row Key:行键 ...
解决Hbase启动后，hmaster会在几秒钟后自动关闭（停掉）！！！
在日志(身为小白白的我,一开始日志在哪我都不知道!路径:/usr/local/hadoop/app/hbase-0.98.8/logs/hbase-hadoop-master-Master.log(也 ...

python3+requests+BeautifulSoup+mysql爬取豆瓣电影top250

python3+requests+BeautifulSoup+mysql爬取豆瓣电影top250的更多相关文章

随机推荐

热门专题