【Python爬虫】:使用高性能异步多进程爬虫获取豆瓣电影Top250
在本篇博文当中,将会教会大家如何使用高性能爬虫,快速爬取并解析页面当中的信息。一般情况下,如果我们请求网页的次数太多,每次都要发出一次请求,进行串行执行的话,那么请求将会占用我们大量的时间,这样得不偿失。因此我们可以i使用高性能爬虫,也就是采用多进程,异步的方式对数据进行爬取和解析,这样就可以在更快的时间内得到我们想要的结果。本篇博文给出有关爬取豆瓣电影的例子,以此来教会大家如何使用高性能爬虫。
一.网页分析
首先我们来分析豆瓣电影的网页代码,在本次的案例当中。我们需要爬取豆瓣电影top250当中的标题title和星数star。

发现,豆瓣电影当中的所有有关电影的信息全部都隐藏在< ol class="grid view">这个标签,当中,因此我们在编写xpath的时候,可以利用对它做一个循环。然后又发现,对于电影的title而言,有两个地方出现,一个地方是在图片上,另一个地方是在span标签下的class = title处,但是在span标签下具有多个标题,为了以免引起混,因此我们使用图片当中所暗含的标题title文字,使用xpath进行定位即可。
对于star而言,就更加简单了。我们发现每次一个star的分数出现,就会有又一个<div class="star">的标签在前面,然后再出现了与span有关的标签,因此我们编写xpath表达式为://ol[@class="grid_view"]//div[@class="star"]/span[@class="rating_num"]/text()
这样就可以得到一整个页面的star的数值了。当然这样我们只能获取第一页的我们想要得到的数据,怎么得到第二页的数据呢?
二.翻页处理
翻页处理对于豆瓣电影这个网站还是比较简单的。我们分别查看第一,二,三页的url,就会惊奇的发现它的网址如下:
https://movie.douban.com/top250?start=0&filter=
https://movie.douban.com/top250?start=25&filter=
https://movie.douban.com/top250?start=50&filter=
十分明显,这个网址后面有问号说明想要获取页面内容肯定需要发起get请求,都没有做有关post请求的加密,这样看来这也太简单了吧!
同样的我们发现里面的参数start在不断的变化,而filter却保持不变。因此我们只需要得到start参数的规律就知道该怎么编写爬虫了。
对于start而言,每跳转一页,就会增加25的数值,因为每一个页面里面均仅有25部电影。这样我们就找到了start参数的规律,开始编写爬虫。
三.爬虫代码的编写
在编写的代码时候,我们导入了多进程的库,使用这个库进行爬虫,也就只需要在原本代码的基础之上多添加两行代码即可,如下所示:
pool=Pool(4)
pool.map(get_information,number_ls)
这两行代码当中,第一个参数的4表示了我们使用4个进程的进程池进行数据的抓取。数值越大,爬取的效率就越高,这取决于你CPU的数量,数值不能超过CPU核心数的数量,因为一个一个CPU核心同时只能够运行单个进程。
第二行代码使用了map函数,第一个参数填写我们进行爬虫的函数,第二个参数填写爬虫函数所需要的参数。把这两个东西放到map函数里,就可以开始高性能爬虫了。
Remark:
在进行进程池爬虫的时候,我们放入的参数number_ls一定是一个列表,同时我们在get_imformation函数里使用得到的参数时,每次系统会调用这个参数列表当中的任意一个数值,而不是对整个列表进行调用。
由于整个原因,因此我们编写整个的代码·如下所示:
import requests
from lxml import etree
from multiprocessing.dummy import Pool
cookie='bid=N3Zqe_FFUKc; douban-fav-remind=1; viewed="27093751"; _vwo_uuid_v2=D401F17C96234AE149C4E04B78C3C8066|6fcc3cefe576bff2b89cdf28c4c5f597; __gads=ID=21cdec44606b00df-2250ba4d7ac4009b:T=1604034713:RT=1604034713:S=ALNI_Mb6iYJKYfbUjLxlisTQX5HCODTGKg; gr_user_id=fb6ac40c-94c3-400e-b170-47e126a9b78a; _gid=GA1.2.1520341169.1612004212; _ga=GA1.2.645228582.1602221486; ll="108288"; UM_distinctid=17752f076e4530-0b6eef25ebabba-f7b1332-1fa400-17752f076e57f0; Hm_lvt_19fc7b106453f97b6a84d64302f21a04=1612004228; Hm_lpvt_19fc7b106453f97b6a84d64302f21a04=1612004253; ap_v=0,6.0; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1612004299%2C%22https%3A%2F%2Fwww.google.com%2F%22%5D; _pk_ses.100001.4cf6=*; __utma=30149280.645228582.1602221486.1611225800.1612004300.9; __utmb=30149280.0.10.1612004300; __utmc=30149280; __utmz=30149280.1612004300.9.9.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); __utma=223695111.645228582.1602221486.1612004300.1612004300.1; __utmb=223695111.0.10.1612004300; __utmc=223695111; __utmz=223695111.1612004300.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); _pk_id.100001.4cf6=9a1bb1df4597b334.1612004299.1.1612005471.1612004299.' headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36',
} url='https://movie.douban.com/top250' number_ls=[]
for i in range(0,251,25):
number_ls.append(i) print(number_ls) def get_information(number_ls):
param={
'start':number_ls,
'filter' :''
}
page_content=requests.get(url=url,headers=headers,params=param).text with open('douban.html','w',encoding='utf-8') as fp:
fp.write(page_content) tree=etree.HTML(page_content)
vedio_title=tree.xpath('//ol[@class="grid_view"]//div[@class="pic"]//a/img/@alt')
star=tree.xpath('//ol[@class="grid_view"]//div[@class="star"]/span[@class="rating_num"]/text()') vedio_title_ls=[]
star_ls=[]
for i in vedio_title:
vedio_title_ls.append(i)
for i in star:
star_ls.append(i) j=0
while j<len(star_ls):
print("the movie is ",vedio_title_ls[j])
print("the star is ",star_ls[j])
print()
j+=1 pool=Pool(4)
pool.map(get_information,number_ls)
四.输出的结果
输出的结果十分完美,一共有250份电影,如下图所示:
[0, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250]
the movie is 搏击俱乐部
the star is 9.0 the movie is 教父2
the star is 9.2 the movie is 狮子王
the star is 9.0 the movie is 指环王2:双塔奇兵
the star is 9.1 the movie is 死亡诗社
the star is 9.1 the movie is 钢琴家
the star is 9.2 the movie is 黑客帝国
the star is 9.0 the movie is 指环王1:魔戒再现
the star is 9.0 the movie is 饮食男女
the star is 9.1 the movie is 窃听风暴
the star is 9.1 the movie is 美丽心灵
the star is 9.0 the movie is 让子弹飞
the star is 8.8 the movie is 绿皮书
the star is 8.9 the movie is 两杆大烟枪
the star is 9.1 the movie is 本杰明·巴顿奇事
the star is 8.9 the movie is 海蒂和爷爷
the star is 9.2 the movie is 飞越疯人院
the star is 9.1 the movie is 看不见的客人
the star is 8.8 the movie is 西西里的美丽传说
the star is 8.9 the movie is 拯救大兵瑞恩
the star is 9.0 the movie is 穿条纹睡衣的男孩
the star is 9.1 the movie is 小鞋子
the star is 9.2 the movie is 音乐之声
the star is 9.0 the movie is 情书
the star is 8.9 the movie is 海豚湾
the star is 9.3 the movie is 美国往事
the star is 9.2 the movie is 致命魔术
the star is 8.9 the movie is 沉默的羔羊
the star is 8.9 the movie is 低俗小说
the star is 8.9 the movie is 禁闭岛
the star is 8.8 the movie is 蝴蝶效应
the star is 8.8 the movie is 七宗罪
the star is 8.8 the movie is 心灵捕手
the star is 8.9 the movie is 布达佩斯大饭店
the star is 8.9 the movie is 春光乍泄
the star is 8.9 the movie is 摩登时代
the star is 9.3 the movie is 被嫌弃的松子的一生
the star is 8.9 the movie is 哈利·波特与死亡圣器(下)
the star is 8.9 the movie is 阿凡达
the star is 8.7 the movie is 喜剧之王
the star is 8.8 the movie is 致命ID
the star is 8.8 the movie is 剪刀手爱德华
the star is 8.7 the movie is 勇敢的心
the star is 8.9 the movie is 加勒比海盗
the star is 8.8 the movie is 杀人回忆
the star is 8.9 the movie is 狩猎
the star is 9.1 the movie is 请以你的名字呼唤我
the star is 8.9 the movie is 天使爱美丽
the star is 8.7 the movie is 断背山
the star is 8.8 the movie is 红辣椒
the star is 9.0 the movie is 触不可及
the star is 9.2 the movie is 蝙蝠侠:黑暗骑士
the star is 9.2 the movie is 末代皇帝
the star is 9.3 the movie is 活着
the star is 9.3 the movie is 寻梦环游记
the star is 9.1 the movie is 乱世佳人
the star is 9.3 the movie is 何以为家
the star is 9.1 the movie is 指环王3:王者无敌
the star is 9.2 the movie is 飞屋环游记
the star is 9.0 the movie is 摔跤吧!爸爸
the star is 9.0 the movie is 哈利·波特与魔法石
the star is 9.1 the movie is 素媛
the star is 9.3 the movie is 少年派的奇幻漂流
the star is 9.1 the movie is 十二怒汉
the star is 9.4 the movie is 哈尔的移动城堡
the star is 9.1 the movie is 鬼子来了
the star is 9.3 the movie is 天空之城
the star is 9.1 the movie is 大话西游之月光宝盒
the star is 9.0 the movie is 我不是药神
the star is 9.0 the movie is 闻香识女人
the star is 9.1 the movie is 罗马假日
the star is 9.0 the movie is 天堂电影院
the star is 9.2 the movie is 辩护人
the star is 9.2 the movie is 猫鼠游戏
the star is 9.0 the movie is 大闹天宫
the star is 9.4 the movie is 肖申克的救赎
the star is 9.7 the movie is 霸王别姬
the star is 9.6 the movie is 阿甘正传
the star is 9.5 the movie is 这个杀手不太冷
the star is 9.4 the movie is 泰坦尼克号
the star is 9.4 the movie is 美丽人生
the star is 9.5 the movie is 千与千寻
the star is 9.4 the movie is 辛德勒的名单
the star is 9.5 the movie is 盗梦空间
the star is 9.3 the movie is 忠犬八公的故事
the star is 9.4 the movie is 星际穿越
the star is 9.3 the movie is 海上钢琴师
the star is 9.3 the movie is 楚门的世界
the star is 9.3 the movie is 三傻大闹宝莱坞
the star is 9.2 the movie is 机器人总动员
the star is 9.3 the movie is 放牛班的春天
the star is 9.3 the movie is 大话西游之大圣娶亲
the star is 9.2 the movie is 疯狂动物城
the star is 9.2 the movie is 无间道
the star is 9.2 the movie is 熔炉
the star is 9.3 the movie is 教父
the star is 9.3 the movie is 当幸福来敲门
the star is 9.1 the movie is 龙猫
the star is 9.2 the movie is 怦然心动
the star is 9.1 the movie is 控方证人
the star is 9.6 the movie is 7号房的礼物
the star is 8.9 the movie is 幽灵公主
the star is 8.9 the movie is 小森林 夏秋篇
the star is 9.0 the movie is 阳光灿烂的日子
the star is 8.8 the movie is 第六感
the star is 8.9 the movie is 重庆森林
the star is 8.8 the movie is 入殓师
the star is 8.9 the movie is 唐伯虎点秋香
the star is 8.7 the movie is 小森林 冬春篇
the star is 9.0 the movie is 爱在黎明破晓前
the star is 8.8 the movie is 超脱
the star is 8.9 the movie is 消失的爱人
the star is 8.7 the movie is 一一
the star is 9.0 the movie is 菊次郎的夏天
the star is 8.8 the movie is 蝙蝠侠:黑暗骑士崛起
the star is 8.8 the movie is 侧耳倾听
the star is 8.9 the movie is 倩女幽魂
the star is 8.7 the movie is 功夫
the star is 8.6 the movie is 超能陆战队
the star is 8.7 the movie is 无人知晓
the star is 9.1 the movie is 人生果实
the star is 9.5 the movie is 萤火之森
the star is 8.9 the movie is 甜蜜蜜
the star is 8.8 the movie is 借东西的小人阿莉埃蒂
the star is 8.8 the movie is 玛丽和马克思
the star is 8.9 the movie is 爱在日落黄昏时
the star is 8.8 the movie is 驯龙高手
the star is 8.7 the movie is 完美的世界
the star is 9.1 the movie is 幸福终点站
the star is 8.8 the movie is 告白
the star is 8.7 the movie is 大鱼
the star is 8.8 the movie is 阳光姐妹淘
the star is 8.8 the movie is 射雕英雄传之东成西就
the star is 8.7 the movie is 哈利·波特与阿兹卡班的囚徒
the star is 8.8 the movie is 恐怖直播
the star is 8.8 the movie is 天书奇谭
the star is 9.2 the movie is 怪兽电力公司
the star is 8.7 the movie is 神偷奶爸
the star is 8.6 the movie is 玩具总动员3
the star is 8.8 the movie is 傲慢与偏见
the star is 8.6 the movie is 时空恋旅人
the star is 8.8 the movie is 哈利·波特与密室
the star is 8.7 the movie is 教父3
the star is 8.9 the movie is 釜山行
the star is 8.6 the movie is 血战钢锯岭
the star is 8.7 the movie is 哪吒闹海
the star is 9.1 the movie is 被解救的姜戈
the star is 8.7 the movie is 七武士
the star is 9.3 the movie is 喜宴
the star is 8.9 the movie is 电锯惊魂
the star is 8.7 the movie is 爆裂鼓手
the star is 8.7 the movie is 贫民窟的百万富翁
the star is 8.6 the movie is 萤火虫之墓
the star is 8.7 the movie is 东邪西毒
the star is 8.6 the movie is 海街日记
the star is 8.8 the movie is 黑天鹅
the star is 8.6 the movie is 惊魂记
the star is 9.0 the movie is 无敌破坏王
the star is 8.7 the movie is 你看起来好像很好吃
the star is 8.9 the movie is 冰川时代
the star is 8.6 the movie is 雨人
the star is 8.7 the movie is 小偷家族
the star is 8.7 the movie is 绿里奇迹
the star is 8.9 the movie is 恋恋笔记本
the star is 8.5 the movie is 爱在午夜降临前
the star is 8.8 the movie is 疯狂的石头
the star is 8.5 the movie is 哈利·波特与火焰杯
the star is 8.6 the movie is 寄生虫
the star is 8.7 the movie is 恐怖游轮
the star is 8.5 the movie is 奇迹男孩
the star is 8.6 the movie is 雨中曲
the star is 9.0 the movie is 魔女宅急便
the star is 8.7 the movie is 二十二
the star is 8.7 the movie is 海边的曼彻斯特
the star is 8.6 the movie is 房间
the star is 8.8 the movie is 风之谷
the star is 8.9 the movie is 一个叫欧维的男人决定去死
the star is 8.9 the movie is 我是山姆
the star is 8.9 the movie is 头号玩家
the star is 8.7 the movie is 英雄本色
the star is 8.7 the movie is 上帝之城
the star is 9.0 the movie is 谍影重重3
the star is 8.8 the movie is 疯狂原始人
the star is 8.7 the movie is 未麻的部屋
the star is 9.0 the movie is 岁月神偷
the star is 8.7 the movie is 卢旺达饭店
the star is 8.9 the movie is 纵横四海
the star is 8.8 the movie is 三块广告牌
the star is 8.7 the movie is 达拉斯买家俱乐部
the star is 8.8 the movie is 花样年华
the star is 8.7 the movie is 心迷宫
the star is 8.7 the movie is 记忆碎片
the star is 8.6 the movie is 模仿游戏
the star is 8.7 the movie is 黑客帝国3:矩阵革命
the star is 8.8 the movie is 新世界
the star is 8.8 the movie is 头脑特工队
the star is 8.7 the movie is 荒蛮故事
the star is 8.8 the movie is 你的名字。
the star is 8.4 the movie is 真爱至上
the star is 8.6 the movie is 忠犬八公物语
the star is 9.2 the movie is 谍影重重2
the star is 8.7 the movie is 阿飞正传
the star is 8.5 the movie is 地球上的星星
the star is 8.9 the movie is 彗星来的那一夜
the star is 8.5 the movie is 完美陌生人
the star is 8.5 the movie is 战争之王
the star is 8.7 the movie is 谍影重重
the star is 8.6 the movie is 香水
the star is 8.5 the movie is 东京教父
the star is 9.0 the movie is 东京物语
the star is 9.2 the movie is 朗读者
the star is 8.6 the movie is 千钧一发
the star is 8.8 the movie is 再次出发之纽约遇见你
the star is 8.6 the movie is 驴得水
the star is 8.3 the movie is 猜火车
the star is 8.5 the movie is 黑客帝国2:重装上阵
the star is 8.6 the movie is 无间道2
the star is 8.6 the movie is 我爱你
the star is 9.1 the movie is 浪潮
the star is 8.7 the movie is 崖上的波妞
the star is 8.5 the movie is 聚焦
the star is 8.8 the movie is 小萝莉的猴神大叔
the star is 8.4 the movie is 追随
the star is 8.9 the movie is 黑鹰坠落
the star is 8.7 the movie is 网络谜踪
the star is 8.6 the movie is 虎口脱险
the star is 8.9 the movie is 人工智能
the star is 8.7 the movie is 九品芝麻官
the star is 8.6 the movie is 2001太空漫游
the star is 8.8 the movie is 可可西里
the star is 8.8 the movie is 罗生门
the star is 8.8 the movie is 色,戒
the star is 8.5 the movie is 终结者2:审判日
the star is 8.7 the movie is 城市之光
the star is 9.3 the movie is 初恋这件小事
the star is 8.4 the movie is 魂断蓝桥
the star is 8.8 the movie is 牯岭街少年杀人事件
the star is 8.9 the movie is 遗愿清单
the star is 8.7 the movie is 大佛普拉斯
the star is 8.7 the movie is 新龙门客栈
the star is 8.6 the movie is 波西米亚狂想曲
the star is 8.7 the movie is 源代码
the star is 8.5 the movie is 青蛇
the star is 8.6 the movie is 海洋
the star is 9.1 the movie is 燃情岁月
the star is 8.8 the movie is 无耻混蛋
the star is 8.6 the movie is 疯狂的麦克斯4:狂暴之路
the star is 8.6 the movie is 血钻
the star is 8.7 the movie is 穿越时空的少女
the star is 8.6 the movie is 步履不停
the star is 8.8
【Python爬虫】:使用高性能异步多进程爬虫获取豆瓣电影Top250的更多相关文章
- python爬虫 Scrapy2-- 爬取豆瓣电影TOP250
sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频) https://study.163.com/course/introduction.htm?courseId=1005269003& ...
- Python爬虫教程-17-ajax爬取实例(豆瓣电影)
Python爬虫教程-17-ajax爬取实例(豆瓣电影) ajax: 简单的说,就是一段js代码,通过这段代码,可以让页面发送异步的请求,或者向服务器发送一个东西,即和服务器进行交互 对于ajax: ...
- Python爬虫----抓取豆瓣电影Top250
有了上次利用python爬虫抓取糗事百科的经验,这次自己动手写了个爬虫抓取豆瓣电影Top250的简要信息. 1.观察url 首先观察一下网址的结构 http://movie.douban.com/to ...
- Python小爬虫——抓取豆瓣电影Top250数据
python抓取豆瓣电影Top250数据 1.豆瓣地址:https://movie.douban.com/top250?start=25&filter= 2.主要流程是抓取该网址下的Top25 ...
- Python爬虫入门:爬取豆瓣电影TOP250
一个很简单的爬虫. 从这里学习的,解释的挺好的:https://xlzd.me/2015/12/16/python-crawler-03 分享写这个代码用到了的学习的链接: BeautifulSoup ...
- [Python] 豆瓣电影top250爬虫
1.分析 <li><div class="item">电影信息</div></li> 每个电影信息都是同样的格式,毕竟在服务器端是用 ...
- scrapy爬虫框架教程(二)-- 爬取豆瓣电影TOP250
scrapy爬虫框架教程(二)-- 爬取豆瓣电影TOP250 前言 经过上一篇教程我们已经大致了解了Scrapy的基本情况,并写了一个简单的小demo.这次我会以爬取豆瓣电影TOP250为例进一步为大 ...
- 一起学爬虫——通过爬取豆瓣电影top250学习requests库的使用
学习一门技术最快的方式是做项目,在做项目的过程中对相关的技术查漏补缺. 本文通过爬取豆瓣top250电影学习python requests的使用. 1.准备工作 在pycharm中安装request库 ...
- Scrapy爬虫(4)爬取豆瓣电影Top250图片
在用Python的urllib和BeautifulSoup写过了很多爬虫之后,本人决定尝试著名的Python爬虫框架--Scrapy. 本次分享将详细讲述如何利用Scrapy来下载豆瓣电影To ...
随机推荐
- springcloud组件gateway断言(Predicate)
Spring Cloud Gateway是SpringCloud的全新子项目,该项目基于Spring5.x.SpringBoot2.x技术版本进行编写,意在提供简单方便.可扩展的统一API路由管理方式 ...
- java动态代理实现与原理详细分析(转)
关于Java中的动态代理,我们首先需要了解的是一种常用的设计模式--代理模式,而对于代理,根据创建代理类的时间点,又可以分为静态代理和动态代理. 一.代理模式 代理模式是常用的java设计模式, ...
- 设计模式——从HttpServletRequestWrapper了解装饰者模式
从一个业务开始 最近项目上紧急需要,为了应付一个不知道啥的安全检测,我们要给系统追加防XSS注入的功能,这里有经验的JavaWeb开发就会想到,用过滤器或者基于项目框架的拦截器来做,但是顺着这个思路下 ...
- JDBC数据库删除
1 //删除操作: 2 3 if(conn != null){ 4 String temps="2"; 5 conn.setAutoCommit(false); 6 Prepare ...
- Turtlebot3新手教程:OpenCR软件设置(shell)
*本文针对如何利用脚本来更新固件进行讲解 具体步骤如下: burger的固件更新 $ export OPENCR_PORT=/dev/ttyACM0 $ export OPENCR_MODEL=bur ...
- CentOs 7 安装mysql5.7.18(二进制版本)
1.下载二进制版本安装包.搜狐开源镜像站:http://mirrors.sohu.com/mysql/MySQL-5.7/ , 找 mysql-5.7.18-linux-glibc2.5-x86_ ...
- java中string、stringBuild、stringBuffer的区别
(1)string 1,Stirng是对象不是基本数据类型 2,String是final类,不能被继承.是不可变对象,一旦创建,就不能修改它的值. 3,对于已经存在的Sti ...
- linuix查端口
根据进程pid查端口:netstat -nap | grep pid 根据端口port查进程:netstat -nap | grep port 根据pid查找文件的启动位置 ps aux | gre ...
- Docker安装MySQL,Redis,阿里云镜像加速
Docker安装 虚拟化容器技术.Docker基于镜像,可以秒级启动各种容器.每一种容器都是一个完整的环境,容器之间相互隔离. 如果之前安装的有其他版本,卸载旧的版本. $ sudo yum remo ...
- upload-labs 1-21关通关记录
0x01: 检查源代码,发现JS前端验证,关闭JS即可连接,或者手动添加.php,或者上传1.jpg,再抓包修改为php 0X02: if (($_FILES['upload_file']['type ...