Python3编写网络爬虫04-爬取猫眼电影排行实例

利用requests库和正则表达式抓取猫眼电影TOP100 （requests比urllib使用更方便，由于没有学习HTML系统解析库选用re）

1.目标抓取电影名称时间评分图片等

url http://maoyan.com/board/4 结果以文件形式保存

2.分析

offset 代表偏移量如果为n 电影序号为n+1~n+10 每页显示10个

获取100 分开请求10次 offset 分别为0 10 20...90 利用正则提取相关信息

3.抓取页面

import requests

#爬取第一页 页面信息

def get_one_page(url):

header = {

"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16",

}

response = requests.get(url,headers=header)

if response.status_code == 200:#判断是否请求成功

return response.text

return None

# 定义一个main函数调用get_one_page 发送请求打印结果

def main():

url = 'http://maoyan.com/board/4'

html = get_one_page(url)#调用请求函数

print(html)

main()

分析页面
电影信息对应节点为<dd>
提取排名 class 为 board-index i节点内正则 <dd>.*?board-index.*?>(.*?)
电影图片查看为第二个img链接 <dd>.*?board-index.*?>(.*?).*?data-src="(.*?)"
电影名字 p节点 class 为name <dd>.*?board-index.*?>(.*?).*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>
主演 <dd>.*?board-index.*?>(.*?).*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)
发布时间 <dd>.*?board-index.*?>(.*?).*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?).*?releasetime.*?>(.*?)
评分 <dd>.*?board-index.*?>(.*?).*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?).*?releasetime.*?>(.*?).*?integer.*?>(.*?).*?fraction.*?>(.*?).*?</dd>

定义分析页面的方法 parse_one_page()

import requests

import re

def get_one_page(url):

header = {

"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16",

}

response = requests.get(url,headers=header)

if response.status_code == 200:

return response.text

return None

def parse_one_page(html):

pattern = re.compile(

'<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>',re.S

)

items = re.findall(pattern,html)

print(items)

# 定义一个main函数 调用get_one_page 发送请求 打印结果

def main():

url = 'http://maoyan.com/board/4'

html = get_one_page(url)

# print(html)

parse_one_page(html)

main()

将匹配结果遍历生成字典

import requests

import re

def get_one_page(url):

header = {

"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16",

}

response = requests.get(url,headers=header)

if response.status_code == 200:

return response.text

return None

def parse_one_page(html):# html为网页源码

pattern = re.compile(

'<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>',re.S

)#定义规则

items = re.findall(pattern,html)#查找整个页面

# print(items)

#遍历结果生成字典

for item in items:

yield {

'index':item[0],

'image': item[1],

'title': item[2].strip(),

'actor': item[3].strip()[3:] if len(item[3]) > 3 else'',

'time': item[4].strip()[5:] if len(item[4]) > 5 else '',

'score': item[5].strip()+item[6].strip()

}

#返回一个生成器 yield

# 定义一个main函数 调用get_one_page 发送请求 打印结果

def main():

url = 'http://maoyan.com/board/4'

html = get_one_page(url)

# print(html)

for item in parse_one_page(html):#遍历生成器

print(item)

main()

写入文件

将提取结果 写入文件 通过json库 的dumps（） 实现字典的序列化 指定ensure_ascii 参数为 False

#写入文件

def write_to_file(content):

with open('result.txt','a',encoding='utf-8') as f:

print(type(json.dumps(content)))

f.write(json.dumps(content,ensure_ascii=False)+'\n')

整合代码单页面电影提取

import requests

import re

import json

# 请求页面

def get_one_page(url):

header = {

"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16",

}

response = requests.get(url,headers=header)

if response.status_code == 200:

return response.text

return None

#解析页面

def parse_one_page(html):# html为网页源码

pattern = re.compile(

'<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>',re.S

)#定义规则

items = re.findall(pattern,html)#查找整个页面

# print(items)

#遍历结果生成字典

for item in items:

yield {

'index':item[0],

'image': item[1],

'title': item[2].strip(),

'actor': item[3].strip()[3:] if len(item[3]) > 3 else'',

'time': item[4].strip()[5:] if len(item[4]) > 5 else '',

'score': item[5].strip()+item[6].strip()

}

#返回一个生成器 yield

#写入文件

def write_to_file(content):

with open('result.txt','a',encoding='utf-8') as f:

print(type(json.dumps(content)))

f.write(json.dumps(content,ensure_ascii=False)+'\n')

# 定义一个main函数 调用get_one_page 发送请求 打印结果

def main():

url = 'http://maoyan.com/board/4'

html = get_one_page(url)

# print(html)

for item in parse_one_page(html):#遍历生成器

write_to_file(item)

main()

分页爬取

# 定义一个main函数调用get_one_page 发送请求打印结果

def main(offset):

url = 'http://maoyan.com/board/4?offset=' + str(offset)

html = get_one_page(url)

# print(html)

for item in parse_one_page(html):#遍历生成器

write_to_file(item)

if __name__ == '__main__':

for i in range(10):

main(offset = i *10)

整理代码

#-*-coding:utf-8-*-

import requests #请求库

import re #正则模块

import json #json模块

import time #时间模块

from requests.exceptions import RequestException#捕获异常模块

# 请求页面

def get_one_page(url):

#异常处理

try:

header = {

"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16",

}

response = requests.get(url,headers=header)

# 判断状态码是否为200

if response.status_code == 200:

return response.text

return None

except RequestException:

return None

#解析页面

def parse_one_page(html):# html为网页源码

#定义爬取规则

pattern = re.compile(

'<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>.*?</dd>',re.S

)

# 查找整个页面

items = re.findall(pattern,html)

# print(items)

#遍历结果生成字典

for item in items:

yield {

'index':item[0],

'image': item[1],

'title': item[2].strip(),

'actor': item[3].strip()[3:] if len(item[3]) > 3 else'',

'time': item[4].strip()[5:] if len(item[4]) > 5 else '',

'score': item[5].strip()+item[6].strip()

}

#返回一个生成器 yield

#写入文件

def write_to_file(content):

with open('result.txt','a',encoding='utf-8') as f:

# print(type(json.dumps(content)))

# ensure_ascii=False 保证输出结果为中文

f.write(json.dumps(content,ensure_ascii=False)+'\n')

# 定义一个main函数 调用get_one_page 发送请求 参数offset 网页偏移量

def main(offset):

#拼接url地址

url = 'http://maoyan.com/board/4?offset=' + str(offset)

# 请求函数

html = get_one_page(url)

# print(html)

# 解析函数 和 文件保存函数

for item in parse_one_page(html):#遍历生成器

write_to_file(item)

if __name__ == '__main__':

for i in range(10):

main(offset = i *10)

#延时处理

time.sleep(3)

# 最基础的实例做好总结

Python3编写网络爬虫04-爬取猫眼电影排行实例的更多相关文章

python3编写网络爬虫19-app爬取
一.app爬取前面都是介绍爬取Web网页的内容,随着移动互联网的发展,越来越多的企业并没有提供Web页面端的服务,而是直接开发了App,更多信息都是通过App展示的 App爬取相比Web端更加容易 ...
爬虫--requests爬取猫眼电影排行榜
'''目标:使用requests分页爬取猫眼电影中榜单栏目中TOP100榜的所有电影信息,并将信息写入文件URL地址:http://maoyan.com/board/4 其中参数offset表示其实条 ...
Python爬虫项目--爬取猫眼电影Top100榜
本次抓取猫眼电影Top100榜所用到的知识点: 1. python requests库 2. 正则表达式 3. csv模块 4. 多进程正文目标站点分析通过对目标站点的分析, 来确定网页结构, ...
python学习(23)requests库爬取猫眼电影排行信息
本文介绍如何结合前面讲解的基本知识,采用requests,正则表达式,cookies结合起来,做一次实战,抓取猫眼电影排名信息. 用requests写一个基本的爬虫排行信息大致如下图网址链接为ht ...
零基础Python爬虫实现(爬取最新电影排行)
提示:本学习来自Ehco前辈的文章, 经过实现得出的笔记. 目标网站 http://dianying.2345.com/top/ 网站结构要爬的部分,在ul标签下(包括li标签), 大致来说迭代li ...
Python爬取猫眼电影排行
import requests import pyquery def crawl_page(url: str) -> None: headers = { 'user-agent': 'Mozil ...
爬虫系列（1）-----python爬取猫眼电影top100榜
对于Python初学者来说,爬虫技能是应该是最好入门,也是最能够有让自己有成就感的,今天在整理代码时,整理了一下之前自己学习爬虫的一些代码,今天先上一个简单的例子,手把手教你入门Python爬虫,爬取 ...
50 行代码教你爬取猫眼电影 TOP100 榜所有信息
对于Python初学者来说,爬虫技能是应该是最好入门,也是最能够有让自己有成就感的,今天,恋习Python的手把手系列,手把手教你入门Python爬虫,爬取猫眼电影TOP100榜信息,将涉及到基础爬虫 ...
PYTHON 爬虫笔记八:利用Requests+正则表达式爬取猫眼电影top100（实战项目一）
利用Requests+正则表达式爬取猫眼电影top100 目标站点分析流程框架爬虫实战使用requests库获取top100首页: import requests def get_one_pag ...

随机推荐

Asp.net Webform 使用Repository模式实现CRUD操作代码生成工具
Asp.net Webform 使用Repository模式实现CRUD操作代码生成工具介绍该工具是通过一个github上的开源项目修改的原始作者https://github.com/Supere ...
#10 Python字符串
前言通过上一节可知,Python6个序列的内置类型中,最常见的是列表和元组,但在Python中,最常用的数据类型却不是列表和元组,而是字符串.要想深入了解字符串,必须先掌握字符编码问题.因此本篇博文 ...
搭建前端监控系统（三）NodeJs服务器部署篇
===================================================================== 监控系统预览地址: DEMO地址 GIT代码仓库地址 ...
XtraBackup的备份原理与应用示例
一.XtraBackup简介与安装 XtraBackup是一款免费的在线开源数据库备份解决方案,适用于所有版本的MySQL和MariaDB.XtraBackup支持对InnoDB热备,是一款物理备份工 ...
OSPF笔记
OSPF:现实情况中99%的网络运行的是这种路由协议 OSPF有三张表:邻居表,链路状态数据库(LSDB),路由表 SPF算法 OSPF架构为花瓣形(不同area组成花瓣)就是为了防环,因为骨干区域运 ...
绕过边界防火墙之ICMP隧道、HTTP隧道、UDP隧道
一.ICMP隧道背景:已经通过某种手段拿到了园区网A主机的控制权,但是边界防火墙只放行该主机向外的ICMP流量,此时怎样才能让A主机和公网主机C建立TCP连接呢? 方案:将TCP包内容包裹在ICMP ...
深入浅出 JVM GC（3）
# 前言在深入浅出 JVM GC(2) 中,我们介绍了一些 GC 算法,GC 名词,同时也留下了一个问题,就是每个 GC 收集器的具体作用.有哪些 GC 收集器呢? Serial 串行收集器(只适 ...
[转]ionic3项目实战教程三（创建provider、http请求、图文列表、滑动列表）
本文转自:https://blog.csdn.net/lyt_angularjs/article/details/81145468 版权声明:本文为博主原创文章,转载请注明出处.谢谢! https:/ ...
【Java基础】14、位运算之——按位与（&）操作——（快速取模算法）
学习redis 字典结构,hash找槽位求槽位的索引值时,用到了 hash值 & sizemask操作, 其后的scan操作涉及扫描顺序逻辑,对同模的槽位按一定规则扫描! 其中涉及位运算 ...
本地navicate for mysql怎么修改密码？
1.以前在本地设置sql库密码,就是在本地新建数据库的时候就输入,怎么也链接不上,原来是新建数据库的时候不能输入密码,需要在内部修改. 2. 打开mysql user表 3. 打开mysql user ...

Python3编写网络爬虫04-爬取猫眼电影排行实例

Python3编写网络爬虫04-爬取猫眼电影排行实例的更多相关文章

随机推荐

热门专题