分析 ajax 请求并抓取 “今日头条的街拍图”

今日头条抓取页面：

分析街拍页面的 ajax 请求：

通过在 XHR 中查看内容，获取 url 链接，params 参数信息，将两者进行拼接后取得完整 url 地址。data 中的 article_url 为各详情页的链接地址。

代码显示：

 1 # 获取街拍页面；

 2 def one_page_index(offset, keyword, headers):

 3     params = {

 4         'aid': 24,

 5         'app_name': 'web_search',

 6         'offset': offset,

 7         'format': 'json',

 8         'keyword': keyword,

 9         'autoload': 'true',

10         'count': 20,

11         'en_qc': 1,

12         'cur_tab': 1,

13         'from': 'search_tab',

14         'pd': 'synthesis',

15     }

16     url = "https://www.toutiao.com/api/search/content/?" + urlencode(params)

17     try:

18         response = requests.get(url, headers=headers)

19         if response.status_code == 200:

20             return response.text

21         return None

22     except RequestException:

23         print('请求索引页出错！')

24         return None

1 # 解析街拍页面的信息；

2 def parse_one_page(html):

3     data = json.loads(html)

4     if data and 'data' in data.keys():

5         for item in data.get('data'):

6             yield item.get('article_url')

详情页的解析：

需将获取到的数据进行格式调整；

代码显示：

 1 def parse_detail(html, url):

 2     soup = BeautifulSoup(html, 'lxml')

 3     title = soup.select('title')[0].get_text()

 4     image_pattern = re.compile('gallery: JSON.parse\("(.*)"\)', re.S)

 5     result = re.search(image_pattern, html)

 6     if result:

 7         # group(1)即为第一个括号里面的内容;

 8         # 将获取的数据格式进行调整；

 9         newResult = result.group(1).replace("\\\\u002F", '\/')

10         newResult = newResult.replace("\\", '')

11         data = json.loads(newResult)

12         if data and 'sub_images' in data.keys():

13             sub_images = data.get('sub_images')

14             images = [item.get('url') for item in sub_images]

15             for image in images: download_image(image)

16             return {

17                 "title": title,

18                 "url": url,

19                 "images": images

20             }

例外：

在街拍主页中，有些网页的内容不是图集，而是在单个页面中显示所有的图片，其解析的内容与图集形式的网页内容不同，无法正常解析出内容，会获取 None 。在写入数据库时需进行判断。

最后将抓取的数据保存到 MongoDB 数据库中，且将图片保存到本地文件中。

完整代码：

1 # config.py 文件

2

3 MONGO_URL = "localhost"

4 MONGO_DB = 'toutiao'

5 MONGO_TABLE = 'toutiao'

6

7 GROUP_START = 1

8 GROUP_END = 20

9 KEYWORLD = '街拍'

  1 import json

  2 import os

  3 import re

  4 from hashlib import md5

  5 from urllib.parse import urlencode

  6 import pymongo

  7 from bs4 import BeautifulSoup

  8 from requests.exceptions import RequestException

  9 import requests

 10 from toutiao.config import *

 11 from multiprocessing import Pool

 12

 13

 14 client = pymongo.MongoClient(MONGO_URL)

 15 db = client[MONGO_DB]

 16

 17 # 获取街拍页面；

 18 def one_page_index(offset, keyword, headers):

 19     params = {

 20         'aid': 24,

 21         'app_name': 'web_search',

 22         'offset': offset,

 23         'format': 'json',

 24         'keyword': keyword,

 25         'autoload': 'true',

 26         'count': 20,

 27         'en_qc': 1,

 28         'cur_tab': 1,

 29         'from': 'search_tab',

 30         'pd': 'synthesis',

 31     }

 32     url = "https://www.toutiao.com/api/search/content/?" + urlencode(params)

 33     try:

 34         response = requests.get(url, headers=headers)

 35         if response.status_code == 200:

 36             return response.text

 37         return None

 38     except RequestException:

 39         print('请求索引页出错！')

 40         return None

 41

 42

 43 # 获取街拍各详情页的信息；

 44 def get_detail_page(url, headers):

 45     try:

 46         response = requests.get(url, headers=headers)

 47         if response.status_code == 200:

 48             return response.text

 49         return None

 50     except RequestException:

 51         print('请求详情页出错！')

 52         return None

 53

 54

 55 # 解析街拍页面的信息；

 56 def parse_one_page(html):

 57     data = json.loads(html)

 58     if data and 'data' in data.keys():

 59         for item in data.get('data'):

 60             yield item.get('article_url')

 61

 62

 63 # 街拍各详情页的解析；

 64 # 使用正则解析数据；

 65 def parse_detail(html, url):

 66     soup = BeautifulSoup(html, 'lxml')

 67     title = soup.select('title')[0].get_text()

 68     image_pattern = re.compile('gallery: JSON.parse\("(.*)"\)', re.S)

 69     result = re.search(image_pattern, html)

 70     if result:

 71         # group(1)即为第一个括号里面的内容;

 72         # 将获取的数据格式进行调整；

 73         newResult = result.group(1).replace("\\\\u002F", '\/')

 74         newResult = newResult.replace("\\", '')

 75         data = json.loads(newResult)

 76         if data and 'sub_images' in data.keys():

 77             sub_images = data.get('sub_images')

 78             images = [item.get('url') for item in sub_images]

 79             for image in images: download_image(image)

 80             return {

 81                 "title": title,

 82                 "url": url,

 83                 "images": images

 84             }

 85

 86

 87 # 下载图片；

 88 def download_image(url):

 89     print('正在下载', url)

 90     try:

 91         response = requests.get(url)

 92         if response.status_code == 200:

 93             # 返回图片使用content；

 94             save_image(response.content)

 95             return None

 96         return None

 97     except RequestException:

 98         print('请求图片出错！')

 99         return None

100

101

102 # 将图片保存到本地；

103 # 使用md5形式的文件名，内容相同的md5值也相同，防止下载重复的图片；

104 def save_image(content):

105     file_path = '{0}/{1}.{2}'.format(os.getcwd(), md5(content).hexdigest(), 'jpg')

106     if not os.path.exists(file_path):

107         with open(file_path, 'wb') as f:

108             f.write(content)

109             f.close()

110

111

112 # 将数据存储到MongoDB中；

113 # 某些网页的图片全部在单页中显示，代码与图集形式显示的网页不同，解析会匹配不到内容，无法插入数据库；

114 def save_to_mongodb(result):

115     if result and db[MONGO_TABLE].insert(result):

116         print('存储到MongoDB成功', result)

117         return True

118     return False

119

120

121 def main(offset):

122     headers = {

123         "User-Agent": 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36',

124         "cookie": 'tt_webid=6741657574736889357; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6741657574736889357; csrftoken=9e7ac598957d2ec36c80c6f1e05b9622; s_v_web_id=2ab2b8ff35fc91cacdec489ca9a5570f; __tasessionId=d6osr6e6r1569719745562; UM_distinctid=16d7a9f7f4724e-0d4cc2a69d799-3c375d0d-100200-16d7a9f7f4a4f1'

125     }

126     html = one_page_index(offset, KEYWORLD, headers)

127     for url in parse_one_page(html):

128         if url:

129             html = get_detail_page(url, headers)

130             result = parse_detail(html, url)

131             save_to_mongodb(result)

132

133 # 运行；

134 if __name__ == '__main__':

135     groups = [x * 20 for x in range(GROUP_START, GROUP_END + 1)]

136     Pool = Pool()

137     Pool.map(main, groups)

分析 ajax 请求并抓取 “今日头条的街拍图”的更多相关文章

分析Ajax请求并抓取今日头条街拍美图
项目说明本项目以今日头条为例,通过分析Ajax请求来抓取网页数据. 有些网页请求得到的HTML代码里面并没有我们在浏览器中看到的内容.这是因为这些信息是通过Ajax加载并且通过JavaScript渲 ...
分析 ajax 请求并抓取今日头条街拍美图
首先分析街拍图集的网页请求头部: 在 preview 选项卡我们可以找到 json 文件,分析 data 选项,找到我们要找到的图集地址 article_url: 选中其中一张图片,分析 json 请 ...
Python爬虫系列-分析Ajax请求并抓取今日头条街拍图片
1.抓取索引页内容利用requests请求目标站点,得到索引网页HTML代码,返回结果. 2.抓取详情页内容解析返回结果,得到详情页的链接,并进一步抓取详情页的信息. 3.下载图片与保存数据库将 ...
【Python爬虫案例学习】分析Ajax请求并抓取今日头条街拍图片
1.抓取索引页内容利用requests请求目标站点,得到索引网页HTML代码,返回结果. from urllib.parse import urlencode from requests.excep ...
2.分析Ajax请求并抓取今日头条街拍美图
import requests from urllib.parse import urlencode # 引入异常类 from requests.exceptions import RequestEx ...
python爬虫知识点总结（十）分析Ajax请求并抓取今日头条街拍美图
一.流程框架
15-分析Ajax请求并抓取今日头条街拍美图
流程框架: 抓取索引页内容:利用requests请求目标站点,得到索引网页HTML代码,返回结果. 抓取详情页内容:解析返回结果,得到详情页的链接,并进一步抓取详情页的信息. 下载图片与保存数据库:将 ...
分析AJAX抓取今日头条的街拍美图并把信息存入mongodb中
今天学习分析ajax 请求,现把学得记录, 把我们在今日头条搜索街拍美图的时候,今日头条会发起ajax请求去请求图片,所以我们在网页源码中不能找到图片的url,但是今日头条网页中有一个json 文件, ...
爬虫七之分析Ajax请求并爬取今日头条
爬取今日头条图片这里只讨论出现的一些问题,代码在最下面github链接里. 首先,今日头条取消了"图集"这一选项,因此对于爬虫来说效率降低了很多: 在所有代码都完成后,也许是爬取 ...

随机推荐

spring cloud alibaba版本选择
https://github.com/alibaba/spring-cloud-alibaba/wiki/版本说明 Spring Cloud Version Spring Cloud Version ...
taro小程序地址选择组件
效果图: address_picker.tsx: import Taro, { Component } from '@tarojs/taro' import { View, PickerView, P ...
如何选择Spring cloud和 Spring Boot对应的版本
如何选择Spring cloud和 Spring Boot对应的版本首先,我们进入Spring Cloud官网,查询Spring cloud的版本和对应的Spring Boot版本打开Spring ...
C# 计算文件的MD5
MD5的作用详见:https://baike.baidu.com/item/MD5/212708?fr=aladdin public static string GetFileMD5(string f ...
（一）响应式web设计。。。freecodecamp笔记
HTML基础 HTML 的全称是 HyperText Markup Language(超文本标记语言),它是一种用来描述网页结构的标记语言. h1用作主标题,h2用作副标题,还有h3.h4.h5.h6 ...
Lambda Expressions and Functional Interfaces: Tips and Best Practices
转载自https://www.baeldung.com/java-8-lambda-expressions-tips 1. Overview Now that Java 8 has reached ...
初始C3P0连接池
C3P0连接池只需要一个jar包: 其中我们可以看到有三个jar包: 属于C3P0的jar包只有一个,另外两个是测试时使用的JDBC驱动:一个是mysql的,一个是oracle的: 可以看到在src下 ...
JobExecutionContext中的JobDataMapjob与Detail与Trigger中的JobDataMapjob
public static void main(String[] args) { //配置模式 build模式 //1.实例一个JOB JobDetail jobDetail = JobBuilder ...
在ES5中模拟类
1.Object.create()方法创建一个新对象,使用现有的对象来提供新创建的对象的__proto__. var _this = Object.create(fn.prototype);这句代码的 ...
JavaWeb使用Filter进行字符编码过滤预防web服务中文乱码
JavaWeb使用Filter进行字符编码过滤预防web服务中文乱码准备条件:一个创建好的 JavaWeb 项目步骤: 1.创建一个类并实现 Filter 接口 import javax.ser ...

分析 ajax 请求并抓取 “今日头条的街拍图”

分析 ajax 请求并抓取 “今日头条的街拍图”的更多相关文章

随机推荐

热门专题