【Python爬虫案例学习】分析Ajax请求并抓取今日头条街拍图片

1.抓取索引页内容

利用requests请求目标站点，得到索引网页HTML代码，返回结果。

from urllib.parse import urlencode

from requests.exceptions import RequestException

import requests

'''

遇到不懂的问题？Python学习交流群：821460695满足你的需求，资料都已经上传群文件，可以自行下载！

'''

def get_page_index(offset, keyword):

    headers = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' }

    data = {

        'format': 'json',

        'offset': offset,

        'keyword': keyword,

        'autoload': 'true',

        'count': 20,

        'cur_tab': 1,

        'from': 'search_tab',

        'pd': 'synthesis',

    }

    url = 'https://www.toutiao.com/search_content/?' + urlencode(data)

    response = requests.get(url, headers=headers);

    try:

        if response.status_code == 200:

            return response.text

        return None

    except RequestException:

        print('请求索引页失败')

        return None

def main():

    html = get_page_index(0,'街拍')

    print(html)

if __name__=='__main__':

    main()

2.抓取详情页内容

解析返回结果，得到详情页的链接，并进一步抓取详情页的信息。

获取页面网址：

def parse_page_index(html):

  data = json.loads(html)

  if data and 'data' in data.keys():

    for item in data.get('data'):

      yield item.get('article_url')

单个页面代码：

def get_page_detail(url):

  headers = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' }

  try:

    response = requests.get(url, headers=headers)

    if response.status_code == 200:

      return response.text

    return None

  except RequestException:

    print('请求详情页页失败')

    return None

图片地址

def parse_page_detail(html,url):

  soup = BeautifulSoup(html,'lxml')

  title = soup.select('title')[0].get_text()

  images_pattern = re.compile('gallery: JSON.parse\((.*?)\)', re.S)

  result = re.search(images_pattern, html)

  if result:

    data = json.loads(result.group(1))

    data = json.loads(data) #将字符串转为dict，因为报错了

    if data and 'sub_images' in data.keys():

      sub_images = data.get('sub_images')

      images = [item.get('url') for item in sub_images]

      for image in images: download_image(image)

      return {

        'title': title,

        'images':images,

        'url':url

      }

3.下载图片与保存数据库

将图片下载到本地，并把页面信息及图片URL保存到MongDB。

# 存到数据库

def save_to_mongo(result):

  if db[MONGO_TABLE].insert(result):

    print('存储到MongoDb成功', result)

    return True

  return False

# 下载图片

def download_image(url):

  print('正在下载',url)

  headers = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.    36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' }

  try:

    response = requests.get(url, headers=headers)

    if response.status_code == 200:

      save_image(response.content)

    return None

  except RequestException:

    print('请求图片失败', url)

    return None

def save_image(content):

  file_path = '{0}/{1}.{2}'.format(os.getcwd(),md5(content).hexdigest(),'jpg')

  if not os.path.exists(file_path):

    with open(file_path,'wb') as f:

      f.write(content)

4.开启循环及多线程

对多页内容遍历，开启多线程提高抓取速度。

groups = [x*20 for x in range(GROUP_START, GROUP_END+1)]

    pool = Pool()

    pool.map(main,groups)

完整代码：

from urllib.parse import urlencode

from requests.exceptions import RequestException

from bs4 import BeautifulSoup

from hashlib import md5

from multiprocessing import Pool

from config import *

import pymongo

import requests

import json

import re

import os

'''

遇到不懂的问题？Python学习交流群：821460695满足你的需求，资料都已经上传群文件，可以自行下载！

'''

client = pymongo.MongoClient(MONGO_URL)

db = client[MONGO_DB]

def get_page_index(offset, keyword):

  headers = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' }

  data = { 'format': 'json','offset': offset,'keyword': keyword,'autoload': 'true','count': 20,'cur_tab': 1,'from': 'search_tab','pd': 'synthesis' }

  url = 'https://www.toutiao.com/search_content/?' + urlencode(data)

  try:

    response = requests.get(url, headers=headers)

    if response.status_code == 200:

      return response.text

    return None

  except RequestException:

    print('请求索引页失败')

    return None

def parse_page_index(html):

  data = json.loads(html)

  if data and 'data' in data.keys():

    for item in data.get('data'):

      yield item.get('article_url')

def get_page_detail(url):

  headers = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' }

  try:

    response = requests.get(url, headers=headers)

    if response.status_code == 200:

      return response.text

    return None

  except RequestException:

    print('请求详情页页失败')

    return None

def parse_page_detail(html,url):

  soup = BeautifulSoup(html,'lxml')

  title = soup.select('title')[0].get_text()

  images_pattern = re.compile('gallery: JSON.parse\((.*?)\)', re.S)

  result = re.search(images_pattern, html)

  if result:

    data = json.loads(result.group(1))

    data = json.loads(data) #将字符串转为dict，因为报错了

    if data and 'sub_images' in data.keys():

      sub_images = data.get('sub_images')

      images = [item.get('url') for item in sub_images]

      for image in images: download_image(image)

      return {

        'title': title,

        'images':images,

        'url':url

      }

def save_to_mongo(result):

  if db[MONGO_TABLE].insert(result):

    print('存储到MongoDb成功', result)

    return True

  return False

def download_image(url):

  print('正在下载',url)

  headers = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.    36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' }

  try:

    response = requests.get(url, headers=headers)

    if response.status_code == 200:

      save_image(response.content)

    return None

  except RequestException:

    print('请求图片失败', url)

    return None

def save_image(content):

  file_path = '{0}/{1}.{2}'.format(os.getcwd(),md5(content).hexdigest(),'jpg')

  if not os.path.exists(file_path):

    with open(file_path,'wb') as f:

      f.write(content)

def main(offset):

  html = get_page_index(offset,KEYWORD)

  for url in parse_page_index(html):

     html = get_page_detail(url)

     if html:

       result = parse_page_detail(html,url)

       if isinstance(result,dict):

         save_to_mongo(result)

if __name__=='__main__':

    groups = [x*20 for x in range(GROUP_START, GROUP_END+1)]

    pool = Pool()

    pool.map(main,groups)

config.py

MONGO_URL = 'localhost'

MONGO_DB = 'toutiao'

MONGO_TABLE = 'jiepai'

GROUP_START = 1

GROUP_END = 20

KEYWORD = '街拍'

~

【Python爬虫案例学习】分析Ajax请求并抓取今日头条街拍图片的更多相关文章

Python爬虫系列-分析Ajax请求并抓取今日头条街拍图片
1.抓取索引页内容利用requests请求目标站点,得到索引网页HTML代码,返回结果. 2.抓取详情页内容解析返回结果,得到详情页的链接,并进一步抓取详情页的信息. 3.下载图片与保存数据库将 ...
分析Ajax请求并抓取今日头条街拍美图
项目说明本项目以今日头条为例,通过分析Ajax请求来抓取网页数据. 有些网页请求得到的HTML代码里面并没有我们在浏览器中看到的内容.这是因为这些信息是通过Ajax加载并且通过JavaScript渲 ...
分析 ajax 请求并抓取今日头条街拍美图
首先分析街拍图集的网页请求头部: 在 preview 选项卡我们可以找到 json 文件,分析 data 选项,找到我们要找到的图集地址 article_url: 选中其中一张图片,分析 json 请 ...
2.分析Ajax请求并抓取今日头条街拍美图
import requests from urllib.parse import urlencode # 引入异常类 from requests.exceptions import RequestEx ...
python爬虫知识点总结（十）分析Ajax请求并抓取今日头条街拍美图
一.流程框架
15-分析Ajax请求并抓取今日头条街拍美图
流程框架: 抓取索引页内容:利用requests请求目标站点,得到索引网页HTML代码,返回结果. 抓取详情页内容:解析返回结果,得到详情页的链接,并进一步抓取详情页的信息. 下载图片与保存数据库:将 ...
PYTHON 爬虫笔记九:利用Ajax+正则表达式+BeautifulSoup爬取今日头条街拍图集（实战项目二）
利用Ajax+正则表达式+BeautifulSoup爬取今日头条街拍图集目标站点分析今日头条这类的网站制作,从数据形式,CSS样式都是通过数据接口的样式来决定的,所以它的抓取方法和其他网页的抓取方 ...
爬虫七之分析Ajax请求并爬取今日头条
爬取今日头条图片这里只讨论出现的一些问题,代码在最下面github链接里. 首先,今日头条取消了"图集"这一选项,因此对于爬虫来说效率降低了很多: 在所有代码都完成后,也许是爬取 ...
分析 ajax 请求并抓取 “今日头条的街拍图”
今日头条抓取页面: 分析街拍页面的 ajax 请求: 通过在 XHR 中查看内容,获取 url 链接,params 参数信息,将两者进行拼接后取得完整 url 地址.data 中的 article_u ...

随机推荐

【题解】洛谷 P2725 邮票 Stamps
目录题目思路 \(Code\) 题目 P2725 邮票 Stamps 思路 \(\texttt{dp}\).\(\texttt{dp[i]}\)表示拼出邮资\(i\)最少需要几张邮票. 状态转移方 ...
【luoguP2986】[USACO10MAR]伟大的奶牛聚集Great Cow Gathering
题目链接先把\(1\)作为根求每个子树的\(size\),算出把\(1\)作为集会点的代价,不难发现把集会点移动到\(u\)的儿子\(v\)上后的代价为原代价-\(v\)的\(size\)*边权+( ...
uni-app下拉刷新加载刷新数据
onPullDownRefresh监听该页面用户下拉刷新事件需要在 pages.json 里开启 enablePullDownRefresh "globalStyle": { } ...
【AtCoder】 ARC 096
link C-Half and Half 题意:三种pizza,可以花\(A\)价钱买一个A-pizza,花\(B\)价钱买一个B-pizza,花\(C*2\)价钱买A-pizza和B-pizza各一 ...
IDEA 调试图文教程，让 bug 无处藏身！
阅读本文大概需要 6.2 分钟. 来源:http://t.cn/EoPN7J2 Debug用来追踪代码的运行流程,通常在程序运行过程中出现异常,启用Debug模式可以分析定位异常发生的位置,以及在运行 ...
数据库与spring事务传播特性
一.spring事务管理的实现原理,基于AOP 1) REQUIRED ,这个是默认的属性 Support a current transaction, create a new one if non ...
Java 字符集编码
一.字符编码实例1.NioTest13_In.txt文件内容拷贝到NioTest13_Out.txt文件中 public class NioTest13 { public static void ma ...
android双进程守护，让程序崩溃后一定可以重启
由于我们做的是机器人上的软件,而机器人是24小时不间断服务的,这就要求我们的软件不能退出到系统桌面.当然最好是能够做到程序能够不卡顿,不崩溃,自己不退出.由于我们引用了很多第三方的开发包,也不能保证他 ...
Spring不能直接@autowired注入Static变量/ 关于SpringBoot的@Autowired 静态变量注入
昨天在编写JavaMail工具类的时候,静态方法调用静态变量,这是很正常的操作,当时也没多想,直接静态注入. @Component public class JavaMailUtil { @Autow ...
mqtt概念整理
运行模式: 服务器: emqx (https://docs.emqx.io/edge/v3/cn/install.html) 协议头字节数: 2个字节三种消息可能性保障(Qos): Qos0:最多一 ...

【Python爬虫案例学习】分析Ajax请求并抓取今日头条街拍图片

完整代码：

【Python爬虫案例学习】分析Ajax请求并抓取今日头条街拍图片的更多相关文章

随机推荐

热门专题