Python爬虫系列-分析Ajax请求并抓取今日头条街拍图片

1.抓取索引页内容

利用requests请求目标站点，得到索引网页HTML代码，返回结果。

2.抓取详情页内容

解析返回结果，得到详情页的链接，并进一步抓取详情页的信息。

3.下载图片与保存数据库

将图片下载到本地，并把页面信息及图片URL保存到MongDB。

4.开启循环及多线程

对多页内容遍历，开启多线程提高抓取速度。

1.抓取索引页

from urllib.parse import urlencode

from requests.exceptions import RequestException

import requests

def get_page_index(offset, keyword):

    headers = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' }

    data = {

        'format': 'json',

        'offset': offset,

        'keyword': keyword,

        'autoload': 'true',

        'count': 20,

        'cur_tab': 1,

        'from': 'search_tab',

        'pd': 'synthesis',

    }

    url = 'https://www.toutiao.com/search_content/?' + urlencode(data)

    response = requests.get(url, headers=headers);

    try:

        if response.status_code == 200:

            return response.text

        return None

    except RequestException:

        print('请求索引页失败')

        return None

def main():

    html = get_page_index(0,'街拍')

    print(html)

if __name__=='__main__':

    main()

2.抓取详情页内容

获取页面网址：

def parse_page_index(html):

  data = json.loads(html)

  if data and 'data' in data.keys():

    for item in data.get('data'):

      yield item.get('article_url')

单个页面代码：

def get_page_detail(url):

  headers = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' }

  try:

    response = requests.get(url, headers=headers)

    if response.status_code == 200:

      return response.text

    return None

  except RequestException:

    print('请求详情页页失败')

    return None

图片地址

def parse_page_detail(html,url):

  soup = BeautifulSoup(html,'lxml')

  title = soup.select('title')[0].get_text()

  images_pattern = re.compile('gallery: JSON.parse\((.*?)\)', re.S)

  result = re.search(images_pattern, html)

  if result:

    data = json.loads(result.group(1))

    data = json.loads(data) #将字符串转为dict，因为报错了

    if data and 'sub_images' in data.keys():

      sub_images = data.get('sub_images')

      images = [item.get('url') for item in sub_images]

      for image in images: download_image(image)

      return {

        'title': title,

        'images':images,

        'url':url

      }

3.下载图片与保存数据库

# 存到数据库

def save_to_mongo(result):

  if db[MONGO_TABLE].insert(result):

    print('存储到MongoDb成功', result)

    return True

  return False

# 下载图片

def download_image(url):

  print('正在下载',url)

  headers = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.    36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' }

  try:

    response = requests.get(url, headers=headers)

    if response.status_code == 200:

      save_image(response.content)

    return None

  except RequestException:

    print('请求图片失败', url)

    return None

def save_image(content):

  file_path = '{0}/{1}.{2}'.format(os.getcwd(),md5(content).hexdigest(),'jpg')

  if not os.path.exists(file_path):

    with open(file_path,'wb') as f:

      f.write(content)

4.开启循环及多线程

groups = [x*20 for x in range(GROUP_START, GROUP_END+1)]

    pool = Pool()

    pool.map(main,groups)

完整代码:spider.py

from urllib.parse import urlencode

from requests.exceptions import RequestException

from bs4 import BeautifulSoup

from hashlib import md5

from multiprocessing import Pool

from config import *

import pymongo

import requests

import json

import re

import os

client = pymongo.MongoClient(MONGO_URL)

db = client[MONGO_DB]

def get_page_index(offset, keyword):

  headers = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' }

  data = { 'format': 'json','offset': offset,'keyword': keyword,'autoload': 'true','count': 20,'cur_tab': 1,'from': 'search_tab','pd': 'synthesis' }

  url = 'https://www.toutiao.com/search_content/?' + urlencode(data)

  try:

    response = requests.get(url, headers=headers)

    if response.status_code == 200:

      return response.text

    return None

  except RequestException:

    print('请求索引页失败')

    return None

def parse_page_index(html):

  data = json.loads(html)

  if data and 'data' in data.keys():

    for item in data.get('data'):

      yield item.get('article_url')

def get_page_detail(url):

  headers = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' }

  try:

    response = requests.get(url, headers=headers)

    if response.status_code == 200:

      return response.text

    return None

  except RequestException:

    print('请求详情页页失败')

    return None

def parse_page_detail(html,url):

  soup = BeautifulSoup(html,'lxml')

  title = soup.select('title')[0].get_text()

  images_pattern = re.compile('gallery: JSON.parse\((.*?)\)', re.S)

  result = re.search(images_pattern, html)

  if result:

    data = json.loads(result.group(1))

    data = json.loads(data) #将字符串转为dict，因为报错了

    if data and 'sub_images' in data.keys():

      sub_images = data.get('sub_images')

      images = [item.get('url') for item in sub_images]

      for image in images: download_image(image)

      return {

        'title': title,

        'images':images,

        'url':url

      }

def save_to_mongo(result):

  if db[MONGO_TABLE].insert(result):

    print('存储到MongoDb成功', result)

    return True

  return False

def download_image(url):

  print('正在下载',url)

  headers = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.    36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' }

  try:

    response = requests.get(url, headers=headers)

    if response.status_code == 200:

      save_image(response.content)

    return None

  except RequestException:

    print('请求图片失败', url)

    return None

def save_image(content):

  file_path = '{0}/{1}.{2}'.format(os.getcwd(),md5(content).hexdigest(),'jpg')

  if not os.path.exists(file_path):

    with open(file_path,'wb') as f:

      f.write(content)

def main(offset):

  html = get_page_index(offset,KEYWORD)

  for url in parse_page_index(html):

     html = get_page_detail(url)

     if html:

       result = parse_page_detail(html,url)

       if isinstance(result,dict):

         save_to_mongo(result)

if __name__=='__main__':

    groups = [x*20 for x in range(GROUP_START, GROUP_END+1)]

    pool = Pool()

    pool.map(main,groups)

config.py

MONGO_URL = 'localhost'

MONGO_DB = 'toutiao'

MONGO_TABLE = 'jiepai'

GROUP_START = 1

GROUP_END = 20

KEYWORD = '街拍'

~

Python爬虫系列-分析Ajax请求并抓取今日头条街拍图片的更多相关文章

【Python爬虫案例学习】分析Ajax请求并抓取今日头条街拍图片
1.抓取索引页内容利用requests请求目标站点,得到索引网页HTML代码,返回结果. from urllib.parse import urlencode from requests.excep ...
分析Ajax请求并抓取今日头条街拍美图
项目说明本项目以今日头条为例,通过分析Ajax请求来抓取网页数据. 有些网页请求得到的HTML代码里面并没有我们在浏览器中看到的内容.这是因为这些信息是通过Ajax加载并且通过JavaScript渲 ...
分析 ajax 请求并抓取今日头条街拍美图
首先分析街拍图集的网页请求头部: 在 preview 选项卡我们可以找到 json 文件,分析 data 选项,找到我们要找到的图集地址 article_url: 选中其中一张图片,分析 json 请 ...
2.分析Ajax请求并抓取今日头条街拍美图
import requests from urllib.parse import urlencode # 引入异常类 from requests.exceptions import RequestEx ...
python爬虫知识点总结（十）分析Ajax请求并抓取今日头条街拍美图
一.流程框架
15-分析Ajax请求并抓取今日头条街拍美图
流程框架: 抓取索引页内容:利用requests请求目标站点,得到索引网页HTML代码,返回结果. 抓取详情页内容:解析返回结果,得到详情页的链接,并进一步抓取详情页的信息. 下载图片与保存数据库:将 ...
分析 ajax 请求并抓取 “今日头条的街拍图”
今日头条抓取页面: 分析街拍页面的 ajax 请求: 通过在 XHR 中查看内容,获取 url 链接,params 参数信息,将两者进行拼接后取得完整 url 地址.data 中的 article_u ...
python爬虫之分析Ajax请求抓取抓取今日头条街拍美图（七）
python爬虫之分析Ajax请求抓取抓取今日头条街拍美图一.分析网站 1.进入浏览器,搜索今日头条,在搜索栏搜索街拍,然后选择图集这一栏. 2.按F12打开开发者工具,刷新网页,这时网页回弹到综合 ...
PYTHON 爬虫笔记九:利用Ajax+正则表达式+BeautifulSoup爬取今日头条街拍图集（实战项目二）
利用Ajax+正则表达式+BeautifulSoup爬取今日头条街拍图集目标站点分析今日头条这类的网站制作,从数据形式,CSS样式都是通过数据接口的样式来决定的,所以它的抓取方法和其他网页的抓取方 ...

随机推荐

Maven项目已启动但是报异常访问webapp下所有资源都404
Python内建函数二
内置函数二: 1.lambda (匿名函数) 为了解决一些简答的需求而设计的一句话函数.不需要def来声明. def func(n): return n*n f = lambda n: n*n 注意: ...
Luogu P1955 [NOI2015]程序自动分析
又一次做了这道题,感慨万千. 记得寒假时,被cmd2001点起来讲这道题,胡言乱语..受尽鄙视(现在也是好吗)..后来下课想A掉,可是3天下来总是错...后来抄了分题解就咕咕了... 今天老师留了这道 ...
线程池（4）Executors.newScheduledThreadPool-只执行1次
例子1:延迟3秒后,只执行1次 ScheduledExecutorService es = Executors.newScheduledThreadPool(5); log.info("开始 ...
NET API 分析器
NET API 分析器 https://www.hanselman.com/blog/WritingSmarterCrossplatformNETCoreAppsWithTheAPIAnalyzerA ...
Hypertext Application Language（HAL）
Hypertext Application Language(HAL) HAL,全称为Hypertext Application Language,它是一种简单的数据格式,它能以一种简单.统一的形式, ...
(转)Linxu磁盘体系知识介绍及磁盘介绍
Linxu磁盘体系知识介绍及磁盘介绍系统管理 / 2017-01-14 / 0 条评论 / 浴春风 Linu磁盘设备基础知识指南磁盘速度快具备的条件: 1)主轴的转速5400/7200/10000/ ...
(转)linux dumpe2fs命令
linux dumpe2fs命令命令名称 dumpe2fs - 显示ext2/ext3/ext4文件系统信息. dumpe2fs命令语法 dumpe2fs [ -bfhixV ] [ -o supe ...
《四 spring源码》手写springmvc
手写SpringMVC思路 1.web.xml加载为了读取web.xml中的配置,我们用到ServletConfig这个类,它代表当前Servlet在web.xml中的配置信息.通过web.xml ...
H5gulp版非前后的分离环境
由于公司不同意我们使用前后端分离进行开发,硬是要我们和PHP混合在一起,所以用gulp搭建了一个简单的手脚架来用目录结构: 主要是gulpfile.js里的内容 var gulp = require ...