分析Ajax来爬取今日头条街拍美图并保存到MongDB

前提:.需要安装MongDB

注:因今日投票网页发生变更,如下代码不保证能正常使用

#!/usr/bin/env python

#-*- coding: utf-8 -*-

import json

import os

from urllib.parse import urlencode

import pymongo

import requests

from bs4 import BeautifulSoup

from requests.exceptions import ConnectionError

import re

from multiprocessing import Pool

from hashlib import md5

from json.decoder import JSONDecodeError

MONGO_URL = 'localhost'

MONGO_DB = 'toutiao'

MONGO_TABLE = 'toutiao'

GROUP_START = 1

GROUP_END = 20

KEYWORD='街拍'

client = pymongo.MongoClient(MONGO_URL, connect=False)

db = client[MONGO_DB]

def get_page_index(offset, keyword):

    data = {

        'autoload': 'true',

        'count': 20,

        'cur_tab': 3,

        'format': 'json',

        'keyword': keyword,

        'offset': offset,

    }

    params = urlencode(data)

    base = 'http://www.toutiao.com/search_content/'

    url = base + '?' + params

    try:

        response = requests.get(url)

        if response.status_code == 200:

            return response.text

        return None

    except ConnectionError:

        print('Error occurred')

        return None

def download_image(url):

    print('Downloading', url)

    try:

        response = requests.get(url)

        if response.status_code == 200:

            save_image(response.content)

        return None

    except ConnectionError:

        return None

def save_image(content):

    file_path = '{0}/{1}.{2}'.format(os.getcwd(), md5(content).hexdigest(), 'jpg')

    print(file_path)

    if not os.path.exists(file_path):

        with open(file_path, 'wb') as f:

            f.write(content)

            f.close()

def parse_page_index(text):

    try:

        data = json.loads(text)

        if data and 'data' in data.keys():

            for item in data.get('data'):

                yield item.get('article_url')

    except JSONDecodeError:

        pass

def get_page_detail(url):

    try:

        response = requests.get(url)

        if response.status_code == 200:

            return response.text

        return None

    except ConnectionError:

        print('Error occurred')

        return None

def parse_page_detail(html, url):

    soup = BeautifulSoup(html, 'lxml')

    result = soup.select('title')

    title = result[0].get_text() if result else ''

    images_pattern = re.compile('gallery: JSON.parse\("(.*)"\)', re.S)

    result = re.search(images_pattern, html)

    if result:

        data = json.loads(result.group(1).replace('\\', ''))

        if data and 'sub_images' in data.keys():

            sub_images = data.get('sub_images')

            images = [item.get('url') for item in sub_images]

            for image in images: download_image(image)

            return {

                'title': title,

                'url': url,

                'images': images

            }

def save_to_mongo(result):

    if db[MONGO_TABLE].insert(result):

        print('Successfully Saved to Mongo', result)

        return True

    return False

def main(offset):

    text = get_page_index(offset, KEYWORD)

    urls = parse_page_index(text)

    for url in urls:

        html = get_page_detail(url)

        print(html)

        result = parse_page_detail(html, url)

        print(result)

        if result: save_to_mongo(result)

if __name__ == '__main__':

    pool = Pool()

    groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)])

    pool.map(main, groups)

    pool.close()

    pool.join()

分析Ajax来爬取今日头条街拍美图并保存到MongDB的更多相关文章

爬虫（八）：分析Ajax请求抓取今日头条街拍美图
(1):分析网页分析ajax的请求网址,和需要的参数.通过不断向下拉动滚动条,发现请求的参数中offset一直在变化,所以每次请求通过offset来控制新的ajax请求. (2)上代码 a.通过aj ...
python3爬虫-分析Ajax，抓取今日头条街拍美图
# coding=utf-8 from urllib.parse import urlencode import requests from requests.exceptions import Re ...
【Python3网络爬虫开发实战】分析Ajax爬取今日头条街拍美图
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理.作者:haoxuan10 本节中,我们以今日头条为例来尝试通过分析Ajax请求 ...
【Python3网络爬虫开发实战】6.4-分析Ajax爬取今日头条街拍美图【华为云技术分享】
[摘要] 本节中,我们以今日头条为例来尝试通过分析Ajax请求来抓取网页数据的方法.这次要抓取的目标是今日头条的街拍美图,抓取完成之后,将每组图片分文件夹下载到本地并保存下来. 1. 准备工作在本节 ...
转：【Python3网络爬虫开发实战】6.4-分析Ajax爬取今日头条街拍美图
[摘要] 本节中,我们以今日头条为例来尝试通过分析Ajax请求来抓取网页数据的方法.这次要抓取的目标是今日头条的街拍美图,抓取完成之后,将每组图片分文件夹下载到本地并保存下来. 1. 准备工作在本节 ...
分析Ajax爬取今日头条街拍美图-崔庆才思路
站点分析源码及遇到的问题代码结构方法定义需要的常量关于在代码中遇到的问题 01. 数据库连接 02.今日头条的反爬虫机制 03. json解码遇到的问题 04. 关于response.tex ...
python爬虫之分析Ajax请求抓取抓取今日头条街拍美图（七）
python爬虫之分析Ajax请求抓取抓取今日头条街拍美图一.分析网站 1.进入浏览器,搜索今日头条,在搜索栏搜索街拍,然后选择图集这一栏. 2.按F12打开开发者工具,刷新网页,这时网页回弹到综合 ...
15-分析Ajax请求并抓取今日头条街拍美图
流程框架: 抓取索引页内容:利用requests请求目标站点,得到索引网页HTML代码,返回结果. 抓取详情页内容:解析返回结果,得到详情页的链接,并进一步抓取详情页的信息. 下载图片与保存数据库:将 ...
Python Spider 抓取今日头条街拍美图
""" 抓取今日头条街拍美图 """ import os import time import requests from hashlib ...

随机推荐

Apache反向代理结合Tomcat集群来实现负载均衡（一）、概念理解
好的博文一般都能做到"望题知文",看下标题就知道下边要讲的内容,写这个标题时犹豫了一下,本来要将标题定位apache+tomcat实现负载均衡,但是又认为这样显得比較模糊.后来想了 ...
How to pass external configuration properties to storm topology?
How to pass external configuration properties to storm topology? I want to pass some custom configur ...
关于template 的23个问题
发现新大陆.曾经慢慢才知道的东西.原来有个集中营: 看看updated, 处理方式是这么的好 35.1 " id="link-to-faq-35_1" style=&qu ...
LNMP 架构上传文件
修改PHP上传文件大小限制的方法修改PHP上传文件大小限制的方法1. 一般的文件上传,除非文件很小.就像一个5M的文件,很可能要超过一分钟才能上传完.但在php中,默认的该页最久执行时间为 30 ...
GIS中mybatis_CMEU的配置方法
基本常用功能预览: 生成实体类(可以自定义:get/set,有参无参构造方法,自定义类型与属性,序列化等); 生成dao层接口(查询全部信息,通过ID查询信息,插入全部属性,插入不为空的属性,通过ID ...
Dark roads--hdoj
Dark roads Time Limit : 2000/1000ms (Java/Other) Memory Limit : 32768/32768K (Java/Other) Total Su ...
南海区行政审批管理系统接口规范v0.3（规划）
1. 会话API 1.1. login [登录验证] {"r_code":"500","r_msg":"操作失败",&q ...
Spark Scala语言学习系列之完成HelloWorld程序（三种方式）
三种方式完成HelloWorld程序分别采用在REPL,命令行(scala脚本)和Eclipse下运行hello world. 一.Scala REPL. windows下安装好scala后,直接C ...
SQL查询中选取某个字段的前几个字符的方法
在统计某种数据名称是否存在规律时,可以通过group by进行统计,但是有时候存在钱几个字符相同,后面字符不同的情形.这样可以通过按照前几个字符串进行统计,SqlServer和Oracle中都可以使用 ...
week2 notebook2
Beginning day2: 1.基础数据类型宏观: 1.1.整型:int:1,2,3 1.2.字符串:str:‘anthony’ 1.2.1: 索引:索引即下标,就是字符串组成的元素从第一个开始, ...

分析Ajax来爬取今日头条街拍美图并保存到MongDB

分析Ajax来爬取今日头条街拍美图并保存到MongDB的更多相关文章

随机推荐

热门专题