python --爬虫基础 --爬取今日头条使用 requests 库的基本操作, Ajax

'''
思路
一: 由于是Ajax的网页,需要先往下划几下看看XHR的内容变化
二:分析js中的代码内容
三:获取一页中的内容
四:获取图片
五:保存在本地

使用的库1. requests   网页获取库
        2.from urllib.parse import urlencode    将字典转化为字符串内容整理拼接到url
        3.os 操作文件的库
        4.from hashlib import md5   md5 的哈希库
        5.from multiprocessing.pool import Pool   多线程库
'''

import requests

from urllib.parse import urlencode

from requests import codes

import os

from hashlib import md5

from multiprocessing.pool import Pool

# 首先是获取一页的内容,观察XHR可得知  加载图片是改变offset的值,观察XHR内容得知url和字典内容拼接的字符串为目标网页,判断网页是否响应,如果响应则返回json格式文件,如果不响应则

# 抛出

def get_page(offset):

    params =    {'offset': offset,

     'format': 'json',

     'keyword': '街拍',

     'autoload': 'true',

     'count': '',

     'cur_tab': '',

     'from': 'search_tab'

     }

    base_url = 'https://www.toutiao.com/search_content/?'  # 基础网页的基础网址

    url = base_url + urlencode(params)  # 拼接网址

    try:

        resp = requests.get(url)

        if resp.status_code == 200:

            return resp.json()

    except:

        return None

# 第二步,已经获取网页的url,接下来获取想要的内容,已经知道需求是获取妹子图片,通过传入json ,进一步实现获取内

# 容,调取json的方法get(),传入键名字,获取内容

def get_img(json):

    if json.get('data'):  # data是原网页的一个数据集合

        data = json.get('data')

        for item in data:  # 遍历data的内容,

            if item.get('cell_type') is not None:

                continue

            title = item.get('title')

            images = item.get('image_list')

            for image in images:

                yield {

                    'image': 'https:' + image.get('url'),

                    'title': title

                }

# 第三步,保存内容到本地,传入的内容是,获取图片中的item,引入os库用于文件夹操作

def save_files(item):

    img_path = 'img' + os.path.sep +item.get('title')

    if not os.path.exists(img_path):  # 判断文件夹是否存在,如果存在继续,不存在创建继续

        os.makedirs(img_path)

    try:

        resq = requests.get(item.get('image'))

        if resq.status_code == 200:

            file_path = img_path + os.path.sep + '{file_name}.{file_suf}'.format(

                file_name=md5(resq.content).hexdigest(),  # 把获取的内容md5处理获得内容

                file_suf='jpg'

            )

            if not os.path.exists(file_path):

                with open(file_path, 'wb') as f:

                    f.write(resq.content)

                    print('Downloaded image path is' + file_path)

            else:

                print('Already Downloaded', file_path)

    except requests.ConnectionError:

        print('Failed to Save Image，item %s' % item)

#第四步 创建运行主函数 main 方法 ,通过offset 数据改变获取内容

def main (offset):

    json = get_page(offset)

    for item in get_img(json):

        save_files(item)

GROUP_START = 0

GROUP_END = 7

#最后调用多线程 进行下载

if __name__ == '__main__':

    pool = Pool()

    groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)])

    pool.map(main, groups)

    pool.close()

    pool.join()

Ajax 针对类似微博,今日头条那种需要下拉,内容放在js里的网页

python --爬虫基础 --爬取今日头条使用 requests 库的基本操作, Ajax的更多相关文章

python --爬虫基础 --爬猫眼top 100 使用 requests 库的基本操作
import requests import re import json import time def get_page(url): # 获取页数 headers = { 'User-Agent' ...
Python爬虫基础--爬取车模照片
import urllib from urllib import request, parse from lxml import etree class CarModel: def __init__( ...
python爬虫 selenium 抓取今日头条（ajax异步加载）
from selenium import webdriver from lxml import etree from pyquery import PyQuery as pq import time ...
使用scrapy爬虫,爬取今日头条搜索吉林疫苗新闻（scrapy+selenium+PhantomJS）
这一阵子吉林疫苗案,备受大家关注,索性使用爬虫来爬取今日头条搜索吉林疫苗的新闻依然使用三件套(scrapy+selenium+PhantomJS)来爬取新闻以下是搜索页面,得到吉林疫苗的搜索信息, ...
PYTHON 爬虫笔记九:利用Ajax+正则表达式+BeautifulSoup爬取今日头条街拍图集（实战项目二）
利用Ajax+正则表达式+BeautifulSoup爬取今日头条街拍图集目标站点分析今日头条这类的网站制作,从数据形式,CSS样式都是通过数据接口的样式来决定的,所以它的抓取方法和其他网页的抓取方 ...
爬虫七之分析Ajax请求并爬取今日头条
爬取今日头条图片这里只讨论出现的一些问题,代码在最下面github链接里. 首先,今日头条取消了"图集"这一选项,因此对于爬虫来说效率降低了很多: 在所有代码都完成后,也许是爬取 ...
使用scrapy爬虫,爬取今日头条首页推荐新闻（scrapy+selenium+PhantomJS）
爬取今日头条https://www.toutiao.com/首页推荐的新闻,打开网址得到如下界面查看源代码你会发现全是js代码,说明今日头条的内容是通过js动态生成的. 用火狐浏览器F12查看得知 ...
Python3从零开始爬取今日头条的新闻【一、开发环境搭建】
Python3从零开始爬取今日头条的新闻[一.开发环境搭建] Python3从零开始爬取今日头条的新闻[二.首页热点新闻抓取] Python3从零开始爬取今日头条的新闻[三.滚动到底自动加载] Pyt ...
Python3从零开始爬取今日头条的新闻【四、模拟点击切换tab标签获取内容】
Python3从零开始爬取今日头条的新闻[一.开发环境搭建] Python3从零开始爬取今日头条的新闻[二.首页热点新闻抓取] Python3从零开始爬取今日头条的新闻[三.滚动到底自动加载] Pyt ...

随机推荐

05-SSH综合案例：环境搭建之配置文件的引入
1.3 第三步导入相应配置文件 Struts框架中: * web.xml * 核心过滤器: <filter> <filter-name>struts2</filter-n ...
Python解决数独
Environment: Python27 # -*- coding: UTF-8 -*- ''' Created on 2017年6月9日 @author: LXu4 ''' import copy ...
训练超参数，出现 Cannot use GPU in CPU-only Caffe 错误？
当我们用MNIST手写体数字数据库和LeNet CNN 模型训练超参数,运行 examples/mnist/train_lenet.sh是出现Cannot use GPU in CPU-only Ca ...
[Java] 通过XPath获取XML中某个节点的属性
/** * Get PA Url * @author jzhang6 * @return url */ public String getPAUrl(){ String PAUrl = "& ...
sqlserver 2017 linux还原windows备份时的路径问题解决
windows的备份由于路径问题,在Linux上会报错 File 'YourDB_Product' cannot be restored to 'Z:\Microsoft SQL Server\MSS ...
max文件属性设置，
之前一直都没找到用到的时候就是用net 弄了.哎.还在开发东西都是在9上面, 这次脚本必须在 max8 上面逼的我找到了他 getFileAttribute <filename_string ...
pyspider示例代码一：利用phantomjs解决js问题
本系列文章主要记录和讲解pyspider的示例代码,希望能抛砖引玉.pyspider示例代码官方网站是http://demo.pyspider.org/.上面的示例代码太多,无从下手.因此本人找出一下 ...
Android 一些注意
半年没碰android,想给一个按钮写个click,硬是想不起来怎么搞,哎! 1.编码问题调整 2.引用框架问题 3.界面设计无法显示问题,需要调整设计界面的API Level 4.任意输入自动提示 ...
Oracle学习笔记(六)
八.函数 1.函数的作用 (1)方便数据的统计 (2)处理查询结果,让数据显示更清楚 2.函数分类(提供很多内置函数,也可自定义函数) (1)数值函数平均值,四舍五入 a.四舍五入表达式 roun ...
【笔记】metasploit渗透测试魔鬼训练营-信息搜集
exploit 漏洞利用代码编码器模块:免杀.控制 help [cmd] msfcli适合对网络中大量系统统一测试. 打开数据包路由转发功能:/etc/sysctl.conf /etc/rc.loc ...

python --爬虫基础 --爬取今日头条 使用 requests 库的基本操作, Ajax

python --爬虫基础 --爬取今日头条 使用 requests 库的基本操作, Ajax的更多相关文章

随机推荐

热门专题

python --爬虫基础 --爬取今日头条使用 requests 库的基本操作, Ajax

python --爬虫基础 --爬取今日头条使用 requests 库的基本操作, Ajax的更多相关文章