scrapy 登陆知乎

参考 https://github.com/zkqiang/Zhihu-Login

# -*- coding: utf-8 -*-

import scrapy

import time

import re

import base64

import hmac

import hashlib

import json

import matplotlib.pyplot as plt

from PIL import Image

class ZhihuSpider(scrapy.Spider):

    name = 'zhihu'

    allowed_domains = ['www.zhihu.com']

    start_urls = ['http://www.zhihu.com/']

    login_url = 'https://www.zhihu.com/signup'

    login_api = 'https://www.zhihu.com/api/v3/oauth/sign_in'

    login_data = {

        'client_id': 'c3cef7c66a1843f8b3a9e6a1e3160e20',

        'grant_type': 'password',

        'source': 'com.zhihu.web',

        'username': "+86xxxxxx",

        'password': "xxxxxx",

        # 传入'cn'是倒立汉字验证码

        'lang': 'en',

        'ref_source': 'homepage'

    }

    headers = {

        'Connection': 'keep-alive',

        'Host': 'www.zhihu.com',

        'Referer': 'https://www.zhihu.com/',

        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '

                        'AppleWebKit/537.36 (KHTML, like Gecko) '

                        'Chrome/69.0.3497.100 Safari/537.36'

    }

    def start_requests(self):

        if self.login_data["lang"] == 'cn':

            api = 'https://www.zhihu.com/api/v3/oauth/captcha?lang=cn'

        else:

            api = 'https://www.zhihu.com/api/v3/oauth/captcha?lang=en'

        yield scrapy.Request(url=api, headers=self.headers, callback=self._is_need_captcha)

    def _is_need_captcha(self, response):

        show_captcha = re.search(r'true', response.text)

        if show_captcha:

            yield scrapy.Request(url=response.url,

                                 headers=self.headers,

                                 method="PUT",

                                 callback=self._get_captcha)

        else:

            timestamp = str(int(time.time() * 1000))

            self.login_data.update({

                'captcha': "",

                'timestamp': timestamp,

                'signature': self._get_signature(timestamp)

            })

            yield scrapy.FormRequest(

                url=self.login_api,

                formdata=self.login_data,

                headers=self.headers,

                callback=self.check_login

            )

    def _get_captcha(self, response):

        json_data = json.loads(response.text)

        img_base64 = json_data['img_base64'].replace(r'\n', '')

        with open('./captcha.jpg', 'wb') as f:

            f.write(base64.b64decode(img_base64))

        img = Image.open('./captcha.jpg')

        if self.login_data["lang"] == 'cn':

            plt.imshow(img)

            print('点击所有倒立的汉字，按回车提交')

            points = plt.ginput(7)

            capt = json.dumps({'img_size': [200, 44],

                               'input_points': [[i[0] / 2, i[1] / 2] for i in points]})

        else:

            img.show()

            capt = input('请输入图片里的验证码：')

        # 这里必须先把参数 POST 验证码接口

        yield scrapy.FormRequest(url=response.url,

                           formdata={'input_text': capt},

                           headers=self.headers,

                           callback=self.captcha_login,

                           meta={"captcha":capt}

                           )

    def captcha_login(self, response):

        timestamp = str(int(time.time() * 1000))

        self.login_data.update({

            'captcha': response.meta['captcha'],

            'timestamp': timestamp,

            'signature': self._get_signature(timestamp)

        })

        yield scrapy.FormRequest(

            url=self.login_api,

            formdata=self.login_data,

            headers=self.headers,

            callback=self.check_login

        )

    def check_login(self, response):

        yield scrapy.Request(

            url=self.login_url,

            headers=self.headers,

            callback=self.parse

        )

    def _get_signature(self, timestamp):

        """

        通过 Hmac 算法计算返回签名

        实际是几个固定字符串加时间戳

        :param timestamp: 时间戳

        :return: 签名

        """

        ha = hmac.new(b'd1b964811afb40118a12068ff74a12f4', digestmod=hashlib.sha1)

        grant_type = self.login_data['grant_type']

        client_id = self.login_data['client_id']

        source = self.login_data['source']

        ha.update(bytes((grant_type + client_id + source + timestamp), 'utf-8'))

        return ha.hexdigest()

    def parse(self, response):

        print(response.text)

scrapy 登陆知乎的更多相关文章

Python爬虫从入门到放弃（二十四）之 Scrapy登录知乎
因为现在很多网站为了限制爬虫,设置了为只有登录才能看更多的内容,不登录只能看到部分内容,这也是一种反爬虫的手段,所以这个文章通过模拟登录知乎来作为例子,演示如何通过scrapy登录知乎在通过scra ...
Python之爬虫（二十六） Scrapy登录知乎
因为现在很多网站为了限制爬虫,设置了为只有登录才能看更多的内容,不登录只能看到部分内容,这也是一种反爬虫的手段,所以这个文章通过模拟登录知乎来作为例子,演示如何通过scrapy登录知乎在通过scra ...
Python 爬虫模拟登陆知乎
在之前写过一篇使用python爬虫爬取电影天堂资源的博客,重点是如何解析页面和提高爬虫的效率.由于电影天堂上的资源获取权限是所有人都一样的,所以不需要进行登录验证操作,写完那篇文章后又花了些时间研究了 ...
Python3 使用selenium库登陆知乎并保存cookie为本地文件
Python3 使用selenium库登陆知乎并保存cookie为本地文件学习使用selenium库模拟登陆知乎,并将cookie保存为本地文件,然后供以后(requests模块)使用,用selen ...
第十二篇 requests模拟登陆知乎
了解http常见状态码可以通过输入错误的密码来找到登陆知乎的post:url 把Headers拉到底部,可以看到form data _xsrf是需要发送的,需要发送给服务端,否则会返回403错误,提 ...
Scrapy基础(十四)————Scrapy实现知乎模拟登陆
模拟登陆大体思路见此博文,本篇文章只是将登陆在scrapy中实现而已之前介绍过通过requests的session 会话模拟登陆:必须是session,涉及到验证码和xsrf的写入cookie验证的 ...
Scrapy 模拟登陆知乎--抓取热点话题
工具准备在开始之前,请确保 scrpay 正确安装,手头有一款简洁而强大的浏览器, 若是你有使用 postman 那就更好了. Python 1 scrapy genspid ...
python模拟登陆知乎并爬取数据
一些废话看了一眼上一篇日志的时间已然是5个月前的事情了不禁感叹光阴荏苒其实就是我懒几周前心血来潮想到用爬虫爬些东西于是先后先重写了以前写过的求绩点代码爬了草榴贴图,妹子图网,后来想爬婚恋网 ...
使用OKHttp模拟登陆知乎，兼谈OKHttp中Cookie的使用！
本文主要是想和大家探讨技术,让大家学会Cookie的使用,切勿做违法之事! 很多Android初学者在刚开始学习的时候,或多或少都想自己搞个应用出来,把自己学的十八般武艺全都用在这个APP上,其实这个 ...

随机推荐

sql存储过程中使用 output、nvarchar（max）
1.sql存储过程中使用 output CREATE PROCEDURE [dbo].[P_Max] @a int, -- 输入 @b int, -- 输入 @Returnc int output - ...
ofo C++面试
面试官不是C++方向,所以上来就是三个算法题. 1. 假设一个男生和他女朋友约吃饭,男生到的时间点是 6 点到6点半,女生到的时间可能是 6点15到6点30,都是等概率的到达,问男生比女生到的晚的概 ...
Java 自动装箱与拆箱（Autoboxing and unboxing）
什么是自动装箱拆箱基本数据类型的自动装箱(autoboxing).拆箱(unboxing)是自J2SE 5.0开始提供的功能. 一般我们要创建一个类的对象实例的时候,我们会这样: Class a = ...
01-HTML介绍
1.WEB标准 web准备介绍: w3c:万维网联盟组织,用来制定web标准的机构(组织) web标准:制作网页遵循的规范 web准备规范的分类:结构标准.表现标准.行为标准. 结构:html.表示: ...
main函数如何调用文件外的函数
Python文本处理
文本处理 (一)对文本操作的流程: 打开文件,得到文件句柄并赋值给一个变量通过句柄对文件进行操作关闭文件 open(file, mode='r', buffering=None, encoding ...
在java中怎样获得当前日期时间
Calendar cal = Calendar.getInstance(); java.text.SimpleDateFormat sdf = new SimpleDateFormat(&quo ...
semantic-ui 标题
在semantic-ui中定义了5中标题样式,注意HTML中有h1-h6,而semantic-ui中只有h1-h5. 不过需要注意的是,semantic-ui的标题仍旧使用h1-h5来表示,但是在cl ...
C#复习笔记（4）--C#3：革新写代码的方式（Lambda表达式和表达式树）
Lambda表达式和表达式树先放一张委托转换的进化图看一看到lambda简化了委托的使用. lambda可以隐式的转换成委托或者表达式树.转换成委托的话如下面的代码: Func<string ...
mysql5.7以上安装
下载:https://dev.mysql.com/downloads/mysql/ 1.在解压的mysql下(bin目录统计),创建my.ini 文件,内容日下(路径根据自己的目录修改) [mysql ...

scrapy 登陆知乎

scrapy 登陆知乎的更多相关文章

随机推荐

热门专题