python之scrapy携带Cookies模拟登陆

知识点

    """

    scrapy两种模拟登陆：

        1、直接携带cookie

        2、找到发送post请求的url地址，带上信息，发送请求

    应用场景：

        1、cookie过期时间很长，常见于一些不规范的网站

        2、能在cookie过期之前把搜有的数据拿到

        3、配合其他程序使用，比如其使用selenium把登陆之后的cookie获取到保存到本地，scrapy发送请求之前先读取本地cookie

    """

1、创建工程

scrapy startproject renren

2、创建工程

scrapy genspider login renren.com

3、setting.py文件设置COOKIES和COOKIES_DEBUG

# -*- coding: utf-8 -*-

# Scrapy settings for qq project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#     https://doc.scrapy.org/en/latest/topics/settings.html

#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'qq'

SPIDER_MODULES = ['qq.spiders']

NEWSPIDER_MODULE = 'qq.spiders'

#查看cookies信息

COOKIES_DEBUG=True

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'qq (+http://www.yourdomain.com)'

# Obey robots.txt rules

ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

#   'Accept-Language': 'en',

#}

# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

#    'qq.middlewares.QqSpiderMiddleware': 543,

#}

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

#    'qq.middlewares.QqDownloaderMiddleware': 543,

#}

# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

#}

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

#ITEM_PIPELINES = {

#    'qq.pipelines.QqPipeline': 300,

#}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

4、login.py文件实现模拟登陆

# -*- coding: utf-8 -*-

import scrapy

import re

class LoginSpider(scrapy.Spider):

    name = 'login'

    allowed_domains = ['renren.com']

    start_urls = ['http://www.renren.com/971298880/profile']

    def start_requests(self):

        """

        根据cookies模拟登陆人人网，注意settings.py文件的cookies必须是开启的

        :return:

        """

        cookies="anonymid=jxcn09d5-vd52v0; depovince=GUZ; _r01_=1; ick_login=d41bb8a9-056b-41a7-b187-7c706f0f8702; ick=39b091b8-f882-499b-992b-34a682d3469a; JSESSIONID=abcHJrhG1CAIo64PJRrUw; jebe_key=9ca5b44f-aaec-4180-962e-bf7581ad6e5e%7Cc1d85b293dafa0e44367ceed107b877e%7C1561517245262%7C1%7C1561517244004; jebe_key=9ca5b44f-aaec-4180-962e-bf7581ad6e5e%7Cc1d85b293dafa0e44367ceed107b877e%7C1561517245262%7C1%7C1561517244011; wp_fold=0; td_cookie=18446744069457827825; jebecookies=e9de8580-fb15-4891-b7c3-7c08ebb41f5c|||||; _de=AE9934B6C85831351B86F7DDD5B20F8A; p=b25ab0edb69343d7f80c5e481864b8c30; first_login_flag=1; ln_uact=18620028487; ln_hurl=http://head.xiaonei.com/photos/0/0/men_main.gif; t=b8a0d848e228dd51d2c84609f814495b0; societyguester=b8a0d848e228dd51d2c84609f814495b0; id=971298880; xnsid=6e5116da; ver=7.0; loginfrom=null"

        cookies = {i.split("=")[0]:i.split("=")[1] for i in cookies.split("; ")}

        yield scrapy.Request(

            self.start_urls[0],

            callback=self.parse,

            cookies=cookies

        )

    def parse(self, response):

        print(re.findall("新用户28487",response.body.decode()))

        yield  scrapy.Request(

            #这里不需要再添加cookies，因为settings.py文件中开启了，只需要添加一次，下次请求自动携带

            'http://www.renren.com/971298880/profile?v=info_timeline',

            callback=self.parse_detail

        )

    def parse_detail(self,response): #详情页

        print(re.findall("新用户28487", response.body.decode()))

python之scrapy携带Cookies模拟登陆的更多相关文章

Scrapy 中的模拟登陆
目前,大部分网站都具有用户登陆功能,其中某些网站只有在用户登陆后才能获得有价值的信息,在爬取这类网站时,Scrapy 爬虫程序先模拟登陆,再爬取内容 1.登陆实质其核心是想服务器发送含有登陆表单数据 ...
Python爬虫学习笔记之模拟登陆并爬去GitHub
(1)环境准备: 请确保已经安装了requests和lxml库 (2)分析登陆过程: 首先要分析登陆的过程,需要探究后台的登陆请求是怎样发送的,登陆之后又有怎样的处理过程. 如果已经 ...
python之scrapy的FormRequest模拟POST表单自动登陆
1.FormRequest表单实现自动登陆 # -*- coding: utf-8 -*- import scrapy import re class GithubSpider(scrapy.Spid ...
python爬虫学习(3)_模拟登陆
1.登陆超星慕课,chrome抓包,模拟header,提取表单隐藏元素构成params. 主要是验证码图片地址,在js中发现由js->new Date().getTime()时间戳动态生成url ...
Python爬虫教程：requests模拟登陆github
1. Cookie 介绍 HTTP 协议是无状态的.因此,若不借助其他手段,远程的服务器就无法知道以前和客户端做了哪些通信.Cookie 就是「其他手段」之一. Cookie 一个典型的应用场景,就是 ...
使用cookies模拟登陆
http://blog.csdn.net/a1099439833/article/details/51918955 使用cookies会话跟踪,保持cookies访问,对于cookies会失效的问题可 ...
Python爬虫(二十二)_selenium案例：模拟登陆豆瓣
本篇博客主要用于介绍如何使用selenium+phantomJS模拟登陆豆瓣,没有考虑验证码的问题,更多内容,请参考:Python学习指南 #-*- coding:utf-8 -*- from sel ...
【转】详解抓取网站，模拟登陆，抓取动态网页的原理和实现（Python，C#等）
转自:http://www.crifan.com/files/doc/docbook/web_scrape_emulate_login/release/html/web_scrape_emulate_ ...
使用OkHttp模拟登陆LeetCode
前言网上有很多模拟登陆 LeetCode 的教程,但是基本都是使用 Python 来实现的.作为一个 Java 语言爱好者,因此想用 Java 来实现下.在实现的过程中,也遇到了一些坑点,故在此作为 ...

随机推荐

python——flask常见接口开发（简单案例）
python——flask常见接口开发(简单案例)原创大蛇王发布于2019-01-24 11:34:06 阅读数 5208 收藏展开版本:python3.5+ 模块:flask 目标:开发一个只 ...
【概率dp】vijos 3747 随机图
没有养成按状态逐步分析问题的思维题目描述在一张图内,两点$i,j$之间有$p$的概率的概率生成一条边.求该图不出现大小$\ge 4$连通块的概率. $n \le 100,答案在实数意义下$ 题目分 ...
fiddler save files
使用fiddler 保存访问到的文件使用jscript Fiddler Script 是用JScript.NET语言写的 JScript.NET 此语言可以调用C# api 参考地址:http:// ...
C#新增按钮
代码亲测可用,似乎不需要“ADD”,如下:form_load段:for (int i = 0; i < 10; i++){btn = new Button();btn.Parent = this ...
HDU 5634 （线段树）
题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=5634 题意:给出 n 个数,有三种操作,把区间的 ai 变为 φ(ai):把区间的 ai 变为 x:查 ...
springboot2.0入门（一）----springboot 简介
一.springboot解决了什么? 避免了繁杂的xml配置,框架自动帮我们完成了相关的配置,当我们需要进行相关插件集成的时候,只需要将相关的starter通过相关的maven依赖引进,并可以进行相关 ...
ubuntu16.04卡死的解决办法
1.输入命令:top 找到chrome所占用的线程的pid. 2.kill pid
Chef and Problems（from Code-Chef FNCS） ( 回滚 )
题目: 题意:给定序列,求[l,r]区间内数字相同的数的最远距离. 链接:https://www.codechef.com/problems/QCHEF #include<bits/stdc++ ...
【csp模拟赛2】序列操作
线性推,开数组太麻烦,可以用指针代码: #include <iostream> #include <cstdio> #include <queue> using ...
前端vue的get和post请求
vue的get和post需要两个文件vue.js和vue-resource.js 以下是实现的代码,可以参考一下,需要注意的接口的请求需要考虑跨域的问题,其次就是访问页面需要在tomcat下访问,否则 ...

python之scrapy携带Cookies模拟登陆

python之scrapy携带Cookies模拟登陆的更多相关文章

随机推荐

热门专题