1.案例一

a.创建项目

scrapy startproject renren_login

进入项目路径

scrapy genspider renren "renren.com"

renren.py

# -*- coding: utf-8 -*-
import scrapy class RenrenSpider(scrapy.Spider):
name = 'renren'
allowed_domains = ['renren.com']
start_urls = ['http://renren.com/'] def start_requests(self):
url="http://www.renren.com/PLogin.do"
data={"email":"xxxxxxxx@126.com","password":"xxxxxxx"}
request=scrapy.FormRequest(url,formdata=data,callback=self.parse_page)
yield request def parse_page(self, response):
request=scrapy.Request(url='http://www.renren.com/326282648/profile',callback=self.parse_profile)
yield request def parse_profile(self,response):
with open("wenliang.html","w",encoding="utf-8") as fp:
fp.write(response.text)

在项目路径下创建start.py

from scrapy import cmdline
cmdline.execute(["scrapy","crawl","renren"])

2.案例2

a.手动输入验证码

创建项目

scrapy startproject douban_login

进去项目路径

scrapy genspider douban "douban.com"

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for douban_login project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'douban_login' SPIDER_MODULES = ['douban_login.spiders']
NEWSPIDER_MODULE = 'douban_login.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douban_login (+http://www.yourdomain.com)' # Obey robots.txt rules
ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36',
} # Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'douban_login.middlewares.DoubanLoginSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'douban_login.middlewares.DoubanLoginDownloaderMiddleware': 543,
#} # Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'douban_login.pipelines.DoubanLoginPipeline': 300,
#} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

douban.py

# -*- coding: utf-8 -*-
import scrapy
from urllib import request
from PIL import Image
class DoubanSpider(scrapy.Spider):
name = 'douban'
allowed_domains = ['douban.com']
start_urls = ['https://www.douban.com/login']
login_url="https://www.douban.com/login"
profile_url="https://www.douban.com/people/184480369/"
editsignature_url="https://www.douban.com/j/people/184480369/edit_signature" def parse(self, response):
formdata={
"source":"None",
"redir":"https://www.douban.com/",
"form_email":"xxxxxx@qq.com",
"form_password":"xxxxxx!",
"remember":"on",
"login":"登录"
} captcha_url=response.css("img#captcha_image::attr(src)").get() if captcha_url:
captcha=self.regonize_captcha(captcha_url)
formdata["captcha-solution"]=captcha
captcha_id=response.xpath("//input[@name='captcha-id']/@value").get()
formdata["captcha-id"]=captcha_id
yield scrapy.FormRequest(url=self.login_url,formdata=formdata,callback=self.parse_after_login) def parse_after_login(self,response):
if response.url=="https://www.douban.com/":
yield scrapy.Request(self.profile_url,callback=self.parse_profile)
print("登录成功")
else:
print("登录失败") def parse_profile(self,response):
print(response.url)
if response.url==self.profile_url:
print("进入到了个人中心")
ck=response.xpath("//input[@name='ck']/@value").get()
formdata={
"ck":ck,
"signature":"丈夫处世兮立功名"
}
yield scrapy.FormRequest(self.editsignature_url,formdata=formdata)
else:
print("没有进入个人中心") def regonize_captcha(self,image_url):
request.urlretrieve(image_url,"captcha.png")
image=Image.open("captcha.png")
image.show()
captcha=input("请输入验证码:")
return captcha

在douban_login目录下创建start.py

from scrapy import cmdline

cmdline.execute("scrapy crawl douban".split())

执行start.py即可

b.自动识别验证码

from urllib import request
from base64 import b64decode
import requests captcha_url="https://www.douban.com/misc/captcha?id=TCEAV2F8SbBgKbXZ5JAI2G6L:en&size=s"
request.urlretrieve(captcha_url,"captcha.png") recognize_url="http://xxxxxx"
formdata={}
with open("captcha.png","rb") as fp:
data=fp.read()
pic=b64decode(data)
formdata['pic']=pic appcode='xxxxxxxxxxxxxxx'
headers={
"Content-Type":"application/x-www-form-urlencode; charset=UTF-8",
'Authorization':'APPCODE'+appcode
}
response=requests.post(recognize_url,data=formdata,headers=headers)
print(response)

c.其他自动识别案例

from selenium import webdriver
import time
import requests
from lxml import etree
import base64 # 操作浏览器
driver = webdriver.Chrome()
url = 'https://accounts.douban.com/login?alias=&redir=https%3A%2F%2Fwww.douban.com%2F&source=index_nav&error=1001' driver.get(url)
time.sleep(1)
driver.find_element_by_id('email').send_keys('')
time.sleep(1)
driver.find_element_by_id('password').send_keys('yaoqinglin2011')
time.sleep(1) # 获取验证码相关信息
html_str = driver.page_source
html_ele = etree.HTML(html_str)
# 得到验证码的url
image_url = html_ele.xpath('//img[@id="captcha_image"]/@src')[0]
# 获取这个图片的内容
response = requests.get(image_url) # 获取base64的str
# https://market.aliyun.com/products/57124001/cmapi028447.html?spm=5176.2020520132.101.5.2HEXEG#sku=yuncode2244700000
b64_str = base64.b64encode(response.content)
v_type = 'cn'
# post 提交打码平台的数据
form = {
'v_pic': b64_str,
'v_type': v_type,
} # authtication的header
headers = {
'Authorization': 'APPCODE eab23fa1d03f40d48b43c826c57bd284',
}
# 从打码平台获取验证码信息
dmpt_url = 'http://yzmplus.market.alicloudapi.com/fzyzm'
response = requests.post(dmpt_url, form, headers=headers)
print(response.text)
# captcha_value 就是我们的验证码信息
captcha_value = response.json()['v_code'] print(image_url)
print(captcha_value)
# captcha_value = input('请输入验证码') driver.find_element_by_id('captcha_field').send_keys(captcha_value)
time.sleep(1)
driver.find_element_by_class_name('btn-submit').click()
time.sleep(1)
# 获取所有的cookie的信息
cookies = driver.get_cookies()
cookie_list =[] # 对于每一个cookie_dict, 就是将name 和 value取出, 拼接成name=value;
for cookie_dict in cookies:
cookie_str = cookie_dict['name'] + '=' + cookie_dict['value']
cookie_list.append(cookie_str) # 拼接所有的cookie到header_cookie中
header_cookie = '; '.join(cookie_list) headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
'Cookie': header_cookie,
}
another_url = 'https://www.douban.com/accounts/'
response = requests.get(another_url, headers=headers) with open('cc.html', 'wb') as f:
f.write(response.content) # with open('douban.html', 'wb') as f:
# f.write(driver.page_source.encode('utf-8'))

15.scrapy模拟登陆案例的更多相关文章

  1. 第三百四十三节,Python分布式爬虫打造搜索引擎Scrapy精讲—scrapy模拟登陆和知乎倒立文字验证码识别

    第三百四十三节,Python分布式爬虫打造搜索引擎Scrapy精讲—scrapy模拟登陆和知乎倒立文字验证码识别 第一步.首先下载,大神者也的倒立文字验证码识别程序 下载地址:https://gith ...

  2. Scrapy模拟登陆

    1. 为什么需要模拟登陆? #获取cookie,能够爬取登陆后的页面 2. 回顾: requests是如何模拟登陆的? #1.直接携带cookies请求页面 #2.找接口发送post请求存储cooki ...

  3. Scrapy 模拟登陆知乎--抓取热点话题

    工具准备 在开始之前,请确保 scrpay 正确安装,手头有一款简洁而强大的浏览器, 若是你有使用 postman 那就更好了.           Python   1 scrapy genspid ...

  4. 爬虫入门之scrapy模拟登陆(十四)

    注意:模拟登陆时,必须保证settings.py里的COOKIES_ENABLED(Cookies中间件) 处于开启状态 COOKIES_ENABLED = True或# COOKIES_ENABLE ...

  5. python之scrapy模拟登陆人人网

    1.settings.py主要配置信息,包括USER_AGENT等 # -*- coding: utf-8 -*- # Scrapy settings for renren project # # F ...

  6. Scrapy模拟登陆豆瓣抓取数据

    scrapy  startproject douban 其中douban是我们的项目名称 2创建爬虫文件 进入到douban 然后创建爬虫文件 scrapy genspider dou douban. ...

  7. scrapy 模拟登陆

    import scrapy import urllib.request from scrapy.http import Request,FormRequest class LoginspdSpider ...

  8. 二十二 Python分布式爬虫打造搜索引擎Scrapy精讲—scrapy模拟登陆和知乎倒立文字验证码识别

    第一步.首先下载,大神者也的倒立文字验证码识别程序 下载地址:https://github.com/muchrooms/zheye 注意:此程序依赖以下模块包 Keras==2.0.1 Pillow= ...

  9. 识别图片验证码的三种方式(scrapy模拟登陆豆瓣网)

    1.通过肉眼识别,然后输入到input里面 from PIL import image Image request.urlretrieve(url,'image')  #下载验证码图片 image = ...

随机推荐

  1. Swarm部署集群

    swarm-manager 是 manager node,swarm-worker是 worker node.所有节点的 Docker 版本均不低于 v1.12.在master上 docker swa ...

  2. Freescale 车身控制模块(BCM) 解决方案

    中国汽车业已成为全球第一市场,标志着中国汽车产业进入了白热化竞争时代,因此,人们对汽车的操控性,安全性,易用性,舒适性,以及智能化要求也越来越高,更大的空间需求和更多的零部件因而产生了冲突,这就要求汽 ...

  3. 数组拆分I

    题目描述 给定长度为 2n 的数组, 你的任务是将这些数分成 n 对, 例如 (a1, b1), (a2, b2), ..., (an, bn) ,使得从1 到 n 的 min(ai, bi) 总和最 ...

  4. This license xxx has been cancelled 解决

    上节回顾:JetBrains全家桶破解思路 hosts屏蔽一下即可,Linux是:/etc/hosts 0.0.0.0 account.jetbrains.com 重新输入Code即可,最后补一个地址 ...

  5. Python基础之文件和目录操作

    1 .文件操作 1.1 文件打开和关闭 在python, 使用 open 函数, 可以打开一个已经存在的文件, 或者创建一个新文件. # 打开文件 f = open('test.txt', 'w') ...

  6. Qt5应用改变窗口大小时出现黑影

    解决方法 在启动程序时,添加-platform wayland参数 添加QT_QPA_PLATFORM=wayland-egl到系统环境变量 注意:改完后虽然没有黑影,但软件图标显示不正常,也不能正常 ...

  7. 洛谷P3185 分裂游戏

    解:这个毒瘤...... 我们首先发现每一堆的个数对操作不产生影响,每个操作都是针对单个石子的. 所以等价于每个石子都是一个独立的游戏.把它们异或起来作为sg函数值即可. 单个石子的sg值,直接暴力计 ...

  8. WebAPI集成SignalR

    WebAPI提供通用数据接口,SignalR提供实时消息传输,两者可以根据实际业务需求进行组合. 环境 版本 操作系统 Windows 10 prefessional 编译器 Visual Studi ...

  9. Remote debugger is in a background tab which may cause apps to perform slowly. Fix this by foregrounding the tab (or opening it in a separate window).

    先上代码: /** * Sample React Native App * https://github.com/facebook/react-native * * @format * @flow * ...

  10. react-native中的style

    在 React Native 中,你并不需要学习什么特殊的语法来定义样式.我们仍然是使用 JavaScript 来写样式. 所有的核心组件都接受名为style的属性.这些样式名基本上是遵循了 web ...