15.scrapy模拟登陆案例
1.案例一
a.创建项目
scrapy startproject renren_login
进入项目路径
scrapy genspider renren "renren.com"
renren.py
# -*- coding: utf-8 -*-
import scrapy class RenrenSpider(scrapy.Spider):
name = 'renren'
allowed_domains = ['renren.com']
start_urls = ['http://renren.com/'] def start_requests(self):
url="http://www.renren.com/PLogin.do"
data={"email":"xxxxxxxx@126.com","password":"xxxxxxx"}
request=scrapy.FormRequest(url,formdata=data,callback=self.parse_page)
yield request def parse_page(self, response):
request=scrapy.Request(url='http://www.renren.com/326282648/profile',callback=self.parse_profile)
yield request def parse_profile(self,response):
with open("wenliang.html","w",encoding="utf-8") as fp:
fp.write(response.text)
在项目路径下创建start.py
from scrapy import cmdline
cmdline.execute(["scrapy","crawl","renren"])
2.案例2
a.手动输入验证码
创建项目
scrapy startproject douban_login
进去项目路径
scrapy genspider douban "douban.com"
settings.py
# -*- coding: utf-8 -*- # Scrapy settings for douban_login project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'douban_login' SPIDER_MODULES = ['douban_login.spiders']
NEWSPIDER_MODULE = 'douban_login.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douban_login (+http://www.yourdomain.com)' # Obey robots.txt rules
ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36',
} # Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'douban_login.middlewares.DoubanLoginSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'douban_login.middlewares.DoubanLoginDownloaderMiddleware': 543,
#} # Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'douban_login.pipelines.DoubanLoginPipeline': 300,
#} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
douban.py
# -*- coding: utf-8 -*-
import scrapy
from urllib import request
from PIL import Image
class DoubanSpider(scrapy.Spider):
name = 'douban'
allowed_domains = ['douban.com']
start_urls = ['https://www.douban.com/login']
login_url="https://www.douban.com/login"
profile_url="https://www.douban.com/people/184480369/"
editsignature_url="https://www.douban.com/j/people/184480369/edit_signature" def parse(self, response):
formdata={
"source":"None",
"redir":"https://www.douban.com/",
"form_email":"xxxxxx@qq.com",
"form_password":"xxxxxx!",
"remember":"on",
"login":"登录"
} captcha_url=response.css("img#captcha_image::attr(src)").get() if captcha_url:
captcha=self.regonize_captcha(captcha_url)
formdata["captcha-solution"]=captcha
captcha_id=response.xpath("//input[@name='captcha-id']/@value").get()
formdata["captcha-id"]=captcha_id
yield scrapy.FormRequest(url=self.login_url,formdata=formdata,callback=self.parse_after_login) def parse_after_login(self,response):
if response.url=="https://www.douban.com/":
yield scrapy.Request(self.profile_url,callback=self.parse_profile)
print("登录成功")
else:
print("登录失败") def parse_profile(self,response):
print(response.url)
if response.url==self.profile_url:
print("进入到了个人中心")
ck=response.xpath("//input[@name='ck']/@value").get()
formdata={
"ck":ck,
"signature":"丈夫处世兮立功名"
}
yield scrapy.FormRequest(self.editsignature_url,formdata=formdata)
else:
print("没有进入个人中心") def regonize_captcha(self,image_url):
request.urlretrieve(image_url,"captcha.png")
image=Image.open("captcha.png")
image.show()
captcha=input("请输入验证码:")
return captcha
在douban_login目录下创建start.py
from scrapy import cmdline
cmdline.execute("scrapy crawl douban".split())
执行start.py即可
b.自动识别验证码
from urllib import request
from base64 import b64decode
import requests captcha_url="https://www.douban.com/misc/captcha?id=TCEAV2F8SbBgKbXZ5JAI2G6L:en&size=s"
request.urlretrieve(captcha_url,"captcha.png") recognize_url="http://xxxxxx"
formdata={}
with open("captcha.png","rb") as fp:
data=fp.read()
pic=b64decode(data)
formdata['pic']=pic appcode='xxxxxxxxxxxxxxx'
headers={
"Content-Type":"application/x-www-form-urlencode; charset=UTF-8",
'Authorization':'APPCODE'+appcode
}
response=requests.post(recognize_url,data=formdata,headers=headers)
print(response)
c.其他自动识别案例
from selenium import webdriver
import time
import requests
from lxml import etree
import base64 # 操作浏览器
driver = webdriver.Chrome()
url = 'https://accounts.douban.com/login?alias=&redir=https%3A%2F%2Fwww.douban.com%2F&source=index_nav&error=1001' driver.get(url)
time.sleep(1)
driver.find_element_by_id('email').send_keys('')
time.sleep(1)
driver.find_element_by_id('password').send_keys('yaoqinglin2011')
time.sleep(1) # 获取验证码相关信息
html_str = driver.page_source
html_ele = etree.HTML(html_str)
# 得到验证码的url
image_url = html_ele.xpath('//img[@id="captcha_image"]/@src')[0]
# 获取这个图片的内容
response = requests.get(image_url) # 获取base64的str
# https://market.aliyun.com/products/57124001/cmapi028447.html?spm=5176.2020520132.101.5.2HEXEG#sku=yuncode2244700000
b64_str = base64.b64encode(response.content)
v_type = 'cn'
# post 提交打码平台的数据
form = {
'v_pic': b64_str,
'v_type': v_type,
} # authtication的header
headers = {
'Authorization': 'APPCODE eab23fa1d03f40d48b43c826c57bd284',
}
# 从打码平台获取验证码信息
dmpt_url = 'http://yzmplus.market.alicloudapi.com/fzyzm'
response = requests.post(dmpt_url, form, headers=headers)
print(response.text)
# captcha_value 就是我们的验证码信息
captcha_value = response.json()['v_code'] print(image_url)
print(captcha_value)
# captcha_value = input('请输入验证码') driver.find_element_by_id('captcha_field').send_keys(captcha_value)
time.sleep(1)
driver.find_element_by_class_name('btn-submit').click()
time.sleep(1)
# 获取所有的cookie的信息
cookies = driver.get_cookies()
cookie_list =[] # 对于每一个cookie_dict, 就是将name 和 value取出, 拼接成name=value;
for cookie_dict in cookies:
cookie_str = cookie_dict['name'] + '=' + cookie_dict['value']
cookie_list.append(cookie_str) # 拼接所有的cookie到header_cookie中
header_cookie = '; '.join(cookie_list) headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
'Cookie': header_cookie,
}
another_url = 'https://www.douban.com/accounts/'
response = requests.get(another_url, headers=headers) with open('cc.html', 'wb') as f:
f.write(response.content) # with open('douban.html', 'wb') as f:
# f.write(driver.page_source.encode('utf-8'))
15.scrapy模拟登陆案例的更多相关文章
- 第三百四十三节,Python分布式爬虫打造搜索引擎Scrapy精讲—scrapy模拟登陆和知乎倒立文字验证码识别
第三百四十三节,Python分布式爬虫打造搜索引擎Scrapy精讲—scrapy模拟登陆和知乎倒立文字验证码识别 第一步.首先下载,大神者也的倒立文字验证码识别程序 下载地址:https://gith ...
- Scrapy模拟登陆
1. 为什么需要模拟登陆? #获取cookie,能够爬取登陆后的页面 2. 回顾: requests是如何模拟登陆的? #1.直接携带cookies请求页面 #2.找接口发送post请求存储cooki ...
- Scrapy 模拟登陆知乎--抓取热点话题
工具准备 在开始之前,请确保 scrpay 正确安装,手头有一款简洁而强大的浏览器, 若是你有使用 postman 那就更好了. Python 1 scrapy genspid ...
- 爬虫入门之scrapy模拟登陆(十四)
注意:模拟登陆时,必须保证settings.py里的COOKIES_ENABLED(Cookies中间件) 处于开启状态 COOKIES_ENABLED = True或# COOKIES_ENABLE ...
- python之scrapy模拟登陆人人网
1.settings.py主要配置信息,包括USER_AGENT等 # -*- coding: utf-8 -*- # Scrapy settings for renren project # # F ...
- Scrapy模拟登陆豆瓣抓取数据
scrapy startproject douban 其中douban是我们的项目名称 2创建爬虫文件 进入到douban 然后创建爬虫文件 scrapy genspider dou douban. ...
- scrapy 模拟登陆
import scrapy import urllib.request from scrapy.http import Request,FormRequest class LoginspdSpider ...
- 二十二 Python分布式爬虫打造搜索引擎Scrapy精讲—scrapy模拟登陆和知乎倒立文字验证码识别
第一步.首先下载,大神者也的倒立文字验证码识别程序 下载地址:https://github.com/muchrooms/zheye 注意:此程序依赖以下模块包 Keras==2.0.1 Pillow= ...
- 识别图片验证码的三种方式(scrapy模拟登陆豆瓣网)
1.通过肉眼识别,然后输入到input里面 from PIL import image Image request.urlretrieve(url,'image') #下载验证码图片 image = ...
随机推荐
- js jquery 数组的合并 对象的合并
转载自:http://www.cnblogs.com/xingxiangyi/p/6416468.html 1 数组合并 1.1 concat 方法 1 2 3 4 var a=[1,2,3],b=[ ...
- bzoj2554: Color
Description 有n个球排成一列,每个球都有一个颜色,用A-Z的大写字母来表示,我们每次随机选出两个球ball1,ball2,使得后者染上前者的颜色,求期望操作多少次,才能使得所有球的颜色都一 ...
- [luogu3391][文艺平衡树]
题目链接 思路 splay区间操作的裸题. 假如要对l-r这段区间操作,那么就先把l-1伸展到根节点,然后把r +1伸展为根的儿子.这样r + 1的左儿子就是要操作的区间了.只要在上面打上标记,以后每 ...
- 使用ZXing.Net生成与识别二维码(QR Code)
Google ZXing是目前一个常用的基于Java实现的多种格式的1D/2D条码图像处理库,出于其开源的特性其现在已有多平台版本.比如今天要用到的ZXing.Net就是针对微软.Net平台的版本.使 ...
- semantic ui框架学习笔记一
面包屑导航 面包屑导航经常用于多个栏目下的内容管理,是web页面里比较常用的组合.例如: <div class="ui breadcrumb"> <a class ...
- C++ Exception机制
C++异常机制的执行顺序. 在构造函数内抛出异常 /* * ExceptClass.h * * Created on: 2018年1月2日 * Author: jacket */ #ifndef EX ...
- linux下静态库和动态库的制作
一.静态库 1.编写.c文件,在其中实现函数源代码,同时制作头文件 2.将.c文件转为.o文件 gcc -c xxx.c -o xxx.o 3.将*.o转换成库文件 a ...
- java equals和hashcode方法
equals()方法比较两个对象的引用是否相同 hashcode()方法比较两个对象的哈希码是否相同
- (LIS DP) codeVs 1044 拦截导弹
题目描述 Description 某国为了防御敌国的导弹袭击,发展出一种导弹拦截系统.但是这种导弹拦截系统有一个缺陷:虽然它的第一发炮弹能够到达任意的高度,但是以后每一发炮弹都不能高于前一发的高度.某 ...
- 关键字(3):order by/group by/having/where/sum/count(*)...查询结果筛选关键字
ORDER BY <属性表> 只要在WHERE子句的选择条件后面加上如下子句:ORDER BY <属性表> 就可以实现输出的排序,默认的顺序为升序(ASC).可以在属性的后面加 ...