1.案例一

a.创建项目

scrapy startproject renren_login

进入项目路径

scrapy genspider renren "renren.com"

renren.py

# -*- coding: utf-8 -*-
import scrapy class RenrenSpider(scrapy.Spider):
name = 'renren'
allowed_domains = ['renren.com']
start_urls = ['http://renren.com/'] def start_requests(self):
url="http://www.renren.com/PLogin.do"
data={"email":"xxxxxxxx@126.com","password":"xxxxxxx"}
request=scrapy.FormRequest(url,formdata=data,callback=self.parse_page)
yield request def parse_page(self, response):
request=scrapy.Request(url='http://www.renren.com/326282648/profile',callback=self.parse_profile)
yield request def parse_profile(self,response):
with open("wenliang.html","w",encoding="utf-8") as fp:
fp.write(response.text)

在项目路径下创建start.py

from scrapy import cmdline
cmdline.execute(["scrapy","crawl","renren"])

2.案例2

a.手动输入验证码

创建项目

scrapy startproject douban_login

进去项目路径

scrapy genspider douban "douban.com"

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for douban_login project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'douban_login' SPIDER_MODULES = ['douban_login.spiders']
NEWSPIDER_MODULE = 'douban_login.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douban_login (+http://www.yourdomain.com)' # Obey robots.txt rules
ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36',
} # Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'douban_login.middlewares.DoubanLoginSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'douban_login.middlewares.DoubanLoginDownloaderMiddleware': 543,
#} # Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'douban_login.pipelines.DoubanLoginPipeline': 300,
#} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

douban.py

# -*- coding: utf-8 -*-
import scrapy
from urllib import request
from PIL import Image
class DoubanSpider(scrapy.Spider):
name = 'douban'
allowed_domains = ['douban.com']
start_urls = ['https://www.douban.com/login']
login_url="https://www.douban.com/login"
profile_url="https://www.douban.com/people/184480369/"
editsignature_url="https://www.douban.com/j/people/184480369/edit_signature" def parse(self, response):
formdata={
"source":"None",
"redir":"https://www.douban.com/",
"form_email":"xxxxxx@qq.com",
"form_password":"xxxxxx!",
"remember":"on",
"login":"登录"
} captcha_url=response.css("img#captcha_image::attr(src)").get() if captcha_url:
captcha=self.regonize_captcha(captcha_url)
formdata["captcha-solution"]=captcha
captcha_id=response.xpath("//input[@name='captcha-id']/@value").get()
formdata["captcha-id"]=captcha_id
yield scrapy.FormRequest(url=self.login_url,formdata=formdata,callback=self.parse_after_login) def parse_after_login(self,response):
if response.url=="https://www.douban.com/":
yield scrapy.Request(self.profile_url,callback=self.parse_profile)
print("登录成功")
else:
print("登录失败") def parse_profile(self,response):
print(response.url)
if response.url==self.profile_url:
print("进入到了个人中心")
ck=response.xpath("//input[@name='ck']/@value").get()
formdata={
"ck":ck,
"signature":"丈夫处世兮立功名"
}
yield scrapy.FormRequest(self.editsignature_url,formdata=formdata)
else:
print("没有进入个人中心") def regonize_captcha(self,image_url):
request.urlretrieve(image_url,"captcha.png")
image=Image.open("captcha.png")
image.show()
captcha=input("请输入验证码:")
return captcha

在douban_login目录下创建start.py

from scrapy import cmdline

cmdline.execute("scrapy crawl douban".split())

执行start.py即可

b.自动识别验证码

from urllib import request
from base64 import b64decode
import requests captcha_url="https://www.douban.com/misc/captcha?id=TCEAV2F8SbBgKbXZ5JAI2G6L:en&size=s"
request.urlretrieve(captcha_url,"captcha.png") recognize_url="http://xxxxxx"
formdata={}
with open("captcha.png","rb") as fp:
data=fp.read()
pic=b64decode(data)
formdata['pic']=pic appcode='xxxxxxxxxxxxxxx'
headers={
"Content-Type":"application/x-www-form-urlencode; charset=UTF-8",
'Authorization':'APPCODE'+appcode
}
response=requests.post(recognize_url,data=formdata,headers=headers)
print(response)

c.其他自动识别案例

from selenium import webdriver
import time
import requests
from lxml import etree
import base64 # 操作浏览器
driver = webdriver.Chrome()
url = 'https://accounts.douban.com/login?alias=&redir=https%3A%2F%2Fwww.douban.com%2F&source=index_nav&error=1001' driver.get(url)
time.sleep(1)
driver.find_element_by_id('email').send_keys('')
time.sleep(1)
driver.find_element_by_id('password').send_keys('yaoqinglin2011')
time.sleep(1) # 获取验证码相关信息
html_str = driver.page_source
html_ele = etree.HTML(html_str)
# 得到验证码的url
image_url = html_ele.xpath('//img[@id="captcha_image"]/@src')[0]
# 获取这个图片的内容
response = requests.get(image_url) # 获取base64的str
# https://market.aliyun.com/products/57124001/cmapi028447.html?spm=5176.2020520132.101.5.2HEXEG#sku=yuncode2244700000
b64_str = base64.b64encode(response.content)
v_type = 'cn'
# post 提交打码平台的数据
form = {
'v_pic': b64_str,
'v_type': v_type,
} # authtication的header
headers = {
'Authorization': 'APPCODE eab23fa1d03f40d48b43c826c57bd284',
}
# 从打码平台获取验证码信息
dmpt_url = 'http://yzmplus.market.alicloudapi.com/fzyzm'
response = requests.post(dmpt_url, form, headers=headers)
print(response.text)
# captcha_value 就是我们的验证码信息
captcha_value = response.json()['v_code'] print(image_url)
print(captcha_value)
# captcha_value = input('请输入验证码') driver.find_element_by_id('captcha_field').send_keys(captcha_value)
time.sleep(1)
driver.find_element_by_class_name('btn-submit').click()
time.sleep(1)
# 获取所有的cookie的信息
cookies = driver.get_cookies()
cookie_list =[] # 对于每一个cookie_dict, 就是将name 和 value取出, 拼接成name=value;
for cookie_dict in cookies:
cookie_str = cookie_dict['name'] + '=' + cookie_dict['value']
cookie_list.append(cookie_str) # 拼接所有的cookie到header_cookie中
header_cookie = '; '.join(cookie_list) headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
'Cookie': header_cookie,
}
another_url = 'https://www.douban.com/accounts/'
response = requests.get(another_url, headers=headers) with open('cc.html', 'wb') as f:
f.write(response.content) # with open('douban.html', 'wb') as f:
# f.write(driver.page_source.encode('utf-8'))

15.scrapy模拟登陆案例的更多相关文章

  1. 第三百四十三节,Python分布式爬虫打造搜索引擎Scrapy精讲—scrapy模拟登陆和知乎倒立文字验证码识别

    第三百四十三节,Python分布式爬虫打造搜索引擎Scrapy精讲—scrapy模拟登陆和知乎倒立文字验证码识别 第一步.首先下载,大神者也的倒立文字验证码识别程序 下载地址:https://gith ...

  2. Scrapy模拟登陆

    1. 为什么需要模拟登陆? #获取cookie,能够爬取登陆后的页面 2. 回顾: requests是如何模拟登陆的? #1.直接携带cookies请求页面 #2.找接口发送post请求存储cooki ...

  3. Scrapy 模拟登陆知乎--抓取热点话题

    工具准备 在开始之前,请确保 scrpay 正确安装,手头有一款简洁而强大的浏览器, 若是你有使用 postman 那就更好了.           Python   1 scrapy genspid ...

  4. 爬虫入门之scrapy模拟登陆(十四)

    注意:模拟登陆时,必须保证settings.py里的COOKIES_ENABLED(Cookies中间件) 处于开启状态 COOKIES_ENABLED = True或# COOKIES_ENABLE ...

  5. python之scrapy模拟登陆人人网

    1.settings.py主要配置信息,包括USER_AGENT等 # -*- coding: utf-8 -*- # Scrapy settings for renren project # # F ...

  6. Scrapy模拟登陆豆瓣抓取数据

    scrapy  startproject douban 其中douban是我们的项目名称 2创建爬虫文件 进入到douban 然后创建爬虫文件 scrapy genspider dou douban. ...

  7. scrapy 模拟登陆

    import scrapy import urllib.request from scrapy.http import Request,FormRequest class LoginspdSpider ...

  8. 二十二 Python分布式爬虫打造搜索引擎Scrapy精讲—scrapy模拟登陆和知乎倒立文字验证码识别

    第一步.首先下载,大神者也的倒立文字验证码识别程序 下载地址:https://github.com/muchrooms/zheye 注意:此程序依赖以下模块包 Keras==2.0.1 Pillow= ...

  9. 识别图片验证码的三种方式(scrapy模拟登陆豆瓣网)

    1.通过肉眼识别,然后输入到input里面 from PIL import image Image request.urlretrieve(url,'image')  #下载验证码图片 image = ...

随机推荐

  1. css基本语法及页面引用

    css基本语法 css的定义方法是: 选择器 { 属性:值; 属性:值; 属性:值;} 选择器是将样式和页面元素关联起来的名称,属性是希望设置的样式属性每个属性有一个或多个值.代码示例: div{ w ...

  2. Spring Boot + Mybatis + Redis二级缓存开发指南

    Spring Boot + Mybatis + Redis二级缓存开发指南 背景 Spring-Boot因其提供了各种开箱即用的插件,使得它成为了当今最为主流的Java Web开发框架之一.Mybat ...

  3. poj1958 strange towers of hanoi

    说是递推,其实也算是个DP吧. 就是4塔的汉诺塔问题. 考虑三塔:先从a挪n-1个到b,把最大的挪到c,然后再把n-1个从b挪到c,所以是 f[i] = 2 * f[i-1] + 1; 那么4塔类似: ...

  4. malloc() 和 calloc()有啥区别

    calloc()在动态分配完内存后,自动初始化该内存空间为零(会将所分配的内存空间中的每一位都初始化为零). 而malloc()不初始化,里边数据是随机的垃圾数据. calloc(size_t n, ...

  5. 扩展方法、委托和Lambda

    举例演化Lambda string[] names ={"Burke", "Connor", "Frank", "Everett& ...

  6. Mysql DML DCL DDL

    在介绍这些SQL语言之前,先罗列一下mysql的常用数据类型和数据类型修饰,供查询参考 后面的带数字表示此类型的字段长度 数值型: TINYINT 1 ,SMALLINT 2,MEDIUMINT 3 ...

  7. react-native中的TextInput

    TextInput是一个允许用户输入文本的基础组件.它有一个名为onChangeText的属性,此属性接受一个函数, 而此函数会在文本变化时被调用.另外还有一个名为onSubmitEditing的属性 ...

  8. 四种不同的SNP calling算法call低碱基覆盖度测序数据时,SNVs数量的比较(Comparing a few SNP calling algorithms using low-coverage sequencing data)

    摘要:如果不设置任何过滤标准的话,SOAPsnp会call出更多的SNVs:AtlasSNP2算法比较严格,因此call出来的SNVs数量是最少的,GATK 和 SAMtools call出来的数量位 ...

  9. WebService 及 CXF 的进阶讲解

    4.2. WebService请求深入分析 1). 分析WebService的WSDL文档结构 1.1). 实例截图 <definitions> <types> <sch ...

  10. eclipse+pyDev

    感觉python脚本语言在linux下挺有用的,想入门学习一下 新手入门个人习惯找个好点的IDE帮助完成工作,试了好多,如pycharm,sublime text自己打造 后来发现全扯淡,一点不符合自 ...