scrapy模拟知乎登录(无验证码机制)

---恢复内容开始---

spiders 文件夹下新建zhihu.py文件（从dos窗口中进入虚拟环境，再进入工程目录之后输入命令 scrapy genspider zhihu www.zhihu.com）

#zhihu.py

import scrapy

import re

import json

from Item import ZhihuQuestionItem,ZhihuAnswerItem

import datatime

from scrapy.loader import ItemLoader

try:

　　import urlparse as parse

except:

　　from urllib import parse

class ZhuhuSpider(scrapy.Spider):

　　name='zhihu'

　　allow_domains=["www.zhihu.com"]

　　start_urls=["http://www.zhihu.com/"]　

　　headers={

　　"HOST":"www.zhihu.com",

　　"Referer":"https://www.zhihu.com",

　　"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36"

　　}

　　start_answer_url="https://www.zhihu.com/api/v4/questions/{0}/answers?　　　　　　sort_by=default&include=data%5B%2A%5D.is_normal%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccollapsed_counts%2Creviewing_comments_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Cmark_infos%2Ccreated_time%2Cupdated_time%2Crelationship.is_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cupvoted_followees%3Bdata%5B%2A%5D.author.is_blocking%2Cis_blocked%2Cis_followed%2Cvoteup_count%2Cmessage_thread_token%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit={1}&offset={2}"　　#question 第一页answer请求的url

def parse(self,response):

　　"""

　　提取出html页面中的所有url,并跟踪这些url进行进一步爬取

　　如果提取出的url中格式为/requestion/xxx,就下载之后直接进入解析函数

　　"""

　　all_urls=response.css("a::attr(href)").extract()

　　all_urls=[parse.urljoin(response.url,url) for url in all_urls]

　　all_urls=fliter(lambda x:True if x.startswith("https") else False,all_urls)

　　for url in all_urls:

　　　　match_obj=re.match("(.*zhihu.com/requestion/(\d))(/|$).*",url)

　　　　if match_obj:

　　#如果提取到requestion相关页面，交由parse_question进行解析

　　　　　　request_url=match_obj.group(1)

　　　　　　yield scrapy.Request(request_url,headers=self.headers,callback=self.parse_question)

　　　　else:

　　#如果未提取到相关页面，则直接进一步跟踪

　　　　　　yield scrapy.Request(url,headers=self.headers,callback=self.parse)

def parse_question(self,response):

　　#处理question页面，从页面中提取出具体的question item

　　match_obj=re.match("(.*zhuhu.com/question/(\d))(/|$).*",response.url)

　　if match_obj:

　　　　 question_id=int(match_obj.group(2))

　　item_loader=ItemLoader(item=zhuhuQuestionItem(),response=response)

　　if "QuestionHeader-title" in response.text:　　#处理新版本　　

　　　　　item_loader.add_css("title","h1.QuestionHeader-title::text")

　　　　　item_loader.add_css("content",".QuestionHeader-detail")

　　　　　item_loader.add_value("url",response.url)

　　　　　item_loader.add_value("zhuhu_id",question_id)

　　　　　item_loader.add_css("answer_num",".List-headerText span::text")

　　　　　item_loader.add_css("comment_num",".QuestionHeader-actions button::text")

　　　　　item_loader.add_css("watch_user_num",".NumberBoard-value::text")　　　

　　　　　item_loader.add_css("topic",".QuestionHeader-topics .Popover div::text")　　　　　

　　else:　　#处理旧版本页面item的提取

　　　　item_loader.add_xpath("title","//*[@id='zh-question-title']/h2/a/text()|//*[@id='zh-question-title']/h2/span/text()")

　　　　item_loader.add_css("content","#zh-question-detail")

　　　　item_loader.add_value("url",response.url)

　　　　item_loader.add_value("zhuhu_id",question_id)

　　　　item_loader.add_css("answer_num", "#zh-question-answer-num::text")

　　　　item_loader.add_css("comment_num","#zh-question-meta-wrap a[name='addcomment']::text")

　　　　item_loader.add_xpath("watch_user_num","//*[@id='zh-question-side-header-wrap']/text()|//*[@class='zh-question-followers-sidebar']/div/a/strong/text()")

　　　　item_loader.add_css("topic",".zm-tag-editor-labels a::text")　

　　question_item=item_loader.load_item()

　　yield scrapy.Request(self.start_answer_url,format(question_id,20,0),headers=self.headers,callback=self.parse_answer)

　　yield question_item

def parse_answer(self,response):

　　　#处理question中的answer

　　ans_json=json.load(response.text)

　　is_end=ans_json["paging"]["is_end"]

　　next_url=ans_json["paging"]["next"]

　　#提取answer的具体字段

　　for answer in ans_json["data"]:

　　　　answer_item=ZhihuAnswerItem()

　　　　answer_item["zhihu_id"]=answer["id"]

　　　　answer_item["url"]=answer["url"]

　　　　answer_item["question_id"]=answer["question"]["id"]

　　　　answer_item["author_id"] = answer["author"]["id"] if "id" in answer["author"] else None

　　　　answer_item["content"] = answer["content"] if "content" in answer else None

　　　　answer_item["praise_num"]=answer["voteup_count"]

　　　　answer_item["comment_num"]=answer["comment_count"]

　　　　answer_item["creat_time"]=answer["created_time"]

　　　　answer_item["update_time"]=answer["update_time"]

　　　　answer_item["crawl_time"]=datatime.datatime.now()　　　

　　　　yield answer_item

　　　　if not is_end:

　　　　　　yield scrapy.Request(next_url,headers=self.headers,callback=self.answer.parse_answer)　

#重写start_Request方法

def start_requests(self):

　　return [scrapy.Request("https://www.zhihu.com/#signin",headers=self.headers,callback=self.login)]　　#使用scrapy.Request一定要使用回调函数，否则会默认回调parse(self,response)

def login(self,response):

　　response_text=response.text

　　match_obj=re,match(' .*name="_xsrf" value="(.*?)" ',response_text,re.DOTALL)　　#注意使用单双引号

　　xsrf=""

　　if match_obj:

　　　　xsrf=(match_obj.group(1))

　　if xsrf:

　　　　post_url="https://www.zhihu.com/login/phone_num"

　　　　post_data={

　　　　　　"_xsrf"　　:　　xsrf,

　　　　　　"phone_num"　　:　　"18282902586",

　　　　　　"password"　　:　　"admin123"

　　　　}

　　　　return [scrapy.FormRequest(

　　　　　　url=post_url,

　　　　　　formdata=post_data,

　　　　　　headers=self.headers,

　　　　　　callback=self.check_login

　　　　　　)]

def check_login(self,response):

　　text_json=json.loads(response.text)

　　if "msg" in text_json and text_json["msg"]=="登陆成功":

　　　　for url in self.start_urls:

　　　　　　yield scrapy.Request(url,dont.fliter=True,headers=self.headers)

---恢复内容结束---

scrapy模拟知乎登录(无验证码机制)的更多相关文章

(转)request模拟知乎登录（无验证码机制
原文:http://www.itnose.net/detail/6755805.html import request try: import cookielib #python2版本 except: ...
request模拟知乎登录（无验证码机制）
import request try: import cookielib #python2版本 except: import http.cookiejar as cookielib #python3版 ...
htmlunit 模拟登录无验证码
1.模拟登录csdn,最开始的时候使用的是httpclient,网上的所有模拟登录csdn的版本都是找到lt/execution/event_id.连同用户名及密码一起发送即可,但是目前的csdn的 ...
使用selenium模拟知网登录
之前都是用phantomjs和selenium模拟浏览器动作的,后来phantomjs不再更新,就转用chrome了本次模拟登录的网站是中国知网http://login.cnki.net/login ...
Python模拟知乎登录
# -*- coding:utf-8 -*- import urllib import urllib2 import cookielib import time from PIL import Ima ...
8-python模拟登入（无验证码）
方式: 1.手动登入,获取cookie 2.使用cookielib库和 HTTPCookieProcessor处理器 #_*_ coding: utf-8 _*_ ''' Created on 20 ...
新版知乎登录之post请求
前言在上一篇文章中给大家讲解了requests发送post请求的几种方式,并分析了一些使用陷阱. 疑惑在文章发表之后,有朋友给我留言说,知乎登录就没有使用提交Form表单(application/ ...
python爬虫scrapy框架——人工识别知乎登录知乎倒立文字验证码和数字英文验证码
目前知乎使用了点击图中倒立文字的验证码: 用户需要点击图中倒立的文字才能登录. 这个给爬虫带来了一定难度,但并非无法解决,经过一天的耐心查询,终于可以人工识别验证码并达到登录成功状态,下文将和大家一一 ...
利用scrapy模拟登录知乎
闲来无事,写一个模拟登录知乎的小demo. 分析网页发现:登录需要的手机号,密码,_xsrf参数,验证码实现思路: 1.获取验证码 2.获取_xsrf 参数 3.携带参数,请求登录验证码url : ...

随机推荐

[译]Python - socket.error: Cannot assign requested address
原文来源: https://stackoverflow.com/questions/48306528/python-socket-error-cannot-assign-requested-addre ...
C++常用STL
目录 C++ 常用STL整理容器和配接器 list(链表) stack(栈) queue(队列) priority_queue(优先队列) set(集合) vector(向量) map&&a ...
关于C标准
关于C标准 1. 前言本文从英文 C-FAQ (2004 年 7 月 3 日修订版) 翻译而来.本文的中文版权为朱群英和孙云所有. 本文的内容可以自由用于个人目的,但是不可以未经许可出版发行. ...
delphi保存文件的命名规则
没有固定的标准.自己可以定义 .你可以参考PASCAL命名法则.查一下PASCAL命名. 我习惯用UMain,FMain,UDM,DM,UAboutBox,AboutBox.....程序相关内容都放在 ...
[USACO06NOV]玉米田Corn Fields
题面描述状压dp. 设$f[i][sta]$为第$i$层状态为$sta$的方案数. 然后每次可以枚举上一层的状态以及本层的状态,然后如果不冲突且满足地图的要求,则转移. 时间复杂度\(O ...
安徽师大附中%你赛day7 T2 乘积解题报告
乘积题目背景 $\mathrm{Smart}$ 最近在潜心研究数学, 他发现了一类很有趣的数字, 叫做无平方因子数. 也就是这一类数字不能够被任意一个质数的平方整除, 比如$6$.$7$ ...
【BZOJ 2822】[AHOI2012]树屋阶梯卡特兰数+高精
这道题随便弄几个数就发现是卡特兰数然而为什么是呢? 我们发现我们在增加一列时,如果这一个东西(那一列)他就一格,那么就是上一次的方案数,并没有任何改变,他占满了也是,然后他要是占两格呢,就是把原来的切 ...
封装安卓的okhttp
1.封装了get方法,handler更新主线程,回调的onsuccess,onfailure,onerror等方法 2.配置文件 api 'com.android.support:recyclervi ...
通过js修改微信内置浏览器title
document.setTitle = function(t) { document.title = t; var i = document.createElement('iframe'); i.sr ...
C++开源库，欢迎补充。
转载自:http://blog.csdn.net/kobejayandy/article/details/8681741 C++在"商业应用"方面,曾经是天下第一的开发语言,但这一 ...

scrapy模拟知乎登录(无验证码机制)

scrapy模拟知乎登录(无验证码机制)的更多相关文章

随机推荐

热门专题