Scrapy爬虫笔记 - 爬取知乎

cookie是一种本地存储机制，cookie是存储在本地的

session其实就是将用户信息用户名、密码等）加密成一串字符串，返回给浏览器，以后浏览器每次请求都带着这个sessionId

状态码一般是服务器自己定义，也可以框架定义，也可以自己定义

F12 NetWork 下可以看到每个请求的状态码

301永久性重定向，比如更换了域名，但又希望原域名可以请求的到

302临时性重定向，比如未登录状态下点击个人中心，会重定向到登陆页面

404一般是url非法，当然这种情况也可以返回200的空页面，但是这样不太好，因为404可以被过滤掉

500一般是服务器中某个函数出错了，但是又没有捕获异常，一般开发框架会处理

503的状况一般暂停爬虫，一般只爬取200，也不爬404

如果要爬知乎，先登录

可以通过输入错误的用户名密码，在network中找到posturl和post参数

requests.get() 默认加的header是python2或者python3，直接请求会报500

所以requests时需要加header

agent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"

header = {

    "HOST":"www.zhihu.com",

    "Referer": "https://www.zhizhu.com",

    'User-Agent': agent

}

response = session.get("https://www.zhihu.com", headers=header)

模拟登陆时，不是直接用requests，而且用requests的session，session代表某一次连接，这样就不需要每次requests.get时都建立连接，效率是更高的

session = requests.session()

response = session.get("https://www.zhihu.com", headers=header)

后面所以的requests都可以换成session

session.cookies没有save()方法，可以用session.cookies = cookielib.LWPCookieJar()实例化出来的cookies可以有save()

知乎模拟登陆代码（未使用scrapy）

# -*- coding: utf-8 -*-

__author__ = 'bobby'

import requests

try:

    import cookielib

except:

    import http.cookiejar as cookielib

import re

session = requests.session()

session.cookies = cookielib.LWPCookieJar(filename="cookies.txt")

try:

    session.cookies.load(ignore_discard=True)

except:

    print ("cookie未能加载")

agent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"

header = {

    "HOST":"www.zhihu.com",

    "Referer": "https://www.zhizhu.com",

    'User-Agent': agent

}

def is_login():

    #通过个人中心页面返回状态码来判断是否为登录状态

    inbox_url = "https://www.zhihu.com/question/56250357/answer/148534773"

    response = session.get(inbox_url, headers=header, allow_redirects=False)

    if response.status_code != 200:

        return False

    else:

        return True

def get_xsrf():

    #获取xsrf code

    response = session.get("https://www.zhihu.com", headers=header)

    match_obj = re.match('.*name="_xsrf" value="(.*?)"', response.text)

    if match_obj:

        return (match_obj.group(1))

    else:

        return ""

def get_index():

    response = session.get("https://www.zhihu.com", headers=header)

    with open("index_page.html", "wb") as f:

        f.write(response.text.encode("utf-8"))

    print ("ok")

def get_captcha():

    import time

    t = str(int(time.time()*1000))

    captcha_url = "https://www.zhihu.com/captcha.gif?r={0}&type=login".format(t)

    t = session.get(captcha_url, headers=header)

    with open("captcha.jpg","wb") as f:

        f.write(t.content)

        f.close()

    from PIL import Image

    try:

        im = Image.open('captcha.jpg')

        im.show()

        im.close()

    except:

        pass

    captcha = input("输入验证码\n>")

    return captcha

def zhihu_login(account, password):

    #知乎登录

    if re.match("^1\d{10}",account):

        print ("手机号码登录")

        post_url = "https://www.zhihu.com/login/phone_num"

        post_data = {

            "_xsrf": get_xsrf(),

            "phone_num": account,

            "password": password,

            "captcha":get_captcha()

        }

    else:

        if "@" in account:

            #判断用户名是否为邮箱

            print("邮箱方式登录")

            post_url = "https://www.zhihu.com/login/email"

            post_data = {

                "_xsrf": get_xsrf(),

                "email": account,

                "password": password

            }

    response_text = session.post(post_url, data=post_data, headers=header)

    session.cookies.save()

zhihu_login("", "admin123")

# get_index()

is_login()

# get_captcha()

zhihu_login_requests.py

scrapy的spider的入口是start_requests
所以需要重写start_requests

异步UI
def start_requests(self):
return [scrapy.Request('https://www.zhihu.com/#signin', headers=self.headers, callback=self.login)]

callback不加括号是因为传递的是这个函数的对象，它是会被调用的，如果你现在加了括号，那就代表现在就调用它，它就会返回这个函数的值给你，，所以只需要传递函数名过去就好了

通过robot.txt会判断哪些页面会过滤掉

ROBOTSTXT_OBEY = False 不遵守ROBOTS协议

正则表达式默认只匹配一行
加re.DOTALL可以匹配全文

Scrapy shell 如何添加 User_Agent

scrapy shell -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/60.0" https://blog.csdn.net/weixin_42471384/article/details/81556531

# -*- coding: utf-8 -*-

import re

import json

import datetime

try:

    import urlparse as parse

except:

    from urllib import parse

import scrapy

from scrapy.loader import ItemLoader

from items import ZhihuQuestionItem, ZhihuAnswerItem

class ZhihuSpider(scrapy.Spider):

    name = "zhihu"

    allowed_domains = ["www.zhihu.com"]

    start_urls = ['https://www.zhihu.com/']

    #question的第一页answer的请求url

    start_answer_url = "https://www.zhihu.com/api/v4/questions/{0}/answers?sort_by=default&include=data%5B%2A%5D.is_normal%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccollapsed_counts%2Creviewing_comments_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Cmark_infos%2Ccreated_time%2Cupdated_time%2Crelationship.is_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cupvoted_followees%3Bdata%5B%2A%5D.author.is_blocking%2Cis_blocked%2Cis_followed%2Cvoteup_count%2Cmessage_thread_token%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit={1}&offset={2}"

    headers = {

        "HOST": "www.zhihu.com",

        "Referer": "https://www.zhizhu.com",

        'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"

    }

    custom_settings = {

        "COOKIES_ENABLED": True

    }

    def parse(self, response):

        """

        提取出html页面中的所有url 并跟踪这些url进行一步爬取

        如果提取的url中格式为 /question/xxx 就下载之后直接进入解析函数

        """

        all_urls = response.css("a::attr(href)").extract()

        all_urls = [parse.urljoin(response.url, url) for url in all_urls]

        all_urls = filter(lambda x:True if x.startswith("https") else False, all_urls)

        for url in all_urls:

            match_obj = re.match("(.*zhihu.com/question/(\d+))(/|$).*", url)

            if match_obj:

                #如果提取到question相关的页面则下载后交由提取函数进行提取

                request_url = match_obj.group(1)

                yield scrapy.Request(request_url, headers=self.headers, callback=self.parse_question)

            else:

                #如果不是question页面则直接进一步跟踪

                yield scrapy.Request(url, headers=self.headers, callback=self.parse)

    def parse_question(self, response):

        #处理question页面， 从页面中提取出具体的question item

        if "QuestionHeader-title" in response.text:

            #处理新版本

            match_obj = re.match("(.*zhihu.com/question/(\d+))(/|$).*", response.url)

            if match_obj:

                question_id = int(match_obj.group(2))

            item_loader = ItemLoader(item=ZhihuQuestionItem(), response=response)

            item_loader.add_css("title", "h1.QuestionHeader-title::text")

            item_loader.add_css("content", ".QuestionHeader-detail")

            item_loader.add_value("url", response.url)

            item_loader.add_value("zhihu_id", question_id)

            item_loader.add_css("answer_num", ".List-headerText span::text")

            item_loader.add_css("comments_num", ".QuestionHeader-actions button::text")

            item_loader.add_css("watch_user_num", ".NumberBoard-value::text")

            item_loader.add_css("topics", ".QuestionHeader-topics .Popover div::text")

            question_item = item_loader.load_item()

        else:

            #处理老版本页面的item提取

            match_obj = re.match("(.*zhihu.com/question/(\d+))(/|$).*", response.url)

            if match_obj:

                question_id = int(match_obj.group(2))

            item_loader = ItemLoader(item=ZhihuQuestionItem(), response=response)

            # item_loader.add_css("title", ".zh-question-title h2 a::text")

            item_loader.add_xpath("title", "//*[@id='zh-question-title']/h2/a/text()|//*[@id='zh-question-title']/h2/span/text()")

            item_loader.add_css("content", "#zh-question-detail")

            item_loader.add_value("url", response.url)

            item_loader.add_value("zhihu_id", question_id)

            item_loader.add_css("answer_num", "#zh-question-answer-num::text")

            item_loader.add_css("comments_num", "#zh-question-meta-wrap a[name='addcomment']::text")

            # item_loader.add_css("watch_user_num", "#zh-question-side-header-wrap::text")

            item_loader.add_xpath("watch_user_num", "//*[@id='zh-question-side-header-wrap']/text()|//*[@class='zh-question-followers-sidebar']/div/a/strong/text()")

            item_loader.add_css("topics", ".zm-tag-editor-labels a::text")

            question_item = item_loader.load_item()

        yield scrapy.Request(self.start_answer_url.format(question_id, 20, 0), headers=self.headers, callback=self.parse_answer)

        yield question_item

    def parse_answer(self, reponse):

        #处理question的answer

        ans_json = json.loads(reponse.text)

        is_end = ans_json["paging"]["is_end"]

        next_url = ans_json["paging"]["next"]

        #提取answer的具体字段

        for answer in ans_json["data"]:

            answer_item = ZhihuAnswerItem()

            answer_item["zhihu_id"] = answer["id"]

            answer_item["url"] = answer["url"]

            answer_item["question_id"] = answer["question"]["id"]

            answer_item["author_id"] = answer["author"]["id"] if "id" in answer["author"] else None

            answer_item["content"] = answer["content"] if "content" in answer else None

            answer_item["parise_num"] = answer["voteup_count"]

            answer_item["comments_num"] = answer["comment_count"]

            answer_item["create_time"] = answer["created_time"]

            answer_item["update_time"] = answer["updated_time"]

            answer_item["crawl_time"] = datetime.datetime.now()

            yield answer_item

        if not is_end:

            yield scrapy.Request(next_url, headers=self.headers, callback=self.parse_answer)

    def start_requests(self):

        return [scrapy.Request('https://www.zhihu.com/#signin', headers=self.headers, callback=self.login)]

    def login(self, response):

        response_text = response.text

        match_obj = re.match('.*name="_xsrf" value="(.*?)"', response_text, re.DOTALL)

        xsrf = ''

        if match_obj:

            xsrf = (match_obj.group(1))

        if xsrf:

            post_url = "https://www.zhihu.com/login/phone_num"

            post_data = {

                "_xsrf": xsrf,

                "phone_num": "",

                "password": "",

                "captcha": ""

            }

            import time

            t = str(int(time.time() * 1000))

            captcha_url = "https://www.zhihu.com/captcha.gif?r={0}&type=login".format(t)

            yield scrapy.Request(captcha_url, headers=self.headers, meta={"post_data":post_data}, callback=self.login_after_captcha)

    def login_after_captcha(self, response):

        with open("captcha.jpg", "wb") as f:

            f.write(response.body)

            f.close()

        from PIL import Image

        try:

            im = Image.open('captcha.jpg')

            im.show()

            im.close()

        except:

            pass

        captcha = input("输入验证码\n>")

        post_data = response.meta.get("post_data", {})

        post_url = "https://www.zhihu.com/login/phone_num"

        post_data["captcha"] = captcha

        return [scrapy.FormRequest(

            url=post_url,

            formdata=post_data,

            headers=self.headers,

            callback=self.check_login

        )]

    def check_login(self, response):

        #验证服务器的返回数据判断是否成功

        text_json = json.loads(response.text)

        if "msg" in text_json and text_json["msg"] == "登录成功":

            for url in self.start_urls:

                yield scrapy.Request(url, dont_filter=True, headers=self.headers)

zhihu.py

。。。。。。

Scrapy爬虫笔记 - 爬取知乎的更多相关文章

【网络爬虫】【python】网络爬虫（五）：scrapy爬虫初探——爬取网页及选择器
在上一篇文章的末尾,我们创建了一个scrapy框架的爬虫项目test,现在来运行下一个简单的爬虫,看看scrapy爬取的过程是怎样的. 一.爬虫类编写(spider.py) from scrapy.s ...
第二个爬虫之爬取知乎用户回答和文章并将所有内容保存到txt文件中
自从这两天开始学爬虫,就一直想做个爬虫爬知乎.于是就开始动手了. 知乎用户动态采取的是动态加载的方式,也就是先加载一部分的动态,要一直滑道底才会加载另一部分的动态.要爬取全部的动态,就得先获取全部的u ...
Scrapy爬虫实战-爬取体彩排列5历史数据
网站地址:http://www.17500.cn/p5/all.php 1.新建爬虫项目 scrapy startproject pfive 2.在spiders目录下新建爬虫 scrapy gens ...
手把手教大家如何用scrapy爬虫框架爬取王者荣耀官网英雄资料
之前被两个关系很好的朋友拉入了王者荣耀的大坑,奈何技术太差,就想着做一个英雄的随查手册,这样就可以边打边查了.菜归菜,至少得说明咱打王者的态度是没得说的,对吧?大神不喜勿喷!!!感谢!!废话不多说,开 ...
scrapy爬虫案例--爬取阳关热线问政平台
阳光热线问政平台:http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1 爬取最新问政帖子的编号.投诉标题.投诉内容以 ...
Scrapy爬虫Demo 爬取资讯分类
爬取新浪网导航页所有下所有大类.小类.小类里的子链接,以及子链接页面的新闻内容. 效果演示图: items.py import scrapy import sys reload(sys) sys.se ...
scrapy爬虫框架爬取招聘网站
目录结构 BossFace.py文件中代码: # -*- coding: utf-8 -*-import scrapyfrom ..items import BossfaceItemimport js ...
教程+资源,python scrapy实战爬取知乎最性感妹子的爆照合集(12G)!
一.出发点: 之前在知乎看到一位大牛(二胖)写的一篇文章:python爬取知乎最受欢迎的妹子(大概题目是这个,具体记不清了),但是这位二胖哥没有给出源码,而我也没用过python,正好顺便学一学,所以 ...
使用python scrapy爬取知乎提问信息
前文介绍了python的scrapy爬虫框架和登录知乎的方法. 这里介绍如何爬取知乎的问题信息,并保存到mysql数据库中. 首先,看一下我要爬取哪些内容: 如下图所示,我要爬取一个问题的6个信息: ...

随机推荐

requsets模块和beautifulsoup模块
2.requests模块方法 requests是基于Python开发的HTTP库,使用Requests可以轻而易举的完成浏览器可有的任何操作. request.get() request.post() ...
windows 查看链接数
windows 端口链接数查看有效链接数: netstat -an|find " |find "ESTABLISHED" /c /c 统计: 查看链接信息: 查看当 ...
transition，过渡效果
语法: transtion:property time change-speed delay. 人话就是:属性(property )在多少秒内(time )通过什么样的速度(change-speed) ...
stm32中字节对齐问题（__align(n)，__packed用法）
ARM下的对齐处理 from DUI0067D_ADS1_2_CompLib 3.13 type qulifiers 有部分摘自ARM编译器文档对齐部分对齐的使用: 1.__align(n ...
python 入门基础24 元类、单例模式
内容目录: 一.元类二.单例模式一.元类 1 什么是元类: 源自一句话:在python中,一切皆对象,而对象都是由类实例化得到的 class OldboyTeacher: def __init__ ...
不修改加密文件名的勒索软件TeslaCrypt 4.0
不修改加密文件名的勒索软件TeslaCrypt 4.0 安天安全研究与应急处理中心(Antiy CERT)近期发现勒索软件TeslaCrypt的最新变种TeslaCrypt 4.0,它具有多种特性,例 ...
python3之协程
1.协程的概念协程,又称微线程,纤程.英文名Coroutine. 线程是系统级别的它们由操作系统调度,而协程则是程序级别的由程序根据需要自己调度.在一个线程中会有很多函数,我们把这些函数称为子程序, ...
linux 串口0x03，0x13的问题【转】
linux 串口0x03,0x13的问题本人最近在调linux串口的时候,发现其他数据接收正常,但是0x13怎么也接收不到,后面发现了这篇文章,两天的bug终于解决了,原来是linux底层uart配 ...
BLE获取iphone mac地址的方法--【原创】
本人用的BLE是TIcc2541,1.3.2协议栈 1.首先要说明的是,iphone手机将信息保护了,BLE设备读到的iphone地址是随机的,每次连接都会不同 2.下面我就具体说明如何查看手机的ma ...
Python3学习笔记10-条件控制
Python条件语句是通过一条或多条语句的执行结果(True或者False)来决定执行的代码块 var1 = 100 if var1: print("1 - if 表达式条件为 true&q ...

Scrapy爬虫笔记 - 爬取知乎

Scrapy shell 如何添加 User_Agent

Scrapy爬虫笔记 - 爬取知乎的更多相关文章

随机推荐

热门专题