scrapy爬取cnblogs文章列表

- scrapy爬取cnblogs文章

目标任务

爬取 https://www.cnblogs.com 首页文章，爬取的内容包括：标题、推荐数、链接、内容预览、作者、作者blogs链接、评论数、查看数。

安装爬虫

pip install scrapy

python 版本 3.7， scrapy 版本 1.6.0

创建爬虫

#  创建工程

scrapy startproject CnblogsSpider

# 创建爬虫

cd CnblogsSpider

scrapy genspider -t crawl cnblogs cnblogs.com

爬虫名称 cnblogs , 作用域 cnblogs.com，爬虫类型 crawl

编写 `items.py`

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class CnblogsspiderItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    # 文章标题

    title = scrapy.Field()

    # 推荐数

    diggnum = scrapy.Field()

    # 文章链接

    link = scrapy.Field()

    # 文章内容预览

    post_item_summary = scrapy.Field()

    # 作者

    author = scrapy.Field()

    # 作者blogs链接

    author_link = scrapy.Field()

    # 文章评论数

    article_comment = scrapy.Field()

    # 文章查看数

    article_view = scrapy.Field()

定义需求爬取的 item 项

编写 `spiders/cnblogs.py`

# -*- coding: utf-8 -*-

import scrapy

# 导入CrawlSpider类和Rule

from scrapy.spiders import CrawlSpider, Rule

# 导入链接规则匹配类，用来提取符合规则的连接

from scrapy.linkextractors import LinkExtractor

from CnblogsSpider.items import CnblogsspiderItem

class CnblogsSpider(CrawlSpider):

    # 爬虫名称

    name = 'cnblogs'

    allowed_domains = ['cnblogs.com']

    start_urls = ['https://www.cnblogs.com/sitehome/p/1']

    # Response里链接的提取规则，返回的符合匹配规则的链接匹配对象的列表

    pagelink = LinkExtractor(allow=("/sitehome/p/\d+"))

    rules = [

        # 获取这个列表里的链接，依次发送请求，并且继续跟进，调用指定回调函数处理

        Rule(pagelink, callback="parse_item", follow=True)

    ]

    # CrawlSpider的rules属性是直接从response对象的文本中提取url，然后自动创建新的请求。

    # 与Spider不同的是，CrawlSpider已经重写了parse函数

    # scrapy crawl spidername开始运行，程序自动使用start_urls构造Request并发送请求，

    # 然后调用parse函数对其进行解析，在这个解析过程中使用rules中的规则从html（或xml）文本中提取匹配的链接，

    # 通过这个链接再次生成Request，如此不断循环，直到返回的文本中再也没有匹配的链接，或调度器中的Request对象用尽，程序才停止。

    # 如果起始的url解析方式有所不同，那么可以重写CrawlSpider中的另一个函数parse_start_url(self, response)用来解析第一个url返回的Response，但这不是必须的。

    def parse_item(self, response):

        for each in response.xpath("//div[@class='post_item']"):

            item = CnblogsspiderItem()

            item['diggnum'] = each.xpath("./div[1]/div[1]/span[1]/text()").extract()[0].strip()

            item['title'] = each.xpath("./div[2]/h3/a/text()").extract()[0].strip()

            item['link'] = each.xpath("./div[2]/h3/a/@href").extract()[0].strip()

            item['post_item_summary'] = ''.join(each.xpath("./div[2]/p[@class='post_item_summary']/text()").extract()).strip()

            item['author'] = each.xpath("./div[2]/div/a/text()").extract()[0].strip()

            item['author_link'] = each.xpath("./div[2]/div/a/@href").extract()[0].strip()

            item['article_comment'] = each.xpath("./div[2]/div/span[1]/a/text()").extract()[0].strip()

            item['article_view'] = each.xpath("./div[2]/div/span[2]/a/text()").extract()[0].strip()

            yield item

爬虫的主逻辑

编写 `pipelines.py`

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

class CnblogsspiderPipeline(object):

    def __init__(self):

        self.filename = open("cnblogs.json", "w", encoding='utf-8')

    def process_item(self, item, spider):

        try:

            text = json.dumps(dict(item), ensure_ascii=False) + "\n"

            self.filename.write(text)

        except BaseException as e:

            print(e)

        return item

    def close_spider(self, spider):

        self.filename.close()

处理每个页面爬取得到的 item 项

编写 `settings.py`

# -*- coding: utf-8 -*-

# Scrapy settings for CnblogsSpider project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#     https://doc.scrapy.org/en/latest/topics/settings.html

#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'CnblogsSpider'

SPIDER_MODULES = ['CnblogsSpider.spiders']

NEWSPIDER_MODULE = 'CnblogsSpider.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'CnblogsSpider (+http://www.yourdomain.com)'

# 自定义user_agent

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ' \

             'Chrome/73.0.3683.86 Safari/537.36'

# Obey robots.txt rules

# 如果启用,Scrapy将会采用 robots.txt策略

ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)

# 设置最大请求数

CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

# 禁用Cookie（默认情况下启用）

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

# 禁用Telnet控制台（默认启用）

TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

#   'Accept-Language': 'en',

#}

# 设置请求头

DEFAULT_REQUEST_HEADERS = {

    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',

    'Accept-Encoding': 'gzip,deflate,br',

    'accept-language': 'zh-CN,zh;q=0.9',

    'cache-control': 'no-cache',

    'pragma': 'no-cache',

    'upgrade-insecure-requests': '1',

    'host': 'www.cnblogs.com'

}

# 启用或禁用蜘蛛中间件

# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

#    'CnblogsSpider.middlewares.CnblogsspiderSpiderMiddleware': 543,

#}

# 启用或禁用下载器中间件

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

#    'CnblogsSpider.middlewares.CnblogsspiderDownloaderMiddleware': 543,

#}

# 启用或禁用扩展程序

# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

#}

# 配置项目管道

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

    'CnblogsSpider.pipelines.CnblogsspiderPipeline': 300,

}

# 启用和配置AutoThrottle扩展（默认情况下禁用）

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# 开始下载时限速并延迟时间

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# 高并发请求时最大延迟时间

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# 启用和配置HTTP缓存（默认情况下禁用）, 如果开启会优先读取本地缓存，从而加快爬取速度，视情况而定

# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

# 设置日志等级

# LOG_LEVEL = 'DEGUG'

运行爬虫

cd CnblogsSpider

scrapy crawl cnblogs

CnblogsSpider 为项目文件夹， cnblogs 为爬虫名

结果文件 cnblogs.json 在项目文件夹根目录，内容如下：

{"diggnum": "2", "title": "在嵌入式设备中使用 JavaScript 的前景", "link": "https://www.cnblogs.com/conmajia/p/javascript-in-embedded-devices.html", "post_item_summary": "这几年嵌入式/可穿戴设备处理器从 51、AVR 单片机瞬间起飞到 ARM 4 核、8 核，主频 3GHz 起步。硬件能力的提升让硬件集成 JavaScript 引擎成为现实。我大概了解了一下这方面的进展，觉得前景非常广阔。 ...", "author": "Conmajia", "author_link": "https://www.cnblogs.com/conmajia/", "article_comment": "评论(0)", "article_view": "阅读(283)"}

{"diggnum": "1", "title": "Python学习笔记", "link": "https://www.cnblogs.com/xjtu-blacksmith/p/10347247.html", "post_item_summary": "[TOC] 学了N年 的我，现在得为着学业需要而学一下 ，这简明轻快的语言风格真让我有些不习惯。最主要的是，各种各样的命令我总是忘记。以此笔记摘录一下主要内容，以免遗忘。 入门的读本是：《 \"Python语言及其应用\" 》，Bill Lubanovic著，丁嘉瑞、梁杰、禹常隆译。 笔记中若有提到“其 ...", "author": "黑山雁", "author_link": "https://www.cnblogs.com/xjtu-blacksmith/", "article_comment": "评论(0)", "article_view": "阅读(176)"}

{"diggnum": "0", "title": "补习系列(17)-springboot mongodb 内嵌数据库", "link": "https://www.cnblogs.com/littleatp/p/10462395.html", "post_item_summary": "[TOC] 简介 前面的文章中，我们介绍了如何在SpringBoot 中使用MongoDB的一些常用技巧。 那么，与使用其他数据库如 MySQL 一样，我们应该怎么来做MongoDB的单元测试呢？ 使用内嵌数据库的好处是不需要依赖于一个外部环境，如果每一次跑单元测试都需要依赖一个稳定的外部环境，那么 ...", "author": "美码师", "author_link": "https://www.cnblogs.com/littleatp/", "article_comment": "评论(0)", "article_view": "阅读(84)"}

......

......

......

cnblogs 首页一般显示200页，每页20条数据，如果爬取不出错结果应该有4000条
爬虫不保证时效性，源站页面调整就可能导致失效

scrapy爬取cnblogs文章列表的更多相关文章

一文搞定scrapy爬取众多知名技术博客文章保存到本地数据库，包含：cnblog、csdn、51cto、itpub、jobbole、oschina等
本文旨在通过爬取一系列博客网站技术文章的实践,介绍一下scrapy这个python语言中强大的整站爬虫框架的使用.各位童鞋可不要用来干坏事哦,这些技术博客平台也是为了让我们大家更方便的交流.学习.提高 ...
爬虫实战——Scrapy爬取伯乐在线所有文章
Scrapy简单介绍及爬取伯乐在线所有文章一.简说安装相关环境及依赖包 1.安装Python(2或3都行,我这里用的是3) 2.虚拟环境搭建: 依赖包:virtualenv,virtualenvwr ...
Scrapy分布式爬虫打造搜索引擎- (二)伯乐在线爬取所有文章
二.伯乐在线爬取所有文章 1. 初始化文件目录基础环境 python 3.6.5 JetBrains PyCharm 2018.1 mysql+navicat 为了便于日后的部署:我们开发使用了虚拟 ...
scrapy架构与目录介绍、scrapy解析数据、配置相关、全站爬取cnblogs数据、存储数据、爬虫中间件、加代理、加header、集成selenium
今日内容概要 scrapy架构和目录介绍 scrapy解析数据 setting中相关配置全站爬取cnblgos文章存储数据爬虫中间件和下载中间件加代理,加header,集成selenium 内 ...
Scrapy爬取美女图片续集 (原创)
上一篇咱们讲解了Scrapy的工作机制和如何使用Scrapy爬取美女图片,而今天接着讲解Scrapy爬取美女图片,不过采取了不同的方式和代码实现,对Scrapy的功能进行更深入的运用.(我的新书< ...
Scrapy爬取自己的博客内容
python中常用的写爬虫的库有urllib2.requests,对于大多数比较简单的场景或者以学习为目的,可以用这两个库实现.这里有一篇我之前写过的用urllib2+BeautifulSoup做的一 ...
Scrapy爬取美女图片 (原创)
有半个月没有更新了,最近确实有点忙.先是华为的比赛,接着实验室又有项目,然后又学习了一些新的知识,所以没有更新文章.为了表达我的歉意,我给大家来一波福利... 今天咱们说的是爬虫框架.之前我使用pyt ...
【转载】教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神
原文:教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神本博文将带领你从入门到精通爬虫框架Scrapy,最终具备爬取任何网页的数据的能力.本文以校花网为例进行爬取,校花网:http:/ ...
scrapy爬取全部知乎用户信息
# -*- coding: utf-8 -*- # scrapy爬取全部知乎用户信息 # 1:是否遵守robbots_txt协议改为False # 2: 加入爬取所需的headers: user-ag ...

随机推荐

文件系统类型（ext4、ntfs）
Linux 1.Linux:存在几十个文件系统类型:ext2,ext3,ext4,xfs,brtfs,zfs(man 5 fs可以取得全部文件系统的介绍) 不同文件系统采用不同的方法来管理磁盘空间,各 ...
Shiro学习（17）OAuth2集成
目前很多开放平台如新浪微博开放平台都在使用提供开放API接口供开发者使用,随之带来了第三方应用要到开放平台进行授权的问题,OAuth就是干这个的,OAuth2是OAuth协议的下一个版本,相比OAut ...
STL————bitset
C++的 bitset 在 bitset 头文件中,它是一种类似数组的结构,它的每一个元素只能是0或1,每个元素仅用1bit空间. bitset<> bitset1; //无参构造,长度为 ...
php 实现的功能
1.php写日志函数 (如:前端请求日志记录) : https://www.cnblogs.com/lvchenfeng/p/6794822.html 2.php中(服务器)使用CURL实现GET和P ...
(动态改变数据源遇到的问题)ORACLE11g：No Dialect mapping for JDBC type: -9解决方案
在动态改变数据源时 hibernate配置不能使用Oracle官方的方言(org.hibernate.dialect.Oracle10gDialect) 做法写一个方言扩展类,缺什么类型,添加什么类型 ...
Zabbix 历史数据存储到 Elasticsearch
Zabbix 历史数据存储到 Elasticsearch Zabbix 3.4.6 版本开始支持历史数据存储到 Elasticsearch, 早就想测试这个功能,最近有个需求需保存 zabbix 的历 ...
XML 扩展部分
引入命名空间 xmlns DTD缺点 1.不支持命名空间 2.支持的数据类型很少 3.DTD不可扩展 4.DTD不遵循XML规范 DTD的优点简洁 schema 通过schema来解决DTD的不足 ...
18、Page Object 设计模式
Page Object 设计模式的优点如下: 减少代码的重复. 提高测试用例的可读性. 提高测试用例的可维护性, 特别是针对 UI 频繁变化的项目. 当你针对网页编写测试时,你需要引用该网页中的元素, ...
深入理解JAVA虚拟机原理之垃圾回收器机制（一）
更多Android高级架构进阶视频学习请点击:https://space.bilibili.com/474380680 对于程序计数器.虚拟机栈.本地方法栈这三个部分而言,其生命周期与相关线程有关,随 ...
关于VS的第一次使用
参考链接:https://blog.csdn.net/qq_36556893/article/details/88605617

scrapy爬取cnblogs文章列表

目标任务

安装爬虫

创建爬虫

编写 items.py

编写 spiders/cnblogs.py

编写 pipelines.py

编写 settings.py

运行爬虫

scrapy爬取cnblogs文章列表的更多相关文章

随机推荐

热门专题

编写 `items.py`

编写 `spiders/cnblogs.py`

编写 `pipelines.py`

编写 `settings.py`