网络爬虫基本概念与Scrapy工具包使用
Scrapy网络爬虫



Scrapy结构图:

Scrapy流动图

图 2-1




1、在D:\Workspace下新建ScrapyTest文件夹,即D:\Workspace\ScrapyTest
2、cd D:\Workspace\ScrapyTest
3、查看Scrapy版本:scrapy version
4、创建项目 scrapy startproject scrapy_project:
scrapy startproject weibo
5、这将创建一个名为project_name的项目目录,接下来,进入新创建的项目:cd scrapy_project
================project_name工程目录结构===============
通过命令行执行,D:\Workspace\ScrapyTest\weibo>tree /F
卷 DATA1 的文件夹 PATH 列表
卷序列号为 3A2E-EB05
D:.
│ scrapy.cfg
│
│ items.py
│ middlewares.py
│ pipelines.py
│ settings.py
│ __init__.py
│
├─spiders
│ │ __init__.py
│ │
│ └─__pycache__
└─__pycache__
====================工程目录结构=======================
scrapy.cfg(项目配置文件):
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html
[settings]
default = weibo.settings
[deploy]
#url = http://localhost:6800/
project = weibo
items.py(项目items文件):
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class WeiboItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
middlewares.py(项目中间件文件)
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
class WeiboSpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, dict or Item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Response, dict
# or Item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class WeiboDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
pipelines.py(项目管道文件)
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class WeiboPipeline(object):
def process_item(self, item, spider):
return item
settings.py(项目配置文件)
# -*- coding: utf-8 -*-
# Scrapy settings for weibo project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'weibo'
SPIDER_MODULES = ['weibo.spiders']
NEWSPIDER_MODULE = 'weibo.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'weibo (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'weibo.middlewares.WeiboSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'weibo.middlewares.WeiboDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'weibo.pipelines.WeiboPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
6、cd weibo
=================================================================
示例…………
================================================================
1、 (base) D:\Workspace\ScrapyTest\example>scrapy shell http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
进入python shell
………………
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x000002367A9A7B38>
[s] item {}
[s] request <GET http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html>
[s] response <200 http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html>
[s] settings <scrapy.settings.Settings object at 0x000002367A9A7A58>
[s] spider <DefaultSpider 'default' at 0x2367af35198>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]: view(response)
Out[1]: True
In [2]: sel = response.css('div.product_main')
In [3]: sel.xpath('./h1/text()').extract_first()
Out[3]: 'A Light in the Attic'
In [4]: quit()
可能在很多时候,使用view函数打开的页面和在浏览器直接输入url打开的页面看起来是一样的,但需要知道的是,前者是由Scrapy爬虫下载的页面,而后者是由浏览器下载的页面,有时它们是不同的。在进行页面分析时,使用view函数更加可靠。
2、 (base) D:\Workspace\ScrapyTest\example>scrapy startproject scrape_book
New Scrapy project 'scrape_book', using template directory 's:\\users\\jiangshan\\anaconda3\\lib\\site-packages\\scrapy\\templates\\project', created in:
D:\Workspace\ScrapyTest\scrape_book
You can start your first spider with:
cd scrape_book
scrapy genspider example example.com
3、 (base) D:\Workspace\ScrapyTest>cd scrape_book
4、 (base) D:\Workspace\ScrapyTest\scrape_book>tree /F
卷 DATA1 的文件夹 PATH 列表
卷序列号为 3A2E-EB05
D:.
│ scrapy.cfg
│
└─scrape_book
│ items.py
│ middlewares.py
│ pipelines.py
│ settings.py
│ __init__.py
│
├─spiders
│ │ __init__.py
│ │
│ └─__pycache__
└─__pycache__
5、 (base) D:\Workspace\ScrapyTest\scrape_book>scrapy genspider books books.toscrape.com
Created spider 'books' using template 'basic' in module:
scrape_book.spiders.books
6、 不需要手工创建Spider文件以及Spider类,可以使用scrapy genspider<SPIDER_NAME> <DOMAIN>命令生成(根据模板)它们,该命令的两个参数分别是Spider的名字和所要爬取的域(网站)(base) D:\Workspace\ScrapyTest\scrape_book>tree /F
卷 DATA1 的文件夹 PATH 列表
卷序列号为 3A2E-EB05
D:.
│ scrapy.cfg
│
└─scrape_book
│ items.py
│ middlewares.py
│ pipelines.py
│ settings.py
│ __init__.py
│
├─spiders
│ │ books.py
│ │ __init__.py
│ │
│ └─__pycache__
│ __init__.cpython-37.pyc
│
└─__pycache__
settings.cpython-37.pyc
__init__.cpython-37.pyc
运行后,scrapy genspider命令创建了文件toscrape_book/spiders/books.py,并在其中创建了一个BooksSpider类:
# -*- coding: utf-8 -*-
import scrapy
class BooksSpider(scrapy.Spider):
name = 'books'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
pass
网络爬虫基本概念与Scrapy工具包使用的更多相关文章
- 网络爬虫之框架(Scrapy)
Scrapy爬虫框架 爬虫框架是实现爬虫功能的一个软件结构和功能组件集合. 爬虫框架是一个半成品,能够帮助用户实现专业网络爬虫. Scrapy爬虫框架结构:
- python网络爬虫(2)——scrapy框架的基础使用
这里写一下爬虫大概的步骤,主要是自己巩固一下知识,顺便复习一下. 一,网络爬虫的步骤 1,创建一个工程 scrapy startproject 工程名称 创建好工程后,目录结构大概如下: 其中: sc ...
- Python 网络爬虫基本概念篇
爬虫的概念 网络爬虫(又称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本.这是百度百科对爬虫的定义,其实,说简单点,爬虫 ...
- 网络爬虫(蜘蛛)Scrapy,Python安装!
Scrapy,Python安装.使用! 1.下载安装Python2.7.6.由于Scrapy还不支持3.x版本号. Latest Python 2 Release - Python 2.7.6,安装时 ...
- 网络爬虫之scrapy爬取某招聘网手机APP发布信息
1 引言 过段时间要开始找新工作了,爬取一些岗位信息来分析一下吧.目前主流的招聘网站包括前程无忧.智联.BOSS直聘.拉勾等等.有段时间时间没爬取手机APP了,这次写一个爬虫爬取前程无忧手机APP岗位 ...
- python网络爬虫之初始网络爬虫
第一次接触到python是一个很偶然的因素,由于经常在网上看连载小说,很多小说都是上几百的连载.因此想到能不能自己做一个工具自动下载这些小说,然后copy到电脑或者手机上,这样在没有网络或者网络信号不 ...
- python网络爬虫之初识网络爬虫
第一次接触到python是一个很偶然的因素,由于经常在网上看连载小说,很多小说都是上几百的连载.因此想到能不能自己做一个工具自动下载这些小说,然后copy到电脑或者手机上,这样在没有网络或者网络信号不 ...
- 学习推荐《精通Python网络爬虫:核心技术、框架与项目实战》中文PDF+源代码
随着大数据时代的到来,我们经常需要在海量数据的互联网环境中搜集一些特定的数据并对其进行分析,我们可以使用网络爬虫对这些特定的数据进行爬取,并对一些无关的数据进行过滤,将目标数据筛选出来.对特定的数据进 ...
- 《精通Python网络爬虫》|百度网盘免费下载|Python爬虫实战
<精通Python网络爬虫>|百度网盘免费下载|Python爬虫实战 提取码:7wr5 内容简介 为什么写这本书 网络爬虫其实很早就出现了,最开始网络爬虫主要应用在各种搜索引擎中.在搜索引 ...
随机推荐
- 解释下Http协议
HTTP是一个属于应用层的面向对象的协议,由于其简捷.快速的方式,适用于分布式超媒体信息系统. HTTP协议的主要特点可概括如下: 1.支持客户/服务器模式. 2.简单快速:客户向服务器请求服务时,只 ...
- 判断是否是合法的IP地址
ipv4 import re #简单的匹配给定的字符串是否是ip地址,下面的例子它不是IPv4的地址,但是它满足正则表达式 if re.match(r"^(?:[0-9]{1,3}\.){3 ...
- kafka集群搭建(图文并用)
将安装包上传服务器并解压 scp kafka_2.11-1.0.0.tgz username@{ip}:~/. mkdir /usr/local/kafka mv kafka_2.11-1.0.0.t ...
- Flutter扫码识别二维码内容
前面一篇写了生成二维码图片,这篇来写使用相机扫描识别二维码 识别二维码需要用到插件 barcode_scan 首先在 pubspec.yaml 文件中添加以下依赖,添加依赖后在 pubspec.yam ...
- maven仓库失效的情况下搭建maven项目
maven仓库失效的情况下搭建maven项目 1,在有maven仓库的情况下mvn clean package 2,整个项目拷贝到没有的环境下 3,ls |xargs -t -I a cp a/pom ...
- 实时查看linux下的日志
cat /var/log/*.log 如果日志在更新,如何实时查看 tail -f /var/log/messages 还可以使用 watch -d -n 1 cat /var/log/message ...
- 源码分析关于SpringBoot2.x版本与1.5版本之间的问题
1.Social包在SpringBoot2.x移除问题 spring-boot-autoconfigure1.5x版本中支持facebook,领英和推特官方文档:https://docs.spring ...
- JavaWeb三大组件(Servlet、Filter、Listener)
JavaWeb三大组件指的是:Servlet.Filter.Listener,这三个组件在JavaWeb开发中分别提供不同的功能,然而很多人可能只用过其中一个或者两个(Servlet.Filter,比 ...
- java/spring boot/dubbo/spring cloud/微服务/SOA/分布式经典电子书籍pdf下载
微服务系列 官方文档是最好的资料了. spring cloud官方文档:https://cloud.spring.io/spring-cloud-static/Greenwich.RELEASE/si ...
- Leetcode: Campus Bikes II
On a campus represented as a 2D grid, there are N workers and M bikes, with N <= M. Each worker a ...