scrapydWeb安装和使用
1. 安装:pip install scrapydweb
2. 启动:scrapydweb
第一次执行,当前目录会生产配置文件:scrapydweb_settings_v8.py
配置账户和密码:
# The default is False, set it to True to enable basic auth for web UI.
ENABLE_AUTH = True
# In order to enable basic auth, both USERNAME and PASSWORD should be non-empty strings.
USERNAME = 'admin'
PASSWORD = 'admin'
配置文件:
# coding: utf8
"""
How ScrapydWeb works:
BROWSER_HOST <<<>>> SCRAPYDWEB_BIND:SCRAPYDWEB_PORT <<<>>> your SCRAPYD_SERVERS GitHub: https://github.com/my8100/scrapydweb
""" ###############################################################################
###############################################################################
## QUICK SETUP: Simply search and update the SCRAPYD_SERVERS option, leave the rest as default.
## Recommended Reading: [How to efficiently manage your distributed web scraping projects]
## (https://medium.com/@my8100)
## ------------------------------ Chinese -------------------------------------
## 快速设置:搜索并更新 SCRAPYD_SERVERS 配置项即可,其余配置项保留默认值。
## 推荐阅读:[如何简单高效地部署和监控分布式爬虫项目]
## (https://juejin.im/post/5bebc5fd6fb9a04a053f3a0e)
###############################################################################
############################################################################### ############################## ScrapydWeb #####################################
# Setting SCRAPYDWEB_BIND to '0.0.0.0' or IP-OF-THE-CURRENT-HOST would make
# ScrapydWeb server visible externally; Otherwise, set it to '127.0.0.1'.
# The default is '0.0.0.0'.
SCRAPYDWEB_BIND = '127.0.0.1'
# Accept connections on the specified port, the default is 5000.
SCRAPYDWEB_PORT = 5000 # The default is False, set it to True to enable basic auth for web UI.
ENABLE_AUTH = True
# In order to enable basic auth, both USERNAME and PASSWORD should be non-empty strings.
USERNAME = 'admin'
PASSWORD = 'admin' # The default is False, set it to True and add both CERTIFICATE_FILEPATH and PRIVATEKEY_FILEPATH
# to run ScrapydWeb in HTTPS mode.
# Note that this feature is not fully tested, please leave your comment here if ScrapydWeb
# raises any excepion at startup: https://github.com/my8100/scrapydweb/issues/18
ENABLE_HTTPS = False
# e.g. '/home/username/cert.pem'
CERTIFICATE_FILEPATH = ''
# e.g. '/home/username/cert.key'
PRIVATEKEY_FILEPATH = '' ############################## Scrapy #########################################
# ScrapydWeb is able to locate projects in the SCRAPY_PROJECTS_DIR,
# so that you can simply select a project to deploy, instead of packaging it in advance.
# e.g. 'C:/Users/username/myprojects/' or '/home/username/myprojects/'
SCRAPY_PROJECTS_DIR = '' ############################## Scrapyd ########################################
# Make sure that [Scrapyd](https://github.com/scrapy/scrapyd) has been installed
# and started on all of your hosts.
# Note that for remote access, you have to manually set 'bind_address = 0.0.0.0'
# in the configuration file of Scrapyd and restart Scrapyd to make it visible externally.
# Check out 'https://scrapyd.readthedocs.io/en/latest/config.html#example-configuration-file' for more info.
# ------------------------------ Chinese --------------------------------------
# 请先确保所有主机都已经安装和启动 [Scrapyd](https://github.com/scrapy/scrapyd)。
# 如需远程访问 Scrapyd,则需在 Scrapyd 配置文件中设置 'bind_address = 0.0.0.0',然后重启 Scrapyd。
# 详见 https://scrapyd.readthedocs.io/en/latest/config.html#example-configuration-file # - the string format: username:password@ip:port#group
# - The default port would be 6800 if not provided,
# - Both basic auth and group are optional.
# - e.g. '127.0.0.1:6800' or 'username:password@localhost:6801#group'
# - the tuple format: (username, password, ip, port, group)
# - When the username, password, or group is too complicated (e.g. contains ':@#'),
# - or if ScrapydWeb fails to parse the string format passed in,
# - it's recommended to pass in a tuple of 5 elements.
# - e.g. ('', '', '127.0.0.1', '6800', '') or ('username', 'password', 'localhost', '6801', 'group')
# 启动的服务,可以部署多个爬虫
SCRAPYD_SERVERS = [
'127.0.0.1:6800',
# 'username:password@localhost:6801#group',
# ('username', 'password', 'localhost', '6801', 'group'),
] # If both ScrapydWeb and one of your Scrapyd servers run on the same machine,
# ScrapydWeb would try to directly read Scrapy logfiles from disk, instead of making a request
# to the Scrapyd server.
# e.g. '127.0.0.1:6800' or 'localhost:6801', do not forget the port number.
LOCAL_SCRAPYD_SERVER = '127.0.0.1:6800'
# Check out this link to find out where the Scrapy logs are stored:
# https://scrapyd.readthedocs.io/en/stable/config.html#logs-dir
# e.g. 'C:/Users/username/logs/' or '/home/username/logs/'
# 日志文件保存的地址,用于日志分析
SCRAPYD_LOGS_DIR = '/Users/admin/www/reports/env/logs' # ScrapydWeb would try every extension in sequence to locate the Scrapy logfile.
# The default is ['.log', '.log.gz', '.txt'].
SCRAPYD_LOG_EXTENSIONS = ['.log', '.log.gz', '.txt'] ############################## LogParser ######################################
# By default ScrapydWeb would automatically run LogParser as a subprocess at startup,
# so that the stats of crawled_pages and scraped_items can be shown in the Jobs page.
# The default is True, set it to False to disable this behaviour.
# Note that you can run the LogParser service separately via command 'logparser' as you like.
# Run 'logparser -h' to find out the config file of LogParser for more advanced settings.
# Visit https://github.com/my8100/logparser for more info.
# 开启日志
ENABLE_LOGPARSER = True # Whether to backup the stats json files locally after you visit the Stats page of a job
# so that it is still accessible even if the original logfile has been deleted.
# The default is True, set it to False to disable this behaviour.
BACKUP_STATS_JSON_FILE = True ############################## Timer Tasks ####################################
# Run ScrapydWeb with argument '-sw' or '--switch_scheduler_state', or click the ENABLED|DISABLED button
# on the Timer Tasks page to turn on/off the scheduler for the timer tasks and the snapshot mechanism below. # The default is 300, which means ScrapydWeb would automatically create a snapshot of the Jobs page
# and save the jobs info in the database in the background every 300 seconds.
# Note that this behaviour would be paused if the scheduler for timer tasks is disabled.
# Set it to 0 to disable this behaviour.
JOBS_SNAPSHOT_INTERVAL = 300 ############################## Page Display ###################################
# The default is True, set it to False to hide the Items page, as well as
# the Items column in the Jobs page.
SHOW_SCRAPYD_ITEMS = True # The default is True, set it to False to hide the Job column in the Jobs page with non-database view.
SHOW_JOBS_JOB_COLUMN = True # The default is 0, which means unlimited, set it to a positive integer so that
# only the latest N finished jobs would be shown in the Jobs page with non-database view.
JOBS_FINISHED_JOBS_LIMIT = 0 # If your browser stays on the Jobs page, it would be reloaded automatically every N seconds.
# The default is 300, set it to 0 to disable auto-reloading.
JOBS_RELOAD_INTERVAL = 300 # The load status of the current Scrapyd server is checked every N seconds,
# which is displayed in the top right corner of the page.
# The default is 10, set it to 0 to disable auto-refreshing.
DAEMONSTATUS_REFRESH_INTERVAL = 10 ############################## Email Notice ###################################
# In order to be notified (and stop or forcestop a job when triggered) in time,
# you can reduce the value of POLL_ROUND_INTERVAL and POLL_REQUEST_INTERVAL,
# at the cost of burdening both CPU and bandwidth of your servers. # Tip: set SCRAPYDWEB_BIND to the actual IP of your host, then you can visit ScrapydWeb
# via the links attached in the email. (check out the "ScrapydWeb" section above) # Check out this link if you are using ECS of Alibaba Cloud and your SMTP server provides TCP port 25 only:
# https://www.alibabacloud.com/help/doc-detail/56130.htm # The default is False, set it to True to enable email notification.
ENABLE_EMAIL = False ########## smtp settings ##########
SMTP_SERVER = ''
SMTP_PORT = 0
SMTP_OVER_SSL = False # Config for https://mail.google.com using SSL
# SMTP_SERVER = 'smtp.gmail.com'
# SMTP_PORT = 465
# SMTP_OVER_SSL = True # Config for https://mail.google.com
# SMTP_SERVER = 'smtp.gmail.com'
# SMTP_PORT = 587
# SMTP_OVER_SSL = False # Config for https://mail.qq.com/ using SSL
# SMTP_SERVER = 'smtp.qq.com'
# SMTP_PORT = 465
# SMTP_OVER_SSL = True # Config for http://mail.10086.cn/
# SMTP_SERVER = 'smtp.139.com'
# SMTP_PORT = 25
# SMTP_OVER_SSL = False # The timeout in seconds for the connection attempt, the default is 10.
SMTP_CONNECTION_TIMEOUT = 10 ########## sender & recipients ##########
# Leave this option as '' to default to the FROM_ADDR option below; Otherwise, set it up
# if your email service provider requires an username which is different from the FROM_ADDR option below to login.
# e.g. 'username'
EMAIL_USERNAME = ''
# As for different email service provider, you might have to get an APP password (like Gmail)
# or an authorization code (like QQ mail) and set it as the EMAIL_PASSWORD.
# Check out links below to get more help:
# https://stackoverflow.com/a/27515833/10517783 How to send an email with Gmail as the provider using Python?
# https://stackoverflow.com/a/26053352/10517783 Python smtplib proxy support
# e.g. 'password4gmail'
EMAIL_PASSWORD = '' # e.g. 'username@gmail.com'
FROM_ADDR = ''
# e.g. ['username@gmail.com', ]
TO_ADDRS = [FROM_ADDR] ########## email working time ##########
# Monday is 1 and Sunday is 7.
# e.g, [1, 2, 3, 4, 5, 6, 7]
EMAIL_WORKING_DAYS = [] # From 0 to 23.
# e.g. [9] + list(range(15, 18)) >>> [9, 15, 16, 17], or range(24) for 24 hours
EMAIL_WORKING_HOURS = [] ########## poll interval ##########
# Sleep N seconds before starting next round of poll, the default is 300.
POLL_ROUND_INTERVAL = 300 # Sleep N seconds between each request to the Scrapyd server while polling, the default is 10.
POLL_REQUEST_INTERVAL = 10 ########## basic triggers ##########
# Trigger email notice every N seconds for each running job.
# The default is 0, set it to a positive integer to enable this trigger.
ON_JOB_RUNNING_INTERVAL = 0 # Trigger email notice when a job is finished.
# The default is False, set it to True to enable this trigger.
ON_JOB_FINISHED = False ########## advanced triggers ##########
# - LOG_XXX_THRESHOLD:
# - Trigger email notice the first time reaching the threshold for a specific kind of log.
# - The default is 0, set it to a positive integer to enable this trigger.
# - LOG_XXX_TRIGGER_STOP (optional):
# - The default is False, set it to True to stop current job automatically when reaching the LOG_XXX_THRESHOLD.
# - The SIGTERM signal would be sent only one time to shut down the crawler gracefully.
# - In order to avoid an UNCLEAN shutdown, the 'STOP' action would be executed one time at most
# - if none of the 'FORCESTOP' triggers is enabled, no matter how many 'STOP' triggers are enabled.
# - LOG_XXX_TRIGGER_FORCESTOP (optional):
# - The default is False, set it to True to FORCESTOP current job automatically when reaching the LOG_XXX_THRESHOLD.
# - The SIGTERM signal would be sent twice resulting in an UNCLEAN shutdown, without the Scrapy stats dumped!
# - The 'FORCESTOP' action would be executed if both of the 'STOP' and 'FORCESTOP' triggers are enabled. # Note that the 'STOP' action and the 'FORCESTOP' action would STILL be executed even when the current time
# is NOT within the EMAIL_WORKING_DAYS and the EMAIL_WORKING_HOURS, though NO email would be sent. LOG_CRITICAL_THRESHOLD = 0
LOG_CRITICAL_TRIGGER_STOP = False
LOG_CRITICAL_TRIGGER_FORCESTOP = False LOG_ERROR_THRESHOLD = 0
LOG_ERROR_TRIGGER_STOP = False
LOG_ERROR_TRIGGER_FORCESTOP = False LOG_WARNING_THRESHOLD = 0
LOG_WARNING_TRIGGER_STOP = False
LOG_WARNING_TRIGGER_FORCESTOP = False LOG_REDIRECT_THRESHOLD = 0
LOG_REDIRECT_TRIGGER_STOP = False
LOG_REDIRECT_TRIGGER_FORCESTOP = False LOG_RETRY_THRESHOLD = 0
LOG_RETRY_TRIGGER_STOP = False
LOG_RETRY_TRIGGER_FORCESTOP = False LOG_IGNORE_THRESHOLD = 0
LOG_IGNORE_TRIGGER_STOP = False
LOG_IGNORE_TRIGGER_FORCESTOP = False ############################## System #########################################
# The default is False, set it to True to enable debug mode and the interactive debugger
# would be shown in the browser instead of the "500 Internal Server Error" page.
# Note that use_reloader is set to False in run.py
DEBUG = False # The default is False, set it to True to change the logging level from WARNING to DEBUG
# for getting more information about how ScrapydWeb works, especially while debugging.
VERBOSE = False
3. 界面:功能强大、还支持移动端

参考资料:https://github.com/my8100/files/blob/master/scrapydweb/README_CN.md
scrapydWeb安装和使用的更多相关文章
- 【python3】基于scrapyd + scrapydweb 的可视化部署
一.部署组件概览 该部署方式适用于 scrapy项目.scrapy-redis的分布式爬虫项目 需要安装的组件有: 1.scrapyd 服务端 [运行打包后的爬虫代码](所有的爬虫机器都要安 ...
- 笔记-爬虫部署及运行工具-scrapydweb
笔记-爬虫部署及运行工具-scrapydweb 1. 简介 scrapyd是爬虫部署工具,但它的ui比较简单,使用不是很方便. scrapydweb以scrapyd为基础,增加了ui界面和监 ...
- scrapydweb的初步使用(管理分布式爬虫)
https://github.com/my8100/files/blob/master/scrapydweb/README_CN.md 一.安装配置 1.请先确保所有主机都已经安装和启动 Scrapy ...
- Scrapy+Scrapyd+Scrapydweb实现爬虫可视化
Scrapy+Scrapyd+Scrapydweb实现爬虫可视化 Scrapyd是一个服务,用来运行scrapy爬虫的 它允许你部署你的scrapy项目以及通过HTTP JSON的方式控制你的爬虫 官 ...
- docker——容器安装tomcat
写在前面: 继续docker的学习,学习了docker的基本常用命令之后,我在docker上安装jdk,tomcat两个基本的java web工具,这里对操作流程记录一下. 软件准备: 1.jdk-7 ...
- 网络原因导致 npm 软件包 node-sass / gulp-sass 安装失败的处理办法
如果你正在构建一个基于 gulp 的前端自动化开发环境,那么极有可能会用到 gulp-sass ,由于网络原因你可能会安装失败,因为安装过程中部分细节会到亚马逊云服务器上获取文件.本文主要讨论在不变更 ...
- Sublime Text3安装JsHint
介绍 Sublime Text3使用jshint依赖Nodejs,SublimeLinter和Sublimelinter-jshint. NodeJs的安装省略. 安装SublimeLinter Su ...
- Fabio 安装和简单使用
Fabio(Go 语言):https://github.com/eBay/fabio Fabio 是一个快速.现代.zero-conf 负载均衡 HTTP(S) 路由器,用于部署 Consul 管理的 ...
- gentoo 安装
加载完光驱后 1进行ping命令查看网络是否通畅 2设置硬盘的标识为GPT(主要用于64位且启动模式为UEFI,还有一个是MBR,主要用于32位且启动模式为bois) parted -a optima ...
随机推荐
- TCP之RST报文段
TCP 首部中的 RST 比特是用于 "复位" 的.一般来说,无论何时一个报文段发往基准的连接(referenced connection)出现错误,TCP 都会发出一个复位报文段 ...
- 一、Git的一些命令操作----创建版本库、增加文件到Git库、时光机穿梭、远程仓库
具体详细教程请链接:http://www.liaoxuefeng.com/wiki/0013739516305929606dd18361248578c67b8067c8c017b000 我这里只是记录 ...
- vuex 讲解
vuex 状态的管理状态,它采用集中式存储管理应用的所有组件的状态,尤其是在中大型项目,则是很好的开发利器 vuex 的流程图 vuex 的优势: 1. vuex 的存储状态,响应式的 2. 他是所有 ...
- html页面元素命名参考
页面结构: (1)页面结构 容器: container 页头:header 内容:content/container 页面主体:main 页尾:footer 导航:nav 侧栏:sidebar 栏目: ...
- electron关于无边框窗口无法拖拽移动以及点击事件失效的问题
为了使窗口无边框,使得在某些时候让项目看起来更美观,所以在创建窗口的时候通过设置 frame 属性的值为 false 来创建无边框窗口.但是无边框窗口会产生无法移动的问题,对于这个问题我们可以在渲染进 ...
- phpStudy本地搭建wordpress教程
一.启用phpStudy环境包 phpStudy简单易用,一键启动配置本地环境; 二.wordpress博客程序 登陆wordpress官网下载最新程序,解压后提取wordpress目录下全部文件到p ...
- 【转载】WEBRTC基本介绍
“WebRTC,名称源自网页实时通信(Web Real-Time Communication)的缩写,是一个支持网页浏览器进行实时语音对话或视频对话的技术,是谷歌2010年以6820万美元收购Glob ...
- IDEA Cannot access alimaven (http://maven.aliyun.com/nexus/content/groups/public/)
[ERROR] Plugin org.apache.maven.plugins:maven-compiler-plugin:3.1 or one of its dependencies could n ...
- go 基础 处理异常
package main import "fmt" func main() { dosomething() } func dosomething(){ defer func() { ...
- springboot整合mybatis时java.sql.SQLException: The server time zone value 'Öйú±ê׼ʱ¼ä' is unrecognized or represents more than one time zone.
时区问题造成的,解决方法是在数据源配置文件中在数据库链接处增加参数&serverTimezone=GMT%2B8对时区进行配置,配置为东八区. 修改前:spring.datasource.ur ...