HTTrack 网站备份工具
HTTrack可以克隆指定网站-把整个网站下载到本地。可以用在离线浏览上,免费的噢!
强大的Httrack类似于搜索引擎的爬虫,也可以用来收集信息。记得之前写过篇http://www.cnblogs.com/dcb3688/p/4607985.html
Python 爬取网站资源文件
现在我们用这个工具很easy的完成
安装:
最新版本:httrack-3.48.22 (2016-5-16)
linux
wget http://download.httrack.com/cserv.php3?File=httrack.tar.gz
tar -zxvf httrack-3.48..tar.gz
cd httrack-3.48.
./configure
make && make install
可能会遇到报错情况
httrack: error while loading shared libraries: libhttrack.so.: cannot open shared object file: No such file or directory
要更新动态库缓存
ldconfig
windows
下载地址:http://files.cnblogs.com/files/dcb3688/httrack-3.48.22.zip
抓取数据
Linux 下使用help查看帮助文档
httrack --help
HTTrack version 3.48-
usage: httrack <URLs> [-option] [+<URL_FILTER>] [-<URL_FILTER>] [+<mime:MIME_FILTER>] [-<mime:MIME_FILTER>]
with options listed below: (* is the default value) General options:
O path for mirror/logfiles+cache (-O path_mirror[,path_cache_and_logfiles]) (--path <param>) Action options:
w *mirror web sites (--mirror)
W mirror web sites, semi-automatic (asks questions) (--mirror-wizard)
g just get files (saved in the current directory) (--get-files)
i continue an interrupted mirror using the cache (--continue)
Y mirror ALL links located in the first level pages (mirror links) (--mirrorlinks) Proxy options:
P proxy use (-P proxy:port or -P user:pass@proxy:port) (--proxy <param>)
%f *use proxy for ftp (f0 don't use) (--httpproxy-ftp[=N])
%b use this local hostname to make/send requests (-%b hostname) (--bind <param>) Limits options:
rN set the mirror depth to N (* r9999) (--depth[=N])
%eN set the external links depth to N (* %e0) (--ext-depth[=N])
mN maximum file length for a non-html file (--max-files[=N])
mN,N2 maximum file length for non html (N) and html (N2)
MN maximum overall size that can be uploaded/scanned (--max-size[=N])
EN maximum mirror time in seconds (= minute, = hour) (--max-time[=N])
AN maximum transfer rate in bytes/seconds (=1KB/s max) (--max-rate[=N])
%cN maximum number of connections/seconds (*%c10) (--connection-per-second[=N])
GN pause transfer if N bytes reached, and wait until lock file is deleted (--max-pause[=N]) Flow control:
cN number of multiple connections (*c8) (--sockets[=N])
TN timeout, number of seconds after a non-responding link is shutdown (--timeout[=N])
RN number of retries, in case of timeout or non-fatal errors (*R1) (--retries[=N])
JN traffic jam control, minimum transfert rate (bytes/seconds) tolerated for a link (--min-rate[=N])
HN host is abandonned if: =never, =timeout, =slow, =timeout or slow (--host-control[=N]) Links options:
%P *extended parsing, attempt to parse all links, even in unknown tags or Javascript (%P0 don't use) (--extended-parsing[=N])
n get non-html files 'near' an html file (ex: an image located outside) (--near)
t test all URLs (even forbidden ones) (--test)
%L <file> add all URL located in this text file (one URL per line) (--list <param>)
%S <file> add all scan rules located in this text file (one scan rule per line) (--urllist <param>) Build options:
NN structure type ( *original structure, +: see below) (--structure[=N])
or user defined structure (-N "%h%p/%n%q.%t")
%N delayed type check, don't make any link test but wait for files download to start instead (experimental) (%N0 don't use, %N1 use for unknown extensions, * %N2 always use)
%D cached delayed type check, don't wait for remote type during updates, to speedup them (%D0 wait, * %D1 don't wait) (--cached-delayed-type-check)
%M generate a RFC MIME-encapsulated full-archive (.mht) (--mime-html)
LN long names (L1 *long names / L0 - conversion / L2 ISO9660 compatible) (--long-names[=N])
KN keep original links (e.g. http://www.adr/link) (K0 *relative link, K absolute links, K4 original links, K3 absolute URI links, K5 transparent proxy link) (--keep-links[=N])
x replace external html links by error pages (--replace-external)
%x do not include any password for external password protected websites (%x0 include) (--disable-passwords)
%q *include query string for local files (useless, for information purpose only) (%q0 don't include) (--include-query-string)
o *generate output html file in case of error (..) (o0 don't generate) (--generate-errors)
X *purge old files after update (X0 keep delete) (--purge-old[=N])
%p preserve html files 'as is' (identical to '-K4 -%F ""') (--preserve)
%T links conversion to UTF- (--utf8-conversion) Spider options:
bN accept cookies in cookies.txt (=do not accept,* =accept) (--cookies[=N])
u check document type if unknown (cgi,asp..) (u0 don't check, * u1 check but /, u2 check always) (--check-type[=N])
j *parse Java Classes (j0 don't parse, bitmask: |1 parse default, |2 don't parse .class | don't parse .js |8 don't be aggressive) (--parse-java[=N])
sN follow robots.txt and meta robots tags (=never,=sometimes,* =always, =always (even strict rules)) (--robots[=N])
%h force HTTP/1.0 requests (reduce update features, only for old servers or proxies) (--http-)
%k use keep-alive if possible, greately reducing latency for small files and test requests (%k0 don't use) (--keep-alive)
%B tolerant requests (accept bogus responses on some servers, but not standard!) (--tolerant)
%s update hacks: various hacks to limit re-transfers when updating (identical size, bogus response..) (--updatehack)
%u url hacks: various hacks to limit duplicate URLs (strip //, www.foo.com==foo.com..) (--urlhack)
%A assume that a type (cgi,asp..) is always linked with a mime type (-%A php3,cgi=text/html;dat,bin=application/x-zip) (--assume <param>)
shortcut: '--assume standard' is equivalent to -%A php2 php3 php4 php cgi asp jsp pl cfm nsf=text/html
can also be used to force a specific file type: --assume foo.cgi=text/html
@iN internet protocol (=both ipv6+ipv4, =ipv4 only, =ipv6 only) (--protocol[=N])
%w disable a specific external mime module (-%w htsswf -%w htsjava) (--disable-module <param>) Browser ID:
F user-agent field sent in HTTP headers (-F "user-agent name") (--user-agent <param>)
%R default referer field sent in HTTP headers (--referer <param>)
%E from email address sent in HTTP headers (--from <param>)
%F footer string in Html code (-%F "Mirrored [from host %s [file %s [at %s]]]" (--footer <param>)
%l preffered language (-%l "fr, en, jp, *" (--language <param>)
%a accepted formats (-%a "text/html,image/png;q=0.9,*/*;q=0.1" (--accept <param>)
%X additional HTTP header line (-%X "X-Magic: 42" (--headers <param>) Log, index, cache
C create/use a cache for updates and retries (C0 no cache,C1 cache is prioritary,* C2 test update before) (--cache[=N])
k store all files in cache (not useful if files on disk) (--store-all-in-cache)
%n do not re-download locally erased files (--do-not-recatch)
%v display on screen filenames downloaded (in realtime) - * %v1 short version - %v2 full animation (--display)
Q no log - quiet mode (--do-not-log)
q no questions - quiet mode (--quiet)
z log - extra infos (--extra-log)
Z log - debug (--debug-log)
v log on screen (--verbose)
f *log in files (--file-log)
f2 one single log file (--single-log)
I *make an index (I0 don't make) (--index)
%i make a top index for a project folder (* %i0 don't make) (--build-top-index)
%I make an searchable index for this mirror (* %I0 don't make) (--search-index) Expert options:
pN priority mode: (* p3) (--priority[=N])
p0 just scan, don't save anything (for checking links)
p1 save only html files
p2 save only non html files
*p3 save all files
p7 get html files before, then treat other files
S stay on the same directory (--stay-on-same-dir)
D *can only go down into subdirs (--can-go-down)
U can only go to upper directories (--can-go-up)
B can both go up&down into the directory structure (--can-go-up-and-down)
a *stay on the same address (--stay-on-same-address)
d stay on the same principal domain (--stay-on-same-domain)
l stay on the same TLD (eg: .com) (--stay-on-same-tld)
e go everywhere on the web (--go-everywhere)
%H debug HTTP headers in logfile (--debug-headers) Guru options: (do NOT use if possible)
#X *use optimized engine (limited memory boundary checks) (--fast-engine)
# filter test (-# '*.gif' 'www.bar.com/foo.gif') (--debug-testfilters <param>)
# simplify test (-# ./foo/bar/../foobar)
# type test (-# /foo/bar.php)
#C cache list (-#C '*.com/spider*.gif' (--debug-cache <param>)
#R cache repair (damaged cache) (--repair-cache)
#d debug parser (--debug-parsing)
#E extract new.zip cache meta-data in meta.zip
#f always flush log files (--advanced-flushlogs)
#FN maximum number of filters (--advanced-maxfilters[=N])
#h version info (--version)
#K scan stdin (debug) (--debug-scanstdin)
#L maximum number of links (-#L1000000) (--advanced-maxlinks[=N])
#p display ugly progress information (--advanced-progressinfo)
#P catch URL (--catch-url)
#R old FTP routines (debug) (--repair-cache)
#T generate transfer ops. log every minutes (--debug-xfrstats)
#u wait time (--advanced-wait)
#Z generate transfer rate statictics every minutes (--debug-ratestats) Dangerous options: (do NOT use unless you exactly know what you are doing)
%! bypass built-in security limits aimed to avoid bandwidth abuses (bandwidth, simultaneous connections) (--disable-security-limits)
IMPORTANT NOTE: DANGEROUS OPTION, ONLY SUITABLE FOR EXPERTS
USE IT WITH EXTREME CARE Command-line specific options:
V execute system command after each files ($ is the filename: -V "rm \$0") (--userdef-cmd <param>)
%W use an external library function as a wrapper (-%W myfoo.so[,myparameters]) (--callback <param>) Details: Option N
N0 Site-structure (default)
N1 HTML in web/, images/other files in web/images/
N2 HTML in web/HTML, images/other in web/images
N3 HTML in web/, images/other in web/
N4 HTML in web/, images/other in web/xxx, where xxx is the file extension (all gif will be placed onto web/gif, for example)
N5 Images/other in web/xxx and HTML in web/HTML
N99 All files in web/, with random names (gadget !)
N100 Site-structure, without www.domain.xxx/
N101 Identical to N1 exept that "web" is replaced by the site's name
N102 Identical to N2 exept that "web" is replaced by the site's name
N103 Identical to N3 exept that "web" is replaced by the site's name
N104 Identical to N4 exept that "web" is replaced by the site's name
N105 Identical to N5 exept that "web" is replaced by the site's name
N199 Identical to N99 exept that "web" is replaced by the site's name
N1001 Identical to N1 exept that there is no "web" directory
N1002 Identical to N2 exept that there is no "web" directory
N1003 Identical to N3 exept that there is no "web" directory (option set for g option)
N1004 Identical to N4 exept that there is no "web" directory
N1005 Identical to N5 exept that there is no "web" directory
N1099 Identical to N99 exept that there is no "web" directory
Details: User-defined option N
'%n' Name of file without file type (ex: image)
'%N' Name of file, including file type (ex: image.gif)
'%t' File type (ex: gif)
'%p' Path [without ending /] (ex: /someimages)
'%h' Host name (ex: www.someweb.com)
'%M' URL MD5 ( bits, ascii bytes)
'%Q' query string MD5 ( bits, ascii bytes)
'%k' full query string
'%r' protocol name (ex: http)
'%q' small query string MD5 ( bits, ascii bytes)
'%s?' Short name version (ex: %sN)
'%[param]' param variable in query string
'%[param:before:after:empty:notfound]' advanced variable extraction
Details: User-defined option N and advanced variable extraction
%[param:before:after:empty:notfound]
param : parameter name
before : string to prepend if the parameter was found
after : string to append if the parameter was found
notfound : string replacement if the parameter could not be found
empty : string replacement if the parameter was empty
all fields, except the first one (the parameter name), can be empty Details: Option K
K0 foo.cgi?q= -> foo4B54.html?q= (relative URI, default)
K -> http://www.foobar.com/folder/foo.cgi?q=45 (absolute URL) (--keep-links[=N])
K3 -> /folder/foo.cgi?q= (absolute URI)
K4 -> foo.cgi?q= (original URL)
K5 -> http://www.foobar.com/folder/foo4B54.html?q=45 (transparent proxy URL) Shortcuts:
--mirror <URLs> *make a mirror of site(s) (default)
--get <URLs> get the files indicated, do not seek other URLs (-qg)
--list <text file> add all URL located in this text file (-%L)
--mirrorlinks <URLs> mirror all links in 1st level pages (-Y)
--testlinks <URLs> test links in pages (-r1p0C0I0t)
--spider <URLs> spider site(s), to test links: reports Errors & Warnings (-p0C0I0t)
--testsite <URLs> identical to --spider
--skeleton <URLs> make a mirror, but gets only html files (-p1)
--update update a mirror, without confirmation (-iC2)
--continue continue a mirror, without confirmation (-iC1) --catchurl create a temporary proxy to capture an URL or a form post URL
--clean erase cache & log files --http10 force http/1.0 requests (-%h) Details: Option %W: External callbacks prototypes
see htsdefines.h example: httrack www.someweb.com/bob/
means: mirror site www.someweb.com/bob/ and only this site example: httrack www.someweb.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg -mime:application/*
means: mirror the two sites together (with shared links) and accept any .jpg files on .com sites example: httrack www.someweb.com/bob/bobby.html +* -r6
means get all files starting from bobby.html, with 6 link-depth, and possibility of going everywhere on the web example: httrack www.someweb.com/bob/bobby.html --spider -P proxy.myhost.com:8080
runs the spider on www.someweb.com/bob/bobby.html using a proxy example: httrack --update
updates a mirror in the current folder example: httrack
will bring you to the interactive mode example: httrack --continue
continues a mirror in the current folder HTTrack version 3.48-22
Copyright (C) 1998-2016 Xavier Roche and other contributors
抓取数据
httrack http://hao123.com -O /home/wwwroot/hao123.com
windows因为有GUI图形化界面,我们可以直接跟着流程走
设置选项,排除baidu.com域名下的资源
运行的效果
等下载完成,我们查看t4目录里面的资源
HTTrack 是不是很厉害的爬虫!
HTTrack 网站备份工具的更多相关文章
- 干货!IT小伙伴们实用的网站及工具大集合!持续更新!
1.Git 还在担心自己辛辛苦苦写的代码被误删了吗?还在担心自己改错了代码不能挽回吗?还在苦恼于多人开发合作找不到一个好的工具吗?那么用Git就对 了,Git是一个开源的分布式版本控制系统,用以有效. ...
- 好用的SQLSERVER数据库自动备份工具SQLBackupAndFTP(功能全面)
转载:http://www.cnblogs.com/lyhabc/p/3322437.html 挺好用的SQLSERVER数据库自动备份工具SQLBackupAndFTP(功能全面) 这个工具主要就是 ...
- 挺好用的SQLSERVER数据库自动备份工具SQLBackupAndFTP(功能全面)
原文:挺好用的SQLSERVER数据库自动备份工具SQLBackupAndFTP(功能全面) 挺好用的SQLSERVER数据库自动备份工具SQLBackupAndFTP(功能全面) 这个工具主要就是自 ...
- SQLSERVER数据库自动备份工具SQLBackupAndFTP(功能全面)
挺好用的SQLSERVER数据库自动备份工具SQLBackupAndFTP(功能全面) 这个工具主要就是自动备份数据库,一键还原数据库,发送备份数据库日志报告到邮箱,自动压缩备份好的数据库 定期执行数 ...
- CSDN博客 专用备份工具
CSDN博客 专用备份工具 用要的朋友可下载. 本程序为个人所用,仅供学习.作者:潇湘博客网站:http://blog.csdn.NET/fkedwgwy默认文件存放位置为用户名文件夹下,也可以直接自 ...
- SQLSERVER自动定时(手动)备份工具
最近项目需要,写了一个小工具软件: 1.实时显示监控 2.可多选择备份数据库 3.按每天定时备份 4.备份文件自动压缩 5.删除之前备份文件 直接上图 1.备份监控界面: 2.数据库设置: 附工具下载 ...
- MySQL 5.7 mysqlpump 备份工具说明
背景: MySQL5.7之后多了一个备份工具:mysqlpump.它是mysqldump的一个衍生,mysqldump就不多说明了,现在看看mysqlpump到底有了哪些提升,可以查看官方文档,这里针 ...
- 常用备份工具是mysql自带的mysqldump
常用备份工具是mysql自带的mysqldump,mysqldump -u root -p密码 dbname >d:\test.sql ------------备份某个库mysqldump -u ...
- shell编写mysql备份工具
如需转载,请经本人同意. 这是之前写的一个备份脚本,调用的备份工具是xtrabackup 编写思路是:每周一全备份,备份后提取lSN号,对备份文件进行压缩,其余时候在LSN的基础上进行增量备份,并对3 ...
随机推荐
- IDF-CTF-牛刀小试-啥?
本人属于Web安全这一块的小白,稍微了作了一下知识补充就开始了CTF,其中的有很多不懂但看多了网上大牛的解题办法和思路.便开始有了一些要想动手记录的冲动,希望大家共同进步学习,本文能对读者有所帮助~ ...
- 【IDEA 2016】intellij idea tomcat jsp 热部署
刚开始用IDEA,落伍的我,只是觉得IDEA好看.可以换界面.想法如此的low. 真是不太会用啊,弄好了tomcat.程序启动竟然改动一下就要重启,JSP页面也一样. IDEA可以配置热部署,打开to ...
- php SESSION跨域问题
这段时间随着项目功能的扩展,慢慢接触到了跨域方面的知识,遇到的更多的问题也是前端与后端交互的时候跨域问题.关于js跨域的问题我会在其他分类里面写.这里记录我今天遇到的php session跨域问题 当 ...
- 如何实现 javascript “同步”调用 app 代码
在 App 混合开发中,app 层向 js 层提供接口有两种方式,一种是同步接口,一种一异步接口(不清楚什么是同步的请看这里的讨论).为了保证 web 流畅,大部分时候,我们应该使用异步接口,但是某些 ...
- 在Node.js使用mysql模块时遇到的坑
之前写了个小程序Node News,用到了MySQL数据库,在本地测试均没神马问题.放上服务器运行一段时间后,偶然发现打开页面的时候页面一直处于等待状态,直到Nginx返回超时错误.于是上服务器检查了 ...
- 网络第三节——NSURLSession
有的程序员老了,还没听过NSURLSession有的程序员还嫩,没用过NSURLConnection有的程序员很单纯,他只知道AFN. NSURLConnection在iOS9被宣布弃用,NSURLS ...
- MySQL中的常用工具
一.mysql 客户端连接工具 二.myisampack MyISAM表压缩工具 三.mysqladmin MySQL管理工具 四.mysqlbinlog 日志管理工具 五.mysqlcheck My ...
- KVM 虚拟机联网方式:NAT 和 Bridge
KVM 客户机网络连接有两种方式: 用户网络(User Networking):让虚拟机访问主机.互联网或本地网络上的资源的简单方法,但是不能从网络或其他的客户机访问客户机,性能上也需要大的调整.NA ...
- Sql Server函数全解<二>数学函数
阅读目录 1.绝对值函数ABS(x)和返回圆周率的函数PI() 2.平方根函数SQRT(x) 3.获取随机函数的函数RAND()和RAND(x) 4.四舍五入函数ROUND(x,y) 5.符号函数SI ...
- Android app AOP添加埋点技术总结
目标:通过面向切面编程来实现对源代码无侵入的埋点. 方式 能力 缺点 学习曲线 XPosed 运行期hook 能hook自己应用进程的方法: 能hook别的应用的方法: 能hook系统方法 ...