linux下使用wget下载整个网站

linux下可以用wget下载整个网站，而且网站链接中包含utf-8编码的中文也能正确处理。

简要方法记录如下：

wget --restrict-file-name=ascii -m -c -nv -np -k -E -p -R=exe,zip http://www.xxx.com

参数释义如下：

--restrict-file-name=ascii ，将文件名保存为ASCII格式。这样能避免utf-8文件名带来的麻烦（注：1.12版才支持ascii参数值）

-m 整站下载，mirror的缩写，是-N -r -l inf --no-remove-listing 这几个参数的快捷方式，具体详阅各自的说明

-c 续传

-nv 不显示详细的下载详情

-np don’t ascend to the parent directory.即下载的Web页面不越过后面指定的 http://www.xxx.com的范围。当然，如果你指定的是 http://www.xxx.com/aaa，则所有的web页面都要在 http://www.xxx.com/aaa下

-k 下载完成后，将页面文件中的链接转换为本地链接，便于离线浏览和制作chm等

-E 保存html/css文件时，使用合适的文件后缀。例如，在某些网站有些文件是服务器端动态生成的，虽然是css文件，但后缀并不是css，-E选项可以调整之

-p -np对页面文件做了限制，如果不加-p，则html所需的媒体文件也会受限于-np，-p则会下载html/css文件所需的所有媒体文件（图片、音频、视频等）

-R 拒绝下载的文件后缀列表，逗号分隔

至于下载到的文件的文件名变为了形如%A7这样百分号加16进制数字的形式，可以用个python程序来改变文件名：

————————————————————————————————————

import os, urllib, sys, getopt

class Renamer:

input_encoding = ""

output_encoding = ""

path = ""

is_url = False

def __init__(self, input, output, path, is_url):

self.input_encoding = input

self.output_encoding = output

self.path = path

self.is_url = is_url

def start(self):

self.rename_dir(self.path)

def rename(self, root, path):

try:

if self.is_url:

new = urllib.unquote(path).decode(self.input_encoding).encode(self.output_encoding)

else:

new = path.decode(self.input_encoding).encode(self.output_encoding)

os.rename(os.path.join(root, path), os.path.join(root, new))

except:

pass

def rename_dir(self, path):

for root, dirs, files in os.walk(path):

for f in files:

self.rename(root, f)

if dirs == []:

for f in files:

self.rename(root, f)

else:

for d in dirs:

self.rename_dir(os.path.join(root, d))

self.rename(root, d)

def usage():

print '''This program can change encode of files or directories.

Usage: rename.py [OPTION]...

Options:

-h, --help this document.

-i, --input-encoding=ENC set original encoding, default is UTF-8.

-o, --output-encoding=ENC set output encoding, default is GBK.

-p, --path=PATH choose the path which to process.

-u, --is-url whether as a URL

'''

def main(argv):

input_encoding = "utf-8"

output_encoding = "gbk"

path = ""

is_url = True

try:

opts, args = getopt.getopt(argv, "hi:o:p:u", ["help", "input-encoding=", "output-encoding=", "path=", "is-url"])

except getopt.GetoptError:

usage()

sys.exit(2)

for opt, arg in opts:

if opt in ("-h", "--help"):

usage()

sys.exit()

elif opt in ("-i", "--input-encoding"):

input_encoding = arg

elif opt in ("-o", "--output-encoding"):

output_encoding = arg

elif opt in ("-p", "--path"):

path = arg

elif opt in ("-u", "--is-url"):

is_url = True

rn = Renamer(input_encoding, output_encoding, path, is_url)

rn.start()

if __name__ == '__main__':

main(sys.argv[1:])

————————————————————————————————————

rename.py -i utf-8 -o gbk -p <指定的下载目录> -u

文件改名方法来自于http://blog.csdn.net/kowity/article/details/6899256

linux下使用wget下载整个网站的更多相关文章

Linux下使用wget下载FTP服务器文件
wget -nH -m --ftp-user=your_username --ftp-password=your_password ftp://your_ftp_host/* 使用命令下载ftp上的文 ...
linux 下使用wget 下载 jdk资源命令
wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-co ...
wget下载整个网站
wget下载整个网站wget下载整个网站可以使用下面的命令 wget -r -p -k -np http://hi.baidu.com/phps , -r 表示递归下载,会下载所有的链接,不过要注意的 ...
LINUX下一款不错的网站压力测试工具webbench
LINUX下一款不错的网站压力测试工具webbench 分类: Linux 2014-07-03 09:10 220人阅读评论(0) 收藏举报 [html] view plaincopy wget ...
wget下载整个网站---比较实用--比如抓取Smarty的document
wget下载整个网站可以使用下面的命令 wget -r -p -k -np http://hi.baidu.com/phps, -r 表示递归下载,会下载所有的链接,不过要注意的是,不要单独使用这个参 ...
Linux下Apache服务部署静态网站------网站服务程序
文章链接(我的CSDN博客): Linux下Apache服务部署静态网站------网站服务程序
简单备忘一下Linux下的wget和curl如何使用http proxy
简单备忘一下Linux下的wget和curl如何使用http proxywget -e "http_proxy=porxyhost:port" www.baidu.comcurl ...
Windows 和 Linux下使用socket下载网页页面内容（可设置接收/发送超时）的代码
主要难点在于设置recv()与send()的超时时间,具体要注意的事项,请看代码注释部分,下面是代码: #include <stdio.h> #include <sys/types. ...
wget下载整个网站或特定目录
下载整个网站或特定目录 wget -c -k -r -np -p http://www.yoursite.com/path -c, –continue 断点下载 -k, –convert-links ...

随机推荐

Visual Studio 2010安装包
点击下载
POJ 1144 Network（割点）
Description A Telephone Line Company (TLC) is establishing a new telephone cable network. They are c ...
Qscintilla2编译使用
Qscintilla2的下载地址: https://github.com/josephwilk/qscintilla https://riverbankcomputing.com/software/q ...
mysql ibd 文件还原数据
-- 这里要还原的表名为 test_table -- 1建库,并选中库,库名随意 -- 2查看InnoDB 引擎独立表空间是否开启 SHOW VARIABLES LIKE '%per_table%' ...
JVM（1）——简介
网上流传着一段挺有意思的话-- 对于从事C或C++的开发人员来说,他们既是内存管理的最高权力的皇帝,也是最基础的劳动人民,担负着每一个对象生命开始到终结的维护工作,有点光杆司令的赶脚. 但对于java ...
QueryHelper插件类(hql)
package cn.itcast.core.util; import java.util.ArrayList; import java.util.List; public class QueryHe ...
DataBase -- Customers Who Never Order
Question: Suppose that a website contains two tables, the Customers table and the Orders table. Writ ...
一张图彻底搞懂JavaScript的==运算
一张图彻底搞懂JavaScript的==运算来源 https://zhuanlan.zhihu.com/p/21650547 PS:最后,把图改了一下,仅供娱乐 : ) 大家知道,==是JavaSc ...
dnsmasq-2.48没有ipset特性，安装dnsmasq-2.71来支持ipset
iptables只能根据ip地址进行转发,不能识别域名,而dnsmasq-full不仅可以实现域名-IP的映射,还可以把这个映射关系存储在ipset中,所以使用dnsmasq+ipset就可以实现ip ...
JUnit4.11 理论机制 @Theory 完整解读
最近在研究JUnit4,大部分基础技术都是通过百度和JUnit的官方wiki学习的,目前最新的发布版本是4.11,结合代码实践,发现官方wiki的内容或多或少没有更新,Theory理论机制章节情况尤为 ...

linux下使用wget下载整个网站

linux下使用wget下载整个网站的更多相关文章

随机推荐

热门专题