基于Python的网页文档处理脚本实现

　　嵌入式web服务器不同于传统服务器，web需要转换成数组格式保存在flash中，才方便lwip网络接口的调用，最近因为业务需求，需要频繁修改网页，每次的压缩和转换就是个很繁琐的过程，因此我就有了利用所掌握的知识，利用python编写个能够批量处理网页文件，压缩并转换成数组的脚本。

　　脚本运行背景(后续版本兼容)：

Python 3.5.1(下载、安装、配置请参考网上教程)

node.js v4.4.7，安装uglifyjs管理包，支持js文件非文本压缩

uglifyjs 用来压缩JS文件的引擎，具体安装可参考http://www.zhangxinxu.com/wordpress/2013/01/uglifyjs-compress-js/

具体实现代码如下:

#/usr/bin/python

import os

import binascii

import shutil

from functools import partial

import re

import gzip

#创建一个新文件夹

def mkdir(path):

    path=path.strip()

    isExists=os.path.exists(path)

    #判断文件夹是否存在，不存在则创建

    if not isExists:

        os.makedirs(path)

        print(path+' 创建成功')

    else:

        pass

    return path

#删除一个文件夹(包含内部所有文件)

def deldir(path):

    path = path.strip()

    isExists=os.path.exists(path)

    #判断文件夹是否存在，存在则删除

    if isExists:

        shutil.rmtree(path)

        print(path + "删除成功")

    else:

        pass

#网页一次压缩文件

def FileReduce(inpath, outpath):

        infp = open(inpath, "r", encoding="utf-8")

        outfp = open(outpath, "w", encoding="utf-8")

        for li in infp.readlines():

            if li.split():

                #去除多余的\r \n

                li = li.replace('\n', '').replace('\t', '');

                #空格只保留一个

                li = ' '.join(li.split())

                outfp.writelines(li)

        infp.close()

        outfp.close()

        print(outpath+" 压缩成功")

#shell命令行调用(用ugllifyjs来压缩js文件)

def ShellReduce(inpath, outpath):

    Command = "uglifyjs "+inpath+" -m -o "+outpath

    print(Command)

    os.system(Command)

#gzip压缩模块

def FileGzip(inpath, outpath):

    with open(inpath, 'rb') as plain_file:

        with gzip.open(outpath, 'wb') as zip_file:

            zip_file.writelines(plain_file)

    print(outpath+" gzip-压缩成功")

#将文件以二进制读取, 并转化成数组保存

def FileHex(inpath, outpath):

    i = 0

    count = 0

    a = ''

    inf = open(inpath, 'rb');

    outf = open(outpath, 'w')

    records = iter(partial(inf.read, 1), b'')

    for r in records:

        r_int = int.from_bytes(r, byteorder='big')

        a +=  strzfill(hex(r_int), 2, 2) + ', '

        i += 1

        count += 1

        if i == 16:

            a += '\n'

            i = 0

    a = "const static char " + outpath.split('.')[-2].split('/')[-1] + "["+ str(count) +"]={\n" + a + "\n}\n\n"

    outf.write(a)

    inf.close()

    outf.close()

    print(outpath + " 转换成数组成功")

#在指定位置填充0

def strzfill(istr, index, n):

    return istr[:index] + istr[index:].zfill(n)

#去css注释 /*.....*/

def unCommentReduce(inpath, outpath):

    infp = open(inpath, "r", encoding="utf-8")

    outfp = open(outpath, "w", encoding="utf-8")

    fileByte = infp.read();

    replace_reg = re.compile('/\*[\s\S]*?\*/')

    fileByte = replace_reg.sub('', fileByte)

    fileByte = fileByte.replace('\n', '').replace('\t', '');

    fileByte = ' '.join(fileByte.split())

    outfp.write(fileByte)

    infp.close()

    outfp.close()

    print(outpath+"去注释 压缩成功!")

#程序处理主函数

def WebProcess(path):

        #原网页 ..\basic\

        #压缩网页 ..\reduce\

        #gzip二次压缩 ..\gzip

        #编译完成.c网页 ..\programe

        BasicPath = path + "\\basic"

        ReducePath = path + "\\reduce"

        GzipPath = path + "\\gzip"

        ProgramPath = path + "\\program"

        #删除原文件夹，再创建新文件夹

        deldir(ProgramPath)

        deldir(ReducePath)

        deldir(GzipPath)

        mkdir(ProgramPath)

        for root, dirs, files in os.walk(BasicPath):

                for item in files:

                        ext = item.split('.')

                        InFilePath = root + "/" + item

                        OutReducePath = mkdir(root.replace("basic", "reduce")) + "/" + item

                        OutGzipPath = mkdir(root.replace("basic", "gzip"))  + "/" + item + '.gz'

                        OutProgramPath = ProgramPath + "/" + item.replace('.', '_') + '.c'

                        #根据后缀不同进行相应处理

                        #html 去除'\n','\t', 空格字符保留1个

                        #css  去除\*......*\注释数据、'\n'和'\t', 同时空格字符保留1个

                        #js 调用uglifyjs2进行压缩

                        #gif jpg ico 直接拷贝

                        #其它 直接拷贝

                        #上述执行完毕后压缩成.gz文件

                        #除其它外，剩余文件同时转化成16进制数组, 保存为.c文件

                        if ext[-1] == 'html':

                            FileReduce(InFilePath, OutReducePath)

                            FileGzip(OutReducePath, OutGzipPath)

                            FileHex(OutGzipPath, OutProgramPath)

                        elif ext[-1] == 'css':

                            unCommentReduce(InFilePath, OutReducePath)

                            FileGzip(OutReducePath, OutGzipPath)

                            FileHex(OutGzipPath, OutProgramPath)

                        elif ext[-1] == 'js':

                            ShellReduce(InFilePath, OutReducePath)

                            FileGzip(OutReducePath, OutGzipPath)

                            FileHex(OutGzipPath, OutProgramPath)

                        elif ext[-1] in ["gif", "jpg", "ico"]:

                            shutil.copy(InFilePath, OutReducePath)

                            FileGzip(OutReducePath, OutGzipPath)

                            FileHex(OutGzipPath, OutProgramPath)

                        else:

                            shutil.copy(InFilePath, OutReducePath)

#获得当前路径

path = os.path.split(os.path.realpath(__file__))[0];

WebProcess(path)

上述实现的原理主要包含：

1.遍历待处理文件夹(路径为..\basic，需要用户创建，并将处理文件复制到其中，并将脚本放置到该文件夹上一层)--WebProcess

2.创建压缩页面文件夹(..\reduce, 用于存储压缩后文件), 由脚本完成，处理动作：

　htm: 删除文本中的多余空格，换行符

　css: 删除文本中的多余空格，换行符及注释文件/*......*/

js：调用uglifyjs进行压缩处理

gif, jpg, ico和其它: 直接进行复制处理

3.创建gzip文件处理文件夹(..\gzip, 用于保存二次压缩后文件), 由脚本调用gzip模块完成。

4.创建处理页面文件夹(..\program, 用于存储压缩后文件), 由脚本完成，处理动作：

　以二进制模式读取文件，并转换成16进制字符串写入到文件中。

在文件夹下(shift+鼠标右键)启用windows命令行，并输入python web.py, 就可以通过循环重复这三个过程就可以完成所有文件的处理。

特别注意：所有处理的文件需要以utf-8格式存储，否则读取时会报"gbk"读取错误。

实现效果如下图

html文件：

转换数组:

示例可参考：

http://files.cnblogs.com/files/zc110747/webreduce.7z

另外附送一个小的脚本，查询当前目录及子文件夹下选定代码行数和空行数(算是写这个脚本测试时衍生出来的):

#/usr/bin/python

import os

total_count = 0;

empty_count = 0;

def CountLine(path):

        global total_count

        global empty_count

        tempfile = open(path)

        for lines in tempfile:

                total_count += 1

                if len(lines.strip()) == 0:

                       empty_count += 1

def TotalLine(path):

        for root, dirs, files in os.walk(path):

                for item in files:

                        ext = item.split('.')

                        ext = ext[-1]

                        if(ext in ["cpp", "c", "h", "java", "php"]):

                                subpath = root + "/" + item

                                CountLine(subpath)

path = os.path.split(os.path.realpath(__file__))[0];

TotalLine(path)

print("Input Path:", path)

print("total lines: ",total_count)

print("empty lines: ",empty_count)

print("code lines: ", (total_count-empty_count))

基于Python的网页文档处理脚本实现的更多相关文章

HTML5网页文档结构
2.1 Web标准 Web标准,使得Web开发更加容易.Web标准由万维网联盟(W3C)制定. 2.1.1 Web标准概述 Web标准的最终目的就是保证每个人都有权力访问相同 ...
使用Python从Markdown文档中自动生成标题导航
概述知识与思路代码实现概述 Markdown 很适合于技术写作,因为技术写作并不需要花哨的排版和内容, 只要内容生动而严谨,文笔朴实而优美. 为了编写对读者更友好的文章,有必要生成文章的标题导航 ...
【转】Python之xml文档及配置文件处理（ElementTree模块、ConfigParser模块）
[转]Python之xml文档及配置文件处理(ElementTree模块.ConfigParser模块) 本节内容前言 XML处理模块 ConfigParser/configparser模块总结 ...
Atitit 基于图片图像与文档混合文件夹的分类
Atitit 基于图片图像与文档混合文件夹的分类太小的文档(txt doc csv exl ppt pptx)单独分类 Mov10KminiDoc 但是可能会有一些书法图片迁移,因为他们很微小,需 ...
Openstack python api 学习文档 api创建虚拟机
Openstack python api 学习文档转载请注明http://www.cnblogs.com/juandx/p/4953191.html 因为需要学习使用api接口调用openstack ...
获取网页文档的URL和连接来源
<script type="text/javascript">document.write("链接来源:"+document.referrer+&q ...
Python处理Excel文档（xlrd, xlwt, xlutils）
简介 xlrd,xlwt和xlutils是用Python处理Excel文档(*.xls)的高效率工具.其中,xlrd只能读取xls,xlwt只能新建xls(不可以修改),xlutils能将xlrd.B ...
python+selenium自动化软件测试(第12章)：Python读写XML文档
XML 即可扩展标记语言,它可以用来标记数据.定义数据类型,是一种允许用户对自己的标记语言进行定义的源语言.xml 有如下特征: 首先,它是有标签对组成:<aa></aa> ...
《网页文档/文字复制方法大全》 - imsoft.cnblogs
<网页文档/文字复制方法大全> 一: 1.首先,找到自己要的文档. 2.文章题目复制,在搜索引擎的框框里输入:site:wenku.baidu.com "题目"/sit ...

随机推荐

2.2 C#的注释
注释,是代码中的一些“说明文档”,注释注释本身不会参与程序代码的编译和运行,仅仅是为了方便程序员阅读. C#的注释分为:单行注释.多行注释.文档注释. 单行注释的符号是2条斜杠线(斜线的方向是左下到右 ...
Win下，通过Jstack截取Java进程中的堆栈信息
在Java软件的使用过程中,有时会莫名的出现奇怪的问题.而这些问题常常无法使用日志信息定位,这时我们就需要通过查看进程内部线程的堆栈调用关系来分析问题出在哪里. 举个例子,当我们在做某个操作时,莫名的 ...
Python中的传值和引用
我写这个主要是给自己看,内容也就是便于自己理解,可能会不正确,但目前来看代码测试的结果是对的. python中一切皆对象. 当我们赋值时: a = 1 其实是先创建了一个整数常量1(也是一个对象,且已 ...
cassandra.yaml介绍
cluster_name 集群的名字,默认情况下是TestCluster.对于这个属性的配置可以防止某个节点加入到其他集群中去,所以一个集群中的节点必须有相同的cluster_name属性. list ...
使用sklearn优雅地进行数据挖掘【转】
目录 1 使用sklearn进行数据挖掘 1.1 数据挖掘的步骤 1.2 数据初貌 1.3 关键技术2 并行处理 2.1 整体并行处理 2.2 部分并行处理3 流水线处理4 自动化调参5 持久化6 回 ...
scrapy的scrapyd使用方法
一直以来,很多人疑惑scrapy提供的scrapyd该怎么用,于我也是.自己在实际项目中只是使用scrapy crawl spider,用python来写一个多进程启动,还用一个shell脚本来监控进 ...
Windows7下的免费虚拟机（微软官方虚拟机）
前言: 不是说windows7自带的虚拟机最好用,而是他是正式版的,免费的,只要你是windows7用户,就可以免费使用: 其实我最推荐的还是Vmware: 微软为什么提供免费的虚拟机呢? 因为vis ...
babel 解构赋值无法问题
这个东西需要第二级, babel-preset-stage-2,然后再presets里引入stage-2的设置,再plugins离引入对应的插件 { "presets": [&qu ...
如何选择合适的CRM客户关系管理软件？
面对日益激烈的市场竞争,很多企业管理者不断通过各种途径和方式,试图寻找一个合适并行之有效的解决方案,以帮助他们解决企业管理难题,不断提高企业的业绩,获得持续的成功. 企业管理软件的出现填补了企业管理领 ...
django关系对象映射（Object Relational Mapping，简称ORM）
Model 创建数据库,设计表结构和字段 django中遵循 Code Frist 的原则,即:根据代码中定义的类来自动生成数据库表 from django.db import models clas ...

基于Python的网页文档处理脚本实现

基于Python的网页文档处理脚本实现的更多相关文章

随机推荐

热门专题