Python项目1：自动添加标签

本项目取材自《Python基础教程（第三版）》人民邮电出版社

目标：

本项目给纯文本文件添加格式，使文档转换成其他类型的文档（以HTML为例）

思路：

从原文件提取有用信息：
- 文档结构---成为目标文档添加HTML标签的依据
- 文档内容---成为目标文档的内容
制定原结构与HTML对应的规则
- 一种是直接添加标签
- 一种是用新标签替换旧标记
编写实际执行添加、置换操作的处理程序
编写主逻辑程序，创建实际的规则对象，并应用到原文档上，控制输入输出

具体实现：

#util.py

#这个模块的功能是为了将原文档分成块，以作为规则匹配程序的输入

def lines(file):

    """在文件末尾添加空行（结束标志）"""

    for line in file: yield line  #这里的一个line代表文档中的一段话

    yield '\n'

def blocks(file):

    """一段话生成一个文本块"""

    block = []

    for line in lines(file):

        if line.strip():

            block.append(line)

        elif block:

            yield ''.join(block).strip()

            block = []

#handlers.py

#这个模块的作用是将已经匹配好规则的文本块进行标签加工，添加开始结束标签，或者将某类标记替换成HTML标签（注释、列表项等）

class Handler:

    """

    start()、end()根据传入的参数调用具体的标签方法，并具有一定的异常处理能力，忽略未定义的标签方法调用

    sub()根据传入的MatchObject对象调用对应的置换方法

    """

    def callback(self, prefix, name, *args):

        method = getattr(self, prefix + name, None)

        if callable(method): return method(*args)

    def start(self, name):

        self.callback('start_', name)

    def end(self, name):

        self.callback('end_', name)

    def sub(self, name):

        def substitution(match):

            result = self.callback('sub_', name, match)

            if result is None: match.group(0)

            return result

        return substitution

class HTMLRenderer(Handler):

    """

    用于渲染HTML的具体处理程序，其中定义了各类标签方法的具体实现，这些方法由超类的方法来访问

    feed方法用在start、end之间，给结果字符串添加文本内容

    """

    def start_document(self):

        print('<html><head><title>...</title></head><body>')

    def end_document(self):

        print('</body></html>')

    def start_paragraph(self):

        print('<p>')

    def end_paragraph(self):

        print('</p>')

    def start_heading(self):

        print('<h2>')

    def end_heading(self):

        print('</h2>')

    def start_list(self):

        print('<ul>')

    def end_list(self):

        print('</ul>')

    def start_listitem(self):

        print('<li>')

    def end_listitem(self):

        print('</li>')

    def start_title(self):

        print('<h1>')

    def end_title(self):

        print('</h1>')

    #下面这几个方法的实际调用者是re.sub()，如re.sub(pattern, sub_emphasis(), block)，

    #re.sub方法会将对block进行模式匹配后的结果（一个MatchObject对象）传入sub_emphasis，最终返回置换完成的字符串

    def sub_emphasis(self, match):

        return '<em>{}</em>'.format(match.group(1)) #等价于renturn '<em>/1</em>'

    def sub_url(self, match):

        return '<a href="{}">{}</a>'.format(match.group(1), match.group(1))

    def sub_mail(self, match):

        return '<a href="mailto:{}">{}</a>'.format(match.group(1), match.group(1))

    def feed(self, data):

        print(data)

#rules.py

#这个模块制定了一系列规则，这些规则会匹配各类文档块，并调用相应的标签处理程序

class Rule:

    """

    所有规则的基类，定义了大多数情况通用的action方法

    """

    def action(self, block, handler):

        handler.start(self.type)

        handler.feed(block)

        handler.end(self.type)

        return True

class HeadingRule(Rule):

    """

    标题只包含一行，不超过70个字符且不以冒号结尾

    """

    type = 'heading'

    def condition(self, block):

        return not '\n' in block and len(block) <= 70 and not block[-1] == ':'

class TitleRule(HeadingRule):

    """

    题目是文档中的第一个文本块，前提条件是它属于标题

    """

    type = 'title'

    first = True

    def condition(self, block):

        if not self.first: return False

        self.first = False

        return HeadingRule.condition(self, block)

class ListItemRule(Rule):

    """

    列表项是以字符打头的段落。在设置格式的过程中，将把连字符删除

    """

    type = 'listitem'

    def condition(self, block):

        return block[0] == '-'

    def action(self, block, handler):

        handler.start(self.type)

        handler.feed(block[1:].strip())

        handler.end(self.type)

        return True

class ListRule(ListItemRule):

    """

    列表以紧跟在非列表项文本块后面的列表项打头，以相连的最后一个列表项结束

    """

    type = 'list'

    inside = False

    def condition(self, block):

        return True

    def action(self, block, handler):

        if not self.inside and ListItemRule.condition(self, block):

            handler.start(self.type)

            self.inside = True

        elif self.inside and not ListItemRule.condition(self, block):

            handler.end(self.type)

            self.inside = False

        return False

class ParagraphRule(Rule):

    """

    段落是不符合其他规则的文本块

    """

    type = 'paragraph'

    def condition(self, block):

        return True

#markup.py

#负责整合调用各模块

import sys, re

from handlers import *

from util import *

from rules import *

class Parser:

    """

    Paeser读取文本文件，应用规则并控制处理程序

    """

    def __init__(self, handler):

        self.handler = handler

        self.rules = []

        self.filters = []

    def addRule(self, rule):

        self.rules.append(rule)

    def addFilter(self, pattern, name):

        def filter(block, handler):

            return re.sub(pattern, handler.sub(name), block)

        self.filters.append(filter)

    def parse(self, file):

        self.handler.start('document')

        for block in blocks(file):

            for filter in self.filters:

                block = filter(block, self.handler)

                for rule in self.rules:

                    if rule.condition(block):

                        last = rule.action(block,

                               self.handler)

                        if last: break

        self.handler.end('document')

class BasicTextParser(Parser):

    """

    在构造函数中添加规则和过滤器的Parser子类

    注意：规则列表的添加顺序是有要求的，condition判断失败才会匹配下一条规则

    """

    def __init__(self, handler):

        Parser.__init__(self, handler)

        self.addRule(ListRule())

        self.addRule(ListItemRule())

        self.addRule(TitleRule())

        self.addRule(HeadingRule())

        self.addRule(ParagraphRule())

        self.addFilter(r'\*(.+?)\*', 'emphasis')

        self.addFilter(r'(http://[\.a-zA-Z/]+)', 'url')

        self.addFilter(r'([\.a-zA-Z]+@[\.a-zA-Z]+[a-zA-Z]+)', 'mail')

handler = HTMLRenderer()

parser = BasicTextParser(handler)

parser.parse(sys.stdin)

这样就完成了，可以用下面这段文本做个实验，看看结果如何。



Welcome to World Wide Spam, Inc.

These are the corporate web pages of *World Wide Spam*, Inc. We hope

you find your stay enjoyable, and that you will sample many of our

products.

A short history of the company

World Wide Spam was started in the summer of 2000. The business

concept was to ride the dot-com wave and to make money both through

bulk email and by selling canned meat online.

After receiving several complaints from customers who weren't

satisfied by their bulk email, World Wide Spam altered their profile,

and focused 100% on canned goods. Today, they rank as the world's

13,892nd online supplier of SPAM.

Destinations

From this page you may visit several of our interesting web pages:

  - What is SPAM? (http://wwspam.fu/whatisspam)

  - How do they make it? (http://wwspam.fu/howtomakeit)

  - Why should I eat it? (http://wwspam.fu/whyeatit)

How to get in touch with us

You can get in touch with us in *many* ways: By phone (555-1234), by

email (wwspam@wwspam.fu) or by visiting our customer feedback page

(http://wwspam.fu/feedback).

Python项目1：自动添加标签的更多相关文章

Pycharm创建项目时自动添加头部信息
1.打开PyCharm,选择File--Settings 2.依次选择Editor---Code Style-- File and Code Templates---Python Script 3.. ...
解决使用vue-cli生成项目后项目地址自动添加#号的问题
vue-router官网https://router.vuejs.org/zh/guide/ vue的路由在默认的hash模式下,url会带有一个#,不美观,而且在微信分享,授权登录等都会有一些坑.所 ...
Jenkins持续集成_02_添加python项目&设置定时任务
前言自动化测试脚本编写后,最终目的都是持续集.持续集成可以实现一天多次部署运行自动化脚本,对功能进行不断监控测试.由于小编使用python编写的自动化脚本,这里仅讲解下如何在Jenkins中添加py ...
Jenkins 为Jenkins添加Windows Slave远程执行python项目脚本
为Jenkins添加Windows Slave远程执行python项目脚本 by:授客 QQ:1033553122 测试环境 JAVA JDK 1.7.0_13 (jdk-7u13-windows ...
Python项目在Jenkins中的自动化测试实践（语法检查、单元测试，coverage（代码覆盖率）、自动打包）
原始链接:http://blog.csdn.net/a464057216/article/details/52934077 requirments OS: Ubuntu 14.04+ Gitlab 8 ...
Discuz 3.X 门户文章插入图片自动添加 alt 标签
最近用 Discuz 搭建了个网站--儿童安全座椅网(www.bbseat.com.cn),用到了门户功能,不得不说Discuz 的功能还是非常强大的,但在使用过程中发现在发表文章时添加了图片却不能像 ...
python logging详解及自动添加上下文信息
之前写过一篇文章日志的艺术(The art of logging),提到了输出日志的时候记录上下文信息的重要性,我认为上下文信息包括: when:log事件发生的时间 where:log事件发生在哪个 ...
sphinx：python项目文档自动生成
Sphinx: 发音: DJ音标发音: [sfiŋks] KK音标发音: [sfɪŋks] 单词本身释义: an ancient imaginary creature with a lion's bo ...
Discuz! 3.3全站帖子自动添加图片alt标签
网站想要更好的适应搜索引擎的话,就要把最基础的一些小优化标签做好, 虽然说现在搜索都很厉害能够识别图片,但是除非的你的图片每一张都是周杰伦.范冰冰等知名图片... 不然你还是要给你自己的图添加alt标 ...

随机推荐

day32 Pyhton hashlib模块总结异常处理
一.当用明文密码进行信息存储的时候,会导致密码的泄露,如何解决问题通过导入hashlib模块,利用里面存在的算法对字符串进行加密计算得到一串密文的结果 1.这个过程不可逆 2.对于同一个字符串,同一 ...
centos7虚拟机时间和本地时间相差8小时
安装ntp和ntpdate 在安装centos7虚拟机的时候,已经将时区设置为了Asia/shanghai,但还是出现时间不准,相差了8小时可以安装ntp和ntpdate,使用 NTP 公共时间服务 ...
【图论】USACO11JAN Roads and Planes G
题目内容洛谷链接 Farmer John正在一个新的销售区域对他的牛奶销售方案进行调查.他想把牛奶送到$T$个城镇 ($1 <= T <= 25,000$),编号为$1$到\ ...
php超全局数组为什么swoole的http服务不能用
php的超全局数组$_GET等九个可以直接使用无需定义实际上是浏览器请求到Apache或者nginx的时候转发到PHP处理模块 fpm转发给php解释器处理 php封装好后丢给php的 sw ...
【Azure 环境】连接到微软云Azure中国区 By VS 2019, VS Code, Powershell
问题情形最近,在使用最新的VS Code插件连接到中国区的Azure时候,出现了依旧是global版的登录连接.这个问题是当前Azure Account插件最新版的问题,可以使用V0.8.11版本登 ...
想买保时捷的运维李先生学Java性能之 JIT即时编译器
前言本文记录日常学习<深入理解Java虚拟机>,不知道为啥感觉看一遍也就过了,喜欢动动手理解理解,这样才有点感觉,静不下心来的时候,看书抄书也可以用这个办法. 一.什么是JIT(Just ...
Python入门教程完整版（懂中文就能学会）
前几天给大家分享<从零学会Photoshop经典教程300集>的教程受到了广泛的关注,有人不知道怎么领取,居然称小编为"骗子". 不过小编的内心是强大的,网友虐我千百遍 ...
CTF相关
https://blog.csdn.net/zxl2016/article/details/96482763
svg 进度条
先看理想效果先上代码,在进行解释 <div id="app"> <svg width="230" height="230" ...
稳压二极管、肖特基二极管、静电保护二极管、TVS管
1.稳压二极管正向导通电压跟普通二级管一样约为0.7v,反向状态下在临界电压之前截止,在达到临界电压的条件下会处于导通的状态,电压也不再升高,所以用在重要元器件上,起到稳压作用. 稳压二极管主要利用 ...

Python项目1：自动添加标签

目标：

思路：

具体实现：

Python项目1：自动添加标签的更多相关文章

随机推荐

热门专题