Html / XHtml 解析 - Parsing Html and XHtml

 Html / XHtml 解析 - Parsing Html and XHtml



 HTMLParser 模块

     通过 HTMLParser 模块来解析 html 文件通常的做法是, 建立一个 HTMLParser 子类,

     然后子类中实现处理的标签(<.>)的方法, 其实现是通过 '重写' 父类(HTMLParser)的

     handle_starttag(), handle_data(), handle_endtag() 等方法.

     例子,

         解析 htmlsample.html 中 <head> 标签,

             <-- htmlsample.html -->  -> 文件内容,

                 '

                 <html>

                 <head><title>404 Not Found</title></head>

                 <body bgcolor="white">

                 <center><h1>404 Not Found</h1></center>

                 <hr><center>nginx/1.12.2</center>

                 </body>

                 </html>

                 '

         from html.parser import HTMLParser

         class ParsingHeadT(HTMLParser):

             def __init__(self):

                 self.headtag =''

                 self.parsesemaphore = False

                 HTMLParser.__init__(self)

             def handle_starttag(self, tag, attrs): # enable semaphore

                 if tag == 'head':

                     self.parsesemaphore = True

             def handle_data(self, data):          # tag process as requirement

                 if self.parsesemaphore:

                     self.headtag = data

             def handle_endtag(self, tag):

                 if tag == 'head':

                     self.parsesemaphore = False

             def getheadtag(self):

                 return self.headtag

         if __name__ == "__main__":

             with open('htmlsample.html') as FH:

                 pht = ParsingHeadT()

                 pht.feed(FH.read())    # HTMLParser will invoke the replaced methods

                                        # handle_starttag, handle_data and handle_endtag

                 print("Head Tag : %s" % pht.getheadtag())

         output,

            Head Tag : 404 Not Found

     上例是一个简单完成的 html 文本, 然而在实际生产中是有一些实现情况要考虑和处理的,

     比如 html 中的特殊字符 &copy (copyright 符号), &amp(& 逻辑与符号) 等,

         对于这种情况, 之前的做法是需要重写父类的 handle_entityref() 来处理,

             HTMLParser.handle_entityref(name)¶

                 This method is called to process a named character reference of the form

                 &name; (e.g. &gt;), where name is a general entity reference (e.g. 'gt').

                 This method is never called if convert_charrefs is True.

     字符转换 也是一种需要注意的情况, 比如 十进制 decimal 和 十六进制 hexadecimal 字符的转换.

         HTMLParser.handle_charref(name)

             This method is called to process decimal and hexadecimal numeric character

             references of the form &#NNN; and &#xNNN;. For example, the decimal equivalent

             for &gt; is &#62;, whereas the hexadecimal is > in this case the method

             will receive '' or 'x3E'. This method is never called if convert_charrefs is True.

     Note,

         幸运的是,以上情况在 python 3 已经能很好得帮我们处理了. 还是使用上例, 现在我们在 htmlsample.html

         <head> tag 中加入一些特殊字符来看看.

             <-- htmlsample.html -->

             <html>

             <head><title>&#62 &#x3E 404 &copy Not &gt Found & </title></head>

             <body bgcolor="white">

             <center><h1>404 Not Found</h1></center>

             <hr><center>nginx/1.12.2</center>

             </body>

             </html>

         上例 Output,

                 Head Tag : > > 404 © Not > Found &

                 从运行结果可以看出, 在 python 3 中上例能够很好的处理特殊字符的情况.

     然而, 在 html 的代码中存在一类 '非对称'的标签, 如 <p>, <li> 等, 当我们试图使用上面的例子

     去处理这类非对称标签的时候发现, 这类标签并不能被上例正确解析. 这时我们需要扩展上例的 code 使

     其能够正确解析这些'非对称'标签.

         先扩展一下儿 htmlsample.html, 以 <li> 标签为例,

         <-- htmlsample.html -->

         <html>

         <head><title>&#62 &#x3E 404 &copy Not &gt Found &</title>

         <body bgcolor="white">

         <center><h1>404 Not Found</h1></center>

         <hr><center>nginx/1.12.2</center>

         <ul>

             <li> First Reason

             <li> Second Reason

         </body>

         </html>

         htmlsample.html 文件是可以被浏览器渲染的, 然而 htmlsample.html 中 <head> 和 <ul> 标签

         没有对应的结束 tag, <li> 为非对称的 tag. 现在来向之前的例子添加一些逻辑来处理这些问题.

         例,

             from html.parser import HTMLParser

             class Parser(HTMLParser):

                 def __init__(self):

                     self.taglevels = []     # track anchor

                     self.tags =['head','ul','li']

                     self.parsesemaphore = False

                     self.data = ''

                     HTMLParser.__init__(self)

                 def handle_starttag(self, tag, attrs): # enable semaphore

                     if len(self.taglevels) and self.taglevels[-1] == tag:

                         self.handle_endtag(tag)

                     self.taglevels.append(tag)

                     if tag in self.tags:

                         self.parsesemaphore = True

                 def handle_data(self, data):          # tag process as requirement

                     if self.parsesemaphore:

                         self.data += data

                 def handle_endtag(self, tag):

                     self.parsesemaphore = False

                 def gettag(self):

                     return self.data

             if __name__ == "__main__":

                 with open('htmlsample.html') as FH:

                     pht = Parser()

                     pht.feed(FH.read())    # HTMLParser will invoke the replaced methods

                                            # handle_starttag, handle_data and handle_endtag

                     print("Head Tag : %s" % pht.gettag())

             Output,

                  Head Tag : > > 404 © Not > Found &

                  First Reason

                  Second Reason

 Reference,

     https://docs.python.org/3.6/library/html.parser.html?highlight=htmlparse#html.parser.HTMLParser.handle_entityref

 Appendix,

     The example given by python Doc,

         from html.parser import HTMLParser

         from html.entities import name2codepoint

         class MyHTMLParser(HTMLParser):

             def handle_starttag(self, tag, attrs):

                 print("Start tag:", tag)

                 for attr in attrs:

                     print("     attr:", attr)

             def handle_endtag(self, tag):

                 print("End tag  :", tag)

             def handle_data(self, data):

                 print("Data     :", data)

             def handle_comment(self, data):

                 print("Comment  :", data)

             def handle_entityref(self, name):

                 c = chr(name2codepoint[name])

                 print("Named ent:", c)

             def handle_charref(self, name):

                 if name.startswith('x'):

                     c = chr(int(name[1:], 16))

                 else:

                     c = chr(int(name))

                 print("Num ent  :", c)

             def handle_decl(self, data):

                 print("Decl     :", data)

         parser = MyHTMLParser()

     Output,

         Parsing a doctype:

     # >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '

     ...             '"http://www.w3.org/TR/html4/strict.dtd">')

         Decl     : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"

         Parsing an element with a few attributes and a title:

     # >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')

         Start tag: img

              attr: ('src', 'python-logo.png')

              attr: ('alt', 'The Python logo')

     # >>> parser.feed('<h1>Python</h1>')

         Start tag: h1

         Data     : Python

         End tag  : h1

         The content of script and style elements is returned as is, without further parsing:

     # >>> parser.feed('<style type="text/css">#python { color: green }</style>')

         Start tag: style

              attr: ('type', 'text/css')

         Data     : #python { color: green }

         End tag  : style

     # >>> parser.feed('<script type="text/javascript">'

     ...             'alert("<strong>hello!</strong>");</script>')

         Start tag: script

              attr: ('type', 'text/javascript')

         Data     : alert("<strong>hello!</strong>");

         End tag  : script

         Parsing comments:

     # >>> parser.feed('<!-- a comment -->'

     ...             '<!--[if IE 9]>IE-specific content<![endif]-->')

         Comment  :  a comment

         Comment  : [if IE 9]>IE-specific content<![endif]

         Parsing named and numeric character references and converting them to the correct

         char (note: these 3 references are all equivalent to '>'):

     # >>> parser.feed('&gt;>>')

         Named ent: >

         Num ent  : >

         Num ent  : >

         Feeding incomplete chunks to feed() works, but handle_data() might be called more

         than once (unless convert_charrefs is set to True):

     # >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:

     ...     parser.feed(chunk)

         Start tag: span

         Data     : buff

         Data     : ered

         Data     : text

         End tag  : span

         Parsing invalid HTML (e.g. unquoted attributes) also works:

     # >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')

         Start tag: p

         Start tag: a

              attr: ('class', 'link')

              attr: ('href', '#main')

         Data     : tag soup

         End tag  : p

         End tag  : a

Html / XHtml 解析 - Parsing Html and XHtml的更多相关文章

python模块介绍- HTMLParser 简单的HTML和XHTML解析器
python模块介绍- HTMLParser 简单的HTML和XHTML解析器 2013-09-11 磁针石 #承接软件自动化实施与培训等gtalk:ouyangchongwu#gmail.comqq ...
HTMLParser-简单HTML和XHTML解析
使用HTMLParser模块解析HTML页面 HTMLParser是python用来解析html和xhtml文件格式的模块.它可以分析出html里面的标签.数据等等,是一种处理html的简便途径.HT ...
XHTML 结构化：使用 XHTML 重构网站
http://www.w3school.com.cn/xhtml/xhtml_structural_01.asp 我们曾经为本节撰写的标题是:"XHTML : 简单的规则,容易的方针.&qu ...
XHTML 结构化：使用 XHTML 重构网站分类： C1_HTML/JS/JQUERY 2014-07-31 15:58 249人阅读评论(0) 收藏
http://www.w3school.com.cn/xhtml/xhtml_structural_01.asp 我们曾经为本节撰写的标题是:"XHTML : 简单的规则,容易的方针.&qu ...
Sharepoint的网页(Page)，网页解析(Parsing)与解析安全处理(Security)
转:http://www.chawenti.com/articles/8592.html Microsoft SharePoint Foundation 中主要有两种类型的页面,分别是应用程序页(Ap ...
解析html与xhtml的神器——HTMLParser与SGMLParser
有时候你要把抓回来的数据进行提取,过大篇幅的html标签,你若使用正则表达式进行匹配的话,显然是低效的,这时使用python的HTMLParser模块会显得非常方便.据说还有个比较好用的解析器叫:Be ...
XHTML代码规则&手工html转换xhtml
XHTML规则 XHTML是XML得一个应用,它遵守XML得规范和要求.从技术角度上讲.这些语法规则是由XML规范定义的. XML文档必须遵守的规则使得生成工具以解析文档变得更容易.这些规则也使得XM ...
HTML和XHTML区别
HTML和XHTML 可扩展超文本标记语言XHTML(eXtensible HyperText Markup Language)是将超文本标记语言HTML(HyperText Markup Langu ...
1; XHTML 基本知识
万维网是我们这个时代最重要的信息传播手段.几乎任何人都可以创建自己的网站,然后把它发布在因特网上.一些网页属于企业,提供销售服务:另一些网页属于个人,用来分享信息.你可以自己决定网页的内容和风格.所有 ...

随机推荐

TensorFlow——dropout和正则化的相关方法
1.dropout dropout是一种常用的手段,用来防止过拟合的,dropout的意思是在训练过程中每次都随机选择一部分节点不要去学习,减少神经元的数量来降低模型的复杂度,同时增加模型的泛化能力. ...
mysql中emoji表情存储
mysql中emoji表情存储背景在mysql 5.7.19,创建的数据库默认选择的编码是utf8 -- UTF-8 Unicode,因此字段默认的编码为utf-8,但在项目开发中存在一个需求:在 ...
dp - LIS
某国为了防御敌国的导弹袭击,发展出一种导弹拦截系统.但是这种导弹拦截系统有一个缺陷:虽然它的第一发炮弹能够到达任意的高度,但是以后每一发炮弹都不能超过前一发的高度.某天,雷达捕捉到敌国的导弹来袭.由于 ...
盘它！！一步到位，Tensorflow 2的实战！！LSTM下的股票预测（附详尽代码及数据集）
关键词:tensorflow2.LSTM.时间序列.股票预测 Tensorflow 2.0发布已经有一段时间了,各种新API的确简单易用,除了官方文档以外能够找到的学习资料也很多,但是大都没有给出实战 ...
xlwings excel（四）
前言当年看<别怕,Excel VBA其实很简单>相见恨晚,看了第一版电子版之后,买了纸质版,然后将其送人.而后,发现出了第二版,买之收藏.之后,发现Python这一编程语言,简直是逆天, ...
总是在起头可是能怎么办呢 Python数据分析
目录前言1 第1章准备工作5 本书主要内容5 为什么要使用Python进行数据分析6 重要的Python库7 安装和设置10 社区和研讨会16 使用本书16 致谢18 第2章引言20 来自bit.l ...
嗯想写个demo 苦于没数据
step 1: 来点数据: 各种数据随你便了. step 2: 来个服务端 step 3 : 客户端调用
Kafka -入门学习
kafka 1. 介绍官网 http://kafka.apache.org/ 介绍 http://kafka.apache.org/intro 2. 快速开始 1. 安装路径: http://ka ...
centos7中修改运行级别
centos6 在centos6里打开vim /etc/inittab文件看到下面有一行 id:5:initdefault,因此我们可以通过修改这个文件的id后的数字来修改运行级别如果我们想要直接切 ...
Docker底层架构之基础架构
Docker 采用了 C/S架构,包括客户端和服务端. Docker daemon 作为服务端接受来自客户的请求,并处理这些请求(创建.运行.分发容器). 客户端和服务端既可以运行在一个机器上,也可 ...

Html / XHtml 解析 - Parsing Html and XHtml

Html / XHtml 解析 - Parsing Html and XHtml的更多相关文章

随机推荐

热门专题