HTMLParser in python

You can know form the name that the HTMLParser is something used to parse HTML files. In python, there are two HTMLParsers. One is the HTMLParser class defined in htmllib module—— htmllib.HTMLParser, the other one is HTMLParser class defined in HTMLParser module. Let`s see them separately.

htmllib.HTMLParser

This is deprecated since python2.6. The htmllib is removed in python3. But still, there is something you could know about it. This parser is not directly concerned with I/O — it must be provided with input in string form via a method, and makes calls to methods of a “formatter” object in order to produce output. So you need to do it in below way for instantiation purpose.

>>> from cStringIO import StringIO

>>> from formatter import DumbWriter, AbstractFormatter

>>> from htmllib import HTMLParser

>>> parser = HTMLParser(AbstractFormatter(DumbWriter(StringIO())))

>>>

It is very annoying. All you want to do is parsing a html file, but now you have to know a lot other things like format, I/O stream etc.

HTMLParser.HTMLParser

In python3 this module is renamed to html.parser. This module does the samething as htmllib.HTMLParser. The good thing is you do not to import modules like formatter and cStringIO. For more information you can go to this URL :

https://docs.python.org/2.7/library/htmlparser.html?highlight=htmlparser#HTMLParser

Here is some briefly introduction for this module.

See below for a sample code while using this module. You will notice that you do not need to use formater class or I/O string class.

>>> from HTMLParser import HTMLParser

>>> class MyHTMLParser(HTMLParser):

...     def handle_starttag(self, tag, attrs):

...             print "Encountered a start tag:", tag

...     def handle_endtag(self, tag):

...             print "Encountered an end tag :", tag

...     def handle_data(self, data):

...              print "Encountered some data  :", data

...

>>> parser = MyHTMLParser()

>>> parser.feed('<html><head><title>Test</title></head><body><h1>Parse me!</h1></body></html>')

Encountered a start tag: html

Encountered a start tag: head

Encountered a start tag: title

Encountered some data  : Test

Encountered an end tag : title

Encountered an end tag : head

Encountered a start tag: body

Encountered a start tag: h1

Encountered some data  : Parse me!

Encountered an end tag : h1

Encountered an end tag : body

Encountered an end tag : html

Another case here, in the htmllib.HTMLParser, there was two functions as below,

HTMLParser.anchor_bgn(href, name, type)

This method is called at the start of an anchor region. The arguments correspond to the attributes of the <A> tag with the same names. The default implementation maintains a list of hyperlinks (defined by the HREF attribute for <A> tags) within the document. The list of hyperlinks is available as the data attribute anchorlist.

HTMLParser.anchor_end()

This method is called at the end of an anchor region. The default implementation adds a textual footnote marker using an index into the list of hyperlinks created by anchor_bgn().

With these two funcitons, htmllib.HTMLParser can easily retrive url links from a html file. For example:

>>> from urlparse import urlparse

>>> from formatter import DumbWriter, AbstractFormatter

>>> from cStringIO import StringIO

>>> from htmllib import HTMLParser

>>>

>>> def parseAndGetLinks():

...     parser = HTMLParser(AbstractFormatter(DumbWriter(StringIO())))

...     parser.feed(open(file).read())

...     parser.close()

...     return parser.anchorlist

...

>>> file='/tmp/a.ttt'

>>> parseAndGetLinks()

['http://www.baidu.com/gaoji/preferences.html', '/', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', '/', 'http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=', 'http://tieba.baidu.com/f?kw=&fr=wwwt', 'http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt', 'http://music.baidu.com/search?fr=ps&key=', 'http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=', 'http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=', 'http://map.baidu.com/m?word=&fr=ps01000', 'http://wenku.baidu.com/search?word=&lm=0&od=0', 'http://www.baidu.com/more/', 'javascript:;', 'javascript:;', 'javascript:;', 'http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w', 'http://www.baidu.com/gaoji/preferences.html', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'http://news.baidu.com', 'http://tieba.baidu.com', 'http://zhidao.baidu.com', 'http://music.baidu.com', 'http://image.baidu.com', 'http://v.baidu.com', 'http://map.baidu.com', 'javascript:;', 'javascript:;', 'javascript:;', 'http://baike.baidu.com', 'http://wenku.baidu.com', 'http://www.hao123.com', 'http://www.baidu.com/more/', '/', 'http://www.baidu.com/cache/sethelp/index.html', 'http://e.baidu.com/?refer=888', 'http://top.baidu.com', 'http://home.baidu.com', 'http://ir.baidu.com', '/duty/']

But in HTMLParser.HTMLParser, we do not have these two functions. Does not matter, we can define our own.

 >>> from HTMLParser import HTMLParser

 >>> class myHtmlParser(HTMLParser):

 ...     def __init__(self):

 ...             HTMLParser.__init__(self)

 ...             self.anchorlist=[]

 ...     def handle_starttag(self, tag, attrs):

 ...                     if tag=='a' or tag=='A':

 ...                             for t in attrs :

 ...                                     if t[0] == 'href' or t[0]=='HREF':

 ...                                             self.anchorlist.append(t[1])

 ...

 >>> file='/tmp/a.ttt'

 >>> parser=myHtmlParser()

 >>> parser.feed(open(file).read())

 >>> parser.anchorlist

 ['http://www.baidu.com/gaoji/preferences.html', '/', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', '/', 'http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=', 'http://tieba.baidu.com/f?kw=&fr=wwwt', 'http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt', 'http://music.baidu.com/search?fr=ps&key=', 'http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=', 'http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=', 'http://map.baidu.com/m?word=&fr=ps01000', 'http://wenku.baidu.com/search?word=&lm=0&od=0', 'http://www.baidu.com/more/', 'javascript:;', 'javascript:;', 'javascript:;', 'http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w', 'http://www.baidu.com/gaoji/preferences.html', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'http://news.baidu.com', 'http://tieba.baidu.com', 'http://zhidao.baidu.com', 'http://music.baidu.com', 'http://image.baidu.com', 'http://v.baidu.com', 'http://map.baidu.com', 'javascript:;', 'javascript:;', 'javascript:;', 'http://baike.baidu.com', 'http://wenku.baidu.com', 'http://www.hao123.com', 'http://www.baidu.com/more/', '/', 'http://www.baidu.com/cache/sethelp/index.html', 'http://e.baidu.com/?refer=888', 'http://top.baidu.com', 'http://home.baidu.com', 'http://ir.baidu.com', '/duty/']

 >>>

We look into the second code.

line 3 to line 5 overwrite the __init__ method. The key for this overwriten is that add an new attribute - anchorlist to our instance.

line 6 to line 10 overwrite the handle_starttag method. First it use if to check what the tag is. If it is 'a' or 'A', then use for loop to check its attribute. Retrieve the href attribute and put the value into the anchorlist.

Then done.

HTMLParser in python的更多相关文章

python模块学习---HTMLParser(解析HTML文档元素)
HTMLParser是Python自带的模块,使用简单,能够很容易的实现HTML文件的分析. 本文主要简单讲一下HTMLParser的用法. 使用时需要定义一个从类HTMLParser继承的类,重定义 ...
python网络爬虫之LXML与HTMLParser
Python lxml包用于解析html和XML文件,个人觉得比beautifulsoup要更灵活些 Lxml中的路径表达式如下: 在下面的表格中,我们已列出了一些路径表达式以及表达式的结果: 路径表 ...
python之HTMLParser解析HTML文档
HTMLParser是Python自带的模块,使用简单,能够很容易的实现HTML文件的分析.本文主要简单讲一下HTMLParser的用法. 使用时需要定义一个从类HTMLParser继承的类,重定义函 ...
Python HTML解析模块HTMLParser(爬虫工具)
简介先简略介绍一下.实际上,HTMLParser是python用来解析HTML的内置模块.它可以分析出HTML里面的标签.数据等等,是一种处理HTML的简便途径.HTMLParser采用的是一种事件 ...
python模块之HTMLParser
HTMLParser是python用来解析html的模块.它可以分析出html里面的标签.数据等等,是一种处理html的简便途径. HTMLParser采用的是一种事件驱动的模式,当HTMLParse ...
python学习（解析python官网会议安排）
在学习python的过程中,做练习,解析https://www.python.org/events/python-events/ HTML文件,输出Python官网发布的会议时间.名称和地点. 对ht ...
Python学习笔记5
1.关于global声明变量的错误例子 I ran across this warning: #!/usr/bin/env python2.3 VAR = 'xxx' if __name__ == ' ...
python 爬虫部分解释
example:self.file = www.baidu.com存有baidu站的index.html def parseAndGetLinks(self): # parse HTML, save ...
Python之HTML的解析（网页抓取一）
http://blog.csdn.net/my2010sam/article/details/14526223 --------------------- 对html的解析是网页抓取的基础,分析抓取的 ...

随机推荐

setjmp和longjmp函数
关于setjmp函数和longjmp函数有话要说,是UNIX高级环境变成看到了10.10信号那章用到了,研究一下,这里作为补充. setjmp(jmp_buf env_buf) 函数可以将当前的运行环 ...
aop 切面demo
/** * 必须要@Aspect 和 @Component一起使用否则没法拦截通知 * 搞了好久才明白刚刚开始以为时execution里面的配置的问题 * AOP使用很简单的 */@Aspect@Co ...
Linux(centOS7.2)+node+express初体验
赶着阿里云服务器老用户服务器半折的好时机,手痒买了一个低配. 想着对于低配用Linux应该比较好(无可视化界面) 于是选择安装了centOs7.2: 我是通过SecureCRT进行远程连接的(如何操作 ...
LinearLayout中间布局填充出现的问题
线性布局如何中间填充,会挤掉他下面的布局,所以中间填充使用layout_weight属性.
C#的一些知識點
不能將屬性以ref或out的方式傳遞看上去屬性和字段差不多,可是屬性本質上是個方法,并不是真正指向一個内存位置,所以不能像字段那樣能以ref或out方式傳遞. Lookup運行一個鍵對應多個值,但無 ...
jsTree使用记录
1. ajax请求生成jsTree <span style="font-size:14px;"><script> var r = []; // 权限树中被选 ...
Python 之mysql类封装
import pymysql class MysqlHelper(object): conn = None def __init__(self, host, username, password, d ...
* 获取页面参数 * @return 参数打印
/** * 获取页面参数 * @return 参数打印 */ GetUrlParam: function(paraName) { var url = document.location.toStrin ...
MIUI 的参与感
最近这段时间在看小米联合创始人黎万强写的<参与感>这本书,看完我还挺有感触的.小米相信大家都一定有所耳闻. 2010 年 4 月 6 日小米公司正式创立. 8 月 ...
Yin and Yang Stones（思路题）
Problem Description: A mysterious circular arrangement of black stones and white stones has appeared ...

HTMLParser in python

HTMLParser in python的更多相关文章

随机推荐

热门专题