You can know form the name that the HTMLParser is something used to parse HTML files.  In python, there are two HTMLParsers. One is the HTMLParser class defined in htmllib module—— htmllib.HTMLParser, the other one is HTMLParser class defined in HTMLParser module. Let`s see them separately.

htmllib.HTMLParser

This is deprecated since python2.6. The htmllib is removed in python3. But still, there is something you could know about it. This parser is not directly concerned with I/O — it must be provided with input in string form via a method, and makes calls to methods of a “formatter” object in order to produce output. So you need to do it in below way for instantiation purpose.

>>> from cStringIO import StringIO
>>> from formatter import DumbWriter, AbstractFormatter
>>> from htmllib import HTMLParser
>>> parser = HTMLParser(AbstractFormatter(DumbWriter(StringIO())))
>>>

It is very annoying. All you want to do is parsing a html file, but now you have to know a lot other things like format, I/O stream etc.

HTMLParser.HTMLParser

In python3 this module is renamed to html.parser. This module does the samething as htmllib.HTMLParser. The good thing is you do not to import modules like formatter and cStringIO.  For more information you can go to this URL :

https://docs.python.org/2.7/library/htmlparser.html?highlight=htmlparser#HTMLParser

Here is some briefly introduction for this module.

See below for a sample code while using this module. You will notice that you do not need to use formater class or I/O string class.

>>> from HTMLParser import HTMLParser
>>> class MyHTMLParser(HTMLParser):
... def handle_starttag(self, tag, attrs):
... print "Encountered a start tag:", tag
... def handle_endtag(self, tag):
... print "Encountered an end tag :", tag
... def handle_data(self, data):
... print "Encountered some data :", data
...
>>> parser = MyHTMLParser()
>>> parser.feed('<html><head><title>Test</title></head><body><h1>Parse me!</h1></body></html>')
Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

  

Another case here, in the htmllib.HTMLParser, there was two functions as below,

HTMLParser.anchor_bgn(href, name, type)
This method is called at the start of an anchor region. The arguments correspond to the attributes of the <A> tag with the same names. The default implementation maintains a list of hyperlinks (defined by the HREF attribute for <A> tags) within the document. The list of hyperlinks is available as the data attribute anchorlist. HTMLParser.anchor_end()
This method is called at the end of an anchor region. The default implementation adds a textual footnote marker using an index into the list of hyperlinks created by anchor_bgn().

With these two funcitons, htmllib.HTMLParser can easily retrive url links from a html file. For example:

>>> from urlparse import urlparse
>>> from formatter import DumbWriter, AbstractFormatter
>>> from cStringIO import StringIO
>>> from htmllib import HTMLParser
>>>
>>> def parseAndGetLinks():
... parser = HTMLParser(AbstractFormatter(DumbWriter(StringIO())))
... parser.feed(open(file).read())
... parser.close()
... return parser.anchorlist
...
>>> file='/tmp/a.ttt'
>>> parseAndGetLinks()
['http://www.baidu.com/gaoji/preferences.html', '/', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', '/', 'http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=', 'http://tieba.baidu.com/f?kw=&fr=wwwt', 'http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt', 'http://music.baidu.com/search?fr=ps&key=', 'http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=', 'http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=', 'http://map.baidu.com/m?word=&fr=ps01000', 'http://wenku.baidu.com/search?word=&lm=0&od=0', 'http://www.baidu.com/more/', 'javascript:;', 'javascript:;', 'javascript:;', 'http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w', 'http://www.baidu.com/gaoji/preferences.html', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'http://news.baidu.com', 'http://tieba.baidu.com', 'http://zhidao.baidu.com', 'http://music.baidu.com', 'http://image.baidu.com', 'http://v.baidu.com', 'http://map.baidu.com', 'javascript:;', 'javascript:;', 'javascript:;', 'http://baike.baidu.com', 'http://wenku.baidu.com', 'http://www.hao123.com', 'http://www.baidu.com/more/', '/', 'http://www.baidu.com/cache/sethelp/index.html', 'http://e.baidu.com/?refer=888', 'http://top.baidu.com', 'http://home.baidu.com', 'http://ir.baidu.com', '/duty/']

But in HTMLParser.HTMLParser, we do not have these two functions. Does not matter, we can define our own.

 >>> from HTMLParser import HTMLParser
>>> class myHtmlParser(HTMLParser):
... def __init__(self):
... HTMLParser.__init__(self)
... self.anchorlist=[]
... def handle_starttag(self, tag, attrs):
... if tag=='a' or tag=='A':
... for t in attrs :
... if t[0] == 'href' or t[0]=='HREF':
... self.anchorlist.append(t[1])
...
>>> file='/tmp/a.ttt'
>>> parser=myHtmlParser()
>>> parser.feed(open(file).read())
>>> parser.anchorlist
['http://www.baidu.com/gaoji/preferences.html', '/', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', '/', 'http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=', 'http://tieba.baidu.com/f?kw=&fr=wwwt', 'http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt', 'http://music.baidu.com/search?fr=ps&key=', 'http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=', 'http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=', 'http://map.baidu.com/m?word=&fr=ps01000', 'http://wenku.baidu.com/search?word=&lm=0&od=0', 'http://www.baidu.com/more/', 'javascript:;', 'javascript:;', 'javascript:;', 'http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w', 'http://www.baidu.com/gaoji/preferences.html', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'http://news.baidu.com', 'http://tieba.baidu.com', 'http://zhidao.baidu.com', 'http://music.baidu.com', 'http://image.baidu.com', 'http://v.baidu.com', 'http://map.baidu.com', 'javascript:;', 'javascript:;', 'javascript:;', 'http://baike.baidu.com', 'http://wenku.baidu.com', 'http://www.hao123.com', 'http://www.baidu.com/more/', '/', 'http://www.baidu.com/cache/sethelp/index.html', 'http://e.baidu.com/?refer=888', 'http://top.baidu.com', 'http://home.baidu.com', 'http://ir.baidu.com', '/duty/']
>>>

We look into the second code.

line 3 to line 5 overwrite the __init__ method. The key for this overwriten is that add an new attribute - anchorlist to our instance.

line 6 to line 10 overwrite the handle_starttag method. First it use if to check what the tag is. If it is 'a' or 'A',  then use for loop to check its attribute. Retrieve the href attribute and put the value into the anchorlist.

Then done.

HTMLParser in python的更多相关文章

  1. python模块学习---HTMLParser(解析HTML文档元素)

    HTMLParser是Python自带的模块,使用简单,能够很容易的实现HTML文件的分析. 本文主要简单讲一下HTMLParser的用法. 使用时需要定义一个从类HTMLParser继承的类,重定义 ...

  2. python网络爬虫之LXML与HTMLParser

    Python lxml包用于解析html和XML文件,个人觉得比beautifulsoup要更灵活些 Lxml中的路径表达式如下: 在下面的表格中,我们已列出了一些路径表达式以及表达式的结果: 路径表 ...

  3. python之HTMLParser解析HTML文档

    HTMLParser是Python自带的模块,使用简单,能够很容易的实现HTML文件的分析.本文主要简单讲一下HTMLParser的用法. 使用时需要定义一个从类HTMLParser继承的类,重定义函 ...

  4. Python HTML解析模块HTMLParser(爬虫工具)

    简介 先简略介绍一下.实际上,HTMLParser是python用来解析HTML的内置模块.它可以分析出HTML里面的标签.数据等等,是一种处理HTML的简便途径.HTMLParser采用的是一种事件 ...

  5. python模块之HTMLParser

    HTMLParser是python用来解析html的模块.它可以分析出html里面的标签.数据等等,是一种处理html的简便途径. HTMLParser采用的是一种事件驱动的模式,当HTMLParse ...

  6. python学习(解析python官网会议安排)

    在学习python的过程中,做练习,解析https://www.python.org/events/python-events/ HTML文件,输出Python官网发布的会议时间.名称和地点. 对ht ...

  7. Python学习笔记5

    1.关于global声明变量的错误例子 I ran across this warning: #!/usr/bin/env python2.3 VAR = 'xxx' if __name__ == ' ...

  8. python 爬虫部分解释

    example:self.file = www.baidu.com存有baidu站的index.html def parseAndGetLinks(self): # parse HTML, save ...

  9. Python之HTML的解析(网页抓取一)

    http://blog.csdn.net/my2010sam/article/details/14526223 --------------------- 对html的解析是网页抓取的基础,分析抓取的 ...

随机推荐

  1. yii2表单,用惯yii1的可以看一下,有很大不同哦

    使用表单 本章节将介绍如何创建一个从用户那搜集数据的表单页.该页将显示一个包含 name 输入框和 email 输入框的表单.当搜集完这两部分信息后,页面将会显示用户输入的信息. 为了实现这个目标,除 ...

  2. 最全的C/C++入门到进阶的书籍推荐,你需要嘛?

    编程是操作性很强的一门知识,看书少不了,但只有学习和实践相结合才能起到很好的效果,一种学习方法是看视频->看书->研究书中例子->自己做些东西->交流->看书. 研究经典 ...

  3. VUE修改每个页面title

    //index.js routes: [ { name:'home', path: '/home/:openname', component: Home, meta: { title: '首页' } ...

  4. Swagger 教程

    转自   Vojtech Ruzicka的编程博客 (一)Swagger和SpringFox 记录REST API非常重要.它是一个公共接口,其他模块,应用程序或开发人员可以使用它.即使你没有公开曝光 ...

  5. layout 自适应详解

    @{    ViewBag.Title = "人员查找";    ViewBag.LeftWidth = "200px";    ViewBag.MiddleW ...

  6. mahjong

    题目描述 “为什么, 你们的力量在哪里得到如此地......”“我们比 1 分钟前的我们还要进步, 虽然很微小, 但每转一圈就会前进一寸.这就是钻头啊!”“那才是通向毁灭的道路.为什么就没有意识到螺旋 ...

  7. Android内存管理(6)onTrimMemory,onLowMemory,MemoryInfo()

    转自: http://www.cnblogs.com/sudawei/p/3527145.html 参考: Android Application生命周期学习 Android中如何查看内存(上) An ...

  8. define与typedef的区别

    define: 发生在预处理阶段,也就是编译之前,仅仅文本替换,不做任何的类型检查 没有作用域的限制 typedef: 多用于简化复杂的类型声明,比如函数指针声明:typedef bool (*fun ...

  9. RabbitMQ~消费者实时与消息服务器保持通话

    这个文章主要介绍简单的消费者的实现,rabbitMQ实现的消费者可以对消息服务器进行实时监听,当有消息(生产者把消息推到服务器上之后),消费者可以自动去消费它,这通常是开启一个进程去维护这个对话,它与 ...

  10. 启用禁用USB接口

    一个小工具,功能有启用禁用外网.USB接口,可由服务端socket长链接进行操控客户端从而达到实现前边的这些功能,这里贴上核心代码,先给上启用禁用USB接口吧,这个方法可随时启用禁用,之前用过一个改u ...