HTMLParser in python
You can know form the name that the HTMLParser is something used to parse HTML files. In python, there are two HTMLParsers. One is the HTMLParser class defined in htmllib module—— htmllib.HTMLParser, the other one is HTMLParser class defined in HTMLParser module. Let`s see them separately.
htmllib.HTMLParser
This is deprecated since python2.6. The htmllib is removed in python3. But still, there is something you could know about it. This parser is not directly concerned with I/O — it must be provided with input in string form via a method, and makes calls to methods of a “formatter” object in order to produce output. So you need to do it in below way for instantiation purpose.
>>> from cStringIO import StringIO
>>> from formatter import DumbWriter, AbstractFormatter
>>> from htmllib import HTMLParser
>>> parser = HTMLParser(AbstractFormatter(DumbWriter(StringIO())))
>>>
It is very annoying. All you want to do is parsing a html file, but now you have to know a lot other things like format, I/O stream etc.
HTMLParser.HTMLParser
In python3 this module is renamed to html.parser. This module does the samething as htmllib.HTMLParser. The good thing is you do not to import modules like formatter and cStringIO. For more information you can go to this URL :
https://docs.python.org/2.7/library/htmlparser.html?highlight=htmlparser#HTMLParser
Here is some briefly introduction for this module.
See below for a sample code while using this module. You will notice that you do not need to use formater class or I/O string class.
>>> from HTMLParser import HTMLParser
>>> class MyHTMLParser(HTMLParser):
... def handle_starttag(self, tag, attrs):
... print "Encountered a start tag:", tag
... def handle_endtag(self, tag):
... print "Encountered an end tag :", tag
... def handle_data(self, data):
... print "Encountered some data :", data
...
>>> parser = MyHTMLParser()
>>> parser.feed('<html><head><title>Test</title></head><body><h1>Parse me!</h1></body></html>')
Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html
Another case here, in the htmllib.HTMLParser, there was two functions as below,
HTMLParser.anchor_bgn(href, name, type)
This method is called at the start of an anchor region. The arguments correspond to the attributes of the <A> tag with the same names. The default implementation maintains a list of hyperlinks (defined by the HREF attribute for <A> tags) within the document. The list of hyperlinks is available as the data attribute anchorlist. HTMLParser.anchor_end()
This method is called at the end of an anchor region. The default implementation adds a textual footnote marker using an index into the list of hyperlinks created by anchor_bgn().
With these two funcitons, htmllib.HTMLParser can easily retrive url links from a html file. For example:
>>> from urlparse import urlparse
>>> from formatter import DumbWriter, AbstractFormatter
>>> from cStringIO import StringIO
>>> from htmllib import HTMLParser
>>>
>>> def parseAndGetLinks():
... parser = HTMLParser(AbstractFormatter(DumbWriter(StringIO())))
... parser.feed(open(file).read())
... parser.close()
... return parser.anchorlist
...
>>> file='/tmp/a.ttt'
>>> parseAndGetLinks()
['http://www.baidu.com/gaoji/preferences.html', '/', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg®Type=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', '/', 'http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=', 'http://tieba.baidu.com/f?kw=&fr=wwwt', 'http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt', 'http://music.baidu.com/search?fr=ps&key=', 'http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=', 'http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=', 'http://map.baidu.com/m?word=&fr=ps01000', 'http://wenku.baidu.com/search?word=&lm=0&od=0', 'http://www.baidu.com/more/', 'javascript:;', 'javascript:;', 'javascript:;', 'http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w', 'http://www.baidu.com/gaoji/preferences.html', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg®Type=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'http://news.baidu.com', 'http://tieba.baidu.com', 'http://zhidao.baidu.com', 'http://music.baidu.com', 'http://image.baidu.com', 'http://v.baidu.com', 'http://map.baidu.com', 'javascript:;', 'javascript:;', 'javascript:;', 'http://baike.baidu.com', 'http://wenku.baidu.com', 'http://www.hao123.com', 'http://www.baidu.com/more/', '/', 'http://www.baidu.com/cache/sethelp/index.html', 'http://e.baidu.com/?refer=888', 'http://top.baidu.com', 'http://home.baidu.com', 'http://ir.baidu.com', '/duty/']
But in HTMLParser.HTMLParser, we do not have these two functions. Does not matter, we can define our own.
>>> from HTMLParser import HTMLParser
>>> class myHtmlParser(HTMLParser):
... def __init__(self):
... HTMLParser.__init__(self)
... self.anchorlist=[]
... def handle_starttag(self, tag, attrs):
... if tag=='a' or tag=='A':
... for t in attrs :
... if t[0] == 'href' or t[0]=='HREF':
... self.anchorlist.append(t[1])
...
>>> file='/tmp/a.ttt'
>>> parser=myHtmlParser()
>>> parser.feed(open(file).read())
>>> parser.anchorlist
['http://www.baidu.com/gaoji/preferences.html', '/', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg®Type=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', '/', 'http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=', 'http://tieba.baidu.com/f?kw=&fr=wwwt', 'http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt', 'http://music.baidu.com/search?fr=ps&key=', 'http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=', 'http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=', 'http://map.baidu.com/m?word=&fr=ps01000', 'http://wenku.baidu.com/search?word=&lm=0&od=0', 'http://www.baidu.com/more/', 'javascript:;', 'javascript:;', 'javascript:;', 'http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w', 'http://www.baidu.com/gaoji/preferences.html', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg®Type=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'http://news.baidu.com', 'http://tieba.baidu.com', 'http://zhidao.baidu.com', 'http://music.baidu.com', 'http://image.baidu.com', 'http://v.baidu.com', 'http://map.baidu.com', 'javascript:;', 'javascript:;', 'javascript:;', 'http://baike.baidu.com', 'http://wenku.baidu.com', 'http://www.hao123.com', 'http://www.baidu.com/more/', '/', 'http://www.baidu.com/cache/sethelp/index.html', 'http://e.baidu.com/?refer=888', 'http://top.baidu.com', 'http://home.baidu.com', 'http://ir.baidu.com', '/duty/']
>>>
We look into the second code.
line 3 to line 5 overwrite the __init__ method. The key for this overwriten is that add an new attribute - anchorlist to our instance.
line 6 to line 10 overwrite the handle_starttag method. First it use if to check what the tag is. If it is 'a' or 'A', then use for loop to check its attribute. Retrieve the href attribute and put the value into the anchorlist.
Then done.
HTMLParser in python的更多相关文章
- python模块学习---HTMLParser(解析HTML文档元素)
HTMLParser是Python自带的模块,使用简单,能够很容易的实现HTML文件的分析. 本文主要简单讲一下HTMLParser的用法. 使用时需要定义一个从类HTMLParser继承的类,重定义 ...
- python网络爬虫之LXML与HTMLParser
Python lxml包用于解析html和XML文件,个人觉得比beautifulsoup要更灵活些 Lxml中的路径表达式如下: 在下面的表格中,我们已列出了一些路径表达式以及表达式的结果: 路径表 ...
- python之HTMLParser解析HTML文档
HTMLParser是Python自带的模块,使用简单,能够很容易的实现HTML文件的分析.本文主要简单讲一下HTMLParser的用法. 使用时需要定义一个从类HTMLParser继承的类,重定义函 ...
- Python HTML解析模块HTMLParser(爬虫工具)
简介 先简略介绍一下.实际上,HTMLParser是python用来解析HTML的内置模块.它可以分析出HTML里面的标签.数据等等,是一种处理HTML的简便途径.HTMLParser采用的是一种事件 ...
- python模块之HTMLParser
HTMLParser是python用来解析html的模块.它可以分析出html里面的标签.数据等等,是一种处理html的简便途径. HTMLParser采用的是一种事件驱动的模式,当HTMLParse ...
- python学习(解析python官网会议安排)
在学习python的过程中,做练习,解析https://www.python.org/events/python-events/ HTML文件,输出Python官网发布的会议时间.名称和地点. 对ht ...
- Python学习笔记5
1.关于global声明变量的错误例子 I ran across this warning: #!/usr/bin/env python2.3 VAR = 'xxx' if __name__ == ' ...
- python 爬虫部分解释
example:self.file = www.baidu.com存有baidu站的index.html def parseAndGetLinks(self): # parse HTML, save ...
- Python之HTML的解析(网页抓取一)
http://blog.csdn.net/my2010sam/article/details/14526223 --------------------- 对html的解析是网页抓取的基础,分析抓取的 ...
随机推荐
- [Swift通天遁地]五、高级扩展-(2)扩展集合类型
★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★★➤微信公众号:山青咏芝(shanqingyongzhi)➤博客园地址:山青咏芝(https://www.cnblogs. ...
- curl 采集的时候遇到301怎么办
采集的时候遇到301,采集数据有错误 $ch = curl_init($url);curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);curl_setopt( ...
- hdu2030
http://acm.hdu.edu.cn/showproblem.php?pid=2030 #include<stdio.h> #include<math.h> #inclu ...
- flask 中的模板语法 jinja2及render_template的深度用法
是时候开始写个前端了,Flask中默认的模板语言是Jinja2 现在我们来一步一步的学习一下 Jinja2 捎带手把 render_template 中留下的疑问解决一下 首先我们要在后端定义几个字符 ...
- NPOI复制模板导出Excel
本人菜鸟实习生一枚,公司给我安排了一个excel导出功能.要求如下:1.导出excel文件有样式要求:2.导出excel包含一个或多个工作表:3.功能做活(我的理解就是excel样式以后可能会变方便维 ...
- JQuery:常用知识点总结
jQuery本质上就是一个外部的js文件(jQuery.js),该文件中封装了很多js代码,实现了很多功能.并且jQuery有非常丰富的插件,大多数功能都有相应的插件解决方案.jQuery的宗旨是wr ...
- CF832B Petya and Exam
思路: 模拟. 实现: #include <iostream> using namespace std; string a, b; ]; bool solve() { ) return f ...
- Android 解决ScrollView嵌套RecyclerView导致滑动不流畅的问题
最近做的项目中遇到了ScrollView嵌套RecyclerView,刚写完功能测试,直接卡出翔了,后来通过网上查找资料和 自己的实践,找出了两种方法解决这个问题. 首先来个最简单的方法: recyc ...
- Android O Bitmap 内存分配
我们知道,一般认为在Android进程的内存模型中,heap分为两部分,一部分是native heap,一部分是Dalvik heap(实际上也是native heap的一部分). Andro ...
- IP访问频率限制不能用数组循环插入多个限制条件原因分析及解决方案
14.IP频率限制不能用数组循环插入多个限制条件原因分析及解决方案: define("RATE_LIMITING_ARR", array('3' => 3, '6' => ...