HTMLParser in python
You can know form the name that the HTMLParser is something used to parse HTML files. In python, there are two HTMLParsers. One is the HTMLParser class defined in htmllib module—— htmllib.HTMLParser, the other one is HTMLParser class defined in HTMLParser module. Let`s see them separately.
htmllib.HTMLParser
This is deprecated since python2.6. The htmllib is removed in python3. But still, there is something you could know about it. This parser is not directly concerned with I/O — it must be provided with input in string form via a method, and makes calls to methods of a “formatter” object in order to produce output. So you need to do it in below way for instantiation purpose.
>>> from cStringIO import StringIO
>>> from formatter import DumbWriter, AbstractFormatter
>>> from htmllib import HTMLParser
>>> parser = HTMLParser(AbstractFormatter(DumbWriter(StringIO())))
>>>
It is very annoying. All you want to do is parsing a html file, but now you have to know a lot other things like format, I/O stream etc.
HTMLParser.HTMLParser
In python3 this module is renamed to html.parser. This module does the samething as htmllib.HTMLParser. The good thing is you do not to import modules like formatter and cStringIO. For more information you can go to this URL :
https://docs.python.org/2.7/library/htmlparser.html?highlight=htmlparser#HTMLParser
Here is some briefly introduction for this module.
See below for a sample code while using this module. You will notice that you do not need to use formater class or I/O string class.
>>> from HTMLParser import HTMLParser
>>> class MyHTMLParser(HTMLParser):
... def handle_starttag(self, tag, attrs):
... print "Encountered a start tag:", tag
... def handle_endtag(self, tag):
... print "Encountered an end tag :", tag
... def handle_data(self, data):
... print "Encountered some data :", data
...
>>> parser = MyHTMLParser()
>>> parser.feed('<html><head><title>Test</title></head><body><h1>Parse me!</h1></body></html>')
Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html
Another case here, in the htmllib.HTMLParser, there was two functions as below,
HTMLParser.anchor_bgn(href, name, type)
This method is called at the start of an anchor region. The arguments correspond to the attributes of the <A> tag with the same names. The default implementation maintains a list of hyperlinks (defined by the HREF attribute for <A> tags) within the document. The list of hyperlinks is available as the data attribute anchorlist. HTMLParser.anchor_end()
This method is called at the end of an anchor region. The default implementation adds a textual footnote marker using an index into the list of hyperlinks created by anchor_bgn().
With these two funcitons, htmllib.HTMLParser can easily retrive url links from a html file. For example:
>>> from urlparse import urlparse
>>> from formatter import DumbWriter, AbstractFormatter
>>> from cStringIO import StringIO
>>> from htmllib import HTMLParser
>>>
>>> def parseAndGetLinks():
... parser = HTMLParser(AbstractFormatter(DumbWriter(StringIO())))
... parser.feed(open(file).read())
... parser.close()
... return parser.anchorlist
...
>>> file='/tmp/a.ttt'
>>> parseAndGetLinks()
['http://www.baidu.com/gaoji/preferences.html', '/', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg®Type=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', '/', 'http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=', 'http://tieba.baidu.com/f?kw=&fr=wwwt', 'http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt', 'http://music.baidu.com/search?fr=ps&key=', 'http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=', 'http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=', 'http://map.baidu.com/m?word=&fr=ps01000', 'http://wenku.baidu.com/search?word=&lm=0&od=0', 'http://www.baidu.com/more/', 'javascript:;', 'javascript:;', 'javascript:;', 'http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w', 'http://www.baidu.com/gaoji/preferences.html', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg®Type=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'http://news.baidu.com', 'http://tieba.baidu.com', 'http://zhidao.baidu.com', 'http://music.baidu.com', 'http://image.baidu.com', 'http://v.baidu.com', 'http://map.baidu.com', 'javascript:;', 'javascript:;', 'javascript:;', 'http://baike.baidu.com', 'http://wenku.baidu.com', 'http://www.hao123.com', 'http://www.baidu.com/more/', '/', 'http://www.baidu.com/cache/sethelp/index.html', 'http://e.baidu.com/?refer=888', 'http://top.baidu.com', 'http://home.baidu.com', 'http://ir.baidu.com', '/duty/']
But in HTMLParser.HTMLParser, we do not have these two functions. Does not matter, we can define our own.
>>> from HTMLParser import HTMLParser
>>> class myHtmlParser(HTMLParser):
... def __init__(self):
... HTMLParser.__init__(self)
... self.anchorlist=[]
... def handle_starttag(self, tag, attrs):
... if tag=='a' or tag=='A':
... for t in attrs :
... if t[0] == 'href' or t[0]=='HREF':
... self.anchorlist.append(t[1])
...
>>> file='/tmp/a.ttt'
>>> parser=myHtmlParser()
>>> parser.feed(open(file).read())
>>> parser.anchorlist
['http://www.baidu.com/gaoji/preferences.html', '/', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg®Type=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', '/', 'http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=', 'http://tieba.baidu.com/f?kw=&fr=wwwt', 'http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt', 'http://music.baidu.com/search?fr=ps&key=', 'http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=', 'http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=', 'http://map.baidu.com/m?word=&fr=ps01000', 'http://wenku.baidu.com/search?word=&lm=0&od=0', 'http://www.baidu.com/more/', 'javascript:;', 'javascript:;', 'javascript:;', 'http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w', 'http://www.baidu.com/gaoji/preferences.html', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg®Type=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'http://news.baidu.com', 'http://tieba.baidu.com', 'http://zhidao.baidu.com', 'http://music.baidu.com', 'http://image.baidu.com', 'http://v.baidu.com', 'http://map.baidu.com', 'javascript:;', 'javascript:;', 'javascript:;', 'http://baike.baidu.com', 'http://wenku.baidu.com', 'http://www.hao123.com', 'http://www.baidu.com/more/', '/', 'http://www.baidu.com/cache/sethelp/index.html', 'http://e.baidu.com/?refer=888', 'http://top.baidu.com', 'http://home.baidu.com', 'http://ir.baidu.com', '/duty/']
>>>
We look into the second code.
line 3 to line 5 overwrite the __init__ method. The key for this overwriten is that add an new attribute - anchorlist to our instance.
line 6 to line 10 overwrite the handle_starttag method. First it use if to check what the tag is. If it is 'a' or 'A', then use for loop to check its attribute. Retrieve the href attribute and put the value into the anchorlist.
Then done.
HTMLParser in python的更多相关文章
- python模块学习---HTMLParser(解析HTML文档元素)
HTMLParser是Python自带的模块,使用简单,能够很容易的实现HTML文件的分析. 本文主要简单讲一下HTMLParser的用法. 使用时需要定义一个从类HTMLParser继承的类,重定义 ...
- python网络爬虫之LXML与HTMLParser
Python lxml包用于解析html和XML文件,个人觉得比beautifulsoup要更灵活些 Lxml中的路径表达式如下: 在下面的表格中,我们已列出了一些路径表达式以及表达式的结果: 路径表 ...
- python之HTMLParser解析HTML文档
HTMLParser是Python自带的模块,使用简单,能够很容易的实现HTML文件的分析.本文主要简单讲一下HTMLParser的用法. 使用时需要定义一个从类HTMLParser继承的类,重定义函 ...
- Python HTML解析模块HTMLParser(爬虫工具)
简介 先简略介绍一下.实际上,HTMLParser是python用来解析HTML的内置模块.它可以分析出HTML里面的标签.数据等等,是一种处理HTML的简便途径.HTMLParser采用的是一种事件 ...
- python模块之HTMLParser
HTMLParser是python用来解析html的模块.它可以分析出html里面的标签.数据等等,是一种处理html的简便途径. HTMLParser采用的是一种事件驱动的模式,当HTMLParse ...
- python学习(解析python官网会议安排)
在学习python的过程中,做练习,解析https://www.python.org/events/python-events/ HTML文件,输出Python官网发布的会议时间.名称和地点. 对ht ...
- Python学习笔记5
1.关于global声明变量的错误例子 I ran across this warning: #!/usr/bin/env python2.3 VAR = 'xxx' if __name__ == ' ...
- python 爬虫部分解释
example:self.file = www.baidu.com存有baidu站的index.html def parseAndGetLinks(self): # parse HTML, save ...
- Python之HTML的解析(网页抓取一)
http://blog.csdn.net/my2010sam/article/details/14526223 --------------------- 对html的解析是网页抓取的基础,分析抓取的 ...
随机推荐
- Tempter of the Bone------剪枝
看了好多别人的 代码,最终还是 感觉 这种代码的风格适合我 下面附上代码 /* 首先 应该充满信心! 先写出来 自己的程序 然后慢慢改 , 如果是 答题思路错误的话 借鉴别人的 代码 再写 */ ...
- python常见的加密方式
1.前言 我们所说的加密方式都是对二进制编码的格式进行加密,对应到python中,则是我妈们的bytes. 所以当我们在Python中进行加密操作的时候,要确保我们的操作是bytes,否则就会报错. ...
- Centos7下安装python环境
前言 centos7默认是装有pyhton的. #检查python版本 [root@oldboy_python ~ ::]#python -V Python 但是众所周知,python2版本到2020 ...
- JSP执行原理图
- conda python虚拟环境
#查看已安装的python包 conda list #查看当前有哪些虚拟环境 conda env list 或者 conda info -e #更新conda conda update conda # ...
- Java多线程——线程八锁案例分析
Java多线程——线程八锁案例分析 摘要:本文主要学习了多线程并发中的一些案例. 部分内容来自以下博客: https://blog.csdn.net/dyt443733328/article/deta ...
- [ NOIP 2008 ] TG
\(\\\) \(\#A\) \(Word\) 给出一个长为\(N\)的小写字母串,判断出现所有字母中最多出现次数减最少出现次数得到的答案是否是质数. \(N\in [1,100]\) 直接按题意开桶 ...
- Android中ViewPager动态创建的ImageView铺满屏幕
ImageView imageView=new ImageView(context); imageView.setScaleType(ScaleType.FIT_XY);//铺满屏幕
- 【译】x86程序员手册12-4.2系统指令
4.2 Systems Instructions 系统指令 Systems instructions deal with such functions as: 系统指令具有以下功能: Verifica ...
- Html test
<!DOCTYPE html> <html lang="en" xmlns="http://www.w3.org/1999/xhtml"> ...