HTMLParser in python
You can know form the name that the HTMLParser is something used to parse HTML files. In python, there are two HTMLParsers. One is the HTMLParser class defined in htmllib module—— htmllib.HTMLParser, the other one is HTMLParser class defined in HTMLParser module. Let`s see them separately.
htmllib.HTMLParser
This is deprecated since python2.6. The htmllib is removed in python3. But still, there is something you could know about it. This parser is not directly concerned with I/O — it must be provided with input in string form via a method, and makes calls to methods of a “formatter” object in order to produce output. So you need to do it in below way for instantiation purpose.
>>> from cStringIO import StringIO
>>> from formatter import DumbWriter, AbstractFormatter
>>> from htmllib import HTMLParser
>>> parser = HTMLParser(AbstractFormatter(DumbWriter(StringIO())))
>>>
It is very annoying. All you want to do is parsing a html file, but now you have to know a lot other things like format, I/O stream etc.
HTMLParser.HTMLParser
In python3 this module is renamed to html.parser. This module does the samething as htmllib.HTMLParser. The good thing is you do not to import modules like formatter and cStringIO. For more information you can go to this URL :
https://docs.python.org/2.7/library/htmlparser.html?highlight=htmlparser#HTMLParser
Here is some briefly introduction for this module.
See below for a sample code while using this module. You will notice that you do not need to use formater class or I/O string class.
>>> from HTMLParser import HTMLParser
>>> class MyHTMLParser(HTMLParser):
... def handle_starttag(self, tag, attrs):
... print "Encountered a start tag:", tag
... def handle_endtag(self, tag):
... print "Encountered an end tag :", tag
... def handle_data(self, data):
... print "Encountered some data :", data
...
>>> parser = MyHTMLParser()
>>> parser.feed('<html><head><title>Test</title></head><body><h1>Parse me!</h1></body></html>')
Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html
Another case here, in the htmllib.HTMLParser, there was two functions as below,
HTMLParser.anchor_bgn(href, name, type)
This method is called at the start of an anchor region. The arguments correspond to the attributes of the <A> tag with the same names. The default implementation maintains a list of hyperlinks (defined by the HREF attribute for <A> tags) within the document. The list of hyperlinks is available as the data attribute anchorlist. HTMLParser.anchor_end()
This method is called at the end of an anchor region. The default implementation adds a textual footnote marker using an index into the list of hyperlinks created by anchor_bgn().
With these two funcitons, htmllib.HTMLParser can easily retrive url links from a html file. For example:
>>> from urlparse import urlparse
>>> from formatter import DumbWriter, AbstractFormatter
>>> from cStringIO import StringIO
>>> from htmllib import HTMLParser
>>>
>>> def parseAndGetLinks():
... parser = HTMLParser(AbstractFormatter(DumbWriter(StringIO())))
... parser.feed(open(file).read())
... parser.close()
... return parser.anchorlist
...
>>> file='/tmp/a.ttt'
>>> parseAndGetLinks()
['http://www.baidu.com/gaoji/preferences.html', '/', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg®Type=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', '/', 'http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=', 'http://tieba.baidu.com/f?kw=&fr=wwwt', 'http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt', 'http://music.baidu.com/search?fr=ps&key=', 'http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=', 'http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=', 'http://map.baidu.com/m?word=&fr=ps01000', 'http://wenku.baidu.com/search?word=&lm=0&od=0', 'http://www.baidu.com/more/', 'javascript:;', 'javascript:;', 'javascript:;', 'http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w', 'http://www.baidu.com/gaoji/preferences.html', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg®Type=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'http://news.baidu.com', 'http://tieba.baidu.com', 'http://zhidao.baidu.com', 'http://music.baidu.com', 'http://image.baidu.com', 'http://v.baidu.com', 'http://map.baidu.com', 'javascript:;', 'javascript:;', 'javascript:;', 'http://baike.baidu.com', 'http://wenku.baidu.com', 'http://www.hao123.com', 'http://www.baidu.com/more/', '/', 'http://www.baidu.com/cache/sethelp/index.html', 'http://e.baidu.com/?refer=888', 'http://top.baidu.com', 'http://home.baidu.com', 'http://ir.baidu.com', '/duty/']
But in HTMLParser.HTMLParser, we do not have these two functions. Does not matter, we can define our own.
>>> from HTMLParser import HTMLParser
>>> class myHtmlParser(HTMLParser):
... def __init__(self):
... HTMLParser.__init__(self)
... self.anchorlist=[]
... def handle_starttag(self, tag, attrs):
... if tag=='a' or tag=='A':
... for t in attrs :
... if t[0] == 'href' or t[0]=='HREF':
... self.anchorlist.append(t[1])
...
>>> file='/tmp/a.ttt'
>>> parser=myHtmlParser()
>>> parser.feed(open(file).read())
>>> parser.anchorlist
['http://www.baidu.com/gaoji/preferences.html', '/', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg®Type=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', '/', 'http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=', 'http://tieba.baidu.com/f?kw=&fr=wwwt', 'http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt', 'http://music.baidu.com/search?fr=ps&key=', 'http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=', 'http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=', 'http://map.baidu.com/m?word=&fr=ps01000', 'http://wenku.baidu.com/search?word=&lm=0&od=0', 'http://www.baidu.com/more/', 'javascript:;', 'javascript:;', 'javascript:;', 'http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w', 'http://www.baidu.com/gaoji/preferences.html', 'https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'https://passport.baidu.com/v2/?reg®Type=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F', 'http://news.baidu.com', 'http://tieba.baidu.com', 'http://zhidao.baidu.com', 'http://music.baidu.com', 'http://image.baidu.com', 'http://v.baidu.com', 'http://map.baidu.com', 'javascript:;', 'javascript:;', 'javascript:;', 'http://baike.baidu.com', 'http://wenku.baidu.com', 'http://www.hao123.com', 'http://www.baidu.com/more/', '/', 'http://www.baidu.com/cache/sethelp/index.html', 'http://e.baidu.com/?refer=888', 'http://top.baidu.com', 'http://home.baidu.com', 'http://ir.baidu.com', '/duty/']
>>>
We look into the second code.
line 3 to line 5 overwrite the __init__ method. The key for this overwriten is that add an new attribute - anchorlist to our instance.
line 6 to line 10 overwrite the handle_starttag method. First it use if to check what the tag is. If it is 'a' or 'A', then use for loop to check its attribute. Retrieve the href attribute and put the value into the anchorlist.
Then done.
HTMLParser in python的更多相关文章
- python模块学习---HTMLParser(解析HTML文档元素)
HTMLParser是Python自带的模块,使用简单,能够很容易的实现HTML文件的分析. 本文主要简单讲一下HTMLParser的用法. 使用时需要定义一个从类HTMLParser继承的类,重定义 ...
- python网络爬虫之LXML与HTMLParser
Python lxml包用于解析html和XML文件,个人觉得比beautifulsoup要更灵活些 Lxml中的路径表达式如下: 在下面的表格中,我们已列出了一些路径表达式以及表达式的结果: 路径表 ...
- python之HTMLParser解析HTML文档
HTMLParser是Python自带的模块,使用简单,能够很容易的实现HTML文件的分析.本文主要简单讲一下HTMLParser的用法. 使用时需要定义一个从类HTMLParser继承的类,重定义函 ...
- Python HTML解析模块HTMLParser(爬虫工具)
简介 先简略介绍一下.实际上,HTMLParser是python用来解析HTML的内置模块.它可以分析出HTML里面的标签.数据等等,是一种处理HTML的简便途径.HTMLParser采用的是一种事件 ...
- python模块之HTMLParser
HTMLParser是python用来解析html的模块.它可以分析出html里面的标签.数据等等,是一种处理html的简便途径. HTMLParser采用的是一种事件驱动的模式,当HTMLParse ...
- python学习(解析python官网会议安排)
在学习python的过程中,做练习,解析https://www.python.org/events/python-events/ HTML文件,输出Python官网发布的会议时间.名称和地点. 对ht ...
- Python学习笔记5
1.关于global声明变量的错误例子 I ran across this warning: #!/usr/bin/env python2.3 VAR = 'xxx' if __name__ == ' ...
- python 爬虫部分解释
example:self.file = www.baidu.com存有baidu站的index.html def parseAndGetLinks(self): # parse HTML, save ...
- Python之HTML的解析(网页抓取一)
http://blog.csdn.net/my2010sam/article/details/14526223 --------------------- 对html的解析是网页抓取的基础,分析抓取的 ...
随机推荐
- Android内存管理(10)MAT: 基本教程
原文: http://help.eclipse.org/mars/index.jsp?topic=%2Forg.eclipse.mat.ui.help%2Fgettingstarted%2Fbasic ...
- Ubuntu下搭建repo服务器(三): 搭建Android repo服务器
1. 配置repo 1.1 下载git-repo.git(B端) mkdir -p ~/gitCfg cd ~/gitCfg git clone https://gerrit.googlesourc ...
- synchronized关键字详解(一)
synchronized官方定义: 同步方法支持一种简单的策略防止线程干扰和内存一致性错误,如果一个对象对多个线程可见,则对该对象变量的所有读取或写入都是通过同步方法完成的(这一个synchroniz ...
- android接收mjpg-streamer软件视频流
[代码]主要实现代码 package cn.dong.mjpeg; import java.io.InputStream; import java.net.HttpURLConnection; imp ...
- JNI数组操作
在Java中数组分为两种: 1.基本类型数组 2.对象类型(Object[])的数组(数组中存放的是指向Java对象中的引用) 一个能通用于两种不同类型数组的函数: GetArrayLength(ja ...
- tomcat 编码设置
在Tomcat8.0之前的版本,如果你要向服务器提交中文是需要转码的(如果你没有修改server.xml中的默认编码),因为8.0之前Tomcat的默认编码为ISO8859-1. POST方式提交 r ...
- PHP常用的设计模式
工厂模式 工厂模式是我们最常用的实例化对象模式,是用工厂方法代替new操作的一种模式. 使用工厂模式的好处是如果你想要更改所实例化的类名等,则只需要更改该工厂方法内容即可,不需逐一寻找代码中具体实例化 ...
- vue.js---利用vue cli脚手架工具+webpack创建项目遇到的坑
1.Eslint js代码规范报错 WARNING Compiled with 2 warnings 10:43:26 ✘ http://eslint.org/docs/rules/quotes St ...
- MATLAB仿真学习笔记(一)
一.Simulink概述 1.特点 simulink是对动态系统进行建模.仿真和综合分析的图形化软件,可以处理线性和非线性.离散.连续和混合系统,也可以处理单任务和多任务系统,并支持多种采样频率的系统 ...
- uva 133(The Dole Queue UVA - 133)
一道比较难想的模拟题,用了队列等东西,发现还是挺难做的,索性直接看了刘汝佳的代码,发现还是刘汝佳厉害! 代码本身难度并不是很大,主要还是p=(p+n+d-1)%n+1;这一句有些难度,实际上经过自己的 ...