Python的Web编程[0] -> Web客户端[1] -> Web 页面解析
Web页面解析 / Web page parsing
1 HTMLParser解析
下面介绍一种基本的Web页面HTML解析的方式,主要是利用Python自带的html.parser模块进行解析。其主要步骤为:
- 创建一个新的Parser类,继承HTMLParser类;
 - 重载handler_starttag等方法,实现指定功能;
 - 实例化新的Parser并将HTML文本feed给类实例。
 
完整代码
from html.parser import HTMLParser # An HTMLParser instance is fed HTML data and calls handler methods when start tags, end tags, text, comments, and other markup elements are encountered
# Subclass HTMLParser and override its methods to implement the desired behavior class MyHTMLParser(HTMLParser):
# attrs is the attributes set in HTML start tag
def handle_starttag(self, tag, attrs):
print('Encountered a start tag:', tag)
for attr in attrs:
print(' attr:', attr) def handle_endtag(self, tag):
print('Encountered an end tag :', tag) def handle_data(self, data):
print('Encountered some data :', data) parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
'<body><h1>Parse me!</h1></body></html>'
'<img src="python-logo.png" alt="The Python logo">')
代码中首先对模块进行导入,派生一个新的 Parser 类,随后重载方法,当遇到起始tag时,输出并判断是否有定义属性,有则输出,遇到终止tag与数据时同样输出。
Note: handle_starttag()函数的attrs为由该起始tag属性组成的元组元素列表,即列表中包含元组,元组中第一个参数为属性名,第二个参数为属性值。
输出结果
Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html
Encountered a start tag: img
attr: ('src', 'python-logo.png')
attr: ('alt', 'The Python logo')
从输出中可以看到,解析器将HTML文本进行了解析,并且输出了tag中包含的属性。
2 BeautifulSoup解析
接下来介绍一种第三方的HTML页面解析包BeautifulSoup,同时与HTMLParser进行对比。
首先需要进行BeautifulSoup的安装,安装方式如下,
pip install beautifulsoup4
完整代码
from html.parser import HTMLParser
from io import StringIO
from urllib import request from bs4 import BeautifulSoup, SoupStrainer
from html5lib import parse, treebuilders URLs = ('http://python.org',
'http://www.baidu.com') def output(x):
print('\n'.join(sorted(set(x)))) def simple_beau_soup(url, f):
'simple_beau_soup() - use BeautifulSoup to parse all tags to get anchors'
# BeautifulSoup returns a BeautifulSoup instance
# find_all function returns a bs4.element.ResultSet instance,
# which contains bs4.element.Tag instances,
# use tag['attr'] to get attribute of tag
output(request.urljoin(url, x['href']) for x in BeautifulSoup(markup=f, features='html5lib').find_all('a')) def faster_beau_soup(url, f):
'faster_beau_soup() - use BeautifulSoup to parse only anchor tags'
# Add find_all('a') function
output(request.urljoin(url, x['href']) for x in BeautifulSoup(markup=f, features='html5lib', parse_only=SoupStrainer('a')).find_all('a')) def htmlparser(url, f):
'htmlparser() - use HTMLParser to parse anchor tags'
class AnchorParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag != 'a':
return
if not hasattr(self, 'data'):
self.data = []
for attr in attrs:
if attr[0] == 'href':
self.data.append(attr[1])
parser = AnchorParser()
parser.feed(f.read())
output(request.urljoin(url, x) for x in parser.data)
print('DONE') def html5libparse(url, f):
'html5libparse() - use html5lib to parser anchor tags'
#output(request.urljoin(url, x.attributes['href']) for x in parse(f) if isinstance(x, treebuilders.etree.Element) and x.name == 'a') def process(url, data):
print('\n*** simple BeauSoupParser')
simple_beau_soup(url, data)
data.seek(0)
print('\n*** faster BeauSoupParser')
faster_beau_soup(url, data)
data.seek(0)
print('\n*** HTMLParser')
htmlparser(url, data)
data.seek(0)
print('\n*** HTML5lib')
html5libparse(url, data)
data.seek(0) if __name__=='__main__':
for url in URLs:
f = request.urlopen(url)
data = StringIO(f.read().decode())
f.close()
process(url, data)
分段解释
首先将所需模块进行导入,其中StringIO模块用来实现字符串缓存容器,
from html.parser import HTMLParser
from io import StringIO
from urllib import request from bs4 import BeautifulSoup, SoupStrainer
from html5lib import parse, treebuilders URLs = ('http://python.org',
'http://www.baidu.com')
接着定义一个输出函数,利用集合消除重复参数同时进行换行分离。
def output(x):
print('\n'.join(sorted(set(x))))
此处定义一个简单的bs解析函数,首先利用BeautifulSoup类传入HTML文本以及features(新版提示使用‘html5lib’),生成一个BeautifulSoup实例,再利用find_all()函数返回所有tag为‘a’的链接锚集合类(bs4.element.Tag),通过Tag获取href属性,最后利用urljoin函数生成链接并输出。
def simple_beau_soup(url, f):
'simple_beau_soup() - use BeautifulSoup to parse all tags to get anchors'
# BeautifulSoup returns a BeautifulSoup instance
# find_all function returns a bs4.element.ResultSet instance,
# which contains bs4.element.Tag instances,
# use tag['attr'] to get attribute of tag
output(request.urljoin(url, x['href']) for x in BeautifulSoup(markup=f, features='html5lib').find_all('a'))
接着定义一个新的解析函数,这个函数可以通过参数传入parse_only来设置需要解析的锚标签,从而加快解析的速度。
Note: 这部分存在一个问题,当使用‘html5lib’特性时,是不支持parse_only参数的,因此会对整个标签进行搜索。有待解决。
def faster_beau_soup(url, f):
'faster_beau_soup() - use BeautifulSoup to parse only anchor tags'
# Add find_all('a') function
output(request.urljoin(url, x['href']) for x in BeautifulSoup(markup=f, features='html5lib', parse_only=SoupStrainer('a')).find_all('a'))
再定义一个用html方式进行解析的函数,可参见前节使用方式,首先建立一个锚解析的类,在遇到起始标签时,判断是否为‘a’锚,在进入时判断是否有data属性,没有的话初始化属性为空,随后对attrs参数遍历,获取href参数。最后生成实例并feed数据。
def htmlparser(url, f):
'htmlparser() - use HTMLParser to parse anchor tags'
class AnchorParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag != 'a':
return
if not hasattr(self, 'data'):
self.data = []
for attr in attrs:
if attr[0] == 'href':
self.data.append(attr[1])
parser = AnchorParser()
parser.feed(f.read())
output(request.urljoin(url, x) for x in parser.data)
print('DONE')
最后定义一个process函数,对于传入的data,每次使用完后都需要seek(0)将光标移回初始。
def process(url, data):
print('\n*** simple BeauSoupParser')
simple_beau_soup(url, data)
data.seek(0)
print('\n*** faster BeauSoupParser')
faster_beau_soup(url, data)
data.seek(0)
print('\n*** HTMLParser')
htmlparser(url, data)
data.seek(0)
print('\n*** HTML5lib')
html5libparse(url, data)
data.seek(0)
最终解析的结果为网页内所有的链接。
if __name__=='__main__':
for url in URLs:
f = request.urlopen(url)
data = StringIO(f.read().decode())
f.close()
process(url, data)
运行输出结果
*** simple BeauSoupParser
http://blog.python.org
http://bottlepy.org
http://brochure.getpython.info/
http://buildbot.net/
http://docs.python.org/3/tutorial/
http://docs.python.org/3/tutorial/controlflow.html
http://docs.python.org/3/tutorial/controlflow.html#defining-functions
http://docs.python.org/3/tutorial/introduction.html#lists
http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator
http://feedproxy.google.com/~r/PythonInsider/~3/TmC0nYZBrz4/python-364rc1-and-370a3-now-available.html
http://feedproxy.google.com/~r/PythonInsider/~3/rMFQQbvrekU/python-364-is-now-available.html
http://feedproxy.google.com/~r/PythonInsider/~3/ubEu3XCqoFM/python-370a2-now-available-for-testing.html
http://feedproxy.google.com/~r/PythonInsider/~3/xUpvN2wKt2s/python-364rc1-and-370a3-now-available.html
http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html
http://flask.pocoo.org/
http://ipython.org
http://jobs.python.org
http://pandas.pydata.org/
http://planetpython.org/
http://plus.google.com/+Python
http://pycon.blogspot.com/
http://pyfound.blogspot.com/
http://python.org
http://python.org#content
http://python.org#python-network
http://python.org#site-map
http://python.org#top
http://python.org/
http://python.org/about/
http://python.org/about/apps
http://python.org/about/apps/
http://python.org/about/gettingstarted/
http://python.org/about/help/
http://python.org/about/legal/
http://python.org/about/quotes/
http://python.org/about/success/
http://python.org/about/success/#arts
http://python.org/about/success/#business
http://python.org/about/success/#education
http://python.org/about/success/#engineering
http://python.org/about/success/#government
http://python.org/about/success/#scientific
http://python.org/about/success/#software-development
http://python.org/accounts/login/
http://python.org/accounts/signup/
http://python.org/blogs/
http://python.org/community/
http://python.org/community/awards
http://python.org/community/diversity/
http://python.org/community/forums/
http://python.org/community/irc/
http://python.org/community/lists/
http://python.org/community/logos/
http://python.org/community/merchandise/
http://python.org/community/sigs/
http://python.org/community/workshops/
http://python.org/dev/
http://python.org/dev/core-mentorship/
http://python.org/dev/peps/
http://python.org/dev/peps/peps.rss
http://python.org/doc/
http://python.org/doc/av
http://python.org/doc/essays/
http://python.org/download/alternatives
http://python.org/download/other/
http://python.org/downloads/
http://python.org/downloads/mac-osx/
http://python.org/downloads/release/python-2714/
http://python.org/downloads/release/python-364/
http://python.org/downloads/source/
http://python.org/downloads/windows/
http://python.org/events/
http://python.org/events/calendars/
http://python.org/events/python-events
http://python.org/events/python-events/543/
http://python.org/events/python-events/611/
http://python.org/events/python-events/past/
http://python.org/events/python-user-group/
http://python.org/events/python-user-group/605/
http://python.org/events/python-user-group/619/
http://python.org/events/python-user-group/620/
http://python.org/events/python-user-group/past/
http://python.org/jobs/
http://python.org/privacy/
http://python.org/psf-landing/
http://python.org/psf/
http://python.org/psf/donations/
http://python.org/psf/sponsorship/sponsors/
http://python.org/shell/
http://python.org/success-stories/
http://python.org/success-stories/industrial-light-magic-runs-python/
http://python.org/users/membership/
http://roundup.sourceforge.net/
http://tornadoweb.org
http://trac.edgewall.org/
http://twitter.com/ThePSF
http://wiki.python.org/moin/Languages
http://wiki.python.org/moin/TkInter
http://www.ansible.com
http://www.djangoproject.com/
http://www.facebook.com/pythonlang?fref=ts
http://www.pylonsproject.org/
http://www.riverbankcomputing.co.uk/software/pyqt/intro
http://www.saltstack.com
http://www.scipy.org
http://www.web2py.com/
http://www.wxpython.org/
https://bugs.python.org/
https://devguide.python.org/
https://docs.python.org
https://docs.python.org/3/license.html
https://docs.python.org/faq/
https://github.com/python/pythondotorg/issues
https://kivy.org/
https://mail.python.org/mailman/listinfo/python-dev
https://pypi.python.org/
https://status.python.org/
https://wiki.gnome.org/Projects/PyGObject
https://wiki.python.org/moin/
https://wiki.python.org/moin/BeginnersGuide
https://wiki.python.org/moin/Python2orPython3
https://wiki.python.org/moin/PythonBooks
https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event
https://wiki.qt.io/PySide
https://www.openstack.org
https://www.python.org/psf/codeofconduct/
javascript:; *** faster BeauSoupParser Warning (from warnings module):
File "C:\Python35\lib\site-packages\bs4\builder\_html5lib.py", line 63
warnings.warn("You provided a value for parse_only, but the html5lib tree builder doesn't support parse_only. The entire document will be parsed.")
UserWarning: You provided a value for parse_only, but the html5lib tree builder doesn't support parse_only. The entire document will be parsed.
http://blog.python.org
http://bottlepy.org
http://brochure.getpython.info/
http://buildbot.net/
http://docs.python.org/3/tutorial/
http://docs.python.org/3/tutorial/controlflow.html
http://docs.python.org/3/tutorial/controlflow.html#defining-functions
http://docs.python.org/3/tutorial/introduction.html#lists
http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator
http://feedproxy.google.com/~r/PythonInsider/~3/TmC0nYZBrz4/python-364rc1-and-370a3-now-available.html
http://feedproxy.google.com/~r/PythonInsider/~3/rMFQQbvrekU/python-364-is-now-available.html
http://feedproxy.google.com/~r/PythonInsider/~3/ubEu3XCqoFM/python-370a2-now-available-for-testing.html
http://feedproxy.google.com/~r/PythonInsider/~3/xUpvN2wKt2s/python-364rc1-and-370a3-now-available.html
http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html
http://flask.pocoo.org/
http://ipython.org
http://jobs.python.org
http://pandas.pydata.org/
http://planetpython.org/
http://plus.google.com/+Python
http://pycon.blogspot.com/
http://pyfound.blogspot.com/
http://python.org
http://python.org#content
http://python.org#python-network
http://python.org#site-map
http://python.org#top
http://python.org/
http://python.org/about/
http://python.org/about/apps
http://python.org/about/apps/
http://python.org/about/gettingstarted/
http://python.org/about/help/
http://python.org/about/legal/
http://python.org/about/quotes/
http://python.org/about/success/
http://python.org/about/success/#arts
http://python.org/about/success/#business
http://python.org/about/success/#education
http://python.org/about/success/#engineering
http://python.org/about/success/#government
http://python.org/about/success/#scientific
http://python.org/about/success/#software-development
http://python.org/accounts/login/
http://python.org/accounts/signup/
http://python.org/blogs/
http://python.org/community/
http://python.org/community/awards
http://python.org/community/diversity/
http://python.org/community/forums/
http://python.org/community/irc/
http://python.org/community/lists/
http://python.org/community/logos/
http://python.org/community/merchandise/
http://python.org/community/sigs/
http://python.org/community/workshops/
http://python.org/dev/
http://python.org/dev/core-mentorship/
http://python.org/dev/peps/
http://python.org/dev/peps/peps.rss
http://python.org/doc/
http://python.org/doc/av
http://python.org/doc/essays/
http://python.org/download/alternatives
http://python.org/download/other/
http://python.org/downloads/
http://python.org/downloads/mac-osx/
http://python.org/downloads/release/python-2714/
http://python.org/downloads/release/python-364/
http://python.org/downloads/source/
http://python.org/downloads/windows/
http://python.org/events/
http://python.org/events/calendars/
http://python.org/events/python-events
http://python.org/events/python-events/543/
http://python.org/events/python-events/611/
http://python.org/events/python-events/past/
http://python.org/events/python-user-group/
http://python.org/events/python-user-group/605/
http://python.org/events/python-user-group/619/
http://python.org/events/python-user-group/620/
http://python.org/events/python-user-group/past/
http://python.org/jobs/
http://python.org/privacy/
http://python.org/psf-landing/
http://python.org/psf/
http://python.org/psf/donations/
http://python.org/psf/sponsorship/sponsors/
http://python.org/shell/
http://python.org/success-stories/
http://python.org/success-stories/industrial-light-magic-runs-python/
http://python.org/users/membership/
http://roundup.sourceforge.net/
http://tornadoweb.org
http://trac.edgewall.org/
http://twitter.com/ThePSF
http://wiki.python.org/moin/Languages
http://wiki.python.org/moin/TkInter
http://www.ansible.com
http://www.djangoproject.com/
http://www.facebook.com/pythonlang?fref=ts
http://www.pylonsproject.org/
http://www.riverbankcomputing.co.uk/software/pyqt/intro
http://www.saltstack.com
http://www.scipy.org
http://www.web2py.com/
http://www.wxpython.org/
https://bugs.python.org/
https://devguide.python.org/
https://docs.python.org
https://docs.python.org/3/license.html
https://docs.python.org/faq/
https://github.com/python/pythondotorg/issues
https://kivy.org/
https://mail.python.org/mailman/listinfo/python-dev
https://pypi.python.org/
https://status.python.org/
https://wiki.gnome.org/Projects/PyGObject
https://wiki.python.org/moin/
https://wiki.python.org/moin/BeginnersGuide
https://wiki.python.org/moin/Python2orPython3
https://wiki.python.org/moin/PythonBooks
https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event
https://wiki.qt.io/PySide
https://www.openstack.org
https://www.python.org/psf/codeofconduct/
javascript:; *** HTMLParser
http://blog.python.org
http://bottlepy.org
http://brochure.getpython.info/
http://buildbot.net/
http://docs.python.org/3/tutorial/
http://docs.python.org/3/tutorial/controlflow.html
http://docs.python.org/3/tutorial/controlflow.html#defining-functions
http://docs.python.org/3/tutorial/introduction.html#lists
http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator
http://feedproxy.google.com/~r/PythonInsider/~3/TmC0nYZBrz4/python-364rc1-and-370a3-now-available.html
http://feedproxy.google.com/~r/PythonInsider/~3/rMFQQbvrekU/python-364-is-now-available.html
http://feedproxy.google.com/~r/PythonInsider/~3/ubEu3XCqoFM/python-370a2-now-available-for-testing.html
http://feedproxy.google.com/~r/PythonInsider/~3/xUpvN2wKt2s/python-364rc1-and-370a3-now-available.html
http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html
http://flask.pocoo.org/
http://ipython.org
http://jobs.python.org
http://pandas.pydata.org/
http://planetpython.org/
http://plus.google.com/+Python
http://pycon.blogspot.com/
http://pyfound.blogspot.com/
http://python.org
http://python.org#content
http://python.org#python-network
http://python.org#site-map
http://python.org#top
http://python.org/
http://python.org/about/
http://python.org/about/apps
http://python.org/about/apps/
http://python.org/about/gettingstarted/
http://python.org/about/help/
http://python.org/about/legal/
http://python.org/about/quotes/
http://python.org/about/success/
http://python.org/about/success/#arts
http://python.org/about/success/#business
http://python.org/about/success/#education
http://python.org/about/success/#engineering
http://python.org/about/success/#government
http://python.org/about/success/#scientific
http://python.org/about/success/#software-development
http://python.org/accounts/login/
http://python.org/accounts/signup/
http://python.org/blogs/
http://python.org/community/
http://python.org/community/awards
http://python.org/community/diversity/
http://python.org/community/forums/
http://python.org/community/irc/
http://python.org/community/lists/
http://python.org/community/logos/
http://python.org/community/merchandise/
http://python.org/community/sigs/
http://python.org/community/workshops/
http://python.org/dev/
http://python.org/dev/core-mentorship/
http://python.org/dev/peps/
http://python.org/dev/peps/peps.rss
http://python.org/doc/
http://python.org/doc/av
http://python.org/doc/essays/
http://python.org/download/alternatives
http://python.org/download/other/
http://python.org/downloads/
http://python.org/downloads/mac-osx/
http://python.org/downloads/release/python-2714/
http://python.org/downloads/release/python-364/
http://python.org/downloads/source/
http://python.org/downloads/windows/
http://python.org/events/
http://python.org/events/calendars/
http://python.org/events/python-events
http://python.org/events/python-events/543/
http://python.org/events/python-events/611/
http://python.org/events/python-events/past/
http://python.org/events/python-user-group/
http://python.org/events/python-user-group/605/
http://python.org/events/python-user-group/619/
http://python.org/events/python-user-group/620/
http://python.org/events/python-user-group/past/
http://python.org/jobs/
http://python.org/privacy/
http://python.org/psf-landing/
http://python.org/psf/
http://python.org/psf/donations/
http://python.org/psf/sponsorship/sponsors/
http://python.org/shell/
http://python.org/success-stories/
http://python.org/success-stories/industrial-light-magic-runs-python/
http://python.org/users/membership/
http://roundup.sourceforge.net/
http://tornadoweb.org
http://trac.edgewall.org/
http://twitter.com/ThePSF
http://wiki.python.org/moin/Languages
http://wiki.python.org/moin/TkInter
http://www.ansible.com
http://www.djangoproject.com/
http://www.facebook.com/pythonlang?fref=ts
http://www.pylonsproject.org/
http://www.riverbankcomputing.co.uk/software/pyqt/intro
http://www.saltstack.com
http://www.scipy.org
http://www.web2py.com/
http://www.wxpython.org/
https://bugs.python.org/
https://devguide.python.org/
https://docs.python.org
https://docs.python.org/3/license.html
https://docs.python.org/faq/
https://github.com/python/pythondotorg/issues
https://kivy.org/
https://mail.python.org/mailman/listinfo/python-dev
https://pypi.python.org/
https://status.python.org/
https://wiki.gnome.org/Projects/PyGObject
https://wiki.python.org/moin/
https://wiki.python.org/moin/BeginnersGuide
https://wiki.python.org/moin/Python2orPython3
https://wiki.python.org/moin/PythonBooks
https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event
https://wiki.qt.io/PySide
https://www.openstack.org
https://www.python.org/psf/codeofconduct/
javascript:;
DONE *** HTML5lib *** simple BeauSoupParser
http://e.baidu.com/?refer=888
http://home.baidu.com
http://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=
http://ir.baidu.com
http://jianyi.baidu.com/
http://map.baidu.com
http://map.baidu.com/m?word=&fr=ps01000
http://music.baidu.com/search?fr=ps&ie=utf-8&key=
http://news.baidu.com
http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=
http://tieba.baidu.com
http://tieba.baidu.com/f?kw=&fr=wwwt
http://v.baidu.com
http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=
http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8
http://www.baidu.com/
http://www.baidu.com/cache/sethelp/help.html
http://www.baidu.com/duty/
http://www.baidu.com/gaoji/preferences.html
http://www.baidu.com/more/
http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001
http://www.hao123.com
http://xueshu.baidu.com
http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt
https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F
javascript:; *** faster BeauSoupParser
http://e.baidu.com/?refer=888
http://home.baidu.com
http://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=
http://ir.baidu.com
http://jianyi.baidu.com/
http://map.baidu.com
http://map.baidu.com/m?word=&fr=ps01000
http://music.baidu.com/search?fr=ps&ie=utf-8&key=
http://news.baidu.com
http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=
http://tieba.baidu.com
http://tieba.baidu.com/f?kw=&fr=wwwt
http://v.baidu.com
http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=
http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8
http://www.baidu.com/
http://www.baidu.com/cache/sethelp/help.html
http://www.baidu.com/duty/
http://www.baidu.com/gaoji/preferences.html
http://www.baidu.com/more/
http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001
http://www.hao123.com
http://xueshu.baidu.com
http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt
https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F
javascript:; *** HTMLParser
http://e.baidu.com/?refer=888
http://home.baidu.com
http://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=
http://ir.baidu.com
http://jianyi.baidu.com/
http://map.baidu.com
http://map.baidu.com/m?word=&fr=ps01000
http://music.baidu.com/search?fr=ps&ie=utf-8&key=
http://news.baidu.com
http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=
http://tieba.baidu.com
http://tieba.baidu.com/f?kw=&fr=wwwt
http://v.baidu.com
http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=
http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8
http://www.baidu.com/
http://www.baidu.com/cache/sethelp/help.html
http://www.baidu.com/duty/
http://www.baidu.com/gaoji/preferences.html
http://www.baidu.com/more/
http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001
http://www.hao123.com
http://xueshu.baidu.com
http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt
https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F
javascript:;
DONE *** HTML5lib
参考链接
《Python 核心编程 第3版》
Python的Web编程[0] -> Web客户端[1] -> Web 页面解析的更多相关文章
- Web 2.0应用客户端性能问题十大根源《转载》
		
前言 Web 2.0应用的推广为用户带来了全新的体验,同时也让开发人员更加关注客户端性能问题.最近,资深Web性能诊断专家.知名工具dynatrace的创始人之一Andreas Grabner根据自己 ...
 - Web编程前端之7:web.config详解 【转】
		
http://www.cnblogs.com/alvinyue/archive/2013/05/06/3063008.html 声明:这篇文章是摘抄周公(周金桥)的<asp.net夜话> ...
 - Python的Web编程[0] -> Web客户端[0] -> 统一资源定位符 URL
		
统一资源定位符 / URL 目录 URL 构成 URL 解析 URL 处理 1 URL构成 统一资源定位符(Uniform Resource Locator) 是对可以从互联网上得到的资源的位置和访问 ...
 - Python的异步编程[0] -> 协程[0] -> 协程和 async / await
		
协程 / Coroutine 目录 生产者消费者模型 从生成器到异步协程– async/await 协程是在一个线程执行过程中可以在一个子程序的预定或者随机位置中断,然后转而执行别的子程序,在适当的时 ...
 - Python的网络编程[0] -> socket[0] -> socket 与 TCP / UDP
		
Socket socket 简述 / socket Abstract 网络进程通信与 socket 网络中进程之间如何通信,首要解决的问题是如何唯一标识一个进程,否则通信无从谈起.在本地可以通过进程 ...
 - Python的网络编程[0] -> socket[1] -> socket 模块
		
socket 1. 常量 / Constants AF_* 和 SOCK_* 分别属于 AddressFamily 和 SocketType 1.1 AF_*类常量 socket.AF_UNIX: ...
 - 遇到奇怪的问题:web.py 0.40中使用web.input(),出现一堆奇怪的错误
		
有的请求很正常,有的请求就出现了500错误. 这里使用POST请求,然后在web.input()中出现了很长很长的错误. 猜测是这个机器上安装了python2.7 / python 3.6 / pyt ...
 - 【python】网络编程-SocketServer 实现客户端与服务器间非阻塞通信
		
利用SocketServer模块来实现网络客户端与服务器并发连接非阻塞通信.首先,先了解下SocketServer模块中可供使用的类:BaseServer:包含服务器的核心功能与混合(mix-in)类 ...
 - Python的异步编程[0] -> 协程[1] -> 使用协程建立自己的异步非阻塞模型
		
使用协程建立自己的异步非阻塞模型 接下来例子中,将使用纯粹的Python编码搭建一个异步模型,相当于自己构建的一个asyncio模块,这也许能对asyncio模块底层实现的理解有更大的帮助.主要参考为 ...
 
随机推荐
- Oracle 学习----:ora-00054 资源正忙 ,但指定以nowait方式获取资源 ,或者超时失效---解决方法
			
1.查询被锁的会话ID: select session_id from v$locked_object;查询结果:SESSION_ID-------92.查询上面会话的详细信息: SELECT sid ...
 - 解决使用Oracle数据库,项目启动由于表原因无法成功启动问题
			
1.仔细看异常信息,如果出现一个 翻译过来是 不仅仅这一张表,那就说明,在连接数据库,定位到表的时候有多张表,不知道连哪一张. 原因: 有多个用户,这两个用户下有相同的表. 就算是在不同的表空间也不 ...
 - Opencv2.4.13.6安装包
			
这个资源是Opencv2.4.13.6安装包,包括Windows软件包,Android软件包,IOS软件包,还有opencv的源代码:需要的下载吧. 点击下载
 - [转] Linux命令行编辑常用键
			
ctrl + a 将光标移动到命令行开头相当于VIM里shift+^ ctrl + e 将光标移动到命令行结尾处相当于VIM里shift+$ ctrl + 方向键左键 光标移动到前一个单词开头 ctr ...
 - 【bzoj3779】重组病毒  LCT+树上倍增+DFS序+树状数组区间修改区间查询
			
题目描述 给出一棵n个节点的树,每一个节点开始有一个互不相同的颜色,初始根节点为1. 定义一次感染为:将指定的一个节点到根的链上的所有节点染成一种新的颜色,代价为这条链上不同颜色的数目. 现有m次操作 ...
 - 项目导入时报错:The import javax.servlet.http.HttpServletRequest cannot be resolved
			
Error: The import javax.servlet cannot be resolved The import javax.servlet.http.HttpServletRequest ...
 - [bzoj] 3263 陌上花开 洛谷 P3810 三维偏序|| CDQ分治 && CDQ分治讲解
			
原题 定义一个点比另一个点大为当且仅当这个点的三个值分别大于等于另一个点的三个值.每比一个点大就为加一等级,求每个等级的点的数量. 显然的三维偏序问题,CDQ的板子题. CDQ分治: CDQ分治是一种 ...
 - BZOJ4894 天赋  【矩阵树定理】
			
题目链接 BZOJ4894 题解 双倍经验P5297 题解 #include<iostream> #include<cstring> #include<cstdio> ...
 - 移动端web开发 浅析
			
1. viewport ① viewport在移动端承载网页的区域:具有默认格式 ②设置viewport属性,适配移动端设备 主流设置: <meta name = ”viewport” cont ...
 - 总结各种width,height,top,left
			
1.offsetWidth 内容+内边距(padding)+边框(border) 2.offsetHeight 内容+内边距(padding)+边框(border) 3.offsetLeft 获取对象 ...