http://html5lib.readthedocs.org/en/latest/

By default, the document will be an
xml.etree element instance.Whenever possible, html5lib chooses the accelerated
ElementTreeimplementation (i.e.
xml.etree.cElementTree on Python 2.x).

Overview

html5lib is a pure-python library for parsing HTML. It is designed toconform to the WHATWG HTML specification, as is implemented by all majorweb browsers.

Usage

Simple usage follows this pattern:

import html5lib
with open("mydocument.html", "rb") as f:
document = html5lib.parse(f)

or:

import html5lib
document = html5lib.parse("<p>Hello World!")

By default, the document will be anxml.etree element instance.Whenever possible, html5lib chooses the acceleratedElementTreeimplementation
(i.e.xml.etree.cElementTree on Python 2.x).

Two other tree types are supported: xml.dom.minidom andlxml.etree. To use an alternative format, specify the name ofa treebuilder:

import html5lib
with open("mydocument.html", "rb") as f:
lxml_etree_document = html5lib.parse(f, treebuilder="lxml")

When using with urllib2 (Python 2), the charset from HTTP should bepass into html5lib as follows:

from contextlib import closing
from urllib2 import urlopen
import html5lib with closing(urlopen("http://example.com/")) as f:
document = html5lib.parse(f, encoding=f.info().getparam("charset"))

When using with urllib.request (Python 3), the charset from HTTPshould be pass into html5lib as follows:

from urllib.request import urlopen
import html5lib with urlopen("http://example.com/") as f:
document = html5lib.parse(f, encoding=f.info().get_content_charset())

To have more control over the parser, create a parser object explicitly.For instance, to make the parser raise exceptions on parse errors, use:

import html5lib
with open("mydocument.html", "rb") as f:
parser = html5lib.HTMLParser(strict=True)
document = parser.parse(f)

When you’re instantiating parser objects explicitly, pass a treebuilderclass as thetree keyword argument to use an alternative documentformat:

import html5lib
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
minidom_document = parser.parse("<p>Hello World!")

More documentation is available at http://html5lib.readthedocs.org/.

Installation

html5lib works on CPython 2.6+, CPython 3.2+ and PyPy. To install it,use:

$ pip install html5lib

Optional Dependencies

The following third-party libraries may be used for additionalfunctionality:

  • datrie can be used to improve parsing performance (though inalmost all cases the improvement is marginal);
  • lxml is supported as a tree format (for both building andwalking) under CPython (butnot PyPy where it is known to causesegfaults);
  • genshi has a treewalker (but not builder); and
  • charade can be used as a fallback when character encoding cannotbe determined;chardet, from which it was forked, can also be usedon Python
    2.
  • ordereddict can be used under Python 2.6(collections.OrderedDict is used instead on later versions) toserialize attributes in alphabetical
    order.

Bugs

Please report any bugs on the issue tracker.

Tests

Unit tests require the nose library and can be run using thenosetests command in the root directory;ordereddict
isrequired under Python 2.6. All should pass.

Test data are contained in a separate html5lib-tests repository and includedas a submodule, thus for git checkouts they must be initialized:

$ git submodule init
$ git submodule update

If you have all compatible Python implementations available on yoursystem, you can run tests on all of them using thetox utility,which can be found on PyPI.

Questions?

There’s a mailing list available for support on Google Groups,html5lib-discuss,though you may get a quicker response asking on IRC in#whatwg
onirc.freenode.net
.

Indices and tables

html5lib-python doc的更多相关文章

  1. python doc格式转文本格式

    首先python是不能直接读写doc格式的文件的,这是python先天的缺陷.但是可以利用python-docx (0.8.6)库可以读取.docx文件或.txt文件,且一路畅通无阻. 这样的话,可以 ...

  2. python doc os 参考

    os --- 操作系统接口模块 源代码: Lib/os.py 该模块提供了一些方便使用操作系统相关功能的函数. 如果你是想读写一个文件,请参阅 open(),如果你想操作路径,请参阅 os.path  ...

  3. python doc

    http://blog.51cto.com/lizhenliang/category16.html

  4. python爬虫 beutifulsoup4_1官网介绍

    http://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup Documentation Beautiful Soup is ...

  5. 【Python爬虫】BeautifulSoup网页解析库

    BeautifulSoup 网页解析库 阅读目录 初识Beautiful Soup Beautiful Soup库的4种解析器 Beautiful Soup类的基本元素 基本使用 标签选择器 节点操作 ...

  6. 【python】BeautifulSoup的应用

    from bs4 import BeautifulSoup#下面的一段HTML代码将作为例子被多次用到.这是 爱丽丝梦游仙境的 的一段内容(以后内容中简称为 爱丽丝 的文档): html_doc = ...

  7. 吴裕雄--天生自然python学习笔记:Beautiful Soup 4.2.0模块

    Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时 ...

  8. python使用uuid库生成唯一id

    概述: UUID是128位的全局唯一标识符,通常由32字节的字符串表示. 它可以保证时间和空间的唯一性,也称为GUID,全称为: UUID -- Universally Unique IDentifi ...

  9. 【循序渐进学Python】14.数据库的支持

    纯文本只能够实现一些简单有限的功能.如果想要实现自动序列化,也可以使用 shelve 模块和 pickle 模块来实现.但是,如果想要自动的实现数据并发访问,以及更标准,更通用的数据库(databas ...

  10. 【循序渐进学Python】13.基本的文件I/O

    文件I/O是Python中最重要的技术之一,在Python中对文件进行I/O操作是非常简单的. 1. 打开文件 使用 open 函数来打开文件,语法如下: open(name[, mode[, buf ...

随机推荐

  1. ViewPager实现页卡的3种方法(谷歌组件)

    ----方法一:---- 效果图: 须要的组件: ViewPager+PagerTabStrip 布局文件代码: <!--xmlns:android_custom="http://sc ...

  2. 线程、线程句柄、线程ID

     什么是句柄:句柄是一种指向指针的指针.我们知道,所谓指针是一种内存地址.应用程序启动后,组成这个程序的各对象是住留在内存的.如果简单地理解,似乎我们只要获知这个内存的首地址,那么就可以随时用这个地址 ...

  3. 解决 Google 重定向,体验 Google 本味

    想要体验原汁原味的 Google(google.com),下面的方案是我用过的较方便的方案. 欢迎更正及补充 Chrome 扩展 Chrone 商店有一款禁止重定向的扩展 NoCountryRedir ...

  4. java基础之代理

    代理的定义,代理的应用,代理的特性

  5. SSH公钥私钥安全通讯原理

    客户端在访问服务器的时候,防止通讯信息被截取,进行加密处理通讯. 在服务器上会有两把钥匙,公钥和私钥.公钥可以对所有公开,私钥只有服务器自己知道, 并且公钥产生的密文只能通过私钥才能解开 1:客户端发 ...

  6. OC基础 点语法的使用

    OC基础 点语法的使用 1.创建一个Student类继承于NSObject,Student.h文件 #import <Foundation/Foundation.h> @interface ...

  7. C#类型基础——学习笔记一

    1.C#中的类型一共分两类,一类是值类型,一类是引用类型.2.结构类型变量本身就相当于一个实例.3.调用结构上的方法前,需要对其所有的字段进行赋值.4.所有元素使用前都必须初始化.5.(结构类型)ne ...

  8. 插件的理解$.extend()与$.fn.extend()

    插件的理解.$.extend()与$.fn.extend()    插件开发包括两种:1.类级别的插件开发,即$.extend()扩展jquery对象本身:2.对象级别的插件开发,即$.fn.exte ...

  9. java 编码转换

    在网络中爬取到的数据,编码可能与当前编译器的编码不相同,而导致可能产生显示乱码的问题.那么如何将网络的编码,转换为当前编译器认可的编码(一般为UTF-8),就是个问题了. 主要使用了两个方法: Str ...

  10. php的一些小笔记-文件函数(2)

    ---恢复内容开始--- copy 文件的复制 echo copy('test.php','test1.php'); 如果成功的返回true,反之返回false 如何在多层目录中复制文件呢?也就是根据 ...