笔记-python-lib-lxml

1. lxml简介

lxml是一个实现解析网页文件的库，python中自带有解析库，但没有lxml方便好用。

The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt.

它将这些库的高效与python API的易用结合起来了。

1.1. 安装

pip install lxml

2. 网页解析

2.1. 实例化

lxml提供如下方式输入文本进行实例化：

fromstring():解析字符串

HTML():解析HTML对象

XML():解析XML对象

parse():解析文件类型对象

一般使用HTML或XML；

案例代码：

import lxml.etree

etree_page = etree.HTML(page_data)

page_data为网页源代码，字符串格式；

type(etree_page)

2.2. 查找

查找常用的有find,findall,xpath，都支持XPATH语法。

find():返回第一个匹配对象，并且xpath语法只能使用相对路径（以’.//’开头）；

findall():返回一个标签对象的列表，并且xpath语法只能使用相对路径（以’.//’开头）；

xpath()：返回一个标签对象的列表，并且xpath语法的相对路径和绝对路径。

返回的元素都是lxml.etree._Element类型；

案例代码：

page_etree = etree.HTML(page_data)

resonse_div = page_etree.findall(‘.//img[@src]’)

2.3. 属性读取和输出

下例代码中et 是一个返回的lxml.etree._Element类对象；

获取属性：

et.get(‘attribute name’)

et.get(‘src’)

获取全部属性：

et.attrib

et.attrib[‘src’] 等效于et.get(‘src’)

获取文本：

et.text

标签名：

et.tag

其它：

et.getparent()

et.getchild()

etree.tostring(object)

3. xpath

lxml etree支持ElementTree和Elements上的find,findall,findtext方法，除此之外作为一个特定的功能扩展，提供了xpath()方法，该方法完全支持XPATH语法。

xpath()方法非常强大，下面是一些常用获取属性的案例，关于更多xpath表达式语法查看xpath文档：

# 得到根节点

>>> root = etree.fromstring(xml)　　

>>> print root

# 选取所有book子元素

>>> root.xpath('book')　　

[<Element book at 0x2d88878>, <Element book at 0x2d888c8>]

# 选取根节点bookstore

>>> root.xpath('/bookstore')　　

[<Element bookstore at 0x2c9cc88>]

# 选取所有book子元素的title子元素

>>> root.xpath('book/title')　　

[<Element title at 0x2d88878>, <Element title at 0x2d888c8>]

# 以根节点为始祖，选取其后代中的title元素

>>> root.xpath('//title')

[<Element title at 0x2d88878>, <Element title at 0x2d888c8>]

# 以book子元素为始祖，选取后代中的price元素

>>> root.xpath('book//price')　　

[<Element price at 0x2ca20a8>, <Element price at 0x2d88738>]

# 以根节点为始祖，选取其后代中的lang属性值

>>> root.xpath('//@lang')　　　　

['eng', 'eng']

3.1. 预判（Predicates）

预判是用来查找某个特定的节点或者符合某种条件的节点，预判表达式位于方括号中。

# 选取bookstore的第一个book子元素

>>> root.xpath('/bookstore/book[1]')　　　　　　　　　　

[<Element book at 0x2ca20a8>]

# 选取bookstore的最后一个book子元素

>>> root.xpath('/bookstore/book[last()]')　　　　　　　　

[<Element book at 0x2d88878>]

# 选取bookstore的倒数第二个book子元素

>>> root.xpath('/bookstore/book[last()-1]')　　　　　　

[<Element book at 0x2ca20a8>]

# 选取bookstore的前两个book子元素

>>> root.xpath('/bookstore/book[position()<3]')　　　　

[<Element book at 0x2ca20a8>, <Element book at 0x2d88878>]

# 以根节点为始祖，选取其后代中含有lang属性的title元素

>>> root.xpath('//title[@lang]')　　　　　

[<Element title at 0x2d888c8>, <Element title at 0x2d88738>]

# 以根节点为始祖，选取其后代中含有lang属性并且值为eng的title元素

>>> root.xpath("//title[@lang='eng']")

[<Element title at 0x2d888c8>, <Element title at 0x2d88738>]

# 选取bookstore子元素book，条件是book的price子元素要大于35

>>> root.xpath("/bookstore/book[price>35.00]")

[<Element book at 0x2ca20a8>]

# 选取bookstore子元素book的子元素title，条件是book的price子元素要大于35

>>> root.xpath("/bookstore/book[price>35.00]/title")

[<Element title at 0x2d888c8>]

3.2. 通配符

通配符描述

* 匹配任何元素。

@* 匹配任何属性。

node() 匹配任何类型的节点。

# 选取 bookstore 所有子元素

>>> root.xpath('/bookstore/*')

[<Element book at 0x2d888c8>, <Element book at 0x2ca20a8>]

# 选取根节点的所有后代元素

>>> root.xpath('//*')　　

[<Element bookstore at 0x2c9cc88>, <Element book at 0x2d888c8>, <Element title at 0x2d88738>, <Element price at 0x2d88878>, <Element book at 0x2ca20a8>, <Element title at 0x2d88940>, <Element price at 0x2d88a08>]

# 选取根节点的所有具有属性节点的title元素

>>> root.xpath('//title[@*]')　　

[<Element title at 0x2d88738>, <Element title at 0x2d88940>]

# 选取当前节点下所有节点。'\n ' 是文本节点。

>>> root.xpath('node()')

['\n ', <Element book at 0x2d888c8>, '\n ', <Element book at 0x2d88878>, '\n']

# 选取根节点所有后代节点，包括元素、属性、文本。

>>> root.xpath('//node()')

[<Element bookstore at 0x2c9cc88>, '\n ', <Element book at 0x2d888c8>, '\n ', <Element title at 0x2d88738>, 'Harry Potter', '\n ', <Element price at 0x2d88940>, '29.99', '\n ', '\n ', <Element book at 0x2d88878>, '\n ', <Element title at 0x2ca20a8>, 'Learning XML', '\n ', <Element price at 0x2d88a08>, '39.95', '\n ', '\n']

或条件选取

使用 “|” 运算符，你可以选取符合“或”条件的若干路径。

# 选取所有book的title元素或者price元素

>>> root.xpath('//book/title|//book/price')　　

[<Element title at 0x2d88738>, <Element price at 0x2d88940>, <Element title at 0x2ca20a8>, <Element price at 0x2d88a08>]

# 选择所有title或者price元素

>>> root.xpath('//title|//price')　　

[<Element title at 0x2d88738>, <Element price at 0x2d88940>, <Element title at 0x2ca20a8>, <Element price at 0x2d88a08>]

# 选择book子元素title或者全部的price元素

>>> root.xpath('/bookstore/book/title|//price')

[<Element title at 0x2d88738>, <Element price at 0x2d88940>, <Element title at 0x2ca20a8>, <Element price at 0x2d88a08>]

3.3. XPath 坐标轴

坐标轴用于定义当对当前节点的节点集合。

坐标轴名称含义

ancestor 选取当前节点的所有先辈元素及根节点。

ancestor-or-self 选取当前节点的所有先辈以及当前节点本身。

attibute 选取当前节点的所有属性。

child 选取当前节点的所有子元素。

descendant 选取当前节点的所有后代元素。

descendant-or-self 选取当前节点的所有后代元素以及当前节点本身。

following 选取文档中当前节点的结束标签之后的所有节点。

following-sibling 选取当前节点之后的所有同级节点

namespace 选取当前节点的所有命名空间节点。

parent 选取当前节点的父节点。

preceding 选取当前节点的开始标签之前的所有节点。

preceding-sibling 选取当前节点之前的所有同级节点。

self 选取当前节点。

位置路径表达式

位置路径可以是绝对路径，也可以是相对路径。绝对路径以 “/” 开头。每条路径包括一个或多个步，每步之间以 “/” 分隔。

绝对路径：/step/step/…

相对路径：step/step/…

每步根据当前节点集合中的节点计算。

步（step）包括三部分：

坐标轴（axis）：定义所选节点与当前节点之间的关系。

节点测试（node-test）：识别某个坐标轴内部的节点。

预判（predicate）：提出预判条件对节点集合进行筛选。

步的语法：坐标轴::节点测试[预判]

# child::nodename 选取所有属于当前节点的 book 子元素，等价于 './nodename'

>>> root.xpath('child::book')

[<Element book at 0x2d888c8>, <Element book at 0x2d88878>]

>>> root.xpath('./book')

[<Element book at 0x2d888c8>, <Element book at 0x2d88878>]

# attribute::lang 选取当前节点的 lang 属性，等价于 './@lang'

>>> root.xpath('//*[@lang]')[0].xpath('attribute::lang')

['eng']

>>> root.xpath('//*[@lang]')[0].xpath('@lang')

['eng']

# child::* 选取当前节点的所有子元素，等价于 './*'

>>> root.xpath('child::*')

[<Element book at 0x2d88878>, <Element book at 0x2d88738>]

>>> root.xpath('./*')

[<Element book at 0x2d88878>, <Element book at 0x2d88738>]

# attribute::* 选取当前节点的所有属性，等价于 './@*'

>>> root.xpath('//*[@*]')[0].xpath('attribute::*')

['eng']

>>> root.xpath('//*[@*]')[0].xpath('@*')

['eng']

# child::text() 选取当前节点的所有文本子节点，等价于 './text()'

>>> root.xpath('child::text()')

['\n ', '\n ', '\n']

>>> root.xpath('./text()')

['\n ', '\n ', '\n']

# child::node() 选取当前节点所有子节点，等价于 './node()'

>>> root.xpath('child::node()')

['\n ', <Element book at 0x2d88878>, '\n ', <Element book at 0x2d88738>, '\n']

>>> root.xpath('./node()')

['\n ', <Element book at 0x2d88878>, '\n ', <Element book at 0x2d88738>, '\n']

# descendant::book 选取当前节点所有 book 后代，等价于 './/book'

>>> root.xpath('descendant::book')

[<Element book at 0x2d88878>, <Element book at 0x2d88738>]

>>> root.xpath('.//book')

[<Element book at 0x2d88878>, <Element book at 0x2d88738>]

# ancestor::book 选取当前节点所有 book 先辈

>>> root.xpath('.//title')[0].xpath('ancestor::book')

[<Element book at 0x2d88878>]

# ancestor-or-self::book 选取当前节点的所有 book 先辈以及如果当前节点是 book 的话也要选取

>>> root.xpath('.//title')[0].xpath('ancestor-or-self::book')

[<Element book at 0x2d88878>]

>>> root.xpath('.//book')[0].xpath('ancestor-or-self::book')

[<Element book at 0x2d88878>]

>>> root.xpath('.//book')[0].xpath('ancestor::book')

[]

# child::*/child::price 选取当前节点的所有 price 孙节点，等价于 './*/price'

>>> root.xpath('child::*/child::price')

[<Element price at 0x2d88878>, <Element price at 0x2d88738>]

>>> root.xpath('./*/price')

[<Element price at 0x2d88878>, <Element price at 0x2d88738>]

4. 附录：

4.1. 正则，lxml与beautifulsoup4

在解析网页时常用的库有正则，lxml，beautifulsoup4，三者各有优劣：

正则：最原始的字符串解析，写起来麻烦一些；
lxml：c实现底层，效率较高；
beautifulsoup4：python原生库，使用简单，但效率较低，也容易出现解析错误。从效率上来讲，lxml 确实比 BeautifulSoup 高效得多，每次分步调试时，soup 对象的生成有很明显的延迟，而 lxml.etree.HTML(html) 方式则在 step over 的一瞬间便构建成功了一个可执行 xpath 操作的对象,速度惊人。原理上来讲，bs4 是用 python 写的，lxml 是 c 语言实现的，而且 BeautifulSoup 是基于 DOM 的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多。而lxml只会进行局部遍历。