python 模块BeautifulSoup使用

BeautifulSoup是一个专门用于解析html/xml的库。官网：http://www.crummy.com/software/BeautifulSoup/

说明，BS有了4.x的版本了。官方说：

Beautiful Soup 3 has been replaced by Beautiful Soup 4. You may be looking for the Beautiful Soup 4 documentation

Beautiful Soup 3 only works on Python 2.x, but Beautiful Soup 4 also works on Python 3.x. Beautiful Soup 4 is faster, has more features, and works with third-party parsers like lxml and html5lib. You should use Beautiful Soup 4 for all new projects.

我的电脑上面用

help(BeautifulSoup.__version__)看到版本号为：

3.2.1

Beautiful Soup 4 works on both Python 2 (2.6+) and Python 3.

安装其实很简单，BeautifulSoup只有一个文件，只要把这个文件拷到你的工作目录，就可以了。

from BeautifulSoup import BeautifulSoup          # For processing HTML

from BeautifulSoup import BeautifulStoneSoup     # For processing XML

import BeautifulSoup # To get everything

创建 BeautifulSoup 对象

BeautifulSoup对象需要一段html文本就可以创建了。

下面的代码就创建了一个BeautifulSoup对象：

from BeautifulSoup import BeautifulSoup

doc = ['<html><head><title>PythonClub.org</title></head>',

       '<body><p id="firstpara" align="center">This is paragraph <b>one</b> of ptyhonclub.org.',

       '<p id="secondpara" align="blah">This is paragraph <b>two</b> of pythonclub.org.',

       '</html>']

soup = BeautifulSoup(''.join(doc))

采用

print soup.prettify()

后：

# <html>

#  <head>

#   <title>

#    Page title

#   </title>

#  </head>

#  <body>

#   <p id="firstpara" align="center">

#    This is paragraph

#    <b>

#     one

#    </b>

#    .

#   </p>

#   <p id="secondpara" align="blah">

#    This is paragraph

#    <b>

#     two

#    </b>

#    .

#   </p>

#  </body>

# </html>

查找HTML内指定元素

BeautifulSoup可以直接用”.”访问指定HTML元素

根据html标签(tag)查找：查找html title

可以用 soup.html.head.title 得到title的name，和字符串值。

>>> soup.html.head.title 注意，包含title标签

<title>PythonClub.org</title>

>>> soup.html.head.title.name

u'title'

>>> soup.html.head.title.string

u'PythonClub.org'

>>>

也可以直接通过soup.title直接定位到指定HTML元素:

>>> soup.title

<title>PythonClub.org</title>

>>>

根据html内容查找：查找包含特定字符串的整个标签内容

下面的例子给出了查找含有”para”的html tag内容：

>>> soup.findAll(text=re.compile("para"))

[u'This is paragraph ', u'This is paragraph ']

>>> soup.findAll(text=re.compile("para"))[0].parent

<p id="firstpara" align="center">This is paragraph <b>one</b> of ptyhonclub.org.</p>

>>> soup.findAll(text=re.compile("para"))[0].parent.contents

[u'This is paragraph ', <b>one</b>, u' of ptyhonclub.org.']

基本的方法：findAll

findAll(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

These arguments show up over and over again throughout the Beautiful Soup API. The most important arguments are name and the keyword arguments.

The simplest usage is to just pass in a tag name. This code finds all the Tags in the document:
```
soup.findAll('b')

#[one, two]
```

You can also pass in a regular expression. This code finds all the tags whose names start with B:

import re

tagsStartingWithB = soup.findAll(re.compile('^b'))

[tag.name for tag in tagsStartingWithB]

#[u'body', u'b', u'b']

You can pass in a list or a dictionary. These two calls find all the <TITLE> and all the tags. They work the same way, but the second call runs faster:

soup.findAll(['title', 'p'])

#[<title>Page title</title>,

# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>,

# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]

soup.findAll({'title' : True, 'p' : True})

#[<title>Page title</title>,

# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>,

# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]

The keyword arguments impose restrictions on the attributes of a tag. This simple example finds all the tags which have a value of "center" for their "align" attribute:

soup.findAll(align="center")

#[<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>]

Searching by CSS class

The attrs argument would be a pretty obscure feature were it not for one thing: CSS. It's very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, class, is also a Python reserved word.

You could search by CSS class with soup.find("tagName", { "class" : "cssClass" }), but that's a lot of code for such a common operation. Instead, you can pass a string for attrs instead of a dictionary. The string will be used to restrict the CSS class.

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup("""Bob's <b>Bold</b> Barbeque Sauce now available in

                        <b class="hickory">Hickory</b> and <b class="lime">Lime</a>""")

soup.find("b", { "class" : "lime" })

#<b class="lime">Lime</b>

soup.find("b", "hickory")

#<b class="hickory">Hickory</b>

根据CSS属性查找HTML内容

soup.findAll(id=re.compile("para$"))

# [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>,

#  <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]

 

soup.findAll(attrs={'id' : re.compile("para$")})

# [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>,

#  <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]

深入理解BeautifulSoup

BeautifulSoup 编码相关

BeautifulSoup 技巧

转自：http://www.pythonclub.org/modules/beautifulsoup/start

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

一篇文章

------------------------------------

汤料——Soup中的对象

标签（Tag）

标签对应于HTML元素，也就是应于一对HTML标签以及括起来的内容（包括内层标签和文本），如：

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')

tag = soup.b

soup.b就是一个标签，soup其实也可以视为是一个标签，其实整个HTML就是由一层套一层的标签组成的。

名字（Name）

名字对应于HTML标签中的名字（也就是尖括号里的第一项）。每个标签都具有名字，标签的名字使用.name来访问，例如上例中，

tag.name == u'b'

soup.name == u'[document]'

属性（Atrriutes）

属性对应于HTML标签中的属性部分（也就是尖括号里带等号的那些）。标签可以有许多属性，也可以没有属性。属性使用类似于字典的形式访问，用方括号加属性名，例如上例中，

tag['class'] ==  u'boldest'

可以使用.attrs直接获得这个字典，例如，

tag.attrs == {u'class': u'boldest'}

文本（Text）

文本对应于HTML中的文本（也就是尖括号外的部分）。文件使用.text来访问，例如上例中，

tag.text ==  u'Extremely bold'

string和text区别：

找汤料——Soup中的查找

解析一个HTML通常是为了找到感兴趣的部分，并提取出来。BeautifulSoup提供了find和find_all的方法进行查找。find只返回找到的第一个标签，而find_all则返回一个列表。因为查找用得很多，所以BeautifulSoup做了一些很方便的简化的使用方式：

tag.find_all("a")  #等价于 tag("a") 这是4.0的函数find_all

tag.find("a") #等价于 tag.a

因为找不到的话，find_all返回空列表，find返回None，而不会抛出异常，所以，也不用担心 tag("a") 或tag.a 会因为找不到而报错。限于python的语法对变量名的规定，tag.a 的形式只能是按名字查找，因为点号.后面只能接变量名，而带括号的形式 tag() 或 tag.find() 则可用于以下的各种查找方式。

查找可以使用多种方式：字符串、列表、键-值（字典）、正则表达式、函数

字符串：字符串会匹配标签的名字，例如 tag.a 或 tag("a")
列表：可以按一个字符串列表查找，返回名字匹配任意一个字符串的标签。例如 tag("h2", "p")
键-值：可以用tag(key=value)的形式，来按标签的属性查找。键-值查找里有比较多的小花招，这里列几条：
1. class
  class是Python的保留字，不能当变量名用，偏偏在HTML中会有很多 class=XXX 的情况，BeautifulSoup的解决方法是加一下划线，用 class_ 代替,如 tag(class_=XXX)。
2. True
  当值为True时，会匹配所有带这个键的标签，如 tag(href=True)
3. text
  text做为键时表示查找按标签中的文本查找，如 tag(text=something）
正则表达式：例如 tag(href=re.compile("elsie"))
函数：当以上方法都行不通时，函数是终极方法。写一个以单个标签为参数的函数，传入 find 或find_all 进行查找。如
```
def fun(tag):

    return tag.has_key("class") and not tag.has_key("id")

tag(fun) # 会返回所有带class属性但不带id属性的标签
```

再来一碗——按文档的结构查找

HTML可以解析成一棵标签树，因此也可以按标签在树中的相互关系来查找。

查找上层节点：find_parents() 和 find_parent()
查找下一个兄弟节点：find_next_siblings() 和 find_next_sibling()
查找上一个兄弟节点：find_previous_siblings() 和 find_previous_sibling()

以上四个都只会查同一父节点下的兄弟

查找下层节点：其实上面说的find和find_all就是干这活的
查找下一个节点（无视父子兄弟关系） find_all_next() 和 find_next()
查找上一个节点（无视父子兄弟关系） find_all_previous() 和 find_previous()

以上的这些查找的参都和find一样，可以搭配着用。

看颜色选汤——按CSS查找

用 .select()方法，看 http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

一些小花招

BeautifulSoup 可以支持多种解析器，如lxml, html5lib, html.parser. 如：BeautifulSoup("<a>", "html.parser")

具体表现可参考 http://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers

BeautifulSoup 在解析之前会先把文本转换成unicode，可以用 from_encoding 指定编码，如：BeautifulSoup(markup, from_encoding="iso-8859-8")
soup.prettify()可以输出排列得很好看的HTML文本，遇上中文的话可以指定编码使其显示正常，如soup.prettify("gbk")
还是有编码问题，看：http://www.crummy.com/software/BeautifulSoup/bs4/doc/#unicode-dammit

转自：http://cndenis.iteye.com/blog/1746706

soup2个重要的属性：

`.contents` and `.children`

A tag’s children are available in a list called .contents:

head_tag = soup.head

head_tag

# <head><title>The Dormouse's story</title></head>

head_tag.contents

[<title>The Dormouse's story</title>]

type(head_tag.contents[0])
<class 'BeautifulSoup.Tag'> 说明content里面的类型不是string，而是固有的类型

title_tag = head_tag.contents[0]

title_tag

# <title>The Dormouse's story</title>

title_tag.contents

# [u'The Dormouse's story']

The BeautifulSoup object itself has children. In this case, the <html> tag is the child of the BeautifulSoup object.:

len(soup.contents)

# 1

soup.contents[0].name

# u'html'

A string does not have .contents, because it can’t contain anything:

text = title_tag.contents[0]

text.contents

# AttributeError: 'NavigableString' object has no attribute 'contents'
如果一个soup对象里面包含了html 标签，那么string是为None的。不管html tag前面是否有string。

soup=BeautifulSoup("<head><title>The Dormouse's story</title></head>")
head=soup.head

print head.string

输出None说明了这个问题

Instead of getting them as a list, you can iterate over a tag’s children using the .children generator:

for child in title_tag.children:

    print(child)

# The Dormouse's story

一个递归获取文本的函数：

def gettextonly(self,soup):

        v=soup.string

        if v==None:

            c=soup.contents

            resulttext=''

            for t in c:

                subtext=self.gettextonly(t)

                resulttext+=subtext+'\n'

            return resulttext

        else:

            return v.strip()

一个分割字符串为单词的函数：

def separatewords(self,text):

        splitter=re.compile('\\W')

        return [s.lower() for s in splitter.split(text) if s!='']