Finding an Element with the select() Method

调用select()方法从BeautifulSoup对象索取网页元素,并用CSS 选择器传递你寻找的元素。
选择器像正则表达式

不同选择器模式可以组合,产生复杂配对。
例如soup.select('p #author')匹配有id的作者,并在<p>元素内。

你可以从BeautifulSoup对象
You can retrieve a web page element from a BeautifulSoup object by calling the select()method and passing a string of a CSS selector for the element you are looking for. Selectors are like regular expressions: They specify a pattern to look for, in this case, in HTML pages instead of general text strings.

A full discussion of CSS selector syntax is beyond the scope of this book (there’s a good selector tutorial in the resources athttp://nostarch.com/automatestuff/), but here’s a short introduction to selectors. Table 11-2 shows examples of the most common CSS selector patterns.

常见CSS 选择器

Table 11-2. Examples of CSS Selectors

Selector passed to the select()method

Will match...

soup.select('div')

All elements named <div>

soup.select('#author')

The element with an id attribute of author

soup.select('.notice')

All elements that use a CSS class attribute named notice

soup.select('div span')

All elements named <span> that are within an element named <div>

soup.select('div > span')

All elements named <span> that are directly within an element named <div>, with no other element in between

soup.select('input[name]')

All elements named <input> that have a name attribute with any value

soup.select('input[type="button"]')

All elements named <input> that have an attribute namedtype with value button

不同选择器模式可以组合,产生复杂配对。
例如soup.select('p #author')匹配有id的作者,并在<p>元素内。
The various selector patterns can be combined to make sophisticated matches. For example, soup.select('p #author') will match any element that has an id attribute of author, as long as it is also inside a <p> element.

The select() method will return a list of Tag objects, which is how Beautiful Soup represents an HTML element. The list will contain one Tag object for every match in the BeautifulSoup object’s HTML. Tag values can be passed to the str()function to show the HTML tags they represent. Tag values also have an attrsattribute that shows all the HTML attributes of the tag as a dictionary. Using the example.html file from earlier, enter the following into the interactive shell:

>>> import bs4
>>> exampleFile = open('example.html')
>>> exampleSoup = bs4.BeautifulSoup(exampleFile.read()) #read()把文件当做一个字符串读取
>>> elems = exampleSoup.select('#author')
>>> type(elems)
 <class 'list'>
>>> len(elems)
 
1
>>> type(elems[0])
 
<class 'bs4.element.Tag'>
>>> elems[0].getText()
 
'Al Sweigart'
>>> str(elems[0])
 
'<span id="author">Al Sweigart</span>'
>>> elems[0].attrs
 
{'id': 'author'}

这代码把 id="author" 的元素从example HTML文档中提取出来。
我们把Tag列表对象存储进elems变量,
 len(elems)告诉我们列表里只有一个Tag标签
元素调用函数getText() 返回元素的文字内容。
attrs返回元素属性 
str() 返回字符串,字符串包含标签符 

This code will pull the element with id="author" out of our example HTML. We useselect('#author') to return a list of all the elements with id="author". We store this list of Tag objects in the variable elems, and len(elems) tells us there is one Tag object in the list; there was one match. Calling getText() on the element returns the element’s text, or inner HTML. The text of an element is the content between the opening and closing tags: in this case, 'Al Sweigart'.

Passing the element to str() returns a string with the starting and closing tags and the element’s text. Finally, attrs gives us a dictionary with the element’s attribute, 'id', and the value of the id attribute, 'author'.

You can also pull all the <p> elements from the BeautifulSoup object. Enter this into the interactive shell:

>>> pElems = exampleSoup.select('p')
>>> str(pElems[0])
 
'<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>'
>>> pElems[0].getText()
 
'Download my Python book from my website.'
>>> str(pElems[1])
 
'<p class="slogan">Learn Python the easy way!</p>'
>>> pElems[1].getText()
 
'Learn Python the easy way!'
>>> str(pElems[2])
 '<p>By <span id="author">Al Sweigart</span></p>'
>>> pElems[2].getText()
 
'By Al Sweigart'

This time, select() gives us a list of three matches, which we store in pElems. Using str() on pElems[0]pElems[1], and pElems[2] shows you each element as a string, and using getText() on each element shows you its text.

Getting Data from an Element’s Attributes

The get() method for Tag objects makes it simple to access attribute values from an element. The method is passed a string of an attribute name and returns that attribute’s value. Using example.html, enter the following into the interactive shell:

>>> import bs4
>>> soup = bs4.BeautifulSoup(open('example.html'))
>>> spanElem = soup.select('span')[0]
>>> str(spanElem)
 
'<span id="author">Al Sweigart</span>'
>>> spanElem.get('id')
 
'author'
>>> spanElem.get('some_nonexistent_addr') == None
 
True
>>> spanElem.attrs
 
{'id': 'author'}

这里我们选择 select()方法找到<span> 元素,并把匹配的第一元素存储在spanElem变量里。
传输id属性到get()函数,返回属性值 'author'

Here we use select() to find any <span> elements and then store the first matched element in spanElem. Passing the attribute name 'id' to get() returns the attribute’s value, 'author'.

bs4_3select()的更多相关文章

随机推荐

  1. 40-cut 简明笔记

    从输入行中选取字符或者字段 cut [options] [file-list] cut 从输入行中选取字符或者字段,并将他们写到标准输出,字符和字段从1开始编号 参数 file-list 是文件的路径 ...

  2. 【Alpha版本】冲刺阶段——Day 9

    我说的都队 031402304 陈燊 031402342 许玲玲 031402337 胡心颖 03140241 王婷婷 031402203 陈齐民 031402209 黄伟炜 031402233 郑扬 ...

  3. Myeclipse下JSP打开报空指针异常解决方法。

    Myeclipse下JSP打开报空指针异常解决方法 一.运行JSP文件就出错 静态的JSP页面访问时候正常,只要是牵涉到数据库的页面就出错,出错见下图. 出现这种情况让我调试了一天,各种断点,各种改代 ...

  4. oracle操作记录

    由于之前建的job过多,造成数据库cpu占用率达到99%,造成需要的job崩溃. 以下为解决方案: 1. 查询当前的job列表 : select * from user_jobs; 2. 暂停所有的j ...

  5. [转]SVN版本冲突解决详解

    原文地址:http://blog.csdn.net/windone0109/article/details/4857044 版权声明:本文为博主原创文章,未经博主允许不得转载. 版本冲突原因: 假设A ...

  6. uva10870 矩阵

    f(n) = a1f(n − 1) + a2f(n − 2) + a3f(n − 3) + . . . + adf(n − d), for n > d, 可以用矩阵进行优化,直接构造矩阵,然后快 ...

  7. Linux下磁盘分区挂载

    一般你去买vps都会看到介绍说硬盘多少G  比如 80G 但是你进入系统df -h的时候发现怎么只有10G呢, 其实这10G是用来装系统的和一些常用服务软件的  不是给你放网站数据的 那50G硬盘在哪 ...

  8. 让Chrome支持Ajax/$http方式读取本地文件

    在开发中经常写些小demo调试一下插件什么的 数据源又经常手动构造分离为一个单独的文件.用ajax或$http去访问时总是拒绝访问.这个时候可以给Chrome的快捷图标加启动参数,让浏览器允许js访问 ...

  9. C语言中数组名作为参数进行函数传递

    用数组名作函数参数与用数组元素作实参有几点不同. 1) 用数组元素作实参时,只要数组类型和函数的形参变量的类型一致,那么作为下标变量的数组元素的类型也和函数形参变量的类型是一致的.因此,并不要求函数的 ...

  10. bzoj4204: 取球游戏

    好神啊.. 首先递推随便yy一下就行了 然后发现可以用矩阵优化,不过显然是n^3logk的,不资磁 于是就有了性质,这个转移矩阵显然是一个循环矩阵(并不知道) 循环矩阵乘循环矩阵还是循环矩阵 然后就可 ...