Finding an Element with the select() Method

调用select()方法从BeautifulSoup对象索取网页元素,并用CSS 选择器传递你寻找的元素。
选择器像正则表达式

不同选择器模式可以组合,产生复杂配对。
例如soup.select('p #author')匹配有id的作者,并在<p>元素内。

你可以从BeautifulSoup对象
You can retrieve a web page element from a BeautifulSoup object by calling the select()method and passing a string of a CSS selector for the element you are looking for. Selectors are like regular expressions: They specify a pattern to look for, in this case, in HTML pages instead of general text strings.

A full discussion of CSS selector syntax is beyond the scope of this book (there’s a good selector tutorial in the resources athttp://nostarch.com/automatestuff/), but here’s a short introduction to selectors. Table 11-2 shows examples of the most common CSS selector patterns.

常见CSS 选择器

Table 11-2. Examples of CSS Selectors

Selector passed to the select()method

Will match...

soup.select('div')

All elements named <div>

soup.select('#author')

The element with an id attribute of author

soup.select('.notice')

All elements that use a CSS class attribute named notice

soup.select('div span')

All elements named <span> that are within an element named <div>

soup.select('div > span')

All elements named <span> that are directly within an element named <div>, with no other element in between

soup.select('input[name]')

All elements named <input> that have a name attribute with any value

soup.select('input[type="button"]')

All elements named <input> that have an attribute namedtype with value button

不同选择器模式可以组合,产生复杂配对。
例如soup.select('p #author')匹配有id的作者,并在<p>元素内。
The various selector patterns can be combined to make sophisticated matches. For example, soup.select('p #author') will match any element that has an id attribute of author, as long as it is also inside a <p> element.

The select() method will return a list of Tag objects, which is how Beautiful Soup represents an HTML element. The list will contain one Tag object for every match in the BeautifulSoup object’s HTML. Tag values can be passed to the str()function to show the HTML tags they represent. Tag values also have an attrsattribute that shows all the HTML attributes of the tag as a dictionary. Using the example.html file from earlier, enter the following into the interactive shell:

>>> import bs4
>>> exampleFile = open('example.html')
>>> exampleSoup = bs4.BeautifulSoup(exampleFile.read()) #read()把文件当做一个字符串读取
>>> elems = exampleSoup.select('#author')
>>> type(elems)
 <class 'list'>
>>> len(elems)
 
1
>>> type(elems[0])
 
<class 'bs4.element.Tag'>
>>> elems[0].getText()
 
'Al Sweigart'
>>> str(elems[0])
 
'<span id="author">Al Sweigart</span>'
>>> elems[0].attrs
 
{'id': 'author'}

这代码把 id="author" 的元素从example HTML文档中提取出来。
我们把Tag列表对象存储进elems变量,
 len(elems)告诉我们列表里只有一个Tag标签
元素调用函数getText() 返回元素的文字内容。
attrs返回元素属性 
str() 返回字符串,字符串包含标签符 

This code will pull the element with id="author" out of our example HTML. We useselect('#author') to return a list of all the elements with id="author". We store this list of Tag objects in the variable elems, and len(elems) tells us there is one Tag object in the list; there was one match. Calling getText() on the element returns the element’s text, or inner HTML. The text of an element is the content between the opening and closing tags: in this case, 'Al Sweigart'.

Passing the element to str() returns a string with the starting and closing tags and the element’s text. Finally, attrs gives us a dictionary with the element’s attribute, 'id', and the value of the id attribute, 'author'.

You can also pull all the <p> elements from the BeautifulSoup object. Enter this into the interactive shell:

>>> pElems = exampleSoup.select('p')
>>> str(pElems[0])
 
'<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>'
>>> pElems[0].getText()
 
'Download my Python book from my website.'
>>> str(pElems[1])
 
'<p class="slogan">Learn Python the easy way!</p>'
>>> pElems[1].getText()
 
'Learn Python the easy way!'
>>> str(pElems[2])
 '<p>By <span id="author">Al Sweigart</span></p>'
>>> pElems[2].getText()
 
'By Al Sweigart'

This time, select() gives us a list of three matches, which we store in pElems. Using str() on pElems[0]pElems[1], and pElems[2] shows you each element as a string, and using getText() on each element shows you its text.

Getting Data from an Element’s Attributes

The get() method for Tag objects makes it simple to access attribute values from an element. The method is passed a string of an attribute name and returns that attribute’s value. Using example.html, enter the following into the interactive shell:

>>> import bs4
>>> soup = bs4.BeautifulSoup(open('example.html'))
>>> spanElem = soup.select('span')[0]
>>> str(spanElem)
 
'<span id="author">Al Sweigart</span>'
>>> spanElem.get('id')
 
'author'
>>> spanElem.get('some_nonexistent_addr') == None
 
True
>>> spanElem.attrs
 
{'id': 'author'}

这里我们选择 select()方法找到<span> 元素,并把匹配的第一元素存储在spanElem变量里。
传输id属性到get()函数,返回属性值 'author'

Here we use select() to find any <span> elements and then store the first matched element in spanElem. Passing the attribute name 'id' to get() returns the attribute’s value, 'author'.

bs4_3select()的更多相关文章

随机推荐

  1. python环境搭建-Pycharm 调整字体大小

  2. 1116Xlinux初学习之正则表达式和通配符

    一.正则表达式: 元字符是用来阐释字符表达式意义的字符,简言之,就是用来描述字符的字符. 正则表达式RE(Regular Expression)是由一串字符和元字符构成的字符串. 正则表达式的主要功能 ...

  3. oracle判断字段是否存在语句

    declare v_cnt number; begin select count(*) into v_cnt from dba_tab_columns where table_name='T_IDC_ ...

  4. linux安装软件的学习

    Yum(全称为 Yellow dog Updater, Modified)是一个在Fedora和RedHat以及CentOS中的Shell前端软件包管理器.基于RPM包管理,能够从指定的服务器自动下载 ...

  5. java-汉字转化拼音(纯java)

    1.转换所有的拼音 import java.util.Iterator; import java.util.LinkedHashMap; import java.util.Set; public cl ...

  6. 系统间通信(9)——通信管理与RMI 下篇

    接上文<架构设计:系统间通信(8)--通信管理与RMI 上篇>.之前说过,JDK中的RMI框架在JDK1.1.JDK1.2.JDK1.5.JDK1.6+几个版本中做了较大的调整.以下我们讨 ...

  7. vim——打开多个文件、同时显示多个文件、在文件之间切换

    打开多个文件: 1.vim还没有启动的时候: 在终端里输入  vim file1 file2 ... filen便可以打开所有想要打开的文件 2.vim已经启动 输入 :open file 可以再打开 ...

  8. Linux的vim三种模式及命令

    一般模式:在Linux终端中输入"vim 文件名"就进入了一般模式,但不能输入文字.编辑模式:在一般模式下按i就会进入编辑模式,此时就可以写程式,按Esc可回到一般模式. 命令模式 ...

  9. fork子进程僵尸问题及解决方案

    额,原来用 c 写 cgi 的时候用过 fork .那时候 cgi 的生命很短,所以遇到的问题压根没出现过.这次也是更加深入的对 fork 机制进行了一下了解. 参考这里的文档:http://ju.o ...

  10. python——复制目录结构小脚本

    引言 有个需要,需要把某个目录下的目录结构进行复制,不要文件,当目录结构很少的时候可以手工去建立,当目录结构复杂,目录层次很深,目录很多的时候,这个时候要是还是手动去建立的话,实在不是一种好的方法,弄 ...