Finding an Element with the select() Method

调用select()方法从BeautifulSoup对象索取网页元素,并用CSS 选择器传递你寻找的元素。
选择器像正则表达式

不同选择器模式可以组合,产生复杂配对。
例如soup.select('p #author')匹配有id的作者,并在<p>元素内。

你可以从BeautifulSoup对象
You can retrieve a web page element from a BeautifulSoup object by calling the select()method and passing a string of a CSS selector for the element you are looking for. Selectors are like regular expressions: They specify a pattern to look for, in this case, in HTML pages instead of general text strings.

A full discussion of CSS selector syntax is beyond the scope of this book (there’s a good selector tutorial in the resources athttp://nostarch.com/automatestuff/), but here’s a short introduction to selectors. Table 11-2 shows examples of the most common CSS selector patterns.

常见CSS 选择器

Table 11-2. Examples of CSS Selectors

Selector passed to the select()method

Will match...

soup.select('div')

All elements named <div>

soup.select('#author')

The element with an id attribute of author

soup.select('.notice')

All elements that use a CSS class attribute named notice

soup.select('div span')

All elements named <span> that are within an element named <div>

soup.select('div > span')

All elements named <span> that are directly within an element named <div>, with no other element in between

soup.select('input[name]')

All elements named <input> that have a name attribute with any value

soup.select('input[type="button"]')

All elements named <input> that have an attribute namedtype with value button

不同选择器模式可以组合,产生复杂配对。
例如soup.select('p #author')匹配有id的作者,并在<p>元素内。
The various selector patterns can be combined to make sophisticated matches. For example, soup.select('p #author') will match any element that has an id attribute of author, as long as it is also inside a <p> element.

The select() method will return a list of Tag objects, which is how Beautiful Soup represents an HTML element. The list will contain one Tag object for every match in the BeautifulSoup object’s HTML. Tag values can be passed to the str()function to show the HTML tags they represent. Tag values also have an attrsattribute that shows all the HTML attributes of the tag as a dictionary. Using the example.html file from earlier, enter the following into the interactive shell:

>>> import bs4
>>> exampleFile = open('example.html')
>>> exampleSoup = bs4.BeautifulSoup(exampleFile.read()) #read()把文件当做一个字符串读取
>>> elems = exampleSoup.select('#author')
>>> type(elems)
 <class 'list'>
>>> len(elems)
 
1
>>> type(elems[0])
 
<class 'bs4.element.Tag'>
>>> elems[0].getText()
 
'Al Sweigart'
>>> str(elems[0])
 
'<span id="author">Al Sweigart</span>'
>>> elems[0].attrs
 
{'id': 'author'}

这代码把 id="author" 的元素从example HTML文档中提取出来。
我们把Tag列表对象存储进elems变量,
 len(elems)告诉我们列表里只有一个Tag标签
元素调用函数getText() 返回元素的文字内容。
attrs返回元素属性 
str() 返回字符串,字符串包含标签符 

This code will pull the element with id="author" out of our example HTML. We useselect('#author') to return a list of all the elements with id="author". We store this list of Tag objects in the variable elems, and len(elems) tells us there is one Tag object in the list; there was one match. Calling getText() on the element returns the element’s text, or inner HTML. The text of an element is the content between the opening and closing tags: in this case, 'Al Sweigart'.

Passing the element to str() returns a string with the starting and closing tags and the element’s text. Finally, attrs gives us a dictionary with the element’s attribute, 'id', and the value of the id attribute, 'author'.

You can also pull all the <p> elements from the BeautifulSoup object. Enter this into the interactive shell:

>>> pElems = exampleSoup.select('p')
>>> str(pElems[0])
 
'<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>'
>>> pElems[0].getText()
 
'Download my Python book from my website.'
>>> str(pElems[1])
 
'<p class="slogan">Learn Python the easy way!</p>'
>>> pElems[1].getText()
 
'Learn Python the easy way!'
>>> str(pElems[2])
 '<p>By <span id="author">Al Sweigart</span></p>'
>>> pElems[2].getText()
 
'By Al Sweigart'

This time, select() gives us a list of three matches, which we store in pElems. Using str() on pElems[0]pElems[1], and pElems[2] shows you each element as a string, and using getText() on each element shows you its text.

Getting Data from an Element’s Attributes

The get() method for Tag objects makes it simple to access attribute values from an element. The method is passed a string of an attribute name and returns that attribute’s value. Using example.html, enter the following into the interactive shell:

>>> import bs4
>>> soup = bs4.BeautifulSoup(open('example.html'))
>>> spanElem = soup.select('span')[0]
>>> str(spanElem)
 
'<span id="author">Al Sweigart</span>'
>>> spanElem.get('id')
 
'author'
>>> spanElem.get('some_nonexistent_addr') == None
 
True
>>> spanElem.attrs
 
{'id': 'author'}

这里我们选择 select()方法找到<span> 元素,并把匹配的第一元素存储在spanElem变量里。
传输id属性到get()函数,返回属性值 'author'

Here we use select() to find any <span> elements and then store the first matched element in spanElem. Passing the attribute name 'id' to get() returns the attribute’s value, 'author'.

bs4_3select()的更多相关文章

随机推荐

  1. Android四大组件之Activity详解——传值和获取结果

    废话不多说,先来看效果图 项目源码: http://download.csdn.net/detail/ginodung/8331535 程序说明: 在MainActivity中输入用户名和密码,然后提 ...

  2. SVN常见图标的含义

    项目视图   The Package Explorer view - 已忽略版本控制的文件.可以通过Window → Preferences → Team → Ignored Resources.来忽 ...

  3. js中用正则表达式 过滤特殊字符 ,校验所有输入域是否含有特殊符号

    function stripscript(s) { var pattern = new RegExp("[`~!@#$^&*()=|{}':;',\\[\\].<>/?~ ...

  4. android图片的异步加载和双缓存学习笔记——DisplayImageOptions (转)

    转的地址:http://hunankeda110.iteye.com/blog/1897961 1 //设置图片在下载期间显示的图片 2 showStubImage(R.drawable.ic_lau ...

  5. 【CodeVS 2822】爱在心中

    “每个人都拥有一个梦,即使彼此不相同,能够与你分享,无论失败成功都会感动.爱因为在心中,平凡而不平庸,世界就像迷宫,却又让我们此刻相逢Our Home.” 在爱的国度里有N个人,在他们的心中都有着一个 ...

  6. Region-Based Segmentation

    读完10.4 Region-Based Segmentation这一小节, 新get到的且需要留意的知识点: Region Spltting and Merging, quadtrees Waters ...

  7. js call与apply方法

    js中所有函数都默认定义了Call()与apply()两个方法,call与apply的第一个参数都是需要调用的函数对象,在函数体内这个参数就是this的值,剩余的参数是需要传递给函数的值,call与a ...

  8. WordPress菜单函数wp_nav_menu()详细介绍

    导航菜单函数wp_nav_menu()进行详细的说明. 1.wp_nav_menu()函数介绍: worpdress发展到3.0以后增加了一个自定义菜单函数wp_nav_menu(),使得wordpr ...

  9. 我所经历的JS性能优化

    转自http://www.cnblogs.com/koking/archive/2011/10/17/2215665.html 折腾了好几天,纠结了好几天,郁闷了好几天,终于在今天可以释怀了,留下其中 ...

  10. PLSQL设置显示的字符集及PLSQL的一些自身设置

    一.关于PLSQL无法正确显示中文 刚才下载安装了PLSQL Developer 9.0.0.1601 汉化绿色版,执行SQL查询语句,发现显示的数据中只要有中文都会以?表示.经过网上查询得知这是客户 ...