Selenium

There are vaious strategies to locate elements in a page. You can use the most appropriate one for your case. Selenium provides the following methods to locate elements in a page:

  • find_element_by_id
  • find_element_by_name
  • find_element_by_xpath
  • find_element_by_link_text
  • find_element_by_partial_link_text
  • find_element_by_tag_name
  • find_element_by_class_name
  • find_element_by_css_selector

To find multiple elements (these methods will return a list):

  • find_elements_by_name
  • find_elements_by_xpath
  • find_elements_by_link_text
  • find_elements_by_partial_link_text
  • find_elements_by_tag_name
  • find_elements_by_class_name
  • find_elements_by_css_selector

Apart from the public methods given above, there are two private methods which might be useful with locators in page objects. These are the two private methods: find_element and find_elements.

Example usage:

from selenium.webdriver.common.by import By

driver.find_element(By.XPATH, '//button[text()="Some text"]')
driver.find_elements(By.XPATH, '//button')

These are the attributes available for By class:

ID = "id"
XPATH = "xpath"
LINK_TEXT = "link text"
PARTIAL_LINK_TEXT = "partial link text"
NAME = "name"
TAG_NAME = "tag name"
CLASS_NAME = "class name"
CSS_SELECTOR = "css selector"

Locating by Id

Use this when you know id attribute of an element. With this strategy, the first element with the idattribute value matching the location will be returned. If no element has a matching id attribute, aNoSuchElementException will be raised.

For instance, consider this page source:

<html>
<body>
<form id="loginForm">
<input name="username" type="text" />
<input name="password" type="password" />
<input name="continue" type="submit" value="Login" />
</form>
</body>
<html>

The form element can be located like this:

login_form = driver.find_element_by_id('loginForm')

Locating by Name

Use this when you know name attribute of an element. With this strategy, the first element with thename attribute value matching the location will be returned. If no element has a matching nameattribute, a NoSuchElementException will be raised.

For instance, consider this page source:

<html>
<body>
<form id="loginForm">
<input name="username" type="text" />
<input name="password" type="password" />
<input name="continue" type="submit" value="Login" />
<input name="continue" type="button" value="Clear" />
</form>
</body>
<html>

The username & password elements can be located like this:

username = driver.find_element_by_name('username')
password = driver.find_element_by_name('password')

This will give the “Login” button as it occur before the “Clear” button:

continue = driver.find_element_by_name('continue')

Locating by XPath

XPath is the language used for locating nodes in an XML document. As HTML can be an implementation of XML (XHTML), Selenium users can leverage this powerful language to target elements in their web applications. XPath extends beyond (as well as supporting) the simple methods of locating by id or name attributes, and opens up all sorts of new possibilities such as locating the third checkbox on the page.

One of the main reasons for using XPath is when you don’t have a suitable id or name attribute for the element you wish to locate. You can use XPath to either locate the element in absolute terms (not advised), or relative to an element that does have an id or name attribute. XPath locators can also be used to specify elements via attributes other than id and name.

Absolute XPaths contain the location of all elements from the root (html) and as a result are likely to fail with only the slightest adjustment to the application. By finding a nearby element with an id or name attribute (ideally a parent element) you can locate your target element based on the relationship. This is much less likely to change and can make your tests more robust.

For instance, consider this page source:

<html>
<body>
<form id="loginForm">
<input name="username" type="text" />
<input name="password" type="password" />
<input name="continue" type="submit" value="Login" />
<input name="continue" type="button" value="Clear" />
</form>
</body>
<html>

The form elements can be located like this:

login_form = driver.find_element_by_xpath("/html/body/form[1]")
login_form = driver.find_element_by_xpath("//form[1]")
login_form = driver.find_element_by_xpath("//form[@id='loginForm']")
  1. Absolute path (would break if the HTML was changed only slightly)
  2. First form element in the HTML
  3. The form element with attribute named id and the value loginForm

The username element can be located like this:

username = driver.find_element_by_xpath("//form[input/@name='username']")
username = driver.find_element_by_xpath("//form[@id='loginForm']/input[1]")
username = driver.find_element_by_xpath("//input[@name='username']")
  1. First form element with an input child element with attribute named name and the valueusername
  2. First input child element of the form element with attribute named id and the value loginForm
  3. First input element with attribute named ‘name’ and the value username

The “Clear” button element can be located like this:

clear_button = driver.find_element_by_xpath("//input[@name='continue'][@type='button']")
clear_button = driver.find_element_by_xpath("//form[@id='loginForm']/input[4]")
  1. Input with attribute named name and the value continue and attribute named type and the valuebutton
  2. Fourth input child element of the form element with attribute named id and value loginForm

These examples cover some basics, but in order to learn more, the following references are recommended:

There are also a couple of very useful Add-ons that can assist in discovering the XPath of an element:

  • XPath Checker - suggests XPath and can be used to test XPath results.
  • Firebug - XPath suggestions are just one of the many powerful features of this very useful add-on.
  • XPath Helper - for Google Chrome

Locating Hyperlinks by Link Text

Use this when you know link text used within an anchor tag. With this strategy, the first element with the link text value matching the location will be returned. If no element has a matching link text attribute, a NoSuchElementException will be raised.

For instance, consider this page source:

<html>
<body>
<p>Are you sure you want to do this?</p>
<a href="continue.html">Continue</a>
<a href="cancel.html">Cancel</a>
</body>
<html>

The continue.html link can be located like this:

continue_link = driver.find_element_by_link_text('Continue')
continue_link = driver.find_element_by_partial_link_text('Conti')

Locating Elements by Tag Name

Use this when you want to locate an element by tag name. With this strategy, the first element with the given tag name will be returned. If no element has a matching tag name, a NoSuchElementExceptionwill be raised.

For instance, consider this page source:

<html>
<body>
<h1>Welcome</h1>
<p>Site content goes here.</p>
</body>
<html>

The heading (h1) element can be located like this:

heading1 = driver.find_element_by_tag_name('h1')

Locating Elements by Class Name

Use this when you want to locate an element by class attribute name. With this strategy, the first element with the matching class attribute name will be returned. If no element has a matching class attribute name, a NoSuchElementException will be raised.

For instance, consider this page source:

<html>
<body>
<p class="content">Site content goes here.</p>
</body>
<html>

The “p” element can be located like this:

content = driver.find_element_by_class_name('content')

Locating Elements by CSS Selectors

Use this when you want to locate an element by CSS selector syntaxt. With this strategy, the first element with the matching CSS selector will be returned. If no element has a matching CSS selector, a NoSuchElementException will be raised.

For instance, consider this page source:

<html>
<body>
<p class="content">Site content goes here.</p>
</body>
<html>

The “p” element can be located like this:

content = driver.find_element_by_css_selector('p.content')

Beautifulsoup

The name argument

Pass in a value for name and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match.

This is the simplest usage:

soup.find_all("title")
# [<title>The Dormouse's story</title>]

Recall from Kinds of filters that the value to name can be a stringa regular expressiona lista function, or the value True.

The keyword arguments

Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes. If you pass in a value for an argument called id, Beautiful Soup will filter against each tag’s ‘id’ attribute:

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

If you pass in a value for href, Beautiful Soup will filter against each tag’s ‘href’ attribute:

soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

You can filter an attribute based on a stringa regular expressiona lista function, or the value True.

This code finds all tags whose id attribute has a value, regardless of what the value is:

soup.find_all(id=True)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

You can filter multiple attributes at once by passing in more than one keyword argument:

soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

Some attributes, like the data-* attributes in HTML 5, have names that can’t be used as the names of keyword arguments:

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression

You can use these attributes in searches by putting them into a dictionary and passing the dictionary into find_all() as the attrs argument:

data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

Searching by CSS class

It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:

soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

As with any keyword argument, you can pass class_ a string, a regular expression, a function, or True:

soup.find_all(class_=re.compile("itl"))
# [<p class="title"><b>The Dormouse's story</b></p>] def has_six_characters(css_class):
return css_class is not None and len(css_class) == 6 soup.find_all(class_=has_six_characters)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Remember that a single tag can have multiple values for its “class” attribute. When you search for a tag that matches a certain CSS class, you’re matching against any of its CSS classes:

css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.find_all("p", class_="strikeout")
# [<p class="body strikeout"></p>] css_soup.find_all("p", class_="body")
# [<p class="body strikeout"></p>]

You can also search for the exact string value of the class attribute:

css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]

But searching for variants of the string value won’t work:

css_soup.find_all("p", class_="strikeout body")
# []

If you want to search for tags that match two or more CSS classes, you should use a CSS selector:

css_soup.select("p.strikeout.body")
# [<p class="body strikeout"></p>]

In older versions of Beautiful Soup, which don’t have the class_ shortcut, you can use the attrs trick mentioned above. Create a dictionary whose value for “class” is the string (or regular expression, or whatever) you want to search for:

soup.find_all("a", attrs={"class": "sister"})
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

The text argument

With text you can search for strings instead of tags. As with name and the keyword arguments, you can pass in a stringa regular expression,a lista function, or the value True. Here are some examples:

soup.find_all(text="Elsie")
# [u'Elsie'] soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie'] soup.find_all(text=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"] def is_the_only_string_within_a_tag(s):
"""Return True if this string is the only child of its parent tag."""
return (s == s.parent.string) soup.find_all(text=is_the_only_string_within_a_tag)
# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']

Although text is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for text. This code finds the <a> tags whose .string is “Elsie”:

soup.find_all("a", text="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

The limit argument

find_all() returns all the tags and strings that match your filters. This can take a while if the document is large. If you don’t need all the results, you can pass in a number for limit. This works just like the LIMIT keyword in SQL. It tells Beautiful Soup to stop gathering results after it’s found a certain number.

There are three links in the “three sisters” document, but this code only finds the first two:

soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

The recursive argument

If you call mytag.find_all(), Beautiful Soup will examine all the descendants of mytag: its children, its children’s children, and so on. If you only want Beautiful Soup to consider direct children, you can pass in recursive=False. See the difference here:

soup.html.find_all("title")
# [<title>The Dormouse's story</title>] soup.html.find_all("title", recursive=False)
# []

Here’s that part of the document:

<html>
<head>
<title>
The Dormouse's story
</title>
</head>
...

The <title> tag is beneath the <html> tag, but it’s not directly beneath the <html> tag: the <head> tag is in the way. Beautiful Soup finds the <title> tag when it’s allowed to look at all descendants of the <html> tag, but when recursive=False restricts it to the <html> tag’s immediate children, it finds nothing.

Beautiful Soup offers a lot of tree-searching methods (covered below), and they mostly take the same arguments as find_all()nameattrs,textlimit, and the keyword arguments. But the recursive argument is different: find_all() and find() are the only methods that support it. Passing recursive=False into a method like find_parents() wouldn’t be very useful.


Calling a tag is like calling find_all()

Because find_all() is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoupobject or a Tag object as though it were a function, then it’s the same as calling find_all() on that object. These two lines of code are equivalent:

soup.find_all("a")
soup("a")

These two lines are also equivalent:

soup.title.find_all(text=True)
soup.title(text=True)

find()

Signature: find(nameattrsrecursivetext**kwargs)

The find_all() method scans the entire document looking for results, but sometimes you only want to find one result. If you know a document only has one <body> tag, it’s a waste of time to scan the entire document looking for more. Rather than passing in limit=1 every time you call find_all, you can use the find() method. These two lines of code are nearly equivalent:

soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>] soup.find('title')
# <title>The Dormouse's story</title>

The only difference is that find_all() returns a list containing the single result, and find() just returns the result.

If find_all() can’t find anything, it returns an empty list. If find() can’t find anything, it returns None:

 

Beautifulsoup 和selenium 的查询的更多相关文章

  1. Beautifulsoup和selenium的简单使用

    Beautifulsoup和selenium的简单使用 requests库的复习 好久没用requests了,因为一会儿要写个简单的爬虫,所以还是随便写一点复习下. import requests r ...

  2. [python] 网络数据采集 操作清单 BeautifulSoup、Selenium、Tesseract、CSV等

    Python网络数据采集操作清单 BeautifulSoup.Selenium.Tesseract.CSV等 Python网络数据采集操作清单 BeautifulSoup.Selenium.Tesse ...

  3. BeautifulSoup使用手册(查询篇)

    目录 开始使用呢 解析器 四种对象 tag对象 标签名(name) 属性值(Attributes) 多值属性 内容 Comment对象 prettify()方法 find_all方法 contents ...

  4. selenium+BeautifulSoup实现强大的爬虫功能

    sublime下运行 1 下载并安装必要的插件 BeautifulSoup selenium phantomjs 采用方式可以下载后安装,本文采用pip pip install BeautifulSo ...

  5. 爬虫实例——爬取淘女郎相册(通过selenium、PhantomJS、BeautifulSoup爬取)

    环境 操作系统:CentOS 6.7 32-bit Python版本:2.6.6 第三方插件 selenium PhantomJS BeautifulSoup 代码 # -*- coding: utf ...

  6. 暑假闲着没事第一弹:基于Django的长江大学教务处成绩查询系统

    本篇文章涉及到的知识点有:Python爬虫,MySQL数据库,html/css/js基础,selenium和phantomjs基础,MVC设计模式,ORM(对象关系映射)框架,django框架(Pyt ...

  7. 孤荷凌寒自学python第八十五天配置selenium并进行模拟浏览器操作1

    孤荷凌寒自学python第八十五天配置selenium并进行模拟浏览器操作1 (完整学习过程屏幕记录视频地址在文末) 要模拟进行浏览器操作,只用requests是不行的,因此今天了解到有专门的解决方案 ...

  8. Selenium自动化测试环境搭建汇总(一):Selenium+Eclipse+Junit+TestNG

    第一步 安装JDK JDk1.7. 下载地址:http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-188026 ...

  9. selenium+chrome抓取淘宝搜索抓娃娃关键页面

    最近迷上了抓娃娃,去富国海底世界抓了不少,完全停不下来,还下各种抓娃娃的软件,梦想着有一天买个抓娃娃的机器存家里~.~ 今天顺便抓了下马爸爸家抓娃娃机器的信息,晚辈只是觉得翻得手酸,本来100页的数据 ...

随机推荐

  1. Centos下安装jdk详解

    环境: 系统: [root@Wulaoer ~]# cat /proc/version Linux version 2.6.32-431.el6.x86_64 (mockbuild@c6b8.bsys ...

  2. HDFS在Linux下的命令

    1.对hdfs操作的命令格式是 1.1hadoop fs  -ls <path> 表示对hdfs下一级目录的查看 1.2 hadoop fs -lsr <path> 表示对hd ...

  3. c# 执行js的方法

    http://www.cnblogs.com/wuhuacong/archive/2010/11/08/1871866.html 为了有效阻止恶意用户的攻击,一般登录都会采用验证码方式方式处理登录,类 ...

  4. linux logrotate配置

    对于Linux 的系统安全来说,日志文件是极其重要的工具.系统管理员可以使用logrotate 程序用来管理系统中的最新的事件,对于Linux 的系统安全来说,日志文件是极其重要的工具.系统管理员可以 ...

  5. css预处理器

    Sass.LESS是什么?大家为什么要使用他们?  他们是CSS预处理器.他是CSS上的一种抽象层.他们是一种特殊的语法/语言编译成CSS.  Less是一种动态样式语言. 将CSS赋予了动态语言的特 ...

  6. hihocoder 网络流二·最大流最小割定理

    网络流二·最大流最小割定理 时间限制:10000ms 单点时限:1000ms 内存限制:256MB 描述 小Hi:在上一周的Hiho一下中我们初步讲解了网络流的概念以及常规解法,小Ho你还记得内容么? ...

  7. log4j输出到指定日志文件

    log4j.properties: log4j.logger.myTest=DEBUG,console,FILE log4j.appender.console=org.apache.log4j.Con ...

  8. libtiff库使用

    此文章为了记录我在使用libtiff库中的一些问题而写,将不断补充. libtiff库是读取和写入tiff文件最主要的一个开源库,但文档写的实在不敢恭维.相对资料也是异常稀少. libtiff库的安装 ...

  9. git查看每个版本间的差异

    命令行: 1,git log: 2,git diff 版本号码 窗口类型: 1,sudo apt-get install gitk 2,gitk

  10. Extjs4---Cannot read property 'addCls' of null - heirenheiren的专栏 - 博客频道 - CSDN.NET

    body { font-family: 微软雅黑,"Microsoft YaHei", Georgia,Helvetica,Arial,sans-serif,宋体, PMingLi ...