Selenium入门介绍

Selenium概述

https://github.com/SeleniumHQ/selenium

https://www.selenium.dev/documentation/en/

Selenium是常用的Web自动化测试方案，也可以用来抓取一些页面数据。

三要素：WebDriver，IDE，Grid。

浏览器支持

1.真实浏览器

Chrome，Chromium，Firefox，Internet Explorer，Opera，Safari

2.模拟浏览器

HtmlUnit：Java语言绑定。

https://htmlunit.sourceforge.io/

PhantomJS：

https://phantomjs.org/

工具库

1.Beautiful Soup

从HTML或XML文件中提取数据。

https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/ Beautiful Soup 4.4.0 文档

2.requests/urlib2

下载指定url的数据。

开发实践

第一步：安装Python

https://www.runoob.com/python/python-tutorial.html Python基础教程

https://www.python.org/downloads/ Python官网下载地址

将${PYTHON_HOME}和${PYTHON_HOME}/Scripts目录添加到PATH变量中。

python --version

Python 3.9.6

pip --version

pip 21.1.3 from d:\python39\lib\site-packages\pip (python 3.9)

第二步：安装selenium

pip install selenium

Collecting selenium

Downloading selenium-3.141.0-py2.py3-none-any.whl (904 kB)

|████████████████████████████████| 904 kB 64 kB/s

Collecting urllib3

Downloading urllib3-1.26.6-py2.py3-none-any.whl (138 kB)

|████████████████████████████████| 138 kB 94 kB/s

Installing collected packages: urllib3, selenium

Successfully installed selenium-3.141.0 urllib3-1.26.6

第三步：安装浏览器驱动

下载chrome浏览器驱动：https://sites.google.com/a/chromium.org/chromedriver/downloads

国内镜像：http://npm.taobao.org/mirrors/chromedriver/

将驱动程序路径添加到系统PATH变量中：

chromedriver --version

ChromeDriver 91.0.4472.101 (af52a90bf87030dd1523486a1cd3ae25c5d76c9b-refs/branch-heads/4472@{#1462})

第四部：在项目中调用浏览器驱动API访问页面，操作元素等操作

创建Python项目，实战Selenium操作浏览器。

https://www.selenium.dev/documentation/en/webdriver/

https://www.selenium.dev/documentation/en/driver_idiosyncrasies/ 驱动特性

等待

浏览器在加载页面时需要一定的时间，因此在Selenium中定位页面元素时也需要一定的等待时长，已确保页面被正常加载完毕并且可以定位到目标元素。

有4种实现等待的方式：

1.进程等待：

import time

time.sleep(10)

2.显示等待：设置一个满足某个条件的显示等待

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.common.by import By

WebDriverWait(driver, timeout=10).until(EC.element_to_be_clickable((By.ID, 'content_left')))

3.隐式等待

driver.implicitly_wait(3)

4.FluentWait等待

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.common.by import By

from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, timeout=10, poll_frequency=1)

wait.until(EC.element_to_be_clickable((By.ID, 'content_left')))

特别注意：

Selenium中的显示等待和隐式等待不能一起混合使用，否则将可能会带来一起超出预期的效果。

定位元素

Selenium提供了8种不同的内置元素定位策略，如下所示。

假设DOM内容如下：

<ol id=cheese class="clazz1 clazz2">

 <li id=cheddar name="cheddar">…

 <li id=brie>…

 <li id=rochefort>…

 <li id=camembert>…

 <li>

   <a href="">test</a>

 </li>

</ol>

定位单个元素

1.按元素ID定位

# 只定位一次目标元素

driver.find_element(By.ID, "cheese")

# 先定位到父元素，再缩小查找范围继续按ID定位元素

cheese = driver.find_element(By.ID, "cheese")

cheddar = cheese.find_element(By.ID, "cheddar")

2.根据css定位

cheddar = driver.find_element(By.CSS_SELECTOR, "#cheese #cheddar")

3.根据Class名称定位

# 查找元素的Class名称中包含指定值的元素，注意：传递的参数不能是一个复合class，如：'clazz1 clazz2'

driver.find_element(By.CLASS_NAME, 'clazz1')

4.根据元素Name属性定位

# 定位name属性匹配指定值的元素

driver.find_element(By.NAME, 'cheddar')

5.根据元素可视化文本定位

# 完全匹配元素可视化文本定位

driver.find_element(By.LINK_TEXT, 'test')

6.根据元素可视化文本子集定位

# 根据元素可视化文本部分字段值定位

driver.find_element(By.PARTIAL_LINK_TEXT, 'te')

7.根据元素标签名称定位

# 定位所有a标签元素

driver.find_element(By.TAG_NAME, 'a')

8.根据xpath表达式定位

# 根据xpath表达式定位

driver.find_element(By.XPATH, xpath表达式)

除了上述内置元素定位策略之外，Selenium 4还支持元素相对位置定位的方法。

使用CSS和Class定位时，可以使用SelectorGadget辅助获取。

使用XPath定位时，可以直接使用Chrome自带的开发者工具，选择元素之后复制对应的Xpath信息。

定位多个元素

在定位多个元素时跟定位单个元素使用相同的策略，不同之处在于返回值不再是单个元素，而是一个元素列表。

# 定位多个元素时返回一个列表，如果定位到的元素只有一个，也是返回一个列表（此时列表元素个数为1）

# 如果没有找到目标元素，则返回一个空列表

mucho_cheese = driver.find_elements(By.CSS_SELECTOR, "#cheese li")

获取HTML元素内容的方式

    span：get_attribute('textContent')

textarea：get_attribute('textContent')

       a：get_attribute('textContent')

       p：get_attribute('textContent')

       b：get_attribute('textContent')

      h2：get_attribute('innerHTML')，获取纯文本内容：text，如：driver.find_element_by_xpath('xxx').text

     div：get_attribute('innerHTML')，获取纯文本内容：text，如：driver.find_element_by_xpath('xxx').text

table.td：get_attribute('innerHTML')，获取纯文本内容：text，如：driver.find_element_by_xpath('xxx').text

      em：get_attribute('innerHTML')，获取纯文本内容：text，如：driver.find_element_by_xpath('xxx').text

注意：属性get_attribute('innerHTML')获取的是该HTML标签以及所有子节点内容，结果是HTML标签内容。

另外，还可以使用多个class名称定位元素：

driver.find_elements_by_css_selector("div[class='value test']")

详见：Find div element by multiple class names?

【参考】

https://www.cnblogs.com/deliaries/p/14121204.html selenium加载cookie报错问题