python3 + selenium + (chrome and firefox)使用

瞎扯一句
简介
最后放模板

瞎扯一句

最近在做一个关于 selenium 相关的项目，在选择浏览器方面，一般有3种方案：

chrome
phantomJs
firefox(推荐)

网上有很多教程是关于PhantomJS的，可是，在2018.3.4日，git开源项目上，ariya宣布暂停更新，具体时间另行通知，截止到2019.3.8日，还没消息。。。

chrome浏览器的教程也是很多的，但是，经过这几天的使用，体验并不是很好，对selenium超时的支持不够好，坑了我很久！

在这里隆重推荐firefox浏览器

简介

利用selenium的强大之处在于，可以像人打开浏览器一样，这样可以避免js的各种加密，动态加载之类的，可见即可爬。但是，selenium控制的chrome会暴露出许多参数，是可以通过这些参数来识别selenium的，现在针对selenium的反爬网站其实很多了。听说可以使用pyppeteer(puppeteer的py版),以后要学。

今天发现一个好用的方法driver.set_page_load_timeout(self.timeout), 让我不得不仔细学习几种超时的用法了

driver.implicitly_wait(1)
driver.set_page_load_timeout(15)
driver.set_script_timeout(15)

1.implicitly_wait

官方注释：

Sets a sticky timeout to implicitly wait for an element to be found,

or a command to complete. This method only needs to be called one

time per session. To set the timeout for calls to

execute_async_script, see set_script_timeout.

当你在调用find_element_by...方法时，会用到此方法，并且它是driver全局的，只要设置1次，所以当你想查找某元素，找不到马上放弃时，要设置得比较小才行

2.set_page_load_timeout

官方注释：

Set the amount of time to wait for a page load to complete

before throwing an error.

当你在调用driver.get()方法打开某网站时，其实某网站已经差不多加载完成了，但比如某图片，某异步请求没完成，一直转圈圈，get方法是不会结束的，set_page_load_timeout就是来设置这个的，示例：

driver = webdriver.Chrome()

driver.set_page_load_timeout(15)  # 设定页面加载限制时间

try:

    driver.get('https://www.toutiao.com')

except TimeoutException:

    driver.execute_script('window.stop()')  # 停止加载

print(driver.page_resource)

driver.quit()

但是！！！chrome浏览器不支持这个非常重要的配置，一旦TimeoutException，driver的所有操作都会报TimeoutException异常，不能进行下去了。所以我推荐firefox浏览器

3.set_script_timeout

官方注释：

Set the amount of time that the script should wait during an

execute_async_script call before throwing an error.

这是控制异步脚本执行时间，超时抛TimeoutException异常

4.WebDriverWait Class

示例：

from selenium.webdriver.common.by import By

from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, 15, 0.5)

wait.until(EC.presence_of_element_located((By.XPATH, 'express')))

我喜欢使用的找元素的等待类，15秒超时，每0.5秒根据要求找元素，找到了就结束until，15后没找到会抛TimeoutException异常

最后放模板

1.set_page_load_timeout模板

from selenium.common.exceptions import TimeoutException

t0 = time.time()

print("cur time is: %s" % (time.time() - t0))

driver = webdriver.Chrome()

driver.set_page_load_timeout(5)  # 设定页面加载限制时间

driver.maximize_window()

try:

    print("cur time is: %s" % (time.time() - t0))

    driver.get('http://www.autohome.com.cn/')

except TimeoutException:

    print("cur time is: %s" % (time.time() - t0))

    try:

        driver.execute_script('window.stop()')  # 当页面加载时间超过设定时间，通过执行Javascript来stop加载，即可执行后续动作

    except:

        pass

print("cur time is: %s" % (time.time() - t0))

2.selenium + chrome 模板

from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.support.wait import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

chrome_options = webdriver.ChromeOptions()

chrome_options.add_argument('disable-infobars')  # 去掉提示

# 一定要注意，=两边不能有空格，不能是这样"--proxy-server = http://202.20.16.82:10152"

# chrome_options.add_argument("--proxy-server=http://192.168.60.15:808")  # 设置代理

# chrome_options.add_argument('start-fullscreen')  # 启动就全屏 F11那种

# chrome_options.add_argument('-lang=zh-CN')  # 中文，貌似没用

# 语言，设为中文

# prefs = {'intl.accept_languages': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3'}

# chrome_options.add_experimental_option('prefs', prefs)

# chrome_options.add_argument('blink-settings=imagesEnabled=false')  # 不加载图片, 提升速度

# chrome_options.add_argument('--headless')  # 浏览器不提供可视化页面. linux下如果系统不支持可视化不加这条会启动失败

# chrome_options.add_argument('window-size=1920x3000')  # 指定浏览器分辨率

# chrome_options.add_argument('--disable-gpu')  # 谷歌文档提到需要加上这个属性来规避bug

# chrome_options.add_argument('--hide-scrollbars')  # 隐藏滚动条, 应对一些特殊页面

# chrome_options.binary_location = r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" #手动指定使用的浏览器位置

TIMEOUT = 15

class Display(object):

    def __init__(self):

        self.driver = webdriver.Chrome(options=chrome_options)  # 配置好了环境变量可以不用写executable_path

        # self.driver.set_page_load_timeout(TIMEOUT)

        self.wait = WebDriverWait(self.driver, TIMEOUT, 0.5)

    def __del__(self):

        if self.driver:

            self.driver.close()

    def fetch(self, url):

        self.driver.maximize_window()  # 放大

        self.driver.get(url)  # 发请求

        # self.driver.execute_script('window.location.reload();')  # 刷新

        self.wait.until(EC.presence_of_element_located((By.ID, 'kw')))

        self.driver.find_element_by_id('kw').send_keys('selenium')

        self.driver.find_element_by_id('su').click()

        self.wait.until(EC.presence_of_element_located((By.ID, '1')))

        return self.driver.page_source

d = Display()

print(d.fetch('https://www.baidu.com'))

3selenium + firefox 模板

from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.support.wait import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

TIMEOUT = 15

class Display(object):

    def __init__(self):

        # 无界模式

        options = webdriver.FirefoxOptions()

        options.headless = True

        profile = webdriver.FirefoxProfile()

        # 禁用图片

        profile.set_preference('permissions.default.image', 2)

        self.driver = webdriver.Firefox(desired_capabilities=DESIRED_CAP, profile=profile,

                                       options=options)

        self.driver.set_page_load_timeout(TIMEOUT)

        self.wait = WebDriverWait(self.driver, TIMEOUT, 0.5)

    def __del__(self):

        if self.driver:

            self.driver.close()

    def fetch(self, url):

        self.driver.maximize_window()  # 放大

        self.driver.get(url)  # 发请求

        # self.driver.execute_script('window.location.reload();')  # 刷新

        self.wait.until(EC.presence_of_element_located((By.ID, 'kw')))

        self.driver.find_element_by_id('kw').send_keys('selenium')

        self.driver.find_element_by_id('su').click()

        self.wait.until(EC.presence_of_element_located((By.ID, '1')))

        return self.driver.page_source