爬虫04 /asyncio、selenium规避检测、动作链、无头浏览器

爬虫04 /asyncio、selenium规避检测、动作链、无头浏览器

1. 协程asyncio

协程基础
- 特殊的函数
  - 就是async关键字修饰的一个函数的定义
  - 特殊之处：
    
    特殊函数被调用后会返回一个协程对象
    
    特殊函数调用后内部的程序语句没有被立即执行
- 协程
  - 对象。协程==特殊的函数。协程表示的就是一组特定的操作。
- 任务对象
  - 高级的协程（对协程的进一步的封装）/任务对象表示一组指定的操作
    
    任务对象协程特殊的函数
    
    任务对象==特殊的函数
  - 绑定回调/一般用于解析：
    
    task.add_done_callback(task)
    
    参数task：当前回调函数对应的任务对象
    
    task.result():返回的就是任务对象对应的特殊函数的返回值
- 事件循环对象
  - 创建事件循环对象
  - 将任务对象注册到该对象中并且开启该对象
  - 作用：loop可以将其内部注册的所有的任务对象进行异步执行
- 代码示例：
```
import asyncio

from time import sleep

# 特殊的函数

async def get_request(url):

    print('正在下载:',url)

    sleep(2)

    print('下载完毕：',url)

    return 'page_text'

# 回调函数的定义（普通的函数）

def parse(task):

    # 参数表示的就是任务对象

    print('i am callback!!!',task.result())

# 特殊函数的调用

c = get_request('www.lbzhk.com')

# 创建一个任务对象

task = asyncio.ensure_future(c)

# 给任务对象绑定一个回调函数

task.add_done_callback(parse)

# 创建一个事件循环对象

loop = asyncio.get_event_loop()

# 将任务对象注册到该对象中并且开启该对象

loop.run_until_complete(task)   # 让loop执行了一个任务
```

多任务协程

挂起：就是交出cpu的使用权。

wait(tasks):给每个任务对象赋予一个可被挂起的的权限
await：被用作特殊函数内部（被阻塞）
代码示例：

import asyncio

from time import sleep

import time

# 特殊的函数

async def get_request(url):

    print('正在下载:',url)

    await asyncio.sleep(2)

    print('下载完毕：',url)

    return 'i am page_text!!!'

def parse(task):

    page_text = task.result()

    print(page_text)

start = time.time()

urls = ['www.1.com','www.2.com','www.3.com']

tasks = []  # 存储的是所有的任务对象。多任务！

for url in urls:

    c = get_request(url)

    task = asyncio.ensure_future(c)

    task.add_done_callback(parse)

    tasks.append(task)

loop = asyncio.get_event_loop()

# asyncio.wait(tasks):给每一个任务对象赋予一个可被挂起的权限

loop.run_until_complete(asyncio.wait(tasks))

print('总耗时：',time.time()-start)

2. aiohttp多任务异步爬虫

实现异步爬取的条件
- 不能在特殊函数内部出现不支持异步的模块代码，否则会中断整个的异步效果
- requests模块不支持异步
- aiohttp是一个支持异步的网络请求模块

使用aiohttp模块实现多任务异步爬虫的流程

环境安装
```
pip install aiohttp
```

编码流程：

大致的架构:

with aiohttp.ClientSession() as s:

# s.get(url,headers,params,proxy="http://ip:port")

    with s.get(url) as response:

        # response.read()二进制/相当于requests的.content

        page_text = response.text()

        return page_text

细节补充：

在每一个with前加上async，标记是一个特殊函数
需要在每一个阻塞操作前加上await

async with aiohttp.ClientSession() as s:

    # s.get(url,headers,params,proxy="http://ip:port")

    async with await s.get(url) as response:

        # response.read()二进制（.content）

        page_text = await response.text()

        return page_text

代码示例：

import asyncio

import aiohttp

import time

from bs4 import BeautifulSoup

# 将被请求的url全部整合到一个列表中

urls = ['http://127.0.0.1:5000/bobo','http://127.0.0.1:5000/jay','http://127.0.0.1:5000/tom']

start = time.time()

async def get_request(url):

    async with aiohttp.ClientSession() as s:

        # s.get(url,headers,params,proxy="http://ip:port")

        async with await s.get(url) as response:

            # response.read()二进制（.content）

            page_text = await response.text()

            return page_text

def parse(task):

    page_text = task.result()

    soup = BeautifulSoup(page_text,'lxml')

    data = soup.find('div',class_="tang").text

    print(data)

tasks = []

for url in urls:

    c = get_request(url)

    task = asyncio.ensure_future(c)

    task.add_done_callback(parse)

    tasks.append(task)

loop = asyncio.get_event_loop()

loop.run_until_complete(asyncio.wait(tasks))

print('总耗时：',time.time()-start)

3. selenium的使用

selenium和爬虫之间的关联：
- 模拟登录
- 便捷的捕获到动态加载的数据
  
  特点：可见及可得
  
  缺点：效率低
selenium概念/安装
- 概念：基于浏览器自动化的一个模块。
- 环境的安装：
```
pip install selenium
```
selenium的具体使用

准备浏览器的驱动程序：http://chromedriver.storage.googleapis.com/index.html

selenium演示程序

from selenium import webdriver

from time import sleep

# 后面是你的浏览器驱动位置，记得前面加r'','r'是防止字符转义的

driver = webdriver.Chrome(r'chromedriver')

# 用get打开百度页面

driver.get("http://www.baidu.com")

# 查找页面的“设置”选项，并进行点击

driver.find_elements_by_link_text('设置')[0].click()

sleep(2)

# 打开设置后找到“搜索设置”选项，设置为每页显示50条

driver.find_elements_by_link_text('搜索设置')[0].click()

sleep(2)

# 选中每页显示50条

m = driver.find_element_by_id('nr')

sleep(2)

m.find_element_by_xpath('//*[@id="nr"]/option[3]').click()

m.find_element_by_xpath('.//option[3]').click()

sleep(2)

# 点击保存设置

driver.find_elements_by_class_name("prefpanelgo")[0].click()

sleep(2)

# 处理弹出的警告页面   确定accept() 和 取消dismiss()

driver.switch_to_alert().accept()

sleep(2)

# 找到百度的输入框，并输入 美女

driver.find_element_by_id('kw').send_keys('美女')

sleep(2)

# 点击搜索按钮

driver.find_element_by_id('su').click()

sleep(2)

# 在打开的页面中找到“Selenium - 开源中国社区”，并打开这个页面

driver.find_elements_by_link_text('美女_百度图片')[0].click()

sleep(3)

# 关闭浏览器

driver.quit()

selenium基本使用指令

from selenium import webdriver

bro = webdriver.Chrome(executable_path='./chromedriver.exe')

# 请求的发送：

bro.get(url)

# 标签定位

# 使用xpath定位

search = bro.find_element_by_xpath('//input[@id="key"]')

# 使用id定位

search = bro.find_element_by_id('key')

# 使用class类值定位

search = bro.find_elements_by_class_name('prefpanelgo')

# 向指定标签中录入文本数据

search.send_keys('mac pro')

# 模拟点击

search.click()

# JS注入

bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')

# 处理弹出的警告页面   确定accept() 和 取消dismiss()

bro.switch_to_alert().accept()

# switch_to.frame进行指定子页面的切换

bro.switch_to.frame('iframeResult')

# 捕获到当前页面的数据

page_text = bro.page_source

# 保留当前页面截图

bro.save_screenshot('123.png')

# 关闭浏览器

bro.quit()

selenium简单使用示例代码：

from selenium import webdriver

from time import sleep

# 结合着浏览器的驱动实例化一个浏览器对象

bro = webdriver.Chrome(executable_path='./chromedriver.exe')

# 请求的发送

url = 'https://www.jd.com/'

bro.get(url)

sleep(1)

# 标签定位

# bro.find_element_by_xpath('//input[@id="key"]')

search = bro.find_element_by_id('key')

search.send_keys('mac pro')   # 向指定标签中录入文本数据

sleep(2)

btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')

btn.click()

sleep(2)

# JS注入

bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')

# 捕获到当前页面的数据

page_text = bro.page_source

print(page_text)

sleep(3)

bro.quit()

动态加载数据的捕获代码示例：

http://125.35.6.84:81/xk/,对药监总局前3页的企业名称进行爬取

from selenium import webdriver

from lxml import etree

from time import sleep

bro = webdriver.Chrome(executable_path='./chromedriver.exe')

url = 'http://125.35.6.84:81/xk/'

bro.get(url)

page_text = bro.page_source

all_page_text = [page_text]

# 点击下一页

for i in range(2):

    # 获取标签

    nextPage = bro.find_element_by_xpath('//*[@id="pageIto_next"]')

    # 进行点击

    nextPage.click()

    sleep(1)

    all_page_text.append(bro.page_source)

# 对爬取到的数据进行解析

for page_text in all_page_text:

    tree = etree.HTML(page_text)

    li_list = tree.xpath('//*[@id="gzlist"]/li')

    for li in li_list:

        name = li.xpath('./dl/@title')[0]

        print(name)

sleep(2)

bro.quit()

4. 动作链

动作链概念/使用流程
- ActionChains，一系列的行为动作
  
  动作链对象action和浏览器对象bro是独立的
- 使用流程：
  1. 实例化一个动作链对象，需要将指定的浏览器和动作链对象进行绑定
  2. 执行相关的连续的动作
  3. perform()立即执行动作链制定好的动作

示例代码：

from selenium import webdriver

from selenium.webdriver import ActionChains # 动作链

from time import sleep

bro = webdriver.Chrome(executable_path='./chromedriver.exe')

url = 'https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'

bro.get(url)

# NoSuchElementException:定位的标签是存在与iframe之中，则就会抛出这个错误

# 解决方法：switch_to.frame进行指定子页面的切换

bro.switch_to.frame('iframeResult')

div_tag = bro.find_element_by_xpath('//*[@id="draggable"]')

# 实例化一个动作链对象

action = ActionChains(bro)

action.click_and_hold(div_tag)   # 点击且长按

# perform()让动作链立即执行

for i in range(5):

    action.move_by_offset(xoffset=15,yoffset=15).perform()

    sleep(2)

action.release()

sleep(5)

bro.quit()

5. 12306模拟登录分析

模拟登录流程：
1. 将当前浏览器页面进行图片保存
2. 将验证码的局部区域进行裁剪
  - 捕获标签在页面中的位置信息
  - 裁剪范围对应的矩形区域
  - 使用Image工具进行指定区域的裁剪
3. 调用打码平台进行验证码的识别/返回对应的坐标位置

代码示例：

from selenium import webdriver

from selenium.webdriver import ActionChains

from time import sleep

from PIL import Image  # 安装PIL或者是Pillow

from CJY import Chaojiying_Client

# 封装一个识别验证码的函数

def transformCode(imgPath,imgType):

    chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370')

    im = open(imgPath, 'rb').read()

    return chaojiying.PostPic(im, imgType)['pic_str']

bro = webdriver.Chrome(executable_path='./chromedriver.exe')

bro.get('https://kyfw.12306.cn/otn/login/init')

sleep(2)

# 将当前浏览器页面进行图片保存

bro.save_screenshot('./main.png')

# 将验证码的局部区域进行裁剪

# 捕获标签在页面中的位置信息

img_tag = bro.find_element_by_xpath('//*[@id="loginForm"]/div/ul[2]/li[4]/div/div/div[3]/img')

location = img_tag.location   # 标签的起始位置坐标（左下角坐标）

size = img_tag.size   # 标签的尺寸

# 裁剪范围对应的矩形区域

rangle = (int(location['x']),int(location['y']),int(location['x']+size['width']),int(location['y']+size['height']))

# 使用Image工具进行指定区域的裁剪

i = Image.open('./main.png')

frame = i.crop(rangle)   # crop就是根据指定的裁剪范围进行图片的截取

frame.save('code.png')

# 调用打码平台进行验证码的识别

result = transformCode('./code.png',9004)

print(result) #x1,y1|x2,y2|x3,y3

# x1,y1|x2,y2|x3,y3 ==>[[x1,y1],[x2,y2],[x3,y3]]

all_list = []    # [[x1,y1],[x2,y2],[x3,y3]]

if '|' in result:

    list_1 = result.split('|')

    count_1 = len(list_1)

    for i in range(count_1):

        xy_list = []

        x = int(list_1[i].split(',')[0])

        y = int(list_1[i].split(',')[1])

        xy_list.append(x)

        xy_list.append(y)

        all_list.append(xy_list)

else:

    x = int(result.split(',')[0])

    y = int(result.split(',')[1])

    xy_list = []

    xy_list.append(x)

    xy_list.append(y)

    all_list.append(xy_list)

for point in all_list:

    x = point[0]

    y = point[1]

    ActionChains(bro).move_to_element_with_offset(img_tag,x,y).click().perform()

    sleep(1)

bro.find_element_by_id('username').send_keys('xxxxxx')

sleep(1)

bro.find_element_by_id('password').send_keys('xxxx')

sleep(1)

bro.find_element_by_id('loginSub').click()

sleep(10)

print(bro.page_source)

bro.quit()

6. selenium规避风险

测试服务器是否有selenium检测机制
1. 正常打开一个网站进行window.navigator.webdriver的js注入，返回值为undefined
2. 使用selenium打开的页面，进行上述js注入返回的是true

规避检测代码示例：

# 规避检测

from selenium import webdriver

from selenium.webdriver import ChromeOptions

option = ChromeOptions()

option.add_experimental_option('excludeSwitches', ['enable-automation'])

bro = webdriver.Chrome(executable_path='./chromedriver.exe',options=option)

url = 'https://www.taobao.com/'

bro.get(url)

7. 无头浏览器

现有无头浏览器
- phantomJs
- 谷歌无头

无头浏览器代码示例：

# 无头浏览器

from selenium import webdriver

from selenium.webdriver.chrome.options import Options

from time import sleep

chrome_options = Options()

chrome_options.add_argument('--headless')

chrome_options.add_argument('--disable-gpu')

bro = webdriver.Chrome(executable_path='./chromedriver.exe',chrome_options=chrome_options)

url = 'https://www.taobao.com/'

bro.get(url)

sleep(2)

bro.save_screenshot('123.png')

print(bro.page_source)

总结：

网络请求的模块：requests/urllib/aiohttp
aiohttp和requests的区别：
- 代理requests用poroxies，aiohttp用的是proxy
- 接收二进制文件requests用response.content，aiohttp用的是response.read()

爬虫04 /asyncio、selenium规避检测、动作链、无头浏览器的更多相关文章

selenium中动作链的使用
一.问题我们有时候在使用selenium的时候,会遇到悬停后点击元素的操作,因此需要一个动作链来完成这个功能. 二.解决从selenium的包中导入actionchains函数,利用xpath找到 ...
Selenium 动作链
Selenium 模拟浏览器操作,有一些操作,它们没有特定的执行对象,比如鼠标拖曳.键盘按键等,这些动作用另一种方式来执行,那就是动作链更多动作链参考官网:https://selenium-pyth ...
selenium动作链
简介一般来说我们与页面的交互可以使用Webelement的方法来进行点击等操作. 但是,有时候我们需要一些更复杂的动作,类似于拖动,双击,长按等等. 这时候就需要用到我们的Action Chains ...
selenium处理iframe和动作链
selenium处理iframe和动作链 iframe iframe就是一个界面里嵌套了其他界面,这个时候selenium是不能从主界面找到子界面的属性,需要先找到子界面,再去找子界面的属性动作链( ...
selenium之动作链
概念:一组连续的行为动作爬取网站:https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable 背景:把左边的方块横竖往下便宜 ...
爬虫之图片懒加载技术、selenium工具与PhantomJS无头浏览器
图片懒加载技术 selenium爬虫简单使用 2.1 selenium简介 2.2 selenium安装 2.3 selenium简单使用 2.3.1 selenium使用案例 2.3.2 selen ...
# Python3微博爬虫[requests+pyquery+selenium+mongodb]
目录 Python3微博爬虫[requests+pyquery+selenium+mongodb] 主要技术站点分析程序流程图编程实现数据库选择代理IP测试模拟登录获取用户详细信息获取 ...
爬虫基础(三)-----selenium模块应用程序
摆脱穷人思维 <三> : 培养"目标导向"的思维: 好项目永远比钱少,只要目标正确,钱总有办法解决. 一 selenium模块什么是selenium?seleni ...
爬虫----爬虫请求库selenium
一介绍 selenium最初是一个自动化测试工具,而爬虫中使用它主要是为了解决requests无法直接执行JavaScript代码的问题 selenium本质是通过驱动浏览器,完全模拟浏览器的操作, ...

随机推荐

iOS/swift 单选框和复选框
/** 复选框 */ import UIKit class LYBmutipleSelectView: UIView { var selectindexs:[Int]=[]//选中的 //标题数组 v ...
小师妹学JVM之:JVM的架构和执行过程
目录简介 JVM是一种标准 java程序的执行顺序 JVM的架构类加载系统运行时数据区域执行引擎总结简介 JVM也叫Java Virtual Machine,它是java程序运行的基础,负 ...
Map 转 json格式保留null值的解决办法
Map 转 json格式保留null值的解决办法开发中遇到将map数据转json格式,然后map中含null值的键值对都被转没了,所以记录一下,以下是解决方法使用fastJson进行转换 imp ...
MAC地址表、ARP缓存表、路由表及交换机、路由器基本原理
在网上找到了这篇讲述MAC地址,ARP协议和路由表的文章,如获至宝.一篇文章把组网中的相关概念讲的明明白白. 原文是发布在51cto博客上,但不知道为什么点进去却是404.让我没想到的是这个技术论坛上 ...
JDBC——使用JDBC连接MySQL数据库
在JDBC--什么是JDBC一文中我们已经介绍了JDBC的基本原理. 这篇文章我们聊聊如何使用JDBC连接MySQL数据库. 一.基本操作首先我们需要一个数据库和一张表: CREATE DATABA ...
Win10下创建virtualenv Linux下创建
虚拟环境为什么要搭建虚拟环境开发多个不同的项目可能需要用到同一个包不同版本新版本会覆盖旧的作用虚拟环境可以搭建独立的Python运行环境使项目之间版本不受影响 Linux下如何搭建虚拟环 ...
debug PostgreSQL 9.6.18 using Eclipse IDE on CentOS7
目录 debug PostgreSQL 9.6.18 using Eclipse IDE on CentOS7 1.概览 2.建立用户 3.编译postgre 4.启动Eclipse 5.设置环境变量 ...
黑马程序员spring data jpa 2019年第一版本
第一步首先创建一个maven工程,导入对于的pom依赖 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xs ...
ceph对象存储RADOSGW安装与使用
本文章ceph版本为luminous,操作系统为centos7.7,ceph安装部署方法可以参考本人其他文章. [root@ceph1 ceph-install]# ceph -v ceph vers ...
script写在head与写在body中的区别
咱先说将Javascript写在head里面的情况吧,如果你要在这里面去操控DOM元素,是会报错的,因为浏览器是先执行head标签里面的内容,在执行时你的DOM元素还没有生成.(使用了windows. ...

爬虫04 /asyncio、selenium规避检测、动作链、无头浏览器

爬虫04 /asyncio、selenium规避检测、动作链、无头浏览器

1. 协程asyncio

2. aiohttp多任务异步爬虫

3. selenium的使用

4. 动作链

5. 12306模拟登录分析

6. selenium规避风险

7. 无头浏览器

总结：

爬虫04 /asyncio、selenium规避检测、动作链、无头浏览器的更多相关文章

随机推荐

热门专题