python3编写网络爬虫17-验证码识别

一、验证码识别

1.图形验证码的识别

识别图形验证码需要 tesserocr 库 OCR技术识别（光学字符识别，是指通过扫描字符，然后通过其形状将其翻译成电子文本的过程。）
例如中国知网注册页面 http://my.cnki.net/elibregister/commonRegister.aspx
tesserocr是Python的一个OCR识别库，但其实是对tesseract做的一层Python API封装，所以它的核心是tesseract
所以在安装tesserocr之前要先安装tesseract

下载地址：

https://digi.bib.uni-mannheim.de/tesseract/

其中带dev的为开发版本不带dev的为稳定版本下载好后双击安装
勾选 additional language data 支持语言包

安装tesserocr

pip install tesserocr pillow

如果报错，查看pip支持版本 python命令行下

import pip

import pip._internal

print(pip._internal.pep425tags.get_supported())

去如下网址下载对应版本进行安装即可

https://github.com/simonflueckiger/tesserocr-windows_build/releases

测试成功 import tesserocr 不报错表示成功

pip install pillow 测试import PIL

1.1 识别测试

下载到本地一张验证码后更改其名字为 code.jpg 放在python代码根目录

代码如下：

#识别code.jpg 图片验证码

import tesserocr

from PIL import Image

image = Image.open('code.jpg') #新建image对象

result = tesserocr.image_to_text(image)#调用imgae_to_text方法 传入image对象

print(result)

另外tesserocr 还有一个更加简单的方法，这个方法可以直接将图片转化为字符串
示例:

import tesserocr

print(tesserocr.file_to_text('code.jpg')) #不过此种方法识别效果不如上一种方法好

1.2 验证码处理

重新下载一张图片命名为code1.jpg 重新用以上代码进行测试
可以看到如果图片当中多余的线条干扰会影响图片识别的准确度

对于这种情况我们还要进行进一步的处理例如转灰度二值化等。
可以利用Image对象的 convert()方法传入参数 L 即可将图片转化为灰度图像
示例：

image = image.convert('L')

image.show()

传入参数1 即可将图片二值化处理

image = image.convert('')

image.show()

但是此种方法默认阀值是127
并且不能直接转换原图要先将原图转为灰度图像，然后再指定二值化阀值
示例：

image = image.convert('L')

threshold = 80#二值化阀值

table = []

for i in range(256):

if i < threshold:

table.append(0)

else:

table.append(1)

image = image.point(table,'')

image.show()

发现验证码中的线条已经去除验证码黑白分明再重新识别验证码

示例：

import tesserocr

from PIL import Image

image = Image.open('code1.jpg')

image = image.convert('L')

threshold = 127

table = []

for i in range(256):

if i < threshold:

table.append(0)

else:

table.append(1)

image = image.point(table, '')

image.show()

result = tesserocr.image_to_text(image)

print(result)

如果针对一些有干扰的图片，我们可以选择做一些灰度和二值化处理达到提高图片识别的正确率

2. 极验滑动验证码的识别

上面我们说可以利用 tesserocr 来识别简单的图形验证码但是近几年出现了一些新型验证码
其中比较有代表性的就是极验验证码它需要拖动拼合滑块才可以完成验证，相对于图形验证码来说
识别难度上升了几个等级例如魅族斗鱼

确保本机安装好了selenium 浏览器为Chrome 并配置ChromeDriver

极验验证码官网 http://www.geetest.com/ 它是一个专注于提供验证安全的系统主要验证方式是拖动滑块拼合图像
如果图像完全拼合则验证成功

2.1 极验验证码特点

极验验证码相比图片验证码识别难度更大，对于极验3.0版本首先要点击按钮进行智能验证，如果验证不通过，则会弹出滑动验证窗口
拖动滑动拼合图像进行验证，之后三个加密参数会生成，通过表单提交到后台，后台还会进行一次验证。

极验验证码还增加了机器学习的方法来识别拖动轨迹官方网站的安全防护有如下几点说明：

1. 三角防护之防模拟。恶意程序模仿人类行为轨迹对验证码进行识别，针对模拟极验验证码拥有4000万人机行为样本的海量数据
利用机器学习和神经网络构建线上线下多重静态动态防御模型识别模拟轨迹，界定人机边界。

2. 三角防护之防伪造。恶意程序通过伪造设备浏览器环境对验证码进行识别，针对伪造极验验证码利用设备基因技术，深度分析
浏览器的实际性能来识别伪造信息，同时根据伪造时间不断更新黑名单，大幅度提高防伪造能力。

3. 三角防护之防暴力。恶意程序短时间内进行密集攻击，对验证码进行暴力识别，针对暴力识别极验验证码拥有多种验证形态，
每一种验证形态都利用神经网络生成海量图库储备，每一张图片都是独一无二的，且图库不断更新，极大程度提高了暴力识别的成本。

另外相比普通验证方式极验更加方便体验更加友好：

1. 点击验证只需0.4秒

2. 全平台兼容

3. 面向未来
相比一般验证码极验验证码的安全性和易用性有了非常大的提高。

2.2 识别思路

对于应用了极验验证码的网站，如果直接模拟表单提交加密参数的构造是个问题需要分析它加密和校验逻辑相对繁琐
所以采用直接模拟浏览器动作的方式来完成验证此验证成本相比直接去识别加密算法少很多

示例：中国保温网

http://www.cnbaowen.net/api/geetest/

识别验证只需要完成如下三步：
1.模拟点击验证按钮
2.识别滑动缺口的位置
3.模拟拖动滑块

第一步相对简单可以直接用selenium 模拟点击操作
第二步识别缺口位置比较关键需要用到图像相关的处理方法首先观察缺口的样子缺口四周边缘有明显的断裂边缘边缘和边缘周围有明显的区别
可以实现一个边缘检测算法找出缺口的位置。
第三步看似简单其中的坑比较多极验验证码增加了机器轨迹识别，匀速移动随机速度移动等方法都不能通过验证，只有完全模拟人的移动轨迹才可以通过验证
人的移动轨迹一般是先加速后减速需要模拟这个过程才能通过验证。

2.3 代码实现

初始化测试链接

 http://www.cnbaowen.net/api/geetest/

from selenium import webdriver

from selenium.webdriver.support.ui import WebDriverWait # 等待元素加载的

from selenium.webdriver.common.action_chains import ActionChains #拖拽

from selenium.webdriver.support import expected_conditions as EC

from selenium.common.exceptions import TimeoutException, NoSuchElementException

from selenium.webdriver.common.by import By

from PIL import Image

import requests

import time

import re

import random

from io import BytesIO

def merge_image(image_file,location_list):

　　"""

　　　　拼接图片

　　　　:param image_file:

　　　　:param location_list:

　　　　:return:

　　"""

　　im = Image.open(image_file)

　　im.save('Code.jpg')

　　new_im = Image.new('RGB',(260,116))

　　# 把无序的图片 切成52张小图片

　　im_list_upper = []

　　im_list_down = []

　　# print(location_list)

　　for location in location_list:

　　　　# print(location['y'])

　　　　if location['y'] == -58: # 上半边

　　　　　　im_list_upper.append(im.crop((abs(location['x']),58,abs(location['x'])+10,116)))

　　　　if location['y'] == 0: # 下半边

　　　　　　im_list_down.append(im.crop((abs(location['x']),0,abs(location['x'])+10,58)))

　　x_offset = 0

　　for im in im_list_upper:

　　　　new_im.paste(im,(x_offset,0)) # 把小图片放到 新的空白图片上

　　　　x_offset += im.size[0]

　　x_offset = 0

　　for im in im_list_down:

　　　　new_im.paste(im,(x_offset,58))

　　　　x_offset += im.size[0]

　　　　new_im.show()

　　return new_im

def get_image(driver,div_path):

　　'''

　　　　下载无序的图片 然后进行拼接 获得完整的图片

　　　　:param driver:

　　　　:param div_path:

　　　　:return:

　　'''

　　time.sleep(2)

　　background_images = driver.find_elements_by_xpath(div_path)

　　location_list = []

　　for background_image in background_images:

　　　　location = {}

　　　　result = re.findall('background-image: url\("(.*?)"\); background-position: (.*?)px (.*?)px;',background_image.get_attribute('style'))

　　　　# print(result)

　　　　location['x'] = int(result[0][1])

　　　　location['y'] = int(result[0][2])

　　　　image_url = result[0][0]

　　　　location_list.append(location)

　　print('==================================')

　　image_url = image_url.replace('webp','jpg')

　　# '替换url http://static.geetest.com/pictures/gt/579066de6/579066de6.webp'

　　image_result = requests.get(image_url).content

　　# with open('1.jpg','wb') as f:

　　# f.write(image_result)

　　image_file = BytesIO(image_result) # 是一张无序的图片

　　image = merge_image(image_file,location_list)

　　return image

def get_track(distance):

　　'''

　　　　拿到移动轨迹，模仿人的滑动行为，先匀加速后匀减速

　　　　匀变速运动基本公式：

　　　　①v=v0+at

　　　　②s=v0t+(1/2)at²

　　　　③v²-v0²=2as

　　　　:param distance: 需要移动的距离

　　　　:return: 存放每0.2秒移动的距离

　　'''

　　# 初速度

　　v=0

　　# 单位时间为0.2s来统计轨迹，轨迹即0.2内的位移

　　t=0.2

　　# 位移/轨迹列表，列表内的一个元素代表0.2s的位移

　　tracks=[]

　　# 当前的位移

　　current=0

　　# 到达mid值开始减速

　　mid=distance * 7/8

　　distance += 10 # 先滑过一点，最后再反着滑动回来

　　# a = random.randint(1,3)

　　while current < distance:

　　　　if current < mid:

　　　　　　# 加速度越小，单位时间的位移越小,模拟的轨迹就越多越详细

　　　　　　a = random.randint(2,4) # 加速运动

　　　　else:

　　　　　　a = -random.randint(3,5) # 减速运动

　　　　# 初速度

　　　　v0 = v

　　　　# 0.2秒时间内的位移

　　　　s = v0*t+0.5*a*(t**2)

　　　　# 当前的位置

　　　　current += s

　　　　# 添加到轨迹列表

　　　　tracks.append(round(s))

　　　　# 速度已经达到v,该速度作为下次的初速度

　　　　v= v0+a*t

　　　　# 反着滑动到大概准确位置

　　for i in range(4):

　　　　tracks.append(-random.randint(2,3))

　　for i in range(4):

　　　　tracks.append(-random.randint(1,3))

　　return tracks

def get_distance(image1,image2):

　　'''

　　　　拿到滑动验证码需要移动的距离

　　　　:param image1:没有缺口的图片对象

　　　　:param image2:带缺口的图片对象

　　　　:return:需要移动的距离

　　'''

　　# print('size', image1.size)

　　threshold = 60

　　for i in range(0,image1.size[0]): #

　　　　for j in range(0,image1.size[1]): #

　　　　　　pixel1 = image1.getpixel((i,j))

　　　　　　pixel2 = image2.getpixel((i,j))

　　　　　　res_R = abs(pixel1[0]-pixel2[0]) # 计算RGB差

　　　　　　res_G = abs(pixel1[1] - pixel2[1]) # 计算RGB差

　　　　　　res_B = abs(pixel1[2] - pixel2[2]) # 计算RGB差

　　　　　　if res_R > threshold and res_G > threshold and res_B > threshold:

　　　　　　　　return i # 需要移动的距离

def main_check_code(driver, element):

　　"""

　　　　拖动识别验证码

　　　　:param driver:

　　　　:param element:

　　　　:return:

　　"""

　　image1 = get_image(driver, '//div[@class="gt_cut_bg gt_show"]/div')

　　image2 = get_image(driver, '//div[@class="gt_cut_fullbg gt_show"]/div')

　　# 图片上 缺口的位置的x坐标

　　

　　# 2 对比两张图片的所有RBG像素点，得到不一样像素点的x值，即要移动的距离

　　l = get_distance(image1, image2)

　　print('l=',l)

　　# 3 获得移动轨迹

　　track_list = get_track(l)

　　print('第一步,点击滑动按钮')

　　ActionChains(driver).click_and_hold(on_element=element).perform() # 点击鼠标左键，按住不放

　　time.sleep(2)

　　print('第二步,拖动元素')

　　for track in track_list:

　　　　ActionChains(driver).move_by_offset(xoffset=track, yoffset=0).perform() # 鼠标移动到距离当前位置（x,y）

　　　　time.sleep(0.002)

　　ActionChains(driver).move_by_offset(xoffset=-random.randint(2,5), yoffset=0).perform()

　　time.sleep(2)

　　print('第三步,释放鼠标')

　　ActionChains(driver).release(on_element=element).perform()

　　time.sleep(5)

def main_check_slider(driver):

　　"""

　　　　检查滑动按钮是否加载

　　　　:param driver:

　　　　:return:

　　"""

　　while True:

　　　　try :

　　　　　　driver.get('http://www.cnbaowen.net/api/geetest/')

　　　　　　element = WebDriverWait(driver, 30, 0.5).until(EC.element_to_be_clickable((By.CLASS_NAME, 'gt_slider_knob')))

　　　　　　if element:

　　　　　　　　return element

　　　　except TimeoutException as e:

　　　　　　print('超时错误，继续')

　　　　　　time.sleep(5)

if __name__ == '__main__':

　　try:

　　　　count = 6 # 最多识别6次

　　　　driver = webdriver.Chrome()

　　　　# 等待滑动按钮加载完成

　　　　element = main_check_slider(driver)

　　　　while count > 0:

　　　　　　main_check_code(driver,element)

　　　　　　time.sleep(2)

　　　　　　try:

　　　　　　　　success_element = (By.CSS_SELECTOR, '.gt_holder .gt_ajax_tip.gt_success')

　　　　　　　　# 得到成功标志

　　　　　　　　print('suc=',driver.find_element_by_css_selector('.gt_holder .gt_ajax_tip.gt_success'))

　　　　　　　　success_images = WebDriverWait(driver, 20).until(EC.presence_of_element_located(success_element))

　　　　　　　　if success_images:

　　　　　　　　　　print('成功识别')

　　　　　　　　　　count = 0

　　　　　　　　　　break

　　　　　　except NoSuchElementException as e:

　　　　　　　　print('识别错误，继续识别')

　　　　　　　　ount -= 1

　　　　　　　　time.sleep(2)

　　　　else:

　　　　　　print('too many attempt check code ')

　　　　　　exit('退出程序')

　　finally:

　　　　driver.close()

3. 点触验证码的识别

除了极验验证码之外，还有一个常见且比较广泛的验证码，既点触验证码例如12306
直接点击图中符合要求的图答案全部正确验证才会成功有一个错误验证就会失败

示例：

https://www.jianshu.com/sign_in

识别思路：
如果依靠图像识别验证码识别难度非常大第一点是文字识别第二点是图像识别图像背景会干扰导致ORC几乎识别不出结果
如果直接识别白色文字换一张验证码颜色又变了

借助打码平台：

http://www.chaojiying.com/user/reg/

注册账户
进入用户中心申请软件ID
关注微信或者购买题分

http://www.chaojiying.com/api-14.html

下载打码平台的api

示例如下：

#!/usr/bin/env python

# coding:utf-8

import requests

from hashlib import md5

class Chaojiying_Client(object):

　　def __init__(self, username, password, soft_id):

　　　　self.username = username

　　　　password = password.encode('utf8')

　　　　self.password = md5(password).hexdigest()

　　　　self.soft_id = soft_id

　　　　self.base_params = {

　　　　　　'user': self.username,

　　　　　　'pass2': self.password,

　　　　　　'softid': self.soft_id,

　　　　}

　　　　self.headers = {

　　　　　　'Connection': 'Keep-Alive',

　　　　　　'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',

　　　　}

　　def PostPic(self, im, codetype):

　　　　"""

　　　　　　im: 图片字节

　　　　　　codetype: 题目类型 参考 http://www.chaojiying.com/price.html

　　　　"""

　　　　params = {

　　　　　　'codetype': codetype,

　　　　}

　　　　params.update(self.base_params)

　　　　files = {'userfile': ('ccc.jpg', im)}

　　　　r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)

　　　　return r.json()

　　def ReportError(self, im_id):

　　　　"""

　　　　　　im_id:报错题目的图片ID

　　　　"""

　　　　params = {

　　　　　　'id': im_id,

　　　　}

　　　　params.update(self.base_params)

　　　　r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)

　　　　return r.json()

if __name__ == '__main__':

　　chaojiying = Chaojiying_Client('超级鹰用户名', '超级鹰用户名的密码', '')    #用户中心>>软件ID 生成一个替换 96001

　　im = open('a.jpg', 'rb').read()    #本地图片文件路径 来替换 a.jpg 有时WIN系统须要//

　　print chaojiying.PostPic(im, 1902)    #1902 验证码类型 官方网站>>价格体系 3.4+版 print 后要加()

这里定义了一个Chaojiying_Client类其构造函数接收三个参数分别是超级鹰用户名，超级鹰用户名的密码，软件ID
最重要的一个方法叫做PostPic，需要传入图片对象和验证码的代号，该方法会将图片对象的相关信息发送个超级鹰后台进行识别，
然后将识别成功的JSON返回。
ReportError方法发生错误的时候回调如果验证码识别错误，调用此方法会返回相应的题分。

初始化

import time

from PIL import Image

from selenium import webdriver

from selenium.webdriver import ActionChains

from chaojiying import Chaojiying

def crack():

　　# 保存网页截图

　　browser.save_screenshot('222.jpg')

　　# 获取 验证码确定按钮

　　button = browser.find_element_by_xpath(xpath='//div[@class="geetest_panel"]/a/div')

　　# 获取 验证码图片的 位置信息

　　img1 = browser.find_element_by_xpath(xpath='//div[@class="geetest_widget"]')

　　location = img1.location

　　size = img1.size

　　top, bottom, left, right = location['y'], location['y'] + size['height'], location['x'], location['x'] + size[

　　　　'width']

　　print('图片的宽:', img1.size['width'])

　　print(top, bottom, left, right)

　　# 根据获取的验证码位置信息和网页图片 对验证码图片进行裁剪 保存

　　img_1 = Image.open('222.jpg')

　　capcha1 = img_1.crop((left, top, right, bottom - 54))

　　capcha1.save('tu1-1.png')

　　# 接入超级鹰 API 获取图片中的一些参数 (返回的是一个字典)

　　cjy = Chaojiying('liuxiaosong', '', '')

　　im = open('tu1-1.png', 'rb').read()

　　content = cjy.post_pic(im, 9004)

　　print(content)

　　# 将图片中汉字的坐标位置 提取出来

　　positions = content.get('pic_str').split('|')

　　locations = [[int(number) for number in group.split(",")] for group in positions]

　　print(positions)

　　print(locations)

　　# 根据获取的坐标信息 模仿鼠标点击验证码图片

　　for location1 in locations:

　　　　print(location1)

　　　　ActionChains(browser).move_to_element_with_offset(img1, location1[0], location1[1]).click().perform()

　　　　time.sleep(1)

　　button.click()

　　time.sleep(1)

　　# 失败后重试

　　lower = browser.find_element_by_xpath('//div[@class="geetest_table_box"]/div[2]').text

　　print('判断', lower)

　　if lower != '验证失败 请按提示重新操作' and lower != None:

　　　　print('登录成功')

　　　　time.sleep(3)

　　else:

　　　　time.sleep(3)

　　　　print('登录失败')

　　　　# 登录失败后 , 调用 该函数 , 后台 则对该次判断不做扣分处理

　　　　pic_id = content.get('pic_id')

　　　　print('图片id为:', pic_id)

　　　　cjy = Chaojiying('liuxiaosong', '', '')

　　　　cjy.report_error(pic_id)

　　　　crack()

if __name__ == '__main__':

　　browser = webdriver.Chrome()

　　browser.get('https://www.jianshu.com/sign_in')

　　browser.save_screenshot('login.png')

　　# 填写from表单 点击登陆 获取验证码 的网页截图

　　login = browser.find_element_by_id('sign-in-form-submit-btn')

　　username = browser.find_element_by_id('session_email_or_mobile_number')

　　password = browser.find_element_by_id('session_password')

　　username.send_keys('')

　　time.sleep(1)

　　password.send_keys('')

　　time.sleep(2)

　　login.click()

　　time.sleep(10)

　　crack()

二、代理的使用

前面介绍了多种请求库 requests urllib selenium等

1.获取代理

网上有很多免费代理例如西刺 http://www.xicidaili.com/ 但是免费代理大多数是不好用的最靠谱的方法是购买付费代理

如果本机有代理软件的话软件一般会在本机创建HTTP和SOCKS代理服务本机直接使用代理也可以

示例：（也可以替换成自己的可用代理设置代理后测试网址是http://httpbin.org/get 访问该网站可以得到请求信息其中origin字段就是客户端的ip）

2.urllib

from urllib.error import URLError

from urllib.request import ProxyHandler,build_opener

proxy = '127.0.0.1:14155'

proxy_handler = ProxyHandler({

　　'http':'http://' + proxy,

　　'https':'https://' + proxy

　　})

opener = build_opener(proxy_handler)

try:

　　response = opener.open('http://httpbin.org/get')

　　print(response.read().decode('utf-8'))

except URLError as e:

　　print(e.reason)

这里借助ProxyHandler 设置代理参数是字典键名为协议键值为代理
创建完ProxyHandler对象后调用 build_opener()方法传入该对象来创建一个opener对象

如果需要认证可以改变proxy 变量只需要在代理前面加入代理认证的用户名密码即可例如 username:password@127.0.0.1:14155

3.requests

代理设置相对urllib简单传入参数proxies

import requests

proxy = '127.0.0.1:14155'

proxies = {

　　'http': 'http://' + proxy,

　　'https': 'https://' + proxy,

}

try:

　　response = requests.get('http://httpbin.org/get', proxies=proxies)

　　print(response.text)

except requests.exceptions.ConnectionError as e:

　　print('Error', e.args)

需要认证的话同理 proxy = 'username:password@127.0.0.1:9743'

4. selenium

from selenium import webdriver

proxy = '127.0.0.1:14155'

chrome_options = webdriver.ChromeOptions()

chrome_options.add_argument('--proxy-server=http://' + proxy)

chrome = webdriver.Chrome(chrome_options=chrome_options)

chrome.get('http://httpbin.org/get')

如果是认证的相对比较麻烦

from selenium import webdriver

from selenium.webdriver.chrome.options import Options

import zipfile

ip = '127.0.0.1'

port = 14155

username = 'liuxiaosong'

password = ''

manifest_json = """

{

　　"version": "1.0.0",

　　"manifest_version": 2,

　　"name": "Chrome Proxy",

　　"permissions": [

　　"proxy",

　　"tabs",

　　"unlimitedStorage",

　　"storage",

　　"<all_urls>",

　　"webRequest",

　　"webRequestBlocking"

　　],

　　"background": {

　　"scripts": ["background.js"]

　　}

}

"""

background_js = """

var config = {

　　mode: "fixed_servers",

　　rules: {

　　　　singleProxy: {

　　　　　　scheme: "http",

　　　　　　host: "%(ip)s",

　　　　　　port: %(port)s

　　　　}

　　}

}

chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});

function callbackFn(details) {

　　return {

　　　　authCredentials: {

　　　　　　username: "%(username)s",

　　　　　　password: "%(password)s"

　　　　}

　　}

}

chrome.webRequest.onAuthRequired.addListener(

　　callbackFn,

　　{urls: ["<all_urls>"]},

　　['blocking']

)

""" % {'ip': ip, 'port': port, 'username': username, 'password': password}

plugin_file = 'proxy_auth_plugin.zip'

with zipfile.ZipFile(plugin_file, 'w') as zp:

　　zp.writestr("manifest.json", manifest_json)

　　zp.writestr("background.js", background_js)

chrome_options = Options()

chrome_options.add_argument("--start-maximized")

chrome_options.add_extension(plugin_file)

browser = webdriver.Chrome(chrome_options=chrome_options)

browser.get('http://httpbin.org/get')

需要本地创建一个manifest.json配置文件 background.js 脚本设置代理运行之后本地会生成一个 proxy_auth_plugin.zip 文件保存当前设置

5.phantomjs

需要安装下载地址 http://phantomjs.org/download 选择对应平台下载即可

下载后解压文件复制在bin目录下phantomjs.exe 到python目录下的script目录下或者单独添加环境变量

cmd 运行 phantomjs 进入到phantomjs命令行表示配置成功

在selenium中使用的话只需要将Chrome切换为PhantomJS即可

from selenium import webdriver

browser = webdriver.PhantomJS()

browser.get('https://www.baidu.com')

print(browser.current_url)

会报警告 selenium 3.X版本已经弃用PhantomJS 两种方式使用Chrome无界面 headless 或者降低selenium版本推荐第一种

示例：

from selenium import webdriver

from selenium.webdriver.chrome.options import Options

chrome_options = Options()

chrome_options.add_argument('--headless')

chrome_options.add_argument('--disable-gpu')#上面三行代码就是为了将Chrome不弹出界面，实现无界面爬取

browser = webdriver.Chrome(chrome_options=chrome_options)

PhantomJS示例：

from selenium import webdriver

service_args = [

　　'--proxy=127.0.0.1:9743',

　　'--proxy-type=http'

]

browser = webdriver.PhantomJS(service_args=service_args)

browser.get('http://httpbin.org/get')

print(browser.page_source)

如果加认证

from selenium import webdriver

service_args = [

　　'--proxy=127.0.0.1:9743',

　　'--proxy-type=http',

　　'--proxy-auth=username:password'

]

browser = webdriver.PhantomJS(service_args=service_args)

browser.get('http://httpbin.org/get')

print(browser.page_source)

python3编写网络爬虫17-验证码识别的更多相关文章

python3编写网络爬虫20-pyspider框架的使用
二.pyspider框架的使用简介 pyspider是由国人binux 编写的强大的网络爬虫系统 github地址 : https://github.com/binux/pyspider 官方文档 ...
python3编写网络爬虫19-app爬取
一.app爬取前面都是介绍爬取Web网页的内容,随着移动互联网的发展,越来越多的企业并没有提供Web页面端的服务,而是直接开发了App,更多信息都是通过App展示的 App爬取相比Web端更加容易 ...
python3编写网络爬虫18-代理池的维护
一.代理池的维护上面我们利用代理可以解决目标网站封IP的问题在网上有大量公开的免费代理或者我们也可以购买付费的代理IP但是无论是免费的还是付费的,都不能保证都是可用的因为可能此IP被其他人使用 ...
python3编写网络爬虫21-scrapy框架的使用
一.scrapy框架的使用前面我们讲了pyspider 它可以快速的完成爬虫的编写不过pyspider也有一些缺点例如可配置化不高异常处理能力有限对于一些反爬虫程度非常强的网站爬取显得力不从 ...
Python3编写网络爬虫12-数据存储方式五-非关系型数据库存储
非关系型数据库存储 NoSQL 全称 Not Only SQL 意为非SQL 泛指非关系型数据库.基于键值对不需要经过SQL层解析数据之间没有耦合性性能非常高. 非关系型数据库可细分如下: 键值 ...
Python3编写网络爬虫11-数据存储方式四-关系型数据库存储
关系型数据库存储关系型数据库是基于关系模型的数据库,而关系模型是通过二维表保存的,所以它的存储方式就是行列组成的表.每一列是一个字段,每一行是一条记录.表可以看作某个实体的集合,而实体之间存在联系, ...
Python3编写网络爬虫01-基本请求库urllib的使用
安装python后自带urllib库模块篇分为几个模块如下: 1. urllib.request 请求模块 2. urllib.parse 分析模块 3. urllib.error 异常处理模块 ...
python3编写网络爬虫23-分布式爬虫
一.分布式爬虫前面我们了解Scrapy爬虫框架的基本用法这些框架都是在同一台主机运行的爬取效率有限如果多台主机协同爬取爬取效率必然成倍增长这就是分布式爬虫的优势 1. 分布式爬虫基本原理 1 ...
python3编写网络爬虫22-爬取知乎用户信息
思路选定起始人选一个关注数或者粉丝数多的大V作为爬虫起始点获取粉丝和关注列表通过知乎接口获得该大V的粉丝列表和关注列表获取列表用户信息获取列表每个用户的详细信息获取每个用户的粉丝和关注 ...

随机推荐

[转]centos7指定yum安装软件路径
本文转自:https://www.cnblogs.com/pyyu/p/9814062.html 网上的命令都是垃圾 yum -c /etc/yum.conf --installroot=/opt/a ...
eclipse项目导入之后，项目内无报错，项目头有红色叉号。
解决方法:右击项目之后选择properties,先看buildpath是不是有不一样的地方需要改成自己用的jdk与tomcat 之后看是否是项目之前用的tomcat与自己的不一样,如图再更改过之后问 ...
WebBrowser引用IE版本问题,更改使用高版本IE
做了一个Winform的项目.项目里使用了WebBrowser控件.以前一直都以为WebBrowser是直接调用的系统自带的IE,IE是呈现出什么样的页面WebBrowser就呈现出什么样的页面.其实 ...
c# 静态构造函数与构造函数的调用先后
先上代码: 测试类: /// <summary> /// 构造函数 /// </summary> public RedisHelper() { Console.WriteLin ...
EF 事务（转载）
事务简单用法文章一:https://www.cnblogs.com/wujingtao/p/5407821.html 1EF事务事务就是确保一次数据库操作,所有步骤都成功,如果哪一步出错了,整个操 ...
MySQL技巧（三）运算符与函数
表单时间和定时器this的指向
1.针对表单的 form 表单 input 输入框 select 下拉列表 textarea 文本域 type 类型 radio 单选框 checkbox 多选框 password 密码框 but ...
form表单基础知识
form 元素是块级元素 ------------------- ------------------- ----------------------------------------------- ...
BUG -Failed to compile.
检查代码发现: 图片的路径写错了改回正确路径页面可以正常显示
CSS3布局之box-flex的使用
语法: box-flex:<number> 其中number取值:使用浮点数指定对象所分配其父元素剩余空间的比例.设置或检索伸缩盒对象的子元素如何分配其剩余空间.(伸缩盒最老版本) htm ...

python3编写网络爬虫17-验证码识别

python3编写网络爬虫17-验证码识别的更多相关文章

随机推荐

热门专题