一、出现的问题

　　　　前段时间在使用selenium对淘宝进行模拟登陆的时候，输入完正好和密码，然后验证码无论如何都不能划过去。找了好久，原来是因为selenium在浏览器中运　　　　　　　　行的时候会暴露一些特征变量，被识别出来是爬虫，所以无法进行登录操作。如在非selenium运行的时候"window.navigator.webdriver"是undefined，但是在 selenium运行的情况下，它是true。

二、解决方法

　　　　1、网上大部分的方案

　　　　启动浏览器的时候加上一些配置和手动把webdriver的属性设置为undefined。

　　option = ChromeOptions()

    option.add_experimental_option('excludeSwitches', ['enable-automation'])

    #option.add_argument('--headless')

    web= webdriver.Chrome(options=option)

    web.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {

       "source": """Object.defineProperty(navigator, 'webdriver', {get: () => undefined})""",

    })

　　　　但是这个方案并不能有效的解决，于是又在网上找到另外的一种方案。

　　　　2、使用python的mitmproxy库进行操作。

　　　　mitmproxy 就是用于 MITM 的 proxy，MITM 即中间人攻击（Man-in-the-middle attack）。用于中间人攻击的代理首先会向正常的代理一样转发请求，保障服务端与客户端的通信，其次，会适时的查、　　　　　　　　记录其截获的数据，或篡改数据，引发服务端或客户端特定的行为。

　　　　使用 pip install mitmproxy

　　　　新建一个py文件，命名随意，这里命名为modify_response.py

# coding: utf-8

# modify_response.py

from mitmproxy import ctx

def response(flow):

    """修改响应数据

    """

    if '/js/yoda.' in flow.request.url:

        # 屏蔽selenium检测

        for webdriver_key in ['webdriver', '__driver_evaluate', '__webdriver_evaluate', '__selenium_evaluate',

                              '__fxdriver_evaluate', '__driver_unwrapped', '__webdriver_unwrapped',

                              '__selenium_unwrapped', '__fxdriver_unwrapped', '_Selenium_IDE_Recorder', '_selenium',

                              'calledSelenium', '_WEBDRIVER_ELEM_CACHE', 'ChromeDriverw', 'driver-evaluate',

                              'webdriver-evaluate', 'selenium-evaluate', 'webdriverCommand',

                              'webdriver-evaluate-response', '__webdriverFunc', '__webdriver_script_fn',

                              '__$webdriverAsyncExecutor', '__lastWatirAlert', '__lastWatirConfirm',

                              '__lastWatirPrompt', '$chrome_asyncScriptInfo', '$cdc_asdjflasutopfhvcZLmcfl_']:

            ctx.log.info('Remove "{}" from {}.'.format(webdriver_key, flow.request.url))

            flow.response.text = flow.response.text.replace('"{}"'.format(webdriver_key), '"NO-SUCH-ATTR"')

            print(webdriver_key)

        flow.response.text = flow.response.text.replace('t.webdriver', 'false')

        flow.response.text = flow.response.text.replace('ChromeDriver', '')

　　　　　　然后在cmd中使用命令运行脚本：mitmdump.exe -p 端口号 -s modify_response.py

　　　　　　然后再执行selenium的脚本即可实现正常的通过selenium进行登录淘宝网站，之前设置的ChromeOptions也要加上。当然要设置代理。

三、代码实现

　　　　　　此次代码实现了自动登录，输入关键词，爬取淘宝商品的商品名称，店铺的省份，商品的价格和人气等信息，并将这些信息保存在CSV文件中，以方便进行数据的分析。具体代码如下：

from selenium import webdriver

from selenium.webdriver import ChromeOptions

from selenium.webdriver import ActionChains

from selenium.webdriver.common.by import By

from selenium.webdriver.common.keys import Keys

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

import time

import csv

def main():

    #登录设置

    #会开会话

    option = ChromeOptions()

    option.add_experimental_option('excludeSwitches', ['enable-automation'])

    #option.add_argument('--headless')

    web= webdriver.Chrome(options=option)

    web.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {

       "source": """Object.defineProperty(navigator, 'webdriver', {get: () => undefined})""",

    })

    web.get('https://login.taobao.com/member/login.jhtml')

    #输入账号和密码

    web.find_element_by_xpath('//*[@id="fm-login-id"]').send_keys('xxxxxx')

    web.find_element_by_xpath('//*[@id="fm-login-password"]').send_keys('xxxxxxxx')

    web.find_element_by_xpath('//*[@id="login-form"]/div[4]/button').click()

    #进入首页

    try:

        WebDriverWait(web,20).until(EC.element_to_be_clickable((By.XPATH,'//*[@id="J_SiteNavHome"]/div/a/span'))).click()

    except:

        pass

    # 搜索商品

    goods=input('请输入您要搜索的商品：')

    WebDriverWait(web,20).until(EC.presence_of_element_located((By.XPATH,'//*[@id="q"]'))).send_keys(goods)

    WebDriverWait(web,20).until(EC.element_to_be_clickable((By.XPATH,'//*[@id="J_TSearchForm"]/div[1]/button'))).click()

    try:

        #查找共计页数

        sum_page=web.find_element_by_xpath('//*[@id="mainsrp-pager"]/div/div/div/div[1]').text

    except:

        # 防止出现滑块的验证

        WebDriverWait(web,5).until(EC.presence_of_element_located((By.XPATH,'//*[@id="nc_1_n1z"]')))

        actoin = ActionChains(web)

        drag = web.find_element_by_xpath('//*[@id="nc_1_n1z"]')

        actoin.drag_and_drop_by_offset(drag,300,0).perform()

        WebDriverWait(web,20).until(EC.presence_of_element_located((By.XPATH,'//*[@id="mainsrp-pager"]/div/div/div/div[1]')))

        sum_page = web.find_element_by_xpath('//*[@id="mainsrp-pager"]/div/div/div/div[1]').text

    print(goods,sum_page)

    #输入查找范围

    min_page = int(input('请输入搜索的最小页数：'))

    max_page = int(input('请输入搜搜的最大页数：'))

    #解析和保存数据

    f = open('C:/Users/sunshine/Desktop/课件/图片/爬取的数据/' + '淘宝' + goods+'.csv', 'w+', encoding='utf-8',newline='')

    csvwrite=csv.writer(f)

    csvwrite.writerow(('shop_name', 'goods_name', 'loc', 'prices', 'sum_body'))

    for min_page in range(min_page,max_page+1):

        try:

            if min_page!=1:

                key_=WebDriverWait(web, 20).until(

                EC.element_to_be_clickable((By.XPATH, '//*[@id="mainsrp-pager"]/div/div/div/div[2]/input')))

                key_.clear()

                key_.send_keys(min_page)

                key_.send_keys(Keys.ENTER)

                #WebDriverWait(web,20).until(EC.element_to_be_clickable((

                    #By.XPATH,'//*[@id="mainsrp-pager"]/div/div/div/div[2]/span[3]'))).click()

                #web.find_element_by_xpath('//*[@id="mainsrp-pager"]/div/div/div/div[2]/span[3]').click()

                #web.execute_script("window.scrollTo(0,document.body.scrollHeight);")

                time.sleep(2)

                #WebDriverWait(web, 20).until(

                    #EC.element_to_be_clickable((By.XPATH, '//*[@id="mainsrp-pager"]/div/div/div/div[2]/span[3]')))

            else:

                web.execute_script("window.scrollTo(0,document.body.scrollHeight);")

                time.sleep(2)

        except Exception as e:

            xpath_=web.find_element_by_xpath('//*[@id="J_sufei"]/iframe')

            web.switch_to.frame(xpath_)

            action=ActionChains

            drag=web.find_element_by_xpath('//*[@id="nc_1__scale_text"]/span')

            action(web).drag_and_drop_by_offset(drag,300,0).perform()

            web.switch_to.frame(web.find_element_by_xpath('//*[@id="CrossStorageClient-ba26ffda-7fa9-44ef-a87f-63c058cd9d01"]'))

            print(e,'出现滑块验证')

            key_ = WebDriverWait(web, 20).until(

                EC.element_to_be_clickable((By.XPATH, '//*[@id="mainsrp-pager"]/div/div/div/div[2]/input')))

            key_.clear()

            key_.send_keys(min_page)

            key_.send_keys(Keys.ENTER)

            #WebDriverWait(web, 20).until(

                #EC.element_to_be_clickable((By.XPATH, '//*[@id="mainsrp-pager"]/div/div/div/div[2]/span[3]'))).click()

            time.sleep(2)

        list = web.find_elements_by_xpath('//*[@id="mainsrp-itemlist"]/div/div/div[1]/div')

        for items in list:

            prices = items.find_element_by_xpath('./div[2]/div[1]/div[1]/strong').text

            prices = float(prices)

            goods_name=items.find_element_by_xpath('./div[2]/div[2]/a').text

            body=items.find_element_by_xpath('./div[2]/div[1]/div[2]').text

            if '万' in body:

                body=re.findall(r'\d+.\d+|\d+', body)[0]

                sum_body=float(body)*10000

            elif len(body)!=0:

                body = re.findall(r'\d+.\d+|\d+', body)[0]

                sum_body = float(body)

            else:

                sum_body=None

            shop_name=items.find_element_by_xpath('./div[2]/div[3]/div[1]/a/span[2]').text

            loc=items.find_element_by_xpath('./div[2]/div[3]/div[2]').text[:3]

            #tuple=((shop_name,goods_name,loc,prices,sum_body))

            print((shop_name,goods_name,loc,prices,sum_body))

            csvwrite.writerow((shop_name,goods_name,loc,prices,sum_body))

        print('===============第'+str(min_page)+'页已爬取完成！=======================')

    f.close()

    web.close()

    return goods

使用selenium爬取淘宝的更多相关文章

利用Selenium爬取淘宝商品信息
一. Selenium和PhantomJS介绍 Selenium是一个用于Web应用程序测试的工具,Selenium直接运行在浏览器中,就像真正的用户在操作一样.由于这个性质,Selenium也是一 ...
python3编写网络爬虫16-使用selenium 爬取淘宝商品信息
一.使用selenium 模拟浏览器操作爬取淘宝商品信息之前我们已经成功尝试分析Ajax来抓取相关数据,但是并不是所有页面都可以通过分析Ajax来完成抓取.比如,淘宝,它的整个页面数据确实也是通过A ...
爬虫实战4：用selenium爬取淘宝美食
方案1:一次性爬取全部淘宝美食信息 1. spider.py文件如下 __author__ = 'Administrator' from selenium import webdriver from ...
使用Selenium爬取淘宝商品
import pymongo from selenium import webdriver from selenium.common.exceptions import TimeoutExceptio ...
Selenium爬取淘宝商品概要入mongodb
准备: 1.安装Selenium:终端输入 pip install selenium 2.安装下载Chromedriver:解压后放在…\Google\Chrome\Application\:如果是M ...
使用scrapy+selenium爬取淘宝网
--***2019-3-27测试有效***---- 第一步: 打开cmd,输入scrapy startproject taobao_s新建一个项目. 接着cd 进入我们的项目文件夹内输入scrapy ...
python selenium 爬取淘宝
# -*- coding:utf-8 -*- # author : yesehngbao # time:2018/3/29 import re import pymongo from lxml imp ...
Selenium+Chrome/phantomJS模拟浏览器爬取淘宝商品信息
#使用selenium+Carome/phantomJS模拟浏览器爬取淘宝商品信息 # 思路: # 第一步:利用selenium驱动浏览器,搜索商品信息,得到商品列表 # 第二步:分析商品页数,驱动浏 ...
selenium跳过webdriver检测并爬取淘宝我已购买的宝贝数据
简介上一个博文已经讲述了如何使用selenium跳过webdriver检测并爬取天猫商品数据,所以在此不再详细讲,有需要思路的可以查看另外一篇博文. 源代码 # -*- coding: utf-8 ...
Python爬虫实战八之利用Selenium抓取淘宝匿名旺旺
更新其实本文的初衷是为了获取淘宝的非匿名旺旺,在淘宝详情页的最下方有相关评论,含有非匿名旺旺号,快一年了淘宝都没有修复这个. 可就在今天,淘宝把所有的账号设置成了匿名显示,SO,获取非匿名旺旺号已经 ...

随机推荐

天翼云主机某一IP多次登录失败导致IP被锁无法登录，天翼云主机莫名其妙无法远程登陆
情况说明: 直接使用该IP通过ssh远程连接失败,但是先通过ssh远程连接其他主机上,然后在这个主机上再ssh刚才连接失败的主机,就能登陆上. 说明,root用户不是被锁了, 而是远程登陆IP被锁了 ...
普通用户使用CI/CD权限使用
根据文章:授权用户访问名称空间 (https://www.cnblogs.com/sanduzxcvbnm/p/15015576.html) 进行有关操作后,普通用户点击会报错如下信息: 解决办法: ...
部署文件：filebeat->kafka集群(zk集群)->logstash->es集群->kibana
该压缩包内包含以下文件: 1.install_java.txt 配置java环境,logstash使用 2.es.txt 三节点的es集群 3.filebeat.txt 获取日志输出到kafka集群 ...
Logstash：Logstash-to-Logstash 通信
文章转载自:https://elasticstack.blog.csdn.net/article/details/117253545 在有些时候,我们甚至可以建立 Logstash-to-Logsta ...
2_JDBC
一. 引言 1.1 如何操作数据库使用客户端工具访问数据库, 需要手工建立连接, 输入用户名和密码登陆, 编写SQL语句, 点击执行, 查看操作结果(结果集或受行数影响) 1.2 实际开发中, 会采 ...
大数据技术之HBase原理与实战归纳分享-中
@ 目录底层原理 Master架构 RegionServer架构 Region/Store/StoreFile/Hfile之间的关系写流程写缓存刷写读流程文件合并分区 JAVA API编程 ...
创建线程的方式三：实现Callable接口。 --- JDK 5.0新增
如何理解实现Callable接口的方式创建多线程比实现Runnable接口创建多线程方式强大? call()可以有返回值的.call()可以抛出异常,被外面的操作捕获,获取异常的信息Callable是 ...
Linux实战笔记_CentOS7_格式化磁盘
fdisk -l #检查是否添加成功(添加一块磁盘并重启计算机后) fdisk /dev/sdb #格式化磁盘 mount /dev/sdb1 /opt #挂载到/opt目录 df -h #查看是否挂 ...
Educational Codeforces Round 138 (Rated for Div. 2) A-E
比赛链接 A 题解知识点:贪心. 注意到 \(m\geq n\) 时,不存在某一行或列空着,于是不能移动. 而 \(m<n\) 时,一定存在,可以移动. 时间复杂度 \(O(1)\) 空间复杂 ...
关于网页实现串口或者TCP通讯的说明
概述最近经常有网页联系我,反馈为什么他按我说的方法,写的HTML代码,无法在chrome网页中运行.这里我统一做一个解释,我发现好多网页并没有理解我的意思. 其实,要实现在HTML中进行串口或者TC ...

使用selenium爬取淘宝

一、出现的问题

二、解决方法

三、代码实现

使用selenium爬取淘宝的更多相关文章

随机推荐

热门专题