python 爬虫系列09-selenium+拉钩

使用selenium爬取拉勾网职位

 from selenium import webdriver

 from lxml import etree

 import re

 import time

 from selenium.webdriver.support.ui import WebDriverWait

 from selenium.webdriver.support import expected_conditions as EC

 from selenium.webdriver.common.by import By

 class LagouSpider(object):

     driver_path = r"D:\driver\chromedriver.exe"

     def __init__(self):

         self.driver = webdriver.Chrome(executable_path=LagouSpider.driver_path)

         self.url = 'https://www.lagou.com/jobs/list_%E4%BA%91%E8%AE%A1%E7%AE%97?labelWords=&fromSearch=true&suginput='

         self.positions = []

     def run(self):

         self.driver.get(self.url)

         while True:

             source = self.driver.page_source

             WebDriverWait(driver=self.driver,timeout=10).until(

                 EC.presence_of_element_located((By.XPATH, "//div[@class='pager_container']/span[last()]"))

             )

             self.parse_list_page(source)

             try:

                 next_btn = self.driver.find_element_by_xpath("//div[@class='pager_container']/span[last()]")

                 if "pager_next_disabled" in next_btn.get_attribute("class"):

                     break

                 else:

                     next_btn.click()

             except:

                 print(source)

             time.sleep(1)

     def parse_list_page(self,source):

         html = etree.HTML(source)

         links = html.xpath("//a[@class='position_link']/@href")

         for link in links:

             self.request_detail_page(link)

             time.sleep(1)

     def request_detail_page(self,url):

         # self.driver.get(url)

         print()

         print(url)

         print()

         self.driver.execute_script("window.open('%s')" % url)

         self.driver.switch_to.window(self.driver.window_handles[1])

         WebDriverWait(self.driver,timeout=10).until(

             EC.presence_of_element_located((By.XPATH,"//div[@class='job-name']/span[@class='name']"))

         )

         source = self.driver.page_source

         self.parse_detail_page(source)

         self.driver.close()

         self.driver.switch_to.window(self.driver.window_handles[0])

     def parse_detail_page(self,source):

         html = etree.HTML(source)

         position_name = html.xpath("//span[@class='name']/text()")[0]

         job_request_spans = html.xpath("//dd[@class='job_request']//span")

         salary = job_request_spans[0].xpath('.//text()')[0].strip()

         city = job_request_spans[1].xpath(".//text()")[0].strip()

         city = re.sub(r"[\s/]", "", city)

         work_years = job_request_spans[2].xpath(".//text()")[0].strip()

         work_years = re.sub(r"[\s/]", "", work_years)

         education = job_request_spans[3].xpath(".//text()")[0].strip()

         education = re.sub(r"[\s/]", "", education)

         desc = "".join(html.xpath("//dd[@class='job_bt']//text()")).strip()

         company_name = html.xpath("//h2[@class='f1']/text()")

         position = {

             'name': position_name,

             'company_name': company_name,

             'salary': salary,

             'city': city,

             'work_years': work_years,

             'education': education,

             'desc': desc

         }

         self.positions.append(position)

         print(position)

 if __name__ == '__main__':

     spider = LagouSpider()

     spider.run()

python 爬虫系列09-selenium+拉钩的更多相关文章

python爬虫动态html selenium.webdriver
python爬虫:利用selenium.webdriver获取渲染之后的页面代码! 1 首先要下载浏览器驱动: 常用的是chromedriver 和phantomjs chromedirver下载地址 ...
Python爬虫之设置selenium webdriver等待
Python爬虫之设置selenium webdriver等待 ajax技术出现使异步加载方式呈现数据的网站越来越多,当浏览器在加载页面时,页面上的元素可能并不是同时被加载完成,这给定位元素的定位增加 ...
Python爬虫系列-Selenium详解
自动化测试工具,支持多种浏览器.爬虫中主要用来解决JavaScript渲染的问题. 用法讲解模拟百度搜索网站过程: from selenium import webdriver from selen ...
PYTHON 爬虫笔记七:Selenium库基础用法
知识点一:Selenium库详解及其基本使用什么是Selenium selenium 是一套完整的web应用程序测试系统,包含了测试的录制(selenium IDE),编写及运行(Selenium ...
python爬虫之初始Selenium
1.初始 Selenium[1] 是一个用于Web应用程序测试的工具.Selenium测试直接运行在浏览器中,就像真正的用户在操作一样.支持的浏览器包括IE(7, 8, 9, 10, 11),Moz ...
python 爬虫系列教程方法总结及推荐
爬虫,是我学习的比较多的,也是比较了解的.打算写一个系列教程,网上搜罗一下,感觉别人写的已经很好了,我没必要重复造轮子了. 爬虫不过就是访问一个页面然后用一些匹配方式把自己需要的东西摘出来. 而访问页 ...
$python爬虫系列（2）—— requests和BeautifulSoup库的基本用法
本文主要介绍python爬虫的两大利器:requests和BeautifulSoup库的基本用法. 1. 安装requests和BeautifulSoup库可以通过3种方式安装: easy_inst ...
Python爬虫系列 - 初探：爬取旅游评论
Python爬虫目前是基于requests包,下面是该包的文档,查一些资料还是比较方便. http://docs.python-requests.org/en/master/ POST发送内容格式爬 ...
python爬虫系列（2）—— requests和BeautifulSoup
本文主要介绍python爬虫的两大利器:requests和BeautifulSoup库的基本用法. 1. 安装requests和BeautifulSoup库可以通过3种方式安装: easy_inst ...
Python爬虫系列（七）：提高解析效率
如果仅仅因为想要查找文档中的<a>标签而将整片文档进行解析,实在是浪费内存和时间.最快的方法是从一开始就把<a>标签以外的东西都忽略掉. SoupStrainer 类可以定义文 ...

随机推荐

JavaEE互联网轻量级框架整合开发（书籍）阅读笔记（5）：责任链模式、观察者模式
一.责任链模式.观察者模式 1.责任链模式:当一个对象在一条链上被多个拦截器处理(烂机器也可以选择不拦截处理它)时,我们把这样的设计模式称为责任链模式,它用于一个对象在多个角色中传递的场景. 2. ...
Portal:Machine learning机器学习：门户
Machine learning Machine learning is a scientific discipline that explores the construction and stud ...
如何创建JUnit
这里拿Dynamic项目来演示,首先创建一个Dynamic项目,起名,点next, 继续点next, 将web.xml文件勾选,finish, 接下来在Java Resources->src下创 ...
python 爬虫proxy,BeautifulSoup+requests+mysql 爬取样例
实现思路: 由于反扒机制,所以需要做代理切换,去爬取,内容通过BeautifulSoup去解析,最后入mysql库 1.在西刺免费代理网获取代理ip,并自我检测是否可用 2.根据获取的可用代理ip去发 ...
MSSQL中数据库对象类型解释
public string GetObjectTypeName(object oType) { switch (oType+"") { case "U": re ...
以太坊系列之十六：golang进行智能合约开发
以太坊系列之十六: 使用golang与智能合约进行交互以太坊系列之十六: 使用golang与智能合约进行交互此例子的目录结构 token contract 智能合约的golang wrapper ...
Spring容器管理对象和new对象
问题:一个业务类交给spring管理,并自动注入了其他业务类作为属性,方法中通过全局属性调用其他业务类的方法.当该业务类是通过new获取的情况下,该实例的属性会是null(不存在依赖注入实例),调用方 ...
多线程《七》信号量,Event,定时器
一信号量信号量也是一把锁,可以指定信号量为5,对比互斥锁同一时间只能有一个任务抢到锁去执行,信号量同一时间可以有5个任务拿到锁去执行,如果说互斥锁是合租房屋的人去抢一个厕所,那么信号量就相当于一群 ...
dubbo 面试题
dubbo是什么 dubbo是一个分布式框架,远程服务调用的分布式框架,其核心部分包含:集群容错:提供基于接口方法的透明远程过程调用,包括多协议支持,以及软负载均衡,失败容错,地址路由,动态配置等 ...
14、OpenCV Python 直线检测
__author__ = "WSX" import cv2 as cv import numpy as np #-----------------霍夫变换------------- ...

python 爬虫系列09-selenium+拉钩

python 爬虫系列09-selenium+拉钩的更多相关文章

随机推荐

热门专题