python3爬虫-通过requests获取拉钩职位信息
import requests, json, time, tablib def send_ajax_request(data: dict):
try:
ajax_response = session.post(url=ajax_url,
params={"needAddtionalResult": "false", "city": city},
data=data,
headers=ajax_headers,
timeout=timeout)
if ajax_response.status_code == 200:
return ajax_response.json()
return {}
except Exception:
return {} def get_job_info(info_dic: dict):
jobInfoMap = info_dic.get("content").get("positionResult").get("result") for jobInfoDict in jobInfoMap:
dic = {}
dic["companyId"] = jobInfoDict.get("companyId")
dic["companyFullName"] = jobInfoDict.get("companyFullName")
dic["positionName"] = jobInfoDict.get("positionName")
dic["workYear"] = jobInfoDict.get("workYear")
dic["education"] = jobInfoDict.get("education")
dic["salary"] = jobInfoDict.get("salary")
dic["jobNature"] = jobInfoDict.get("jobNature")
dic["companySize"] = jobInfoDict.get("companySize")
dic["city"] = jobInfoDict.get("city")
dic["district"] = jobInfoDict.get("district")
dic["createTime"] = jobInfoDict.get("createTime")
if is_save_txtfile:
yield json.dumps(dic, ensure_ascii=False)
else:
yield dic.values() def save_to_file(json_data):
for data in json_data:
f.write(data + "\n") def save_to_excel(list_data):
for line in list_data:
dataset.append(line) def run():
for i in range(1, 31):
data = {
"first": "false",
"pn": i,
"kd": "python"
}
info_dic = send_ajax_request(data)
data = get_job_info(info_dic)
if is_save_txtfile:
save_to_file(data)
else:
save_to_excel(data)
print("正在保存数据")
time.sleep(sleeptime) if __name__ == '__main__':
session = requests.Session()
job_name = "python"
city = "成都"
timeout = 5
sleeptime = 10
doc_url = "https://www.lagou.com/jobs/list_{job_name}".format(job_name=job_name)
session.headers[
"User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"
session.headers["Host"] = "www.lagou.com" doc_response = session.get(url=doc_url, params={"city": city}) ajax_headers = {
"Origin": "https://www.lagou.com",
"Referer": doc_response.url
} ajax_url = "https://www.lagou.com/jobs/positionAjax.json?=false" is_save_txtfile = False if not is_save_txtfile:
dataset = tablib.Dataset()
dataset.headers = ["companyId", "companyFullName", "positionName", "workYear",
"education", "salary", "jobNature", "companySize", "city",
"district", "createTime"] f = open("jobinfo.txt", "a", encoding="utf-8")
try:
run()
except Exception:
print('出错了')
finally:
if is_save_txtfile:
f.close()
else:
with open("jobInfo.xls", "wb") as f:
f.write(dataset.xls)
f.flush()
python3爬虫-通过requests获取拉钩职位信息的更多相关文章
- python3爬虫-通过requests获取安居客房屋信息
import requests from fake_useragent import UserAgent from lxml import etree from http import cookiej ...
- 21天打造分布式爬虫-Selenium爬取拉钩职位信息(六)
6.1.爬取第一页的职位信息 第一页职位信息 from selenium import webdriver from lxml import etree import re import time c ...
- python3爬虫-通过selenium登陆拉钩,爬取职位信息
from selenium import webdriver from selenium.common.exceptions import NoSuchElementException from se ...
- python3爬虫抓取智联招聘职位信息代码
上代码,有问题欢迎留言指出. # -*- coding: utf-8 -*- """ Created on Tue Aug 7 20:41:09 2018 @author ...
- ruby 爬虫爬取拉钩网职位信息,产生词云报告
思路:1.获取拉勾网搜索到职位的页数 2.调用接口获取职位id 3.根据职位id访问页面,匹配出关键字 url访问采用unirest,由于拉钩反爬虫,短时间内频繁访问会被限制访问,所以没有采用多线程, ...
- python3爬虫-通过requests爬取图虫网
import requests from fake_useragent import UserAgent from requests.exceptions import Timeout from ur ...
- 通俗易懂的分析如何用Python实现一只小爬虫,爬取拉勾网的职位信息
源代码:https://github.com/nnngu/LagouSpider 效果预览 思路 1.首先我们打开拉勾网,并搜索"java",显示出来的职位信息就是我们的目标. 2 ...
- python3 requests 获取 拉勾工作数据
#-*- coding:utf-8 -*- __author__ = "carry" import requests,json for x in range(1, 15): url ...
- python3爬虫-使用requests爬取起点小说
import requests from lxml import etree from urllib import parse import os, time def get_page_html(ur ...
随机推荐
- sql-pivot
PIVOT PIVOT运算符用于在列和行之间进行数据旋转或透视转换,同时执行聚合运算 ,,) Order By empid asc Select * From ( Select empid,YEAR( ...
- 服务器端的tomcat,servlet框架
tomcat是一个服务器程序 可以对webapp目录下的Servlet代码进行执行和操作 编写的Servlet代码的步骤一般是在本地的ide中编写和测试,然后打包工程为war格式的文件,部署在服务器t ...
- 安装 GraphicsMagick
yum -y install GraphicsMagick GraphicsMagick-devel 实际试了试,上面yum的方式不好使,下面是我实际安装过程: 1.下载最新版 wget ftp:// ...
- 【Leetcode】【Easy】Contains Duplicate
Given an array of integers, find if the array contains any duplicates. Your function should return t ...
- 尝试Office 2003 VSTO的开发、部署
转载:http://www.cnblogs.com/oneivan/p/4243574.html 背景:一年前,某项目需要使用到Excel进行数据录入,考虑到很多用户还是使用XP+Office 200 ...
- 最优化 KKT条件
对于约束优化问题: 拉格朗日公式: 其KKT条件为: 求解 x.α.β 其中β*g(x)为互补松弛条件 KKT条件是使一组解成为最优解的必要条件,当原问题是凸问题的时候,KKT条件也是充分条件.
- [BZOJ 3441]乌鸦喝水
3441: 乌鸦喝水 Time Limit: 20 Sec Memory Limit: 128 MBSubmit: 374 Solved: 148[Submit][Status][Discuss] ...
- Xposed模块开发基本方法记录
由于某些课程实验的要求,需要通过xposed框架对某应用进行hook操作,笔者选用了开源且免费的xposed框架进行实现.虽然网上存在一些利用xposed实现特定功能的文章资源,但大多均将xposed ...
- FZU-1608 Huge Mission 线段树(更新懒惰标记)
题目链接: https://cn.vjudge.net/problem/FZU-1608 题目大意: 长度n,m次操作:每次操作都有三个数:a,b,c:意味着(a,b]区间单位长度的价值为c,若某段长 ...
- 排序算法Java版,以及各自的复杂度,以及由堆排序产生的top K问题
常用的排序算法包括: 冒泡排序:每次在无序队列里将相邻两个数依次进行比较,将小数调换到前面, 逐次比较,直至将最大的数移到最后.最将剩下的N-1个数继续比较,将次大数移至倒数第二.依此规律,直至比较结 ...