要获取什么样的数据?

我们要获取的数据,是指那些公开的,可以轻易地获取地数据.如果你有完整的数据集,肯定是极好的,但一般都很难通过还算正当的方式轻易获取.单就本系列文章要研究的实时招聘信息来讲,能获取最近一个月的相关信息,已是足矣.

如何获取数据?

爬虫,也是可以的,作为一个备选方案.但是,我注意到拉勾网本身的数据,是通过ajax请求更新的,所以批量获取变得更加简单.基于ajax请求来获取数据,方式有很多,这里我演示其中的自认为较为简单通用的一种: 使用 curl 模拟 ajax 请求获取数据.

注意,以下的步骤演示全部基于 Mac 版的 ** Google Chrome** 浏览器,其他浏览器部分操作可能会有些许差异.最后一步会给出 提取出的通用 curl 脚本,直接其实也是可以的,如果对步骤不是很关心.

1.找到目标城市和目标职位,然后按"最新排序",参考链接: http://www.lagou.com/jobs/list_iOS?px=new&city=北京#order

2.双指击/右击 页面,弹出快捷菜单,选择"检查",以进入浏览器调试界面,切换到调试器的 network -> xhr 标签下.

3.cmd + R 刷新页面,此时会捕捉到此页面发出的xhr请求.找到 http://www.lagou.com/jobs/positionAjax.json 开头的请求,并双指击/右击,选择 copy as cUrl.

这个 curl代码非常长,对于本次分析来说,最关键的是 末尾的 pn=1&kd=iOS,分别代表着页面和职位,动态设置,即可获取更多职位的更多数据了,文章的其他篇幅,会单独分析.

curl 'http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false' -H 'Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c' -H 'Origin: http://www.lagou.com' -H 'X-Anit-Forge-Code: 0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.8,en;q=0.6' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Cache-Control: max-age=0' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'X-Anit-Forge-Token: None' -H 'Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC' --data 'first=true&pn=1&kd=iOS' --compressed

4.讲上一步中的curl指令复制到终端,桥下回车键,即可看到输出.

{"success":true,"requestId":null,"msg":null,"resubmitToken":null,"content":{"pageNo":1,"pageSize":15,"positionResult":{"totalCount":974,"resultSize":15,"locationInfo":{"city":"北京","district":null,"queryByGisCode":false,"businessZone":null,"locationCode":null},"queryAnalysisInfo":{"positionName":"ios","companyName":null,"usefulCompany":false,"industryName":null},"strategyProperty":{"name":"dm-csearch-newSimScorer","id":1},"result":[{"companyId":129801,"companyShortName":"言之有物科技","createTime":"2016-08-30 19:28:12","positionId":1857486,"positionAdvantage":"一线公司,技术驱动,免费三餐,超期望回报","salary":"25k-50k","score":0,"workYear":"不限","education":"本科","city":"北京","positionName":"iOS高级研发工程师/Lead","companyLogo":"i/image/M00/43/4E/CgqKkVeDGsuAXz0gAAA4XeGAAHQ390.png","financeStage":"成长型(A轮)","industryField":"移动互联网,电子商务","jobNature":"全职","approve":1,"companySize":"15-50人","district":null,"companyLabelList":["股票期权","扁平管理","美女多","领导好"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"19:28发布","gradeDescription":null,"companyFullName":"北京言之有物科技有限公司","businessZones":null,"imState":"today","lastLogin":1472556472000,"publisherId":5092848,"explain":null,"plus":null,"pcShow":0},{"companyId":133,"companyShortName":"猎豹移动","createTime":"2016-08-30 19:09:34","positionId":2151896,"positionAdvantage":"明星产品 超赞年终奖 靠谱领导","salary":"15k-30k","score":0,"workYear":"1-3年","education":"本科","city":"北京","positionName":"iOS","companyLogo":"image1/M00/39/70/CgYXBlWo3nqABJTsAADJ3hn5gmE062.jpg","financeStage":"上市公司","industryField":"移动互联网,信息安全","jobNature":"全职","approve":1,"companySize":"500-2000人","district":"朝阳区","companyLabelList":["带薪年假","美女前台","超赞年终奖","一公里工作圈"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"19:09发布","gradeDescription":null,"companyFullName":"北京金山网络科技有限公司","businessZones":["姚家园","十里堡","高碑店"],"imState":"today","lastLogin":1472555392000,"publisherId":129969,"explain":null,"plus":null,"pcShow":0},{"companyId":107608,"companyShortName":"MUM计算机","createTime":"2016-08-30 19:03:24","positionId":1963945,"positionAdvantage":"帮助程序员赴美做IT,享受高薪高品质生活","salary":"10k-20k","score":0,"workYear":"不限","education":"本科","city":"北京","positionName":"IOS程序员赴美项目推广员","companyLogo":"i/image/M00/00/C2/CgqKkVZVHmSAWPtRAASUg0iUVuI932.jpg","financeStage":"初创型(不需要融资)","industryField":"教育","jobNature":"全职","approve":0,"companySize":"少于15人","district":"昌平区","companyLabelList":["赴美工作","美元薪水","告别996","技术前沿"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"19:03发布","gradeDescription":null,"companyFullName":"北京玛赫西计算机教育咨询有限公司","businessZones":null,"imState":"disabled","lastLogin":1472558059000,"publisherId":5179699,"explain":null,"plus":null,"pcShow":0},{"companyId":67576,"companyShortName":"车满满","createTime":"2016-08-30 18:47:30","positionId":2307877,"positionAdvantage":"期权","salary":"20k-25k","score":0,"workYear":"3-5年","education":"本科","city":"北京","positionName":"iOS高级开发工程师","companyLogo":"i/image/M00/01/47/Cgp3O1ZmYACABBpPAAGzVR5S-Ps906.png","financeStage":"成长型(A轮)","industryField":"移动互联网","jobNature":"全职","approve":1,"companySize":"50-150人","district":"朝阳区","companyLabelList":["股票期权","技能培训","弹性工作","定期体检"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:47发布","gradeDescription":null,"companyFullName":"车满满(北京)信息技术有限公司","businessZones":["建外大街","CBD","国贸"],"imState":"today","lastLogin":1472566873000,"publisherId":2116322,"explain":null,"plus":null,"pcShow":0},{"companyId":1575,"companyShortName":"百度","createTime":"2016-08-30 18:30:05","positionId":2307765,"positionAdvantage":"BAT 薪酬福利好","salary":"15k-25k","score":0,"workYear":"3-5年","education":"本科","city":"北京","positionName":"iOS移动开发","companyLogo":"image1/M00/00/06/CgYXBlTUWAWAOBXrAABGHHFb0q8748.jpg","financeStage":"上市公司","industryField":"移动互联网,数据服务","jobNature":"全职","approve":1,"companySize":"2000人以上","district":null,"companyLabelList":["股票期权","弹性工作","五险一金","免费班车"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:30发布","gradeDescription":null,"companyFullName":"百度在线网络技术(北京)有限公司","businessZones":null,"imState":"disabled","lastLogin":1472553001000,"publisherId":5705515,"explain":null,"plus":null,"pcShow":0},{"companyId":13321,"companyShortName":"FunPlus 趣加游戏","createTime":"2016-08-30 18:26:28","positionId":2240276,"positionAdvantage":"国际一线团队,无限的成长空间,任你发挥","salary":"18k-36k","score":0,"workYear":"5-10年","education":"本科","city":"北京","positionName":"iOS 视频处理工程师/高级工程师","companyLogo":"image1/M00/00/1A/Cgo8PFTUWFWAKE5aAABwJ1mgAYw423.png","financeStage":"成长型(B轮)","industryField":"游戏","jobNature":"全职","approve":0,"companySize":"150-500人","district":"海淀区","companyLabelList":["绩效奖金","股票期权","专项奖金","五险一金"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:26发布","gradeDescription":null,"companyFullName":"北京趣加科技有限公司","businessZones":["中关村","知春路","双榆树"],"imState":"today","lastLogin":1472552889000,"publisherId":285309,"explain":null,"plus":null,"pcShow":0},{"companyId":15111,"companyShortName":"联拓天际","createTime":"2016-08-30 18:22:12","positionId":2307696,"positionAdvantage":"与其在别处仰望,不如在这里并肩","salary":"15k-25k","score":0,"workYear":"3-5年","education":"本科","city":"北京","positionName":"iOS","companyLogo":"image1/M00/00/1D/Cgo8PFTUWGGAZQdjAADRNZVO9fc470.jpg","financeStage":"成熟型(不需要融资)","industryField":"电子商务","jobNature":"全职","approve":1,"companySize":"500-2000人","district":null,"companyLabelList":["五险一金","午餐补助","定期体检","技能培训"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:22发布","gradeDescription":null,"companyFullName":"北京联拓天际电子商务有限公司","businessZones":null,"imState":"today","lastLogin":1472552392000,"publisherId":1595082,"explain":null,"plus":null,"pcShow":0},{"companyId":119049,"companyShortName":"优久科技","createTime":"2016-08-30 18:15:29","positionId":1853231,"positionAdvantage":"良好的工作环境、成长平台和工作伙伴","salary":"10k-18k","score":0,"workYear":"1-3年","education":"本科","city":"北京","positionName":"iOS","companyLogo":"i/image/M00/16/74/CgqKkVbvnVuAeC-YAAA_YSPyb5A166.jpg","financeStage":"初创型(天使轮)","industryField":"移动互联网","jobNature":"全职","approve":0,"companySize":"少于15人","district":"海淀区","companyLabelList":["交通补助","通讯津贴","午餐补助"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:15发布","gradeDescription":null,"companyFullName":"北京优久科技有限责任公司","businessZones":["中关村","知春路","人民大学"],"imState":"today","lastLogin":1472552013000,"publisherId":4427723,"explain":null,"plus":null,"pcShow":0},{"companyId":41878,"companyShortName":"商询科技","createTime":"2016-08-30 18:14:06","positionId":2278393,"positionAdvantage":"微软创业团队,工程师文化!","salary":"10k-15k","score":0,"workYear":"1-3年","education":"本科","city":"北京","positionName":"iOS开发","companyLogo":"i/image/M00/24/22/Cgp3O1cZmpWAGslpAAA9MdgVNWU645.jpg","financeStage":"成长型(A轮)","industryField":"企业服务,数据服务","jobNature":"全职","approve":1,"companySize":"15-50人","district":"朝阳区","companyLabelList":["股票期权","人脉资源","办公环境好","国际化团队"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:14发布","gradeDescription":null,"companyFullName":"北京商询科技有限公司","businessZones":["姚家园"],"imState":"today","lastLogin":1472554153000,"publisherId":803257,"explain":null,"plus":null,"pcShow":0},{"companyId":5832,"companyShortName":"新浪微博","createTime":"2016-08-30 18:02:30","positionId":254885,"positionAdvantage":"亿级别DAU,微博重点项目组","salary":"20k-40k","score":0,"workYear":"1-3年","education":"本科","city":"北京","positionName":"新浪微博iOS客户端研发工程师","companyLogo":"image1/M00/00/0D/CgYXBlTUWCCAdkhOAABNgyvZQag818.jpg","financeStage":"上市公司","industryField":"移动互联网","jobNature":"全职","approve":0,"companySize":"2000人以上","district":"海淀区","companyLabelList":["年底双薪","专项奖金","股票期权","五险一金"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:02发布","gradeDescription":null,"companyFullName":"微梦创科网络科技(中国)有限公司","businessZones":["西北旺","马连洼","上地"],"imState":"disabled","lastLogin":1472556144000,"publisherId":561302,"explain":null,"plus":null,"pcShow":0},{"companyId":48321,"companyShortName":"合广众","createTime":"2016-08-30 18:00:40","positionId":2263615,"positionAdvantage":"老板nice","salary":"10k-20k","score":0,"workYear":"3-5年","education":"本科","city":"北京","positionName":"iOS开发工程师","companyLogo":"i/image/M00/01/D6/CgqKkVZ496GAYypzAAAKATKLXuY379.png","financeStage":"初创型(天使轮)","industryField":"移动互联网","jobNature":"全职","approve":0,"companySize":"50-150人","district":"海淀区","companyLabelList":["节日礼物","带薪年假","绩效奖金","岗位晋升"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:00发布","gradeDescription":null,"companyFullName":"北京合广众文化发展有限公司","businessZones":["八里庄","定慧寺","四季青"],"imState":"today","lastLogin":1472550077000,"publisherId":3608518,"explain":null,"plus":null,"pcShow":0},{"companyId":38239,"companyShortName":"Keep","createTime":"2016-08-30 17:52:25","positionId":2076872,"positionAdvantage":"福利健全、北京工作居住证、C轮","salary":"25k-35k","score":0,"workYear":"5-10年","education":"本科","city":"北京","positionName":"iOS开发工程师","companyLogo":"image1/M00/0A/40/CgYXBlTun9KASqKdAAAs36QVurU409.png","financeStage":"成熟型(C轮)","industryField":"社交网络,文化娱乐","jobNature":"全职","approve":1,"companySize":"150-500人","district":null,"companyLabelList":["节日礼物","年度旅游","定期体检","五险一金"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"17:52发布","gradeDescription":null,"companyFullName":"北京卡路里科技有限公司","businessZones":null,"imState":"today","lastLogin":1472550738000,"publisherId":3425178,"explain":null,"plus":null,"pcShow":0},{"companyId":179,"companyShortName":"她理财","createTime":"2016-08-30 17:52:02","positionId":982402,"positionAdvantage":"五险一金 绩效奖金 年底15薪 带薪年假","salary":"15k-25k","score":0,"workYear":"1-3年","education":"本科","city":"北京","positionName":"高级iOS开发工程师","companyLogo":"image1/M00/0C/F2/CgYXBlT2mG2AOPevAAB_09mD2Ko247.png","financeStage":"成长型(A轮)","industryField":"电子商务,金融","jobNature":"全职","approve":1,"companySize":"50-150人","district":"朝阳区","companyLabelList":["年底双薪","节日礼物","技能培训","绩效奖金"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"17:52发布","gradeDescription":null,"companyFullName":"北京新工场投资顾问有限公司","businessZones":["大望路","华贸","百子湾"],"imState":"today","lastLogin":1472557005000,"publisherId":97147,"explain":null,"plus":null,"pcShow":0},{"companyId":11053,"companyShortName":"中科三方","createTime":"2016-08-30 17:33:13","positionId":2307276,"positionAdvantage":"留用机会,户口指标","salary":"2k-4k","score":0,"workYear":"应届毕业生","education":"本科","city":"北京","positionName":"iOS实习生","companyLogo":"image1/M00/00/16/CgYXBlTUWEWAXnWbAACvz96W4qA927.jpg","financeStage":"成长型(不需要融资)","industryField":"移动互联网","jobNature":"实习","approve":0,"companySize":"150-500人","district":"海淀区","companyLabelList":null,"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"17:33发布","gradeDescription":null,"companyFullName":"北京中科三方网络技术有限公司","businessZones":["中关村","知春路","双榆树"],"imState":"today","lastLogin":1472549621000,"publisherId":141237,"explain":null,"plus":null,"pcShow":0},{"companyId":116183,"companyShortName":"情非得已","createTime":"2016-08-30 17:28:11","positionId":1786957,"positionAdvantage":"五险一金、无限小吃、Mac办公、定期体检","salary":"8k-15k","score":0,"workYear":"1-3年","education":"不限","city":"北京","positionName":"android&iOS测试工程师","companyLogo":"i/image/M00/1C/58/CgqKkVcB1QyAJM2-AAA4t6tVzs8439.jpg","financeStage":"初创型(天使轮)","industryField":"移动互联网,企业服务","jobNature":"全职","approve":0,"companySize":"15-50人","district":"朝阳区","companyLabelList":["定期体检","年度旅游","领导好","扁平管理"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"17:28发布","gradeDescription":null,"companyFullName":"情非得已(北京)科技有限公司","businessZones":["建外大街","国贸","CBD"],"imState":"today","lastLogin":1472553855000,"publisherId":4170237,"explain":null,"plus":null,"pcShow":0}]}},"code":0}

可以看到,与网站的第一页获取的实际数据是完全对应的.

如何将数据保存为文件?

将curl的结果,直接保存为文件,才方便进一步处理,方法就是使用重定向符 >,以下代码,讲curl的结果,不是在控制器输出,而是保存到指定文件 1.json

curl 'http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false' -H 'Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c' -H 'Origin: http://www.lagou.com' -H 'X-Anit-Forge-Code: 0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.8,en;q=0.6' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Cache-Control: max-age=0' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'X-Anit-Forge-Token: None' -H 'Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC' --data 'first=true&pn=1&kd=iOS' --compressed > 1.json

如何获取其他职位的数据?

此处需要一点更深入些的shell语法,简单说,需要一个for in 循环来遍历一组给定的职位,动态更改 前面curl脚本中的 末尾的kd属性的值,并写入职位对应的文件中,注意 末尾 --data后的 单引号对,要改成双引导对,否则无法应用变量.完整代码如下,职位数组,可按需自行添加:

for kd in "Java" "PHP" "C" "C++" "Android" "iOS"
do
curl 'http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false' -H 'Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c' -H 'Origin: http://www.lagou.com' -H 'X-Anit-Forge-Code: 0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.8,en;q=0.6' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Cache-Control: max-age=0' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'X-Anit-Forge-Token: None' -H 'Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC' --data "first=true&pn=1&kd=$kd" --compressed > $kd.json
done

如何批量获取?

curl 脚本,现在是每次只可以获取单页,要想获取多页,加个for循环就可以了.经过观察,拉勾有效数据大概最多在100页左右,所以写个1~100的循环,并以 $kd_$pn.json 的格式保存:

for (( pn=1; pn<=100; pn=pn+1 )); do
for kd in "Java" "PHP" "C" "C++" "Android" "iOS"; do
curl 'http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false' -H 'Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c' -H 'Origin: http://www.lagou.com' -H 'X-Anit-Forge-Code: 0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.8,en;q=0.6' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Cache-Control: max-age=0' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'X-Anit-Forge-Token: None' -H 'Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC' --data "first=true&pn=$pn&kd=$kd" --compressed > $kd\_$pn.json
done
done

如何提高获取速度?

如果你运行了上面的脚本,如你所见,似乎有点太慢,因为curl请求是同步执行的,必须一条下载完成后,才会继续执行下面的代码.可以借助 & 符 异步同时获取多个请求,来提高速度.另外需要注意的一点是:一个电脑,能同时创建的 curl 链接是有限的,为了避免不必要的中断,加了个极短的sleep,改进后的代码如下:

注意: 此处代码,可能会导致您的ip被lagou封闭,如果不是太赶时间的话,慎用;当然,你可以多换几个ip.

for (( pn=1; pn<=100; pn=pn+1 )); do
for kd in "Java" "PHP" "C" "C++" "Android" "iOS"; do
curl 'http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false' -H 'Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c' -H 'Origin: http://www.lagou.com' -H 'X-Anit-Forge-Code: 0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.8,en;q=0.6' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Cache-Control: max-age=0' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'X-Anit-Forge-Token: None' -H 'Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC' --data "first=true&pn=$pn&kd=$kd" --compressed > $kd\_$pn.json &
sleep 0.02
done
done

注意: 如果一直卡住不动,可以 ctrl + c 退出;如果总是异常脚本中断,可以尝试将 sleep 后的数值调大.

一个更完整的脚本

此处,单独将数据放到 jobs目录,以便于组织目录结构,完整数据可异步文首的github项目中下载:

mkdir jobs
for (( pn=1; pn<=100; pn=pn+1 )); do
for kd in "Java" "PHP" "C" "C++" "Android" "iOS"; do
curl 'http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false' -H 'Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c' -H 'Origin: http://www.lagou.com' -H 'X-Anit-Forge-Code: 0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.8,en;q=0.6' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Cache-Control: max-age=0' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'X-Anit-Forge-Token: None' -H 'Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC' --data "first=true&pn=$pn&kd=$kd" --compressed > jobs/$kd\_$pn.json &
sleep 0.02
done
done

另外,你可能会发现,部分职位并没有100页的有效数据,那是否需要额外处理这些数据呢?当然是没有的.Spark等大数据分析工具的一个基本功能就是适度数据集容错.部分异常数据,一般是不会影响数据本身的导入的.导入后,直接分析即可.这都是后话,此系列后面的文章会单独讲述的.


本系列专属github地址:https://github.com/ios122/spark_lagou

使用Spark分析拉勾网招聘信息(二): 获取数据的更多相关文章

  1. 使用Spark分析拉勾网招聘信息(一):准备工作

    本系列专属github地址:https://github.com/ios122/spark_lagou 前言 我觉得如果动笔,就应该努力地把要说的东西表达清楚.今后一段时间,尝试下系列博客文章.简单说 ...

  2. 使用Spark分析拉勾网招聘信息(四): 几个常用的脚本与图片分析结果

    概述 前一篇文章,已经介绍了BMR的基础用法,再结合Spark和Scala的文档,我想应该是可以开始你的数据分析之路的.这一篇文章,着重进行一些简单的思路上的引导和分析.如果你分析招聘数据时,卡在了某 ...

  3. 使用Spark分析拉勾网招聘信息(三): BMR 入门

    简述 本文,意在以最小的篇幅,来帮助对大数据和Spark感兴趣的小伙伴,能尽快搭建一个可用的Spark开发环境.力求言简意赅.文章,不敢自称BMR的最佳实践,但绝对可以帮助初学者,迅速入门,能够专心于 ...

  4. 爬取拉勾网招聘信息并使用xlwt存入Excel

    xlwt 1.3.0 xlwt 文档 xlrd 1.1.0 python操作excel之xlrd 1.Python模块介绍 - xlwt ,什么是xlwt? Python语言中,写入Excel文件的扩 ...

  5. Python3获取拉勾网招聘信息

    为了了解跟python数据分析有关行业的信息,大概地了解一下对这个行业的要求以及薪资状况,我决定从网上获取信息并进行分析.既然想要分析就必须要有数据,于是我选择了拉勾,冒着危险深入内部,从他们那里得到 ...

  6. Linux内核--网络栈实现分析(六)--应用层获取数据包(上)

    本文分析基于内核Linux 1.2.13 原创作品,转载请标明http://blog.csdn.net/yming0221/article/details/7541907 更多请看专栏,地址http: ...

  7. 2、 Spark Streaming方式从socket中获取数据进行简单单词统计

    Spark 1.5.2 Spark Streaming 学习笔记和编程练习 Overview 概述 Spark Streaming is an extension of the core Spark ...

  8. python-scrapy爬虫框架爬取拉勾网招聘信息

    本文实例为爬取拉勾网上的python相关的职位信息, 这些信息在职位详情页上, 如职位名, 薪资, 公司名等等. 分析思路 分析查询结果页 在拉勾网搜索框中搜索'python'关键字, 在浏览器地址栏 ...

  9. Python爬取拉勾网招聘信息并写入Excel

    这个是我想爬取的链接:http://www.lagou.com/zhaopin/Python/?labelWords=label 页面显示如下: 在Chrome浏览器中审查元素,找到对应的链接: 然后 ...

随机推荐

  1. Ubuntu下vsftpd配置实例,超级简单,高度可用

    看了网上很多关于vsftpd的安装配置教程,发现很多都是不可以用的,经过多次尝试,总结了一个最简单的设置方法. 第一步:安装vsftpd sudo apt-get install vsftpd 第二步 ...

  2. ElasticSearch + Kibana

    关键词: 数据可视化 数据分析 数据爬虫 信息检索(搜索引擎) ElasticSearch是基于Lucene的分布式搜索引擎,提供多种插件及配套工具. 其中Kibana可以“关联”ES中的数据集,进行 ...

  3. 你可能不知道的python

    1.如何循环获得下标,使用 enumerate ints = ['a','b','c','d','e','f'] for idx, val in enumerate(ints): print idx, ...

  4. 推荐一个C#代码混淆器 .NET Reactor【转】

    C#的代码辛苦写出来之后,一个反射工具,就可以完全显露出来. 当然,在做项目时,这个功能还不错.因为我就曾在一个项目上使用C#,没有进行任何混淆.结果在项目二年多之后,需要做一些调整,自己保存的源代码 ...

  5. Reflector反编译.NET文件后修复【转】

    反编译后的工程文件用VS2010打开后,在打开窗体时会出现一系列错误提示: 第一种情况: “设计器无法处理第 152 行的代码: base.AutoScaleMode = AutoScaleMode. ...

  6. IT战略规划咨询

    目录 1IT战略规划微咨询简介 2IT战略的意义 3服务模式 4IT战略规划焦点问题 5IT战略规划步骤 6服务提供方微咨询网 7微咨询价值 8微咨询服务方式 9IT工作规划与IT战略规... IT战 ...

  7. 结合使用saiku、mondrian workbentch建立多维查询报表

    1.简介 前几篇博客已经介绍了saiku.mondrian.MDX和模式文件他们之间的关系,那么如何将它们串联起来,供产品人员使用哪?下面咱们一步一步的实现 2.建立数据表 建表语句参考:http:/ ...

  8. MFC中混合使用Duilib制作界面

    因为公司项目最近入了MFC的这个大坑,用MFC做UI做了一段时间,感觉不是很方便,开发效率有点慢. 看了c++里面做界面的类库,感觉Duilib比较符合做界面的需求,而且很多大公司也在使用Duilib ...

  9. C primer plus 练习题 第二章

    6. #include <stdio.h> void echo(); int main() { /* echo(); echo(); echo(); printf("\n&quo ...

  10. java之源码路径及api

    jav源码地址:D:\Program Files\jdk1.7\src.zip class类地址:D:\Program Files\jdk1.7\jre\lib\rt.jar 在线api底地址:htt ...