爬虫项目:爬取并筛选拉钩网职位信息自动提交简历

 

一 目标站点分析

#一:实验前准备:
浏览器用Chrome
用Ctrl+Shift+Delete清除浏览器缓存的Cookie
打开network准备抓包,点击Preserve log保留所有日志 #二:拉勾网验证流程:
1、请求登录页面:
请求url为:https://passport.lagou.com/login/login.html
请求头并没有什么内容,带上简单的Host,User-Agent把自己伪装成浏览器即可
响应头里包含有效的cookie信息
Set-Cookie:JSESSIONID=ABAAABAAADGAACFC0077EDC55EEC248392A667B221CE7AB; Path=/; HttpOnly
Set-Cookie:user_trace_token=20171104165207-d69fee97-d5d1-4a06-a406-e41989257b25;
页面内容里包含有用的:
X-Anit-Forge-Code
X-Anit-Forge-Token
ps:可以从login.html的head标签里发现拉钩程序员的注释:为了防止重复提交请求与表单,正是这条注释为老娘提供了干它的灵感,可见有时候爱加注释并不是什么好事 2、提交用户名密码
请求url为:https://passport.lagou.com/login/login.json
请求头里需要携带:
JESSIONID
'X-Anit-Forge-Code': X_Anti_Forge_Code, #从login.html页面内容中找
'X-Anit-Forge-Token': X_Anti_Forge_Token, #从login.html页面内容中找
'X-Requested-With': 'XMLHttpRequest',
请求体内data:
用户名密码
ps:用户名为明文,密码为密文,可以输错用户名,输对密码,然后在form data内获取正确的密文密码 Cookies:
JSESSIONID
user_trace_token 3、请求授权(上一步登录成功后,并没有被授权),拿到重定向的url
请求url为:https://passport.lagou.com/grantServiceTicket/grant.html
请求头:
host
user-agent
注意:授权成功后会重定向,如果重定向成功就完成登录了 4、请求重定向的url,拿到最终的登录session 老娘实现了两个版本,第一个版本完全用requests模拟浏览器的行为,但一些请求头与cookie的处理太繁琐了
于是老娘采用了第二个版本,直接用requests.session()去做

二  分析验证策略完成登录

import requests,re
session = requests.Session() #步骤一、首先登陆login.html,获取cookie
r1 = session.get('https://passport.lagou.com/login/login.html', headers={'Host': "passport.lagou.com",'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}) X_Anti_Forge_Token = re.findall(r"window.X_Anti_Forge_Token = '(.*)';",r1.text)[0]
X_Anti_Forge_Code = re.findall(r"window.X_Anti_Forge_Code = '(.*)';",r1.text)[0] #步骤二、用户登陆,携带上一次的cookie,后台对cookie中的 jsessionid 进行授权
r3 = session.post(
url='https://passport.lagou.com/login/login.json',
data={
'isValidate': True,
# 'username': '424662508@qq.com',
# 'password': '4c4c83b3adf174b9c22af4a179dddb63',
'username':'',
'password':'bff642652c0c9e766b40e1a6f3305274',
'request_form_verifyCode': '',
'submit': '',
},
headers={
'X-Anit-Forge-Code': X_Anti_Forge_Code,
'X-Anit-Forge-Token': X_Anti_Forge_Token,
'X-Requested-With': 'XMLHttpRequest',
"Referer": "https://passport.lagou.com/login/login.html",
"Host": "passport.lagou.com",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
},
)
print(r3.text)
# print(r3.headers) #步骤三:进行授权
r4 = session.get('https://passport.lagou.com/grantServiceTicket/grant.html',
allow_redirects=False,
headers={'Host': "passport.lagou.com",'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}) # print(r4.headers)
location=r4.headers['Location']
# print(location) #步骤四:请求重定向的地址,拿到最终的登录session
r5= session.get(location,
allow_redirects=True,
headers={
'Host': "www.lagou.com",
'Referer':'https://passport.lagou.com/login/login.html?',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}) # print(r5.headers)
#步骤五:验证登录
print('林海峰' in r5.text) #r5.text是重定向后的页面 r5=session.get('https://www.lagou.com') #基于已经拿到的session再登录就无需输入账号密码了
print('林海峰' in r5.text) r5=session.get('https://www.lagou.com') #基于已经拿到的session再登录就无需输入账号密码了
print('林海峰' in r5.text)
#使用requests.get(),自己处理cookie信息,流程是对的,可以正常登录,但是没有拿到想要的cookie信息,在爬取过程中发现拉勾网对请求头做了严格的限制,推测失败的原因极有可能是自己拼的请求头多了或者少了某个字段,于是果断采用requests.session()
import re
import time
import requests # 一、访问登录页面,获取:cookie 、 X_Anti_Forge_Token、X_Anti_Forge_Code r1 = requests.get('https://passport.lagou.com/login/login.html', headers={'Host': "passport.lagou.com",'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'})
r1_cookie = r1.cookies.get_dict() X_Anti_Forge_Token = re.findall(r"window.X_Anti_Forge_Token = '(.*)';",r1.text)[0]
X_Anti_Forge_Code = re.findall(r"window.X_Anti_Forge_Code = '(.*)';",r1.text)[0] r2 = requests.get('https://a.lagou.com/collect',
cookies=r1_cookie,
headers={'Host': "a.lagou.com",'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}) r2_cookie=r2.cookies.get_dict() cookies={}
cookies.update(r1_cookie)
cookies.update(r2_cookie)
print(cookies)
# 二、输入用户名密码,登录
r3 = requests.post(
url='https://passport.lagou.com/login/login.json',
data={
'isValidate': True,
# 'username': '424662508@qq.com',
# 'password': '4c4c83b3adf174b9c22af4a179dddb63',
'username':'',
'password':'bff642652c0c9e766b40e1a6f3305274',
'request_form_verifyCode': '',
'submit': '',
},
headers={
'X-Anit-Forge-Code': X_Anti_Forge_Code,
'X-Anit-Forge-Token': X_Anti_Forge_Token,
'X-Requested-With': 'XMLHttpRequest',
"Referer": "https://passport.lagou.com/login/login.html",
"Host": "passport.lagou.com",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
},
cookies=cookies
)
# print(r3.text)
# print(r3.cookies.get_dict()) r4 = requests.get('https://passport.lagou.com/grantServiceTicket/grant.html',
cookies=cookies,
allow_redirects=False,
headers={'Host': "passport.lagou.com",'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}) # print(r4.headers)
location=r4.headers['Location'] r5= requests.get(location,
cookies=cookies,
allow_redirects=False,
headers={
'X_HTTP_TOKEN':'e6efc0e95eb87147209fbb4f22558fd1',
'Host': "www.lagou.com",
'Referer':'https://passport.lagou.com/login/login.html?',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}) print(r5.headers)

纪念一次失败的尝试

三 基于登录爬取个人主页

import requests,re
session = requests.Session() #步骤一、首先登陆login.html,获取cookie
r1 = session.get('https://passport.lagou.com/login/login.html', headers={'Host': "passport.lagou.com",'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}) X_Anti_Forge_Token = re.findall(r"window.X_Anti_Forge_Token = '(.*)';",r1.text)[0]
X_Anti_Forge_Code = re.findall(r"window.X_Anti_Forge_Code = '(.*)';",r1.text)[0] #步骤二、用户登陆,携带上一次的cookie,后台对cookie中的 jsessionid 进行授权
r3 = session.post(
url='https://passport.lagou.com/login/login.json',
data={
'isValidate': True,
# 'username': '424662508@qq.com',
# 'password': '4c4c83b3adf174b9c22af4a179dddb63',
'username':'',
'password':'bff642652c0c9e766b40e1a6f3305274',
'request_form_verifyCode': '',
'submit': '',
},
headers={
'X-Anit-Forge-Code': X_Anti_Forge_Code,
'X-Anit-Forge-Token': X_Anti_Forge_Token,
'X-Requested-With': 'XMLHttpRequest',
"Referer": "https://passport.lagou.com/login/login.html",
"Host": "passport.lagou.com",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
},
)
print(r3.text)
# print(r3.headers) #步骤三:进行授权
r4 = session.get('https://passport.lagou.com/grantServiceTicket/grant.html',
allow_redirects=False,
headers={'Host': "passport.lagou.com",'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}) # print(r4.headers)
location=r4.headers['Location']
# print(location) #步骤四:请求重定向的地址,拿到最终的登录session
r5= session.get(location,
allow_redirects=True,
headers={
'Host': "www.lagou.com",
'Referer':'https://passport.lagou.com/login/login.html?',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}) # print(r5.headers) #===============以上是登录环节
r6=session.get('https://www.lagou.com/resume/myresume.html')
print('林海峰' in r6.text)
print(r6.text)
#拿到r6.text即个人主页内容,然后用re模块,想取啥就取啥了,这种low操作就不必说了

四 爬取并筛选职位信息

import requests,re
session = requests.Session() #步骤一、首先登陆login.html,获取cookie
r1 = session.get('https://passport.lagou.com/login/login.html', headers={'Host': "passport.lagou.com",'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}) X_Anti_Forge_Token = re.findall(r"window.X_Anti_Forge_Token = '(.*)';",r1.text)[0]
X_Anti_Forge_Code = re.findall(r"window.X_Anti_Forge_Code = '(.*)';",r1.text)[0] #步骤二、用户登陆,携带上一次的cookie,后台对cookie中的 jsessionid 进行授权
r3 = session.post(
url='https://passport.lagou.com/login/login.json',
data={
'isValidate': True,
# 'username': '424662508@qq.com',
# 'password': '4c4c83b3adf174b9c22af4a179dddb63',
'username':'',
'password':'bff642652c0c9e766b40e1a6f3305274',
'request_form_verifyCode': '',
'submit': '',
},
headers={
'X-Anit-Forge-Code': X_Anti_Forge_Code,
'X-Anit-Forge-Token': X_Anti_Forge_Token,
'X-Requested-With': 'XMLHttpRequest',
"Referer": "https://passport.lagou.com/login/login.html",
"Host": "passport.lagou.com",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
},
)
print(r3.text)
# print(r3.headers) #步骤三:进行授权
r4 = session.get('https://passport.lagou.com/grantServiceTicket/grant.html',
allow_redirects=False,
headers={'Host': "passport.lagou.com",'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}) # print(r4.headers)
location=r4.headers['Location']
# print(location) #步骤四:请求重定向的地址,拿到最终的登录session
r5= session.get(location,
allow_redirects=True,
headers={
'Host': "www.lagou.com",
'Referer':'https://passport.lagou.com/login/login.html?',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}) # print(r5.headers) #===============以上是登录环节 #爬取职位信息 #步骤一:分析
#搜索职位的url样例:https://www.lagou.com/jobs/list_python%E5%BC%80%E5%8F%91?labelWords=&fromSearch=true&suginput=
from urllib.parse import urlencode
keyword='python开发'
url_encode=urlencode({'k':keyword},encoding='utf-8') #k=python%E5%BC%80%E5%8F%91
url='https://www.lagou.com/jobs/list_%s?labelWords=&fromSearch=true&suginput=' %url_encode.split('=')[1] #根据用户的keyword拼接出搜索职位的url
print(url) #拿到职位信息的主页面
r7=session.get(url,
headers={
'Host': "www.lagou.com",
'Referer': 'https://passport.lagou.com/login/login.html?',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
})
#发现主页面中并没有我们想要搜索的职位信息,那么肯定是通过后期js渲染出的结果,一查,果然如此
r7.text
#搜索职位:请求职位的url后只获取了一些静态内容,关于职位的信息是向https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false&isSchoolJob=0发送请求拿到json #步骤二:验证分析的结果
#爬取职位信息,发post请求,拿到json数据:'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false&isSchoolJob=0'
r8=session.post('https://www.lagou.com/jobs/positionAjax.json',
params={
'needAddtionalResult':False,
'isSchoolJob':'', },
headers={
'Host': "www.lagou.com",
'Origin':'https://www.lagou.com',
'Referer': url,
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
'X-Anit-Forge-Code':'',
'X-Anit-Forge-Token': '',
'X-Requested-With': 'XMLHttpRequest',
'Accept':'application/json, text/javascript, */*; q=0.01'
},
data={
'first':True,
'pn':'',
'kd':'python开发'
}
) print(r8.json()) #pageNo:1 代表第一页,pageSize:15代表本页有15条职位记录,我们需要做的是获取总共有多少页就可以了 #步骤三(最终实现):实现根据传入参数,筛选职位信息
from urllib.parse import urlencode
keyword='python开发'
url_encode=urlencode({'k':keyword},encoding='utf-8') #k=python%E5%BC%80%E5%8F%91
url='https://www.lagou.com/jobs/list_%s?labelWords=&fromSearch=true&suginput=' %url_encode.split('=')[1] #根据用户的keyword拼接出搜索职位的url def search_position(
keyword,
pn=1,
city='北京',
district=None,
bizArea=None,
isSchoolJob=None,
xl=None,
jd=None,
hy=None,
yx=None,
needAddtionalResult=False,
px='detault'):
params = {
'city': city, # 工作地点,如北京
'district': district, # 行政区,如朝阳区
'bizArea': bizArea, # 商区,如望京
'isSchoolJob': isSchoolJob, # 工作性质,如应届
'xl': xl, # 学历要求,如大专
'jd': jd, # 融资阶段,如天使轮,A轮
'hy': hy, # 行业领域,如移动互联网
'yx': yx, # 工资范围,如10-15k
'needAddtionalResult': needAddtionalResult,
'px': 'detault'
},
r8 = session.post('https://www.lagou.com/jobs/positionAjax.json',
params={
'city': city, #工作地点,如北京
'district': district,#行政区,如朝阳区
'bizArea': bizArea, #商区,如望京
'isSchoolJob': isSchoolJob, #工作性质,如应届
'xl': xl, #学历要求,如大专
'jd': jd,#融资阶段,如天使轮,A轮
'hy': hy, #行业领域,如移动互联网
'yx': yx, #工资范围,如10-15k
'needAddtionalResult': needAddtionalResult,
'px':'detault'
},
headers={
'Host': "www.lagou.com",
'Origin': 'https://www.lagou.com',
'Referer': url,
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
'X-Anit-Forge-Code': '',
'X-Anit-Forge-Token': '',
'X-Requested-With': 'XMLHttpRequest',
'Accept': 'application/json, text/javascript, */*; q=0.01'
},
data={
'first': True,
'pn': pn,
'kd': keyword,
}
)
print(r8.status_code)
print(r8.json())
return r8.json() #求一份北京朝阳区10-15k的python开发工作
keyword='python开发'
yx='10k-15k'
city='北京'
district='朝阳区'
isSchoolJob='' #应届或实习 response=search_position(keyword=keyword,yx=yx,city=city,district=district,isSchoolJob=isSchoolJob)
results=response['content']['positionResult']['result'] #打印公司的详细信息
def get_company_info(results):
for res in results:
info = '''
公司全称 : %s
地址 : %s,%s
发布时间 : %s
职位名 : %s
职位类型 : %s,%s
工作模式 : %s
薪资 : %s
福利 : %s
要求工作经验 : %s
公司规模 : %s
详细链接 : https://www.lagou.com/jobs/%s.html
''' % (
res['companyFullName'],
res['city'],
res['district'],
res['createTime'],
res['positionName'],
res['firstType'],
res['secondType'],
res['jobNature'],
res['salary'],
res['positionAdvantage'],
res['workYear'],
res['companySize'],
res['positionId']
)
print(info)
# 经分析,公司的详细链接都是:https://www.lagou.com/jobs/2653020.html ,其中那个编号就是职位id
#print('公司全称[%s],简称[%s]' %(res['companyFullName'],res['companyShortName']))
get_company_info(results)

五 自动提交简历

import requests,re
session = requests.Session() #步骤一、首先登陆login.html,获取cookie
r1 = session.get('https://passport.lagou.com/login/login.html', headers={'Host': "passport.lagou.com",'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}) X_Anti_Forge_Token = re.findall(r"window.X_Anti_Forge_Token = '(.*)';",r1.text)[0]
X_Anti_Forge_Code = re.findall(r"window.X_Anti_Forge_Code = '(.*)';",r1.text)[0] #步骤二、用户登陆,携带上一次的cookie,后台对cookie中的 jsessionid 进行授权
r3 = session.post(
url='https://passport.lagou.com/login/login.json',
data={
'isValidate': True,
# 'username': '424662508@qq.com',
# 'password': '4c4c83b3adf174b9c22af4a179dddb63',
'username':'',
'password':'bff642652c0c9e766b40e1a6f3305274',
'request_form_verifyCode': '',
'submit': '',
},
headers={
'X-Anit-Forge-Code': X_Anti_Forge_Code,
'X-Anit-Forge-Token': X_Anti_Forge_Token,
'X-Requested-With': 'XMLHttpRequest',
"Referer": "https://passport.lagou.com/login/login.html",
"Host": "passport.lagou.com",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
},
)
print(r3.text)
# print(r3.headers) #步骤三:进行授权
r4 = session.get('https://passport.lagou.com/grantServiceTicket/grant.html',
allow_redirects=False,
headers={'Host': "passport.lagou.com",'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}) # print(r4.headers)
location=r4.headers['Location']
# print(location) #步骤四:请求重定向的地址,拿到最终的登录session
r5= session.get(location,
allow_redirects=True,
headers={
'Host': "www.lagou.com",
'Referer':'https://passport.lagou.com/login/login.html?',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}) # print(r5.headers) #===============以上是登录环节 #自动提交简历(data内的positionId即3476321.html的数字) #先访问主页面,拿到X_Anti_Forge_Tokenm,X_Anti_Forge_Code,userid
r9 = session.get('https://www.lagou.com/jobs/3476321.html',
headers={
'Host': "www.lagou.com",
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}) X_Anti_Forge_Token = re.findall(r"window.X_Anti_Forge_Token = '(.*)';",r9.text)[0]
X_Anti_Forge_Code = re.findall(r"window.X_Anti_Forge_Code = '(.*)';",r9.text)[0]
userid=re.findall(r'value="(\d+)" name="userid"',r9.text)[0] print(userid,type(userid))
with open('a.html','w',encoding='utf-8') as f :
f.write(userid) #然后发送用户id与职位id,post提交即可 r10=session.post('https://www.lagou.com/mycenterDelay/deliverResumeBeforce.json',
headers={
'Host': "www.lagou.com",
'Origin':'https://www.lagou.com',
'Referer':'https://www.lagou.com/jobs/3737624.html',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
'X-Anit-Forge-Code': X_Anti_Forge_Code,
'X-Anit-Forge-Token': X_Anti_Forge_Token,
'X-Requested-With': 'XMLHttpRequest',
},
data={
'userId':userid,
'positionId':'', #即'positionId'
'force':False,
'type':'',
'resubmitToken':''
}
) print(r10.status_code)
print(r10.text) #可以去投递箱内查看投递结果,地址为:https://www.lagou.com/mycenter/delivery.html
 
 
 

【转】Python爬虫_示例2的更多相关文章

  1. 【转】Python爬虫_示例

    爬虫项目:爬取汽车之家新闻资讯   # requests+Beautifulsoup爬取汽车之家新闻 import requests from bs4 import BeautifulSoup res ...

  2. 十个Python爬虫武器库示例,十个爬虫框架,十种实现爬虫的方法!

    一般比价小型的爬虫需求,我是直接使用requests库 + bs4就解决了,再麻烦点就使用selenium解决js的异步 加载问题.相对比较大型的需求才使用框架,主要是便于管理以及扩展等. 1.Scr ...

  3. python爬虫_入门_翻页

    写出来的爬虫,肯定不能只在一个页面爬,只要要爬几个页面,甚至一个网站,这时候就需要用到翻页了 其实翻页很简单,还是这个页面http://bbs.fengniao.com/forum/10384633. ...

  4. python爬虫_入门

    本来觉得没什么可写的,因为网上这玩意一搜一大把,不过爬虫毕竟是python的一个大亮点,不说说感觉对不起这玩意基础点来说,python2写爬虫重点需要两个模块,urllib和urllib2,其实还有r ...

  5. Python爬虫基础示例

    使用pip安装相关依赖: pip install requests pip install bs4 安装成功提示:Successfully installed *... 爬取中国天气网数据示例代码: ...

  6. Python爬虫_糗事百科

    本爬虫任务: 爬虫糗事百科网站(https://www.qiushibaike.com/)--段子版块中所有的[段子].[投票数].[神回复]等内容 步骤: 通过翻页寻找url规律,构造url列表 查 ...

  7. Python爬虫_百度贴吧(title、url、image_url)

    本爬虫以百度贴吧为例,爬取某个贴吧的[所有发言]以及对应发言详情中的[图片链接] 涉及: request 发送请求获取响应 html 取消注释 通过xpath提取数据 数据保存 思路: 由于各贴吧发言 ...

  8. Python爬虫_百度贴吧

    # 本爬虫为爬取百度贴吧并存储HTMLimport requests class TiebaSpider: def __init__(self, tieba_name): self.tieba_nam ...

  9. python爬虫_简单使用百度OCR解析验证码

    百度技术文档 首先要注册百度云账号: 在首页,找到图像识别,创建应用,选择相应的功能,创建 安装接口模块: pip install baidu-aip 简单识别一: 简单图形验证码: 图片: from ...

随机推荐

  1. AngularCSS 的引入: CSS On-Demand for AngularJS

    1) Include the required JavaScript libraries in your index.html (ngRoute and UI Router are optional) ...

  2. windows下使用python2.7.6 安装django

    1) 安装python2.7.6 2) 由于 python2.7.6 中没有安装setuptools,需要先从官网下载setuptools,下载zip包然后解压,运行 python setup.py ...

  3. 小数数据精度问题Double与BigDecimal

    做项目的过程中涉及到小数问题的时候,一般我都用Double类型,但是经常出现*.999999998这种数据,然后自己再手动四舍五入,简直傻的要死. 明明就是一个1.51-1.38的问题,很简单怎么会得 ...

  4. Android WebView 常见问题

    1.为WebView自定义错误显示界面: /** * 显示自定义错误提示页面,用一个View覆盖在WebView */ protected void showErrorPage() { LinearL ...

  5. shell脚本中多命令单行执行_转

    多命令一起执行 如果希望把几个命令合在一起执行, shell提供了两种方法.既可以在当前shell也可以在子shell中执行一组命令. 对{}和()而言, 括号中的重定向符只影响该条命令, 而括号外的 ...

  6. LeetCode447. Number of Boomerangs

    Description Given n points in the plane that are all pairwise distinct, a "boomerang" is a ...

  7. 给linux操作系统安装中文环境

    如果设置的默认环境是英文,需要安装中文环境.最简答的方法如下: sudo apt-get install language-pack-zh-hant sudo apt-get install lang ...

  8. Using Swift with Cocoa and Objective-C下载

    <Using Swift with Cocoa and Objective-C Building App > 下载地址 http://download.csdn.net/detail/sw ...

  9. WPF绑定Binding及模式

    绑定,就是把一个对象属性的值绑定在别的对象的属性上 1. 默认绑定 public class Company { public string Name { get; set; } } XAML代码 1 ...

  10. Android开发:《Gradle Recipes for Android》阅读笔记(翻译)3.5——在flavors间合并java代码

    问题: 你想要在单独的product flavors里面增加Acitivity或者其它java类. 解决方案: 创建合适的代码目录,增加java类,将它们和main代码合并. 讨论: flavors和 ...