爬虫 requests模块的其他用法抽屉网线程池回调爬取+保存实例,gihub登陆实例

requests模块的其他用法

#通常我们在发送请求时都需要带上请求头，请求头是将自身伪装成浏览器的关键，常见的有用的请求头如下

Host

Referer #大型网站通常都会根据该参数判断请求的来源

User-Agent #客户端

Cookie #Cookie信息虽然包含在请求头里，但requests模块有单独的参数来处理他，headers={}内就不要放它了

import requests

respone=requests.get('http://www.jianshu.com')

# respone属性

print(respone.text)

print(respone.content)

print(respone.status_code)

print(respone.headers)

print(respone.cookies)

print(respone.cookies.get_dict())

print(respone.cookies.items())

print(respone.url)

print(respone.history)

print(respone.encoding)

#关闭：response.close()

from contextlib import closing

with closing(requests.get('xxx',stream=True)) as response:

    for line in response.iter_content():

    pass

Response属性

#stream参数:一点一点的取,比如下载视频时,如果视频100G,用response.content然后一下子写到文件中是不合理的importrequestsresponse=requests.get('https://gss3.baidu.com/6LZ0ej3k1Qd3ote6lo7D0j9wehsv/tieba-smallvideo-transcode/1767502_56ec685f9c7ec542eeaf6eac93a65dc7_6fe25cd1347c_3.mp4',stream=True)withopen('b.mp4','wb')asf:forlineinresponse.iter_content():f.write(line)

Response.iter_content()下载二进制数据

#解析json

import requests

response=requests.get('http://httpbin.org/get')

import json

res1=json.loads(response.text) #太麻烦

res2=response.json() #直接获取json数据

print(res1 == res2) #True

Response.json()解析json

编码方式不同

requests.post(url='xxxxxxxx',

              data={'xxx':'yyy'}) #没有指定请求头,#默认的请求头:application/x-www-form-urlencoed

#如果我们自定义请求头是application/json,并且用data传值, 则服务端取不到值

requests.post(url='',

              data={'':1,},

              headers={

                  'content-type':'application/json'

              })

requests.post(url='',

              json={'':1,},

              ) #默认的请求头:application/json

post(data,json)参数

高级用法

#证书验证(大部分网站都是https)

import requests

respone=requests.get('https://www.12306.cn') #如果是ssl请求,首先检查证书是否合法,不合法则报错,程序终端

#改进1:去掉报错,但是会报警告

import requests

respone=requests.get('https://www.12306.cn',verify=False) #不验证证书,报警告,返回200

print(respone.status_code)

#改进2:去掉报错,并且去掉警报信息

import requests

from requests.packages import urllib3

urllib3.disable_warnings() #关闭警告

respone=requests.get('https://www.12306.cn',verify=False)

print(respone.status_code)

#改进3:加上证书

#很多网站都是https,但是不用证书也可以访问,大多数情况都是可以携带也可以不携带证书

#知乎\百度等都是可带可不带

#有硬性要求的,则必须带，比如对于定向的用户,拿到证书后才有权限访问某个特定网站

import requests

respone=requests.get('https://www.12306.cn',

                     cert=('/path/server.crt',

                           '/path/key'))

print(respone.status_code)

是否验证证书

#官网链接: http://docs.python-requests.org/en/master/user/advanced/#proxies

#代理设置:先发送请求给代理,然后由代理帮忙发送(封ip是常见的事情)

import requests

proxies={

    'http':'http://egon:123@localhost:9743',#带用户名密码的代理,@符号前是用户名与密码

    'http':'http://localhost:9743',

    'https':'https://localhost:9743',

}

respone=requests.get('https://www.12306.cn',

                     proxies=proxies)

print(respone.status_code)

#支持socks代理,安装:pip install requests[socks]

import requests

proxies = {

    'http': 'socks5://user:pass@host:port',

    'https': 'socks5://user:pass@host:port'

}

respone=requests.get('https://www.12306.cn',

                     proxies=proxies)

print(respone.status_code)

2种代理方式

#超时设置

#两种超时:float or tuple

#timeout=0.1 #代表接收数据的超时时间

#timeout=(0.1,0.2)#0.1代表链接超时  0.2代表接收数据的超时时间

import requests

respone=requests.get('https://www.baidu.com',

                     timeout=0.0001)

超时设置

#官网链接：http://docs.python-requests.org/en/master/user/authentication/

#认证设置:登陆网站是,弹出一个框,要求你输入用户名密码（与alter很类似），此时是无法获取html的

# 但本质原理是拼接成请求头发送

#         r.headers['Authorization'] = _basic_auth_str(self.username, self.password)

# 一般的网站都不用默认的加密方式，都是自己写

# 那么我们就需要按照网站的加密方式，自己写一个类似于_basic_auth_str的方法

# 得到加密字符串后添加到请求头

#         r.headers['Authorization'] =func('.....')

#看一看默认的加密方式吧，通常网站都不会用默认的加密设置

import requests

from requests.auth import HTTPBasicAuth

r=requests.get('xxx',auth=HTTPBasicAuth('user','password'))

print(r.status_code)

#HTTPBasicAuth可以简写为如下格式

import requests

r=requests.get('xxx',auth=('user','password'))

print(r.status_code)

认证加密

#异常处理

import requests

from requests.exceptions import * #可以查看requests.exceptions获取异常类型

try:

    r=requests.get('http://www.baidu.com',timeout=0.00001)

except ReadTimeout:

    print('===:')

# except ConnectionError: #网络不通

#     print('-----')

# except Timeout:

#     print('aaaaa')

except RequestException:

    print('Error')

异常处理

#API

from django.shortcuts import render,HttpResponse

# Create your views here.

def test(request):

    if request.method=='POST':

        f=request.FILES.get('file')

        print(f)

        with open('a.text','wb')as f1:

            for i in f:

                f1.write(i)

    return HttpResponse('ok')

#爬虫程序

import requests

files={'file':open('data.text','rb')}

respone=requests.post('http://127.0.0.1:8000/test/',files=files)

print(respone.status_code)

上传文件

抽屉网线程池实例

import os

import requests

import json

import re

URL='https://dig.chouti.com/all/hot/recent/%s'

headers={'user-agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36"}

#代理ip

proxies = {

    'http':'59.32.37.5:3128',

}

lis=[]

download_num=0

#获取text

def get_text(URL):

    res=requests.get (URL, proxies=proxies,headers=headers)

    return res.text

#解析text

def parser_text(text):

    s=text.result()

    title=re.findall('<div class="part2" share-pic=.*? share-title=(.*?) share-summary=',s)

    zan=re.findall('<a href="javascript:;" class="digg-a" title="推荐"><span class="hand-icon icon-digg"></span><b>(\w+?)</b><i style="display:none">\w+</i></a>',s)

    tim=re.findall('<span class="left time-into"><a class="time-a" href="/link/\w+" target="_blank"><b>(.*?)</b></a><i>入热榜</i></span>',s)

    user=re.findall('<a href=.*? class="user-a">.*?<b>(\w+?)</b></a>',s)

    data=list(zip (title, user, tim, zan))

    download(data)

#保存

def download(data):

    print(data)

    global download_num

    lis=json.dumps(data,ensure_ascii=False)

    print(lis)

    if not os.path.exists('file'):

        os.mkdir('file')

    f=open('file/data.text','at',encoding='utf-8')

    f.write(lis)

    f.close()

    download_num+=1

    print('\n下载完成数为%s+++++++++++++++++++++++++++++++++++++++++'%download_num)

#开启线程池爬取

if __name__ == '__main__':

    from concurrent.futures import ThreadPoolExecutor

    p=ThreadPoolExecutor(20)

    for num in range(1,50):

        res=p.submit(get_text,URL%num)

        res.add_done_callback(parser_text)

    p.shutdown(wait=True)

代理+线程池(回调提交任务)

gihub登陆实例

'''

一 目标站点分析

    浏览器输入https://github.com/login

    然后输入错误的账号密码，抓包

    发现登录行为是post提交到：https://github.com/session

    而且请求头包含cookie

    而且请求体包含：

        commit:Sign in

        utf8:✓

        authenticity_token:lbI8IJCwGslZS8qJPnof5e7ZkCoSoMn6jmDTsL1r/m06NLyIbw7vCrpwrFAPzHMep3Tmf/TSJVoXWrvDZaVwxQ==

        login:egonlin

        password:123

二 流程分析

    先GET：https://github.com/login拿到初始cookie与authenticity_token

    返回POST：https://github.com/session， 带上初始cookie，带上请求体（authenticity_token，用户名，密码等）

    最后拿到登录cookie

    ps：如果密码时密文形式，则可以先输错账号，输对密码，然后到浏览器中拿到加密后的密码，github的密码是明文

'''

import requests

import re

#第一次请求

r1=requests.get('https://github.com/login')

r1_cookie=r1.cookies.get_dict() #拿到初始cookie(未被授权)

authenticity_token=re.findall(r'name="authenticity_token".*?value="(.*?)"',r1.text)[0] #从页面中拿到CSRF TOKEN

#第二次请求：带着初始cookie和TOKEN发送POST请求给登录页面，带上账号密码

data={

    'commit':'Sign in',

    'utf8':'✓',

    'authenticity_token':authenticity_token,

    'login':'317828332@qq.com',

    'password':'alex3714'

}

r2=requests.post('https://github.com/session',

             data=data,

             cookies=r1_cookie

             )

login_cookie=r2.cookies.get_dict()

#第三次请求：以后的登录，拿着login_cookie就可以,比如访问一些个人配置

r3=requests.get('https://github.com/settings/emails',

                cookies=login_cookie)

print('317828332@qq.com' in r3.text) #True

github request模拟登陆

import requests

import re

session=requests.session()

#第一次请求

r1=session.get('https://github.com/login')

authenticity_token=re.findall(r'name="authenticity_token".*?value="(.*?)"',r1.text)[0] #从页面中拿到CSRF TOKEN

#第二次请求

data={

    'commit':'Sign in',

    'utf8':'✓',

    'authenticity_token':authenticity_token,

    'login':'317828332@qq.com',

    'password':'alex3714'

}

r2=session.post('https://github.com/session',

             data=data,

             )

#第三次请求

r3=session.get('https://github.com/settings/emails')

print('317828332@qq.com' in r3.text) #True

自动携带cookies

爬虫 requests模块的其他用法抽屉网线程池回调爬取+保存实例,gihub登陆实例的更多相关文章

requests模块session处理cookie 与基于线程池的数据爬取
引入有些时候,我们在使用爬虫程序去爬取一些用户相关信息的数据(爬取张三“人人网”个人主页数据)时,如果使用之前requests模块常规操作时,往往达不到我们想要的目的,例如: #!/usr/bin/ ...
requests模块处理cookie,代理ip，基于线程池数据爬取
引入有些时候,我们在使用爬虫程序去爬取一些用户相关信息的数据(爬取张三“人人网”个人主页数据)时,如果使用之前requests模块常规操作时,往往达不到我们想要的目的. 一.基于requests模块 ...
爬虫--requests模块学习
requests模块 - 基于如下5点展开requests模块的学习什么是requests模块 requests模块是python中原生的基于网络请求的模块,其主要作用是用来模拟浏览器发起请求.功能 ...
Python网络爬虫-requests模块(II)
有些时候,我们在使用爬虫程序去爬取一些用户相关信息的数据(爬取张三“人人网”个人主页数据)时,如果使用之前requests模块常规操作时,往往达不到我们想要的目的,例如: #!/usr/bin/env ...
05爬虫-requests模块基础（2）
今日重点: 1.代理服务器的设置 2.模拟登陆过验证码(静态验证码) 3.cookie与session 4.线程池 1.代理服务器的设置有时候使用同一个IP去爬取同一个网站,久了之后会被该网站服务器 ...
Java爬虫——网易云热评爬取
爬取目标网址 : http://music.163.com/#/song?id=409649818 需要爬取信息 : 网易云top13热评使用之前的 HttpURLConnection 获取 ...
scrapy爬取相似页面及回调爬取问题（以慕课网为例）
以爬取慕课网数据为例慕课网的数据很简单,就是通过get方式获取的连接地址为https://www.imooc.com/course/list?page=2 根据page参数来分页
爬虫 requests模块高级用法
一介绍 #介绍:使用requests可以模拟浏览器的请求,比起之前用到的urllib,requests模块的api更加便捷(本质就是封装了urllib3) #注意:requests库发送请求将网页内 ...
爬虫——requests模块
一爬虫简介 #1.什么是互联网? 互联网是由网络设备(网线,路由器,交换机,防火墙等等)和一台台计算机连接而成,像一张网一样. #2.互联网建立的目的? 互联网的核心价值在于数据的共享/传递:数据是 ...

随机推荐

win 解压安装mysql步骤
5 安装成功之后,启动mysql时报错: 系统错误2,找不到指定的文件.原因:有的系统安装过MySQL没有卸载干净,或者系统自带精简版的MySQL导致注册表关于MySQL的配置与实际安装路径不一致. ...
定时任务调度工作（学习记录二）timer定时函数的用法
schedule的四种用法: 1.schedule(task,time) 参数: task----所安排的任务 time----执行任务的时间作用: 在时间等于或超过time的时候执行且仅执行一次t ...
leanote折腾指南
持续更新. 过几天把自己的修改好的css放到github上给大家参考. https://github.com/whuwangyong/leanote-conf TODO leanote Linux/W ...
NOIP算法小结（转载）
(一)数论 1.最大公约数,最小公倍数 2.筛法求素数 3.mod规律公式 4.排列组合数,错排 5.Catalan数 6.康托展开 7.负进制 8.中位数的应用 9.位运算 (二)高精度算法 1.朴 ...
bzoj1444[Jsoi2009]有趣的游戏[AC自动机]
题面 bzoj 我要向师父学习善待每一只数据结构考虑成环,那么高斯消元然鹅这道题太小了所以直接转移矩阵自乘就好啦终点不向外连边有一条向自己的,概率为一的自环来作为结尾对于其他店若有边\( ...
一文入门HTML5
1.HTML5 上节回顾:一文读懂ES6(附PY3对比) | 一文入门NodeJS 演示demo:https://github.com/lotapp/BaseCode/tree/master/java ...
Node的安装和进程管理
安装nvm git clone https://github.com/creationix/nvm.git source nvm/nvm.sh 安装node nvm install 6.14.4(版本 ...
C++11 std::move和std::forward
下文先从C++11引入的几个规则,如引用折叠.右值引用的特殊类型推断规则.static_cast的扩展功能说起,然后通过例子解析std::move和std::forward的推导解析过程,说明std: ...
Money King【题解】
我又傻了……竟然忘了区别大根堆和小根堆的性质,以至于一个符号打错,debug了半天……(我真是太菜了……) 题目描述 Once in a forest, there lived N aggressiv ...
【转】关于Tomcat下项目线程启动两次的问题
最近遇见了一个很搞得事情,在tomcat下启动项目时自己写的定时程序被执行了两次,导致程序启动了两个线程,使定时任务在几秒间隔内执行了两次,后来通过日志查到,原来是tomcat将项目启动了两次,为什么 ...

爬虫 requests模块的其他用法 抽屉网线程池回调爬取+保存实例,gihub登陆实例

requests模块的其他用法

抽屉网线程池实例

gihub登陆实例

爬虫 requests模块的其他用法 抽屉网线程池回调爬取+保存实例,gihub登陆实例的更多相关文章

随机推荐

热门专题

爬虫 requests模块的其他用法抽屉网线程池回调爬取+保存实例,gihub登陆实例

爬虫 requests模块的其他用法抽屉网线程池回调爬取+保存实例,gihub登陆实例的更多相关文章