爬虫 http原理,梨视频,github登陆实例,requests请求参数小总结

回顾:http协议基于请求响应的方式，请求：请求首行请求头{'keys':vales} 请求体；响应:响应首行，响应头{'keys':'vales'}，响应体。

import socket

sock=socket.socket()

sock.bind(("127.0.0.1",8808))

sock.listen(5)

while 1:

    print("server waiting.....")

    conn,addr=sock.accept()

    data=conn.recv(1024)

    print("data", data)

    # 读取html文件

    with open("login.html","rb") as f:

        data=f.read()

    conn.send((b"HTTP/1.1 200 OK\r\nContent-type:text/html\r\n\r\n%s"%data))

    conn.close()

基于socket的浏览器交互

'''

    GET请求

    # 请求首行

    GET / HTTP/1.1\r\n

    # get请求后面的参数

    b'GET /?name=wd&age=11 HTTP/1.1\r\n

    # 请求头

    Host: 127.0.0.1:8008\r\n

    Connection: keep-alive\r\n

    Cache-Control: max-age=0\r\n

    Upgrade-Insecure-Requests: 1\r\n

    User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64)

    AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181Safari/537.36\r\n

Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8\r\nAccept-Encoding: gzip, deflate, br\r\n

Accept-Language: zh-CN,zh;q=0.9\r\n Cookie:csrftoken=7xx6BxQDJ6KB0PM7qS8uTA892ACtooNbnnF4LDwlYk1Y7S7nTS81FBqwruizHsxF\r\n\r\n'

    # 请求体（get请求，请求体为空）

    '''

   b''

    '''

    POST请求

    # 请求首行

    b'POST /?name=wd&age=11 HTTP/1.1\r\n

    # 请求头

    Host: 127.0.0.1:8008\r\n

Connection: keep-alive\r\n

Content-Length: 21\r\n

Cache-Control: max-age=0\r\n

Origin: http://127.0.0.1:8008\r\n

Upgrade-Insecure-Requests: 1\r\n

Content-Type: application/x-www-form-urlencoded\r\n

User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36\r\n

Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8\r\n

Referer: http://127.0.0.1:8008/?name=lqz&age=18\r\n

Accept-Encoding: gzip, deflate, br\r\n

Accept-Language: zh-CN,zh;q=0.9\r\n

Cookie:csrftoken=7xx6BxQDJ6KB0PM7qS8uTA892ACtooNbnnF4LDwlYk1Y7S7nTS81FBqwruizHsxF\r\n\r\n'

    # 请求体

    b'name=wd&password=11'

    '''

请求

b"HTTP/1.1 200 OK\r\n

Content-type:text/html\r\n\r\n

%s"%data

响应

http原理

点击详情

梨视频案例

#返回数据3种格式

#1.text                    匹配需要的东西

#2.content(二进制)    保存成图片,视频等

#3.json                    反序列化成字典或列表

#下载功能

def download(videos,title):

    if not os.path.exists('video'):

        os.mkdir('video')

    path=os.path.join('video',title)+'.mp4'

    res=requests.get(videos)

    with open(path,'wb') as f:

        f.write(res.content)

#起线程执行执行

if __name__ == '__main__':

    from concurrent.futures import ThreadPoolExecutor

    p=ThreadPoolExecutor(10)

    for i in parser_index(get_index()):

        dic=video_info(get_video(i))

        print(dic)

        p.submit(download,dic['video'],dic['title'])

    p.shutdown(wait=True)

#注意问题:梨视频下滑加载视频（是根据url的参数,例如分类下的视频显示多少）

github登陆实例

#get请求登陆页面获取csrf随机字符串和cookies

#post请求登陆操作携带csrf,输入的用户名密码等(请求体数据) 和 cookies,user-agent,referer等(请求头数据) 必须数据

数据是请求体还是请求头数据？（我的理解是比如ajax里的data,django的返回数据都是请求体的数据. request.set_cookies('islogin':'true') request对象的数据为请求头的）

"""

1.请求登陆页面 获取token cookie

2.发生登陆的post请求,将用户名密码 和token 放在请求体中,cookie放在请求头中

"""

import requests

import re

login_url = "https://github.com/login"

#浏览器标识

headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"}

#请求登陆页面

res1 = requests.get(login_url,headers=headers)

print(res1.status_code)

# 从响应体中获取token

token = re.search('name="authenticity_token" value="(.*?)"',res1.text).group(1)

# 保存cookie

login_cookie = res1.cookies.get_dict()

print(login_cookie)

# 发送登陆请求

res2 = requests.post("https://github.com/session",

              headers={

                  "user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"},

              cookies = login_cookie,

              data={

                "commit": "Sign in",

                "utf8": "✓",

                "authenticity_token": token,

                "login": "xxxxxxxxxxx",

                "password": "xxxxxxxxxxx"},

                # 是否允许自动重定向

                allow_redirects = False)

print(res2.status_code)

# 用户登录成功后的cookie

user_cookie = res2.cookies.get_dict()

# 携带用户cookies访问主页

res3 = requests.get("https://github.com/settings/profile",cookies = user_cookie,headers = headers)

print(res3.status_code)

print(res3.text)

# "https://github.com/settings/profile"

requests请求参数小总结

#get请求参数

kwd = "吴秀波出轨门"

url = "https://www.baidu.com/s"

requests.get(url,headers=headers,params={"wd":kwd})

#post请求参数

requests.post("https://github.com/session",

              headers={

                  "user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"},

              cookies = login_cookie,

              data={

                "commit": "Sign in",

                "utf8": "✓",

                "authenticity_token": token,

                "login": "ssssss",

                "password": "ssssss"},

                # 是否允许自动重定向

                allow_redirects = False)

#返回值处理

# response.cookies.get_dict() #获取cookies

# response.status_code # 状态码

# response.text # 将结果以文本的形式返回

# response.content # 将结果以二进制的方式返回

# response.json() # 将数据直接反序列化得到字典或是列表

主要代码内容

爬虫 http原理,梨视频,github登陆实例,requests请求参数小总结的更多相关文章

HFun.快速开发平台（二）=》自定义列表实例（请求参数的处理）
上编描述了自定义列表的基本实现功能,本此记录列表的请求过程. 个人比较喜欢对参数进行对象化,方便后续人维护及查看,先上代码: /************************************ ...
HFun.快速开发平台（四）=》自定义列表实例（请求参数的处理）
上编自定义列表描述了自定义列表的基本实现功能,本此记录列表的请求过程. 个人比较喜欢对参数进行对象化,方便后续人维护及查看,先上代码: /******************************* ...
基础爬虫，谁学谁会，用requests、正则表达式爬取豆瓣Top250电影数据！
爬取豆瓣Top250电影的评分.海报.影评等数据! 本项目是爬虫中最基础的,最简单的一例: 后面会有利用爬虫框架来完成更高级.自动化的爬虫程序. 此项目过程是运用requests请求库来获取h ...
爬虫 requests模块的其他用法抽屉网线程池回调爬取+保存实例,gihub登陆实例
requests模块的其他用法 #通常我们在发送请求时都需要带上请求头,请求头是将自身伪装成浏览器的关键,常见的有用的请求头如下 Host Referer #大型网站通常都会根据该参数判断请求的来源 ...
python爬虫实践——爬取“梨视频”
一.爬虫的基本过程: 1.发送请求(请求库:request,selenium) 2.获取响应数据()服务器返回 3.解析并提取数据(解析库:re,BeautifulSoup,Xpath) 4.保存数据 ...
开源磁力搜索爬虫dhtspider原理解析
开源地址:https://github.com/callmelanmao/dhtspider. 开源的dht爬虫已经有很多了,有php版本的,python版本的和nodejs版本.经过一些测试,发现还 ...
Python 爬虫——抖音App视频抓包
APP抓包前面我们了解了一些关于 Python 爬虫的知识,不过都是基于 PC 端浏览器网页中的内容进行爬取.现在手机 App 用的越来越多,而且很多也没有网页端,比如抖音就没有网页版,那么上面的视 ...
Python 爬虫的工具列表附Github代码下载链接
Python爬虫视频教程零基础小白到scrapy爬虫高手-轻松入门 https://item.taobao.com/item.htm?spm=a1z38n.10677092.0.0.482434a6E ...
开源大数据技术专场（下午）:Databircks、Intel、阿里、梨视频的技术实践
摘要: 本论坛第一次聚集阿里Hadoop.Spark.Hbase.Jtorm各领域的技术专家,讲述Hadoop生态的过去现在未来及阿里在Hadoop大生态领域的实践与探索. 开源大数据技术专场下午场在 ...

随机推荐

mybatis 一对多查询
需求:一条数据对应多张表 ad_share_friends 主表 ad_share_image 图片表建立实体 adShareFriends 和 adShareImage *注意在adShar ...
ansible-playbook（nginx例）
一.创建目录结构 cd /etc/ansible/roles/ mkdir nginx/{files,templates,vars,handlers,meta,default,tasks} -pv 二 ...
codeforces 792A-D
先刷前四题,剩下的有空补. 792A New Bus Route 题意:给出x 轴上的n 个点,问两个点之间的最短距离是多少,有多少个最短距离. 思路:排序后遍历. 代码: #include<s ...
每个努力奋斗过的人，被不公正的际遇砸了满头包的时候，都有那么一瞬间的代入感。出生就是hard模式的人，早已经历了太多的劳其筋骨饿其体肤，再多的人为考验只会摧毁人对美好的向往。
每个努力奋斗过的人,被不公正的际遇砸了满头包的时候,都有那么一瞬间的代入感.出生就是hard模式的人,早已经历了太多的劳其筋骨饿其体肤,再多的人为考验只会摧毁人对美好的向往.
【DeepLearning】深入理解dropout正则化
本文为转载,作者:Microstrong0305 来源:CSDN 原文:https://blog.csdn.net/program_developer/article/details/80737724 ...
bzoj 3196 && luogu 3380 JoyOI 1730 二逼平衡树 (线段树套Treap）
链接:https://www.lydsy.com/JudgeOnline/problem.php?id=3196 题面; 3196: Tyvj 1730 二逼平衡树 Time Limit: 10 Se ...
magento 2.2.3 -/.gitignore -/.htaccess 分享
/.htaccess ############################################ ## overrides deployment configuration mode v ...
python3 while-else和for-else语法
while-else: while判断条件不成立时,执行else语句: 语法: while 判断条件: 语句1.... else: 语句2.... i初始值为2,i>0成立,则执行while语句 ...
「NOI2013」小 Q 的修炼解题报告
「NOI2013」小 Q 的修炼第一次完整的做出一个提答,花了半个晚上+一个上午+半个下午总体来说太慢了对于此题,我认为的难点是观察数据并猜测性质和读入操作我隔一会就思考这个sb字符串读起来怎 ...
Codeforces 1095F Make It Connected（最小生成树）
题目链接:Make It Connected 题意:给定一张$n$个顶点(每个顶点有权值$a_i$)的无向图,和已连接的拥有边权$w_i$的$m$条边,顶点u和顶点v直接如果新建边,边权为$a_u+a ...