爬虫中什么是requests

print(response.text)       #响应的信息

print(response.headers)  #获取响应头

print(response.status_code)  #响应状态码

print(response.encoding)   #响应的编码

print(response.cookies)   #获取cookies信息

带参数GET请求

data = {

    'name':'abc'，

''''''

}

response = requests.get(url='http://www.baidu.com',params=data)

解析json

import requests

response = requests.get(url='http://www.baidu.com')

print(response.json())

获取二进制数据

import requests

response = requests.get(url='http://www.baidu.com')

print(response.content)

高级操作

文件上传

import requests

flies = {

    'flies':open('XXX','rb')

}

response = requests.post(url='http://www.baidu.com',flies=flies)

print(response.content)

会话维持（模拟登陆）

import requests

s = requests.Session()

s.get('http://httpbin.org/cookies/set/number/123456789')

response = s.get('http://httpbin.org/cookies')

print(response.text)

{

  "cookies": {

    "number": "123456789"

  }

}

证书验证

import requests

import urllib3

url = 'https://www.biqudu.com/43_43821/2520338.html'

urllib3.disable_warnings() #关闭证书后再把警告提示关闭

response = requests.get(url=url,verify=False)

print(response.text)

代理认证

url = 'https://www.biqudu.com/43_43821/2520338.html'

proxies = {

    'http':'http://127.0.0.2',

    'https':'http://user:pwd@127.0.0.2',  #带密码的代理

}

response = requests.get(url=url,proxies=proxies)

print(response.text)

    ****

请求超时处理

import requests

from requests.exceptions import ReadTimeout  #导入错误模块

url = 'https://www.taobao.com'

try:

    response = requests.get(url=url,timeout=0.1)  #限制请求时间

    print(response.status_code)

except ReadTimeout:

    print('请求超时')

认证设置

#有的网站打开的瞬间就需要密码认证

import requests

from requests.auth import HTTPBasicAuth

url = 'https://www.taobao.com'

response = requests.get(url=url,auth=('user','pwd'))

print(response.status_code)

1，笔趣阁小说（入门级爬取文本信息）

抓取笔趣阁小说：排行榜单的小说总榜

1.请求初始url，获取网页源码

2.解析网页源码，得到文本内容

3.将小说全部章节名存入txt文件中

from lxml import etree

import requests

url = 'http://www.biqiuge.com/paihangbang'

response = requests.get(url)

response.encoding = response.apparent_encoding

html = etree.HTML(response.text)

info = html.xpath("//div[@class='block bd'][1]/ul[@class='tli']/li/a")

for i in info:

    title = i.xpath("./text()")[0]

    urls =i.xpath("./@href")[0]

    urls1 = 'http://www.biqiuge.com'+urls

    with open(title+'.txt','w+',encoding='utf-8') as f:

        response1 = requests.get(url=urls1)

        response1.encoding = response1.apparent_encoding

        html = etree.HTML(response1.text)

        info = html.xpath("//div[@class='listmain']/dl/dd/a/text()")[6:]

        for i in info:

            f.write(i.strip()+'\n')

        print(title+"------写入成功")

------------------------------------------------------

判断路径是否存在，自动创建！！！

if not os.path.exists(title):

    os.mkdir(title)

path = os.path.join(title,title1)

if not os.path.exists(path):

    os.mkdir(path)

with open(path+ '\\' + title2 +'.txt', 'w+', encoding='utf-8') as f:

    for con in contents:

        f.write(con.strip() + '\n')

    print(title +'---'+ title1 +'---'+ title2 + '---写入成功')

2，崔庆才博客（伪造头信息爬取策略）

from lxml import etree

import requests

n = 0

with open('cuijincai.txt', 'w+', encoding='utf-8') as f:

    for i in range(1,10):

        url = 'https://cuiqingcai.com/category/technique/python/page/'+str(i)

#这里的循环，该网站是动态显示，可以在f12/network中XHR中查到该链接url。

        headers = {

        Referer: https://cuiqingcai.com/category/technique/python

        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'

        }

    #部分网站设置反爬机制，可以为请求头设置 信息

        response = requests.get(url=url,headers=headers)

        html = etree.HTML(response.text)

        all_div = html.xpath("//article[@class='excerpt']")

        for div in  all_div:

            title = div.xpath("./header/h2/a/text()")[0]  #当前路径下的标题信息

            author = div.xpath("./p[@class='auth-span']/span[@class='muted'][1]/a/text()")[0]

            time = div.xpath("./p[@class='auth-span']/span[@class='muted'][2]/text()")[0]

            liulanshu = div.xpath("./p[@class='auth-span']/span[@class='muted'][3]/text()")[0]

            pinlun = div.xpath("./p[@class='auth-span']/span[@class='muted'][4]/a/text()")[0]

            like = div.xpath("./p[@class='auth-span']/span[@class='muted'][5]/a[@id='Addlike']/span[@class='count']/text()")[0]+'喜欢'

            n += 1

            f.write("第{}条\t{}\t{}\t{}\t{}\t{}\t{}\n".format(n,title,author,time,liulanshu,pinlun,like))

User Agent中文名为用户代理，简称 UA，它是一个特殊字符串头，使得服务器能够识别客户使用的操作系统及版本、CPU 类型、浏览器及版本、

浏览器渲染引擎、浏览器语言、浏览器插件等。

HTTP Referer是header的一部分，当浏览器向web服务器发送请求的时候，一般会带上Referer，告诉服务器我是从哪个页面链接过来的，

服务器基此可以获得一些信息用于处理。

https://www.liaoxuefeng.com  该网站设置反爬，可以用上面设置头信息爬取

爬虫中什么是requests的更多相关文章

爬虫中之Requests 模块的进阶
requests进阶内容 session处理cookie proxies参数设置请求代理ip 基于线程池的数据爬取引入有些时候,我们在使用爬虫程序去爬取一些用户相关信息的数据(爬取张三“人人网”个 ...
python爬虫学习(6) —— 神器 Requests
Requests 是使用 Apache2 Licensed 许可证的 HTTP 库.用 Python 编写,真正的为人类着想. Python 标准库中的 urllib2 模块提供了你所需要的大多数 H ...
(转)Python爬虫利器一之Requests库的用法
官方文档以下内容大多来自于官方文档,本文进行了一些修改和总结.要了解更多可以参考官方文档安装利用 pip 安装 $ pip install requests 或者利用 easy_install ...
[python爬虫]Requests-BeautifulSoup-Re库方案--Requests库介绍
[根据北京理工大学嵩天老师“Python网络爬虫与信息提取”慕课课程编写文章中部分图片来自老师PPT 慕课链接:https://www.icourse163.org/learn/BIT-10018 ...
爬虫（五）requests模块2
引入有些时候,我们在使用爬虫程序去爬取一些用户相关信息的数据(爬取张三“人人网”个人主页数据)时,如果使用之前requests模块常规操作时,往往达不到我们想要的目的,例如: #!/usr/bin/ ...
爬虫系列4：Requests+Xpath 爬取动态数据
爬虫系列4:Requests+Xpath 爬取动态数据 [抓取]:参考前文爬虫系列1:https://www.cnblogs.com/yizhiamumu/p/9451093.html [分页]:参 ...
Python爬虫利器一之Requests库的用法
前言之前我们用了 urllib 库,这个作为入门的工具还是不错的,对了解一些爬虫的基本理念,掌握爬虫爬取的流程有所帮助.入门之后,我们就需要学习一些更加高级的内容和工具来方便我们的爬取.那么这一节来 ...
网络爬虫必备知识之requests库
就库的范围,个人认为网络爬虫必备库知识包括urllib.requests.re.BeautifulSoup.concurrent.futures,接下来将结对requests库的使用方法进行总结 1. ...
爬虫系列(十) 用requests和xpath爬取豆瓣电影
这篇文章我们将使用 requests 和 xpath 爬取豆瓣电影 Top250,下面先贴上最终的效果图: 1.网页分析 (1)分析 URL 规律我们首先使用 Chrome 浏览器打开豆瓣电影 T ...

随机推荐

sentinel备忘
git https://github.com/alibaba/Sentinel https://github.com/dubbo/dubbo-sentinel-supportdubbo http: ...
Kbengine游戏引擎-【4】demo-kbengine_unity3d_demo 在容器docker上安装测试
git地址:https://github.com/kbengine/kbengine_unity3d_demo Demo中文地址:https://github.com/kbengine/kbengin ...
[go]os/exec执行shell命令
// exec基础使用 import ( "os/exec" ) cmd = exec.Command("C:\\cygwin64\\bin\\bash.exe" ...
o enclosing instance of type ArrayList_day02 is accessible. Must qualify the allocation with an enclosing instance of type ArrayList_day02
错误日志: 这个错误是因为我创建的一个类,内中又创建了一个内部类,为什么呢在new内部类的时候出现错误呢,因为类中方法(函数)是在是在public static void main(String [] ...
Hadoop集群参数和常用端口
一.Hadoop集群参数配置在hadoop集群中,需要配置的文件主要包括四个,分别是core-site.xml.hdfs-site.xml.mapred-site.xml和yarn-site.xml ...
Linux（CentOS）下安装tesseract-ocr以及配置依赖leptonica
下载 wget https://github.com/tesseract-ocr/tesseract/archive/4.1.0.tar.gz wget http://www.leptonica.or ...
python之selenium元素定位方法
前提: 大家好,今天我们来学习一下selenium,今天主要讲解selenium定位元素的方法,希望对大家有所帮助! 内容: 一,selenium定位元素 selenium提供了8种方法: 1.id ...
Crunch黑客神器-创造个性字典
先来看第一个命令: crunch 6 7 123456 -o pass.txt 是什么意思呢?我们打开终端,输入这个命令之后,crunch代表使用crunch这个工具,6代表生成的密码最小是6位数,7 ...
shell脚本常见的结构化函数
if-then if command then command fi if-then-else if command then command else command fi 嵌套if if comm ...
拿下id_rsa
ssh配置公私钥远程登录Linux主机 ssh-keygen cat id_rsa.pub >>authorized_keys cat authorized_keys 拿下id_rsa h ...

爬虫中什么是requests

会话维持 （模拟登陆）

1，笔趣阁小说（入门级爬取文本信息）

2，崔庆才博客（伪造头信息爬取策略）

爬虫中什么是requests的更多相关文章

随机推荐

热门专题

会话维持（模拟登陆）