爬虫中什么是requests

print(response.text)       #响应的信息

print(response.headers)  #获取响应头

print(response.status_code)  #响应状态码

print(response.encoding)   #响应的编码

print(response.cookies)   #获取cookies信息

带参数GET请求

data = {

    'name':'abc'，

''''''

}

response = requests.get(url='http://www.baidu.com',params=data)

解析json

import requests

response = requests.get(url='http://www.baidu.com')

print(response.json())

获取二进制数据

import requests

response = requests.get(url='http://www.baidu.com')

print(response.content)

高级操作

文件上传

import requests

flies = {

    'flies':open('XXX','rb')

}

response = requests.post(url='http://www.baidu.com',flies=flies)

print(response.content)

会话维持（模拟登陆）

import requests

s = requests.Session()

s.get('http://httpbin.org/cookies/set/number/123456789')

response = s.get('http://httpbin.org/cookies')

print(response.text)

{

  "cookies": {

    "number": "123456789"

  }

}

证书验证

import requests

import urllib3

url = 'https://www.biqudu.com/43_43821/2520338.html'

urllib3.disable_warnings() #关闭证书后再把警告提示关闭

response = requests.get(url=url,verify=False)

print(response.text)

代理认证

url = 'https://www.biqudu.com/43_43821/2520338.html'

proxies = {

    'http':'http://127.0.0.2',

    'https':'http://user:pwd@127.0.0.2',  #带密码的代理

}

response = requests.get(url=url,proxies=proxies)

print(response.text)

    ****

请求超时处理

import requests

from requests.exceptions import ReadTimeout  #导入错误模块

url = 'https://www.taobao.com'

try:

    response = requests.get(url=url,timeout=0.1)  #限制请求时间

    print(response.status_code)

except ReadTimeout:

    print('请求超时')

认证设置

#有的网站打开的瞬间就需要密码认证

import requests

from requests.auth import HTTPBasicAuth

url = 'https://www.taobao.com'

response = requests.get(url=url,auth=('user','pwd'))

print(response.status_code)

1，笔趣阁小说（入门级爬取文本信息）

抓取笔趣阁小说：排行榜单的小说总榜

1.请求初始url，获取网页源码

2.解析网页源码，得到文本内容

3.将小说全部章节名存入txt文件中

from lxml import etree

import requests

url = 'http://www.biqiuge.com/paihangbang'

response = requests.get(url)

response.encoding = response.apparent_encoding

html = etree.HTML(response.text)

info = html.xpath("//div[@class='block bd'][1]/ul[@class='tli']/li/a")

for i in info:

    title = i.xpath("./text()")[0]

    urls =i.xpath("./@href")[0]

    urls1 = 'http://www.biqiuge.com'+urls

    with open(title+'.txt','w+',encoding='utf-8') as f:

        response1 = requests.get(url=urls1)

        response1.encoding = response1.apparent_encoding

        html = etree.HTML(response1.text)

        info = html.xpath("//div[@class='listmain']/dl/dd/a/text()")[6:]

        for i in info:

            f.write(i.strip()+'\n')

        print(title+"------写入成功")

------------------------------------------------------

判断路径是否存在，自动创建！！！

if not os.path.exists(title):

    os.mkdir(title)

path = os.path.join(title,title1)

if not os.path.exists(path):

    os.mkdir(path)

with open(path+ '\\' + title2 +'.txt', 'w+', encoding='utf-8') as f:

    for con in contents:

        f.write(con.strip() + '\n')

    print(title +'---'+ title1 +'---'+ title2 + '---写入成功')

2，崔庆才博客（伪造头信息爬取策略）

from lxml import etree

import requests

n = 0

with open('cuijincai.txt', 'w+', encoding='utf-8') as f:

    for i in range(1,10):

        url = 'https://cuiqingcai.com/category/technique/python/page/'+str(i)

#这里的循环，该网站是动态显示，可以在f12/network中XHR中查到该链接url。

        headers = {

        Referer: https://cuiqingcai.com/category/technique/python

        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'

        }

    #部分网站设置反爬机制，可以为请求头设置 信息

        response = requests.get(url=url,headers=headers)

        html = etree.HTML(response.text)

        all_div = html.xpath("//article[@class='excerpt']")

        for div in  all_div:

            title = div.xpath("./header/h2/a/text()")[0]  #当前路径下的标题信息

            author = div.xpath("./p[@class='auth-span']/span[@class='muted'][1]/a/text()")[0]

            time = div.xpath("./p[@class='auth-span']/span[@class='muted'][2]/text()")[0]

            liulanshu = div.xpath("./p[@class='auth-span']/span[@class='muted'][3]/text()")[0]

            pinlun = div.xpath("./p[@class='auth-span']/span[@class='muted'][4]/a/text()")[0]

            like = div.xpath("./p[@class='auth-span']/span[@class='muted'][5]/a[@id='Addlike']/span[@class='count']/text()")[0]+'喜欢'

            n += 1

            f.write("第{}条\t{}\t{}\t{}\t{}\t{}\t{}\n".format(n,title,author,time,liulanshu,pinlun,like))

User Agent中文名为用户代理，简称 UA，它是一个特殊字符串头，使得服务器能够识别客户使用的操作系统及版本、CPU 类型、浏览器及版本、

浏览器渲染引擎、浏览器语言、浏览器插件等。

HTTP Referer是header的一部分，当浏览器向web服务器发送请求的时候，一般会带上Referer，告诉服务器我是从哪个页面链接过来的，

服务器基此可以获得一些信息用于处理。

https://www.liaoxuefeng.com  该网站设置反爬，可以用上面设置头信息爬取

爬虫中什么是requests的更多相关文章

爬虫中之Requests 模块的进阶
requests进阶内容 session处理cookie proxies参数设置请求代理ip 基于线程池的数据爬取引入有些时候,我们在使用爬虫程序去爬取一些用户相关信息的数据(爬取张三“人人网”个 ...
python爬虫学习(6) —— 神器 Requests
Requests 是使用 Apache2 Licensed 许可证的 HTTP 库.用 Python 编写,真正的为人类着想. Python 标准库中的 urllib2 模块提供了你所需要的大多数 H ...
(转)Python爬虫利器一之Requests库的用法
官方文档以下内容大多来自于官方文档,本文进行了一些修改和总结.要了解更多可以参考官方文档安装利用 pip 安装 $ pip install requests 或者利用 easy_install ...
[python爬虫]Requests-BeautifulSoup-Re库方案--Requests库介绍
[根据北京理工大学嵩天老师“Python网络爬虫与信息提取”慕课课程编写文章中部分图片来自老师PPT 慕课链接:https://www.icourse163.org/learn/BIT-10018 ...
爬虫（五）requests模块2
引入有些时候,我们在使用爬虫程序去爬取一些用户相关信息的数据(爬取张三“人人网”个人主页数据)时,如果使用之前requests模块常规操作时,往往达不到我们想要的目的,例如: #!/usr/bin/ ...
爬虫系列4：Requests+Xpath 爬取动态数据
爬虫系列4:Requests+Xpath 爬取动态数据 [抓取]:参考前文爬虫系列1:https://www.cnblogs.com/yizhiamumu/p/9451093.html [分页]:参 ...
Python爬虫利器一之Requests库的用法
前言之前我们用了 urllib 库,这个作为入门的工具还是不错的,对了解一些爬虫的基本理念,掌握爬虫爬取的流程有所帮助.入门之后,我们就需要学习一些更加高级的内容和工具来方便我们的爬取.那么这一节来 ...
网络爬虫必备知识之requests库
就库的范围,个人认为网络爬虫必备库知识包括urllib.requests.re.BeautifulSoup.concurrent.futures,接下来将结对requests库的使用方法进行总结 1. ...
爬虫系列(十) 用requests和xpath爬取豆瓣电影
这篇文章我们将使用 requests 和 xpath 爬取豆瓣电影 Top250,下面先贴上最终的效果图: 1.网页分析 (1)分析 URL 规律我们首先使用 Chrome 浏览器打开豆瓣电影 T ...

随机推荐

Android网络编程之——文件断点下载
一:关于断点下载所涉及到的知识点 1.对SQLite的增删改查(主要用来保存当前任务的一些信息) 2.HttpURLConnection的请求配置 HttpURLConnection connecti ...
【8583】ISO8583报文解析
ISO8583报文(简称8583包)又称8583报文,是一个国际标准的包格式,最多由128个字段域组成,每个域都有统一的规定,并有定长与变长之分. [报文格式] POS终端上送POS中心的消息报文结构 ...
Ruby on Rails 的模型 validates 验证
validate(), 这个方法在每次保存数据时都会被调用.如:def validate if name.blank? && email.blank? errors.add_to_b ...
Node.JS数组及For 语句
for Each语句: var arr = ["Zhang San", "Li Si", "Wang Wu"] arr.forEach(fu ...
反射中的 Method 的 getReadMethod 与 getWriteMethod 使用【获取一个对象的所有属性字段名称和其对应的值】
转: class反射(一),以及Method 的 getReadMethod 与 getWriteMethod 使用 2018年11月28日 17:27:42 zich77521 阅读数 788 ...
Linux 查看CPU和内存的使用情况
Linux 查看CPU和内存的使用情况如何查看Linux机器的CPU和内存的使用情况. 可以通过如下方式: 1.查看CPU和内存的实时使用情况使用如下命令: top 命令执行后,效果如下(资源的使 ...
k8s-高可用多主master配置
准备主机 centos7镜像 node1: 192.168.0.101 node2: 192.168.0.102 node3: 192.168.0.103 vip: 192.168.0.104 配置s ...
docker搭建pxc
1.下载镜像 curl -sSL https://get.daocloud.io/daotools/set_mirror.sh | sh -s http://f1361db2.m.daocloud.i ...
bootstrap4 调整元素之间距离
影响元素之间的间距是可以通过style的margin或padding属性来实现,但这两个属性本意并不相同:margin影响的是本元素与相邻外界元素之间的距离,这里简称外边距:padding影响的元素本 ...
1.React中的虚拟DOM
1.state 数据 2.JSX模板 3.数据+ 模板结合,生成真实的DOM,来显示 4.state发生改变 5.数据 + 模板结合,生成真实的DOM,替换原始的DOM 缺陷: 第一次生成了一个完 ...

爬虫中什么是requests

会话维持 （模拟登陆）

1，笔趣阁小说（入门级爬取文本信息）

2，崔庆才博客（伪造头信息爬取策略）

爬虫中什么是requests的更多相关文章

随机推荐

热门专题

会话维持（模拟登陆）