25-1 request模块介绍

requests模块

- 基于如下5点展开requests模块的学习

什么是requests模块
- requests模块是python中原生的基于网络请求的模块，其主要作用是用来模拟浏览器发起请求。功能强大，用法简洁高效。在爬虫领域中占据着半壁江山的地位。
为什么要使用requests模块
- 因为在使用urllib模块的时候，会有诸多不便之处，总结如下：
  - 手动处理url编码
  - 手动处理post请求参数
  - 处理cookie和代理操作繁琐
  - ......
- 使用requests模块：
  - 自动处理url编码
  - 自动处理post请求参数
  - 简化cookie和代理操作
  - ......
如何使用requests模块
- 安装：
  - pip install requests
- 使用流程
  - 指定url
  - 基于requests模块发起请求
  - 获取响应对象中的数据值
  - 持久化存储
通过5个基于requests模块的爬虫项目对该模块进行学习和巩固
- 基于requests模块的get请求
  - 需求：爬取搜狗指定词条搜索后的页面数据
- 基于requests模块的post请求
  - 需求：登录豆瓣电影，爬取登录成功后的页面数据
- 基于requests模块ajax的get请求
  - 需求：爬取豆瓣电影分类排行榜 https://movie.douban.com/中的电影详情数据
- 基于requests模块ajax的post请求
  - 需求：爬取肯德基餐厅查询http://www.kfc.com.cn/kfccda/index.aspx中指定地点的餐厅数据
- 综合练习
  - 需求：爬取搜狗知乎指定词条指定页码下的页面数据

- 代码展示

需求：爬取搜狗指定词条搜索后的页面数据

import requests

import os

#指定搜索关键字

word = input('enter a word you want to search:')

#自定义请求头信息

headers={

    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',

    }

#指定url

url = 'https://www.sogou.com/web'

#封装get请求参数

prams = {

    'query':word,

    'ie':'utf-8'

}

#发起请求

response = requests.get(url=url,params=param)

#获取响应数据

page_text = response.text

with open('./sougou.html','w',encoding='utf-8') as fp:

    fp.write(page_text)

需求：登录豆瓣电影，爬取登录成功后的页面数据

import requests

import os

url = 'https://accounts.douban.com/login'

#封装请求参数

data = {

    "source": "movie",

    "redir": "https://movie.douban.com/",

    "form_email": "",

    "form_password": "bobo@15027900535",

    "login": "登录",

}

#自定义请求头信息

headers={

    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',

    }

response = requests.post(url=url,data=data)

page_text = response.text

with open('./douban111.html','w',encoding='utf-8') as fp:

    fp.write(page_text)

需求：爬取豆瓣电影分类排行榜 https://movie.douban.com/中的电影详情数据

#!/usr/bin/env python

# -*- coding:utf-8 -*-

import requests

import urllib.request

if __name__ == "__main__":

    #指定ajax-get请求的url（通过抓包进行获取）

    url = 'https://movie.douban.com/j/chart/top_list?'

    #定制请求头信息，相关的头信息必须封装在字典结构中

    headers = {

        #定制请求头中的User-Agent参数，当然也可以定制请求头中其他的参数

        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36',

    }

    #定制get请求携带的参数(从抓包工具中获取)

    param = {

        'type':'',

        'interval_id':'100:90',

        'action':'',

        'start':'',

        'limit':''

    }

    #发起get请求，获取响应对象

    response = requests.get(url=url,headers=headers,params=param)

    #获取响应内容：响应内容为json串

    print(response.text)

需求：爬取肯德基餐厅查询http://www.kfc.com.cn/kfccda/index.aspx中指定地点的餐厅数据

#!/usr/bin/env python

# -*- coding:utf-8 -*-

import requests

import urllib.request

if __name__ == "__main__":

    #指定ajax-post请求的url（通过抓包进行获取）

    url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'

    #定制请求头信息，相关的头信息必须封装在字典结构中

    headers = {

        #定制请求头中的User-Agent参数，当然也可以定制请求头中其他的参数

        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36',

    }

    #定制post请求携带的参数(从抓包工具中获取)

    data = {

        'cname':'',

        'pid':'',

        'keyword':'北京',

        'pageIndex': '',

        'pageSize': ''

    }

    #发起post请求，获取响应对象

    response = requests.get(url=url,headers=headers,data=data)

    #获取响应内容：响应内容为json串

    print(response.text)

需求：爬取搜狗知乎指定词条指定页码下的页面数据

import requests

import os

#指定搜索关键字

word = input('enter a word you want to search:')

#指定起始页码

start_page = int(input('enter start page num:'))

end_page = int(input('enter end page num:'))

#自定义请求头信息

headers={

    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',

    }

#指定url

url = 'https://zhihu.sogou.com/zhihu'

#创建文件夹

if not os.path.exists('./sougou'):

    os.mkdir('./sougou')

for page in range(start_page,end_page+1):

    #封装get请求参数

    params = {

        'query':word,

        'ie':'utf-8',

        'page':str(page)

    }

    #发起post请求，获取响应对象

    response = requests.get(url=url,params=params)

    #获取页面数据

    page_text = response.text

    fileName = word+'_'+str(page)+'.html'

    filePath = './sougou/'+fileName

    with open(filePath,'w',encoding='utf-8') as fp:

        fp.write(page_text)

        print('爬取'+str(page)+'页结束')

25-1 request模块介绍的更多相关文章

request 模块详细介绍
request 模块详细介绍 request Requests 是使用 Apache2 Licensed 许可证的基于Python开发的HTTP 库,其在Python内置模块的基础上进行了高度的封装 ...
WebKit由三个模块组成-Webkit模块介绍
2. Webkit 源代码由三大模块组成: 1). WebCore, 2). WebKit, 3). JavaScriptCore. WebCore:排版引擎核心,WebCore包含主要以 ...
python3中urllib库的request模块详解
刚刚接触爬虫,基础的东西得时时回顾才行,这么全面的帖子无论如何也得厚着脸皮转过来啊! 原帖地址:https://www.2cto.com/kf/201801/714859.html 什么是 Urlli ...
node.js的request模块
request模块让http请求变的更加简单.最简单的一个示例: 1: var request = require('request'); 2: 3: request('http://www.goo ...
webkit模块介绍
一.Webkit模块用到的第三方库如下: cairo 一个2D绘图库 casqt Unicode处理用的库,从QT中抽取部分代码形成的 expat 一个XML SAX解析器的库 freety ...
python模块介绍- multi-mechanize 性能测试工具
python模块介绍- multi-mechanize 性能测试工具 2013-09-13 磁针石 #承接软件自动化实施与培训等gtalk:ouyangchongwu#gmail.comqq 3739 ...
python模块介绍- xlwt 创建xls文件（excel）
python模块介绍- xlwt 创建xls文件(excel) 2013-06-24磁针石 #承接软件自动化实施与培训等gtalk:ouyangchongwu#gmail.comqq 37391319 ...
upstream模块介绍
upstream模块介绍 Nginx的负载均衡功能依赖于ngx_http_upsteam_module模块,所支持的代理方式包括proxy_pass.fastcgi_pass.memcached_pa ...
大数据技术之_14_Oozie学习_Oozie 的简介+Oozie 的功能模块介绍+Oozie 的部署+Oozie 的使用案列
第1章 Oozie 的简介第2章 Oozie 的功能模块介绍2.1 模块2.2 常用节点第3章 Oozie 的部署3.1 部署 Hadoop(CDH版本的)3.1.1 解压缩 CDH 版本的 hado ...

随机推荐

rabbitmq实现单发送单接收
1.创建两个项目.都使其支持rabbitmq (1)在pom.xml文件中添加支持rabbitmq的jar包 <dependency> <groupId>org.springf ...
Crontab 入门
参考网址: http://www.centoscn.com/CentOS/help/2014/0820/3524.html 简单命令 service crond restart //重启crontab ...
转：Android检查设备是否联网
public static boolean isConnect(Context context) { ConnectivityManager connectionManager = (Connecti ...
Node.js Error: Cannot find module express的解决办法（转载）
1.全局安装express框架,cmd打开命令行,输入如下命令: npm install -g express express 4.x版本中将命令工具分出来,安装一个命令工具,执行命令: npm in ...
WPF 动画执行后属性无法修改
在做了一个类似QQ展开的动画时,设置了TopProperty,通过改变Window.Top属性来实现展开特效, 但是动画执行了之后,再去设置Window.Top的时候发现修改不了,代码调试后发现值设置 ...
fixed和absolute的区别
今天在实际项目中,写首页一屏的时候,发现页脚定位(position:absolute:)没有达到我想要的效果(不管屏幕大小,页脚始终相对浏览器底部定位).于是我觉得有点奇怪,然而我旁边的小哥说:很明显 ...
Boost.Asio基础
http://www.voidcn.com/article/p-exkmmuyn-po.html http://www.voidcn.com/article/p-xnxiwkrf-po.html ht ...
Python学习（一）安装，环境搭建，IDE
第一篇废话太多了,我的博客最主要的是给自己看的,大家觉得还凑合也可以看看,能说自己想法的就更好了,因为一个人的思想是有局限性的.集思广益,自己的认知才不会被禁锢. 注:其他的系统没装,在Windows ...
Shell中字符串、数值的比较
原文:http://apps.hi.baidu.com/share/detail/31263915 在shell中字符串与数值的比较方法是不同的,要注意区分整数比较: -eq 等于 ...
LUOGU P1512 伊甸园日历游戏
题目描述 Adam和Eve玩一个游戏,他们先从1900.1.1到2001.11.4这个日期之间随意抽取一个日期出来.然后他们轮流对这个日期进行操作: 1 : 把日期的天数加1,例如1900.1.1变到 ...

25-1 request模块介绍

requests模块

- 基于如下5点展开requests模块的学习

- 代码展示

25-1 request模块介绍的更多相关文章

随机推荐

热门专题