requests模块：

1、安装：pip install requests

2、使用request发送get请求：

import requests

paras = {

    'k1':'c1',

    'k2':'c2'

}

ret = requests.get('https://www.cnblogs.com/qiangayz/p/9563377.html')

print(ret.url)

ret = requests.get('https://www.cnblogs.com/qiangayz/p/9563377.html', params=paras)

print(ret.url)

3、使用request发送post请求：

import requests

import json

paras = {

    'k1':'v1',

    'k2':'v2',

}

requests.post('https://www.cnblogs.com/qiangayz/p/9563377.html')

headers_data = {

    'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'

}

requests.post('https://www.cnblogs.com/qiangayz/p/9563377.html',

              headers=headers_data,

              data=json.dumps(paras),

              )

4、requests初始化其他可选参数：

"""Constructs and sends a :class:`Request <Request>`.

    :param method: method for the new :class:`Request` object.

    :param url: URL for the new :class:`Request` object.

    :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.

    :param data: (optional) Dictionary or list of tuples ``[(key, value)]`` (will be form-encoded), bytes, or file-like object to send in the body of the :class:`Request`.

    :param json: (optional) A JSON serializable Python object to send in the body of the :class:`Request`.

    :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.

    :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.

    :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.

        ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``

        or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string

        defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers

        to add for the file.

    :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.

    :param timeout: (optional) How many seconds to wait for the server to send data

        before giving up, as a float, or a :ref:`(connect timeout, read

        timeout) <timeouts>` tuple.

    :type timeout: float or tuple

    :param allow_redirects: (optional) Boolean. Enable/disable GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD redirection. Defaults to ``True``.

    :type allow_redirects: bool

    :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.

    :param verify: (optional) Either a boolean, in which case it controls whether we verify

            the server's TLS certificate, or a string, in which case it must be a path

            to a CA bundle to use. Defaults to ``True``.

    :param stream: (optional) if ``False``, the response content will be immediately downloaded.

    :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.

    :return: :class:`Response <Response>` object

    :rtype: requests.Response

    Usage::

      >>> import requests

      >>> req = requests.request('GET', 'http://httpbin.org/get')

      <Response [200]>

    """

    # By using the 'with' statement we are sure the session is closed, thus we

    # avoid leaving sockets open which can trigger a ResourceWarning in some

    # cases, and look like a memory leak in others.

    with sessions.Session() as session:

        return session.request(method=method, url=url, **kwargs)

参数说明：

-- method：请求方式：post、get等

-- url ：请求的网址url

-- params：在url上传递的参数

　　例：

paras = {

    'k1':'c1',

    'k2':'c2'

}

requests.get('https://www.cnblogs.com/qiangayz/p/9563377.html', params=paras)

实际url是：

https://www.cnblogs.com/qiangayz/p/9563377.html?k1=c1&k2=c2

-- data：在请求体里面传递的数据，请求头为：

　　headers={'Content-Type': 'application/x-www-form-urlencoded'}

-- json：也是在请求题里面传递数据，与data不同的是请求头不一样，且json数据可以嵌套字典

　　请求头为：

　　headers={'Content-Type': 'application/x-www-form-urlencoded'}

-- headers:请求头

-- cookies：网站cookies

-- files：上传文件：

requests.post(

    url='127.0.0.1',

    files={

        'f1': open('xxx.txt', 'rb'), #文件名使用默认的

        'f2':('XXX.txt', open('xxx.txt', 'rb')) #使用定制的文件名

    }

)

-- auth:简易认证，将与户名和密码加密之后放到请求头里面发送过去

-- timeout ：超时

-- allow_redirects：是否允许重定向

-- proxies：使用代理

-- stream：使用流

-- verify ：证书，http与https的区别，https可以使用证书来加密消息，该值可以是False或者True，False代表不接受证书，忽略证书

-- cert：证书文件

5、requests.session的使用：

保存客户端历史访问信息

import requests

session = requests.Session()

#首先登陆页面获取cookies

l1 = session.get(url='xxxxx')

#用户登录，携带上一次的cookies

l2 = session.post(

    url='xxx',

    data='',

)

BeautifulSoup模块：

1、安装：

pip install beautifulsoup4

2、基本使用

from bs4 import BeautifulSoup

import htmldoc

html_doc = htmldoc.html_doc

soup = BeautifulSoup(html_doc,features='html.parser')

#找到一个a标签

tag1 = soup.find(name='a')

print(tag1.name,tag1.attrs)

#找到所有a标签

tag2 = soup.find_all(name='a')

print(tag2)

#找到id为inputuser的标签

tag3 = soup.select('#inputuser')[0]

print(tag3.name,tag3.attrs)

3、标签的方法：

1、tag1.name获取标签名称

　　tag1.name='span' 给标签赋值

2、tag1.attrs获取标签的属性值，字典类型

tag1.attrs=dict1设置值

　　tag1.attrs['id'] = 'a123'设置值

　　del tag1.attrs['id'] 删除属性

3、tags.children找子标签

from bs4 import BeautifulSoup

import htmldoc

from bs4.element import Tag

html_doc = htmldoc.html_doc

soup = BeautifulSoup(html_doc,features='html.parser')

tags = soup.find(name='body').children

tags_list = []

for item in tags:

    if type(item) == Tag:

        tags_list.append(item)

4、tag.descendants

tags = soup.find(name='body').descendants  #找子子孙孙，第一个递归完才开始找第二个

print(len(list(tags)))

5、tag. clear,将标签的所有子标签清空（保留标签名）

6、tag.decompose,递归的删除所有的标签，（不保留当前标签名）

7、tag.extract,递归的删除所有的标签，并返回删除的标签，类似于列表的pop方法

8、tag.decode,转换为字符串（含当前标签）；tag.decode_contents（不含当前标签）

9、tag.encode,转换为字节（含当前标签）；tag.encode_contents（不含当前标签）

10. find,获取匹配的第一个标签

# tag = soup.find('a')

# print(tag)

# tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')

# tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie')

# print(tag)

11. find_all,获取匹配的所有标签

# tags = soup.find_all('a')

# print(tags)

# tags = soup.find_all('a',limit=1)

# print(tags)

# tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')

# # tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie')

# print(tags)

# ####### 列表 #######

# v = soup.find_all(name=['a','div'])

# print(v)

# v = soup.find_all(class_=['sister0', 'sister'])

# print(v)

# v = soup.find_all(text=['Tillie'])

# print(v, type(v[0]))

# v = soup.find_all(id=['link1','link2'])

# print(v)

# v = soup.find_all(href=['link1','link2'])

# print(v)

# ####### 正则 #######

import re

# rep = re.compile('p')

# rep = re.compile('^p')

# v = soup.find_all(name=rep)

# print(v)

# rep = re.compile('sister.*')

# v = soup.find_all(class_=rep)

# print(v)

# rep = re.compile('http://www.oldboy.com/static/.*')

# v = soup.find_all(href=rep)

# print(v)

# ####### 方法筛选 #######

# def func(tag):

# return tag.has_attr('class') and tag.has_attr('id')

# v = soup.find_all(name=func)

# print(v)

# ## get,获取标签属性

# tag = soup.find('a')

# v = tag.get('id')

# print(v)

12. has_attr,检查标签是否具有该属性

# tag = soup.find('a')

# v = tag.has_attr('id')

# print(v)

13. get_text,获取标签内部文本内容

# tag = soup.find('a')

# v = tag.get_text()

# print(v)

14. index,检查标签在某标签中的索引位置

# tag = soup.find('body')

# v = tag.index(tag.find('div'))

# print(v)

# tag = soup.find('body')

# for i,v in enumerate(tag):

# print(i,v)

15. is_empty_element,是否是空标签(是否可以是空)或者自闭合标签，

判断是否是如下标签：'br' , 'hr', 'input', 'img', 'meta','spacer', 'link', 'frame', 'base'

# tag = soup.find('br')

# v = tag.is_empty_element

# print(v)

16. 当前的关联标签

# soup.next

# soup.next_element

# soup.next_elements

# soup.next_sibling

# soup.next_siblings

#

# tag.previous

# tag.previous_element

# tag.previous_elements

# tag.previous_sibling

# tag.previous_siblings

#

# tag.parent

# tag.parents

17. 查找某标签的关联标签

# tag.find_next(...)

# tag.find_all_next(...)

# tag.find_next_sibling(...)

# tag.find_next_siblings(...)

# tag.find_previous(...)

# tag.find_all_previous(...)

# tag.find_previous_sibling(...)

# tag.find_previous_siblings(...)

# tag.find_parent(...)

# tag.find_parents(...)

# 参数同find_all

18. select,select_one, CSS选择器

soup.select("title")

soup.select("p nth-of-type(3)")

soup.select("body a")

soup.select("html head title")

tag = soup.select("span,a")

soup.select("head > title")

soup.select("p > a")

soup.select("p > a:nth-of-type(2)")

soup.select("p > #link1")

soup.select("body > a")

soup.select("#link1 ~ .sister")

soup.select("#link1 + .sister")

soup.select(".sister")

soup.select("[class~=sister]")

soup.select("#link1")

soup.select("a#link2")

soup.select('a[href]')

soup.select('a[href="http://example.com/elsie"]')

soup.select('a[href^="http://example.com/"]')

soup.select('a[href$="tillie"]')

soup.select('a[href*=".com/el"]')

from bs4.element import Tag

def default_candidate_generator(tag):

for child in tag.descendants:

if not isinstance(child, Tag):

continue

if not child.has_attr('href'):

continue

yield child

tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator)

print(type(tags), tags)

from bs4.element import Tag

def default_candidate_generator(tag):

for child in tag.descendants:

if not isinstance(child, Tag):

continue

if not child.has_attr('href'):

continue

yield child

tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator, limit=1)

print(type(tags), tags)

19. 标签的内容

# tag = soup.find('span')

# print(tag.string) # 获取

# tag.string = 'new content' # 设置

# print(soup)

# tag = soup.find('body')

# print(tag.string)

# tag.string = 'xxx'

# print(soup)

# tag = soup.find('body')

# v = tag.stripped_strings # 递归内部获取所有标签的文本

# print(v)

20.append在当前标签内部追加一个标签

# tag = soup.find('body')

# tag.append(soup.find('a'))

# print(soup)

#

# from bs4.element import Tag

# obj = Tag(name='i',attrs={'id': 'it'})

# obj.string = '我是一个新来的'

# tag = soup.find('body')

# tag.append(obj)

# print(soup)

21.insert在当前标签内部指定位置插入一个标签

# from bs4.element import Tag

# obj = Tag(name='i', attrs={'id': 'it'})

# obj.string = '我是一个新来的'

# tag = soup.find('body')

# tag.insert(2, obj)

# print(soup)

22. insert_after,insert_before 在当前标签后面或前面插入

# from bs4.element import Tag

# obj = Tag(name='i', attrs={'id': 'it'})

# obj.string = '我是一个新来的'

# tag = soup.find('body')

# # tag.insert_before(obj)

# tag.insert_after(obj)

# print(soup)

23. replace_with 在当前标签替换为指定标签

# from bs4.element import Tag

# obj = Tag(name='i', attrs={'id': 'it'})

# obj.string = '我是一个新来的'

# tag = soup.find('div')

# tag.replace_with(obj)

# print(soup)

24. 创建标签之间的关系

# tag = soup.find('div')

# a = soup.find('a')

# tag.setup(previous_sibling=a)

# print(tag.previous_sibling)

25. wrap，将指定标签把当前标签包裹起来

# from bs4.element import Tag

# obj1 = Tag(name='div', attrs={'id': 'it'})

# obj1.string = '我是一个新来的'

#

# tag = soup.find('a')

# v = tag.wrap(obj1)

# print(soup)

# tag = soup.find('a')

# v = tag.wrap(soup.find('p'))

# print(soup)

26. unwrap，去掉当前标签，将保留其包裹的标签

# tag = soup.find('a')

# v = tag.unwrap()

# print(soup)

更多参数官方：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

requests与BeautifulSoup的更多相关文章

【安全】requests和BeautifulSoup小试牛刀
web安全的题,为了找key随手写的程序,无处安放,姑且贴上来. # -*- coding: UTF-8 -*- __author__ = 'weimw' import requests from B ...
$python爬虫系列（2）—— requests和BeautifulSoup库的基本用法
本文主要介绍python爬虫的两大利器:requests和BeautifulSoup库的基本用法. 1. 安装requests和BeautifulSoup库可以通过3种方式安装: easy_inst ...
python爬虫系列（2）—— requests和BeautifulSoup
本文主要介绍python爬虫的两大利器:requests和BeautifulSoup库的基本用法. 1. 安装requests和BeautifulSoup库可以通过3种方式安装: easy_inst ...
【网络爬虫入门01】应用Requests和BeautifulSoup联手打造的第一条网络爬虫
[网络爬虫入门01]应用Requests和BeautifulSoup联手打造的第一条网络爬虫广东职业技术学院欧浩源 2017-10-14 1.引言在数据量爆发式增长的大数据时代,网络与用户的沟 ...
基于Requests和BeautifulSoup实现“自动登录”
基于Requests和BeautifulSoup实现“自动登录”实例自动登录抽屉新热榜 #!/usr/bin/env python # -*- coding:utf-8 -*- import req ...
Python使用urllib,urllib3,requests库+beautifulsoup爬取网页
Python使用urllib/urllib3/requests库+beautifulsoup爬取网页 urllib urllib3 requests 笔者在爬取时遇到的问题 1.结果不全 2.'抓取失 ...
requests和BeautifulSoup
一:Requests库 Requests is an elegant and simple HTTP library for Python, built for human beings. 1.安装 ...
Python 爬虫实战（一）：使用 requests 和 BeautifulSoup
Python 基础我之前写的<Python 3 极简教程.pdf>,适合有点编程基础的快速入门,通过该系列文章学习,能够独立完成接口的编写,写写小东西没问题. requests requ ...
#1 爬虫：豆瓣图书TOP250 「requests、BeautifulSoup」
一.项目背景随着时代的发展,国人对于阅读的需求也是日益增长,既然要阅读,就要读好书,什么是好书呢?本项目选择以豆瓣图书网站为对象,统计其排行榜的前250本书籍. 二.项目介绍本项目使用Python ...
爬虫不过如此（python的Re 、Requests、BeautifulSoup 详细篇）
网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本. 爬虫的本质就是一段自动抓取互联网信息的程序,从网络获取 ...

随机推荐

linux IP 网关配置
1. 关闭selinux 与防火墙在虚拟机装好之后之后,先关闭selinux与防火墙关闭selinx,重启生效 vim /etc/selinux/config 修改 SELINUX=disable ...
git合并同事代码
git 操作: 如果龙哥提交了代码,我想拉去过来,我需要的操作: 1.git fetch 2. git add . 3.git commit -m '' 提交本地的文件 4.git merge ori ...
CentOS7——卡在在启动界面
系统在启动时,卡在启动界面比如: 解决方法一这个时候其实系统已经启动了,如果这台机器之前正确配置好了网络连接的话,此时我们可以使用另外一台机器通过SSH来登录这台机器进行修改. 这个时候将系统出问题 ...
rocketmq的生产者生产消息
package com.bfxy.rocketmq.model; import org.apache.rocketmq.client.exception.MQClientException;impor ...
Activity缓存方法
有a.b两个Activity,当从a进入b之后一段时间,可能系统会把a回收,这时候按back,执行的不是a的onRestart而是onCreate方法,a被重新创建一次,这是a中的临时数据和状态可能就 ...
kvm简介及创建虚拟化安装（1）
kvm虚拟化介绍一.虚拟化分类 1.虚拟化,是指通过虚拟化技术将一台计算机虚拟为多台逻辑计算机.在一台计算机上同时运行多个逻辑计算机,每个逻辑计算机可运行不同的操作系统,并且应用程序都可以在相互独立 ...
springMVC入门配置案例
1.spring的jar包下载进入http://repo.springsource.org/libs-release-local/,然后依次点击org/-->springframework-- ...
《Javascript 语言精粹》中用到的一些代码 (1)
var isNumber = function isNumber(value){ return typeof value === 'number' && isFinite(value) ...
ios8唤不起APP的问题
https://stackoverflow.com/questions/27526966/ios-8-window-location-href-doesnt-work-with-url-scheme ...
使用Keepalived实现Nginx高可用
Keepalived是一个路由软件,可以提供linux系统和linux系统上的组件的负载均衡和高可用,高可用基于VRRP(Virtual Router Redundancy Protocol,虚ip) ...

requests与BeautifulSoup

requests模块：

BeautifulSoup模块：

requests与BeautifulSoup的更多相关文章

随机推荐

热门专题