网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。

目录

一、Requests

二、BeautifulSoup

三、自动登陆抽屉并点赞

四、“破解”微信公众号

五、自动登陆示例

一、Requests

Python标准库中提供了:urllib、urllib2、httplib等模块以供Http请求,但是,它的 API 太渣了。它是为另一个时代、另一个互联网所创建的。它需要巨量的工作,甚至包括各种方法覆盖,来完成最简单的任务。

  • 封装urllib请求
import urllib2
import json
import cookielib def urllib2_request(url, method="GET", cookie="", headers={}, data=None):
"""
:param url: 要请求的url
:param cookie: 请求方式,GET、POST、DELETE、PUT..
:param cookie: 要传入的cookie,cookie= 'k1=v1;k1=v2'
:param headers: 发送数据时携带的请求头,headers = {'ContentType':'application/json; charset=UTF-8'}
:param data: 要发送的数据GET方式需要传入参数,data={'d1': 'v1'}
:return: 返回元祖,响应的字符串内容 和 cookiejar对象
对于cookiejar对象,可以使用for循环访问:
for item in cookiejar:
print item.name,item.value
"""
if data:
data = json.dumps(data) cookie_jar = cookielib.CookieJar()
handler = urllib2.HTTPCookieProcessor(cookie_jar)
opener = urllib2.build_opener(handler)
opener.addheaders.append(['Cookie', 'k1=v1;k1=v2'])
request = urllib2.Request(url=url, data=data, headers=headers)
request.get_method = lambda: method response = opener.open(request)
origin = response.read() return origin, cookie_jar # GET
result = urllib2_request('http://127.0.0.1:8001/index/', method="GET") # POST
result = urllib2_request('http://127.0.0.1:8001/index/', method="POST", data= {'k1': 'v1'}) # PUT
result = urllib2_request('http://127.0.0.1:8001/index/', method="PUT", data= {'k1': 'v1'})

Requests 是使用 Apache2 Licensed 许可证的 基于Python开发的HTTP 库,其在Python内置模块的基础上进行了高度的封装,从而使得Pythoner进行网络请求时,变得美好了许多,使用Requests可以轻而易举的完成浏览器可有的任何操作。

1、GET请求

# 1、无参数实例

import requests

ret = requests.get('https://github.com/timeline.json')

print ret.url
print ret.text # 2、有参数实例 import requests payload = {'key1': 'value1', 'key2': 'value2'}
ret = requests.get("http://httpbin.org/get", params=payload) print ret.url
print ret.text

向 https://github.com/timeline.json 发送一个GET请求,将请求和响应相关均封装在 ret 对象中。

实例:爬取汽车之家新闻标题、链接和图片

import requests
import scrapy
from bs4 import BeautifulSoup
import os
import uuid response = requests.get(url='https://www.autohome.com.cn/news/',)
# 用下载自带的编码规则解析
# 也可以指定response.encoding = 'utf-8'
response.encoding = response.apparent_encoding
# response.status_code 返回的状态码
# lxml性能更好,但需要安装,html.parser是Python内置的
soup = BeautifulSoup(response.text,features='html.parser')
tag1 = soup.find(id='auto-channel-lazyload-article')
# tag2 = tag1.find('li') # 只找到第一个
tag_list = tag1.find_all('li') # 找到全部,以列表形式输出
for tag in tag_list:
tag_a = tag.find('a')
if tag_a: # 有a标签的才拿属性
# print(tag_a.attrs.get('href')) # 新闻链接
h3_text = tag_a.find('h3') # 其实是一个对象,只是打印的时候显示文本
# text和string都是获取标签对象的文本
# print(h3_text.string,'*****') # 新闻标题
# print(h3_text.text,'----') # 新闻标题
img_url = tag_a.find('img').attrs.get('src')
# print(img_url) # 新闻图片
# 下载图片,.text返回的是文本,.content返回的是字节
# 如果直接写url=img_url,最后会出现http:////p9.pstatp.com/list/pgc-image/153838020
img_response = requests.get("http:" + img_url).content
file_name = 'imgs/' + str(uuid.uuid4()) + '.jpg'
# with open(file_name,'wb') as f:
# f.write(img_response)

2、POST请求

# 1、基本POST实例

import requests

payload = {'key1': 'value1', 'key2': 'value2'}
ret = requests.post("http://httpbin.org/post", data=payload) print ret.text # 2、发送请求头和数据实例 import requests
import json url = 'https://api.github.com/some/endpoint'
payload = {'some': 'data'}
headers = {'content-type': 'application/json'} ret = requests.post(url, data=json.dumps(payload), headers=headers) print ret.text
print ret.cookies

向https://api.github.com/some/endpoint发送一个POST请求,将请求和相应相关的内容封装在 ret 对象中。

3、其他请求

requests.get(url, params=None, **kwargs)
requests.post(url, data=None, json=None, **kwargs)
requests.put(url, data=None, **kwargs)
requests.head(url, **kwargs)
requests.delete(url, **kwargs)
requests.patch(url, data=None, **kwargs)
requests.options(url, **kwargs) # 以上方法均是在此方法的基础上构建
requests.request(method, url, **kwargs)
params={'k1':'v1'} 以GET形式放入URL传到后台的参数

requests模块已经将常用的Http请求方法为用户封装完成,用户直接调用其提供的相应方法即可,其中方法的所有参数有:

def request(method, url, **kwargs):
"""Constructs and sends a :class:`Request <Request>`. :param method: method for the new :class:`Request` object.
:param url: URL for the new :class:`Request` object.
:param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
:param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`.
:param json: (optional) json data to send in the body of the :class:`Request`.
:param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
:param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
:param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': ('filename', fileobj)}``) for multipart encoding upload.
:param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
:param timeout: (optional) How long to wait for the server to send data
before giving up, as a float, or a :ref:`(connect timeout, read
timeout) <timeouts>` tuple.
:type timeout: float or tuple
:param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed.
:type allow_redirects: bool
:param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
:param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``.
:param stream: (optional) if ``False``, the response content will be immediately downloaded.
:param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
:return: :class:`Response <Response>` object
:rtype: requests.Response Usage:: >>> import requests
>>> req = requests.request('GET', 'http://httpbin.org/get')
<Response [200]>
""" # By using the 'with' statement we are sure the session is closed, thus we
# avoid leaving sockets open which can trigger a ResourceWarning in some
# cases, and look like a memory leak in others.
with sessions.Session() as session:
return session.request(method=method, url=url, **kwargs)

参数示例:

def param_method_url():
requests.request(method='get', url='http://127.0.0.1:8000/test/')
requests.request(method='post', url='http://127.0.0.1:8000/test/') def param_param():
# - 可以是字典
# - 可以是字符串
# - 可以是字节(ascii编码以内) requests.request(method='get',
url='http://127.0.0.1:8000/test/',
params={'k1': 'v1', 'k2': '水电费'}) requests.request(method='get',
url='http://127.0.0.1:8000/test/',
params="k1=v1&k2=水电费&k3=v3&k3=vv3") requests.request(method='get',
url='http://127.0.0.1:8000/test/',
params=bytes("k1=v1&k2=k2&k3=v3&k3=vv3", encoding='utf8')) # 错误
requests.request(method='get',
url='http://127.0.0.1:8000/test/',
params=bytes("k1=v1&k2=水电费&k3=v3&k3=vv3", encoding='utf8')) def param_data():
# 可以是字典
# 可以是字符串
# 可以是字节
# 可以是文件对象 requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
data={'k1': 'v1', 'k2': '水电费'}) requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
data="k1=v1; k2=v2; k3=v3; k3=v4"
) requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
data="k1=v1;k2=v2;k3=v3;k3=v4",
headers={'Content-Type': 'application/x-www-form-urlencoded'}
) requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
data=open('data_file.py', mode='r', encoding='utf-8'), # 文件内容是:k1=v1;k2=v2;k3=v3;k3=v4
headers={'Content-Type': 'application/x-www-form-urlencoded'}
) def param_json():
# 将json中对应的数据进行序列化成一个字符串,json.dumps(...)
# 然后发送到服务器端的body中,并且Content-Type是 {'Content-Type': 'application/json'}
requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
json={'k1': 'v1', 'k2': '水电费'}) def param_headers():
# 发送请求头到服务器端
requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
json={'k1': 'v1', 'k2': '水电费'},
headers={'Content-Type': 'application/x-www-form-urlencoded'}
) def param_cookies():
# 发送Cookie到服务器端
requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
data={'k1': 'v1', 'k2': 'v2'},
cookies={'cook1': 'value1'},
)
# 也可以使用CookieJar(字典形式就是在此基础上封装)
from http.cookiejar import CookieJar
from http.cookiejar import Cookie obj = CookieJar()
obj.set_cookie(Cookie(version=0, name='c1', value='v1', port=None, domain='', path='/', secure=False, expires=None,
discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False,
port_specified=False, domain_specified=False, domain_initial_dot=False, path_specified=False)
)
requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
data={'k1': 'v1', 'k2': 'v2'},
cookies=obj) def param_files():
# 发送文件
file_dict = {
'f1': open('readme', 'rb')
}
requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
files=file_dict) # 发送文件,定制文件名
file_dict = {
'f1': ('test.txt', open('readme', 'rb'))
}
requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
files=file_dict) # 发送文件,定制文件名,文件内容自己写
# file_dict = {
# 'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf")
# }
# requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# files=file_dict) def param_auth():
# 基本认证(原理:在headers中加入加密的用户名和密码)
from requests.auth import HTTPBasicAuth, HTTPDigestAuth ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf'))
print(ret.text) ret = requests.get('http://192.168.1.1',
auth=HTTPBasicAuth('admin', 'admin'))
ret.encoding = 'gbk'
print(ret.text) ret = requests.get('http://httpbin.org/digest-auth/auth/user/pass', auth=HTTPDigestAuth('user', 'pass'))
print(ret) def param_timeout():
# 超时时间
ret = requests.get('http://google.com/', timeout=1)
print(ret) ret = requests.get('http://google.com/', timeout=(5, 1))
print(ret) def param_allow_redirects():
# 是否允许重定向
ret = requests.get('http://127.0.0.1:8000/test/', allow_redirects=False)
print(ret.text) def param_proxies():
# 代理
proxies = {
"http": "61.172.249.96:80",
"https": "http://61.185.219.126:3128",
} proxies = {'http://10.20.1.128': 'http://10.10.1.10:5323'} ret = requests.get("http://www.proxy360.cn/Proxy", proxies=proxies)
print(ret.headers) from requests.auth import HTTPProxyAuth proxyDict = {
'http': '77.75.105.165',
'https': '77.75.105.165'
}
auth = HTTPProxyAuth('username', 'mypassword') r = requests.get("http://www.google.com", proxies=proxyDict, auth=auth)
print(r.text) pass def param_stream():
#流的形式下载文件
ret = requests.get('http://127.0.0.1:8000/test/', stream=True)
print(ret.content)
ret.close() from contextlib import closing
with closing(requests.get('http://httpbin.org/get', stream=True)) as r:
# 在此处理响应。
for i in r.iter_content():
print(i) def requests_session():
# 用于保存客户端的历史访问信息
import requests session = requests.Session() ### 1、首先登陆任何页面,获取cookie i1 = session.get(url="http://dig.chouti.com/help/service") ### 2、用户登陆,携带上一次的cookie,后台对cookie中的 gpsd 进行授权
i2 = session.post(
url="http://dig.chouti.com/login",
data={
'phone': "",
'password': "xxxxxx",
'oneMonth': ""
}
) i3 = session.post(
url="http://dig.chouti.com/link/vote?linksId=8589623",
)
print(i3.text) # 补充
# param verify: 是否忽略证书,直接进行访问;verify=False, # 忽略证书
# param cert:证书文件;# cert='xx.pem', # pem类型证书,cert = ('xx.crt','oo.key'),# 组合证书,功能一样 # Referer: https://www.baidu.com/ 请求头里记录上一次的访问地址
# User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36
# 请求头里表示你当前访问用的客户端类型

更多requests模块相关的文档见:http://cn.python-requests.org/zh_CN/latest/

二、BeautifulSoup

BeautifulSoup是一个模块,该模块用于接收一个HTML或XML字符串,然后将其进行格式化,之后遍可以使用他提供的方法进行快速查找指定元素,从而使得在HTML或XML中查找指定元素变得简单。

  • 安装:pip install beautifulsoup4
  • 调用:from bs4 import BeautifulSoup

简单实例:

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
asdf
<div class="title">
<b>The Dormouse's story总共</b>
<h1>f</h1>
</div>
<div class="story">Once upon a time there were three little sisters; and their names were
<a class="sister0" id="link1">Els<span>f</span>ie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</div>
ad<br/>sf
<p class="story">...</p>
</body>
</html>
""" soup = BeautifulSoup(html_doc, features="lxml")
# 找到第一个a标签
tag1 = soup.find(name='a')
# 找到所有的a标签
tag2 = soup.find_all(name='a')
# 找到id=link2的标签
tag3 = soup.select('#link2')

1. name,标签名称

# tag = soup.find('a')
# name = tag.name # 获取
# print(name)
# tag.name = 'span' # 设置
# print(soup)

2. attr,标签属性

# tag = soup.find('a')
# attrs = tag.attrs # 获取
# print(attrs)
# tag.attrs = {'ik':123} # 设置
# tag.attrs['id'] = 'iiiii' # 设置
# print(soup)

3. children,所有子标签

# body = soup.find('body')
# v = body.children

4.descendants,所有子子孙孙标签,得到一个迭代器,可list()转换成列表,它内部帮我们做迭代;它会先找到标签,再找到标签内部的所有内容,包括文本,一条线走到底;

body = soup.find('body').descendants

5. clear,将标签的所有子标签全部清空(保留标签名)

# tag = soup.find('body')
# tag.clear()
# print(soup)

6. decompose,递归的删除所有的标签

# body = soup.find('body')
# body.decompose()
# print(soup)

7. extract,递归的删除所有的标签,并获取删除的标签

# body = soup.find('body')
# v = body.extract()
# print(soup)

8. decode,转换为字符串(含当前标签);decode_contents(不含当前标签)

# body = soup.find('body')
# v = body.decode()
# v = body.decode_contents()
# print(v)

9. encode,转换为字节(含当前标签);encode_contents(不含当前标签)

# body = soup.find('body')
# v = body.encode()
# v = body.encode_contents()
# print(v)

10. find,获取匹配的第一个标签

# tag = soup.find('a')
# print(tag)
# 组合使用
tag = soup.find(id='c1')
tag = soup.find('div',id='c1')
# tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
# tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
# print(tag)

11. find_all,获取匹配的所有标签,以列表形式输出

# tags = soup.find_all('a')
# print(tags) # tags = soup.find_all('a',limit=1)
# print(tags) # tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
# # tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
# print(tags) # ####### 列表 #######
# v = soup.find_all(name=['a','div'])
# print(v) # v = soup.find_all(class_=['sister0', 'sister'])
# print(v) # v = soup.find_all(text=['Tillie'])
# print(v, type(v[0])) # v = soup.find_all(id=['link1','link2'])
# print(v) # v = soup.find_all(href=['link1','link2'])
# print(v) # ####### 正则 #######
import re
# rep = re.compile('p')
# rep = re.compile('^p')
# v = soup.find_all(name=rep)
# print(v) # rep = re.compile('sister.*')
# v = soup.find_all(class_=rep)
# print(v) # rep = re.compile('http://www.oldboy.com/static/.*')
# v = soup.find_all(href=rep)
# print(v) # ####### 方法筛选 #######
# def func(tag):
# return tag.has_attr('class') and tag.has_attr('id')
# v = soup.find_all(name=func)
# print(v) # ## get,获取标签属性
# tag = soup.find('a')
# v = tag.get('id')
# print(v)

12. has_attr,检查标签是否具有该属性

# tag = soup.find('a')
# v = tag.has_attr('id')
# print(v)

13. get_text,获取标签内部文本内容

# tag = soup.find('a')
# v = tag.get_text()
# print(v)

14. index,检查标签在某标签中的索引位置

# tag = soup.find('body')
# v = tag.index(tag.find('div'))
# print(v) # tag = soup.find('body')
# for i,v in enumerate(tag):
# print(i,v)

15. is_empty_element,是否是空标签(是否可以是空)或者自闭合标签,

判断是否是如下标签:'br' , 'hr', 'input', 'img', 'meta','spacer', 'link', 'frame', 'base'

# tag = soup.find('br')
# v = tag.is_empty_element
# print(v)

16. 当前的关联标签

# soup.next
# soup.next_element
# soup.next_elements
# soup.next_sibling
# soup.next_siblings #
# tag.previous
# tag.previous_element
# tag.previous_elements
# tag.previous_sibling
# tag.previous_siblings #
# tag.parent
# tag.parents

17. 查找某标签的关联标签

# tag.find_next(...)
# tag.find_all_next(...)
# tag.find_next_sibling(...)
# tag.find_next_siblings(...) # tag.find_previous(...)
# tag.find_all_previous(...)
# tag.find_previous_sibling(...)
# tag.find_previous_siblings(...) # tag.find_parent(...)
# tag.find_parents(...) # 参数同find_all

18. select,select_one, CSS选择器

soup.select("title")

soup.select("p nth-of-type(3)")

soup.select("body a")

soup.select("html head title")

tag = soup.select("span,a")

soup.select("head > title")

soup.select("p > a")

soup.select("p > a:nth-of-type(2)")

soup.select("p > #link1")

soup.select("body > a")

soup.select("#link1 ~ .sister")

soup.select("#link1 + .sister")

soup.select(".sister")

soup.select("[class~=sister]")

soup.select("#link1")

soup.select("a#link2")

soup.select('a[href]')

soup.select('a[href="http://example.com/elsie"]')

soup.select('a[href^="http://example.com/"]')

soup.select('a[href$="tillie"]')

soup.select('a[href*=".com/el"]')

from bs4.element import Tag

def default_candidate_generator(tag):
for child in tag.descendants:
if not isinstance(child, Tag):
continue
if not child.has_attr('href'):
continue
yield child tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator)
print(type(tags), tags) from bs4.element import Tag
def default_candidate_generator(tag):
for child in tag.descendants:
if not isinstance(child, Tag):
continue
if not child.has_attr('href'):
continue
yield child tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator, limit=1)
print(type(tags), tags)

19. 标签的内容,response.text和response.string结果一样

# tag = soup.find('span')
# print(tag.string) # 获取
# tag.string = 'new content' # 设置
# print(soup) # tag = soup.find('body')
# print(tag.string)
# tag.string = 'xxx'
# print(soup) # tag = soup.find('body')
# v = tag.stripped_strings # 递归内部获取所有标签的文本
# print(v)

20.append在当前标签内部追加一个标签

# tag = soup.find('body')
# tag.append(soup.find('a'))
# print(soup)
#
# from bs4.element import Tag
# obj = Tag(name='i',attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('body')
# tag.append(obj)
# print(soup)

21.insert在当前标签内部指定位置插入一个标签

# from bs4.element import Tag
# obj = Tag(name='i', attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('body')
# tag.insert(2, obj)
# print(soup)

22. insert_after,insert_before 在当前标签后面或前面插入

# from bs4.element import Tag
# obj = Tag(name='i', attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('body')
# # tag.insert_before(obj)
# tag.insert_after(obj)
# print(soup)

23. replace_with 在当前标签替换为指定标签

# from bs4.element import Tag
# obj = Tag(name='i', attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('div')
# tag.replace_with(obj)
# print(soup)

24. 创建标签之间的关系

# tag = soup.find('div')
# a = soup.find('a')
# tag.setup(previous_sibling=a)
# print(tag.previous_sibling)

25. wrap,用指定标签把当前标签包裹起来,括号内的标签放在当前标签的外面

# from bs4.element import Tag
# obj1 = Tag(name='div', attrs={'id': 'it'})
# obj1.string = '我是一个新来的'
#
# tag = soup.find('a')
# v = tag.wrap(obj1)
# print(soup) # tag = soup.find('a')
# v = tag.wrap(soup.find('p'))
# print(soup)

26. unwrap,去掉当前标签,将保留其包裹的标签

# tag = soup.find('a')
# v = tag.unwrap()
# print(soup)

更多参数官方:http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

三、自动登陆抽屉并点赞

import requests

### 发送请求时,带上headers,否则会遇到防火墙,一定要访问:https://
# headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/58.0'}
headers={
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'referer':'https://dig.chouti.com/', # 表示上一次访问的地址,有些网站需要你先访问一次才能知道你不是通过request发送的请求
} # 先访问主页
response1 = requests.get(
url= "https://dig.chouti.com/",
headers=headers,
)
cookie1 = response1.cookies.get_dict()
print(cookie1)
#{'gpsd': '160d28d416d222dcd6eeb1c1d5ebd268', 'JSESSIONID': 'aaahlyoDbL5GQfrqjq2Lw'} # 登陆,携带第一次访问网站时返回的cookies
response2 = requests.post(
url="https://dig.chouti.com/login",
data={
'phone': "",
'password': "",
'oneMonth': ''
},
cookies=cookie1,
headers=headers,
)
print(response2.status_code)
cookie2 = response2.cookies.get_dict()
print(cookie2) # 登陆后又返回一个gpsd,但是这个没有用,第一个才是验证身份用的 # 访问自己的设置页面
response3 = requests.get(
url='https://dig.chouti.com/profile',
cookies={'gpsd':cookie1.get('gpsd')},
headers=headers,
)
print(response3.text) # 点赞 ,只需要携带已经被授权的gpsd即可,第一次的是被授权的
response4 = requests.post(
url='https://dig.chouti.com/link/vote?linksId=25263971',
cookies={'gpsd':cookie1.get('gpsd')},
headers=headers,
)
print(response4.text)
# {"result":{"code":"9999", "message":"推荐成功",
# "data":{"jid":"cdu_55306581825","likedTime":"1553068537532000",
# "lvCount":"8","nick":"查理大夫","uvCount":"1","voteTime":"小于1分钟前"}}}

四、“破解”微信公众号

“破解”微信公众号其实就是使用Python代码自动实现【登陆公众号】->【获取观众用户】-> 【向关注用户发送消息】。

注:只能向48小时内有互动的粉丝主动推送消息

1、自动登陆

分析对于Web登陆页面,用户登陆验证时仅做了如下操作:

  • 登陆的URL:https://mp.weixin.qq.com/cgi-bin/login?lang=zh_CN
  • POST的数据为:

    {
             'username': 用户名,
             'pwd': 密码的MD5值,
             'imgcode': "", 
             'f': 'json'
        }
    注:imgcode是需要提供的验证码,默认无需验证码,只有在多次登陆未成功时,才需要用户提供验证码才能登陆

  • POST的请求头的Referer值,微信后台用次来检查是谁发送来的请求
  • 请求发送并登陆成功后,获取用户响应的cookie,以后操作其他页面时需要携带此cookie
  • 请求发送并登陆成功后,获取用户相应的内容中的token

登陆代码:

import requests
import time
import hashlib def _password(pwd):
ha = hashlib.md5()
ha.update(pwd)
return ha.hexdigest() def login(): login_dict = {
'username': "用户名",
'pwd': _password("密码"),
'imgcode': "",
'f': 'json'
} login_res = requests.post(
url= "https://mp.weixin.qq.com/cgi-bin/login?lang=zh_CN",
data=login_dict,
headers={'Referer': 'https://mp.weixin.qq.com/cgi-bin/login?lang=zh_CN'}) # 登陆成功之后获取服务器响应的cookie
resp_cookies_dict = login_res.cookies.get_dict()
# 登陆成功后,获取服务器响应的内容
resp_text = login_res.text
# 登陆成功后,获取token
token = re.findall(".*token=(\d+)", resp_text)[0] print resp_text
print token
print resp_cookies_dict login()

登陆成功获取的相应内容如下:

响应内容:
{"base_resp":{"ret":0,"err_msg":"ok"},"redirect_url":"\/cgi-bin\/home?t=home\/index&lang=zh_CN&token=537908795"} 响应cookie:
{'data_bizuin': '3016804678', 'bizuin': '3016804678', 'data_ticket': 'CaoX+QA0ZA9LRZ4YM3zZkvedyCY8mZi0XlLonPwvBGkX0/jY/FZgmGTq6xGuQk4H', 'slave_user': 'gh_5abeaed48d10', 'slave_sid': 'elNLbU1TZHRPWDNXSWdNc2FjckUxalM0Y000amtTamlJOUliSnRnWGRCdjFseV9uQkl5cUpHYkxqaGJNcERtYnM2WjdFT1pQckNwMFNfUW5fUzVZZnFlWGpSRFlVRF9obThtZlBwYnRIVGt6cnNGbUJsNTNIdTlIc2JJU29QM2FPaHZjcTcya0F6UWRhQkhO'}

2、访问其他页面获取用户信息

分析用户管理页面,通过Pyhton代码以Get方式访问此页面,分析响应到的 HTML 代码,从中获取用户信息:

  • 获取用户的URL:https://mp.weixin.qq.com/cgi-bin/user_tag?action=get_all_data&lang=zh_CN&token=登陆时获取的token
  • 发送GET请求时,需要携带登陆成功后获取的cookie
{'data_bizuin': '3016804678', 'bizuin': '3016804678', 'data_ticket': 'C4YM3zZ...
  • 获取当前请求的响应的html代码
  • 通过正则表达式获取html中的指定内容(Python的模块Beautiful Soup)
  • 获取html中每个用户的 data-fakeid属性,该值是用户的唯一标识,通过它可向用户推送消息

代码实现:

import requests
import time
import hashlib
import json
import re LOGIN_COOKIES_DICT = {} def _password(pwd):
ha = hashlib.md5()
ha.update(pwd)
return ha.hexdigest() def login(): login_dict = {
'username': "用户名",
'pwd': _password("密码"),
'imgcode': "",
'f': 'json'
} login_res = requests.post(
url= "https://mp.weixin.qq.com/cgi-bin/login?lang=zh_CN",
data=login_dict,
headers={'Referer': 'https://mp.weixin.qq.com/cgi-bin/login?lang=zh_CN'}) # 登陆成功之后获取服务器响应的cookie
resp_cookies_dict = login_res.cookies.get_dict()
# 登陆成功后,获取服务器响应的内容
resp_text = login_res.text
# 登陆成功后,获取token
token = re.findall(".*token=(\d+)", resp_text)[0] return {'token': token, 'cookies': resp_cookies_dict} def standard_user_list(content):
content = re.sub('\s*', '', content)
content = re.sub('\n*', '', content)
data = re.findall("""cgiData=(.*);seajs""", content)[0]
data = data.strip()
while True:
temp = re.split('({)(\w+)(:)', data, 1)
if len(temp) == 5:
temp[2] = '"' + temp[2] + '"'
data = ''.join(temp)
else:
break while True:
temp = re.split('(,)(\w+)(:)', data, 1)
if len(temp) == 5:
temp[2] = '"' + temp[2] + '"'
data = ''.join(temp)
else:
break data = re.sub('\*\d+', "", data)
ret = json.loads(data)
return ret def get_user_list(): login_dict = login()
LOGIN_COOKIES_DICT.update(login_dict) login_cookie_dict = login_dict['cookies']
res_user_list = requests.get(
url= "https://mp.weixin.qq.com/cgi-bin/user_tag",
params = {"action": "get_all_data", "lang": "zh_CN", "token": login_dict['token']},
cookies = login_cookie_dict,
headers={'Referer': 'https://mp.weixin.qq.com/cgi-bin/login?lang=zh_CN'}
)
user_info = standard_user_list(res_user_list.text)
for item in user_info['user_list']:
print "%s %s " % (item['nick_name'],item['id'],) get_user_list()

3、发送消息

分析给用户发送消息的页面,从网络请求中剖析得到发送消息的URL,从而使用Python代码发送消息:

  • 发送消息的URL:https://mp.weixin.qq.com/cgi-bin/singlesend?t=ajax-response&f=json&token=登陆时获取的token放在此处&lang=zh_CN
  • 从登陆时相应的内容中获取:token和cookie
  • 从用户列表中获取某个用户唯一标识: fake_id
  • 封装消息,并发送POST请求
send_dict = {
'token': 登陆时获取的token,
'lang': "zh_CN",
'f': 'json',
'ajax': 1,
'random': "0.5322618900912392",
'type': 1,
'content': 要发送的内容,
'tofakeid': 用户列表中获取的用户的ID,
'imgcode': ''
}

  

import requests
import time
import hashlib
import json
import re LOGIN_COOKIES_DICT = {} def _password(pwd):
ha = hashlib.md5()
ha.update(pwd)
return ha.hexdigest() def login(): login_dict = {
'username': "用户名",
'pwd': _password("密码"),
'imgcode': "",
'f': 'json'
} login_res = requests.post(
url= "https://mp.weixin.qq.com/cgi-bin/login?lang=zh_CN",
data=login_dict,
headers={'Referer': 'https://mp.weixin.qq.com/cgi-bin/login?lang=zh_CN'}) # 登陆成功之后获取服务器响应的cookie
resp_cookies_dict = login_res.cookies.get_dict()
# 登陆成功后,获取服务器响应的内容
resp_text = login_res.text
# 登陆成功后,获取token
token = re.findall(".*token=(\d+)", resp_text)[0] return {'token': token, 'cookies': resp_cookies_dict} def standard_user_list(content):
content = re.sub('\s*', '', content)
content = re.sub('\n*', '', content)
data = re.findall("""cgiData=(.*);seajs""", content)[0]
data = data.strip()
while True:
temp = re.split('({)(\w+)(:)', data, 1)
if len(temp) == 5:
temp[2] = '"' + temp[2] + '"'
data = ''.join(temp)
else:
break while True:
temp = re.split('(,)(\w+)(:)', data, 1)
if len(temp) == 5:
temp[2] = '"' + temp[2] + '"'
data = ''.join(temp)
else:
break data = re.sub('\*\d+', "", data)
ret = json.loads(data)
return ret def get_user_list(): login_dict = login()
LOGIN_COOKIES_DICT.update(login_dict) login_cookie_dict = login_dict['cookies']
res_user_list = requests.get(
url= "https://mp.weixin.qq.com/cgi-bin/user_tag",
params = {"action": "get_all_data", "lang": "zh_CN", "token": login_dict['token']},
cookies = login_cookie_dict,
headers={'Referer': 'https://mp.weixin.qq.com/cgi-bin/login?lang=zh_CN'}
)
user_info = standard_user_list(res_user_list.text)
for item in user_info['user_list']:
print "%s %s " % (item['nick_name'],item['id'],) def send_msg(user_fake_id, content='啥也没发'): login_dict = LOGIN_COOKIES_DICT token = login_dict['token']
login_cookie_dict = login_dict['cookies'] send_dict = {
'token': token,
'lang': "zh_CN",
'f': 'json',
'ajax': 1,
'random': "0.5322618900912392",
'type': 1,
'content': content,
'tofakeid': user_fake_id,
'imgcode': ''
} send_url = "https://mp.weixin.qq.com/cgi-bin/singlesend?t=ajax-response&f=json&token=%s&lang=zh_CN" % (token,)
message_list = requests.post(
url=send_url,
data=send_dict,
cookies=login_cookie_dict,
headers={'Referer': 'https://mp.weixin.qq.com/cgi-bin/login?lang=zh_CN'}
) get_user_list()
fake_id = raw_input('请输入用户ID:')
content = raw_input('请输入消息内容:')
send_msg(fake_id, content)

发送消息代码

以上就是“破解”微信公众号的整个过程,通过Python代码实现了自动【登陆微信公众号平台】【获取用户列表】【指定用户发送消息】。

五、自动登陆示例

import requests
from bs4 import BeautifulSoup ############## 方式一 ############## # 1. 访问登陆页面,获取 authenticity_token
i1 = requests.get('https://github.com/login')
soup1 = BeautifulSoup(i1.text, features='lxml')
tag = soup1.find(name='input', attrs={'name': 'authenticity_token'})
authenticity_token = tag.get('value')
c1 = i1.cookies.get_dict()
i1.close() # 1. 携带authenticity_token和用户名密码等信息,发送用户验证
form_data = {
"authenticity_token": authenticity_token,
"utf8": "",
"commit": "Sign in",
"login": "charliedaifu",
'password': 'xxxx'
} i2 = requests.post('https://github.com/session', data=form_data, cookies=c1)
c2 = i2.cookies.get_dict()
print(c2)
print(i2.status_code) c1.update(c2)
i3 = requests.get('https://github.com/settings/repositories', cookies=c1) soup3 = BeautifulSoup(i3.text, features='lxml')
list_group = soup3.find(name='div', class_='listgroup') from bs4.element import Tag for child in list_group.children:
if isinstance(child, Tag):
project_tag = child.find(name='a', class_='mr-1')
size_tag = child.find(name='small')
temp = "项目:%s(%s); 项目路径:%s" % (project_tag.get('href'), size_tag.string if size_tag else '', project_tag.string, )
print(temp) """
############## 方式二 ##############
session = requests.Session()
# 1. 访问登陆页面,获取 authenticity_token
i1 = session.get('https://github.com/login')
soup1 = BeautifulSoup(i1.text, features='lxml')
tag = soup1.find(name='input', attrs={'name': 'authenticity_token'})
authenticity_token = tag.get('value')
c1 = i1.cookies.get_dict()
i1.close() # 1. 携带authenticity_token和用户名密码等信息,发送用户验证
form_data = {
"authenticity_token": authenticity_token,
"utf8": "",
"commit": "Sign in",
"login": "wupeiqi@live.com",
'password': 'xxoo'
} i2 = session.post('https://github.com/session', data=form_data)
c2 = i2.cookies.get_dict()
c1.update(c2)
i3 = session.get('https://github.com/settings/repositories') soup3 = BeautifulSoup(i3.text, features='lxml')
list_group = soup3.find(name='div', class_='listgroup') from bs4.element import Tag for child in list_group.children:
if isinstance(child, Tag):
project_tag = child.find(name='a', class_='mr-1')
size_tag = child.find(name='small')
temp = "项目:%s(%s); 项目路径:%s" % (project_tag.get('href'), size_tag.string if size_tag else '', project_tag.string, )
print(temp)
"""

GitHub

import time

import requests
from bs4 import BeautifulSoup session = requests.Session() i1 = session.get(
url='https://www.zhihu.com/#signin',
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
}
) soup1 = BeautifulSoup(i1.text, 'lxml')
xsrf_tag = soup1.find(name='input', attrs={'name': '_xsrf'})
xsrf = xsrf_tag.get('value') current_time = time.time()
i2 = session.get(
url='https://www.zhihu.com/captcha.gif',
params={'r': current_time, 'type': 'login'},
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
}) with open('zhihu.gif', 'wb') as f:
f.write(i2.content) captcha = input('请打开zhihu.gif文件,查看并输入验证码:')
form_data = {
"_xsrf": xsrf,
'password': 'xxooxxoo',
"captcha": 'captcha',
'email': '424662508@qq.com'
}
i3 = session.post(
url='https://www.zhihu.com/login/email',
data=form_data,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
}
) i4 = session.get(
url='https://www.zhihu.com/settings/profile',
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
}
) soup4 = BeautifulSoup(i4.text, 'lxml')
tag = soup4.find(id='rename-section')
nick_name = tag.find('span',class_='name').string
print(nick_name)

知乎

import re
import json
import base64 import rsa
import requests def js_encrypt(text):
b64der = 'MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCp0wHYbg/NOPO3nzMD3dndwS0MccuMeXCHgVlGOoYyFwLdS24Im2e7YyhB0wrUsyYf0/nhzCzBK8ZC9eCWqd0aHbdgOQT6CuFQBMjbyGYvlVYU2ZP7kG9Ft6YV6oc9ambuO7nPZh+bvXH0zDKfi02prknrScAKC0XhadTHT3Al0QIDAQAB'
der = base64.standard_b64decode(b64der) pk = rsa.PublicKey.load_pkcs1_openssl_der(der)
v1 = rsa.encrypt(bytes(text, 'utf8'), pk)
value = base64.encodebytes(v1).replace(b'\n', b'')
value = value.decode('utf8') return value session = requests.Session() i1 = session.get('https://passport.cnblogs.com/user/signin')
rep = re.compile("'VerificationToken': '(.*)'")
v = re.search(rep, i1.text)
verification_token = v.group(1) form_data = {
'input1': js_encrypt('wptawy'),
'input2': js_encrypt('asdfasdf'),
'remember': False
} i2 = session.post(url='https://passport.cnblogs.com/user/signin',
data=json.dumps(form_data),
headers={
'Content-Type': 'application/json; charset=UTF-8',
'X-Requested-With': 'XMLHttpRequest',
'VerificationToken': verification_token}
) i3 = session.get(url='https://i.cnblogs.com/EditDiary.aspx') print(i3.text)

博客园

import requests

# 第一步:访问登陆页,拿到X_Anti_Forge_Token,X_Anti_Forge_Code
# 1、请求url:https://passport.lagou.com/login/login.html
# 2、请求方法:GET
# 3、请求头:
# User-agent
r1 = requests.get('https://passport.lagou.com/login/login.html',
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
},
) X_Anti_Forge_Token = re.findall("X_Anti_Forge_Token = '(.*?)'", r1.text, re.S)[0]
X_Anti_Forge_Code = re.findall("X_Anti_Forge_Code = '(.*?)'", r1.text, re.S)[0]
print(X_Anti_Forge_Token, X_Anti_Forge_Code)
# print(r1.cookies.get_dict())
# 第二步:登陆
# 1、请求url:https://passport.lagou.com/login/login.json
# 2、请求方法:POST
# 3、请求头:
# cookie
# User-agent
# Referer:https://passport.lagou.com/login/login.html
# X-Anit-Forge-Code:53165984
# X-Anit-Forge-Token:3b6a2f62-80f0-428b-8efb-ef72fc100d78
# X-Requested-With:XMLHttpRequest
# 4、请求体:
# isValidate:true
# username:15131252215
# password:ab18d270d7126ea65915c50288c22c0d
# request_form_verifyCode:''
# submit:''
r2 = requests.post(
'https://passport.lagou.com/login/login.json',
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
'Referer': 'https://passport.lagou.com/login/login.html',
'X-Anit-Forge-Code': X_Anti_Forge_Code,
'X-Anit-Forge-Token': X_Anti_Forge_Token,
'X-Requested-With': 'XMLHttpRequest'
},
data={
"isValidate": True,
'username': '',
'password': 'ab18d270d7126ea65915c50288c22c0d',
'request_form_verifyCode': '',
'submit': ''
},
cookies=r1.cookies.get_dict()
)
print(r2.text)

拉勾网

文章转载自http://www.cnblogs.com/wupeiqi/articles/5354900.html

爬虫之Requests&beautifulsoup的更多相关文章

  1. 【Python】在Pycharm中安装爬虫库requests , BeautifulSoup , lxml 的解决方法

    BeautifulSoup在学习Python过程中可能需要用到一些爬虫库 例如:requests BeautifulSoup和lxml库 前面的两个库,用Pychram都可以通过 File--> ...

  2. python 爬虫(一) requests+BeautifulSoup 爬取简单网页代码示例

    以前搞偷偷摸摸的事,不对,是搞爬虫都是用urllib,不过真的是很麻烦,下面就使用requests + BeautifulSoup 爬爬简单的网页. 详细介绍都在代码中注释了,大家可以参阅. # -* ...

  3. 孤荷凌寒自学python第六十七天初步了解Python爬虫初识requests模块

    孤荷凌寒自学python第六十七天初步了解Python爬虫初识requests模块 (完整学习过程屏幕记录视频地址在文末) 从今天起开始正式学习Python的爬虫. 今天已经初步了解了两个主要的模块: ...

  4. 利用requests, beautifulsoup包爬取股票信息网站

    这是第一次用requests, beautifulsoup实现爬虫,此次爬取的是一个股票信息网站:http://www.gupiaozhishi.net.cn. 实现非常简单,只是为了demo使用的数 ...

  5. Python爬虫练习(requests模块)

    Python爬虫练习(requests模块) 关注公众号"轻松学编程"了解更多. 一.使用正则表达式解析页面和提取数据 1.爬取动态数据(js格式) 爬取http://fund.e ...

  6. 爬虫入门二 beautifulsoup

    title: 爬虫入门二 beautifulsoup date: 2020-03-12 14:43:00 categories: python tags: crawler 使用beautifulsou ...

  7. 【爬虫入门手记03】爬虫解析利器beautifulSoup模块的基本应用

    [爬虫入门手记03]爬虫解析利器beautifulSoup模块的基本应用 1.引言 网络爬虫最终的目的就是过滤选取网络信息,因此最重要的就是解析器了,其性能的优劣直接决定这网络爬虫的速度和效率.Bea ...

  8. 【网络爬虫入门03】爬虫解析利器beautifulSoup模块的基本应用

    [网络爬虫入门03]爬虫解析利器beautifulSoup模块的基本应用   1.引言 网络爬虫最终的目的就是过滤选取网络信息,因此最重要的就是解析器了,其性能的优劣直接决定这网络爬虫的速度和效率.B ...

  9. Python爬虫之requests

    爬虫之requests 库的基本用法 基本请求: requests库提供了http所有的基本请求方式.例如 r = requests.post("http://httpbin.org/pos ...

随机推荐

  1. ASP.Net Core项目在Mac上使用Entity Framework Core 2.0进行迁移可能会遇到的一个问题.

    在ASP.Net Core 2.0的项目里, 我使用Entity Framework Core 2.0 作为ORM. 有人习惯把数据库的连接字符串写在appSettings.json里面, 有的习惯写 ...

  2. 在.NET Core console application中使用User Secrets(用户机密)

    微软很坑地只在Microsoft.NET.Sdk.Web中提供了VS项目右键菜单的"管理用户机密"/"Manage User Secrets"菜单项,在使用Mi ...

  3. RxJS 实现摩斯密码(Morse) 【内附脑图】

    参加 2018 ngChina 开发者大会,特别喜欢 Michael Hladky 奥地利帅哥的 RxJS 分享,现在拿出来好好学习工作坊的内容(工作坊Demo地址),结合这个示例,做了一个改进版本, ...

  4. 装饰器模式 Decorator 结构型 设计模式 (十)

    引子           现实世界的装饰器模式 大家应该都吃过手抓饼,本文装饰器模式以手抓饼为模型展开简介 "老板,来一个手抓饼,  加个培根,  加个鸡蛋,多少钱?" 这句话会不 ...

  5. python基础3--函数

    1.函数定义 你可以定义一个由自己想要功能的函数,以下是简单的规则: 函数代码块以def关键词开头,后接函数标识符名称和圆括号(). 任何传入参数和自变量必须放在圆括号中间.圆括号之间可以用于定义参数 ...

  6. CSS中层叠和CSS的7阶层叠水平(上篇)

    今天搜索资料时,忽然发现了以前没注意的一个知识点,所以拖过来搞一搞,这个知识点叫做CSS的7阶层叠水平 在说这个知识之前,我们必须要先了解一个东西以便于我们更好的理解CSS的7阶层叠水平 这个东西就是 ...

  7. mongodb学习(入门。。。。。)

    db.xs.insert({name:zhangsan})   db:当前数据库  xs:学生集合(没有的话自动创建) show collections   显示当前数据库的集合名字 show dbs ...

  8. markdown文本转换word,pdf

    pandoc及下载和安装 pandoc是什么 pandoc是一个软件,是一个能把千奇百怪的文档格式互相转换的神器,是一把文档转换的瑞士军刀(swiss-army knife).不多说,放一张其官网(h ...

  9. Kotlin for循环使用

    普通for循环 for(i in 1..4){ println(i) } 结果为1234 循环四次 反序for循环 for(i in 4 downTo 1){ println(i) } 结果为4321 ...

  10. 【Spring】Autowired原理及与Resource注解区别

    Autowired注解 Autowired顾名思义,表示自动注入,如下是Autowired注解的源代码: @Target({ElementType.CONSTRUCTOR, ElementType.M ...