写在前面

　　　　逆水行舟

 爬虫

     - 基本操作

         概要：

             - 发送Http请求，Python Http请求，requests

             - 提取指定信息，Python 正则表达式，beautifulsoup

             - 数据持久化，

         Python两个模块

             - requests

             - beautifulsoup

         Http请求相关知识

             - 请求：

                 请求头

                     - cookie

                 请求体

                     - 发送内容

             - 响应：

                 响应头

                     - 浏览器读取

                 响应体

                     - 看到的内容

             特殊：

                 - cookie

                 - csrftoken

                 - content-type:

                     content-type:application/url-form....

                     name=alex&age=18

                     content-type:application/json

                     {name:'alex',age:18}

     - 性能相关

         - 串行： 1个人，一个任务一个任务，空余时间，玩。

         - 线程： 10个人，一个任务一个任务，空余时间，玩。

         - 进程： 10个家庭，一个任务一个任务，空余时间，玩。

         - 【协程】异步非阻塞：1个人，充分利用时间。

     - scrapy框架

         - 规则

     - redis-scrapy组件

 内容详细：

     - 基本操作，python伪造浏览器发送请求并或者指定内容

         pip3 install requests

         response = requests.get('http://www.baidu.com')

         response.text

         pip3 install beautifulsoup4

         from bs4 import Beautifulsoup

         soup = Beautifulsoup(response.text,'html.parser')

         soup.find(name='h3',attrs={'class':'t'})

         soup.find_all(name='h3')

         示例：爬取汽车之家新闻

     - 模块

         requests

             GET:

                 requests.get(url="http://www.oldboyedu.com")

                 # data="http GET / http1.1\r\nhost:oldboyedu.com\r\n....\r\n\r\n"

                 requests.get(url="http://www.oldboyedu.com/index.html?p=1")

                 # data="http GET /index.html?p=1 http1.1\r\nhost:oldboyedu.com\r\n....\r\n\r\n"

                 requests.get(url="http://www.oldboyedu.com/index.html",params={'p':1})

                 # data="http GET /index.html?p=1 http1.1\r\nhost:oldboyedu.com\r\n....\r\n\r\n"

             POST:

                 requests.post(url="http://www.oldboyedu.com",data={'name':'alex','age':18}) # 默认请求头：url-formend....

                 data="http POST / http1.1\r\nhost:oldboyedu.com\r\n....\r\n\r\nname=alex&age=18"

                 requests.post(url="http://www.oldboyedu.com",json={'name':'alex','age':18}) # 默认请求头：application/json

                 data="http POST / http1.1\r\nhost:oldboyedu.com\r\n....\r\n\r\n{"name": "alex", "age": 18}"

                 requests.post(

                     url="http://www.oldboyedu.com",

                     params={'p':1},

                     json={'name':'alex','age':18}

                 ) # 默认请求头：application/json

                 data="http POST /?p=1 http1.1\r\nhost:oldboyedu.com\r\n....\r\n\r\n{"name": "alex", "age": 18}"

                 补充：

                     request.body,永远有值

                     request.POST，可能没有值

         beautifulsoup

             soup = beautifulsoup('HTML格式字符串','html.parser')

             tag = soup.find(name='div',attrs={})

             tags = soup.find_all(name='div',attrs={})

             tag.find('h3').text

             tag.find('h3').get('属性名称')

             tag.find('h3').attrs

         HTTP请求：

             GET请求：

                 data="http GET /index?page=1 http1.1\r\nhost:baidu.com\r\n....\r\n\r\n"

             POST请求：

                 data="http POST /index?page=1 http1.1\r\nhost:baidu.com\r\n....\r\n\r\nname=alex&age=18"

             socket.sendall(data)

         示例【github和抽屉】：任何一个不用验证码的网站，通过代码自动登录

             1. 按理说

                 r1 = requests.get(url='https://github.com/login')

                 s1 = beautifulsoup(r1.text,'html.parser')

                 val = s1.find(attrs={'name':'authenticity_token'}).get('value')

                 r2 = requests.post(

                         url= 'https://github.com/session',

                         data={

                             'commit': 'Sign in',

                             'utf8': '✓',

                             'authenticity_token': val,

                             'login':'xxxxx',

                             'password': 'xxxx',

                         }

                     )

                 r2_cookie_dict = r2.cookies.get_dict() # {'session_id':'asdfasdfksdfoiuljksdf'}

                 保存登录状态，查看任意URL

                 r3 = requests.get(

                     url='xxxxxxxx',

                     cookies=r2_cookie_dict

                 )

                 print(r3.text) # 登录成功之后，可以查看的页面

             2. 不按理说

                 r1 = requests.get(url='https://github.com/login')

                 s1 = beautifulsoup(r1.text,'html.parser')

                 val = s1.find(attrs={'name':'authenticity_token'}).get('value')

                 # cookie返回给你

                 r1_cookie_dict = r1.cookies.get_dict()

                 r2 = requests.post(

                         url= 'https://github.com/session',

                         data={

                             'commit': 'Sign in',

                             'utf8': '✓',

                             'authenticity_token': val,

                             'login':'xxxxx',

                             'password': 'xxxx',

                         },

                         cookies=r1_cookie_dict

                     )

                 # 授权

                 r2_cookie_dict = r2.cookies.get_dict() # {}

                 保存登录状态，查看任意URL

                 r3 = requests.get(

                     url='xxxxxxxx',

                     cookies=r1_cookie_dict

                 )

                 print(r3.text) # 登录成功之后，可以查看的页面

     - requests

         """

         1. method

         2. url

         3. params

         4. data

         5. json

         6. headers

         7. cookies

         8. files

         9. auth

         10. timeout

         11. allow_redirects

         12. proxies

         13. stream

         14. cert

         ================ session,保存请求相关信息（不推荐）===================

         import requests

         session = requests.Session()

         i1 = session.get(url="http://dig.chouti.com/help/service")

         i2 = session.post(

             url="http://dig.chouti.com/login",

             data={

                 'phone': "8615131255089",

                 'password': "xxooxxoo",

                 'oneMonth': ""

             }

         )

         i3 = session.post(

             url="http://dig.chouti.com/link/vote?linksId=8589523"

         )

         print(i3.text)

         """

     - beautifulsoup

         - find()

         - find_all()

         - get()

         - attrs

         - text

 内容：

     1. 示例：汽车之家

     2. 示例：github和chouti

     3. requests和beautifulsoup

     4. 轮询和长轮询

     5. Django

         request.POST

         request.body

         # content-type:xxxx

 作业：web微信

       功能：

         1. 二维码显示

         2. 长轮询：check_login

         3.

             - 检测是否已经扫码

             - 扫码之后201，头像： base64:.....

             - 点击确认200，response.text     redirect_ur=....

         4. 可选，获取最近联系人信息

 安装：

     twsited

     scrapy框架

武Sir - 笔记

参考：http://www.cnblogs.com/wupeiqi/articles/6283017.html

爬虫相关

	- 基本操作

		- 概要

			- 发送http请求	requests模块

			- 提取指定信息 	正则	Beautifulsoup模块

			- 数据持久化

		- Python的2个模块

			- requests

			- Beautifulsoup

		- Http请求相关知识

			- 请求

				- 请求头

					- cookie

				- 请求体

					- 发送的内容

			- 响应

				- 响应头

					- 浏览器读取

				- 响应体

					- 看到的内容

			- 特殊

				- cookie

				- csrf_token

				- content-type 用来指定客户端按照哪种格式进行解析

	- 性能相关

		- 进程

		- 线程

		- 协程

		- 【协程】异步非阻塞：充分利用系统资源

	- scrapy框架

		- 学习scrapy的规则

	- redis&scrapy组件：完成一个简单的分布式爬虫

内容详细

	- 基本操作	Python伪造浏览器发送请求

		pip3 install requests

		pip3 install Beautifulsoup4

		import requests

		from bs4 import BeautifulSoup

		response = requests.get("http://www.baidu.com")

		response.text  ->  网页内容

		soup = Beautifulsoup(response.text,'html.parse')

		# 从上到下第一个 <h3 class='t'> 标签

		soup.find(name='h3',attrs={'class':'t'})

		# 查找全部 <h3>标签

		soup.find_all(name='h3')

		...

	模块

		requests

			response = requests.get(url='url路径')

			# 解决乱码问题

			response.encoding = response.apparent_encoding

			GET请求：

				requests.get(url='www.baidu.com')

				data = "http GET / ...."

				requests.get(url='www.baidu.com?page=1')

				data = "http GET page=1 ...."

				requests.get(url='www.baidu.com',params={'page':1})

			POST请求：

				requests.post(url='www.baidu.com',data={'name':'alex','age':18}) # 默认携带请求头类型：application/x-www-form-urlencoded

				requests.post(url='www.baidu.com',json={'name':'alex','age':18}) # 默认携带请求头类型：application/json

				# POST请求既可以在请求体里传参，又可以在url里传参

				requests.post(url='www.baidu.com',params={'page':1},json={'name':'alex','age':18})

				补充：

					django里的 request.POST 里的值是django根据请求体里的数据转换过来的

						所以，如果body里的数据格式不对，那么就转换不了，导致request.POST里面没有值

					django里的 request.body 里永远有值

					django里的 request.POST 可能没有值

		BeautifulSoup

			soup = BeautifulSoup('html格式字符串','html.parser')

			tag = soup.find(name='div',attrs={...})

			tag = soup.find_all(name='div',attrs={...})

			tag.find('h3').text

			tag.find('h3').content

			tag.find('h3').get('属性名称')

			tag.find('h3').attrs['属性名称']

服务器端不能主动给客户端发消息

但是websocket可以

- 【轮询】     	http协议，客户端轮询（每秒1次）请求服务端；一次请求，服务端收到后不管有没有新消息都立即返回

- 【长轮询】 	http协议，客户端发来请求，服务器把客户端给hang住，直到服务端收到新消息并发送给所有客户端、才断开连接；

				客户端收到消息后，再立即发请求到服务端进行下一次hang住。

				hang住，有一个超时时间，web微信超时时间是25s

				应用：web微信

- 【WebSocket】	不是http协议，建立在tcp之上

				一次连接不断开，双工通道，可以互相发送消息

				但是浏览器兼容性不太好，以后将会应用的更广泛

浏览器有同源策略

ajax发送跨域请求是接收不到结果的

http://www.cnblogs.com/wupeiqi/articles/6283017.html

#!/usr/bin/python

# -*- coding:utf-8 -*-

import requests

requests.request()

requests.get(url='xxx')

# 本质上就是：

requests.request(method='get',url='xxx')

import json

requests.post(url='xxx',data={'name':'alex','age':18}) # content_type: application/x-www-form-urlencoded

requests.post(url='xxx',data="name=alex&age=18")   # content_type: application/x-www-form-urlencoded

# 不伦不类

requests.post(url='xxx',data=json.dumps({'name':'alex','age':18}))  # content_type: application/x-www-form-urlencoded

# 利用headers参数重写 Content_type

requests.post(url='xxx',data=json.dumps({'name':'alex','age':18}),headers={'Content_type':'application/json'})  # content_type: application/x-www-form-urlencoded

requests.post(url='xxx',json={'name':'alex','age':18})  # content_type: application/json

"""

1.method

2.url

3.params

4.data

5.json

6.headers

7.cookies

8.files

9.auth

10.timeout

11.allow_redirects

12.proxies

13.stream

14.cert

=================== session,保存请求相关信息  ==================

session = requests.Session()

session.get(url='xxx')

session.post(...)

"""

"""

8.files 用作文件上传

"""

file_dict = {

    'f1': open('readme', 'rb')

}

requests.post(url='xxx',file=file_dict)

# 发送文件，定制文件名

# file_dict = {

#   'f1': ('test.txt', open('readme', 'rb'))

# }

# requests.request(method='POST',

# url='http://127.0.0.1:8000/test/',

# files=file_dict)

# 发送文件，定制文件名

# file_dict = {

#   'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf")

# }

# requests.request(method='POST',

# url='http://127.0.0.1:8000/test/',

# files=file_dict)

"""

9.auth  基本认证    路由器登录

"""

from requests.auth import HTTPBasicAuth,HTTPDigestAuth

requests.get('https://api.github.com/user',auth=HTTPBasicAuth('gypsying','password'))

"""

timeout     (连接超时，响应超时)

"""

requests.get('http://google.com',timeout=3)

requests.get('http://google.com',timeout=(5,1))

"""

allow_redirects

"""

"""

proxies 应对IP被封的情况

"""

proxyDict = {

    "http": "61.172.249.96:80",

    "https": "http://61.185.219.126:3128",

}

proxies = {'http://10.20.1.128': 'http://10.10.1.10:5323'}

"""

stream

"""

from contextlib import closing

with closing(requests.get('xxx',stream=True)) as f:

    for i in f.iter_content():

        print(i)

requests.put()

requests.delete()

BeautifulSoup

	- find()

	- find_all()

	- get()

	- attrs

	- text

soup = BeautifulSoup('html格式字符串','html.parser')

soup = BeautifulSoup('html格式字符串',features='lxml')	第三方，需额外安装，但是速度比'html.parser'更快

soup = BeautifulSoup('html格式字符串','html.parser')

tag = soup.find(attrs={'class':'c1'})

tag.name  ->  标签名字

tag = soup.find(attrs={'class':'c1'})

等价于：

tag = soup.find(class_='c1')

print(tag.attrs)

tag.attrs['id'] = 1

del tag.attrs['class']

# attrs 进行增删改查都可以

tag.children  	所有孩子

tag.descendants	所有后代

tag.find_all()	包含的所有标签，并且递归了

tag.find_all(recursive=False)	包含的所有标签，不递归

tag.clear()		清空内部元素，保留自己

tag.decompose()	递归删除所有标签，包含自己

res = tag.extract()	相当于字典的pop，其余同decompose()

tag = soup.find(class_='c1')	# 对象

tag.decode()	# 对象变成字符串

tag.encode()	# 对象变成字节

tag.find('a')

# tag = soup.find('a')

# print(tag)

# tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')

# tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie')

# print(tag)

find_all()

# tags = soup.find_all('a')

# print(tags)

# tags = soup.find_all('a',limit=1)

# print(tags)

# tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')

# # tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie')

# print(tags)

# ####### 列表 #######

# v = soup.find_all(name=['a','div'])

# print(v)

# v = soup.find_all(class_=['sister0', 'sister'])

# print(v)

# v = soup.find_all(text=['Tillie'])

# print(v, type(v[0]))

# v = soup.find_all(id=['link1','link2'])

# print(v)

# v = soup.find_all(href=['link1','link2'])

# print(v)

# ####### 正则 #######

import re

# rep = re.compile('p')

# rep = re.compile('^p')

# v = soup.find_all(name=rep)

# print(v)

# rep = re.compile('sister.*')

# v = soup.find_all(class_=rep)

# print(v)

# rep = re.compile('http://www.oldboy.com/static/.*')

# v = soup.find_all(href=rep)

# print(v)

# ####### 方法筛选 #######

# def func(tag):

# return tag.has_attr('class') and tag.has_attr('id')

# v = soup.find_all(name=func)

# print(v)

# ## get,获取标签属性

# tag = soup.find('a')

# v = tag.get('id')

# print(v)

from bs4.element import Tag

tag.has_attr()

tag.text  等价于 tag.get_text()

v = tag.index(tag.find('div'))

tag.text

tag.string 也可以获取内容，并扩展了修改内容

tag.string = "xxxx"

tag.stripped_strings 相当于join给分割成了list

tag.children

for item in tag.children:

	print(item,type(item))

from bs4.element import Tag

tag= Tag(name='i',attrs={'id':'it'})

tag.string = "asasasasasasazxzxzx"

soup.find(id='xxx').append(tag)

""" 扩展copy模块 """

import copy

copy.deepcopy()

...

tag.wrap(tag1)

tag.unwrap()

++++++++++++++++++++++++++++++++++++

内容梳理：

	- 汽车之间新闻爬取示例

	- github和抽屉自动登录  以及 登陆后的操作

	- requests 和 Beautifulsoup 基本使用

	- 轮训和长轮询

	- Django 里 content-type问题

		request.POST

		request.body

练习：web微信

	1. 二维码显示

	2. 长轮询 check_login() ：ajax递归  （js递归没有层数限制）

	3. 检测是否已经扫码

		- 扫码之后201：替换头像 base64:...

		src="img_path"

		或者

		src="base64:xxxxxxxx...."

		- 扫码之后继续轮训，等待用户点击确认

		- 点击确认之后，返回200

			response.text redirect_url-....

		- 获取最近联系人信息

下节课前安装

	twsited

	scrapy框架

服务器端不能主动给客户端发消息

但是websocket可以

- 【轮询】     	http协议，客户端轮询（每秒1次）请求服务端；一次请求，服务端收到后不管有没有新消息都立即返回

- 【长轮询】 	http协议，客户端发来请求，服务器把客户端给hang住，直到服务端收到新消息并发送给所有客户端、才断开连接；

				客户端收到消息后，再立即发请求到服务端进行下一次hang住。

				hang住，有一个超时时间，web微信超时时间是25s

				应用：web微信

- 【WebSocket】	不是http协议，建立在tcp之上

				一次连接不断开，双工通道，可以互相发送消息

				但是浏览器兼容性不太好，以后将会应用的更广泛

一、爬虫几点基础知识

- 基本操作

	- 概要

		- 发送http请求	requests模块

		- 提取指定信息 	正则	Beautifulsoup模块

		- 数据持久化

	- Python的2个模块

		- requests

		- Beautifulsoup

	- Http请求相关知识

		- 请求

			- 请求头

				- cookie

			- 请求体

				- 发送的内容

		- 响应

			- 响应头

				- 浏览器读取

			- 响应体

				- 看到的内容

		- 特殊

			- cookie

			- csrf_token

			- content-type 用来指定客户端按照哪种格式进行解析

- 性能相关

	- 进程

	- 线程

	- 协程

	- 【协程】异步非阻塞：充分利用系统资源

- scrapy框架

	- 学习scrapy的规则

- redis&scrapy组件：完成一个简单的分布式爬虫

二、爬取汽车之家新闻示例

#!/usr/bin/python

# -*- coding:utf-8 -*-

"""

爬取汽车之家的新闻

"""

import os

import requests

from bs4 import BeautifulSoup

response = requests.get('http://www.autohome.com.cn/news/')

"""  指定编码，否则会乱码 """

# print(response.apparent_encoding)

# print(response.encoding)

""" Good """

response.encoding = response.apparent_encoding

# print(response.encoding)

# print(type(response.text))      # <class 'str'>

# print(type(response.content))   # <class 'bytes'>

""" BeautifulSoup把各种HTML标签转换成各种对象，所以可以使用 obj.attr 方式 """

soup = BeautifulSoup(response.text,'html.parser')

tag = soup.find(name='div',attrs={'id':'auto-channel-lazyload-article'})

li_list = tag.find_all('li') # [标签对象,标签对象,标签对象...]

for li in li_list:

    h3 = li.find(name='h3')

    if not h3:

        continue

    else:

        print(h3.text)

        # 获取属性

        print(li.find(name='a').get('href'))

        # 或者：print(li.find(name='a').attrs['href'])

        print(li.find('p').text)

        # 下载图片

        img_url = li.find('img').get('src')

        print(img_url)

        res = requests.get('http:'+img_url)

        img_path = os.path.join('autohome',img_url.split('/')[-1])

        with open(img_path,'wb') as fw:

            fw.write(res.content)

一抹红的专属感 Macan Turbo特别版官图

//www.autohome.com.cn/news/201710/908351.html#pvareaid=102624

[汽车之家 新车官图]  日前，保时捷发布了Macan Turbo Exclusive Performance Edition的官图，作为一款特别版车...

//www3.autoimg.cn/newsdfs/g10/M0F/B2/EA/120x90_0_autohomecar__wKgH0VnqsC6AYGDFAAGFLm8dSfc007.jpg

还要怎么轻？ 路特斯Elise Cup 260官图

//www.autohome.com.cn/news/201710/908350.html#pvareaid=102624

[汽车之家 新车官图]  日前，路特斯官方宣布推出Elise Cup 260，这款车相比于已经进行进一步轻量化改造的新款Cup 250要更轻更快，全球...

//www3.autoimg.cn/newsdfs/g18/M0C/B9/7A/120x90_0_autohomecar__wKgH6FnqrhyAH3UDAAFOwoge9w4751.jpg

...

三、自动登录网站示例

参考：http://www.cnblogs.com/wupeiqi/articles/6283017.html

　　- .2种网站授权登录的方式

requests.get()  +  requests.post()

    - 方式1

　　　　1.第一次GET请求获取token

　　　　2.第二次POST请求进行验证并获取cookie

　　　　3.第三次GET/POST请求并携带cookie实现用户登录后的某些操作

    - 方式2

　　　　1.第一次GET请求获取token和未被授权的cookie

　　　　2.第二次POST请求并携带cookie进行验证并授权

　　　　3.第三次GET/POST请求并携带授权过的cookie实现用户登录后的某些操作

另外可以使用 requests.session() 更简单的实现：

session = requests.Session()

session.get()  + session.post()

　　- .自动登录Github并浏览个人主页

#!/usr/bin/python

# -*- coding:utf-8 -*-

import requests

from bs4 import BeautifulSoup

"""

第二种Python登录的cookie携带方式

以登录 github账户为例：

    - 第一次去请求 https://github.com/login 这个页面的时候，服务端就给返回了cookie

    - 第二次去请求 https://github.com/session 进行提交用户名密码的时候，要带上上一次返回的cookie进行授权

    - 第三次去请求用户登录后才能看到的页面（例如个人主页），需要带上上面授权好的cookie，才可以

"""

""" 1.获取token和cookie """

rsp1 = requests.get(url='https://github.com/login')

soup1 = BeautifulSoup(rsp1.text,'html.parser')

# 根据属性值找到对应标签，进而获取其value值

token = soup1.find(attrs={'name':'authenticity_token'}).get('value')

# 获取第一次请求获得的cookie

rsp1_cookie_dict = rsp1.cookies.get_dict()

print(token)

print(rsp1_cookie_dict)

""" 2.发起登录POST请求 """

rsp2 = requests.post(

    url='https://github.com/session',

    data={

        'commit':'Sign in',

        'utf8':'✓',

        'authenticity_token':token,

        'login':'gypsying',

        'password':'xxxxxxxxx',

    },

    cookies=rsp1_cookie_dict

)

# 获取第二次请求获得的cookie

rsp2_cookie_dict = rsp2.cookies.get_dict()

print(rsp2_cookie_dict)

all_cookie_dict = {}

all_cookie_dict.update(rsp1_cookie_dict)

all_cookie_dict.update(rsp2_cookie_dict)

print(all_cookie_dict)

""" 3.发起查看个人主页的GET请求 """

rsp3 = requests.get(

    url='https://github.com/Gypsying',

    cookies=all_cookie_dict

)

soup3 = BeautifulSoup(rsp3.text,'html.parser')

email = soup3.find(name='a',attrs={'class':'u-email'}).text

print(email)  # 就可以拿到了 hitwh_Gypsy@163.com

　　- .自动登录抽屉并实施点赞操作

import requests

from bs4 import BeautifulSoup

index_url = "http://dig.chouti.com/"

rsp1 = requests.get(index_url)

soup = BeautifulSoup(rsp1.text,'html.parser')

a_list = soup.find_all(attrs={'class':'digg-a'})

id_list = []

# 获取首页上所有新闻的id

for item in a_list:

    news_id = item.find(name='i').text

    id_list.append(news_id)

# 获得GET首页时候返回的 cookie ，此时的cookie是没有授权的

index_cookie = rsp1.cookies.get_dict()

login_url = "http://dig.chouti.com/login"

data = {

    'phone':8600000000000,

    'password':'xxxxxx',

    'oneMonth':1

}

# 提交用户名和密码，并带上未授权的cookie进行授权

login_ret = requests.post(url=login_url,data=data,cookies=index_cookie)

login_cookie = login_ret.cookies.get_dict()

login_ret = eval(login_ret.text)

code = login_ret.get('result').get('code')

if "9999"  == code:

    print("登录成功")

else:

    print("登录失败")

"""

{"result":{"code":"8887", "message":"手机号格式不对", "data":""}}

{"result":{"code":"21100", "message":"该手机号未注册", "data":""}}

{"result":{"code":"29998", "message":"手机号或密码错误", "data":{}}}

{"result":{"code":"9999", "message":"", "data":{"complateReg":"0","destJid":"cdu_50613120077"}}}

"""

# 点赞的时候需要带上上次授权好的cookie

for news_id in id_list:

    like_url = "http://dig.chouti.com/link/vote?linksId={}".format(news_id)

    like_ret = requests.post(url=like_url,cookies=index_cookie)

    print(like_ret.text)

"""

{"result":{"code":"30010", "message":"您已经推荐过了", "data":""}}

{"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_50613120077","likedTime":"1509378903908000","lvCount":"8","nick":"gypsy","uvCount":"1","voteTime":"小于1分钟前"}}}

"""

四、模拟Web版微信相关操作

"""

微信网页版登录示例

GET        https://login.wx.qq.com/jslogin?appid=wx782c26e4c19acffb&redirect_uri=https%3A%2F%2Fwx.qq.com%2Fcgi-bin%2Fmmwebwx-bin%2Fwebwxnewloginpage&fun=new&lang=zh_CN&_=1508052025433

得到响应：   window.QRLogin.code = 200; window.QRLogin.uuid = "IapQqsoqcA==";

二维码src   https://login.weixin.qq.com/qrcode/IapQqsoqcA==

长轮询：     https://login.wx.qq.com/cgi-bin/mmwebwx-bin/login?loginicon=true&uuid=IapQqsoqcA==&tip=0&r=-518626217&_=1508052025438

"""

爬虫基础01-day23的更多相关文章

【爬虫入门01】我第一只由Reuests和BeautifulSoup4供养的Spider
[爬虫入门01]我第一只由Reuests和BeautifulSoup4供养的Spider 广东职业技术学院欧浩源 1.引言网络爬虫可以完成传统搜索引擎不能做的事情,利用爬虫程序在网络上取得数据 ...
python 3.x 爬虫基础---Urllib详解
python 3.x 爬虫基础 python 3.x 爬虫基础---http headers详解 python 3.x 爬虫基础---Urllib详解前言爬虫也了解了一段时间了希望在半个月的时间内 ...
爬虫基础02-day24
写在前面上课第24天,打卡: 努力不必让全世界知道: s16/17爬虫2 内容回顾: 1. Http协议 Http协议:GET / http1.1/r/n...../r/r/r/na=1 TCP协议 ...
python基础整理6——爬虫基础知识点
爬虫基础什么是爬虫: 网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本.另外一些不常使用的名字还有蚂蚁. ...
【Nodejs】理想论坛帖子爬虫1.01
用Nodejs把Python实现过的理想论坛爬虫又实现了一遍,但是怎么判断所有回调函数都结束没有好办法,目前的spiderCount==spiderFinished判断法在多页情况下还是会提前中止. ...
【网络爬虫入门01】应用Requests和BeautifulSoup联手打造的第一条网络爬虫
[网络爬虫入门01]应用Requests和BeautifulSoup联手打造的第一条网络爬虫广东职业技术学院欧浩源 2017-10-14 1.引言在数据量爆发式增长的大数据时代,网络与用户的沟 ...
python爬虫基础要学什么，有哪些适合新手的书籍与教程？
一,爬虫基础: 首先我们应该了解爬虫是个什么东西,而不是直接去学习带有代码的内容,新手小白应该花一个小时去了解爬虫是什么,再去学习带有代码的知识,这样所带来的收获是一定比你直接去学习代码内容要多很多很 ...
javascript基础01
javascript基础01 Javascript能做些什么? 给予页面灵魂,让页面可以动起来,包括动态的数据,动态的标签,动态的样式等等. 如实现到轮播图.拖拽.放大镜等,而动态的数据就好比不像没有 ...
Androd核心基础01
Androd核心基础01包含的主要内容如下 Android版本简介 Android体系结构 JVM和DVM的区别常见adb命令操作 Android工程目录结构点击事件的四种形式电话拨号器Demo ...
java基础学习05(面向对象基础01)
面向对象基础01 1.理解面向对象的概念 2.掌握类与对象的概念3.掌握类的封装性4.掌握类构造方法的使用实现的目标 1.类与对象的关系.定义.使用 2.对象的创建格式,可以创建多个对象3.对象的内 ...

随机推荐

Leetcode 350.两个数组的交集|| By Python
给定两个数组,编写一个函数来计算它们的交集. 示例 1: 输入: nums1 = [1,2,2,1], nums2 = [2,2] 输出: [2,2] 示例 2: 输入: nums1 = [4,9,5 ...
逆向并查集 HYSBZ1015星球大战starwar
星球大战starwar HYSBZ - 1015 很久以前,在一个遥远的星系,一个黑暗的帝国靠着它的超级武器统治者整个星系.某一天,凭着一个偶然的机遇,一支反抗军摧毁了帝国的超级武器,并攻下了星系 ...
bzoj4240有趣的家庭菜园（贪心+逆序对）
对家庭菜园有兴趣的JOI君每年在自家的田地中种植一种叫做IOI草的植物.JOI君的田地沿东西方向被划分为N个区域,由西到东标号为1~N.IOI草一共有N株,每个区域种植着一株.在第i个区域种植的IOI ...
ElasticSearch 2 (9) - 在ElasticSearch之下（图解搜索的故事）
ElasticSearch 2 (9) - 在ElasticSearch之下(图解搜索的故事) 摘要先自上而下,后自底向上的介绍ElasticSearch的底层工作原理,试图回答以下问题: 为什么我 ...
js 获取随机数 Math.random()
js 获取随机数 Math.random() // 结果为0-1间的一个随机数(包括0,不包括1) var randomNum1 = Math.random(); //console.log(rand ...
js jquery 遍历 for,while,each,map,grep
js jquery 遍历一,for循环. // 第一种var arr = [1, 2, 3];for(var i = 0; i < arr.length; i++) { console.log ...
【codevs4927】线段树练习
题目大意:维护一个序列,支持区间加.区间染色.区间最值查询.区间和查询. 题解:对于区间赋值操作来说,维护一个赋值标记,注意,这里不能直接用赋值的值直接维护,因为不像加法标记,0 表示不用处理,这里 ...
斯坦福大学公开课机器学习：machine learning system design | data for machine learning（数据量很大时，学习算法表现比较好的原理）
下图为四种不同算法应用在不同大小数据量时的表现,可以看出,随着数据量的增大,算法的表现趋于接近.即不管多么糟糕的算法,数据量非常大的时候,算法表现也可以很好. 数据量很大时,学习算法表现比较好的原理: ...
SpringCloud第一弹(入门)
使用IDEA建立SpringBoot多模块工程不爽啊~算了凑合用吧. 第一步.建立一个POM工程 ..Next ..一路next即可,中间啥也不选第二步.建立Eureka服务器(这个玩意等同于玩Du ...
Codeforce 886 Технокубок 2018 - Отборочный Раунд 3 C. Petya and Catacombs(结论题）
A very brave explorer Petya once decided to explore Paris catacombs. Since Petya is not really exper ...

爬虫基础01-day23

写在前面

一、爬虫几点基础知识

二、爬取汽车之家新闻示例

三、自动登录网站示例

- .2种网站授权登录的方式

- .自动登录Github并浏览个人主页

- .自动登录抽屉并实施点赞操作

四、模拟Web版微信相关操作

爬虫基础01-day23的更多相关文章

随机推荐

热门专题

　　- .2种网站授权登录的方式

　　- .自动登录Github并浏览个人主页

　　- .自动登录抽屉并实施点赞操作