一、爬虫爬取失败的几个原因

　　1.在短时间内向网站发起了一个高频的请求

- 解决办法：使用代理

　　2.连接池(http)中的资源被耗尽

- 解决办法：立即将请求断开：Connection:close

　　3.高清图片:

- 图片懒加载:在img标签中应用了伪属性

二、代理

代理服务器:实现请求转发,从而可以实现更换请求的ip地址
在requests中如何将请求的ip进行更换
代理的匿名度:
- 透明:服务器知道你使用了代理并且知道你的真实ip
- 匿名:服务器知道你使用了代理,但是不知道你的真实ip
- 高匿:服务器不知道你使用了代理,更不知道你的真实ip
代理的类型:
- http:该类型的代理只可以转发http协议的请求
- https:只可以转发https协议的请求

免费代理ip的网站
- 快代理
- 西祠代理
- goubanjia
- 代理精灵(推荐):http://http.zhiliandaili.cn/
在爬虫中遇到ip被禁掉如何处理?
- 使用代理
- 构建一个代理池
- 拨号服务器

import requests

headers = {

    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'

}

url = 'https://www.baidu.com/s?wd=ip'

#proxies={'http/https':'ip:port'}

page_text = requests.get(url=url,headers=headers,proxies={'https':'1.197.203.187:9999'}).text

with open('ip.html','w',encoding='utf-8') as fp:

    fp.write(page_text)

#基于代理精灵构建一个ip池

from lxml import etree

all_ips = [] #列表形式的代理池

proxy_url = 'http://t.11jsq.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=52&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=2'

proxy_page_text = requests.get(url=proxy_url,headers=headers).text

tree = etree.HTML(proxy_page_text)

proxy_list = tree.xpath('//body//text()')

for ip in proxy_list:

    dic = {'https':ip}

    all_ips.append(dic)

import random

#爬取西祠代理中的免费代理ip

url = 'https://www.xicidaili.com/nn/%d'

free_proxies = []

for page in range(1,30):

    new_url = format(url%page)

    page_text = requests.get(new_url,headers=headers,proxies=random.choice(all_ips)).text

    tree = etree.HTML(page_text)

    tr_list = tree.xpath('//*[@id="ip_list"]//tr')[1:]#xpath表达式中不可以出现tbody

    for tr in tr_list:

        ip = tr.xpath('./td[2]/text()')[0]

        port = tr.xpath('./td[3]/text()')[0]

        t_type = tr.xpath('./td[7]/text()')[0]

        dic = {

            'ip':ip,

            'port':port,

            'type':t_type

        }

        free_proxies.append(dic)

    print('第{}页爬取完毕!!!'.format(page))

print(len(free_proxies))

三、Cookie

作用:保存客户端的相关状态
爬取雪球网中的新闻资讯数据:https://xueqiu.com/

在请求中携带cookie,在爬虫中如果遇到了cookie的反爬如何处理?
- 手动处理
  - 在抓包工具中捕获cookie,将其封装在headers中
  - 应用场景:cookie没有有效时长且不是动态变化
- 自动处理
  - 使用session机制
  - 使用场景:动态变化的cookie
  - session对象:该对象和requests模块用法几乎一致.如果在请求的过程中产生了cookie,如果该请求使用session发起的,则cookie会被自动存储到session中.

#获取一个session对象

session = requests.Session()

main_url = 'https://xueqiu.com' #推测对该url发起请求会产生cookie

session.get(main_url,headers=headers)

url = 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json'

params = {

    'since_id': '-1',

    'max_id': '20346152',

    'count': '15',

    'category': '-1',

}

page_text = session.get(url,headers=headers,params=params).json()

page_text

四、验证码识别

相关的线上打码平台识别
- 打码兔
- 云打码
- 超级鹰:http://www.chaojiying.com/about.html
  - 1.注册,登录(用户中心的身份认证)
  - 2.登录后:
    - 创建一个软件:软件ID->生成一个软件id
    - 下载示例代码:开发文档->python->下载

平台实例代码的演示

#!/usr/bin/env python

# coding:utf-8

import requests

from hashlib import md5

class Chaojiying_Client(object):

    def __init__(self, username, password, soft_id):

        self.username = username

        password =  password.encode('utf8')

        self.password = md5(password).hexdigest()

        self.soft_id = soft_id

        self.base_params = {

            'user': self.username,

            'pass2': self.password,

            'softid': self.soft_id,

        }

        self.headers = {

            'Connection': 'Keep-Alive',

            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',

        }

    def PostPic(self, im, codetype):

        """

        im: 图片字节

        codetype: 题目类型 参考 http://www.chaojiying.com/price.html

        """

        params = {

            'codetype': codetype,

        }

        params.update(self.base_params)

        files = {'userfile': ('ccc.jpg', im)}

        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)

        return r.json()

    def ReportError(self, im_id):

        """

        im_id:报错题目的图片ID

        """

        params = {

            'id': im_id,

        }

        params.update(self.base_params)

        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)

        return r.json()

chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370')    #用户中心>>软件ID 生成一个替换 96001

im = open('a.jpg', 'rb').read()                                                    #本地图片文件路径 来替换 a.jpg 有时WIN系统须要//

print(chaojiying.PostPic(im, 1004)['pic_str'])                                                #1902 验证码类型  官方网站>>价格体系 3.4+版 print 后要加()

应用：将古诗文网中的验证码图片进行识别https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx

def getCodeImgText(imgPath,img_type):

    chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370')    #用户中心>>软件ID 生成一个替换 96001

    im = open(imgPath, 'rb').read()                                                    #本地图片文件路径 来替换 a.jpg 有时WIN系统须要//

    return chaojiying.PostPic(im, img_type)['pic_str']

url = 'https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx'

page_text = requests.get(url,headers=headers).text

tree = etree.HTML(page_text)

img_src = 'https://so.gushiwen.org'+tree.xpath('//*[@id="imgCode"]/@src')[0]

img_code_data = requests.get(img_src,headers=headers).content

with open('./gushiwen.jpg','wb') as fp:

    fp.write(img_code_data)

img_text = getCodeImgText('./gushiwen.jpg',1004)

print(img_text)

五、为什么在爬虫中需要实现模拟登录?

有的数据是必须经过登录后才可以显示出来的!

涉及到的反爬:
- 验证码
- 动态请求参数:每次请求对应的请求参数都是动态变化
  - 动态捕获:通常情况下,动态的请求参数都会被隐藏在前台页面的源码中
- cookie

def getCodeImgText(imgPath,img_type):

    chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370')    #用户中心>>软件ID 生成一个替换 96001

    im = open(imgPath, 'rb').read()                                                    #本地图片文件路径 来替换 a.jpg 有时WIN系统须要//

    return chaojiying.PostPic(im, img_type)['pic_str']

#使用session捕获cookie

s = requests.Session()

first_url = 'https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx'

s.get(first_url,headers=headers)

url = 'https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx'

page_text = requests.get(url,headers=headers).text

tree = etree.HTML(page_text)

img_src = 'https://so.gushiwen.org'+tree.xpath('//*[@id="imgCode"]/@src')[0]

img_code_data = s.get(img_src,headers=headers).content

with open('./gushiwen.jpg','wb') as fp:

    fp.write(img_code_data)

img_text = getCodeImgText('./gushiwen.jpg',1004)

print(img_text)

#动态捕获动态的请求参数

__VIEWSTATE = tree.xpath('//*[@id="__VIEWSTATE"]/@value')[0]

__VIEWSTATEGENERATOR = tree.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')[0]

#点击登录按钮后发起请求的url:通过抓包工具捕获

login_url = 'https://so.gushiwen.org/user/login.aspx?from=http%3a%2f%2fso.gushiwen.org%2fuser%2fcollect.aspx'

data = {

    '__VIEWSTATE': __VIEWSTATE,

    '__VIEWSTATEGENERATOR': __VIEWSTATEGENERATOR,

    'from': 'http://so.gushiwen.org/user/collect.aspx',

    'email': 'www.zhangbowudi@qq.com',

    'pwd': 'bobo328410948',

    'code': img_text,

    'denglu': '登录',

}

main_page_text = s.post(login_url,headers=headers,data=data).text

with open('main.html','w',encoding='utf-8') as fp:

    fp.write(main_page_text)

六、基于线程池的异步爬取

url = 'https://www.qiushibaike.com/text/page/%d/'

urls = []

for page in range(1,11):

    new_url = format(url%page)

    urls.append(new_url)

def get_request(url): #必须有一个参数

    return requests.get(url,headers=headers).text

from multiprocessing.dummy import Pool

pool = Pool(10)

response_text_list = pool.map(get_request,urls) #使用自定义的函数func异步的处理urls列表中的每一个列表元素

print(response_text_list)

爬虫必知必会（3）_requests模块高级的更多相关文章

python网络爬虫，知识储备，简单爬虫的必知必会，【核心】
知识储备,简单爬虫的必知必会,[核心] 一.实验说明 1. 环境登录无需密码自动登录,系统用户名shiyanlou 2. 环境介绍本实验环境采用带桌面的Ubuntu Linux环境,实验中会用到桌 ...
2015 前端[JS]工程师必知必会
2015 前端[JS]工程师必知必会本文摘自:http://zhuanlan.zhihu.com/FrontendMagazine/20002850 ,因为好东东西暂时没看懂,所以暂时保留下来,供以 ...
[ 学习路线 ] 2015 前端(JS)工程师必知必会 (2)
http://segmentfault.com/a/1190000002678515?utm_source=Weibo&utm_medium=shareLink&utm_campaig ...
关于TCP/IP，必知必会的十个经典问题[转]
关于TCP/IP,必知必会的十个问题原创 2018-01-25 Ruheng 技术特工队本文整理了一些TCP/IP协议簇中需要必知必会的十大问题,既是面试高频问题,又是程序员必备基础素养. 一 ...
迈向高阶：优秀Android程序员必知必会的网络基础
1.前言网络通信一直是Android项目里比较重要的一个模块,Android开源项目上出现过很多优秀的网络框架,从一开始只是一些对HttpClient和HttpUrlConnection简易封装使用 ...
TCP/IP 必知必会的十个问题
本文整理了一些TCP/IP协议簇中需要必知必会的十大问题,既是面试高频问题,又是程序员必备基础素养. 一.TCP/IP模型 TCP/IP协议模型(Transmission Control Protoc ...
TCP/IP，必知必会的
文章目录前言 TCP/IP模型数据链路层网络层 ping Traceroute TCP/UDP DNS TCP连接的建立与终止 TCP流量控制 TCP拥塞控制 0 前言本文整理了一些TCP/I ...
第5节：Java基础 - 必知必会（下）
第5节:Java基础 - 必知必会(下) 本小节是Java基础篇章的第三小节,主要讲述Java中的Exception与Error,JIT编译器以及值传递与引用传递的知识点. 一.Java中的Excep ...
MySQL必知必会(第4版)整理笔记
参考书籍: BookName:<SQL必知必会(第4版)> BookName:<Mysql必知必会(第4版)> Author: Ben Forta 说明:本书学习笔记 1.了解 ...

随机推荐

（数据科学学习手札106）Python+Dash快速web应用开发——回调交互篇（下）
本文示例代码已上传至我的Github仓库https://github.com/CNFeffery/DataScienceStudyNotes 1 简介这是我的系列教程Python+Dash快速web ...
DCL 数据控制语言
目录授予权限(GRANT) 回收权限(REVOTE) 授予权限(GRANT) # 语法 mysql> help grant; Name: 'GRANT' Description: Syntax ...
C、C++语言中参数的压栈顺序
要回答这个问题,就不得不谈一谈printf()函数,printf函数的原型是:printf(const char* format,-) 没错,它是一个不定参函数,那么我们在实际使用中是怎么样知道它的参 ...
Navigator.registerProtocolHandler All In One
Navigator.registerProtocolHandler All In One Web API custom protocol URL Schemes URL Protocols https ...
node.js delete directory & file system
node.js delete directory & file system delete a not empty directory https://nodejs.org/api/fs.ht ...
py django
创建项目 $ django-admin startproject server 运行项目 $ cd server $ python manage.py runserver 创建一个模块 $ pytho ...
超强嘉宾阵容——NGK Global启动大会圆满举办
近日,由星盟全球投资公司.灵石团队联合主办的NGK Global全球生态启动大会圆满开幕.大会汇集区块链领域.金融领域.密码学领域.智能算法领域等众多大咖,和NGK Global全球价值共识者共聚一堂 ...
ext文件系统机制原理剖析
本文转载自ext文件系统机制原理剖析导语将磁盘进行分区,分区是将磁盘按柱面进行物理上的划分.划分好分区后还要进行格式化,然后再挂载才能使用(不考虑其他方法).格式化分区的过程其实就是创建文件系统. ...
Prometheus时序数据库-内存中的存储结构
Prometheus时序数据库-内存中的存储结构前言笔者最近担起了公司监控的重任,而当前监控最流行的数据库即是Prometheus.按照笔者打破砂锅问到底的精神,自然要把这个开源组件源码搞明白才行 ...
Java之HTTP网络编程（一）：TCP/SSL网页下载
目录一.简介:HTTP程序设计 1.HTTP系统设计 2.HTTP客户端工作过程 3.HTTP服务端工作过程二.基于TCP Socket的HTTP网页下载三.基于SSL Socket的HTTPS ...

爬虫必知必会（3）_requests模块高级

一、爬虫爬取失败的几个原因

二、代理

三、Cookie

四、验证码识别

应用：将古诗文网中的验证码图片进行识别https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx

五、为什么在爬虫中需要实现模拟登录?

六、基于线程池的异步爬取

爬虫必知必会（3）_requests模块高级的更多相关文章

随机推荐

热门专题