爬虫遇到HTTP Error 403的问题

# coding=gbk

from bs4 import BeautifulSoup

import requests

import urllib

x = 1

y = 1

def crawl(url):

    res = requests.get(url)

    soup = BeautifulSoup(res.text, 'html.parser')

    global y

    with open(f'F:/pachong/xnt/{y}.txt','w',encoding="utf-8") as f:

        f.write(str(soup))

        y += 1

    yinhuns = soup.select('img')

    print(yinhuns)

    for yh in yinhuns:

        print(yh)

        link = yh.get('src')

        print(link)

        global x

        urllib.request.urlretrieve(link, f'F:/pachong/xnt/{x}.jpg')

        print(f'正在下载第{x}张图片')

        x += 1

for i in range(1,5):

    url = "https://acg.fi/hentai/23643.htm/" + str(i)

    try:

        crawl(url)

    except ValueError as f:

        continue

    except Exception as e:

        print(e)

运行程序过程中返回下面结果

<img alt="A区(ACG.Fi)" class="logo" src="https://acg.fi/logo.png"/>

https://acg.fi/logo.png

HTTP Error 403: Forbidden

问题有三个
- 搜索src值的时候，没有搜索到全部符合要找的图片网址
- 返回的第一个网址出现了403错误，拒绝访问
- soup.select返回的不是正确的list
思考
- 有可能所要找的网址中包含中文，无法编译
- 如果通过正则对，请求的url的text进行，筛选

#coding=gbk

from bs4 import BeautifulSoup

import requests

import urllib

x = 1

def crawl(url, header):

    res = requests.get(url, headers=header)

    soup = BeautifulSoup(res.text, 'html.parser')

    yinhuns = soup.find('div', attrs = {'id':"content-innerText"}).find_all('img',limit=4)

    print(yinhuns)

    for yh in yinhuns:

        link = yh.get('src')

        global x

        print(x)

        urllib.request.urlretrieve(link, 'F:/pachong/xnt/{}.jpg'.format(x))

        print('正在下载第{0}张图片'.format(x))

        x += 1

header = {

		"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0"

			}

for i in range(1,5):

    url = "https://acg.fi/hentai/23643.htm/" + str(i)

    try:

        crawl(url, header)

    except ValueError as f:

        continue

    except Exception as e:

        print(e)

这个过程用了find(),find_all()方法，依旧没有解决list的问题
后续过程使用urllib.parse.quote对中文部分重新编码，但是urllib.request.urlretrieve依然报错
重新修改后

#coding=gbk

import requests

import urllib

import re

from PIL import Image

from io import BytesIO

x = 1 

# 获取抓取的图片源网址

def crawl(url, header):

    res = requests.get(url, headers=header)

    # 防止被反爬，打开后关闭

    res.close()

    res = res.text

    pattern = re.compile('http.*?apic.*?jpg')

    result = re.findall(pattern, res)

    return result

# 对重编码的网址下载图片

def down(outs, folder_path):

	global x

	for out in outs:

		# 获取新编码的URL地址

		res = requests.get(out)

		# 防止被反爬，打开后关闭

		res.close()

		bf = BytesIO()

		bf.write(res.content)

		img = Image.open(bf)

		print(f'正在下载第{x}张图片')

		img.save(folder_path + f"{x}.jpg")

		x += 1

# 对获取的图片源网址进行重编码

def bianma(results):

	outs = []

	for s in results:

		# 用正则筛选出中文部分

		pattern = re.compile('[\u4e00-\u9fa5]+')

		result = re.search(pattern, s)

		su = result.group(0)

		# 把中文部分重洗编码

		li = urllib.parse.quote(su)

		# 把原URL地址中文部分替换成编码后的

		out = re.sub(pattern, li, s)

		outs.append(out)

	# 对列表进行去重并且按照原来的次序排列

	outs_cp = sorted(set(outs), key=outs.index)

	return outs_cp

def main():

	try:

		header = {

				"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0"

					}

		folder_path = 'F:/pachong/xnt/'

		for i in range(1,5):

			url = "https://acg.fi/hentai/23643.htm/" + str(i)

			results = crawl(url, header)

			outs = bianma(results)

			down(outs, folder_path)

	except Exception as e:

		print(e)

if __name__ == '__main__':

	main()

对于图片路径中有中文的，可以使用BytesIO和PIL下载图片，证实可以有效解决
几次试验出现[Errno 10054] 远程主机强迫关闭了一个现有的连接，可以在requests.get()后使用close()
程序运行无误，就是有点慢，后期可以使用多线程尝试

爬虫遇到HTTP Error 403的问题的更多相关文章

HTTP Error 403没有了，但是中文全都是乱码。又是怎么回事？
首先是简单的网页抓取程序: [python] import sys, urllib2req = urllib2.Request("http://blog.csdn.net/nevasun&q ...
urllib.error.HTTPError: HTTP Error 403: Forbidden
问题: urllib.request.urlopen() 方法经常会被用来打开一个网页的源代码,然后会去分析这个页面源代码,但是对于有的网站使用这种方法时会抛出"HTTP Error 40 ...
python3 HTTP Error 403:Forbidden
问题描述初学python,在用python中的urllib.request.urlopen()和urllib.request.urlretrieve方法打开网页时,有些网站会抛出异常: HTTP Er ...
Python "HTTP Error 403: Forbidden"
问题: 执行下面的语句时 def set_IPlsit(): url = 'https://www.whatismyip.com/' response = urllib.request.urlopen ...
python抓取不得姐动图（报错 urllib.error.HTTPError: HTTP Error 403: Forbidden）
抓取不得姐动图(报错) # -*- coding:utf-8 -*- #__author__ :kusy #__content__:文件说明 #__date__:2018/7/23 17:01 imp ...
asp.net mvc4 HTTP Error 403.14
asp.net mvc4项目部署到II&上时,出现HTTP Error 403.14 - Forbidden - The Web server is configured to not lis ...
解决github push错误The requested URL returned error: 403 Forbidden while accessing
来源:http://blog.csdn.net/happyteafriends/article/details/11554043 github push错误: git push error: The ...
解决git提交问题error: The requested URL returned error: 403 Forbidden while accessing
git提交代码时,出现这个错误"error: The requested URL returned error: 403 Forbidden while accessing https&qu ...
PYCURL ERROR 22 - "The requested URL returned error: 403 Forbidden"
RHEL6.5创建本地Yum源后,发现不可用,报错如下: [root@namenode1 html]# yum install gcc Loaded plugins: product-id, refr ...

随机推荐

【转】PHP实现下载与压缩文件的封装与整理
[转]PHP实现下载与压缩文件的封装与整理 https://mp.weixin.qq.com/s/BUI3QsdNi6Nqu0NhrUL8hQ 一.PHP实现打包zip并下载功能 $file_t ...
719D(树形dp)
题目链接:http://codeforces.com/contest/791/problem/D 题意:给出一棵树,每两个点之间的距离为1,一步最多可以走距离 k,问要将任意两个点之间的路径都走一遍, ...
jquery 插件的实现和优化
1.menus 实现: $.fn.menu=function(options){ var $this=$(this); var cross='<div class="zhiniu_cr ...
社交系统ThinkSNS+在研发过程中，如何做到 Laravel 配置可以网站后台配置
什么是ThinkSNS+ ThinkSNS(简称TS),一款全平台综合性社交系统,为国内外大中小企业和创业者提供社会化软件研发及技术解决方案. 本文分享下利用 Laravel 的 Bootstrapp ...
jsp网站访问次数统计
JSP 点击量统计有时候我们需要知道某个页面被访问的次数,这时我们就需要在页面上添加页面统计器,页面访问的统计一般在用户第一次载入时累加该页面的访问数上. 要实现一个计数器,您可以利用应用程序隐式对 ...
VRTK3.3.0-002获取手柄事件
1.首先创建VRScripts空物体,用来存放脚本,在其下创建Right空物体并添加VRTK_ControllerEvents脚本 2.Right作为右手手柄,拖拽到[VRTK_SDKManager] ...
A.DongDong破密码
链接:https://ac.nowcoder.com/acm/contest/904/A 题意: DongDong是一个喜欢密码学的女孩子,她养的萨摩耶叼着一张带着加密信息的纸条交给了她,如果她不能破 ...
css中如何设置透明度
怎样在CSS样式中设置背景的透明度,下面一个具体的实例.把类为box的层设为透明.<div class="box"></div><style>. ...
(转)nginx利用geo模块做限速白名单以及geo实现全局负载均衡的操作记录
nginx利用geo模块做限速白名单以及geo实现全局负载均衡的操作记录原文:http://www.cnblogs.com/kevingrace/p/6165572.html Nginx的geo模块 ...
【持续更新】Spring相关
什么是IoC 什么是AoP Bean的实例化方法--3种 Bean的作用域--常用2种 Bean的生命周期 Bean的装配方式基于xml的2种装配方式基于Annotaton的装配方式

爬虫遇到HTTP Error 403的问题

爬虫遇到HTTP Error 403的问题的更多相关文章

随机推荐

热门专题