Python爬取pexels图片

研究Python爬虫，网上很多爬取pexels图片的案例，我下载下来运行没有成功，总量有各种各样的问题。

作为菜鸟初学者，网上的各个案例代码对我还是有不少启发作用，我用搜索引擎+chatGPT逐步对代码进行了完善。

最终运行成功。特此记录。

运行环境：Win10，Python3.10、Google Chrome111.0.5563.148（正式版本）

 1 import urllib.request

 2 from bs4 import BeautifulSoup

 3 import os

 4 import html

 5 import requests

 6 import urllib.parse

 7

 8 path = r"C:\Users\xiaochao\pexels"

 9 url_lists = ['https://www.pexels.com/search/book/?page={}'.format(i) for i in range(1, 2)]

10 headers = {

11     "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36",

12     "Referer": "https://www.pexels.com/",

13     "Accept-Language": "en-US,en;q=0.9",

14 }

15

16 for url in url_lists:

17     print(url)

18     req = urllib.request.Request(url, headers=headers)

19     try:

20         resp = urllib.request.urlopen(req)

21     except urllib.error.HTTPError as e:

22         print("HTTPError occurred: {}".format(e))

23         continue

24

25     html_content = resp.read().decode()

26     soup = BeautifulSoup(html_content, "html.parser")

27

28     import re

29     pattern = re.compile('"Download" href="(.*?)/?cs=', re.S)

30     matches = re.findall(pattern, html_content)

31     print(matches)

32

33     if not os.path.exists(path):

34         os.makedirs(path)

35

36     for match in matches:

37         match_cleaned = match.split('?')[0]    # 去除图片URL地址最后带的“？”号。

38         print(match_cleaned)   # 输出去除图片URL“？”号的地址

39         match_cleaned = html.unescape(match_cleaned)  #解码 HTML 编码字符，将文件链接还原为正常的 URL 格式

40         match_cleaned = urllib.parse.unquote(match_cleaned)   # 对 URL 进行进一步处理，解码URL，确保它的格式正确，包括删除多余的引号和处理特殊字符。

41         match_cleaned = urllib.parse.urljoin(url, match_cleaned)  # 将相对 URL 转换为绝对 URL

42

43

44         # 按URL地址后段命名

45         filename = match_cleaned.split("/")[-1]

46         with open(os.path.join(path, filename), "wb") as f:

47             f.write(requests.get(match_cleaned).content)

Python爬取pexels图片的更多相关文章

Python爬取谷歌街景图片
最近有个需求是要爬取街景图片,国内厂商百度高德和腾讯地图都没有开放接口,查询资料得知谷歌地图开放街景api 谷歌捷径申请key地址:https://developers.google.com/maps ...
利用Python爬取网页图片
最近几天,研究了一下一直很好奇的爬虫算法.这里写一下最近几天的点点心得.下面进入正文: 你可能需要的工作环境: Python 3.6官网下载我们这里以sogou作为爬取的对象. 首先我们进入搜狗图片 ...
Python 爬取美女图片，分目录多级存储
最近有个需求:下载https://mm.meiji2.com/网站的图片. 所以简单研究了一下爬虫. 在此整理一下结果,一为自己记录,二给后人一些方向. 爬取结果如图: 整体研究周期 2-3 天, ...
python爬取网页图片（二）
从一个网页爬取图片已经解决,现在想要把这个用户发的图片全部爬取. 首先:先找到这个用户的发帖页面: http://www.acfun.cn/u/1094623.aspx#page=1 然后从这个页面中 ...
用python 爬取网页图片
import re import string import sys import os import urllib url="http://tieba.baidu.com/p/252129 ...
python爬取网页图片
# html:网页地址 def getImg2(html): soup = BeautifulSoup(html, 'html.parser') href_regex = re.compile(r'^ ...
python爬取百度图片
import requests import re from urllib import parse import os from threading import Thread def downlo ...
Python 爬取图书图片和地址
#-*- coding:utf-8 -*- import xlwt import urllib import re def getHtml(url): page = urllib.urlopen(ur ...
python爬取许多图片的代码
from bs4 import BeautifulSoup import requests import os os.makedirs('./img/', exist_ok=True) URL = & ...
实例学习——爬取Pexels高清图片
近来学习爬取Pexels图片时,发现书上代码会抛出ConnectionError,经查阅资料知,可能是向网页申请过于频繁被禁,可使用time.sleep(),减缓爬取速度,但考虑到爬取数据较多,运行时 ...

随机推荐

EF OwnsOne 主键不自增
menu public class Menu { /// <summary> /// id /// </summary> [Key, DatabaseGeneratedAttr ...
Vue二级联动上传图片
二级联动的后台和之前一样都需要一个字典字段查询来实现二级联动但是由于VUE语法和AJAX的不同在前台绑定的时候也有所不同 2.1 首先下拉框的写法就有了本质的改变通过v-model="&q ...
QT网络编程【一】
1.QUdpSocket头文件无法识别怎么解决? 问题原因:qmake没有添加network的模块.在工程配置文件中添加配置即可. 2.选择c++的socket库还是QUdpSocket? 3.同样的 ...
store数据仓库
项目搭建 npm init vite-app GxShujukucd GxShujukunpm inpm i vue-router npm i vuex // 这一句是这节课的关键新建store ① ...
【javascript】fill()的坑
今天在开发过程中用到数组填充函数fill() //创建一个5X5的二维矩阵,全部填充1 let array = new Array(5).fill(new Array(5).fill(1)) //此时 ...
Nginx配置ThinkPHP3.1的PATHINFO模式
location / { if (!-e $request_filename) { rewrite ^/(.*)$ /index.php?$1 last; bre ...
修改/编辑jar包
替换或者导入jar包时,jar包被自动压缩,springboot规定嵌套的jar包不能在被压缩的情况下存储. 解决(本文以升级ojdbc包为例): 使用jar命令解压jar包,在压缩包外重新替换jar ...
Flask CURD(增删改查)
1.创建flask项目 2.修改配置文件: ''' config.py 保存项目配置 ''' 导入Flask模块 from flask import Flask 额外安装: 数据库操作模块 from ...
Eclipse使用Maven搭建SSM框架时遇到的问题以及解决办法
1.新建项目后出现:Could not caculate build plan:plugin 解决方法:删除本地.m2仓库中 org.apache.maven.plugins:maven-resour ...
burpsuite 设置文字大小、抓取https数据头
设置文字大小 burpsuite安装好后,有些时候文字非常的小,看的眼睛直接痛死. 找到 User options -> Display 其中 User Interface -> Font ...

Python爬取pexels图片

Python爬取pexels图片的更多相关文章

随机推荐

热门专题