Python Download Image (python + requests + BeautifulSoup)

环境准备

1 python + requests + BeautifulSoup

页面准备

主页面:

http://www.netbian.com/dongman/

图片伪地址:

http://www.netbian.com/desk/22371.htm

图片真实地址:

http://img.netbian.com/file/2019/1221/36eb674ba0633d185da078804a3638e6.jpg

步骤

1 导入库

import requests

from bs4 import BeautifulSoup

import re

2 更改请求头

ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0"

# "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",

# "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",

# "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",

# "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",

# "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0",

# "Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36"

3 获取主页面的内容

response = requests.get(url, headers={'User-Agent': ua})

html = response.text

soup = BeautifulSoup(html, 'html.parser')

4 我们要的是main里的list中的li标签中的a标签的href，而不是a标签里的img标签的src，若时获取img里的地址其大小为 800*450

list = soup.find(name='div', attrs='list')

for li in list.find_all('li'):

    # print(img.attrs['src'])

    for a in li.children:

        if a.name == 'a':

            src = 'http://www.netbian.com' + a.attrs['href']

5 截取连接里的数字作为图片的名称(这里可以自己想怎么弄就怎么弄)

n = re.search(r'\d+', a.attrs['href'])[0]    # 这里是\d+，而不是\d{5}，是为了避免万一只出现4个数字，则会报错

6 到达真实图片地址

res = requests.get(src, headers={'User-Agent': ua})

s = BeautifulSoup(res.text, 'html.parser')

p = s.find(name='p')

# print(p)

img = p.img.attrs['src']

# print(img)

# 判断地址是否为空

if not img:

    continue

7 下载

with requests.get(img, headers={'User-Agent': ua}) as resp:

    # print(resp.status_code)

    resp.raise_for_status()

    resp.encoding = res.apparent_encoding

    # 将图片内容写入

    with open('E://paper//{}.jpg'.format(n), 'wb') as f:

        f.write(resp.content)

        f.close()

8 若要下载所有的图片

# 页数循环

for i in range(1, 139):

    if i == 1:

        url = 'http://www.netbian.com/dongman/index.htm'

    else:

        url = 'http://www.netbian.com/dongman/index_{}.htm'.format(i)

    # print(url)

9 结果

注：

若会Xpath的话，用Xpath会比BeautifulSoup要简单点，我自己是懒得改过去了。

Python Download Image (python + requests + BeautifulSoup)的更多相关文章

Python爬虫学习三------requests+BeautifulSoup爬取简单网页
第一次第一次用MarkDown来写博客,先试试效果吧! 昨天2018俄罗斯世界杯拉开了大幕,作为一个伪球迷,当然也得为世界杯做出一点贡献啦. 于是今天就编写了一个爬虫程序将腾讯新闻下世界杯专题的相关新 ...
一个超实用的python爬虫功能使用 requests BeautifulSoup
一个简单的数据爬取的示例 import os,re import requests import random import time from bs4 import BeautifulSoup us ...
Python 爬虫—— requests BeautifulSoup
本文记录下用来爬虫主要使用的两个库.第一个是requests,用这个库能很方便的下载网页,不用标准库里面各种urllib:第二个BeautifulSoup用来解析网页,不然自己用正则的话很烦. req ...
python库：bs4，BeautifulSoup库、Requests库
Beautiful Soup https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ Beautiful Soup 4.2.0 文档 htt ...
使用python抓取并分析数据—链家网(requests+BeautifulSoup)（转）
本篇文章是使用python抓取数据的第一篇,使用requests+BeautifulSoup的方法对页面进行抓取和数据提取.通过使用requests库对链家网二手房列表页进行抓取,通过Beautifu ...
【Python】在Pycharm中安装爬虫库requests , BeautifulSoup , lxml 的解决方法
BeautifulSoup在学习Python过程中可能需要用到一些爬虫库例如:requests BeautifulSoup和lxml库前面的两个库,用Pychram都可以通过 File--> ...
Python使用urllib,urllib3,requests库+beautifulsoup爬取网页
Python使用urllib/urllib3/requests库+beautifulsoup爬取网页 urllib urllib3 requests 笔者在爬取时遇到的问题 1.结果不全 2.'抓取失 ...
python 爬虫（一） requests+BeautifulSoup 爬取简单网页代码示例
以前搞偷偷摸摸的事,不对,是搞爬虫都是用urllib,不过真的是很麻烦,下面就使用requests + BeautifulSoup 爬爬简单的网页. 详细介绍都在代码中注释了,大家可以参阅. # -* ...
[python] 网络数据采集操作清单 BeautifulSoup、Selenium、Tesseract、CSV等
Python网络数据采集操作清单 BeautifulSoup.Selenium.Tesseract.CSV等 Python网络数据采集操作清单 BeautifulSoup.Selenium.Tesse ...

随机推荐

PB 数据窗口点击标题不能排序的一个原因
标题必须和数据行名称一致,如数据行列名为:num ,标题行必须为 num_t 才可以
DOCK-SWARM
服务原理: 创建集群:建立ingress网络,网关xxxxx.xxx.xxx.1 管理节点:docker swarm init --advertise-addr 192.168.4.119 工作节点: ...
Hibernate学习（二）
持久化对象的声明周期 1.Hibernate管理的持久化对象(PO persistence object )的生命周期有四种状态,分别是transient.persistent.detached和re ...
吴裕雄 python 神经网络——TensorFlow 花瓣分类与迁移学习（2）
import glob import os.path import numpy as np import tensorflow as tf from tensorflow.python.platfor ...
WLC-Download 3-party CA to WLC
一.基础准备为了创建和导入第三方SSL-certificate你需要做如下准备:1.一个WLC(随着版本的不同,可能需要准备的也不同)这里以7.0.98版本为例.2.一个外部的证书颁发机构(Cert ...
【SSM】Log4j 日志配置
1.log4j.properties ### 配置根 ### # log4j.rootLogger = debug,console ,fileAppender,dailyRollingFile,ROL ...
Day11 - Q - A Multiplication Game HDU - 1517
题目链接本题很像bash博弈,但又有些许不同,因为这里是乘法,我们可以列出前几项可能若n=2-9,那么first可以一次取完若n=10-18,无论first怎么取,second都能一次取完若n ...
JavaScript - String对象，字符串，String包装类型
1. 字符串 1.1 字符串的不可变性 var str = 'abc'; str = 'hello'; // 当重新给str赋值的时候,常量'abc'不会被修改,依然在内存中 // 重新给字符串赋值, ...
JavaScript - 运行机制，作用域，作用域链（Scope chain）
参考 https://www.jianshu.com/p/3b5f0cb59344 https://jingyan.baidu.com/article/4f34706e18745be386b56d46 ...
关于java自学的内容以及感受
这周主要学习了关于数组方面的知识包括一维数组以及多维数组(他们所储存数据默认值为0),以下为我根据相关知识编写的简单程序: public class test { public static void ...

Python Download Image (python + requests + BeautifulSoup)

环境准备

页面准备

步骤

Python Download Image (python + requests + BeautifulSoup)的更多相关文章

随机推荐

热门专题