Python学习 —— 实现简单的爬虫

　　为了加快学习python3.x，查了许多资料后写了这个脚本，这个脚本主要是爬取百度图片'东方幻想乡'的图片，但还是有很多问题存在。

　　下面给出代码：

# 更新了一下代码

from urllib import request

import re

class CrawlImg:     # 定义一个爬取图片的类

    def __init__(self):     # 构造函数

        print('Link start!')

    def __GetHtml(self, html):

        post = request.urlopen(html)

        page = post.read()

        return page.decode('utf-8')     # 将格式转换为utf-8格式 TypeError: cannot use a string pattern on a bytes-like object

    def __GetImg(self, html):

        page = self.__GetHtml(html)     # 获取 html 页面

        recomp = re.compile(r'.+?.jpg')　　#新的、简洁的正则表达式

        imgUrlList = recomp.findall(page)   # 和 html 页面正则匹配

        return imgUrlList   # 返回匹配得到的 jpg 的 url 列表

    def run(self, html):

        imgUrlList = self.__GetImg(html)

        ImgName = 1

        fp = open('C:\\Users\\adimin\\Desktop\\CrawlImg\\imgUrl.txt', 'w')

        for imgUrl in imgUrlList:

            request.urlretrieve(imgUrl, 'C:\\Users\\adimin\\Desktop\\CrawlImg\\{}.jpg' .format(str(ImgName)))

            print('Download:', imgUrl)

            fp.write(str(imgUrl) + '\r\n')

            ImgName += 1

        fp.close()

    def __del__(self):      # 析构函数

        print("Download finished!")

def main():

    url = 'https://image.baidu.com/search/index?tn=baiduimage&ct=201326592&lm=-1&cl=2&ie=gbk&word=%B6%AB%B7%BD%BB%C3%CF%EB%CF%E7&fr=ala&ala=1&alatpl=adress&pos=0&hs=2&xthttps=111111'

    GetImg = CrawlImg()

    GetImg.run(url)

if __name__ == '__main__':

    main()

　　参考了许多博客和资料，主要有：

http://blog.csdn.net/clj198606061111/article/details/50816115

https://www.cnblogs.com/speeding/p/5097790.html

http://urllib3.readthedocs.io/en/latest/

https://pyopenssl.org/en/stable/

https://docs.python.org/3.6/library/urllib.html

https://segmentfault.com/q/1010000004442233/a-1020000004448440

http://urllib3.readthedocs.io/en/latest/user-guide.html

菜鸟教程-python3

　　还有一些记不得了...

　　然后，通过这次的学习学到了很多，基本熟悉了python3的基本语法，还了解了正则表达式的写法等，于是用了面向对象的方式进行编程。

　　代码中可以看到：一个爬取图片的类，构造函数、析构函数等。

　　还有另一个版本的url请求，用了urllib3.PoolManager()：

# 修改了一下代码,现在能正常运行了

from urllib import request

import urllib3

import certifi

import re

class CrawlImg:     # 定义一个爬取图片的类

    def __init__(self):     # 构造函数

        print('Link start!')

    def __GetHtml(self, html):

        post = urllib3.PoolManager(　　# 初始化，为了解决一个证书问题，还安装了 pyOpenSSL，但没有用...最后，这样写就解决了InsecureRequestWarning的警告

            cert_reqs='CERT_REQUIRED',

            ca_certs=certifi.where()

        )

        post = post.request('GET', html)　　# 请求打开网页

        page = post.data　　# 读取页面数据

        return page.decode('utf-8')     # 将格式转换为utf-8格式 TypeError: cannot use a string pattern on a bytes-like object

    def __GetImg(self, html):

        page = self.__GetHtml(html)      # 获取 html 页面数据

        recomp = re.compile(r'.+?.jpg')　　# 更新

        imgUrlList = recomp.findall(page)   # 和 html 页面正则匹配

        return imgUrlList   # 返回匹配得到的 jpg 的 url 列表

    def run(self, html):

        imgUrlList = self.__GetImg(html)

        ImgName = 1

        fp = open('C:\\Users\\adimin\\Desktop\\CrawlImg\\imgUrl.txt', 'w')

        for imgUrl in imgUrlList:

            request.urlretrieve(imgUrl, 'C:\\Users\\adimin\\Desktop\\CrawlImg\\{}.jpg' .format(str(ImgName)))

            print('Download:', imgUrl)

            fp.write(str(imgUrl) + '\r\n')

            ImgName += 1

        fp.close()

    def __del__(self):      # 析构函数

        print("Download finished!")

def main():

    url = 'https://image.baidu.com/search/index?tn=baiduimage&ct=201326592&lm=-1&cl=2&ie=gbk&word=%B6%AB%B7%BD%BB%C3%CF%EB%CF%E7&fr=ala&ala=1&alatpl=adress&pos=0&hs=2&xthttps=111111'

    GetImg = CrawlImg()

    GetImg.run(url)

if __name__ == '__main__':

    main()

　　最后，感觉没什么好解释的地方，这篇就总结到这了。

-----------------update 2018-01-19 22:56:43----------------

　　最近发现别人的正则表达式都比叫简单...我的都好长...但不太明白这个表达式是怎么匹配的：

import re

url = 'https://i.pximg.net/user-profile/img/2017/07/03/10/55/30/12797398_1982f9bf699bd2ff2b67855b276bbb8c_50.png'

recmp = re.compile(r'.+?.jpg|.+?.png')

print(recmp.findall(url))

　　ps:后来弄明白了。

Python学习 —— 实现简单的爬虫的更多相关文章

使用python做最简单的爬虫
使用python做最简单的爬虫 --之心 #第一种方法import urllib2 #将urllib2库引用进来response=urllib2.urlopen("http://www.ba ...
Python 学习(1) 简单的小爬虫
最近抽空学了两天的Python,基础知识都看完了,正好想申请个联通日租卡,就花了2小时写了个小爬虫,爬一下联通日租卡的申请页面,看有没有好记一点的手机号~ 人工挑眼都挑花了. 用的IDE是PyCh ...
python学习总结----简单数据结构
mini-web服务器 - 能够完成简单的请求处理 - 使用http协议 - 目的:加深对网络编程的认识.为后面阶段学习web做铺垫简单数据结构 - 排列组合 import itertools # ...
Python学习笔记：利用爬虫自动保存图片
兴趣才是第一生产驱动力. Part 1 起先,源于对某些网站图片浏览只能一张一张的翻页,心生不满.某夜,冒出一个想法,为什么我不能利用爬虫技术把想看的图片给爬下来,然后在本地看个够. 由此经过一番初尝 ...
python学习：简单的wc命令实现
#!/usr/bin/python import sys import os try: fn = sys.argv[1] except IndexError: print &q ...
python学习 —— python3简单使用pymysql包操作数据库
python3只支持pymysql(cpython >= 2.6 or >= 3.3,mysql >= 4.1),python2支持mysqldb. 两个例子: import pym ...
Python写一个简单的爬虫
code #!/usr/bin/env python # -*- coding: utf-8 -*- import requests from lxml import etree class Main ...
python学习之----初见网络爬虫（输出整个网页html的代码）
from urllib import urlopen html = urlopen('http://www.manluotuo.com') print (html.read()) 控制台输出整个网页h ...
Python学习-一个简单的计时器
在实际开发中,往往想要计算一段代码执行多长时间,以下我将该功能写入到一个函数里面,仅仅要在每一个函数前面调用该函数就可以,见以下代码: #------------------------------- ...

随机推荐

松软科技web教程:JavaScript HTML DOM 元素
查找 HTML 元素通常,通过 JavaScript,您需要操作 HTML 元素. 为了达成此目的,您需要首先找到这些元素.有好几种完成此任务的方法: 通过 id 查找 HTML 元素通过标签名查 ...
dfs序与求子树子节点（染了色）的个数
https://blog.csdn.net/hpu2022/article/details/81910490 https://blog.csdn.net/qq_39670434/article/det ...
word写文档体会
1.找一个文档规范要求. 2.根据文档的规范要求调整正文的格式,标题1的格式,标题2的格式,标题3的格式,图表的格式,把没用的那些格式都删除掉. 3.图注表注后空格一行. 4.设置页眉页脚. 5.生成 ...
opencv python：图像金字塔
图像金字塔原理 expand = 扩大+卷积拉普拉斯金字塔 PyrDown:降采样 PyrUp:还原 example import cv2 as cv import numpy as np # 图像 ...
springboot 服务卡死连接池查询无响应问题解决
排查背景:基于nacos + springboot + druid +mybatis + mysql的环境,服务突然就出现不可访问,所有连接都超时,重启就可以使用一会,过一会就又不可用了排查出来的原 ...
C 语言实例 -求分数数列1/2+2/3+3/5+5/8+...的前n项和
程序分析:抓住分子与分母的变化规律:分子a:1,2,3,5,8,13,21,34,55,89,144...分母b:2,3,5,8,13,21,34,55,89,144,233...分母b把数赋给了分子 ...
linux 系统 vi编辑器下的删除
vi filename 进入vi模式首先最常用的 dd:删除光标所在的整行: d1G: 删除光标所在到第一行的所有数据: dG: 删除光标到最后一行的所有数据 : d$:删除光标到 ...
HTML设置body背景图片全屏显示
在head标签中添加body属性设置: <head><style>body{background:url(timg1.jpg) top left;background-size ...
ubuntu apache 通过端口新建多个站点
cd /etc/apache2/sites-available 最近的虚拟机没绑定域名,所以呢,就先用域名加端口新建几个站点用着 1. vim /etc/apapche2/apapche2.conf ...
spring boot jpa 复杂查询动态查询连接and和or 模糊查询分页查询
最近项目中用到了jpa,刚接触的时候有些激动,以前的到层忽然不用写sql不用去自己实现了,只是取个方法名就实现了,太惊艳了,惊为天人,但是慢慢的就发现不是这么回事了,在动态查询的时候,不知道怎么操作了 ...

Python学习 —— 实现简单的爬虫

Python学习 —— 实现简单的爬虫的更多相关文章

随机推荐

热门专题