Python学习 —— 实现简单的爬虫

　　为了加快学习python3.x，查了许多资料后写了这个脚本，这个脚本主要是爬取百度图片'东方幻想乡'的图片，但还是有很多问题存在。

　　下面给出代码：

# 更新了一下代码

from urllib import request

import re

class CrawlImg:     # 定义一个爬取图片的类

    def __init__(self):     # 构造函数

        print('Link start!')

    def __GetHtml(self, html):

        post = request.urlopen(html)

        page = post.read()

        return page.decode('utf-8')     # 将格式转换为utf-8格式 TypeError: cannot use a string pattern on a bytes-like object

    def __GetImg(self, html):

        page = self.__GetHtml(html)     # 获取 html 页面

        recomp = re.compile(r'.+?.jpg')　　#新的、简洁的正则表达式

        imgUrlList = recomp.findall(page)   # 和 html 页面正则匹配

        return imgUrlList   # 返回匹配得到的 jpg 的 url 列表

    def run(self, html):

        imgUrlList = self.__GetImg(html)

        ImgName = 1

        fp = open('C:\\Users\\adimin\\Desktop\\CrawlImg\\imgUrl.txt', 'w')

        for imgUrl in imgUrlList:

            request.urlretrieve(imgUrl, 'C:\\Users\\adimin\\Desktop\\CrawlImg\\{}.jpg' .format(str(ImgName)))

            print('Download:', imgUrl)

            fp.write(str(imgUrl) + '\r\n')

            ImgName += 1

        fp.close()

    def __del__(self):      # 析构函数

        print("Download finished!")

def main():

    url = 'https://image.baidu.com/search/index?tn=baiduimage&ct=201326592&lm=-1&cl=2&ie=gbk&word=%B6%AB%B7%BD%BB%C3%CF%EB%CF%E7&fr=ala&ala=1&alatpl=adress&pos=0&hs=2&xthttps=111111'

    GetImg = CrawlImg()

    GetImg.run(url)

if __name__ == '__main__':

    main()

　　参考了许多博客和资料，主要有：

http://blog.csdn.net/clj198606061111/article/details/50816115

https://www.cnblogs.com/speeding/p/5097790.html

http://urllib3.readthedocs.io/en/latest/

https://pyopenssl.org/en/stable/

https://docs.python.org/3.6/library/urllib.html

https://segmentfault.com/q/1010000004442233/a-1020000004448440

http://urllib3.readthedocs.io/en/latest/user-guide.html

菜鸟教程-python3

　　还有一些记不得了...

　　然后，通过这次的学习学到了很多，基本熟悉了python3的基本语法，还了解了正则表达式的写法等，于是用了面向对象的方式进行编程。

　　代码中可以看到：一个爬取图片的类，构造函数、析构函数等。

　　还有另一个版本的url请求，用了urllib3.PoolManager()：

# 修改了一下代码,现在能正常运行了

from urllib import request

import urllib3

import certifi

import re

class CrawlImg:     # 定义一个爬取图片的类

    def __init__(self):     # 构造函数

        print('Link start!')

    def __GetHtml(self, html):

        post = urllib3.PoolManager(　　# 初始化，为了解决一个证书问题，还安装了 pyOpenSSL，但没有用...最后，这样写就解决了InsecureRequestWarning的警告

            cert_reqs='CERT_REQUIRED',

            ca_certs=certifi.where()

        )

        post = post.request('GET', html)　　# 请求打开网页

        page = post.data　　# 读取页面数据

        return page.decode('utf-8')     # 将格式转换为utf-8格式 TypeError: cannot use a string pattern on a bytes-like object

    def __GetImg(self, html):

        page = self.__GetHtml(html)      # 获取 html 页面数据

        recomp = re.compile(r'.+?.jpg')　　# 更新

        imgUrlList = recomp.findall(page)   # 和 html 页面正则匹配

        return imgUrlList   # 返回匹配得到的 jpg 的 url 列表

    def run(self, html):

        imgUrlList = self.__GetImg(html)

        ImgName = 1

        fp = open('C:\\Users\\adimin\\Desktop\\CrawlImg\\imgUrl.txt', 'w')

        for imgUrl in imgUrlList:

            request.urlretrieve(imgUrl, 'C:\\Users\\adimin\\Desktop\\CrawlImg\\{}.jpg' .format(str(ImgName)))

            print('Download:', imgUrl)

            fp.write(str(imgUrl) + '\r\n')

            ImgName += 1

        fp.close()

    def __del__(self):      # 析构函数

        print("Download finished!")

def main():

    url = 'https://image.baidu.com/search/index?tn=baiduimage&ct=201326592&lm=-1&cl=2&ie=gbk&word=%B6%AB%B7%BD%BB%C3%CF%EB%CF%E7&fr=ala&ala=1&alatpl=adress&pos=0&hs=2&xthttps=111111'

    GetImg = CrawlImg()

    GetImg.run(url)

if __name__ == '__main__':

    main()

　　最后，感觉没什么好解释的地方，这篇就总结到这了。

-----------------update 2018-01-19 22:56:43----------------

　　最近发现别人的正则表达式都比叫简单...我的都好长...但不太明白这个表达式是怎么匹配的：

import re

url = 'https://i.pximg.net/user-profile/img/2017/07/03/10/55/30/12797398_1982f9bf699bd2ff2b67855b276bbb8c_50.png'

recmp = re.compile(r'.+?.jpg|.+?.png')

print(recmp.findall(url))

　　ps:后来弄明白了。

Python学习 —— 实现简单的爬虫的更多相关文章

使用python做最简单的爬虫
使用python做最简单的爬虫 --之心 #第一种方法import urllib2 #将urllib2库引用进来response=urllib2.urlopen("http://www.ba ...
Python 学习(1) 简单的小爬虫
最近抽空学了两天的Python,基础知识都看完了,正好想申请个联通日租卡,就花了2小时写了个小爬虫,爬一下联通日租卡的申请页面,看有没有好记一点的手机号~ 人工挑眼都挑花了. 用的IDE是PyCh ...
python学习总结----简单数据结构
mini-web服务器 - 能够完成简单的请求处理 - 使用http协议 - 目的:加深对网络编程的认识.为后面阶段学习web做铺垫简单数据结构 - 排列组合 import itertools # ...
Python学习笔记：利用爬虫自动保存图片
兴趣才是第一生产驱动力. Part 1 起先,源于对某些网站图片浏览只能一张一张的翻页,心生不满.某夜,冒出一个想法,为什么我不能利用爬虫技术把想看的图片给爬下来,然后在本地看个够. 由此经过一番初尝 ...
python学习：简单的wc命令实现
#!/usr/bin/python import sys import os try: fn = sys.argv[1] except IndexError: print &q ...
python学习 —— python3简单使用pymysql包操作数据库
python3只支持pymysql(cpython >= 2.6 or >= 3.3,mysql >= 4.1),python2支持mysqldb. 两个例子: import pym ...
Python写一个简单的爬虫
code #!/usr/bin/env python # -*- coding: utf-8 -*- import requests from lxml import etree class Main ...
python学习之----初见网络爬虫（输出整个网页html的代码）
from urllib import urlopen html = urlopen('http://www.manluotuo.com') print (html.read()) 控制台输出整个网页h ...
Python学习-一个简单的计时器
在实际开发中,往往想要计算一段代码执行多长时间,以下我将该功能写入到一个函数里面,仅仅要在每一个函数前面调用该函数就可以,见以下代码: #------------------------------- ...

随机推荐

PHP+Mysql防止SQL注入的方法
这篇文章介绍的内容是关于PHP+Mysql防止SQL注入的方法,有着一定的参考价值,现在分享给大家,有需要的朋友可以参考一下方法一: mysql_real_escape_string -- 转义 S ...
攻防世界 simple——js
simple_js [原理] javascript的代码审计 [目地] 掌握简单的javascript函数 [环境] windows [工具] firefox [步骤] 1.打开页面,查看源代码,可以 ...
numpy-sum函数
看一个例子就懂了 c = array([[[0, 1, 2, 0, 1, 2]], [[0, 1, 2, 0, 1, 2]]]) print('{0}\n'.format(c.shape)) prin ...
理解Spring 容器、BeanFactory 以及 ApplicationContext
一.spring 容器理解 spring 容器可以理解为生产对象(Object)的地方,在这里容器不只是帮助我们创建对象那么简单,它负责了对象的整个生命周期-创建.装配.销毁.而这里对象的创建管理的控 ...
（学习1）最小生成树-Prim算法与Kruskal算法
最小生成树: 求一个有 n 个结点的连通图的生成树是原图的极小连通子图,且包含原图中的所有 n 个结点,并且有保持图连通的最少的边. 1:Prim算法(适合稠密图) 伪代码: Prim(G){ //G ...
【MySQL】安装及配置
" 目录 #. 概述 1. 什么是数据(Data) 2. 什么是数据库(DataBase, 简称DB) 3. 什么是数据库管理系统(DataBase Management System) 4 ...
Could not transfer artifact org.springframework.boot:spring-boot-starter-parent:pom:2.1.9.RELEASE from/to 阿里云镜像地址
今天从 http://start.spring.io/ 下载的demo项目,导入eclipse后,pom文件一直报 parent包错,然后感觉就是自己maven镜像里面搜不到这个包, 所以改了 mav ...
vscode安装过的插件
1.VSCode的Vue插件Vetur设置,alt+shift+f格式化对应配置今天看到的文章安装插件可以参考: https://blog.csdn.net/maixiaochai/article ...
《Web安全攻防渗透测试实战指南》学习笔记（五）
Web安全攻防渗透测试实战指南学习笔记 (五) 第四章 Web安全原理解析 (一) (一)SQL注入的原理 1.web应用程序对用户输入数据的合法性没有判断. 2.参数用户可控:前端传给 ...
Perl 笔记
目录 Perl 学习常用记录基础 1. 运行perl 2. 字符串 3. 变量 4. 条件 5. 循环 6. 运算符 7. 时间日期 8. 子程序(函数) 9. 引用 10. 格式化输出 11. ...

Python学习 —— 实现简单的爬虫

Python学习 —— 实现简单的爬虫的更多相关文章

随机推荐

热门专题