吴裕雄--python学习笔记：爬虫

import chardet

import urllib.request

page = urllib.request.urlopen('http://photo.sina.com.cn/') #打开网页

htmlCode = page.read() #获取网页源代码

print(chardet.detect(htmlCode)) #打印返回网页的编码方式

{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

data = htmlCode.decode('utf-8')

print(data) #打印网页源代码

pageFile = open('E:\\pageCode.txt','wb')#以写的方式打开pageCode.txt

pageFile.write(htmlCode)#写入

pageFile.close()#开了记得关

获取其他信息

打开pageCode.txt文件(也可以直接在原网页F12调试获取)，查看需要获取数据的标签信息。

比如我现在要拿图片

写出图片的正则表达式： reg = r'src="(.+?\.jpg)"' 

解释下吧——匹配以src="开头然后接一个或多个任意字符(非贪婪)，以.jpg" 结尾的字符串。比如图中红框内src后 双引号里的链接就是一个匹配的字符串。

接着我们要做的就是从get_html方法返回的辣么长一串字符串中 拿到 满足正则表达式的 字符串。

用到python中的re库中的 re.findall(str) 它返回一个满足匹配的字符串组成的列表

import re

import chardet

import urllib.request

page = urllib.request.urlopen('http://www.meituba.com/tag/juesemeinv.html') #打开网页

htmlCode = page.read() #获取网页源代码

#print(chardet.detect(htmlCode)) #查看编码方式

data = htmlCode.decode('utf-8')

#print(data) #打印网页源代码

#pageFile = open('pageCode.txt','wb')#以写的方式打开pageCode.txt

#pageFile.write(htmlCode)#写入

#pageFile.close()#开了记得关

reg = r'src="(.+?\.jpg)"'#正则表达式

reg_img = re.compile(reg)#编译一下，运行更快

imglist = reg_img.findall(data)#进行匹配

for img in imglist:

    print(img)

http://ppic.meituba.com:83/uploads3/181201/3-1Q20111553V11.jpg

http://ppic.meituba.com:83/uploads2/180622/3-1P62215532D61.jpg

http://ppic.meituba.com:83/uploads2/180605/3-1P6051000144I.jpg

http://ppic.meituba.com:83/uploads2/170511/8-1F5110URc35.jpg

http://ppic.meituba.com:83/uploads/160322/8-1603220U50O23.jpg

http://ppic.meituba.com:83/uploads2/180317/3-1P31F91U1X9.jpg

http://ppic.meituba.com:83/uploads/160718/7-160GQ51G0b4.jpg

http://ppic.meituba.com:83/uploads2/170517/8-1F51G50301Q3.jpg

http://ppic.meituba.com:83/uploads/161010/7-1610101A202B0.jpg

http://ppic.meituba.com:83/uploads2/171102/7-1G102093511F7.jpg

http://ppic.meituba.com:83/uploads2/170901/7-1FZ1100545438.jpg

http://ppic.meituba.com:83/uploads/160625/8-160625093044631.jpg

http://ppic.meituba.com:83/uploads/160419/7-160419161553153.jpg

http://ppic.meituba.com:83/uploads2/170323/7-1F323103404A2.jpg

http://ppic.meituba.com:83/uploads2/170322/7-1F322105R1255.jpg

http://ppic.meituba.com:83/uploads2/170211/7-1F21110040Y63.jpg

http://ppic.meituba.com:83/uploads2/170110/7-1F110102005930.jpg

http://ppic.meituba.com:83/uploads/160618/8-16061Q04450391.jpg

http://ppic.meituba.com:83/uploads2/170330/3-1F3301HI6138.jpg

http://ppic.meituba.com:83/uploads2/161230/4-161230100U5V8.jpg

然后将图片下载到本地

urllib库中有一个 urllib.request.urlretrieve(链接,名字) 方法，它的作用是以第二个参数为名字下载链接中的内容，我们来试用一下

x = 0

for img in imglist:

    print(img)

    urllib.request.urlretrieve('http://ppic.meituba.com/uploads/160322/8-1603220U50O23.jpg', '%s.jpg'  % x)

    x += 1

import re

import urllib.request

def getGtmlCode():

    html = urllib.request.urlopen("http://www.quanshuwang.com/book/44/44683").read() #获取网页源代码

    html = html.decode("gbk") #转成该网站格式

    reg = r'<li><a href="(.*?)" title=".*?">(.*?)</a></li>' #根据网站样式匹配的正则：(.*?)可以匹配所有东西，加括号为我们需要的

    reg = re.compile(reg)

    urls = re.findall(reg, html)

    for url in urls:

        #print(url)

        chapter_url = url[0] #章节路径

        chapter_title = url[1] #章节名

        chapter_html = urllib.request.urlopen(chapter_url).read() #获取该章节的全文代码

        chapter_html = chapter_html.decode("gbk")

        chapter_reg = r'</script>&nbsp;&nbsp;&nbsp;&nbsp;.*?<br />(.*?)<script type="text/javascript">' #匹配文章内容

        chapter_reg = re.compile(chapter_reg,re.S)

        chapter_content = re.findall(chapter_reg, chapter_html)

        for content in chapter_content:

            content = content.replace("&nbsp;&nbsp;&nbsp;&nbsp;","") #使用空格代替

            content = content.replace("<br />","") #使用空格代替

            print(content)

            f = open('E:\\aa\\{}.txt'.format(chapter_title),'w') #保存到本地

            f.write(content)

getGtmlCode()

吴裕雄--python学习笔记：爬虫的更多相关文章

吴裕雄--python学习笔记：爬虫基础
一.什么是爬虫爬虫:一段自动抓取互联网信息的程序,从互联网上抓取对于我们有价值的信息. 二.Python爬虫架构 Python 爬虫架构主要由五个部分组成,分别是调度器.URL管理器.网页下载器.网 ...
吴裕雄--python学习笔记：爬虫包的更换
python 3.x报错:No module named 'cookielib'或No module named 'urllib2' 1. ModuleNotFoundError: No module ...
吴裕雄--python学习笔记：sqlite3 模块
1 sqlite3.connect(database [,timeout ,other optional arguments]) 该 API 打开一个到 SQLite 数据库文件 database 的 ...
吴裕雄--python学习笔记：os模块函数
os.sep:取代操作系统特定的路径分隔符 os.name:指示你正在使用的工作平台.比如对于Windows,它是'nt',而对于Linux/Unix用户,它是'posix'. os.getcwd:得 ...
吴裕雄--python学习笔记：os模块的使用
在自动化测试中,经常需要查找操作文件,比如说查找配置文件(从而读取配置文件的信息),查找测试报告(从而发送测试报告邮件),经常要对大量文件和大量路径进行操作,这就依赖于os模块. 1.当前路径及路径下 ...
吴裕雄--python学习笔记：BeautifulSoup模块
import re import requests from bs4 import BeautifulSoup req_obj = requests.get('https://www.baidu.co ...
吴裕雄--python学习笔记：通过sqlite3 进行文字界面学生管理
import sqlite3 conn = sqlite3.connect('E:\\student.db') print("Opened database successfully&quo ...
吴裕雄--python学习笔记：sqlite3 模块的使用与学生信息管理系统
import sqlite3 cx = sqlite3.connect('E:\\student3.db') cx.execute( '''CREATE TABLE StudentTable( ID ...
python学习笔记——爬虫学习中的重要库urllib
1 urllib概述 1.1 urllib库中的模块类型 urllib是python内置的http请求库其提供了如下功能: (1)error 异常处理模块 (2)parse url解析模块 (3)r ...

随机推荐

javascript编程中极易出现的错误（个人）
2018-08-10 1,setInterval打错字写成ser 2,document.getElementById().innerHTML;HTML需要全部大写 3,在for循环中定义一个i时要记住 ...
查看jks文件中的签名
1. 打开CMD命令行进入本机安装的jdk或jre下的bin目录. 2. 下来看图 keytool -list -v -keystore C:\Users\Administrator\Desktop\ ...
python import xx和from xx import x 中的坑
先回顾一下理解程度什么是不可变类型和可变类型? 可变类型是,修改变量后引用的内存地址不变,引用的内存中的内容发生变化(是针对变量名的引用来理解). # 在a.py中定义了一个test属性 test ...
Python语言学习：字典常用的方法
1. 增加:字典[key]=value(不存在的key和value) info={ 'stu1101':'TengLan', 'stu1102':'LuoZe', 'stu1103':'XiaoZe' ...
matlab代码学习_2018-7-28
1.核范数||A|| * 是指矩阵奇异值的和,英文称呼叫Nuclear Norm.matlab code:[s, u, v] = svd(A); nulear_norm = sum(diag(s)); ...
Python 学习笔记：Python 中单引号(')、双引号(")、三引号(''',""")的使用以及不转义字符串
一.单引号.双引号及三引号: 参考博客:https://www.cnblogs.com/chenhuan001/p/8006017.html 以上四种形式都是 Python 表示字符串的方式,具体的效 ...
格式化输入 \_\_format\_\_
格式化输入 __format__ 格式化输入一.__format__ 自定制格式化字符串 date_dic = { 'ymd': '{0.year}:{0.month}:{0.day}', 'dmy ...
微服务监控druid sql
参考该文档保存druid的监控记录把日志保存的关系数据数据库(mysql,oracle等) 或者nosql数据库(redis,芒果db等) 保存的时候可以增加微服务名称标识好知道是哪个微服务的sq ...
1年6亿美元！Uber小费功能或引行业变革
当一个行业由稚嫩走向成熟,必然要在大方向上面对两个选择--一是继续在行业内深挖,二是不断向外围扩张.就像电商行业原本只是纯粹的交易中介形态,现在既不断深挖垂直电商新模式,又继续拓展新业务试图玩转跨界. ...
2018-10-09-Pser
title date tags layout Pser 2018-10-09 杂谈 post ### 踏雪无痕![踏雪无痕](http://da1sy.github.io/assets/images/ ...

吴裕雄--python学习笔记：爬虫

吴裕雄--python学习笔记：爬虫的更多相关文章

随机推荐

热门专题