关于Python网络爬虫实战笔记①

python网络爬虫项目实战笔记①如何下载韩寒的博客文章

1. 打开韩寒博客列表页面

http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html

target：获取所有文章的超级链接

2. 韩寒Blog文章列表特征

随便选一个文章的超链接，右键按审查元素，可以找到

<a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102e7pk.html">一个流传多年的谣言</a>

将找到的超链接地址打开一下，

发现就是所要找的链接地址

因此，文章列表特征就是

<a title=…… href="…….html"

建立一个python文件(文件名：GetHanhan.py)，复制一下刚找到的超链接所在地

#coding:utf-8
#<a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102e7pk.html">一个流传多年的谣言</a>

由于是中文，还得注明utf-8

3.字符串函数—— find函数

[KANO@kelvin ~]$ python
Python 2.7.10 (default, Sep 24 2015, 17:50:09) 
[GCC 5.1.1 20150618 (Red Hat 5.1.1-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> help(str.find)

Help on method_descriptor:

find(...)
    S.find(sub [,start [,end]]) -> int

    Return the lowest index in S where substring sub is found,
    such that sub is contained within S[start:end].  Optional
    arguments start and end are interpreted as in slice notation.

    Return -1 on failure.

从S里头找sub ，如果找到就返回第一次出现的地方，这个函数还可以指定从字符串的什么地方开始、什么地方结束。

例如：

注：数数从0位开始，因此应该是01234这样数数。

首先先找到特定的字符，如<a title，缩小范围

#coding:utf-8
#<a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102e7pk.html">一个流传多年的谣言</a>
str0='<a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102e7pk.html">一个流传多年的谣言</a>'
title=str0.find(r'<a title')
print title

运行一下，

[KANO@kelvin 桌面]$ python GetHanhan.py 
0

一下就找到了，<a title在第1就是了。

如果在<a title前面加上一些字符试试看，

像这样的，

运行一下，

<a title是在第15个字符，没错！

在代码后面再加上

#coding:utf-8
#<a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102e7pk.html">一个流传多年的谣言</a>

str0='<a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102e7pk.html">一个流传多年的谣言</a>'
title=str0.find(r'<a title')
print title
href=str0.find(r'href=')
print href

查找一些href=，存盘运行

那么http就在href的h所在位置加6个字符。

接下来找到html的位置，在代码后加上

html=str0.find(r'.html')
print html

运行一下，

那么，介于28+6到81+5之间的这些字符串就是我们要寻找的链接地址

试着打印出找到的地址，

url=str0[href:html]
print url

有点不对，再加上迁移位，

url=str0[href+6:html+5]
print url

这下没错了吧～

4.通过浏览器把网页打开——urllib

使用urllib库

#coding:utf-8
#<a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102e7pk.html">一个流传多年的谣言</a>
import urllib
str0='<a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102e7pk.html">一个流传多年的谣言</a>'
title=str0.find(r'<a title')
print title
href=str0.find(r'href=')
print href
html=str0.find(r'.html')
print html

url=str0[href+6:html+5]
print url

content=urllib.urlopen(url).read()
print content

这篇文章内容就读取下来了

接下来，把读取下来的文件保存文件

先设定文件名，改一下最后两行代码

content=urllib.urlopen(url).read()
# print content
filename=url[-26:]
print filename

把链接地址的后26个字符提取作为文件名

加上

open(filename,'w').write(content)

生成文件

这样就把韩寒首页里面的随便一篇文章下载下来了。