python 爬虫（三）

爬遍整个域名

六度空间理论：任何两个陌生人之间所间隔的人不会超过六个，也就是说最多通过五个人你可以认识任何一个陌生人。通过维基百科我们能够通过连接从一个人连接到任何一个他想连接到的人。

1. 获取一个界面的所有连接

 from urllib.request import urlopen

 from bs4 import BeautifulSoup

 html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")

 bsObj = BeautifulSoup(html,'html.parser')

 for link in bsObj.find_all("a"):

     if 'href' in link.attrs:

         print(link.attrs['href'])

2. 获取维基百科当前人物关联的事物

1. 除去网页中每个界面都会存在sidebar，footbar，header links 和 category pages,talk pages.

2. 当前界面连接到其他界面的连接都会有的相同点

I 包含在一个id为bodyContent的div中

II url中不包含分号，并且以/wiki/开头

 from urllib.request import urlopen

 from bs4 import BeautifulSoup

 import re

 html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")

 bsObj = BeautifulSoup(html,"html.parser")

 for link in bsObj.find('div',{"id":"bodyContent"}).find_all("a",href=re.compile("^(/wiki/)((?!:).)*$")):

     if 'href' in link.attrs:

         print(link.attrs['href'])

3. 深层查找

简单的从一个维基百科界面中找到当前界面的连接是没有意义的，如果能够从当前界面开始循环的找下去会有很大的进步

1. 需要创建一个简单的方法，返回当前界面所有文章的连接

2. 创建一个main方法，从一个界面开始查找，然后进入其中一个随机连接，以这个新连接为基础继续查找直到没有新的连接为止。

from urllib.request import urlopen

from urllib.error import HTTPError

from bs4 import BeautifulSoup

from random import choice

import re

basename = "http://en.wikipedia.org"

def getLinks(pagename):

    url = basename + pagename

    try:

        with urlopen(url) as html:

            bsObj = BeautifulSoup(html,"html.parser")

            links = bsObj.find("div",{"id":"bodyContent"}).find_all("a",href=re.compile("^(/wiki/)((?!:).)*$"))

            return [link.attrs['href'] for link in links if 'href' in link.attrs]

    except (HTTPError,AttributeError) as e:

        return None

def main():

    links = getLinks("/wiki/Kevin_Bacon")

    while len(links) > 0:

        nextpage = choice(links)

        print(nextpage)

        links = getLinks(nextpage)

main()

4. 爬遍整个域名

1. 爬遍整个网站首先需要从网站的主界面开始

2. 需要保存已经访问过的网页，避免重复访问相同的地址

from urllib.request import urlopen

from urllib.error import HTTPError

from bs4 import BeautifulSoup

from random import choice

import re

basename = "http://en.wikipedia.org"

visitedpages = set()#使用set来保存已经访问过的界面地址

def visitelink(pagename):

    url = basename + pagename

    global visitedpages

    try:

        with urlopen(url) as html:

            bsObj = BeautifulSoup(html,"html.parser")

        links = bsObj.find("div",{"id":"bodyContent"}).find_all("a",href=re.compile("^(/wiki/)((?!:).)*$"))

        for eachlink in links:

            if 'href' in eachlink.attrs:

                if eachlink.attrs['href'] not in visitedpages:

                    nextpage = eachlink.attrs['href']

                    print(nextpage)

                    visitedpages.add(nextpage)

                    visitelink(nextpage)

    except (HTTPError,AttributeError) as e:

        return None

visitelink("")

5. 从网站上搜集有用信息

1. 没做什么特别的东西，在访问网页的时候打印了一些 h1和文字内容

2. 在print的时候出现的问题》

UnicodeEncodeError: 'gbk' codec can't encode character u'\xa9' in position 24051: illegal multibyte sequence

　　解决方法：在print之前将source_code.encode('GB18030')

解释：GB18030是GBK的父集，所以能兼容GBK不能编码的字符。

from urllib.request import urlopen

from urllib.error import HTTPError

from bs4 import BeautifulSoup

from random import choice

import re

basename = "http://en.wikipedia.org"

visitedpages = set()#使用set来保存已经访问过的界面地址

def visitelink(pagename):

    url = basename + pagename

    global visitedpages

    try:

        with urlopen(url) as html:

            bsObj = BeautifulSoup(html,"html.parser")

        try:

            print(bsObj.h1.get_text())

            print(bsObj.find("div",{"id":"mw-content-text"}).find("p").get_text().encode('GB18030'))

        except AttributeError as e:

            print("AttributeError")

        links = bsObj.find("div",{"id":"bodyContent"}).find_all("a",href=re.compile("^(/wiki/)((?!:).)*$"))

        for eachlink in links:

            if 'href' in eachlink.attrs:

                if eachlink.attrs['href'] not in visitedpages:

                    nextpage = eachlink.attrs['href']

                    print(nextpage)

                    visitedpages.add(nextpage)

                    visitelink(nextpage)

    except (HTTPError,AttributeError) as e:

        return None

visitelink("")

python 爬虫（三）的更多相关文章

Python爬虫(三)爬淘宝MM图片
直接上代码: # python2 # -*- coding: utf-8 -*- import urllib2 import re import string import os import shu ...
python爬虫(三)
Requests模块这个库的标准文档有个极其幽默的地方就是它的中文翻译,我就截取个开头部分,如下图: 是不是很搞笑,在正文中还有许多,管中窥豹,可见一斑.通过我的使用,感觉Requests库的确是给 ...
Python 爬虫三 beautifulsoup模块
beautifulsoup模块 BeautifulSoup模块 BeautifulSoup是一个模块,该模块用于接收一个HTML或XML字符串,然后将其进行格式化,之后遍可以使用他提供的方法进行快速查 ...
Python爬虫(三)——开封市58同城出租房决策树构建
决策树框架: # coding=utf-8 import matplotlib.pyplot as plt decisionNode = dict(boxstyle=') leafNode = dic ...
Python爬虫(三)——对豆瓣图书各模块评论数与评分图形化分析
文化经管 ....略结论: 一个模块的评分与评论数相关,评分为 [8.8——9.2] 之间的书籍评论数往往是模块中最多的
Python 爬虫 (三)
#对第一章的百度翻译封装的函数进行更新 1 from urllib import request, parse from urllib.error import HTTPError, URLError ...
Python爬虫(四)——开封市58同城数据模型训练与检测
前文参考: Python爬虫(一)——开封市58同城租房信息 Python爬虫(二)——对开封市58同城出租房数据进行分析 Python爬虫(三)——对豆瓣图书各模块评论数与评分图形化分析数据的构建 ...
Python爬虫(四)——豆瓣数据模型训练与检测
前文参考: Python爬虫(一)——豆瓣下图书信息 Python爬虫(二)——豆瓣图书决策树构建 Python爬虫(三)——对豆瓣图书各模块评论数与评分图形化分析数据的构建在这张表中我们可以发现 ...
Python爬虫学习：三、爬虫的基本操作流程
本文是博主原创随笔,转载时请注明出处Maple2cat|Python爬虫学习:三.爬虫的基本操作与流程一般我们使用Python爬虫都是希望实现一套完整的功能,如下: 1.爬虫目标数据.信息: 2.将 ...
3.Python爬虫入门三之Urllib和Urllib2库的基本使用
1.分分钟扒一个网页下来怎样扒网页呢?其实就是根据URL来获取它的网页信息,虽然我们在浏览器中看到的是一幅幅优美的画面,但是其实是由浏览器解释才呈现出来的,实质它是一段HTML代码,加 JS.CSS ...

随机推荐

使用stack的思考
对于使用stack进行()的配对检查,不需要使用额外的空间对每个字符进行存储和push与pop的操作. 只使用size对()进行处理即可,因为只有一种括号,所以入栈为size加1,出栈为size-1. ...
如何将已部署在ASM的资源迁移到ARM中
使用过Azure的读者都知道,Azure向客户提供了两个管理portal,一个是ASM,一个是ARM,虽然Azure官方没有宣布说淘汰ASM,两个portal可能会在很长的一段时间共存,但是考虑到AR ...
解决Android studio导入项目卡死
在使用Android studio的时候常常遇到这样的问题,从github或是其他地方导入项目,Android studio呈现卡死的现象!当遇到这种情况时,可以看看是下面那种情况,在按照方法来解决! ...
c#过滤html标签
public string HtmlFilter(string html) { //设置要删除的标记 string[] lable = { "font ...
phpcms 表单提交发送邮件
修改 phpcms\modules\formguide index.php 找到 foreach ($mails as $m) { sendmail($m, L('tips'), $this-> ...
c++单链表基本功能
head_LinkNode.h /*单链表类的头文件*/#include<assert.h>#include"compare.h"typedef int status; ...
第3章拍摄UFO——单一职责原则
就一个类而言,应该仅有一个引起它变化的原因
window 安装grunt
1.先安装nodejs ,npm ,参考 http://www.cnblogs.com/seanlv/archive/2011/11/22/2258716.html 2 安装grunt 百度搜索参考 ...
POI完美解析Excel数据到对象集合中（可用于将EXCEL数据导入到数据库）
实现思路: 1.获取WorkBook对象,在这里使用WorkbookFactory.create(is); // 这种方式解析Excel.2003/2007/2010都没问题: 2.对行数据进行解析 ...
处理Https 异常记录 javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure
http://blog.csdn.net/baidu_18607183/article/details/51595330 https://blogs.oracle.com/java-platform- ...

python 爬虫（三）

python 爬虫（三）的更多相关文章

随机推荐

热门专题