阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl

1.函数调用它自身，这样就形成了一个循环，一环套一环：

from urllib.request import urlopen

from bs4 import BeautifulSoup

import re

pages = set()

def getLinks(pageUrl):

    global pages

    html = urlopen("http://en.wikipedia.org"+pageUrl)

    bsObj = BeautifulSoup(html,"lxml")

    try:

        print(bsObj.h1.get_text())

        print(bsObj.find(id ="mw-content-text").findAll("p")[0])                     //找到网页中 id=mw-content-text,然后在这个基础上查找"p"这个标签的内容 [0]则代表选择第0个

        print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href'])         //找到id=ca-edit里面的span标签里面的a标签里面的href的值

    except AttributeError:

        print("This page is missing something! No worries though!")

    for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):

        if 'href' in link.attrs:

            if link.attrs['href'] not in pages:

                #We have encountered a new page

                newPage = link.attrs['href']

                print(newPage)

                pages.add(newPage)

                getLinks(newPage)

getLinks("")

　2.对网址进行处理，通过"/"对网址中的字符进行分割

def splitAddress(address):

    addressParts = address.replace("http://", "").split("/")

    return addressParts

addr = splitAddress("https://hao.360.cn/?a1004")

print(addr)

运行结果为：

runfile('C:/Users/user/Desktop/chensimin.py', wdir='C:/Users/user/Desktop')

['https:', '', 'hao.360.cn', '?a1004']                   //两个//之间没有内容，所用用''表示

def splitAddress(address):

    addressParts = address.replace("http://", "").split("/")

    return addressParts

addr = splitAddress("http://www.autohome.com.cn/wuhan/#pvareaid=100519")

print(addr)

运行结果为：

runfile('C:/Users/user/Desktop/chensimin.py', wdir='C:/Users/user/Desktop')

['www.autohome.com.cn', 'wuhan', '#pvareaid=100519']

3.抓取网站的内部链接

from urllib.request import urlopen

from bs4 import BeautifulSoup

import re

#Retrieves a list of all Internal links found on a page

def getInternalLinks(bsObj, includeUrl):

    internalLinks = []

    #Finds all links that begin with a "/"

    for link in bsObj.findAll("a", href=re.compile("^(/|.*"+includeUrl+")")):

        if link.attrs['href'] is not None:

            if link.attrs['href'] not in internalLinks:

                internalLinks.append(link.attrs['href'])

    return internalLinks

startingPage = "http://oreilly.com"

html = urlopen(startingPage)

bsObj = BeautifulSoup(html,"lxml")

def splitAddress(address):

    addressParts = address.replace("http://", "").split("/")

    return addressParts

internalLinks = getInternalLinks(bsObj, splitAddress(startingPage)[0])

print(internalLinks)

运行结果为（此页面内的所有内部链接）：

runfile('C:/Users/user/Desktop/untitled112.py', wdir='C:/Users/user/Desktop')

['https://www.oreilly.com', 'http://www.oreilly.com/ideas', 
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170601+nav', 
'http://www.oreilly.com/conferences/', 'http://shop.oreilly.com/', 'http://members.oreilly.com', '/topics/ai', '/topics/business', 
'/topics/data', '/topics/design', '/topics/economy', '/topics/operations', '/topics/security', '/topics/software-architecture', '/topics/software-engineering', 
'/topics/web-programming', 'https://www.oreilly.com/topics', 
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+get+started+now', 
'https://www.safaribooksonline.com/accounts/login/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170203+homepage+sign+in', 
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170710+homepage+get+started+now', 
'https://www.safaribooksonline.com/public/free-trial/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170710+homepage+start+free+trial', 
'https://www.safaribooksonline.com/accounts/login/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170710+homepage+sign+in', 
'https://www.safaribooksonline.com/live-training/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+take+a+live+online+course', 
'https://www.safaribooksonline.com/learning-paths/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+follow+a+path', 
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+unlimited+access', 'http://www.oreilly.com/live-training/?view=grid', 
'https://www.safaribooksonline.com/your-experience/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+safari+platform', 
'https://www.oreilly.com/ideas/8-data-trends-on-our-radar-for-2017?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+2017+trends', 
'https://www.oreilly.com/ideas?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+read+latest+articles', 'http://www.oreilly.com/about/', 
'http://www.oreilly.com/work-with-us.html', 'http://www.oreilly.com/careers/', 'http://shop.oreilly.com/category/customer-service.do', 'http://www.oreilly.com/about/contact.html', 
'http://www.oreilly.com/emails/newsletters/', 'http://www.oreilly.com/terms/', 'http://www.oreilly.com/privacy.html', 'http://www.oreilly.com/about/editorial_independence.html']

4.抓取网站的外部链接

from urllib.request import urlopen

from bs4 import BeautifulSoup

import re

#Retrieves a list of all external links found on a page

def getExternalLinks(bsObj, excludeUrl):

    externalLinks = []

    #Finds all links that start with "http" or "www" that do

    #not contain the current URL

    for link in bsObj.findAll("a",

                              href=re.compile("^(http|www)((?!"+excludeUrl+").)*$")):

        if link.attrs['href'] is not None:

            if link.attrs['href'] not in externalLinks:

                externalLinks.append(link.attrs['href'])

    return externalLinks

startingPage = "http://oreilly.com"

html = urlopen(startingPage)

bsObj = BeautifulSoup(html,"lxml")

def splitAddress(address):

    addressParts = address.replace("http://", "").split("/")

    return addressParts

print(splitAddress(startingPage))

print(splitAddress(startingPage)[0])

externalLinks = getExternalLinks(bsObj,splitAddress(startingPage)[0])

print(externalLinks)

运行结果为：

runfile('C:/Users/user/Desktop/untitled112.py', wdir='C:/Users/user/Desktop')

['oreilly.com']

oreilly.com

['https://cdn.oreillystatic.com/pdf/oreilly_high_performance_organizations_whitepaper.pdf', 'http://twitter.com/oreillymedia', 'http://fb.co/OReilly', 'https://www.linkedin.com/company/oreilly-media', 'https://www.youtube.com/user/OreillyMedia']

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl的更多相关文章

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href 1.查找以<a>开头的所有文本,然后判断href是否在<a> ...
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll 1..BeautifulSoup库的使用 Beautiful ...
首部讲Python爬虫电子书 Web Scraping with Python
首部python爬虫的电子书2015.6pdf<web scraping with python> http://pan.baidu.com/s/1jGL625g 可直接下载 waterm ...
Web Scraping with Python读书笔记及思考
Web Scraping with Python读书笔记标签(空格分隔): web scraping ,python 做数据抓取一定一定要明确:抓取\解析数据不是目的,目的是对数据的利用一般的数据 ...
<Web Scraping with Python>:Chapter 1 & 2
<Web Scraping with Python> Chapter 1 & 2: Your First Web Scraper & Advanced HTML Parsi ...
Web scraping with Python (part II) « Jean, aka Sig(gg)
Web scraping with Python (part II) « Jean, aka Sig(gg) Web scraping with Python (part II)
Web Scraping with Python
Python爬虫视频教程零基础小白到scrapy爬虫高手-轻松入门 https://item.taobao.com/item.htm?spm=a1z38n.10677092.0.0.482434a6E ...
《Web Scraping With Python》Chapter 2的学习笔记
You Don't Always Need a Hammer When Michelangelo was asked how he could sculpt a work of art as mast ...
Web Scraping using Python Scrapy_BS4 - using BeautifulSoup and Python
Use BeautifulSoup and Python to scrap a website Lib: urllib Parsing HTML Data Web scraping script fr ...

随机推荐

Python 实例方法
class Computer: # 实例方法 def play(self): print("电脑可以扫雷") # 在定义实例方法的时候. 必须给出一个参数 self # 形参的第一 ...
Python网络爬虫第一弹《Python网络爬虫相关基础概念》
爬虫介绍引入之前在授课过程中,好多同学都问过我这样的一个问题:为什么要学习爬虫,学习爬虫能够为我们以后的发展带来那些好处?其实学习爬虫的原因和为我们以后发展带来的好处都是显而易见的,无论是从实际的 ...
获取列表中的最大的N项和最小的N项
获取列表中的最大的N项和最小的N项 #!/sur/bin/env python # -*- coding:utf-8 -*- # author:zengsf #time:2018/10/31 impo ...
c日志宏
仅供参考,不推荐 #ifdef _DEBUG #define LOGDEBUG(format, ...)\ {\ FILE *fp = fopen("nccli.log", &qu ...
SQLI DUMB SERIES-5
less5 (1)输入单引号,回显错误,说明存在注入点.输入的Id被一对单引号所包围,可以闭合单引号 (2)输入正常时:?id=1 说明没有显示位,因此不能使用联合查询了:可以使用报错注入,有两种方式 ...
嵌套for
基于Flask开发web微信
1. 获取二维码 app.py import re import time import requests from flask import Flask,render_template app = ...
Chrome程序及数据位置变更到非系统盘
Chrome浏览器在Windows系统上安装过程,没有设置安装位置的步骤,所以默认是安装在C盘的.并且,若Chrome作为主要浏览器使用,随着时间的积累,数据文件会非常多.增加系统盘的负荷. Wind ...
Go Example--关闭通道
package main import ( "fmt" ) func main() { jobs := make(chan int, 5) done := make(chan bo ...
文件I/0缓冲
设置stdio流缓冲模式 #include<stdio.h> int setvbuf(FILE *stream,char *buf,int mode,size_t size) int se ...

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl的更多相关文章

随机推荐

热门专题