阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl

1.函数调用它自身，这样就形成了一个循环，一环套一环：

from urllib.request import urlopen

from bs4 import BeautifulSoup

import re

pages = set()

def getLinks(pageUrl):

    global pages

    html = urlopen("http://en.wikipedia.org"+pageUrl)

    bsObj = BeautifulSoup(html,"lxml")

    try:

        print(bsObj.h1.get_text())

        print(bsObj.find(id ="mw-content-text").findAll("p")[0])                     //找到网页中 id=mw-content-text,然后在这个基础上查找"p"这个标签的内容 [0]则代表选择第0个

        print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href'])         //找到id=ca-edit里面的span标签里面的a标签里面的href的值

    except AttributeError:

        print("This page is missing something! No worries though!")

    for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):

        if 'href' in link.attrs:

            if link.attrs['href'] not in pages:

                #We have encountered a new page

                newPage = link.attrs['href']

                print(newPage)

                pages.add(newPage)

                getLinks(newPage)

getLinks("")

　2.对网址进行处理，通过"/"对网址中的字符进行分割

def splitAddress(address):

    addressParts = address.replace("http://", "").split("/")

    return addressParts

addr = splitAddress("https://hao.360.cn/?a1004")

print(addr)

运行结果为：

runfile('C:/Users/user/Desktop/chensimin.py', wdir='C:/Users/user/Desktop')

['https:', '', 'hao.360.cn', '?a1004']                   //两个//之间没有内容，所用用''表示

def splitAddress(address):

    addressParts = address.replace("http://", "").split("/")

    return addressParts

addr = splitAddress("http://www.autohome.com.cn/wuhan/#pvareaid=100519")

print(addr)

运行结果为：

runfile('C:/Users/user/Desktop/chensimin.py', wdir='C:/Users/user/Desktop')

['www.autohome.com.cn', 'wuhan', '#pvareaid=100519']

3.抓取网站的内部链接

from urllib.request import urlopen

from bs4 import BeautifulSoup

import re

#Retrieves a list of all Internal links found on a page

def getInternalLinks(bsObj, includeUrl):

    internalLinks = []

    #Finds all links that begin with a "/"

    for link in bsObj.findAll("a", href=re.compile("^(/|.*"+includeUrl+")")):

        if link.attrs['href'] is not None:

            if link.attrs['href'] not in internalLinks:

                internalLinks.append(link.attrs['href'])

    return internalLinks

startingPage = "http://oreilly.com"

html = urlopen(startingPage)

bsObj = BeautifulSoup(html,"lxml")

def splitAddress(address):

    addressParts = address.replace("http://", "").split("/")

    return addressParts

internalLinks = getInternalLinks(bsObj, splitAddress(startingPage)[0])

print(internalLinks)

运行结果为（此页面内的所有内部链接）：

runfile('C:/Users/user/Desktop/untitled112.py', wdir='C:/Users/user/Desktop')

['https://www.oreilly.com', 'http://www.oreilly.com/ideas', 
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170601+nav', 
'http://www.oreilly.com/conferences/', 'http://shop.oreilly.com/', 'http://members.oreilly.com', '/topics/ai', '/topics/business', 
'/topics/data', '/topics/design', '/topics/economy', '/topics/operations', '/topics/security', '/topics/software-architecture', '/topics/software-engineering', 
'/topics/web-programming', 'https://www.oreilly.com/topics', 
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+get+started+now', 
'https://www.safaribooksonline.com/accounts/login/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170203+homepage+sign+in', 
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170710+homepage+get+started+now', 
'https://www.safaribooksonline.com/public/free-trial/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170710+homepage+start+free+trial', 
'https://www.safaribooksonline.com/accounts/login/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170710+homepage+sign+in', 
'https://www.safaribooksonline.com/live-training/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+take+a+live+online+course', 
'https://www.safaribooksonline.com/learning-paths/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+follow+a+path', 
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+unlimited+access', 'http://www.oreilly.com/live-training/?view=grid', 
'https://www.safaribooksonline.com/your-experience/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+safari+platform', 
'https://www.oreilly.com/ideas/8-data-trends-on-our-radar-for-2017?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+2017+trends', 
'https://www.oreilly.com/ideas?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+read+latest+articles', 'http://www.oreilly.com/about/', 
'http://www.oreilly.com/work-with-us.html', 'http://www.oreilly.com/careers/', 'http://shop.oreilly.com/category/customer-service.do', 'http://www.oreilly.com/about/contact.html', 
'http://www.oreilly.com/emails/newsletters/', 'http://www.oreilly.com/terms/', 'http://www.oreilly.com/privacy.html', 'http://www.oreilly.com/about/editorial_independence.html']

4.抓取网站的外部链接

from urllib.request import urlopen

from bs4 import BeautifulSoup

import re

#Retrieves a list of all external links found on a page

def getExternalLinks(bsObj, excludeUrl):

    externalLinks = []

    #Finds all links that start with "http" or "www" that do

    #not contain the current URL

    for link in bsObj.findAll("a",

                              href=re.compile("^(http|www)((?!"+excludeUrl+").)*$")):

        if link.attrs['href'] is not None:

            if link.attrs['href'] not in externalLinks:

                externalLinks.append(link.attrs['href'])

    return externalLinks

startingPage = "http://oreilly.com"

html = urlopen(startingPage)

bsObj = BeautifulSoup(html,"lxml")

def splitAddress(address):

    addressParts = address.replace("http://", "").split("/")

    return addressParts

print(splitAddress(startingPage))

print(splitAddress(startingPage)[0])

externalLinks = getExternalLinks(bsObj,splitAddress(startingPage)[0])

print(externalLinks)

运行结果为：

runfile('C:/Users/user/Desktop/untitled112.py', wdir='C:/Users/user/Desktop')

['oreilly.com']

oreilly.com

['https://cdn.oreillystatic.com/pdf/oreilly_high_performance_organizations_whitepaper.pdf', 'http://twitter.com/oreillymedia', 'http://fb.co/OReilly', 'https://www.linkedin.com/company/oreilly-media', 'https://www.youtube.com/user/OreillyMedia']

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl的更多相关文章

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href 1.查找以<a>开头的所有文本,然后判断href是否在<a> ...
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll 1..BeautifulSoup库的使用 Beautiful ...
首部讲Python爬虫电子书 Web Scraping with Python
首部python爬虫的电子书2015.6pdf<web scraping with python> http://pan.baidu.com/s/1jGL625g 可直接下载 waterm ...
Web Scraping with Python读书笔记及思考
Web Scraping with Python读书笔记标签(空格分隔): web scraping ,python 做数据抓取一定一定要明确:抓取\解析数据不是目的,目的是对数据的利用一般的数据 ...
<Web Scraping with Python>:Chapter 1 & 2
<Web Scraping with Python> Chapter 1 & 2: Your First Web Scraper & Advanced HTML Parsi ...
Web scraping with Python (part II) « Jean, aka Sig(gg)
Web scraping with Python (part II) « Jean, aka Sig(gg) Web scraping with Python (part II)
Web Scraping with Python
Python爬虫视频教程零基础小白到scrapy爬虫高手-轻松入门 https://item.taobao.com/item.htm?spm=a1z38n.10677092.0.0.482434a6E ...
《Web Scraping With Python》Chapter 2的学习笔记
You Don't Always Need a Hammer When Michelangelo was asked how he could sculpt a work of art as mast ...
Web Scraping using Python Scrapy_BS4 - using BeautifulSoup and Python
Use BeautifulSoup and Python to scrap a website Lib: urllib Parsing HTML Data Web scraping script fr ...

随机推荐

反转链表(python3)
问题描述: 反转一个单链表. 示例: 输入: 1->2->3->4->5->NULL 输出: 5->4->3->2->1->NULL解法1: ...
SQL注入之Sqli-labs系列第十四关（基于双引号POST报错注入）
开始挑战第十四关(Double Injection- Double quotes- String) 访问地址,输入报错语句 ' '' ') ") - 等使其报错分析报错信息很明显是 ...
SpringMVC详细示例实战教程（较全开发教程）
SpringMVC学习笔记---- 一.SpringMVC基础入门,创建一个HelloWorld程序 1.首先,导入SpringMVC需要的jar包. 2.添加Web.xml配置文件中关于Spring ...
第七十五课图的遍历（DFS）
添加DFS函数: #ifndef GRAPH_H #define GRAPH_H #include "Object.h" #include "SharedPointer. ...
TrueCrypt 7.1a Hashes
Here are the SHA256, SHA1, and MD5 hashes of all TrueCrypt version 7.1a files. The signature of the ...
数据文件resize扩容
表空间不足 Alert日志报错 Mon Dec :: GMT+: Incremental checkpoint up to RBA[ox1af2d.3ddll.], current log tail ...
Python之路PythonNet，第三篇，网络3
pythonnet 网络3 udp 通信 recvfrom sendtofork 多进程并发threading 多线程并发socketserver 系统模块套接字的属性 setsockopt g ...
array的方法没记住的
reserve() 是倒叙: sort() 拍序,按字符编码排序,可以传一个参数 reduce() 实例:判断一个数组里参数的个数 var arr = ["apple"," ...
02 JDBC相关
====================================================================================JDBC JAVA Databa ...
小白python 安装
小白python 安装: https://blog.csdn.net/qq_36667170/article/details/79275605 https://blog.csdn.net/nmjuzi ...

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl的更多相关文章

随机推荐

热门专题