阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl

1.函数调用它自身,这样就形成了一个循环,一环套一环:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getLinks(pageUrl):
global pages
html = urlopen("http://en.wikipedia.org"+pageUrl)
bsObj = BeautifulSoup(html,"lxml")
try:
print(bsObj.h1.get_text())
print(bsObj.find(id ="mw-content-text").findAll("p")[0]) //找到网页中 id=mw-content-text,然后在这个基础上查找"p"这个标签的内容 [0]则代表选择第0个
print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href']) //找到id=ca-edit里面的span标签里面的a标签里面的href的值
except AttributeError:
print("This page is missing something! No worries though!") for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
#We have encountered a new page
newPage = link.attrs['href']
print(newPage)
pages.add(newPage)
getLinks(newPage) getLinks("")

 2.对网址进行处理,通过"/"对网址中的字符进行分割

def splitAddress(address):
addressParts = address.replace("http://", "").split("/")
return addressParts addr = splitAddress("https://hao.360.cn/?a1004")
print(addr)

运行结果为:

runfile('C:/Users/user/Desktop/chensimin.py', wdir='C:/Users/user/Desktop')
['https:', '', 'hao.360.cn', '?a1004'] //两个//之间没有内容,所用用''表示

  

def splitAddress(address):
addressParts = address.replace("http://", "").split("/")
return addressParts addr = splitAddress("http://www.autohome.com.cn/wuhan/#pvareaid=100519")
print(addr)

运行结果为:

runfile('C:/Users/user/Desktop/chensimin.py', wdir='C:/Users/user/Desktop')
['www.autohome.com.cn', 'wuhan', '#pvareaid=100519']

3.抓取网站的内部链接

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re #Retrieves a list of all Internal links found on a page
def getInternalLinks(bsObj, includeUrl):
internalLinks = []
#Finds all links that begin with a "/"
for link in bsObj.findAll("a", href=re.compile("^(/|.*"+includeUrl+")")):
if link.attrs['href'] is not None:
if link.attrs['href'] not in internalLinks:
internalLinks.append(link.attrs['href'])
return internalLinks startingPage = "http://oreilly.com"
html = urlopen(startingPage)
bsObj = BeautifulSoup(html,"lxml") def splitAddress(address):
addressParts = address.replace("http://", "").split("/")
return addressParts internalLinks = getInternalLinks(bsObj, splitAddress(startingPage)[0])
print(internalLinks)

运行结果为(此页面内的所有内部链接):

runfile('C:/Users/user/Desktop/untitled112.py', wdir='C:/Users/user/Desktop')
['https://www.oreilly.com', 'http://www.oreilly.com/ideas',
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170601+nav',
'http://www.oreilly.com/conferences/', 'http://shop.oreilly.com/', 'http://members.oreilly.com', '/topics/ai', '/topics/business',
'/topics/data', '/topics/design', '/topics/economy', '/topics/operations', '/topics/security', '/topics/software-architecture', '/topics/software-engineering',
'/topics/web-programming', 'https://www.oreilly.com/topics',
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+get+started+now',
'https://www.safaribooksonline.com/accounts/login/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170203+homepage+sign+in',
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170710+homepage+get+started+now',
'https://www.safaribooksonline.com/public/free-trial/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170710+homepage+start+free+trial',
'https://www.safaribooksonline.com/accounts/login/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170710+homepage+sign+in',
'https://www.safaribooksonline.com/live-training/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+take+a+live+online+course',
'https://www.safaribooksonline.com/learning-paths/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+follow+a+path',
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+unlimited+access', 'http://www.oreilly.com/live-training/?view=grid',
'https://www.safaribooksonline.com/your-experience/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+safari+platform',
'https://www.oreilly.com/ideas/8-data-trends-on-our-radar-for-2017?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+2017+trends',
'https://www.oreilly.com/ideas?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+read+latest+articles', 'http://www.oreilly.com/about/',
'http://www.oreilly.com/work-with-us.html', 'http://www.oreilly.com/careers/', 'http://shop.oreilly.com/category/customer-service.do', 'http://www.oreilly.com/about/contact.html',
'http://www.oreilly.com/emails/newsletters/', 'http://www.oreilly.com/terms/', 'http://www.oreilly.com/privacy.html', 'http://www.oreilly.com/about/editorial_independence.html']
4.抓取网站的外部链接
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re #Retrieves a list of all external links found on a page
def getExternalLinks(bsObj, excludeUrl):
externalLinks = []
#Finds all links that start with "http" or "www" that do
#not contain the current URL
for link in bsObj.findAll("a",
href=re.compile("^(http|www)((?!"+excludeUrl+").)*$")):
if link.attrs['href'] is not None:
if link.attrs['href'] not in externalLinks:
externalLinks.append(link.attrs['href'])
return externalLinks startingPage = "http://oreilly.com"
html = urlopen(startingPage)
bsObj = BeautifulSoup(html,"lxml") def splitAddress(address):
addressParts = address.replace("http://", "").split("/")
return addressParts print(splitAddress(startingPage))
print(splitAddress(startingPage)[0]) externalLinks = getExternalLinks(bsObj,splitAddress(startingPage)[0])
print(externalLinks)
运行结果为:
runfile('C:/Users/user/Desktop/untitled112.py', wdir='C:/Users/user/Desktop')
['oreilly.com']
oreilly.com
['https://cdn.oreillystatic.com/pdf/oreilly_high_performance_organizations_whitepaper.pdf', 'http://twitter.com/oreillymedia', 'http://fb.co/OReilly', 'https://www.linkedin.com/company/oreilly-media', 'https://www.youtube.com/user/OreillyMedia']

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl的更多相关文章

  1. 阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href

    阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href 1.查找以<a>开头的所有文本,然后判断href是否在<a> ...

  2. 阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll

    阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll 1..BeautifulSoup库的使用 Beautiful ...

  3. 首部讲Python爬虫电子书 Web Scraping with Python

    首部python爬虫的电子书2015.6pdf<web scraping with python> http://pan.baidu.com/s/1jGL625g 可直接下载 waterm ...

  4. Web Scraping with Python读书笔记及思考

    Web Scraping with Python读书笔记 标签(空格分隔): web scraping ,python 做数据抓取一定一定要明确:抓取\解析数据不是目的,目的是对数据的利用 一般的数据 ...

  5. <Web Scraping with Python>:Chapter 1 & 2

    <Web Scraping with Python> Chapter 1 & 2: Your First Web Scraper & Advanced HTML Parsi ...

  6. Web scraping with Python (part II) « Jean, aka Sig(gg)

    Web scraping with Python (part II) « Jean, aka Sig(gg) Web scraping with Python (part II)

  7. Web Scraping with Python

    Python爬虫视频教程零基础小白到scrapy爬虫高手-轻松入门 https://item.taobao.com/item.htm?spm=a1z38n.10677092.0.0.482434a6E ...

  8. 《Web Scraping With Python》Chapter 2的学习笔记

    You Don't Always Need a Hammer When Michelangelo was asked how he could sculpt a work of art as mast ...

  9. Web Scraping using Python Scrapy_BS4 - using BeautifulSoup and Python

    Use BeautifulSoup and Python to scrap a website Lib: urllib Parsing HTML Data Web scraping script fr ...

随机推荐

  1. ecmall 的一些方法说明

    ecmall/eccore /ecmall.php 常量: define('START_TIME', ecm_microtime()); define('IS_POST', (strtoupper($ ...

  2. 队列 c实现

    循环队列的数组实现 queue.h #ifndef _QUEUE_H_ #define _QUEUE_H_ #define SIZE 10 typedef int data_t; typedef st ...

  3. Power BI新主页将使内容的导航和发现变得轻而易举!

    微软Power BI 将在近日发布Power BI Home登陆页面的公开预览以及Power BI服务中的新全局搜索功能.登录页将成为所有内容的一站式集合,并提供更快捷的方式来分享你的仪表板.原来在左 ...

  4. makefile 使用 Tricks

    .phony是表示目标是伪目标,并不生成相应的文件..phony标志的文件总是执行的. 1. 短横(-)与@ @(常用在 echo 之前):make 在执行编译打包等命令前会在命令行输出此命令,称之为 ...

  5. 2018上C语言程序设计(高级)作业- 第3次作业

    作业要求一 6-1 输出月份英文名 6-2 查找星期 6-3 计算最长的字符串长度 6-4指定位置输出字符串 6-5奇数值结点链表 6-6学生成绩链表处理 6-7链表拼接 作业要求二 题目6-1输出月 ...

  6. conda创建虚拟环境

    可能自己常常会遇到一个这样的问题,自己服务器安装的是python2,但是现在有的代码是基于python3的啊 怎么办?自己将代码转换成python2的呗,是的,我曾经就这样做过,但是真的是很鸡肋 更有 ...

  7. MAC使用mysql报错:ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: YES)

    遇到这种错误,需要重置密码. Step1:停止mysql,命令如下: $ sudo service mysql stop 或者是 $ sudo /usr/local/mysql/support-fil ...

  8. Java设计模式(二)

    3.设计模式分类 通常来说设计模式分为三大类,共23种:   1.工厂模式 工厂模式(Factory Pattern)的意义就跟它的名字一样,在面向对象程序设计中,工厂通常是一个用来创建其他对象的对象 ...

  9. 关于 transparent rgba display:none; opacity visiblity 关于em

    关于 transparent  rgba   display:none; opacity   visiblity   display 之后不会占位. 其余都会占位 opacity 还会继承,子元素也会 ...

  10. python中进制之间的转换

    参考于:http://www.360doc.com/content/14/0428/11/16044571_372866302.shtml  在此非常感谢! ~~~~~~~~~~~~~~~~~~~~~ ...