阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl
1.函数调用它自身,这样就形成了一个循环,一环套一环:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getLinks(pageUrl):
global pages
html = urlopen("http://en.wikipedia.org"+pageUrl)
bsObj = BeautifulSoup(html,"lxml")
try:
print(bsObj.h1.get_text())
print(bsObj.find(id ="mw-content-text").findAll("p")[0]) //找到网页中 id=mw-content-text,然后在这个基础上查找"p"这个标签的内容 [0]则代表选择第0个
print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href']) //找到id=ca-edit里面的span标签里面的a标签里面的href的值
except AttributeError:
print("This page is missing something! No worries though!") for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
#We have encountered a new page
newPage = link.attrs['href']
print(newPage)
pages.add(newPage)
getLinks(newPage) getLinks("")
2.对网址进行处理,通过"/"对网址中的字符进行分割
def splitAddress(address):
addressParts = address.replace("http://", "").split("/")
return addressParts addr = splitAddress("https://hao.360.cn/?a1004")
print(addr)
运行结果为:
runfile('C:/Users/user/Desktop/chensimin.py', wdir='C:/Users/user/Desktop')
['https:', '', 'hao.360.cn', '?a1004'] //两个//之间没有内容,所用用''表示
def splitAddress(address):
addressParts = address.replace("http://", "").split("/")
return addressParts addr = splitAddress("http://www.autohome.com.cn/wuhan/#pvareaid=100519")
print(addr)
运行结果为:
runfile('C:/Users/user/Desktop/chensimin.py', wdir='C:/Users/user/Desktop')
['www.autohome.com.cn', 'wuhan', '#pvareaid=100519']
3.抓取网站的内部链接
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re #Retrieves a list of all Internal links found on a page
def getInternalLinks(bsObj, includeUrl):
internalLinks = []
#Finds all links that begin with a "/"
for link in bsObj.findAll("a", href=re.compile("^(/|.*"+includeUrl+")")):
if link.attrs['href'] is not None:
if link.attrs['href'] not in internalLinks:
internalLinks.append(link.attrs['href'])
return internalLinks startingPage = "http://oreilly.com"
html = urlopen(startingPage)
bsObj = BeautifulSoup(html,"lxml") def splitAddress(address):
addressParts = address.replace("http://", "").split("/")
return addressParts internalLinks = getInternalLinks(bsObj, splitAddress(startingPage)[0])
print(internalLinks)
运行结果为(此页面内的所有内部链接):
runfile('C:/Users/user/Desktop/untitled112.py', wdir='C:/Users/user/Desktop')
['https://www.oreilly.com', 'http://www.oreilly.com/ideas',
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170601+nav',
'http://www.oreilly.com/conferences/', 'http://shop.oreilly.com/', 'http://members.oreilly.com', '/topics/ai', '/topics/business',
'/topics/data', '/topics/design', '/topics/economy', '/topics/operations', '/topics/security', '/topics/software-architecture', '/topics/software-engineering',
'/topics/web-programming', 'https://www.oreilly.com/topics',
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+get+started+now',
'https://www.safaribooksonline.com/accounts/login/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170203+homepage+sign+in',
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170710+homepage+get+started+now',
'https://www.safaribooksonline.com/public/free-trial/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170710+homepage+start+free+trial',
'https://www.safaribooksonline.com/accounts/login/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170710+homepage+sign+in',
'https://www.safaribooksonline.com/live-training/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+take+a+live+online+course',
'https://www.safaribooksonline.com/learning-paths/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+follow+a+path',
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+unlimited+access', 'http://www.oreilly.com/live-training/?view=grid',
'https://www.safaribooksonline.com/your-experience/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+safari+platform',
'https://www.oreilly.com/ideas/8-data-trends-on-our-radar-for-2017?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+2017+trends',
'https://www.oreilly.com/ideas?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+read+latest+articles', 'http://www.oreilly.com/about/',
'http://www.oreilly.com/work-with-us.html', 'http://www.oreilly.com/careers/', 'http://shop.oreilly.com/category/customer-service.do', 'http://www.oreilly.com/about/contact.html',
'http://www.oreilly.com/emails/newsletters/', 'http://www.oreilly.com/terms/', 'http://www.oreilly.com/privacy.html', 'http://www.oreilly.com/about/editorial_independence.html']
4.抓取网站的外部链接
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re #Retrieves a list of all external links found on a page
def getExternalLinks(bsObj, excludeUrl):
externalLinks = []
#Finds all links that start with "http" or "www" that do
#not contain the current URL
for link in bsObj.findAll("a",
href=re.compile("^(http|www)((?!"+excludeUrl+").)*$")):
if link.attrs['href'] is not None:
if link.attrs['href'] not in externalLinks:
externalLinks.append(link.attrs['href'])
return externalLinks startingPage = "http://oreilly.com"
html = urlopen(startingPage)
bsObj = BeautifulSoup(html,"lxml") def splitAddress(address):
addressParts = address.replace("http://", "").split("/")
return addressParts print(splitAddress(startingPage))
print(splitAddress(startingPage)[0]) externalLinks = getExternalLinks(bsObj,splitAddress(startingPage)[0])
print(externalLinks)
运行结果为:
runfile('C:/Users/user/Desktop/untitled112.py', wdir='C:/Users/user/Desktop')
['oreilly.com']
oreilly.com
['https://cdn.oreillystatic.com/pdf/oreilly_high_performance_organizations_whitepaper.pdf', 'http://twitter.com/oreillymedia', 'http://fb.co/OReilly', 'https://www.linkedin.com/company/oreilly-media', 'https://www.youtube.com/user/OreillyMedia']
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl的更多相关文章
- 阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href 1.查找以<a>开头的所有文本,然后判断href是否在<a> ...
- 阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll 1..BeautifulSoup库的使用 Beautiful ...
- 首部讲Python爬虫电子书 Web Scraping with Python
首部python爬虫的电子书2015.6pdf<web scraping with python> http://pan.baidu.com/s/1jGL625g 可直接下载 waterm ...
- Web Scraping with Python读书笔记及思考
Web Scraping with Python读书笔记 标签(空格分隔): web scraping ,python 做数据抓取一定一定要明确:抓取\解析数据不是目的,目的是对数据的利用 一般的数据 ...
- <Web Scraping with Python>:Chapter 1 & 2
<Web Scraping with Python> Chapter 1 & 2: Your First Web Scraper & Advanced HTML Parsi ...
- Web scraping with Python (part II) « Jean, aka Sig(gg)
Web scraping with Python (part II) « Jean, aka Sig(gg) Web scraping with Python (part II)
- Web Scraping with Python
Python爬虫视频教程零基础小白到scrapy爬虫高手-轻松入门 https://item.taobao.com/item.htm?spm=a1z38n.10677092.0.0.482434a6E ...
- 《Web Scraping With Python》Chapter 2的学习笔记
You Don't Always Need a Hammer When Michelangelo was asked how he could sculpt a work of art as mast ...
- Web Scraping using Python Scrapy_BS4 - using BeautifulSoup and Python
Use BeautifulSoup and Python to scrap a website Lib: urllib Parsing HTML Data Web scraping script fr ...
随机推荐
- Non-technical Blog Recording on Day of Sep. 19th 2017 in Retrospection.
Unfortunately, I heard a resignation message verbally from ESU (her name is: Su Yi in Chinese, Su fr ...
- 大数据-09-Intellij idea 开发java程序操作HDFS
主要摘自 http://dblab.xmu.edu.cn/blog/290-2/ 简介 本指南介绍Hadoop分布式文件系统HDFS,并详细指引读者对HDFS文件系统的操作实践.Hadoop分布式文件 ...
- poj 1236 强联通分量
大致题意给你有一个点数为n<=100的有向图. 求解两个子任务: 1:最少给多少个点信息,这些点的信息可以顺着有向边传遍全图. 2:最少要加多少条边,使得整个图强联通. 求强联通分量再缩点后得到 ...
- SQLI DUMB SERIES-4
less4 输入单引号发现回显正常,说明单引号被过滤了,输入双引号: ?id=1" 说明输入的Id被一对双引号和圆括号包围,因此闭合双引号和圆括号就行,其他方法跟less1差不多 例如:un ...
- git 应用
git - 简易指南 助你开始使用 git 的简易指南,木有高深内容,;). 安装 下载 git OSX 版 下载 git Windows 版 下载 git Linux 版 创建新仓库 创建新文件夹, ...
- HDU 1716:排列2(全排列)
排列2 Time Limit: 1000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others) Total Submis ...
- freemarker的template用法
package cn.itcast.ssm.util; import com.alibaba.fastjson.JSONObject; import freemarker.cache.StringTe ...
- Blender 曲线操作
Curve (Bézier Versus NURBS)https://en.wikibooks.org/wiki/Blender_3D:_Noob_to_Pro/Curve_and_Path_Mode ...
- HDU1272小希的迷宫–并查集
上次Gardon的迷宫城堡小希玩了很久(见Problem B),现在她也想设计一个迷宫让Gardon来走.但是她设计迷宫的思路不一样,首先她认为所有的通道都应该是双向连通的,就是说如果有一个通道连通了 ...
- 【牛客练习赛22 C】
https://www.nowcoder.com/acm/contest/132/C 题目大意:在n个区间中取出n个数,相加的和一共会出现多少种结果. 题目分析:对于这种挑选数字相加,由于每一步不同的 ...