阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl

1.函数调用它自身,这样就形成了一个循环,一环套一环:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getLinks(pageUrl):
global pages
html = urlopen("http://en.wikipedia.org"+pageUrl)
bsObj = BeautifulSoup(html,"lxml")
try:
print(bsObj.h1.get_text())
print(bsObj.find(id ="mw-content-text").findAll("p")[0]) //找到网页中 id=mw-content-text,然后在这个基础上查找"p"这个标签的内容 [0]则代表选择第0个
print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href']) //找到id=ca-edit里面的span标签里面的a标签里面的href的值
except AttributeError:
print("This page is missing something! No worries though!") for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
#We have encountered a new page
newPage = link.attrs['href']
print(newPage)
pages.add(newPage)
getLinks(newPage) getLinks("")

 2.对网址进行处理,通过"/"对网址中的字符进行分割

def splitAddress(address):
addressParts = address.replace("http://", "").split("/")
return addressParts addr = splitAddress("https://hao.360.cn/?a1004")
print(addr)

运行结果为:

runfile('C:/Users/user/Desktop/chensimin.py', wdir='C:/Users/user/Desktop')
['https:', '', 'hao.360.cn', '?a1004'] //两个//之间没有内容,所用用''表示

  

def splitAddress(address):
addressParts = address.replace("http://", "").split("/")
return addressParts addr = splitAddress("http://www.autohome.com.cn/wuhan/#pvareaid=100519")
print(addr)

运行结果为:

runfile('C:/Users/user/Desktop/chensimin.py', wdir='C:/Users/user/Desktop')
['www.autohome.com.cn', 'wuhan', '#pvareaid=100519']

3.抓取网站的内部链接

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re #Retrieves a list of all Internal links found on a page
def getInternalLinks(bsObj, includeUrl):
internalLinks = []
#Finds all links that begin with a "/"
for link in bsObj.findAll("a", href=re.compile("^(/|.*"+includeUrl+")")):
if link.attrs['href'] is not None:
if link.attrs['href'] not in internalLinks:
internalLinks.append(link.attrs['href'])
return internalLinks startingPage = "http://oreilly.com"
html = urlopen(startingPage)
bsObj = BeautifulSoup(html,"lxml") def splitAddress(address):
addressParts = address.replace("http://", "").split("/")
return addressParts internalLinks = getInternalLinks(bsObj, splitAddress(startingPage)[0])
print(internalLinks)

运行结果为(此页面内的所有内部链接):

runfile('C:/Users/user/Desktop/untitled112.py', wdir='C:/Users/user/Desktop')
['https://www.oreilly.com', 'http://www.oreilly.com/ideas',
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170601+nav',
'http://www.oreilly.com/conferences/', 'http://shop.oreilly.com/', 'http://members.oreilly.com', '/topics/ai', '/topics/business',
'/topics/data', '/topics/design', '/topics/economy', '/topics/operations', '/topics/security', '/topics/software-architecture', '/topics/software-engineering',
'/topics/web-programming', 'https://www.oreilly.com/topics',
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+get+started+now',
'https://www.safaribooksonline.com/accounts/login/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170203+homepage+sign+in',
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170710+homepage+get+started+now',
'https://www.safaribooksonline.com/public/free-trial/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170710+homepage+start+free+trial',
'https://www.safaribooksonline.com/accounts/login/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170710+homepage+sign+in',
'https://www.safaribooksonline.com/live-training/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+take+a+live+online+course',
'https://www.safaribooksonline.com/learning-paths/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+follow+a+path',
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+unlimited+access', 'http://www.oreilly.com/live-training/?view=grid',
'https://www.safaribooksonline.com/your-experience/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+safari+platform',
'https://www.oreilly.com/ideas/8-data-trends-on-our-radar-for-2017?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+2017+trends',
'https://www.oreilly.com/ideas?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+read+latest+articles', 'http://www.oreilly.com/about/',
'http://www.oreilly.com/work-with-us.html', 'http://www.oreilly.com/careers/', 'http://shop.oreilly.com/category/customer-service.do', 'http://www.oreilly.com/about/contact.html',
'http://www.oreilly.com/emails/newsletters/', 'http://www.oreilly.com/terms/', 'http://www.oreilly.com/privacy.html', 'http://www.oreilly.com/about/editorial_independence.html']
4.抓取网站的外部链接
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re #Retrieves a list of all external links found on a page
def getExternalLinks(bsObj, excludeUrl):
externalLinks = []
#Finds all links that start with "http" or "www" that do
#not contain the current URL
for link in bsObj.findAll("a",
href=re.compile("^(http|www)((?!"+excludeUrl+").)*$")):
if link.attrs['href'] is not None:
if link.attrs['href'] not in externalLinks:
externalLinks.append(link.attrs['href'])
return externalLinks startingPage = "http://oreilly.com"
html = urlopen(startingPage)
bsObj = BeautifulSoup(html,"lxml") def splitAddress(address):
addressParts = address.replace("http://", "").split("/")
return addressParts print(splitAddress(startingPage))
print(splitAddress(startingPage)[0]) externalLinks = getExternalLinks(bsObj,splitAddress(startingPage)[0])
print(externalLinks)
运行结果为:
runfile('C:/Users/user/Desktop/untitled112.py', wdir='C:/Users/user/Desktop')
['oreilly.com']
oreilly.com
['https://cdn.oreillystatic.com/pdf/oreilly_high_performance_organizations_whitepaper.pdf', 'http://twitter.com/oreillymedia', 'http://fb.co/OReilly', 'https://www.linkedin.com/company/oreilly-media', 'https://www.youtube.com/user/OreillyMedia']

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl的更多相关文章

  1. 阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href

    阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href 1.查找以<a>开头的所有文本,然后判断href是否在<a> ...

  2. 阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll

    阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll 1..BeautifulSoup库的使用 Beautiful ...

  3. 首部讲Python爬虫电子书 Web Scraping with Python

    首部python爬虫的电子书2015.6pdf<web scraping with python> http://pan.baidu.com/s/1jGL625g 可直接下载 waterm ...

  4. Web Scraping with Python读书笔记及思考

    Web Scraping with Python读书笔记 标签(空格分隔): web scraping ,python 做数据抓取一定一定要明确:抓取\解析数据不是目的,目的是对数据的利用 一般的数据 ...

  5. <Web Scraping with Python>:Chapter 1 & 2

    <Web Scraping with Python> Chapter 1 & 2: Your First Web Scraper & Advanced HTML Parsi ...

  6. Web scraping with Python (part II) « Jean, aka Sig(gg)

    Web scraping with Python (part II) « Jean, aka Sig(gg) Web scraping with Python (part II)

  7. Web Scraping with Python

    Python爬虫视频教程零基础小白到scrapy爬虫高手-轻松入门 https://item.taobao.com/item.htm?spm=a1z38n.10677092.0.0.482434a6E ...

  8. 《Web Scraping With Python》Chapter 2的学习笔记

    You Don't Always Need a Hammer When Michelangelo was asked how he could sculpt a work of art as mast ...

  9. Web Scraping using Python Scrapy_BS4 - using BeautifulSoup and Python

    Use BeautifulSoup and Python to scrap a website Lib: urllib Parsing HTML Data Web scraping script fr ...

随机推荐

  1. 反转链表(python3)

    问题描述: 反转一个单链表. 示例: 输入: 1->2->3->4->5->NULL 输出: 5->4->3->2->1->NULL解法1: ...

  2. SQL注入之Sqli-labs系列第十四关(基于双引号POST报错注入)

    开始挑战第十四关(Double Injection- Double quotes- String) 访问地址,输入报错语句 '  ''  ')  ") - 等使其报错 分析报错信息 很明显是 ...

  3. SpringMVC详细示例实战教程(较全开发教程)

    SpringMVC学习笔记---- 一.SpringMVC基础入门,创建一个HelloWorld程序 1.首先,导入SpringMVC需要的jar包. 2.添加Web.xml配置文件中关于Spring ...

  4. 第七十五课 图的遍历(DFS)

    添加DFS函数: #ifndef GRAPH_H #define GRAPH_H #include "Object.h" #include "SharedPointer. ...

  5. TrueCrypt 7.1a Hashes

    Here are the SHA256, SHA1, and MD5 hashes of all TrueCrypt version 7.1a files. The signature of the ...

  6. 数据文件resize扩容

    表空间不足 Alert日志报错 Mon Dec :: GMT+: Incremental checkpoint up to RBA[ox1af2d.3ddll.], current log tail ...

  7. Python之路PythonNet,第三篇,网络3

    pythonnet   网络3 udp 通信 recvfrom sendtofork 多进程并发threading 多线程并发socketserver 系统模块 套接字的属性 setsockopt g ...

  8. array的方法 没记住的

    reserve() 是倒叙: sort() 拍序,按字符编码排序,可以传一个参数 reduce() 实例:判断一个数组里参数的个数 var arr = ["apple"," ...

  9. 02 JDBC相关

    ====================================================================================JDBC JAVA Databa ...

  10. 小白python 安装

    小白python 安装: https://blog.csdn.net/qq_36667170/article/details/79275605 https://blog.csdn.net/nmjuzi ...