阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl
1.函数调用它自身,这样就形成了一个循环,一环套一环:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getLinks(pageUrl):
global pages
html = urlopen("http://en.wikipedia.org"+pageUrl)
bsObj = BeautifulSoup(html,"lxml")
try:
print(bsObj.h1.get_text())
print(bsObj.find(id ="mw-content-text").findAll("p")[0]) //找到网页中 id=mw-content-text,然后在这个基础上查找"p"这个标签的内容 [0]则代表选择第0个
print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href']) //找到id=ca-edit里面的span标签里面的a标签里面的href的值
except AttributeError:
print("This page is missing something! No worries though!") for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
#We have encountered a new page
newPage = link.attrs['href']
print(newPage)
pages.add(newPage)
getLinks(newPage) getLinks("")
2.对网址进行处理,通过"/"对网址中的字符进行分割
def splitAddress(address):
addressParts = address.replace("http://", "").split("/")
return addressParts addr = splitAddress("https://hao.360.cn/?a1004")
print(addr)
运行结果为:
runfile('C:/Users/user/Desktop/chensimin.py', wdir='C:/Users/user/Desktop')
['https:', '', 'hao.360.cn', '?a1004'] //两个//之间没有内容,所用用''表示
def splitAddress(address):
addressParts = address.replace("http://", "").split("/")
return addressParts addr = splitAddress("http://www.autohome.com.cn/wuhan/#pvareaid=100519")
print(addr)
运行结果为:
runfile('C:/Users/user/Desktop/chensimin.py', wdir='C:/Users/user/Desktop')
['www.autohome.com.cn', 'wuhan', '#pvareaid=100519']
3.抓取网站的内部链接
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re #Retrieves a list of all Internal links found on a page
def getInternalLinks(bsObj, includeUrl):
internalLinks = []
#Finds all links that begin with a "/"
for link in bsObj.findAll("a", href=re.compile("^(/|.*"+includeUrl+")")):
if link.attrs['href'] is not None:
if link.attrs['href'] not in internalLinks:
internalLinks.append(link.attrs['href'])
return internalLinks startingPage = "http://oreilly.com"
html = urlopen(startingPage)
bsObj = BeautifulSoup(html,"lxml") def splitAddress(address):
addressParts = address.replace("http://", "").split("/")
return addressParts internalLinks = getInternalLinks(bsObj, splitAddress(startingPage)[0])
print(internalLinks)
运行结果为(此页面内的所有内部链接):
runfile('C:/Users/user/Desktop/untitled112.py', wdir='C:/Users/user/Desktop')
['https://www.oreilly.com', 'http://www.oreilly.com/ideas',
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170601+nav',
'http://www.oreilly.com/conferences/', 'http://shop.oreilly.com/', 'http://members.oreilly.com', '/topics/ai', '/topics/business',
'/topics/data', '/topics/design', '/topics/economy', '/topics/operations', '/topics/security', '/topics/software-architecture', '/topics/software-engineering',
'/topics/web-programming', 'https://www.oreilly.com/topics',
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+get+started+now',
'https://www.safaribooksonline.com/accounts/login/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170203+homepage+sign+in',
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170710+homepage+get+started+now',
'https://www.safaribooksonline.com/public/free-trial/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170710+homepage+start+free+trial',
'https://www.safaribooksonline.com/accounts/login/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170710+homepage+sign+in',
'https://www.safaribooksonline.com/live-training/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+take+a+live+online+course',
'https://www.safaribooksonline.com/learning-paths/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+follow+a+path',
'https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+unlimited+access', 'http://www.oreilly.com/live-training/?view=grid',
'https://www.safaribooksonline.com/your-experience/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+safari+platform',
'https://www.oreilly.com/ideas/8-data-trends-on-our-radar-for-2017?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+2017+trends',
'https://www.oreilly.com/ideas?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+read+latest+articles', 'http://www.oreilly.com/about/',
'http://www.oreilly.com/work-with-us.html', 'http://www.oreilly.com/careers/', 'http://shop.oreilly.com/category/customer-service.do', 'http://www.oreilly.com/about/contact.html',
'http://www.oreilly.com/emails/newsletters/', 'http://www.oreilly.com/terms/', 'http://www.oreilly.com/privacy.html', 'http://www.oreilly.com/about/editorial_independence.html']
4.抓取网站的外部链接
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re #Retrieves a list of all external links found on a page
def getExternalLinks(bsObj, excludeUrl):
externalLinks = []
#Finds all links that start with "http" or "www" that do
#not contain the current URL
for link in bsObj.findAll("a",
href=re.compile("^(http|www)((?!"+excludeUrl+").)*$")):
if link.attrs['href'] is not None:
if link.attrs['href'] not in externalLinks:
externalLinks.append(link.attrs['href'])
return externalLinks startingPage = "http://oreilly.com"
html = urlopen(startingPage)
bsObj = BeautifulSoup(html,"lxml") def splitAddress(address):
addressParts = address.replace("http://", "").split("/")
return addressParts print(splitAddress(startingPage))
print(splitAddress(startingPage)[0]) externalLinks = getExternalLinks(bsObj,splitAddress(startingPage)[0])
print(externalLinks)
运行结果为:
runfile('C:/Users/user/Desktop/untitled112.py', wdir='C:/Users/user/Desktop')
['oreilly.com']
oreilly.com
['https://cdn.oreillystatic.com/pdf/oreilly_high_performance_organizations_whitepaper.pdf', 'http://twitter.com/oreillymedia', 'http://fb.co/OReilly', 'https://www.linkedin.com/company/oreilly-media', 'https://www.youtube.com/user/OreillyMedia']
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl的更多相关文章
- 阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href 1.查找以<a>开头的所有文本,然后判断href是否在<a> ...
- 阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---BeautifulSoup---findAll 1..BeautifulSoup库的使用 Beautiful ...
- 首部讲Python爬虫电子书 Web Scraping with Python
首部python爬虫的电子书2015.6pdf<web scraping with python> http://pan.baidu.com/s/1jGL625g 可直接下载 waterm ...
- Web Scraping with Python读书笔记及思考
Web Scraping with Python读书笔记 标签(空格分隔): web scraping ,python 做数据抓取一定一定要明确:抓取\解析数据不是目的,目的是对数据的利用 一般的数据 ...
- <Web Scraping with Python>:Chapter 1 & 2
<Web Scraping with Python> Chapter 1 & 2: Your First Web Scraper & Advanced HTML Parsi ...
- Web scraping with Python (part II) « Jean, aka Sig(gg)
Web scraping with Python (part II) « Jean, aka Sig(gg) Web scraping with Python (part II)
- Web Scraping with Python
Python爬虫视频教程零基础小白到scrapy爬虫高手-轻松入门 https://item.taobao.com/item.htm?spm=a1z38n.10677092.0.0.482434a6E ...
- 《Web Scraping With Python》Chapter 2的学习笔记
You Don't Always Need a Hammer When Michelangelo was asked how he could sculpt a work of art as mast ...
- Web Scraping using Python Scrapy_BS4 - using BeautifulSoup and Python
Use BeautifulSoup and Python to scrap a website Lib: urllib Parsing HTML Data Web scraping script fr ...
随机推荐
- Python 实例方法
class Computer: # 实例方法 def play(self): print("电脑可以扫雷") # 在定义实例方法的时候. 必须给出一个参数 self # 形参的第一 ...
- Python网络爬虫第一弹《Python网络爬虫相关基础概念》
爬虫介绍 引入 之前在授课过程中,好多同学都问过我这样的一个问题:为什么要学习爬虫,学习爬虫能够为我们以后的发展带来那些好处?其实学习爬虫的原因和为我们以后发展带来的好处都是显而易见的,无论是从实际的 ...
- 获取列表中的最大的N项和最小的N项
获取列表中的最大的N项和最小的N项 #!/sur/bin/env python # -*- coding:utf-8 -*- # author:zengsf #time:2018/10/31 impo ...
- c日志宏
仅供参考,不推荐 #ifdef _DEBUG #define LOGDEBUG(format, ...)\ {\ FILE *fp = fopen("nccli.log", &qu ...
- SQLI DUMB SERIES-5
less5 (1)输入单引号,回显错误,说明存在注入点.输入的Id被一对单引号所包围,可以闭合单引号 (2)输入正常时:?id=1 说明没有显示位,因此不能使用联合查询了:可以使用报错注入,有两种方式 ...
- 嵌套for
- 基于Flask开发web微信
1. 获取二维码 app.py import re import time import requests from flask import Flask,render_template app = ...
- Chrome程序及数据位置变更到非系统盘
Chrome浏览器在Windows系统上安装过程,没有设置安装位置的步骤,所以默认是安装在C盘的.并且,若Chrome作为主要浏览器使用,随着时间的积累,数据文件会非常多.增加系统盘的负荷. Wind ...
- Go Example--关闭通道
package main import ( "fmt" ) func main() { jobs := make(chan int, 5) done := make(chan bo ...
- 文件I/0缓冲
设置stdio流缓冲模式 #include<stdio.h> int setvbuf(FILE *stream,char *buf,int mode,size_t size) int se ...