spark持久化
spark持久化:cache 、persist、checkpoint
一、cache持久化
cache实际上是persist的一种简化方式,是一种懒执行的,执行action类算子才会触发,cahce后返回值要赋值给一个变量,下一个job直接基于变量进行操作。
cache操作:
public class Persist_Cache {
public static void main(String[] args) {
SparkConf conf = new SparkConf()
.setMaster("local")
.setAppName("Persist");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> stringJavaRDD = sc.textFile("E:/2018_cnic/learn/wordcount.txt");
JavaRDD<String> cache = stringJavaRDD.cache();
long startTime = System.currentTimeMillis();
long count = cache.count();
long endTime = System.currentTimeMillis();
System.out.println("no cahce duration:"+(endTime-startTime));
long startTime1 = System.currentTimeMillis();
long count1 = cache.count();
long endTime1 = System.currentTimeMillis();
System.out.println("cahce duration:"+(endTime1-startTime1));
}
}
结果输出:
// :: INFO DAGScheduler: Job finished: count at Persist_Cache.java:, took 0.202060 s
no cahce duration:248
// :: INFO SparkContext: Starting job: count at Persist_Cache.java:
// :: INFO DAGScheduler: Got job (count at Persist_Cache.java:) with output partitions
// :: INFO DAGScheduler: Final stage: ResultStage (count at Persist_Cache.java:)
// :: INFO DAGScheduler: Parents of final stage: List()
// :: INFO DAGScheduler: Missing parents: List()
// :: INFO DAGScheduler: Submitting ResultStage (E:/2018_cnic/learn/wordcount.txt MapPartitionsRDD[] at textFile at Persist_Cache.java:), which has no missing parents
// :: INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.1 KB, free 413.7 MB)
// :: INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1925.0 B, free 413.7 MB)
// :: INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on hadoop: (size: 1925.0 B, free: 413.9 MB)
// :: INFO SparkContext: Created broadcast from broadcast at DAGScheduler.scala:
// :: INFO DAGScheduler: Submitting missing tasks from ResultStage (E:/2018_cnic/learn/wordcount.txt MapPartitionsRDD[] at textFile at Persist_Cache.java:) (first tasks are for partitions Vector())
// :: INFO TaskSchedulerImpl: Adding task set 1.0 with tasks
// :: INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID , localhost, executor driver, partition , PROCESS_LOCAL, bytes)
// :: INFO Executor: Running task 0.0 in stage 1.0 (TID )
// :: INFO BlockManager: Found block rdd_1_0 locally
// :: INFO Executor: Finished task 0.0 in stage 1.0 (TID ). bytes result sent to driver
// :: INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID ) in ms on localhost (executor driver) (/)
// :: INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
// :: INFO DAGScheduler: ResultStage (count at Persist_Cache.java:) finished in 0.027 s
// :: INFO DAGScheduler: Job finished: count at Persist_Cache.java:, took 0.028863 s
cahce duration:
// :: INFO SparkContext: Invoking stop() from shutdown hook
二、spark persist持久化
package SparkStreaming; import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.storage.StorageLevel; public class Persist_Cache {
public static void main(String[] args) {
SparkConf conf = new SparkConf()
.setMaster("local")
.setAppName("Persist");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> stringJavaRDD = sc.textFile("E:/2018_cnic/learn/wordcount.txt");
JavaRDD<String> persist = stringJavaRDD.persist(StorageLevel.NONE());
long startTime = System.currentTimeMillis();
long count = persist.count();
long endTime = System.currentTimeMillis(); System.out.println("no cahce duration:"+(endTime-startTime));
long startTime1 = System.currentTimeMillis();
long count1 = persist.count();
long endTime1 = System.currentTimeMillis();
System.out.println("cahce duration:"+(endTime1-startTime1));
}
}
结果输出:结果加快是其内部优化的原因,不是持久化作用。
// :: INFO DAGScheduler: Job finished: count at Persist_Cache.java:, took 0.228634 s
no cahce duration:
// :: INFO SparkContext: Starting job: count at Persist_Cache.java:
// :: INFO DAGScheduler: Got job (count at Persist_Cache.java:) with output partitions
// :: INFO DAGScheduler: Final stage: ResultStage (count at Persist_Cache.java:)
// :: INFO DAGScheduler: Parents of final stage: List()
// :: INFO DAGScheduler: Missing parents: List()
// :: INFO DAGScheduler: Submitting ResultStage (E:/2018_cnic/learn/wordcount.txt MapPartitionsRDD[] at textFile at Persist_Cache.java:), which has no missing parents
// :: INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.1 KB, free 413.7 MB)
// :: INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1919.0 B, free 413.7 MB)
// :: INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on hadoop: (size: 1919.0 B, free: 413.9 MB)
// :: INFO SparkContext: Created broadcast from broadcast at DAGScheduler.scala:
// :: INFO DAGScheduler: Submitting missing tasks from ResultStage (E:/2018_cnic/learn/wordcount.txt MapPartitionsRDD[] at textFile at Persist_Cache.java:) (first tasks are for partitions Vector())
// :: INFO TaskSchedulerImpl: Adding task set 1.0 with tasks
// :: INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID , localhost, executor driver, partition , PROCESS_LOCAL, bytes)
// :: INFO Executor: Running task 0.0 in stage 1.0 (TID )
// :: INFO HadoopRDD: Input split: file:/E:/2018_cnic/learn/wordcount.txt:+
// :: INFO Executor: Finished task 0.0 in stage 1.0 (TID ). bytes result sent to driver
// :: INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID ) in ms on localhost (executor driver) (/)
// :: INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
// :: INFO DAGScheduler: ResultStage (count at Persist_Cache.java:) finished in 0.023 s
// :: INFO DAGScheduler: Job finished: count at Persist_Cache.java:, took 0.025025 s
cahce duration:
// :: INFO SparkContext: Invoking stop() from shutdown hook
三、spark persist源码分析:
class StorageLevel private(
private var _useDisk: Boolean,
private var _useMemory: Boolean,
private var _useOffHeap: Boolean, //不使用堆外内存
private var _deserialized: Boolean, //不序列化
private var _replication: Int = 1
)
val NONE = new StorageLevel(false, false, false, false)
val DISK_ONLY = new StorageLevel(true, false, false, false)
val DISK_ONLY_2 = new StorageLevel(true, false, false, false, )
val MEMORY_ONLY = new StorageLevel(false, true, false, true) 不序列化
val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, )
val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, )
val MEMORY_AND_DISK = new StorageLevel(true, true, false, true) 内存中放不下剩余的放入到磁盘中
val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, )
val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, )
val OFF_HEAP = new StorageLevel(true, true, true, false, ) 使用堆外内存
持久化的单位是partition
例如:RDD中有3个partition,持久化级别是MEMERY_AND_DISK,内存中可以存几个就存几个,不会存储半个的情况,剩下的存储到磁盘
MEMERY_AND_DISK_2:数据存储在几个节点上?不一定,不能确定,因为每一个partition存储在哪个节点不确定,其备份存储在哪里也不确定。
四、spark cache源码分析:调用的是persist方法,默认持久化级别为 StorageLevel.MEMORY_ONLY
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY) /**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
def cache(): this.type = persist()
五、checkpoint操作
checkpoint:会将RDD的数据存储到HDFS中,安全系数较高,因为HDFS会有备份
checkpoint也是懒执行的。如何使用?
sc.checkpoint("path")
rdd3.checkpoint()
标注:在RDD的job执行完成后(action类算子被触发)
1、会从finalRDD(最后一个RDD)从后往前回溯,寻找调用checkpoint的rdd,对这个rdd做一个标记,做完标记后,重新启动一个job,来计算被checkpoint的RDD,然后将计算结果写入到相应的HDFS目录下面。
2、同时将被checkpoint的RDD的依赖关系切断,强制将依赖关系改变为checkpointRDD。
调用checkpoint的优化方法:
因为被checkpont的RDD被计算两次,在执行调用checkpoint之前,可以对RDD3进行cache,那么在rdd的job执行完成之后另外启动一个job,只是将内存中的数据迁移到HDFS就可以了,省去了计算的过程。
spark持久化的更多相关文章
- Spark持久化策略
spark持久化策略_缓存优化persist.cache都是持久化到内存缓存策略 StorageLevel_useDisk:是否使用磁盘_useMemory:是否使用内存_useOffHeap:不用堆 ...
- spark 持久化机制
spark的持久化机制做的相对隐晦一些,没有一个显示的调用入口. 首先通过rdd.persist(newLevel: StorageLevel)对此rdd的StorageLevel进行赋值,同chec ...
- Spark开发指南
原文链接http://www.sxt.cn/info-2730-u-756.html 目录 Spark开发指南 简介 接入Spark Java 初始化Spark Java 弹性分布式数据集 并行集合 ...
- spark RDD编程,scala版本
1.RDD介绍: RDD,弹性分布式数据集,即分布式的元素集合.在spark中,对所有数据的操作不外乎是创建RDD.转化已有的RDD以及调用RDD操作进行求值.在这一切的背后,Spark会自动 ...
- spark 中的RDD编程 -以下基于Java api
1.RDD介绍: RDD,弹性分布式数据集,即分布式的元素集合.在spark中,对所有数据的操作不外乎是创建RDD.转化已有的RDD以及调用RDD操作进行求值.在这一切的背后,Spark会自动 ...
- Spark学习之RDD编程总结
Spark 对数据的核心抽象——弹性分布式数据集(Resilient Distributed Dataset,简称 RDD).RDD 其实就是分布式的元素集合.在 Spark 中,对数据的所有操作不外 ...
- Spark调优 数据倾斜
1. Spark数据倾斜问题 Spark中的数据倾斜问题主要指shuffle过程中出现的数据倾斜问题,是由于不同的key对应的数据量不同导致的不同task所处理的数据量不同的问题. 例如,reduce ...
- 07、RDD持久化
为了避免多次计算同一个RDD(如上面的同一result RDD就调用了两次Action操作),可以让Spark对数据进行持久化.当我们让Spark持久化存储一个RDD时,计算出RDD的节点会分别保存它 ...
- SPARK快学大数据分析概要
Spark 是一个用来实现快速而通用的集群计算的平台.在速度方面,Spark 扩展了广泛使用的MapReduce 计算模型,而且高效地支持更多计算模式,包括交互式查询和流处理.在处理大规模数据集时,速 ...
随机推荐
- WSTMart商城系统数据字典
欢迎来到WSTMart官网 开源多用户商城 QQ交流群: 返回首页|返回首页| 开发手册 | 数据库字典 | 授权查询 | 授权用户登录 | 官方微信扫一扫 x QQ客服 服务热线 020-852 ...
- Eclipse下初用lucene
lucene是apache的一个开源项目,一个开放源代码的全文检索引擎工具包. 1. 首先下载lucene,下载地址来自<lucene实战>第2版(页面加载比较忙,等~) http://w ...
- 命令行传递参数并排序 AS实现加法
题目:从命令行输入参数并进行排序 1.实验准备 Integer提供了能在 int 类型和 String 类型之间互相转换的方法,还提供了处理 int 类型时非常有用的其他一些常量和方法. static ...
- 测试嵌入GeoGebra网页
使用 http://ggbstudy.top/tools/ggb2html/ 将GGB文件免费托管,然后在博客内容中点击“HTML”按钮插入GGB网页地址: <iframe src=" ...
- Oracle EBS View 视图查看没有数据
--关于看视图查看没有数据的问题 --原因OU过滤关系 --Oracle SQL*Plus --toad EXECUTE fnd_client_info.set_org_context(:ou_id ...
- ASP.NET MVC Core的TagHelper (高级特性)
这篇博文ASP.NET MVC Core的TagHelper(基础篇)介绍了TagHelper的基本概念和创建自定义TagHelper的方式,接着继续介绍一些新的看起来比较高级的特性.(示例代码紧接着 ...
- threadpoolExecutor----自动执行任务
使用threadpoolExecutor,主要是任务的提交的执行和获取结果. 提交任务的方法有: 1.submit 2.execute 3.queue的add 其中1和2的使用必须是threadpoo ...
- selenium + PhantomJS 爬取js页面
from selenium import webdriver import time _url="http://xxxxxxxx.com" driver = webdriver.P ...
- Python笔记之format()格式输出全解
格式化输出:format() format():把传统的%替换为{}来实现格式化输出 使用位置参数:就是在字符串中把需要输出的变量值用{}来代替,然后用format()来修改使之成为想要的字符串,位置 ...
- JAVA—Filter
过滤器 Filter 1. Filter简介. filter 是对客户端访问资源的过滤,符合条件放行,不符合条件不放行, 并且可以对目标资源访问前后进行逻辑处理. 2. Filter 的API 详解. ...