9.spark Core 进阶2--Cashe

RDD Persistence

One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.

In addition, each persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes. These levels are set by passing a StorageLevel object (Scala,Java, Python) to persist(). The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory). The full set of storage levels is:

Storage Level	Meaning
MEMORY_ONLY	Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
MEMORY_AND_DISK	Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
MEMORY_ONLY_SER (Java and Scala)	Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
MEMORY_AND_DISK_SER (Java and Scala)	Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.
DISK_ONLY	Store the RDD partitions only on disk.
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.	Same as the levels above, but replicate each partition on two cluster nodes.
OFF_HEAP (experimental)	Similar to MEMORY_ONLY_SER, but store the data in off-heap memory. This requires off-heap memory to be enabled.

Note: In Python, stored objects will always be serialized with the Pickle library, so it does not matter whether you choose a serialized level. The available storage levels in Python include MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, DISK_ONLY, and DISK_ONLY_2.

Spark also automatically persists some intermediate data in shuffle operations (e.g. reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist on the resulting RDD if they plan to reuse it.

Which Storage Level to Choose?

Spark’s storage levels are meant to provide different trade-offs between memory usage and CPU efficiency. We recommend going through the following process to select one:

If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.
If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access. (Java and Scala)
Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.
Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.

Removing Data

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.

9.spark Core 进阶2--Cashe的更多相关文章

8.spark Core 进阶1
(e.g. standalone manager, Mesos, YARN) In "cluster" mode, the framework launches the ...
Spark 3.x Spark Core详解 & 性能优化
Spark Core 1. 概述 Spark 是一种基于内存的快速.通用.可扩展的大数据分析计算引擎 1.1 Hadoop vs Spark 上面流程对应Hadoop的处理流程,下面对应着Spark的 ...
Spark Streaming揭秘 Day35 Spark core思考
Spark Streaming揭秘 Day35 Spark core思考 Spark上的子框架,都是后来加上去的.都是在Spark core上完成的,所有框架一切的实现最终还是由Spark core来 ...
【Spark Core】任务运行机制和Task源代码浅析1
引言上一小节<TaskScheduler源代码与任务提交原理浅析2>介绍了Driver側将Stage进行划分.依据Executor闲置情况分发任务,终于通过DriverActor向exe ...
TypeError: Error #1034: 强制转换类型失败:无法将 mx.controls::DataGrid@9a7c0a1 转换为 spark.core.IViewport。
1.错误描述 TypeError: Error #1034: 强制转换类型失败:无法将 mx.controls::DataGrid@9aa90a1 转换为 spark.core.IViewport. ...
Spark Core
Spark Core DAG概念有向无环图 Spark会根据用户提交的计算逻辑中的RDD的转换(变换方法)和动作(action方法)来生成RDD之间的依赖关系,同时 ...
Spark Streaming 进阶与案例实战
Spark Streaming 进阶与案例实战 1.带状态的算子: UpdateStateByKey 2.实战:计算到目前位置累积出现的单词个数写入到MySql中 1.create table CRE ...
spark core （二）
一.Spark-Shell交互式工具 1.Spark-Shell交互式工具 Spark-Shell提供了一种学习API的简单方式, 以及一个能够交互式分析数据的强大工具. 在Scala语言环境下或Py ...
Spark Core 资源调度与任务调度（standalone client 流程描述）
Spark Core 资源调度与任务调度(standalone client 流程描述) Spark集群启动: 集群启动后,Worker会向Master汇报资源情况(实际上将Worker的资 ...

随机推荐

Feign 系列（01）最简使用姿态
目录 Feign 系列(01)最简使用姿态 1. 引入 maven 依赖 2. 基本用法 3. Feign 声明式注解 Feign 系列(01)最简使用姿态 Spring Cloud 系列目录(htt ...
初识微服务框架ServiceComb
https://blog.csdn.net/zengdongwen/article/details/93486257 后续跟进学习.
页面上有3个输入框：分别为max，min，num；三个按钮：分别为生成，排序，去重；在输入框输入三个数字后，先点击生成按钮，生成一个数组长度为num，值为max到min之间的随机整数点击排序，对当前数组进行排序，点击去重，对当前数组进行去重。每次点击之后使结果显示在控制台
<!DOCTYPE html> <html> <head> <!-- 页面上有3个输入框:分别为max,min,num:三个按钮:分别为生成,排序,去重: 在 ...
jdk源码阅读-ConcurrentLinkedQueue(一)
说明 concurrentLinkedQueue为无界非阻塞队列,是线程安全的内部结构为链表的形式, 内部使用cas保存线程安全.采用cas保证原子性什么是CAS CAS 操作包含三个操作数 —— ...
（一）环境搭建——Django
实验环境准备: 安装django # cmd中 pip install Django 第一个django项目HelloWorld # 在D:/Python test 下创建一个helloworld项目 ...
5. java运算符
1.算术运算符注意: % 取余数 (1)自增 (++)前自增:先自增完毕,再运算整个表达式,语句分号前面的都是运算表达式: 后自增,先运算完整个表达式(分号前面的都是表达式),再进行自增: 2.赋值 ...
bzoj1009题解
[解题思路] 先KMP出fail数组,再用fail数组求出M[i][j],表示上一次匹配到第i位,这次可以遇到多少种不同的字符,使之转而匹配到第j位. 设集合S=[1,m]∩N 又设f[i][j]表示 ...
NX二次开发-UFUN获取显示在NX交互界面的对象UF_OBJ_is_displayable
NX9+VS2012 #include <uf.h> #include <uf_disp.h> #include <uf_obj.h> #include <u ...
NX二次开发-UFUN编辑图层类别名字UF_LAYER_edit_category_name
NX11+VS2013 #include <uf.h> #include <uf_layer.h> UF_initialize(); //创建图层类别 UF_LAYER_cat ...
NX二次开发-设置功能区工具栏的可见性UF_UI_set_ribbon_vis
NX9+VS2012 1.打开D:\Program Files\Siemens\NX 9.0\UGII\menus\ug_main.men 找到装配和PMI,在中间加上一段 TOGGLE_BUTTON ...

9.spark Core 进阶2--Cashe

9.spark Core 进阶2--Cashe的更多相关文章

随机推荐

热门专题