Spark Programming Guide

Link：http://spark.apache.org/docs/2.2.0/rdd-programming-guide.html

每个Spark Application包含一个driver程序（运行main方法）以及在集群中执行不同的并行操作。

Spark的一级抽象是RDD（2.0之后推荐使用Dataset）划分在不同节点上的元素的集合支持并行处理和自动的故障恢复。

RDD的创建：（1）a file in the Hadoop file system (or any other Hadoop-supported file system),（2） an existing Scala collection in the driver program,（3） transforming it.

用户可以persist RDD到内存中，以便高效重用

By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task.

Spark的二级抽象是shared variables 支持并行操作的共享变量，共享变量用于Task之间以及Task和Driver程序之间。Spark支持两种类型的共享变量，broadcast variables（cache a value in memory on all nodes），accumulators（which are variables that are only “added” to, such as counters and sums.）

Linking with Spark

写Spark应用，需要添加Spark的Maven依赖，如果需要访问HDFS，需要添加HDFS的Maven依赖，然后在程序中添加Spark相关的包。

Initializing Spark

通过SparkConf对象创建SparkContext对象，告诉Spark如何访问一个集群。

masteris a Spark, Mesos or YARN cluster URL, or a special “local” string to run in local mode，一般是使用spark-summit脚本来指定master，而不是在程序中硬编码

Using the Shell

$SPARK_HOME/bin/spark-shell可以指定一些参数

Resilient Distributed Datasets (RDDs)

Internally, each RDD is characterized by five main properties:

A list of partitions
A function for computing each split
A list of dependencies on other RDDs
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

创建RDDs的两种方式:（1）parallelizing an existing collection in your driver program

（2） referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

(1) Parallelized Collections

SparkContext的parallelize方法可以将集合并行化，即将集合中的元素进行拷贝，形成分布式的数据集，从而实现并行操作。

parallelize方法可以带两个参数，第二个参数指定the number of partitions to cut the dataset into。Spark在每个partition上运行一个Task。

（2）External Datasets

Spark可以通过Hadoop支持的任意数据源创建RDD，包括local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

SparkContext的textFile(URI)方法可以创建RDD，

Note：

（1）如果是本地文件系统，那么textFile在每个worker节点上被访问的路径应该是相同的，可以将textFile复制到不同的节点中或者使用网络挂载的文件系统

（2）Spark所有基于文件的输入方法支持目录、压缩文件、通配符

（3）textFile方法可以有俩个参数，第二个参数代表文件被划分成的partition的数量。By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.

除了文本文件，Spark还支持

（1） SparkContext.wholeTextFiles，a directory containing multiple small text files, and returns each of them as (filename, content) pairs.

（2） For SequenceFiles, use SparkContext’s sequenceFile[K, V] method where K and V are the types of key and values in the file

（3） For other Hadoop InputFormats, you can use the SparkContext.hadoopRDD method

（4） RDD.saveAsObjectFile and SparkContext.objectFile support saving an RDD in a simple format consisting of serialized Java objects. While this is not as efficient as specialized formats like Avro, it offers an easy way to save any RDD.

RDD Operations

RDD支持两种操作

（1） transformations，create a new dataset from an existing one

（2） actions，return a value to the driver program after running a computation on the dataset.

例如map是一种transformation，reduce是一种action。（但是parallel reduceByKey that returns a distributed dataset）

为了高效性，Spark中所有的transformation都是Lazy的，The transformations are only computed when an action requires a result to be returned to the driver program. 例如一般使用map都是想再通过reduce得到一个结果，因此Lazy。默认情况下， each transformed RDD may be recomputed each time you run an action on it，因此为了高效性可以使RDD驻留在内存中，通过persist或者cache。也有将RDD存储到磁盘中以及在不同节点之间复制RDD的方法。

Basics

第一行定义RDD，This dataset is not loaded in memory or otherwise acted on: lines is merely a pointer to the file.

第二行的lineLengths没有被立即计算，由于Lazy

reduce那行将计算分成任务，交给不同的机器，每个机器执行它对应的map任务和本地的reduction，最终向driver程序返回结果

Passing Functions to Spark

Spark的API时常需要将函数传递到driver程序中，两种推荐的方法

（1）Anonymous function syntax, which can be used for short pieces of code.

（2）Static methods in a global singleton object.

Understanding closures

在executors执行之前，Spark会先计算闭包。The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD (in this case foreach()).闭包通过序列化传递给每executor。发送给每个executor的闭包中的变量是counter的拷贝，而不是driver程序中的counter，因此主程序中的counter对executors是不可见的。集群环境下运行以上代码counter的结果是0。而在local mode下有可能得到正确的结果，但是不要这么用。可以使用Spark提供的Accumulator实现全局的聚合。

Printing elements of an RDD

rdd.foreach(println)或者rdd.map(println)

单机模式下：可以打印出所有rdd的元素

集群模式下：结果会被输出到executor的stdout中，而不是driver节点的stdout中。

如果想在driver节点看到rdd的结果：

（1） rdd.collect().foreach(println)，将所有的rdd加载到driver的内存中，可能会oom

（2）只想取出一部分元素，比如取出100个。rdd.take(100).foreach(println).

Working with Key-Value Pairs

大部分的操作对于包含任意对象的RDD都可以执行，不过有的操作只能操作包含key-value对的RDD，最常见的就是分布式的shuffle操作，例如grouping or aggregating the elements by a key

reduceByKey operation on key-value pairs to count how many times each line of text occurs in a file

use counts.sortByKey(), for example, to sort the pairs alphabetically, and finally counts.collect() to bring them back to the driver program as an array of objects.

注意：用户自定义类型作为key，需要重写equals方法和hashcode方法

Transformations &Actions

http://spark.apache.org/docs/2.2.0/api/scala/index.html#package

Shuffle operations

(1)Spark的shuffle操作将不同partition的数据重新分布，通常包括将数据在不同的executors和机器之间的拷贝，是一个复杂且耗时的操作（disk I/O, data serialization, and network I/O.）。能够引发Shuffle的一些操作：repartition operations like repartition and coalesce, ‘ByKey operations (except for counting) like groupByKey and reduceByKey, and join operations like cogroup and join.

(2)每个partition由一个Task进行计算。以reduceByKey为例，由于同一个key的不同value可能分布在不同的partition或者机器， It must read from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key - this is called the shuffle.

(3)shuffle操作会占用很大的堆内存和磁盘空间

RDD Persistence

为了在之后继续使用RDD或Dataset，可以将他们通过persist方法或者cache方法（fault-tolerant）持久化，持久化的时机：The first time it is computed in an action, it will be kept in memory on the nodes.

可以指定StorageLevel来决定将RDD persist到哪里，disk，memory ，serialized Java objects (to save space)， replicate it across nodes.

Which Storage Level to Choose?

trade-offs between memory usage and CPU efficiency

If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.
If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access. (Java and Scala)
Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.
Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.

Removing Data

Spark自动监视每个节点上的cache使用率，通过Least-recently-used (LRU)的方式自动remove旧的数据。用户也可以使用RDD.unpersist()方法.

Shared Variables

Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient.

Driver程序中rdd的操作传递进来函数，然后rdd的具体计算在每个节点上的executor上执行，这些节点上操作的变量都是独立的，并且不会将操作结果返回给driver程序。

Spark提供两种共享变量：(1)broadcast variables (2)accumulators

Broadcast Variables

read-only variable cached on each machine rather than shipping a copy of it with tasks.

Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

创建方式：SparkContext.broadcast(v)

得到广播变量之后，应该使用得到的广播变量而不是原来的变量v，并且创建广播变量之后不要改变v。

Accumulators

Accumulators are variables that are only “added” to through an associative and commutative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric types, and programmers can add support for new types.

numeric accumulator可以通过SparkContext创建，

SparkContext.longAccumulator()

集群中的Task可以向accumulator中使用add方法，但是不能读取accumulator的值。只有Driver程序可以使用value方法访问accumulator的值

用户可以通过继承AccumulatorV2实现自己的Accumulator,需要实现的方法：

reset for resetting the accumulator to zero,

add for adding another value into the accumulator,

merge for merging another same-type accumulator into this one

For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.

每个task对accumulator的更新只会应用一次，重启任务不会更新accumulator的值。但是task的更新可能会随着task的重新执行，被执行多次

Accumulator不会改变Spark的Lazy的特点。如果在RDD的操作中更新accumulator，只有RDD执行action的时候，才会改变accumulator的值。因此在执行map的时候，accumulator的值不会改变

对Spark2.2.0文档的学习3-Spark Programming Guide的更多相关文章

对Spark2.2.0文档的学习2-Job Scheduling
Job Scheduling Link:http://spark.apache.org/docs/2.2.0/job-scheduling.html 概况: (1)集群中多个应用的调度主要考虑的是不同 ...
对Spark2.2.0文档的学习1-Cluster Mode Overview
Cluster Mode Overview Link:http://spark.apache.org/docs/2.2.0/cluster-overview.html Spark应用(Applicat ...
webpack搭建vue项目开发环境【文档向学习】
为何有这篇文章各个社区已经有无数篇帖子介绍如何使用webpack搭建前端项目,但无论是出于学习webpack的目的还是为了解决工作实际需要都面临着一个现实问题,那就是版本更新.别人的帖子可能刚写好版 ...
vue mand-mobile按2.0文档默认安装的是1.6.8版本
vue mand-mobile按2.0文档默认安装的是1.6.8版本 npm list mand-mobilebigbullmobile@1.0.0 E:\webcode\bigbullmobile` ...
Beautiful Soup 4.2.0 文档
Beautiful Soup 4.2.0 文档 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方 ...
css2.0文档查阅及字体样式
css2.0文档查阅下载网址:http://soft.hao123.com/soft/appid/9517.html <html xmlns="http://www.w3.o ...
Unity shader 官网文档全方位学习（一）
转载:https://my.oschina.net/u/138823/blog/181131 摘要: 这篇文章主要介绍Surface Shaders基础及Examples详尽解析 What?? Sha ...
C# 动态生成word文档 [C#学习笔记3]关于Main(string[ ] args)中args命令行参数实现DataTables搜索框查询结果高亮显示二维码神器QRCoder Asp.net MVC 中 CodeFirst 开发模式实例
C# 动态生成word文档本文以一个简单的小例子,简述利用C#语言开发word表格相关的知识,仅供学习分享使用,如有不足之处,还请指正. 在工程中引用word的动态库在项目中,点击项目名称右键-- ...
【PyTorch v1.1.0文档研习】60分钟快速上手
阅读文档:使用 PyTorch 进行深度学习:60分钟快速入门. 本教程的目标是: 总体上理解 PyTorch 的张量库和神经网络训练一个小的神经网络来进行图像分类 PyTorch 是个啥? 这是基 ...

随机推荐

libuv源码分析
项目开发过程中经常使用了基于libuv库封装的库接口来实现异步处理,一直没仔细研究过这些接口的内部如何实现,因此也就没有掌握它的设计思想.今天花了点时间研究了其事件循环内部的一些过程,总算有了一些理解 ...
NB-IOT模组指令AT+NMSTATUS和AT+CGPADDR对比
1. AT+NMSTATUS,这个指令是用来查询模块在IOT平台的注册情况.注册指的是lwm2m协议里面的注册机制,详细可以参考lwm2m协议. 2. AT+MREGSWT,设置重启之后,自动启动注册 ...
那些不能遗忘的知识点回顾——C/C++系列（笔试面试高频题）
有那么一些零碎的小知识点,偶尔很迷惑,偶尔被忽略,偶然却发现它们很重要,这段时间正好在温习这些,就整理在这里,一起学习一起提高!后面还会继续补充. ——前言 1.面向对象的特性封装.继承.多态. 封 ...
原生与JS交互 iOS
前言 Hybrid App(混合模式移动应用)是指介于web-app.native-app这两者之间的app,兼具“Native App良好用户交互体验的优势”和“Web App跨平台开发的优势” ...
TW实习日记：第九天
这两天有点忙,要改前端网页和加需求上去.所以昨天说的Vue缓存机制也没看,所以打算现在列个挖了的坑的清单: Vue缓存机制.生命周期和钩子函数使用项目组自用组件来重写静态页面 SSM框架搭建.整合流 ...
001----Mysql隔离级别
一:事务隔离级别 mysql数据库的隔离界别如下: 1, READ UNCOMMITTED(未提交读) 事务中的修改,即使没有提交,对其它事务也是可见的. 这样会造成脏读(Dirty Read)的问 ...
优先队列（堆) -数据结构（C语言实现）
数据结构与算法分析优先队列模型 Insert(插入) == Enqueue(入队) DeleteMin(删除最小者) == Dequeue(出队) 基本实现简单链表:在表头插入,并遍历该链表以删 ...
ES数据备份到HDFS
1.准备好HDFS(这里我是本机测试) 2.es 安装repository-hdfs插件 (如es为多节点需在每个节点都安装插件) elasticsearch-plugin install repos ...
Linux内核学习笔记（7）--完全公平调度（CFS）
一.完全公平调度算法完全公平调度 CFS 的出发点基于一个简单的理念:进程调度的效果应该如同系统具备一个理想中的完美多任务处理器.在这种系统中,每个进程能够获得 1/n 的处理器时间(n 为可运行进 ...
XSS 注入检查点
如果你有个论坛,一般你会很注意用户发帖的注入问题,往往这个地方不会被注入,因为开发特别照顾.原则上XSS都是用户输入的,但是许多边角还是容易忽略.枚举一些检查点. 分页分页通用组件获取url,修改p ...

对Spark2.2.0文档的学习3-Spark Programming Guide