Spark 学习笔记：（二）编程指引（Scala版）

参考：　　http://spark.apache.org/docs/latest/programming-guide.html　　　　后面懒得翻译了，英文记的，以后复习时再翻。

摘要：每个Spark application包含一个driver program 来运行main 函数，在集群上进行各种并行操作。 RDD是Spark的核心。除了RDD，Spark的另一个抽象时并行操作中使用的两种 shared variables： broadcast variables和accumulators.

Spark’的shell ： bin/spark-shell ( Scala ) ; bin/pyspark ( Python ).

0.Linking with spark＝>Initialing spark=>programming=>submit

首先要创建一个SparkContext object, 来告诉 Spark 怎样接入一个集群（cluster），创建一个SparkContext之前还要先创建一个SparkConf object t包含application信息，如下.

val conf = new SparkConf().setAppName(appName).setMaster(master)

new SparkContext(conf)

appName是你的app在集群UI上的名字。master参数包括： Spark, Mesos or YARN cluster URL, or a special “local” string to run in local mode.

PS：在Spark shell中, 一个特别的 SparkContext 已经创建好，名字为sc。想让自己的SparkContext 工作需要使用--master 命令。

Once you package your application into a JAR (for Java/Scala) or a set of .py or .zip files (for Python), the bin/spark-submit script lets you submit it to any supported cluster manager.

Spark is friendly to unit testing with any popular unit test framework. Simply create a SparkContext in your test with the master URL set to local, run your operations, and then call SparkContext.stop() to tear it down. Make sure you stop the context within a finally block or the test framework’s tearDown method, as Spark does not support two contexts running concurrently in the same program.

1.RDD

1）Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

Parallelized collections are created by calling SparkContext’s parallelize method on an existing collection in your driver program (a Scala Seq). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.

val data = Array(1, 2, 3, 4, 5)

val distData = sc.parallelize(data)

Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10)).

Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

scala> val distFile = sc.textFile("data.txt")

distFile: RDD[String] = MappedRDD@1d4cee08

SparkContext’s textFile method takes an URI for the file (either a local path on the machine, or a hdfs://, s3n://, etc URI) and reads it as a collection of lines.Once created, distFile can be acted on by dataset operations.

2）RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory (also support disk, or replicated across multiple nodes) using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. Storage level: MEMORY_ONLY, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, MEMORY_AND_DISK, DISK_ONLY...

Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There are two recommended ways to do this:
- Anonymous function syntax, which can be used for short pieces of code.
- Static methods in a global singleton object. For example, you can define object MyFunctions and then pass MyFunctions.func1, as follows:

object MyFunctions {

  def func1(s: String): String = { ... }

}

myRdd.map(MyFunctions.func1)

While most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs. The most common ones are distributed “shuffle” operations, such as grouping or aggregating the elements by a key.In Scala, these operations are automatically available on RDDs containing Tuple2 objects (the built-in tuples in the language, created by simply writing (a, b)), as long as you import org.apache.spark.SparkContext._ in your program to enable Spark’s implicit conversions. The key-value pair operations are available in the PairRDDFunctions class, which automatically wraps around an RDD of tuples if you import the conversions.

val lines = sc.textFile("data.txt")

val pairs = lines.map(s => (s, 1))

val counts = pairs.reduceByKey((a, b) => a + b)

Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that is grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.

Operations which can cause a shuffle include repartition operations , ‘ByKey operations (except for counting) , and join operations.To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce.

常用operations：（具体查文档）

map, filter, flatMap, sample, union, intersection, distinct, groupByKey, reduceByKey, aggregateByKey, sortByKey, join, cartsian

常用actions：

reduce, collect, count, first, take, takeSample, takeOrdered, saveAsTextFile, countByKey, foreach

2.Shared Variables

As operations are lazy, read-write shared variables across tasks would be inefficient. So Spark provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once.

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))

broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcastVar.value

res0: Array[Int] = Array(1, 2, 3)

Accumulators are variables that are only “added” to through an associative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums.

scala> val accum = sc.accumulator(0, "My Accumulator")

accum: spark.Accumulator[Int] = 0

scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)

...

10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s

scala> accum.value

res2: Int = 10

之后读一些例子。强烈推荐：http://dongxicheng.org/framework-on-yarn/spark-scala-writing-application/

以及doc上LR的一个完整例子：

import org.apache.spark.SparkContext

import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionModel}

import org.apache.spark.mllib.evaluation.MulticlassMetrics

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.util.MLUtils

// Load training data in LIBSVM format.

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

// Split data into training (60%) and test (40%).

val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)

val training = splits(0).cache()

val test = splits(1)

// Run training algorithm to build the model

val model = new LogisticRegressionWithLBFGS()

  .setNumClasses(10)

  .run(training)

// Compute raw scores on the test set.

val predictionAndLabels = test.map { case LabeledPoint(label, features) =>

  val prediction = model.predict(features)

  (prediction, label)

}

// Get evaluation metrics.

val metrics = new MulticlassMetrics(predictionAndLabels)

val precision = metrics.precision

println("Precision = " + precision)

// Save and load model

model.save(sc, "myModelPath")

val sameModel = LogisticRegressionModel.load(sc, "myModelPath")

import org.apache.spark.SparkContext

import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionModel}

import org.apache.spark.mllib.evaluation.MulticlassMetrics

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.util.MLUtils

// Load training data in LIBSVM format.

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

// Split data into training (60%) and test (40%).

val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)

val training = splits(0).cache()

val test = splits(1)

// Run training algorithm to build the model

val model = new LogisticRegressionWithLBFGS()

  .setNumClasses(10)

  .run(training)

// Compute raw scores on the test set.

val predictionAndLabels = test.map { case LabeledPoint(label, features) =>

  val prediction = model.predict(features)

  (prediction, label)

}

// Get evaluation metrics.

val metrics = new MulticlassMetrics(predictionAndLabels)

val precision = metrics.precision

println("Precision = " + precision)

// Save and load model

model.save(sc, "myModelPath")

val sameModel = LogisticRegressionModel.load(sc, "myModelPath")

Spark 学习笔记：（二）编程指引（Scala版）的更多相关文章

Spark学习笔记——RDD编程
1.RDD——弹性分布式数据集(Resilient Distributed Dataset) RDD是一个分布式的元素集合,在Spark中,对数据的操作就是创建RDD.转换已有的RDD和调用RDD操作 ...
Spark学习笔记3（IDEA编写scala代码并打包上传集群运行）
Spark学习笔记3 IDEA编写scala代码并打包上传集群运行我们在IDEA上的maven项目已经搭建完成了,现在可以写一个简单的spark代码并且打成jar包上传至集群,来检验一下我们的sp ...
学习笔记(二)--->《Java 8编程官方参考教程（第9版）.pdf》:第七章到九章学习笔记
注:本文声明事项. 本博文整理者:刘军本博文出自于: <Java8 编程官方参考教程>一书声明:1:转载请标注出处.本文不得作为商业活动.若有违本之,则本人不负法律责任.违法者自负一切 ...
Spark学习笔记之SparkRDD
Spark学习笔记之SparkRDD 一. 基本概念 RDD(resilient distributed datasets)弹性分布式数据集. 来自于两方面 ① 内存集合和外部存储系统 ② ...
spark学习笔记总结-spark入门资料精化
Spark学习笔记 Spark简介 spark 可以很容易和yarn结合,直接调用HDFS.Hbase上面的数据,和hadoop结合.配置很容易. spark发展迅猛,框架比hadoop更加灵活实用. ...
NumPy学习笔记二
NumPy学习笔记二 <NumPy学习笔记>系列将记录学习NumPy过程中的动手笔记,前期的参考书是<Python数据分析基础教程 NumPy学习指南>第二版.<数学分 ...
Spark学习笔记0——简单了解和技术架构
目录 Spark学习笔记0--简单了解和技术架构什么是Spark 技术架构和软件栈 Spark Core Spark SQL Spark Streaming MLlib GraphX 集群管理器受 ...
[Firefly引擎][学习笔记二][已完结]卡牌游戏开发模型的设计
源地址:http://bbs.9miao.com/thread-44603-1-1.html 在此补充一下Socket的验证机制:socket登陆验证.会采用session会话超时的机制做心跳接口验证 ...
OD调试学习笔记7—去除未注册版软件的使用次数限制
OD调试学习笔记7—去除未注册版软件的使用次数限制本节使用的软件链接 (想自己试验下的可以下载) 一:破解的思路仔细观察一个程序,我们会发现,无论在怎么加密,无论加密哪里,这个程序加密的目的就是需 ...
Spark学习笔记2（spark所需环境配置
Spark学习笔记2 配置spark所需环境 1.首先先把本地的maven的压缩包解压到本地文件夹中,安装好本地的maven客户端程序,版本没有什么要求不需要最新版的maven客户端. 解压完成之后 ...

随机推荐

HDU 5441 Travel
Travel Time Limit: 1500/1000 MS (Java/Others) Memory Limit: 131072/131072 K (Java/Others) Descrip ...
Leetcode 357.计算各个位数不同的数字个数
计算各个位数不同的数字个数给定一个非负整数 n,计算各位数字都不同的数字 x 的个数,其中 0 ≤ x < 10n . 示例: 输入: 2 输出: 91 解释: 答案应为除去 11,22,33 ...
main()中的参数argc, argv
转自:http://blog.csdn.net/eastmount/article/details/20413773 一.main()函数参数通常我们在写主函数时都是void main()或int ...
[uiautomator篇] 获取当前页面的方法
Uiautomator 在2.0之前的版本里就提供了getCurrentActivity()的方法,但返回内容不正确:2.0 版本今天尝试了下,还是返回有问题的: 有点没描述清楚啊,是在uiautom ...
mac上安装ruby
(转:http://www.cnblogs.com/daguo/p/4097263.html) 以下代码区域,带有 $ 打头的表示需要在控制台(终端)下面执行(不包括 $ 符号) 步骤0 - 安装系统 ...
【bzoj3671】[Noi2014]随机数生成器贪心
题目描述输入第1行包含5个整数,依次为 x_0,a,b,c,d ,描述小H采用的随机数生成算法所需的随机种子.第2行包含三个整数 N,M,Q ,表示小H希望生成一个1到 N×M 的排列来填入她 N ...
C/C++ 命令行参数的实现方法
解析从命令行提供的参数可以使用 getopt函数. To use this facility, your program must include the header file unistd.h u ...
HDU——4549M斐波那契数列（矩阵快速幂+快速幂+费马小定理）
M斐波那契数列 Time Limit: 3000/1000 MS (Java/Others) Memory Limit: 65535/32768 K (Java/Others) Total Su ...
P3694 邦邦的大合唱站队 (状压DP)
题目背景 BanG Dream!里的所有偶像乐队要一起大合唱,不过在排队上出了一些问题. 题目描述 N个偶像排成一列,他们来自M个不同的乐队.每个团队至少有一个偶像. 现在要求重新安排队列,使来自同一 ...
bzoj2144 跳跳棋二分
[bzoj2144]跳跳棋 Description 跳跳棋是在一条数轴上进行的.棋子只能摆在整点上.每个点不能摆超过一个棋子.我们用跳跳棋来做一个简单的游戏:棋盘上有3颗棋子,分别在a,b,c这三个位 ...

Spark 学习笔记：（二）编程指引（Scala版）

Spark 学习笔记：（二）编程指引（Scala版）的更多相关文章

随机推荐

热门专题