Spark 学习笔记:(二)编程指引(Scala版)
参考: http://spark.apache.org/docs/latest/programming-guide.html 后面懒得翻译了,英文记的,以后复习时再翻。
摘要:每个Spark application包含一个driver program 来运行main
函数,在集群上进行各种并行操作。 RDD是Spark的核心。除了RDD,Spark的另一个抽象时并行操作中使用的两种 shared variables: broadcast variables和accumulators.
Spark’的shell : bin/spark-shell
( Scala ) ; bin/pyspark
( Python ).
0.Linking with spark=>Initialing spark=>programming=>submit
首先要创建一个SparkContext object, 来告诉 Spark 怎样接入一个集群(cluster),创建一个SparkContext之前还要先创建一个
SparkConf object t包含application信息,如下.
val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)
appName是你的app在集群UI上的名字。master参数包括: Spark, Mesos or YARN cluster URL, or a special “local” string to run in local mode.
PS: 在Spark shell中, 一个特别的 SparkContext 已经创建好,名字为sc。
想让自己的SparkContext 工作需要使用--master
命令。
Once you package your application into a JAR (for Java/Scala) or a set of .py
or .zip
files (for Python), the bin/spark-submit
script lets you submit it to any supported cluster manager.
Spark is friendly to unit testing with any popular unit test framework. Simply create a SparkContext
in your test with the master URL set to local
, run your operations, and then call SparkContext.stop()
to tear it down. Make sure you stop the context within a finally
block or the test framework’s tearDown
method, as Spark does not support two contexts running concurrently in the same program.
1.RDD
1)Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
- Parallelized collections are created by calling
SparkContext
’sparallelize
method on an existing collection in your driver program (a ScalaSeq
). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize
(e.g. sc.parallelize(data, 10)
).
- Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.
scala> val distFile = sc.textFile("data.txt")
distFile: RDD[String] = MappedRDD@1d4cee08
SparkContext
’s textFile
method takes an URI for the file (either a local path on the machine, or a hdfs://
, s3n://
, etc URI) and reads it as a collection of lines.Once created, distFile
can be acted on by dataset operations.
2)RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
- All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory (also support disk, or replicated across multiple nodes) using the persist
(or cache
) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. Storage level: MEMORY_ONLY, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, MEMORY_AND_DISK, DISK_ONLY...
- Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There are two recommended ways to do this:
- Anonymous function syntax, which can be used for short pieces of code.
- Static methods in a global singleton object. For example, you can define
object MyFunctions
and then passMyFunctions.func1
, as follows:
object MyFunctions {
def func1(s: String): String = { ... }
} myRdd.map(MyFunctions.func1)
- While most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs. The most common ones are distributed “shuffle” operations, such as grouping or aggregating the elements by a key.In Scala, these operations are automatically available on RDDs containing Tuple2 objects (the built-in tuples in the language, created by simply writing
(a, b)
), as long as you importorg.apache.spark.SparkContext._
in your program to enable Spark’s implicit conversions. The key-value pair operations are available in the PairRDDFunctions class, which automatically wraps around an RDD of tuples if you import the conversions.
val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that is grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.
Operations which can cause a shuffle include repartition operations , ‘ByKey operations (except for counting) , and join operations.To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce.
常用operations:(具体查文档)
map, filter, flatMap, sample, union, intersection, distinct, groupByKey, reduceByKey, aggregateByKey, sortByKey, join, cartsian
常用actions:
reduce, collect, count, first, take, takeSample, takeOrdered, saveAsTextFile, countByKey, foreach
2.Shared Variables
As operations are lazy, read-write shared variables across tasks would be inefficient. So Spark provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.
- Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.After the broadcast variable is created, it should be used instead of the value
v
in any functions run on the cluster so thatv
is not shipped to the nodes more than once.
scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0) scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)
- Accumulators are variables that are only “added” to through an associative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums.
scala> val accum = sc.accumulator(0, "My Accumulator")
accum: spark.Accumulator[Int] = 0 scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
...
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s scala> accum.value
res2: Int = 10
之后读一些例子。强烈推荐:http://dongxicheng.org/framework-on-yarn/spark-scala-writing-application/
以及doc上LR的一个完整例子:
import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionModel}
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils // Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") // Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1) // Run training algorithm to build the model
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(10)
.run(training) // Compute raw scores on the test set.
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
val prediction = model.predict(features)
(prediction, label)
} // Get evaluation metrics.
val metrics = new MulticlassMetrics(predictionAndLabels)
val precision = metrics.precision
println("Precision = " + precision) // Save and load model
model.save(sc, "myModelPath")
val sameModel = LogisticRegressionModel.load(sc, "myModelPath")
import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionModel}
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils // Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") // Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1) // Run training algorithm to build the model
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(10)
.run(training) // Compute raw scores on the test set.
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
val prediction = model.predict(features)
(prediction, label)
} // Get evaluation metrics.
val metrics = new MulticlassMetrics(predictionAndLabels)
val precision = metrics.precision
println("Precision = " + precision) // Save and load model
model.save(sc, "myModelPath")
val sameModel = LogisticRegressionModel.load(sc, "myModelPath")
Spark 学习笔记:(二)编程指引(Scala版)的更多相关文章
- Spark学习笔记——RDD编程
1.RDD——弹性分布式数据集(Resilient Distributed Dataset) RDD是一个分布式的元素集合,在Spark中,对数据的操作就是创建RDD.转换已有的RDD和调用RDD操作 ...
- Spark学习笔记3(IDEA编写scala代码并打包上传集群运行)
Spark学习笔记3 IDEA编写scala代码并打包上传集群运行 我们在IDEA上的maven项目已经搭建完成了,现在可以写一个简单的spark代码并且打成jar包 上传至集群,来检验一下我们的sp ...
- 学习笔记(二)--->《Java 8编程官方参考教程(第9版).pdf》:第七章到九章学习笔记
注:本文声明事项. 本博文整理者:刘军 本博文出自于: <Java8 编程官方参考教程>一书 声明:1:转载请标注出处.本文不得作为商业活动.若有违本之,则本人不负法律责任.违法者自负一切 ...
- Spark学习笔记之SparkRDD
Spark学习笔记之SparkRDD 一. 基本概念 RDD(resilient distributed datasets)弹性分布式数据集. 来自于两方面 ① 内存集合和外部存储系统 ② ...
- spark学习笔记总结-spark入门资料精化
Spark学习笔记 Spark简介 spark 可以很容易和yarn结合,直接调用HDFS.Hbase上面的数据,和hadoop结合.配置很容易. spark发展迅猛,框架比hadoop更加灵活实用. ...
- NumPy学习笔记 二
NumPy学习笔记 二 <NumPy学习笔记>系列将记录学习NumPy过程中的动手笔记,前期的参考书是<Python数据分析基础教程 NumPy学习指南>第二版.<数学分 ...
- Spark学习笔记0——简单了解和技术架构
目录 Spark学习笔记0--简单了解和技术架构 什么是Spark 技术架构和软件栈 Spark Core Spark SQL Spark Streaming MLlib GraphX 集群管理器 受 ...
- [Firefly引擎][学习笔记二][已完结]卡牌游戏开发模型的设计
源地址:http://bbs.9miao.com/thread-44603-1-1.html 在此补充一下Socket的验证机制:socket登陆验证.会采用session会话超时的机制做心跳接口验证 ...
- OD调试学习笔记7—去除未注册版软件的使用次数限制
OD调试学习笔记7—去除未注册版软件的使用次数限制 本节使用的软件链接 (想自己试验下的可以下载) 一:破解的思路 仔细观察一个程序,我们会发现,无论在怎么加密,无论加密哪里,这个程序加密的目的就是需 ...
- Spark学习笔记2(spark所需环境配置
Spark学习笔记2 配置spark所需环境 1.首先先把本地的maven的压缩包解压到本地文件夹中,安装好本地的maven客户端程序,版本没有什么要求 不需要最新版的maven客户端. 解压完成之后 ...
随机推荐
- Codeforces Round #415 (Div. 2) 翻车啦
A. Straight «A» time limit per test 1 second memory limit per test 256 megabytes input standard inpu ...
- 九度oj题目1009:二叉搜索树
题目描述: 判断两序列是否为同一二叉搜索树序列 输入: 开始一个数n,(1<=n<=20) 表示有n个需要判断,n= 0 的时候输入结束. 接 ...
- 如何解决 错误code signing is required for product type 'xxxxx' in SDK 'iOS 8.2'
如何解决 错误code signing is required for product type 'xxxxx' in SDK 'iOS 8.2' 大家在做真机调试的时候,或许会遇到这样的问题,那如何 ...
- HDU——1045Fire Net(最大匹配)
Fire Net Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 65536/32768 K (Java/Others) Total S ...
- BZOJ 3749: [POI2015]Łasuchy【动态规划】
Description 圆桌上摆放着n份食物,围成一圈,第i份食物所含热量为c[i]. 相邻两份食物之间坐着一个人,共有n个人.每个人有两种选择,吃自己左边或者右边的食物.如果两个人选择了同一份食物, ...
- mybatis学习(二)——环境搭建
开发环境搭建主要包括以下几步 1.新建一个JAVA项目(可以只建一个文件夹) 2.导入jar包 log4j是一个日志包,可以不加,这里为了定位问题添加了该包,下面两个包必须需要. 3.创建数据库 C ...
- Vmware error:无法获得 VMCI 驱动程序的版本: 句柄无效。
error:无法获得 VMCI 驱动程序的版本: 句柄无效.驱动程序“vmci.sys”的版本不正确.请尝试重新安装 VMware Workstation.开启模块 DevicePowerOn 的操作 ...
- 济南学习 Day 5 T2 pm
逆欧拉函数(arc)题目描述:已知phi(N),求N.输入说明:两个正整数,分别表示phi(N)和K.输出说明:按升序输出满足条件的最小的K个N.样例输入:8 4杨丽输出:15 16 20 24数据范 ...
- 10款GitHub上最火爆的国产开源项目
衡量一个开源产品好不好,看看产品在 GitHub 的 Star 数量就知道了.由此可见,GitHub 已经沦落为开源产品的“大众点评”了.一个开源产品希望快速的被开发者知道.快速的获取反馈,放到 Gi ...
- stm32f103c8t6命名
stm32f103c8t6和stm32f103rbt c8:48脚.64k :rb:64脚.128k.