Spark 学习笔记:(二)编程指引(Scala版)
参考: http://spark.apache.org/docs/latest/programming-guide.html 后面懒得翻译了,英文记的,以后复习时再翻。
摘要:每个Spark application包含一个driver program 来运行main 函数,在集群上进行各种并行操作。 RDD是Spark的核心。除了RDD,Spark的另一个抽象时并行操作中使用的两种 shared variables: broadcast variables和accumulators.
Spark’的shell : bin/spark-shell ( Scala ) ; bin/pyspark ( Python ).
0.Linking with spark=>Initialing spark=>programming=>submit
首先要创建一个SparkContext object, 来告诉 Spark 怎样接入一个集群(cluster),创建一个SparkContext之前还要先创建一个SparkConf object t包含application信息,如下.
val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)
appName是你的app在集群UI上的名字。master参数包括: Spark, Mesos or YARN cluster URL, or a special “local” string to run in local mode.
PS: 在Spark shell中, 一个特别的 SparkContext 已经创建好,名字为sc。想让自己的SparkContext 工作需要使用--master 命令。
Once you package your application into a JAR (for Java/Scala) or a set of .py or .zip files (for Python), the bin/spark-submit script lets you submit it to any supported cluster manager.
Spark is friendly to unit testing with any popular unit test framework. Simply create a SparkContext in your test with the master URL set to local, run your operations, and then call SparkContext.stop() to tear it down. Make sure you stop the context within a finally block or the test framework’s tearDown method, as Spark does not support two contexts running concurrently in the same program.
1.RDD
1)Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
- Parallelized collections are created by calling
SparkContext’sparallelizemethod on an existing collection in your driver program (a ScalaSeq). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10)).
- Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.
scala> val distFile = sc.textFile("data.txt")
distFile: RDD[String] = MappedRDD@1d4cee08
SparkContext’s textFile method takes an URI for the file (either a local path on the machine, or a hdfs://, s3n://, etc URI) and reads it as a collection of lines.Once created, distFile can be acted on by dataset operations.
2)RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
- All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory (also support disk, or replicated across multiple nodes) using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. Storage level: MEMORY_ONLY, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, MEMORY_AND_DISK, DISK_ONLY...
- Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There are two recommended ways to do this:
- Anonymous function syntax, which can be used for short pieces of code.
- Static methods in a global singleton object. For example, you can define
object MyFunctionsand then passMyFunctions.func1, as follows:
object MyFunctions {
def func1(s: String): String = { ... }
}
myRdd.map(MyFunctions.func1)
- While most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs. The most common ones are distributed “shuffle” operations, such as grouping or aggregating the elements by a key.In Scala, these operations are automatically available on RDDs containing Tuple2 objects (the built-in tuples in the language, created by simply writing
(a, b)), as long as you importorg.apache.spark.SparkContext._in your program to enable Spark’s implicit conversions. The key-value pair operations are available in the PairRDDFunctions class, which automatically wraps around an RDD of tuples if you import the conversions.
val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that is grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.
Operations which can cause a shuffle include repartition operations , ‘ByKey operations (except for counting) , and join operations.To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce.
常用operations:(具体查文档)
map, filter, flatMap, sample, union, intersection, distinct, groupByKey, reduceByKey, aggregateByKey, sortByKey, join, cartsian
常用actions:
reduce, collect, count, first, take, takeSample, takeOrdered, saveAsTextFile, countByKey, foreach
2.Shared Variables
As operations are lazy, read-write shared variables across tasks would be inefficient. So Spark provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.
- Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.After the broadcast variable is created, it should be used instead of the value
vin any functions run on the cluster so thatvis not shipped to the nodes more than once.
scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0) scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)
- Accumulators are variables that are only “added” to through an associative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums.
scala> val accum = sc.accumulator(0, "My Accumulator")
accum: spark.Accumulator[Int] = 0 scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
...
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s scala> accum.value
res2: Int = 10
之后读一些例子。强烈推荐:http://dongxicheng.org/framework-on-yarn/spark-scala-writing-application/
以及doc上LR的一个完整例子:
import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionModel}
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils // Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") // Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1) // Run training algorithm to build the model
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(10)
.run(training) // Compute raw scores on the test set.
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
val prediction = model.predict(features)
(prediction, label)
} // Get evaluation metrics.
val metrics = new MulticlassMetrics(predictionAndLabels)
val precision = metrics.precision
println("Precision = " + precision) // Save and load model
model.save(sc, "myModelPath")
val sameModel = LogisticRegressionModel.load(sc, "myModelPath")
import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionModel}
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils // Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") // Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1) // Run training algorithm to build the model
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(10)
.run(training) // Compute raw scores on the test set.
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
val prediction = model.predict(features)
(prediction, label)
} // Get evaluation metrics.
val metrics = new MulticlassMetrics(predictionAndLabels)
val precision = metrics.precision
println("Precision = " + precision) // Save and load model
model.save(sc, "myModelPath")
val sameModel = LogisticRegressionModel.load(sc, "myModelPath")
Spark 学习笔记:(二)编程指引(Scala版)的更多相关文章
- Spark学习笔记——RDD编程
1.RDD——弹性分布式数据集(Resilient Distributed Dataset) RDD是一个分布式的元素集合,在Spark中,对数据的操作就是创建RDD.转换已有的RDD和调用RDD操作 ...
- Spark学习笔记3(IDEA编写scala代码并打包上传集群运行)
Spark学习笔记3 IDEA编写scala代码并打包上传集群运行 我们在IDEA上的maven项目已经搭建完成了,现在可以写一个简单的spark代码并且打成jar包 上传至集群,来检验一下我们的sp ...
- 学习笔记(二)--->《Java 8编程官方参考教程(第9版).pdf》:第七章到九章学习笔记
注:本文声明事项. 本博文整理者:刘军 本博文出自于: <Java8 编程官方参考教程>一书 声明:1:转载请标注出处.本文不得作为商业活动.若有违本之,则本人不负法律责任.违法者自负一切 ...
- Spark学习笔记之SparkRDD
Spark学习笔记之SparkRDD 一. 基本概念 RDD(resilient distributed datasets)弹性分布式数据集. 来自于两方面 ① 内存集合和外部存储系统 ② ...
- spark学习笔记总结-spark入门资料精化
Spark学习笔记 Spark简介 spark 可以很容易和yarn结合,直接调用HDFS.Hbase上面的数据,和hadoop结合.配置很容易. spark发展迅猛,框架比hadoop更加灵活实用. ...
- NumPy学习笔记 二
NumPy学习笔记 二 <NumPy学习笔记>系列将记录学习NumPy过程中的动手笔记,前期的参考书是<Python数据分析基础教程 NumPy学习指南>第二版.<数学分 ...
- Spark学习笔记0——简单了解和技术架构
目录 Spark学习笔记0--简单了解和技术架构 什么是Spark 技术架构和软件栈 Spark Core Spark SQL Spark Streaming MLlib GraphX 集群管理器 受 ...
- [Firefly引擎][学习笔记二][已完结]卡牌游戏开发模型的设计
源地址:http://bbs.9miao.com/thread-44603-1-1.html 在此补充一下Socket的验证机制:socket登陆验证.会采用session会话超时的机制做心跳接口验证 ...
- OD调试学习笔记7—去除未注册版软件的使用次数限制
OD调试学习笔记7—去除未注册版软件的使用次数限制 本节使用的软件链接 (想自己试验下的可以下载) 一:破解的思路 仔细观察一个程序,我们会发现,无论在怎么加密,无论加密哪里,这个程序加密的目的就是需 ...
- Spark学习笔记2(spark所需环境配置
Spark学习笔记2 配置spark所需环境 1.首先先把本地的maven的压缩包解压到本地文件夹中,安装好本地的maven客户端程序,版本没有什么要求 不需要最新版的maven客户端. 解压完成之后 ...
随机推荐
- BNUOJ 1055 走迷宫2
走迷宫2 Time Limit: 1000ms Memory Limit: 65535KB 64-bit integer IO format: %lld Java class name: ...
- [android 应用框架api篇] bluetooth
bluetooth接口 android.bluetooth.jar 官网网址: 接口类: https://developer.android.com/reference/android/bluetoo ...
- HDU-4848 Wow! Such Conquering! 爆搜+剪枝
Wow! Such Conquering! 题意:一个n*n的数字格,Txy表示x到y的时间.最后一行n-1个数字代表分别到2-n的最晚时间,自己在1号点,求到达这些点的时间和的最少值,如果没有满足情 ...
- iOS阴影
但是如果把masksToBounds设置为yes就没有阴影了 UIButton *view = [[UIButton alloc]initWithFrame:CGRectMake(, , , ...
- 【bzoj4200】[Noi2015]小园丁与老司机 STL-map+dp+有上下界最小流
题目描述 小园丁 Mr. S 负责看管一片田野,田野可以看作一个二维平面.田野上有 nn 棵许愿树,编号 1,2,3,…,n1,2,3,…,n,每棵树可以看作平面上的一个点,其中第 ii 棵树 (1≤ ...
- NOJ——1642简单的图论问题?(BFS+优先队列)
[1642] 简单的图论问题? 时间限制: 5000 ms 内存限制: 65535 K 问题描述 给一个 n 行 m 列的迷宫,每个格子要么是障碍物要么是空地.每个空地里都有一个权值.你的 任务是从找 ...
- POJ——3126Prime Path(双向BFS+素数筛打表)
Prime Path Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 16272 Accepted: 9195 Descr ...
- es6 递归 tree
function loop(data) { let office = data.map(item => { if(item.type == '1' ||item.type == '2') { i ...
- java连oracle
下载连接驱动 安装完oracle之后 D:\app\Administrator\product\11.2.0\dbhome_1\jdbc\lib 目录下拷贝 支持jdk1.6以上 From.java ...
- css可见性
overflow:hidden: 溢出隐藏 visibility:hidden: 隐藏元素,隐藏之后还占据原来的位置 display:none: 隐藏元 ...