Sprak RDD简单应用

来自：http://my.oschina.net/scipio/blog/284957#OSC_h5_11

目录[-]

1、准备文件

wget http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/spam.data

2、加载文件

scala> val inFile = sc.textFile("/home/scipio/spam.data")

输出

14/06/28 12:15:34 INFO MemoryStore: ensureFreeSpace(32880) called with curMem=65736, maxMem=311387750

14/06/28 12:15:34 INFO MemoryStore: Block broadcast_2 stored as values to memory (estimated size 32.1 KB, free 296.9 MB)

inFile: org.apache.spark.rdd.RDD[String] = MappedRDD[7] at textFile at <console>:12

3、显示一行

scala> inFile.first()

输出

14/06/28 12:15:39 INFO FileInputFormat: Total input paths to process : 1

14/06/28 12:15:39 INFO SparkContext: Starting job: first at <console>:15

14/06/28 12:15:39 INFO DAGScheduler: Got job 0 (first at <console>:15) with 1 output partitions (allowLocal=true)

14/06/28 12:15:39 INFO DAGScheduler: Final stage: Stage 0(first at <console>:15)

14/06/28 12:15:39 INFO DAGScheduler: Parents of final stage: List()

14/06/28 12:15:39 INFO DAGScheduler: Missing parents: List()

14/06/28 12:15:39 INFO DAGScheduler: Computing the requested partition locally

14/06/28 12:15:39 INFO HadoopRDD: Input split: file:/home/scipio/spam.data:0+349170

14/06/28 12:15:39 INFO SparkContext: Job finished: first at <console>:15, took 0.532360118 s

res2: String = 0 0.64 0.64 0 0.32 0 0 0 0 0 0 0.64 0 0 0 0.32 0 1.29 1.93 0 0.96 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.778 0 0 3.756 61 278 1

4、函数运用

（1）map

scala> val nums = inFile.map(x=>x.split(' ').map(_.toDouble))

nums: org.apache.spark.rdd.RDD[Array[Double]] = MappedRDD[8] at map at <console>:14

scala> nums.first()

14/06/28 12:19:07 INFO SparkContext: Starting job: first at <console>:17

14/06/28 12:19:07 INFO DAGScheduler: Got job 1 (first at <console>:17) with 1 output partitions (allowLocal=true)

14/06/28 12:19:07 INFO DAGScheduler: Final stage: Stage 1(first at <console>:17)

14/06/28 12:19:07 INFO DAGScheduler: Parents of final stage: List()

14/06/28 12:19:07 INFO DAGScheduler: Missing parents: List()

14/06/28 12:19:07 INFO DAGScheduler: Computing the requested partition locally

14/06/28 12:19:07 INFO HadoopRDD: Input split: file:/home/scipio/spam.data:0+349170

14/06/28 12:19:07 INFO SparkContext: Job finished: first at <console>:17, took 0.011412903 s

res3: Array[Double] = Array(0.0, 0.64, 0.64, 0.0, 0.32, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.64, 0.0, 0.0, 0.0, 0.32, 0.0, 1.29, 1.93, 0.0, 0.96, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.778, 0.0, 0.0, 3.756, 61.0, 278.0, 1.0)

（2）collecct

scala> val rdd = sc.parallelize(List(1,2,3,4,5))

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:12

scala> val mapRdd = rdd.map(2*_)

mapRdd: org.apache.spark.rdd.RDD[Int] = MappedRDD[10] at map at <console>:14

scala> mapRdd.collect

14/06/28 12:24:45 INFO SparkContext: Job finished: collect at <console>:17, took 1.789249751 s

res4: Array[Int] = Array(2, 4, 6, 8, 10)

（3）filter

scala> val filterRdd = sc.parallelize(List(1,2,3,4,5)).map(_*2).filter(_>5)

filterRdd: org.apache.spark.rdd.RDD[Int] = FilteredRDD[13] at filter at <console>:12

scala> filterRdd.collect

14/06/28 12:27:45 INFO SparkContext: Job finished: collect at <console>:15, took 0.056086178 s

res5: Array[Int] = Array(6, 8, 10)

（4）flatMap

scala> val rdd = sc.textFile("/home/scipio/README.md")

14/06/28 12:31:55 INFO MemoryStore: ensureFreeSpace(32880) called with curMem=98616, maxMem=311387750

14/06/28 12:31:55 INFO MemoryStore: Block broadcast_3 stored as values to memory (estimated size 32.1 KB, free 296.8 MB)

rdd: org.apache.spark.rdd.RDD[String] = MappedRDD[15] at textFile at <console>:12

scala> rdd.count

14/06/28 12:32:50 INFO SparkContext: Job finished: count at <console>:15, took 0.341167662 s

res6: Long = 127

scala> rdd.cache

res7: rdd.type = MappedRDD[15] at textFile at <console>:12

scala> rdd.count

14/06/28 12:33:00 INFO SparkContext: Job finished: count at <console>:15, took 0.32015745 s

res8: Long = 127

scala> val wordCount = rdd.flatMap(_.split(' ')).map(x=>(x,1)).reduceByKey(_+_)

wordCount: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[20] at reduceByKey at <console>:14

scala> wordCount.collect

res9: Array[(String, Int)] = Array((means,1), (under,2), (this,4), (Because,1), (Python,2), (agree,1), (cluster.,1), (its,1), (YARN,,3), (have,2), (pre-built,1), (MRv1,,1), (locally.,1), (locally,2), (changed,1), (several,1), (only,1), (sc.parallelize(1,1), (This,2), (basic,1), (first,1), (requests,1), (documentation,1), (Configuration,1), (MapReduce,2), (without,1), (setting,1), ("yarn-client",1), ([params]`.,1), (any,2), (application,1), (prefer,1), (SparkPi,2), (<http://spark.apache.org/>,1), (version,3), (file,1), (documentation,,1), (test,1), (MASTER,1), (entry,1), (example,3), (are,2), (systems.,1), (params,1), (scala>,1), (<artifactId>hadoop-client</artifactId>,1), (refer,1), (configure,1), (Interactive,2), (artifact,1), (can,7), (file's,1), (build,3), (when,2), (2.0.X,,1), (Apac...

scala> wordCount.saveAsTextFile("/home/scipio/wordCountResult.txt")

（5）union

scala> val rdd = sc.parallelize(List(('a',1),('a',2)))

rdd: org.apache.spark.rdd.RDD[(Char, Int)] = ParallelCollectionRDD[10] at parallelize at <console>:12

scala> val rdd2 = sc.parallelize(List(('b',1),('b',2)))

rdd2: org.apache.spark.rdd.RDD[(Char, Int)] = ParallelCollectionRDD[11] at parallelize at <console>:12

scala> rdd union rdd2

res3: org.apache.spark.rdd.RDD[(Char, Int)] = UnionRDD[12] at union at <console>:17

scala> res3.collect

res4: Array[(Char, Int)] = Array((a,1), (a,2), (b,1), (b,2))

（6） join

scala> val rdd1 = sc.parallelize(List(('a',1),('a',2),('b',3),('b',4)))

rdd1: org.apache.spark.rdd.RDD[(Char, Int)] = ParallelCollectionRDD[10] at parallelize at <console>:12

scala> val rdd2 = sc.parallelize(List(('a',5),('a',6),('b',7),('b',8)))

rdd2: org.apache.spark.rdd.RDD[(Char, Int)] = ParallelCollectionRDD[11] at parallelize at <console>:12

scala> rdd1 join rdd2

res1: org.apache.spark.rdd.RDD[(Char, (Int, Int))] = FlatMappedValuesRDD[14] at join at <console>:17

res1.collect

res2: Array[(Char, (Int, Int))] = Array((b,(3,7)), (b,(3,8)), (b,(4,7)), (b,(4,8)), (a,(1,5)), (a,(1,6)), (a,(2,5)), (a,(2,6)))

（7）lookup

val rdd1 = sc.parallelize(List(('a',1),('a',2),('b',3),('b',4)))

rdd1.lookup('a')

res3: Seq[Int] = WrappedArray(1, 2)

（8）groupByKey

val wc = sc.textFile("/home/scipio/README.md").flatMap(_.split(' ')).map((_,1)).groupByKey

wc.collect

14/06/28 12:56:14 INFO SparkContext: Job finished: collect at <console>:15, took 2.933392093 s

res0: Array[(String, Iterable[Int])] = Array((means,ArrayBuffer(1)), (under,ArrayBuffer(1, 1)), (this,ArrayBuffer(1, 1, 1, 1)), (Because,ArrayBuffer(1)), (Python,ArrayBuffer(1, 1)), (agree,ArrayBuffer(1)), (cluster.,ArrayBuffer(1)), (its,ArrayBuffer(1)), (YARN,,ArrayBuffer(1, 1, 1)), (have,ArrayBuffer(1, 1)), (pre-built,ArrayBuffer(1)), (MRv1,,ArrayBuffer(1)), (locally.,ArrayBuffer(1)), (locally,ArrayBuffer(1, 1)), (changed,ArrayBuffer(1)), (sc.parallelize(1,ArrayBuffer(1)), (only,ArrayBuffer(1)), (several,ArrayBuffer(1)), (This,ArrayBuffer(1, 1)), (basic,ArrayBuffer(1)), (first,ArrayBuffer(1)), (documentation,ArrayBuffer(1)), (Configuration,ArrayBuffer(1)), (MapReduce,ArrayBuffer(1, 1)), (requests,ArrayBuffer(1)), (without,ArrayBuffer(1)), ("yarn-client",ArrayBuffer(1)), ([params]`.,Ar...

（9）sortByKey

val rdd = sc.textFile("/home/scipio/README.md")

val wordcount = rdd.flatMap(_.split(' ')).map((_,1)).reduceByKey(_+_)

val wcsort = wordcount.map(x => (x._2,x._1)).sortByKey(false).map(x => (x._2,x._1))

wcsort.saveAsTextFile("/home/scipio/sort.txt")

升序的话，sortByKey(true)

Sprak RDD简单应用的更多相关文章

rdd简单操作
1.原始数据 Key value Transformations(example: ((1, 2), (3, 4), (3, 6))) 2. flatMap测试示例 object FlatMapTr ...
RDD算子的使用
TransformationDemo.scala import org.apache.spark.{HashPartitioner, SparkConf, SparkContext} import s ...
JAVA RDD 介绍
RDD 介绍 RDD,全称Resilient Distributed Datasets(弹性分布式数据集),是Spark最为核心的概念,是Spark对数据的抽象. RDD是分布式的元素集合,每个RDD ...
Spark简述及基本架构
Spark简述 Spark发源于美国加州大学伯克利分校AMPLab的集群计算平台.它立足于内存计算.从多迭代批量处理出发,兼收并蓄数据仓库.流处理和图计算等多种计算范式. 特点: 1.轻 Spark ...
Job 逻辑执行图
General logical plan 典型的 Job 逻辑执行图如上所示,经过下面四个步骤可以得到最终执行结果: 从数据源(可以是本地 file,内存数据结构, HDFS,HBase 等)读取数据 ...
Spark学习之JavaRdd
RDD 介绍 RDD,全称Resilient Distributed Datasets(弹性分布式数据集),是Spark最为核心的概念,是Spark对数据的抽象.RDD是分布式的元素集合,每个RDD只 ...
【Spark深入学习 -10】基于spark构建企业级流处理系统
----本节内容------- 1.流式处理系统背景 1.1 技术背景 1.2 Spark技术很火 2.流式处理技术介绍 2.1流式处理技术概念 2.2流式处理应用场景 2.3流式处理系统分类 3.流 ...
Spark学习之路（六）Spark Transformation和Action
Transformation算子基本的初始化 java static SparkConf conf = null; static JavaSparkContext sc = null; static ...
<Spark><Programming><RDDs>
Introduction to Core Spark Concepts driver program: 在集群上启动一系列的并行操作包含应用的main函数,定义集群上的分布式数据集,操作数据集通过 ...

随机推荐

SHIWEITI
//Wannafly挑战赛19(牛客网) //A 队列Q #include <iostream> #include <cstdio> #include <cstring& ...
vba中获取当前sheet页的名称，当前单元格所在位置
fname = ActiveSheet.Name-------获取当前sheet页的名称 Sname = "" & fname & "&qu ...
TCP/IP网络编程之套接字与标准I/O
标准I/O函数标准标准I/O函数有两个优点: 标准I/O函数具有良好的移植性标准I/O函数可以利用缓冲提高性能关于移植性无需过多解释,不仅是I/O函数,所有标准函数都具有良好的移植性.因为,为了 ...
datagrid的formatter
1.formatter函数 formatter:function(value,rowData,rowIndex){ return 'xxx'; } 注意: (1)formatter一定要有返回,且返回 ...
IOS开发---菜鸟学习之路--（十）-实现新闻详细信息浏览页面
前面已经将了上下拉刷新实现了上下拉刷新后我们的第一级界面就做好,接下来我们就需要实现新闻详细信息浏览了我个人认为一般实现新闻详细页面的方法有两种(主要是数据源的不同导致了方法的不同) 第一种是本 ...
『编写高质量代码Web前端开发修炼手册』读书笔记--高质量的CSS
1.怪异模式和DTD 标准模式:浏览器根据规范表现页面怪异模式:模拟老浏览器行为防止老站点无法工作(为了兼容老式浏览器的代码),如果漏写DTD(Document Type Definition文档定 ...
Python-S9——Day110-Git继续
1 当日内容概要 2 内容回顾 3 Git版本控制之多人协同开发 4 Git版本控制之fork 5 版本控制之其他 6 Redis之字典基本操作 7 Django中操作Redis 8 Django缓存 ...
使用jQuery ui创建模态表单
jQuery UI 是一个建立在 jQuery JavaScript 库上的小部件和交互库,可以使用它创建高度交互的 Web 应用程序. 在web页面的开发过程中,在添加元素的时候需要用到弹出窗口添加 ...
Leetcode 529.扫雷游戏
扫雷游戏让我们一起来玩扫雷游戏! 给定一个代表游戏板的二维字符矩阵. 'M' 代表一个未挖出的地雷,'E' 代表一个未挖出的空方块,'B' 代表没有相邻(上,下,左,右,和所有4个对角线)地雷的已挖 ...
Matlab freqs 函数
freqs 模拟滤波器的频率响应语法: h = freqs(b,a,w)[h,w] = freqs(b,a)[h,w] = freqs(b,a,f)freqs(b,a) 描述: freqs 返回一个 ...

Sprak RDD简单应用

1、准备文件

2、加载文件

3、显示一行

4、函数运用

（1）map

（2）collecct

（3）filter

（4）flatMap

（5）union

（6） join

（7）lookup

（8）groupByKey

（9）sortByKey

Sprak RDD简单应用的更多相关文章

随机推荐

热门专题