Spark Shell Examples
Spark Shell
Example 1 - Process Data from List:
scala> val pairs = sc.parallelize( List(
("This", 2),
("is", 3),
("Spark", 5),
("is", 3)
) )
...
scala> pairs.collect().foreach(println)
(This,2)
(is,3)
(Spark,5)
(is,3)
// Reduce Pairs by Keys:
scala> val pair1 = pairs.reduceByKey((x,y) => x+y, 4)
...
scala> pair1.collect.foreach(println)
(Spark,5)
(is,6)
(This,2)
// Decrease values by 1:
scala> val pair2 = pairs.mapValues( x=>x-1 )
scala> pair2.collect.foreach(println)
(This,1)
(is,2)
(Spark,4)
(is,2)
// Group Values by Keys:
scala> pairs.groupByKey.collect().foreach(println)
(Spark,CompactBuffer(5))
(is,CompactBuffer(3, 3))
(This,CompactBuffer(2))
Example 2 - Process Data from Local Text File
// Create an RDD from local test file:
scala> val testFile = sc.textFile("File:///home/PATH_TO_SPARK_HOME/README.MD")
RDD transformation and action can now be applied on the textFile
// This will display the number of lines in this textFile:
scala> textFile.count()
// or simply:
scala> textFile.count
// Note: if no argument, no parenthesis needed
// This will display the first line:
scala> textFile.first
// Filter lines containing "Spark":
scala> val linesWithSpark = textFile.filter (
line => line.contains("Spark")
)
// or simply:
scala> val linesWithSpark = textFile.filter(_.contains ("Spark"))
// Note: underscore "_" means every element in textFile
// Collect the content of linesWithSpark:
scala> linesWithSpark.collect ()
// Print lines of content of linesWithSpark:
scala> linesWithSpark.foreach (println)
// Map each line to #terms in it:
scala> numOfTermsPerLine = textFile.map ( line => line.split(" ").size )
// or simply:
scala> numOfTermsPerLine = textFile.map ( _.split(" ").size )
// Aggregate the numOfTermsPerLine to the max #terms:
scala> numOfTermsPerLine.reduce ( (a, b) => if (a>b) a else b )
// or use package Math.max:
scala> import java.lang.Math
scala> numOfTermsPerLine.reduce ( (a, b) => Math.max(a, b))
// Convert RDD textFile to an 1-D array of terms:
scala> val terms = textFile.flatMap ( _.split(" ") )
// Convert RDD textFile to an 2-D array of lines of terms:
scala> val terms_ = textFile.map ( _.split(" ") )
// Calculate the vocabulary size in textFile:
scala> terms.distinct().count()
// or simply:
scala> terms.distinct.count
// Find longest line together with the length in textFile:
scala> val lineLengthPair = textFile.map (
line => (line, line.length) )
scala> val lineWithMaxLength = lineLengthPair.reduce (
(pair1, pair2) => if pair1._2 >= pair2._2 pair1 else pair2 )
// alternatively, in a concise way:
scala> val lineWithMaxLength = textfile.map (
line => (line, line.length) ).reduce (
(pair1, pair2) => if (pair1._2 >= pair2._2) pair1 else pair2 )
// Find out all lines with "Spark" along with line number (start with 0)
// and output with format <line_no: line_content>
scala> val lineIndexPair = textFile.zipWithIndex()
scala> val lineIndexPairWithSpark = lineIndexPair.filter (
_._1.contains("Spark"))
scala> lineIndexPairWithSpark.foreach (
pair => println ( pair._2 + ": " + pair._1 )
// alternatively, in a concise way:
scala> textFile.zipWithIndex().filter (
_._1.contains("Spark")).foreach (
pair => println(pair._2 + ": ", pair._1) )
Example 3 - Process Data from Local CSV file
Download CSV file by
wget --content-disposition https://webcms3.cse.unsw.edu.au/files/cc5bb4af124130f899cddad80af071f1ad478c3c8eb7440433291459bb603ff1/attachment
Define a name-field mapping for the CSV file
scala> val aucid = 0
scala> val bid = 1
scala> val bidtime = 2
scala> val bidder = 3
scala> val bidderrate = 4
scala> val openbid = 5
scala> val price = 6
scala> val itemtype = 7
scala> val dtl = 8
// Create an RDD as a 2-D array from CSV file:
scala> val auctionRDD = sc.textFile("file:///home/PATH-TO-CSV-FILE/auction.csv")
.map ( _.split(",") )
// Count total number of item types in the auction:
scala> auctionRDD.map ( _(itemtype).distinct.count )
// itemtype was previously defined as 7 to index 8th column
// Count total number of bids per itemtype:
scala> auctionRDD.map ( line => ( line(itemtype), 1 )
.reduceByKey ( _ + _ , 4)
.foreach( pair => println (pair._1 + "," + pair._2)
// Find maximum number of bids for each auction
scala> auctionRDD.map ( line => ( line(aucid), 1 ) )
.reduceByKey ( _ + _ , 4)
.reduce ( (pair1, pair2) => if ( pair1._2 >= pair2._2 ) pair1 else pair2 )
._2
// Find top-5 most number of bids for each auction
scala> auctionRDD.map ( line => (line(aucid), 1) )
.reduceByKey ( _ + _ , 4)
.map ( _.swap )
.sortByKey (false)
.map ( _.swap )
.take (5)
Example 4 - Word Count on HDFS Text File
Download & put data file to HDFS by:
wget --content-disposition https://webcms3.cse.unsw.edu.au/files/33c7707c8b646a686e33af7e2f2fc006b53ff8c13d8317976bd262d8c6daae66/attachment
hdfs dfs -put pg100.txt Input/
// Create an RDD from HDFS:
scala> val pg100RDD = sc.textFile ("hdfs://HOST-NAME:PORT/user/USER-NAME/Input/pg100.txt")
// Word count:
scala> pg100RDD.flapMap ( _.split(" ") )
.map ( term => (term, 1) )
.reduceByKey ( _ + _ , 3)
.saveAsTextFile ( "OUTPUT-PATH" )
Example N - Spark Graph-X programming
# Download graph data tiny-graph.txt
$ wget --content-disposition https://webcms3.cse.unsw.edu.au/files/ae6f45a3d64c0b35a3bd4d0c2740cc673f000dc60ec17d0e882faf6c20f74509/attachment
// Import Graphx relavent classes:
scala> import org.apache.spark.graphx._
// Load graph data as RDD:
scala> val tinyGraphRDD = sc.textFile ("file:///home/PATH-TO-GRAPH-DATA/tiny-graph.txt")
// Convert raw data <index, srcVertex, destVertex, weight>
// into graphx readable edges:
scala> val edges = tinyGraphRDD.map ( _.split(" ") )
.map ( line =>
Edge ( line(1).toLong,
line(2).toLong,
line(3).toDouble
)
)
// Create a graph:
scala> val graph = Graph.fromEdges[Double, Double] (edges, 0.0)
// Now the graph has been created,
// show the triplets of this graph:
scala> graph.triplets.collect.foreach ( println )
Written with StackEdit.
Spark Shell Examples的更多相关文章
- Spark shell的原理
Spark shell是一个特别适合快速开发Spark原型程序的工具,可以帮助我们熟悉Scala语言.即使你对Scala不熟悉,仍然可以使用这个工具.Spark shell使得用户可以和Spark集群 ...
- Spark:使用Spark Shell的两个示例
Spark:使用Spark Shell的两个示例 Python 行数统计 ** 注意: **使用的是Hadoop的HDFS作为持久层,需要先配置Hadoop 命令行代码 # pyspark >& ...
- Spark源码分析之Spark Shell(上)
终于开始看Spark源码了,先从最常用的spark-shell脚本开始吧.不要觉得一个启动脚本有什么东东,其实里面还是有很多知识点的.另外,从启动脚本入手,是寻找代码入口最简单的方法,很多开源框架,其 ...
- Spark源码分析之Spark Shell(下)
继上次的Spark-shell脚本源码分析,还剩下后面半段.由于上次涉及了不少shell的基本内容,因此就把trap和stty放在这篇来讲述. 上篇回顾:Spark源码分析之Spark Shell(上 ...
- [Spark内核] 第36课:TaskScheduler内幕天机解密:Spark shell案例运行日志详解、TaskScheduler和SchedulerBackend、FIFO与FAIR、Task运行时本地性算法详解等
本課主題 通过 Spark-shell 窥探程序运行时的状况 TaskScheduler 与 SchedulerBackend 之间的关系 FIFO 与 FAIR 两种调度模式彻底解密 Task 数据 ...
- 【原创 Hadoop&Spark 动手实践 5】Spark 基础入门,集群搭建以及Spark Shell
Spark 基础入门,集群搭建以及Spark Shell 主要借助Spark基础的PPT,再加上实际的动手操作来加强概念的理解和实践. Spark 安装部署 理论已经了解的差不多了,接下来是实际动手实 ...
- [Spark Core] Spark Shell 实现 Word Count
0. 说明 在 Spark Shell 实现 Word Count RDD (Resilient Distributed dataset), 弹性分布式数据集. 示意图 1. 实现 1.1 分步实现 ...
- Spark Shell简单使用
基础 Spark的shell作为一个强大的交互式数据分析工具,提供了一个简单的方式学习API.它可以使用Scala(在Java虚拟机上运行现有的Java库的一个很好方式)或Python.在Spark目 ...
- 02、体验Spark shell下RDD编程
02.体验Spark shell下RDD编程 1.Spark RDD介绍 RDD是Resilient Distributed Dataset,中文翻译是弹性分布式数据集.该类是Spark是核心类成员之 ...
随机推荐
- 我的QT5学习之路(二)——第一个程序
一.前言 “工欲善其事,必先利其器”,上一节,我介绍了Qt的安装和配置方法,搭建了基本的开发平台.这一节,来通过一个简单的例子来了解Qt的编程样式和规范,开始喽~~~ 二.第一个程序——Hello W ...
- B+树全面解析
B+树的特征与结构 有k个子树的中间节点包含有k个元素(B树中是k-1个元素),每个元素不保存数据,只用来索引,所有数据都保存在叶子节点. 所有的叶子结点中包含了全部元素的信息,及指向含这些元素记录的 ...
- com.alibaba.druid检测排查数据库连接数不释放定位代码
1.可能标题说的很不明白,其实就是这样一个情况,一个工程项目错误日志出现GetConnectionTimeoutException: wait millis 90000, active 22000的异 ...
- python学习第二天 -----2019年4月17日
第二周-第02章节-Python3.5-模块初识 #!/usr/bin/env python #-*- coding:utf-8 _*- """ @author:chen ...
- Oracle Data Provider for .NET Support for Microsoft .NET Core
Oracle Data Provider for .NET Support for Microsoft .NET Core的官方地址,记录下来,按照官方描述,会在2017年底左右发布,暂时还没有看到相 ...
- Hystrix使用
Hystrix是Netflix开源的一款容错系统,能帮助使用者码出具备强大的容错能力和鲁棒性的程序.如果某程序或class要使用Hystrix,只需简单继承HystrixCommand/Hystrix ...
- 20155213 2016-2017-2 《Java程序设计》第六周学习总结
20155213 2016-2017-2 <Java程序设计>第六周学习总结 教材学习内容总结 输入与输出 串流设计 流(Stream)是对「输入输出」的抽象,注意「输入输出」是相对程序而 ...
- 关于homebrew使用时遇到的问题: Error: Could not symlink bin/gdb/usr/local/bin is not writable.
# 关于homebrew使用时遇到的问题: Error: Could not symlink bin/gdb/usr/local/bin is not writable. 这是我在给我的Mac电脑安装 ...
- Java集合——LinkedList源码详解
)LinkedList直接继承于AbstractSequentialList,同时实现了List接口,也实现了Deque接口. AbstractSequentialList为顺序访问的数据存储结构提供 ...
- POJ1035_Spell checker_KEY
题目传送门 一道暴力可以过的水题.(直接暴力模拟的那种) 但是我打Trie练练模板,但是TMD因为数组开太小卡了好久. code: #include <cstdio> #include & ...