Spark Shell Examples

Spark Shell

Example 1 - Process Data from List:

scala> val pairs = sc.parallelize( List(

				("This", 2),

				("is", 3),

				("Spark", 5),

				("is", 3)

										) )

...

scala> pairs.collect().foreach(println)

(This,2)

(is,3)

(Spark,5)

(is,3)

// Reduce Pairs by Keys:

scala> val pair1 = pairs.reduceByKey((x,y) => x+y, 4)

...

scala> pair1.collect.foreach(println)

(Spark,5)

(is,6)

(This,2)

// Decrease values by 1:

scala> val pair2 = pairs.mapValues( x=>x-1 )

scala> pair2.collect.foreach(println)

(This,1)

(is,2)

(Spark,4)

(is,2)

// Group Values by Keys:

scala> pairs.groupByKey.collect().foreach(println)

(Spark,CompactBuffer(5))

(is,CompactBuffer(3, 3))

(This,CompactBuffer(2))

Example 2 - Process Data from Local Text File

// Create an RDD from local test file:

scala> val testFile = sc.textFile("File:///home/PATH_TO_SPARK_HOME/README.MD")

RDD transformation and action can now be applied on the textFile

// This will display the number of lines in this textFile:

scala> textFile.count()

// or simply:

scala> textFile.count

// Note: if no argument, no parenthesis needed

// This will display the first line:

scala> textFile.first

// Filter lines containing "Spark":

scala> val linesWithSpark = textFile.filter (

						line => line.contains("Spark")

					)

// or simply:

scala> val linesWithSpark = textFile.filter(_.contains ("Spark"))

// Note: underscore "_" means every element in textFile

// Collect the content of linesWithSpark:

scala> linesWithSpark.collect ()

// Print lines of content of linesWithSpark:

scala> linesWithSpark.foreach (println)

// Map each line to #terms in it:

scala> numOfTermsPerLine = textFile.map ( line => line.split(" ").size )

// or simply:

scala> numOfTermsPerLine = textFile.map ( _.split(" ").size )

// Aggregate the numOfTermsPerLine to the max #terms:

scala> numOfTermsPerLine.reduce ( (a, b) => if (a>b) a else b )

// or use package Math.max:

scala> import java.lang.Math

scala> numOfTermsPerLine.reduce ( (a, b) => Math.max(a, b))

// Convert RDD textFile to an 1-D array of terms:

scala> val terms = textFile.flatMap ( _.split(" ") )

// Convert RDD textFile to an 2-D array of lines of terms:

scala> val terms_ = textFile.map ( _.split(" ") )

// Calculate the vocabulary size in textFile:

scala> terms.distinct().count()

// or simply:

scala> terms.distinct.count

// Find longest line together with the length in textFile:

scala> val lineLengthPair = textFile.map (

		line => (line, line.length) )

scala> val lineWithMaxLength = lineLengthPair.reduce (

		(pair1, pair2) => if pair1._2 >= pair2._2 pair1 else pair2 )

// alternatively, in a concise way:

scala> val lineWithMaxLength = textfile.map (

		line => (line, line.length) ).reduce (

		(pair1, pair2) => if (pair1._2 >= pair2._2) pair1 else pair2 )

// Find out all lines with "Spark" along with line number (start with 0)

// and output with format <line_no: line_content>

scala> val lineIndexPair = textFile.zipWithIndex()

scala> val lineIndexPairWithSpark = lineIndexPair.filter (

		_._1.contains("Spark"))

scala> lineIndexPairWithSpark.foreach (

		pair => println ( pair._2 + ": " + pair._1 )

// alternatively, in a concise way:

scala> textFile.zipWithIndex().filter (

		_._1.contains("Spark")).foreach (

		pair => println(pair._2 + ": ", pair._1) )

Example 3 - Process Data from Local CSV file

Download CSV file by

wget --content-disposition https://webcms3.cse.unsw.edu.au/files/cc5bb4af124130f899cddad80af071f1ad478c3c8eb7440433291459bb603ff1/attachment

Define a name-field mapping for the CSV file

scala> val aucid 		= 0

scala> val bid 			= 1

scala> val bidtime 		= 2

scala> val bidder		= 3

scala> val bidderrate 	= 4

scala> val openbid 		= 5

scala> val price 		= 6

scala> val itemtype 	= 7

scala> val dtl 			= 8

// Create an RDD as a 2-D array from CSV file:

scala> val auctionRDD = sc.textFile("file:///home/PATH-TO-CSV-FILE/auction.csv")

						.map ( _.split(",") )

// Count total number of item types in the auction:

scala> auctionRDD.map ( _(itemtype).distinct.count )

// itemtype was previously defined as 7 to index 8th column

// Count total number of bids per itemtype:

scala> auctionRDD.map ( line => ( line(itemtype), 1 )

				.reduceByKey ( _ + _ , 4)

				.foreach( pair => println (pair._1 + "," + pair._2)

// Find maximum number of bids for each auction

scala> auctionRDD.map ( line => ( line(aucid), 1 ) )

				.reduceByKey ( _ + _ , 4)

				.reduce ( (pair1, pair2) => if ( pair1._2 >= pair2._2 ) pair1 else pair2 )

				._2

// Find top-5 most number of bids for each auction

scala> auctionRDD.map ( line => (line(aucid), 1) )

				.reduceByKey ( _ + _ , 4)

				.map ( _.swap )

				.sortByKey (false)

				.map ( _.swap )

				.take (5)

Example 4 - Word Count on HDFS Text File

Download & put data file to HDFS by:

wget --content-disposition https://webcms3.cse.unsw.edu.au/files/33c7707c8b646a686e33af7e2f2fc006b53ff8c13d8317976bd262d8c6daae66/attachment

hdfs dfs -put pg100.txt Input/

// Create an RDD from HDFS:

scala> val pg100RDD = sc.textFile ("hdfs://HOST-NAME:PORT/user/USER-NAME/Input/pg100.txt")

// Word count:

scala> pg100RDD.flapMap ( _.split(" ") )

			.map ( term => (term, 1) )

			.reduceByKey ( _ + _ , 3)

			.saveAsTextFile ( "OUTPUT-PATH" )

Example N - Spark Graph-X programming

# Download graph data tiny-graph.txt

$ wget --content-disposition https://webcms3.cse.unsw.edu.au/files/ae6f45a3d64c0b35a3bd4d0c2740cc673f000dc60ec17d0e882faf6c20f74509/attachment

// Import Graphx relavent classes:

scala> import org.apache.spark.graphx._

// Load graph data as RDD:

scala> val tinyGraphRDD = sc.textFile ("file:///home/PATH-TO-GRAPH-DATA/tiny-graph.txt")

// Convert raw data <index, srcVertex, destVertex, weight>

// into graphx readable edges:

scala> val edges = tinyGraphRDD.map ( _.split(" ") )

					.map ( line =>

							Edge ( line(1).toLong,

									line(2).toLong,

									line(3).toDouble

								 )

						)

// Create a graph:

scala> val graph = Graph.fromEdges[Double, Double] (edges, 0.0)

// Now the graph has been created,

// show the triplets of this graph:

scala> graph.triplets.collect.foreach ( println )

Written with StackEdit.

Spark Shell Examples的更多相关文章

Spark shell的原理
Spark shell是一个特别适合快速开发Spark原型程序的工具,可以帮助我们熟悉Scala语言.即使你对Scala不熟悉,仍然可以使用这个工具.Spark shell使得用户可以和Spark集群 ...
Spark:使用Spark Shell的两个示例
Spark:使用Spark Shell的两个示例 Python 行数统计 ** 注意: **使用的是Hadoop的HDFS作为持久层,需要先配置Hadoop 命令行代码 # pyspark >& ...
Spark源码分析之Spark Shell（上）
终于开始看Spark源码了,先从最常用的spark-shell脚本开始吧.不要觉得一个启动脚本有什么东东,其实里面还是有很多知识点的.另外,从启动脚本入手,是寻找代码入口最简单的方法,很多开源框架,其 ...
Spark源码分析之Spark Shell（下）
继上次的Spark-shell脚本源码分析,还剩下后面半段.由于上次涉及了不少shell的基本内容,因此就把trap和stty放在这篇来讲述. 上篇回顾:Spark源码分析之Spark Shell(上 ...
[Spark内核] 第36课：TaskScheduler内幕天机解密：Spark shell案例运行日志详解、TaskScheduler和SchedulerBackend、FIFO与FAIR、Task运行时本地性算法详解等
本課主題通过 Spark-shell 窥探程序运行时的状况 TaskScheduler 与 SchedulerBackend 之间的关系 FIFO 与 FAIR 两种调度模式彻底解密 Task 数据 ...
【原创 Hadoop&Spark 动手实践 5】Spark 基础入门，集群搭建以及Spark Shell
Spark 基础入门,集群搭建以及Spark Shell 主要借助Spark基础的PPT,再加上实际的动手操作来加强概念的理解和实践. Spark 安装部署理论已经了解的差不多了,接下来是实际动手实 ...
[Spark Core] Spark Shell 实现 Word Count
0. 说明在 Spark Shell 实现 Word Count RDD (Resilient Distributed dataset), 弹性分布式数据集. 示意图 1. 实现 1.1 分步实现 ...
Spark Shell简单使用
基础 Spark的shell作为一个强大的交互式数据分析工具,提供了一个简单的方式学习API.它可以使用Scala(在Java虚拟机上运行现有的Java库的一个很好方式)或Python.在Spark目 ...
02、体验Spark shell下RDD编程
02.体验Spark shell下RDD编程 1.Spark RDD介绍 RDD是Resilient Distributed Dataset,中文翻译是弹性分布式数据集.该类是Spark是核心类成员之 ...

随机推荐

vue04-动画、组件
一.vue中使用动画文档:https://cn.vuejs.org/v2/guide/transitions.html 1. Vue 中的过渡动画 <!DOCTYPE html> < ...
windows下nginx访问web目录提示403 Forbidden
在windows下 http服务器nginx时,访问web目录提示403 Forbidden,首先需要了解nginx出现403错误是什么意思: 403 Forbidden表示你在请求一个资源文件但是n ...
轻量ORM-SqlRepoEx （十七）SqlRepoEx 2.30 版本更新说明
.Net平台下兼容.NET Standard 2.0,一个实现以Lambda表达式转转换标准SQL语句,支持MySQL.SQL Server数据库方言,使用强类型操作数据的轻量级ORM工具,在减少魔法 ...
selenium java maven 自动化测试(二) 页面元素获取与操作
在第一节中,我们已经成功打开了页面,但是自动化测试必然包含了表单的填写与按钮的点击. 所以在第二章中我以博客园为例,完成按钮点击,表单填写还是以代码为准,先上代码: package com.ryan ...
MySQL的笔记
一. SELECT tmp2.name,tmp2.browseNum FROM (SELECT tmp.`name`, COUNT(tmp.id) AS browseNum FROM(SELECT ...
List和ArrayList
1.为什么List list = new ArrayList()? 也不是非常夸张的说,一定要用List代替ArrayList接收,只是说这样是良好的编码习惯,便于以后代码可能重构. 首先要明白接口和 ...
Microbit蓝芽配对
Microbit蓝芽配对 (Bluetooth Pairing) Microbit 可以像手机或平板与其他蓝芽装置一样,一旦做完第一次配对完就可以使用”蓝芽服务” paired with the mi ...
GCC编译器基础入门
导语 GCC(GNU Compiler Collection,GNU 编译器套件) 是由 GNU 开发的编程语言编译器,支持C.C++.Objective-C.Fortran.Java.Ada和Go语 ...
x01.polls: 学习 django
开发一个 Web 应用:x01.polls,可能比想像的还要容易一些,这完全得益于 django 框架. 1.安装 django: sudo pip3 install django 2.阅读 djan ...
后台运行spark-submit命令的方法
在使用spark-submit运行工程jar包时常常会出现一下两个问题: 1.在程序中手打的log(如System.out.println(“***testRdd.count=”+testRdd.co ...

Spark Shell Examples

Spark Shell

Example 1 - Process Data from List:

Example 2 - Process Data from Local Text File

Example 3 - Process Data from Local CSV file

Example 4 - Word Count on HDFS Text File

Example N - Spark Graph-X programming

Spark Shell Examples的更多相关文章

随机推荐

热门专题