Spark Shell Examples

Spark Shell

Example 1 - Process Data from List:

scala> val pairs = sc.parallelize( List(

				("This", 2),

				("is", 3),

				("Spark", 5),

				("is", 3)

										) )

...

scala> pairs.collect().foreach(println)

(This,2)

(is,3)

(Spark,5)

(is,3)

// Reduce Pairs by Keys:

scala> val pair1 = pairs.reduceByKey((x,y) => x+y, 4)

...

scala> pair1.collect.foreach(println)

(Spark,5)

(is,6)

(This,2)

// Decrease values by 1:

scala> val pair2 = pairs.mapValues( x=>x-1 )

scala> pair2.collect.foreach(println)

(This,1)

(is,2)

(Spark,4)

(is,2)

// Group Values by Keys:

scala> pairs.groupByKey.collect().foreach(println)

(Spark,CompactBuffer(5))

(is,CompactBuffer(3, 3))

(This,CompactBuffer(2))

Example 2 - Process Data from Local Text File

// Create an RDD from local test file:

scala> val testFile = sc.textFile("File:///home/PATH_TO_SPARK_HOME/README.MD")

RDD transformation and action can now be applied on the textFile

// This will display the number of lines in this textFile:

scala> textFile.count()

// or simply:

scala> textFile.count

// Note: if no argument, no parenthesis needed

// This will display the first line:

scala> textFile.first

// Filter lines containing "Spark":

scala> val linesWithSpark = textFile.filter (

						line => line.contains("Spark")

					)

// or simply:

scala> val linesWithSpark = textFile.filter(_.contains ("Spark"))

// Note: underscore "_" means every element in textFile

// Collect the content of linesWithSpark:

scala> linesWithSpark.collect ()

// Print lines of content of linesWithSpark:

scala> linesWithSpark.foreach (println)

// Map each line to #terms in it:

scala> numOfTermsPerLine = textFile.map ( line => line.split(" ").size )

// or simply:

scala> numOfTermsPerLine = textFile.map ( _.split(" ").size )

// Aggregate the numOfTermsPerLine to the max #terms:

scala> numOfTermsPerLine.reduce ( (a, b) => if (a>b) a else b )

// or use package Math.max:

scala> import java.lang.Math

scala> numOfTermsPerLine.reduce ( (a, b) => Math.max(a, b))

// Convert RDD textFile to an 1-D array of terms:

scala> val terms = textFile.flatMap ( _.split(" ") )

// Convert RDD textFile to an 2-D array of lines of terms:

scala> val terms_ = textFile.map ( _.split(" ") )

// Calculate the vocabulary size in textFile:

scala> terms.distinct().count()

// or simply:

scala> terms.distinct.count

// Find longest line together with the length in textFile:

scala> val lineLengthPair = textFile.map (

		line => (line, line.length) )

scala> val lineWithMaxLength = lineLengthPair.reduce (

		(pair1, pair2) => if pair1._2 >= pair2._2 pair1 else pair2 )

// alternatively, in a concise way:

scala> val lineWithMaxLength = textfile.map (

		line => (line, line.length) ).reduce (

		(pair1, pair2) => if (pair1._2 >= pair2._2) pair1 else pair2 )

// Find out all lines with "Spark" along with line number (start with 0)

// and output with format <line_no: line_content>

scala> val lineIndexPair = textFile.zipWithIndex()

scala> val lineIndexPairWithSpark = lineIndexPair.filter (

		_._1.contains("Spark"))

scala> lineIndexPairWithSpark.foreach (

		pair => println ( pair._2 + ": " + pair._1 )

// alternatively, in a concise way:

scala> textFile.zipWithIndex().filter (

		_._1.contains("Spark")).foreach (

		pair => println(pair._2 + ": ", pair._1) )

Example 3 - Process Data from Local CSV file

Download CSV file by

wget --content-disposition https://webcms3.cse.unsw.edu.au/files/cc5bb4af124130f899cddad80af071f1ad478c3c8eb7440433291459bb603ff1/attachment

Define a name-field mapping for the CSV file

scala> val aucid 		= 0

scala> val bid 			= 1

scala> val bidtime 		= 2

scala> val bidder		= 3

scala> val bidderrate 	= 4

scala> val openbid 		= 5

scala> val price 		= 6

scala> val itemtype 	= 7

scala> val dtl 			= 8

// Create an RDD as a 2-D array from CSV file:

scala> val auctionRDD = sc.textFile("file:///home/PATH-TO-CSV-FILE/auction.csv")

						.map ( _.split(",") )

// Count total number of item types in the auction:

scala> auctionRDD.map ( _(itemtype).distinct.count )

// itemtype was previously defined as 7 to index 8th column

// Count total number of bids per itemtype:

scala> auctionRDD.map ( line => ( line(itemtype), 1 )

				.reduceByKey ( _ + _ , 4)

				.foreach( pair => println (pair._1 + "," + pair._2)

// Find maximum number of bids for each auction

scala> auctionRDD.map ( line => ( line(aucid), 1 ) )

				.reduceByKey ( _ + _ , 4)

				.reduce ( (pair1, pair2) => if ( pair1._2 >= pair2._2 ) pair1 else pair2 )

				._2

// Find top-5 most number of bids for each auction

scala> auctionRDD.map ( line => (line(aucid), 1) )

				.reduceByKey ( _ + _ , 4)

				.map ( _.swap )

				.sortByKey (false)

				.map ( _.swap )

				.take (5)

Example 4 - Word Count on HDFS Text File

Download & put data file to HDFS by:

wget --content-disposition https://webcms3.cse.unsw.edu.au/files/33c7707c8b646a686e33af7e2f2fc006b53ff8c13d8317976bd262d8c6daae66/attachment

hdfs dfs -put pg100.txt Input/

// Create an RDD from HDFS:

scala> val pg100RDD = sc.textFile ("hdfs://HOST-NAME:PORT/user/USER-NAME/Input/pg100.txt")

// Word count:

scala> pg100RDD.flapMap ( _.split(" ") )

			.map ( term => (term, 1) )

			.reduceByKey ( _ + _ , 3)

			.saveAsTextFile ( "OUTPUT-PATH" )

Example N - Spark Graph-X programming

# Download graph data tiny-graph.txt

$ wget --content-disposition https://webcms3.cse.unsw.edu.au/files/ae6f45a3d64c0b35a3bd4d0c2740cc673f000dc60ec17d0e882faf6c20f74509/attachment

// Import Graphx relavent classes:

scala> import org.apache.spark.graphx._

// Load graph data as RDD:

scala> val tinyGraphRDD = sc.textFile ("file:///home/PATH-TO-GRAPH-DATA/tiny-graph.txt")

// Convert raw data <index, srcVertex, destVertex, weight>

// into graphx readable edges:

scala> val edges = tinyGraphRDD.map ( _.split(" ") )

					.map ( line =>

							Edge ( line(1).toLong,

									line(2).toLong,

									line(3).toDouble

								 )

						)

// Create a graph:

scala> val graph = Graph.fromEdges[Double, Double] (edges, 0.0)

// Now the graph has been created,

// show the triplets of this graph:

scala> graph.triplets.collect.foreach ( println )

Written with StackEdit.

Spark Shell Examples的更多相关文章

Spark shell的原理
Spark shell是一个特别适合快速开发Spark原型程序的工具,可以帮助我们熟悉Scala语言.即使你对Scala不熟悉,仍然可以使用这个工具.Spark shell使得用户可以和Spark集群 ...
Spark:使用Spark Shell的两个示例
Spark:使用Spark Shell的两个示例 Python 行数统计 ** 注意: **使用的是Hadoop的HDFS作为持久层,需要先配置Hadoop 命令行代码 # pyspark >& ...
Spark源码分析之Spark Shell（上）
终于开始看Spark源码了,先从最常用的spark-shell脚本开始吧.不要觉得一个启动脚本有什么东东,其实里面还是有很多知识点的.另外,从启动脚本入手,是寻找代码入口最简单的方法,很多开源框架,其 ...
Spark源码分析之Spark Shell（下）
继上次的Spark-shell脚本源码分析,还剩下后面半段.由于上次涉及了不少shell的基本内容,因此就把trap和stty放在这篇来讲述. 上篇回顾:Spark源码分析之Spark Shell(上 ...
[Spark内核] 第36课：TaskScheduler内幕天机解密：Spark shell案例运行日志详解、TaskScheduler和SchedulerBackend、FIFO与FAIR、Task运行时本地性算法详解等
本課主題通过 Spark-shell 窥探程序运行时的状况 TaskScheduler 与 SchedulerBackend 之间的关系 FIFO 与 FAIR 两种调度模式彻底解密 Task 数据 ...
【原创 Hadoop&Spark 动手实践 5】Spark 基础入门，集群搭建以及Spark Shell
Spark 基础入门,集群搭建以及Spark Shell 主要借助Spark基础的PPT,再加上实际的动手操作来加强概念的理解和实践. Spark 安装部署理论已经了解的差不多了,接下来是实际动手实 ...
[Spark Core] Spark Shell 实现 Word Count
0. 说明在 Spark Shell 实现 Word Count RDD (Resilient Distributed dataset), 弹性分布式数据集. 示意图 1. 实现 1.1 分步实现 ...
Spark Shell简单使用
基础 Spark的shell作为一个强大的交互式数据分析工具,提供了一个简单的方式学习API.它可以使用Scala(在Java虚拟机上运行现有的Java库的一个很好方式)或Python.在Spark目 ...
02、体验Spark shell下RDD编程
02.体验Spark shell下RDD编程 1.Spark RDD介绍 RDD是Resilient Distributed Dataset,中文翻译是弹性分布式数据集.该类是Spark是核心类成员之 ...

随机推荐

EF Core 2.0 已经支持自动生成父子关系表的实体
现在我们在SQL Server数据库中有Person表如下: CREATE TABLE [dbo].[Person]( ,) NOT NULL, ) NULL, ) NULL, ) NULL, [Cr ...
用JavaScript中lodash编写双色球
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8&quo ...
Swift基础学习笔记一
之前学习过一段时间swift,由于目前开发的项目还是用的OC,一段时间不看swift又基本忘干净了,好记性不如烂笔头,还是用博客记录一下自己学的东西吧. 基本数据类型: 1.常量(let)和变量(va ...
zookeeper报错 JAVA_HOME is not set
很多开发者安装zookeeper的时候,应该会发现到这么一个问题: JAVA_HOME is not set 好的!那么这个是什么意思呢? 就是说你的 JAVA_HOME 变量没有设定为什么会提示 ...
[USACO15DEC]最大流Max Flow（树链剖分，线段树）
FJ给他的牛棚的N(2≤N≤50,000)个隔间之间安装了N-1根管道,隔间编号从1到N.所有隔间都被管道连通了. FJ有K(1≤K≤100,000)条运输牛奶的路线,第i条路线从隔间si运输到隔间t ...
VS2017 编译 QT5.10.1 X64位静态库 MT
参考文章 https://blog.csdn.net/Devout_programming/article/details/78827112 准备工作* Supported compiler (Vis ...
MySQL架构与引擎初识
一.MySQL逻辑架构 1.连接层: 最上层是一些客户端和连接服务,所包含的服务并不是MySQL所独有的技术.它们都是服务于C/S程序或者是这些程序所需要的 :连接处理,身份验证,安全性等等. 2.服 ...
NFS网络文件系统
FFS服务端概述 NFS,是Network File System的简写,即网络文件系统.网络文件系统是FreeBSD支持的文件系统中的一种,也被称为NFS: NFS允许一个系统在网络上与他人共享目录 ...
golang学习总结
目录 1. 初识go语言 1.1 Hello World 1.2 go 数据类型布尔: 整型: 浮点型: 字符类型字符串型: 复数类型: 1.3 变量常量局部变量: 全局变量常量 1.5 字符 ...
less.js插件监听
<script>less.watch();</script> 在不手动刷新/重新加载页面会自动监听less的变化,页面做出相应的变化 . 写在这两行后面就好了 <lin ...

Spark Shell Examples

Spark Shell

Example 1 - Process Data from List:

Example 2 - Process Data from Local Text File

Example 3 - Process Data from Local CSV file

Example 4 - Word Count on HDFS Text File

Example N - Spark Graph-X programming

Spark Shell Examples的更多相关文章

随机推荐

热门专题