Spark Shell

Example 1 - Process Data from List:

scala> val pairs = sc.parallelize( List(
("This", 2),
("is", 3),
("Spark", 5),
("is", 3)
) )
...
scala> pairs.collect().foreach(println)
(This,2)
(is,3)
(Spark,5)
(is,3)
// Reduce Pairs by Keys:
scala> val pair1 = pairs.reduceByKey((x,y) => x+y, 4)
...
scala> pair1.collect.foreach(println)
(Spark,5)
(is,6)
(This,2)
// Decrease values by 1:
scala> val pair2 = pairs.mapValues( x=>x-1 )
scala> pair2.collect.foreach(println)
(This,1)
(is,2)
(Spark,4)
(is,2)
// Group Values by Keys:
scala> pairs.groupByKey.collect().foreach(println)
(Spark,CompactBuffer(5))
(is,CompactBuffer(3, 3))
(This,CompactBuffer(2))

Example 2 - Process Data from Local Text File

// Create an RDD from local test file:
scala> val testFile = sc.textFile("File:///home/PATH_TO_SPARK_HOME/README.MD")

RDD transformation and action can now be applied on the textFile

// This will display the number of lines in this textFile:
scala> textFile.count()
// or simply:
scala> textFile.count
// Note: if no argument, no parenthesis needed
// This will display the first line:
scala> textFile.first
// Filter lines containing "Spark":
scala> val linesWithSpark = textFile.filter (
line => line.contains("Spark")
)
// or simply:
scala> val linesWithSpark = textFile.filter(_.contains ("Spark"))
// Note: underscore "_" means every element in textFile
// Collect the content of linesWithSpark:
scala> linesWithSpark.collect ()
// Print lines of content of linesWithSpark:
scala> linesWithSpark.foreach (println)
// Map each line to #terms in it:
scala> numOfTermsPerLine = textFile.map ( line => line.split(" ").size ) // or simply:
scala> numOfTermsPerLine = textFile.map ( _.split(" ").size )
// Aggregate the numOfTermsPerLine to the max #terms:
scala> numOfTermsPerLine.reduce ( (a, b) => if (a>b) a else b ) // or use package Math.max:
scala> import java.lang.Math
scala> numOfTermsPerLine.reduce ( (a, b) => Math.max(a, b))
// Convert RDD textFile to an 1-D array of terms:
scala> val terms = textFile.flatMap ( _.split(" ") ) // Convert RDD textFile to an 2-D array of lines of terms:
scala> val terms_ = textFile.map ( _.split(" ") )
// Calculate the vocabulary size in textFile:
scala> terms.distinct().count() // or simply:
scala> terms.distinct.count
// Find longest line together with the length in textFile:
scala> val lineLengthPair = textFile.map (
line => (line, line.length) )
scala> val lineWithMaxLength = lineLengthPair.reduce (
(pair1, pair2) => if pair1._2 >= pair2._2 pair1 else pair2 ) // alternatively, in a concise way:
scala> val lineWithMaxLength = textfile.map (
line => (line, line.length) ).reduce (
(pair1, pair2) => if (pair1._2 >= pair2._2) pair1 else pair2 )
// Find out all lines with "Spark" along with line number (start with 0)
// and output with format <line_no: line_content>
scala> val lineIndexPair = textFile.zipWithIndex()
scala> val lineIndexPairWithSpark = lineIndexPair.filter (
_._1.contains("Spark"))
scala> lineIndexPairWithSpark.foreach (
pair => println ( pair._2 + ": " + pair._1 ) // alternatively, in a concise way:
scala> textFile.zipWithIndex().filter (
_._1.contains("Spark")).foreach (
pair => println(pair._2 + ": ", pair._1) )

Example 3 - Process Data from Local CSV file

Download CSV file by

wget --content-disposition https://webcms3.cse.unsw.edu.au/files/cc5bb4af124130f899cddad80af071f1ad478c3c8eb7440433291459bb603ff1/attachment

Define a name-field mapping for the CSV file

scala> val aucid 		= 0
scala> val bid = 1
scala> val bidtime = 2
scala> val bidder = 3
scala> val bidderrate = 4
scala> val openbid = 5
scala> val price = 6
scala> val itemtype = 7
scala> val dtl = 8
// Create an RDD as a 2-D array from CSV file:
scala> val auctionRDD = sc.textFile("file:///home/PATH-TO-CSV-FILE/auction.csv")
.map ( _.split(",") )
// Count total number of item types in the auction:
scala> auctionRDD.map ( _(itemtype).distinct.count ) // itemtype was previously defined as 7 to index 8th column
// Count total number of bids per itemtype:
scala> auctionRDD.map ( line => ( line(itemtype), 1 )
.reduceByKey ( _ + _ , 4)
.foreach( pair => println (pair._1 + "," + pair._2)
// Find maximum number of bids for each auction
scala> auctionRDD.map ( line => ( line(aucid), 1 ) )
.reduceByKey ( _ + _ , 4)
.reduce ( (pair1, pair2) => if ( pair1._2 >= pair2._2 ) pair1 else pair2 )
._2
// Find top-5 most number of bids for each auction
scala> auctionRDD.map ( line => (line(aucid), 1) )
.reduceByKey ( _ + _ , 4)
.map ( _.swap )
.sortByKey (false)
.map ( _.swap )
.take (5)

Example 4 - Word Count on HDFS Text File

Download & put data file to HDFS by:

wget --content-disposition https://webcms3.cse.unsw.edu.au/files/33c7707c8b646a686e33af7e2f2fc006b53ff8c13d8317976bd262d8c6daae66/attachment
hdfs dfs -put pg100.txt Input/
// Create an RDD from HDFS:
scala> val pg100RDD = sc.textFile ("hdfs://HOST-NAME:PORT/user/USER-NAME/Input/pg100.txt")
// Word count:
scala> pg100RDD.flapMap ( _.split(" ") )
.map ( term => (term, 1) )
.reduceByKey ( _ + _ , 3)
.saveAsTextFile ( "OUTPUT-PATH" )

Example N - Spark Graph-X programming

# Download graph data tiny-graph.txt
$ wget --content-disposition https://webcms3.cse.unsw.edu.au/files/ae6f45a3d64c0b35a3bd4d0c2740cc673f000dc60ec17d0e882faf6c20f74509/attachment
// Import Graphx relavent classes:
scala> import org.apache.spark.graphx._
// Load graph data as RDD:
scala> val tinyGraphRDD = sc.textFile ("file:///home/PATH-TO-GRAPH-DATA/tiny-graph.txt")
// Convert raw data <index, srcVertex, destVertex, weight>
// into graphx readable edges:
scala> val edges = tinyGraphRDD.map ( _.split(" ") )
.map ( line =>
Edge ( line(1).toLong,
line(2).toLong,
line(3).toDouble
)
)
// Create a graph:
scala> val graph = Graph.fromEdges[Double, Double] (edges, 0.0)
// Now the graph has been created,
// show the triplets of this graph:
scala> graph.triplets.collect.foreach ( println )

Written with StackEdit.

Spark Shell Examples的更多相关文章

  1. Spark shell的原理

    Spark shell是一个特别适合快速开发Spark原型程序的工具,可以帮助我们熟悉Scala语言.即使你对Scala不熟悉,仍然可以使用这个工具.Spark shell使得用户可以和Spark集群 ...

  2. Spark:使用Spark Shell的两个示例

    Spark:使用Spark Shell的两个示例 Python 行数统计 ** 注意: **使用的是Hadoop的HDFS作为持久层,需要先配置Hadoop 命令行代码 # pyspark >& ...

  3. Spark源码分析之Spark Shell(上)

    终于开始看Spark源码了,先从最常用的spark-shell脚本开始吧.不要觉得一个启动脚本有什么东东,其实里面还是有很多知识点的.另外,从启动脚本入手,是寻找代码入口最简单的方法,很多开源框架,其 ...

  4. Spark源码分析之Spark Shell(下)

    继上次的Spark-shell脚本源码分析,还剩下后面半段.由于上次涉及了不少shell的基本内容,因此就把trap和stty放在这篇来讲述. 上篇回顾:Spark源码分析之Spark Shell(上 ...

  5. [Spark内核] 第36课:TaskScheduler内幕天机解密:Spark shell案例运行日志详解、TaskScheduler和SchedulerBackend、FIFO与FAIR、Task运行时本地性算法详解等

    本課主題 通过 Spark-shell 窥探程序运行时的状况 TaskScheduler 与 SchedulerBackend 之间的关系 FIFO 与 FAIR 两种调度模式彻底解密 Task 数据 ...

  6. 【原创 Hadoop&Spark 动手实践 5】Spark 基础入门,集群搭建以及Spark Shell

    Spark 基础入门,集群搭建以及Spark Shell 主要借助Spark基础的PPT,再加上实际的动手操作来加强概念的理解和实践. Spark 安装部署 理论已经了解的差不多了,接下来是实际动手实 ...

  7. [Spark Core] Spark Shell 实现 Word Count

    0. 说明 在 Spark Shell 实现 Word Count RDD (Resilient Distributed dataset), 弹性分布式数据集. 示意图 1. 实现 1.1 分步实现 ...

  8. Spark Shell简单使用

    基础 Spark的shell作为一个强大的交互式数据分析工具,提供了一个简单的方式学习API.它可以使用Scala(在Java虚拟机上运行现有的Java库的一个很好方式)或Python.在Spark目 ...

  9. 02、体验Spark shell下RDD编程

    02.体验Spark shell下RDD编程 1.Spark RDD介绍 RDD是Resilient Distributed Dataset,中文翻译是弹性分布式数据集.该类是Spark是核心类成员之 ...

随机推荐

  1. ASP.NET Core MVC中的IActionFilter.OnActionExecuted方法执行时,Controller中Action返回的对象是否已经输出到Http Response中

    我们在ASP.NET Core MVC项目中有如下HomeController: using Microsoft.AspNetCore.Mvc; namespace AspNetCoreActionF ...

  2. iOS中UITextField常用设置和方法

    //初始化textField并设置位置及大小 UITextField *text = [[UITextField alloc]initWithFrame:CGRectMake(, , , )]; // ...

  3. Java运算符使用总结(重点:自增自减、位运算和逻辑运算)

    Java运算符共包括这几种:算术运算符.比较运算符.位运算符.逻辑运算符.赋值运算符和其他运算符.(该图来自网络) 简单的运算符,就不过多介绍使用了,可自行测试.关于赋值运算,可以结合算术运算和位运算 ...

  4. Java 遍历方法总结

    package com.zlh; import java.util.ArrayList; import java.util.HashMap; import java.util.Iterator; im ...

  5. 如何提交代码到git仓库

    首先连接远程仓库 git remote add origin 仓库地址 然后拉取分支 git pull origin master 随后可查看本地增删改的文件 git status 增加本地的更改 g ...

  6. Linux中的阻塞机制

    我们知道在字符设备驱动中,应用层调用read.write等系统调用终会调到驱动中对应的接口. 可以当应用层调用read要去读硬件的数据时,硬件的数据未准备好,那我们该怎么做? 一种办法是直接返回并报错 ...

  7. keil编译运行错误,缺少error:#5:#include "core_cm3.h"

    用Keil  vision5编译时出现以下错误:error:  #5: cannot open source input file "core_cm3.h": No such fi ...

  8. EOS节点远程代码执行漏洞细节

    这是一个缓冲区溢出越界写漏洞 漏洞存在于在 libraries/chain/webassembly/binaryen.cpp文件的78行, Function binaryen_runtime::ins ...

  9. 快速理解python2中的编码问题

    # -*- coding:utf-8 -*- ''' python2 中的字符编码有str和unicode(字符串类型的名字) str类型字符串类型在内存中存储的是bytes数据 Unicode类型字 ...

  10. C语言中赋值表达式的返回值是什么?

    我们或多或少都有过,或者见过将赋值表达式参与运算的情况.这通常会伴随着一些意想不到的问题.今天我就见到了一段奇怪的代码: #include<stdio.h> int main() { ; ...