Exploring the Spark shell

Spark comes bundled with a PERL shell, which is a wrapper around the Scala shell. Though the Spark shell looks lime a command line for simple things, in reality a lot of complex queries can also be executed using it.

1. create the words directory

mkdir words

2. go into the words directory

cd words

3. create a sh.txt file

echo "to be or not to be" > sh.txt

4. start the Spark shell

spark-shell

5. load the words directory as RDD(Resilient Distributed Dataset)

Scala> val words = sc.textFile("hdfs://localhost:9000/user/hduser/words")

6. count the number of lines(result: 1)

Scala> words.count

7. divide the line (or lines) into multiple words

Scala> val wordsFlatMap = words.flatmap(_.split("\\W+"))

8. convert word to (word, 1)

Scala> val wordsMap = wordsFlatMap.map(w => (w, 1))

9. add the number of occurrences for each word

Scala> val wordCount = wordsMap.reduceByKey((a, b) => (a + b))

10. sort the results

Scala> val wordCountSorted = wordCount.sortByKey(true)

11. print the RDD

Scala> wordCountSorted.collect.foreach(println)

12. doing all operations in one step

Scala> sc.textFile("hdfs://localhost:9000/user/hduser/words").flatMap(_.split("\\W+")).map(w => (w,1)).reduceByKey((a,b) => (a+b)).sortByKey(true).collect.foreach(println)

This gives us the following output:
(or,1)
(to,2)
(not,1)
(be,2)

Developing Spark applications in Eclipse with Maven

Maven has two primary features:

1. Convention over configuration

/src/main/scala
/src/main/java
/src/main/resources
/src/test/scala
/src/test/java
/src/test/resources

2. Declarative dependency management

<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>

Install Maven plugin for Eclipse:

1. Open Eclipse and navigate to Help | Install New Software

2. Click on the Work with drop-down menu

3. Select the <eclipse version> update site

4. Click on Collaboration tools

5. Check Maven's integration with Eclipse

6. Click on Next and then click on Finish

Install the Scala plugin for Eclipse:

1. Open Eclipse and navigate to Help | Install New Software

2. Click on the Work with drop-down menu

3. Select the <eclipse version> update site

4. Type http://download.scala-ide.org/sdk/helium/e38/scala210/stable/site

5. Press Enter

6. Select Scala IDE for Eclipse

7. Click on Next and then click on Finish

8. Navigate to Window | Open Perspective | Scala

Developing Spark applications in Eclipse with SBT

Simple Build Tool(SBT) is a build tool made especially for Scala-based development. SBT follows Maven-based naming conventions and declarative dependency management.

SBT provides the following enchancements over Maven:

1. Dependencies are in the form of key-value pairs in the build.sbt file as opposed to pom.xml in Maven

2. It provides a shell that makes it very handy to perform build operations

3. For simple projects without dependencies, you do not even need the build.sbt file

In build.sbt, the first line is the project definition:

lazy val root = (project in file("."))

Each project has an immutable map of key-value pairs.

lazy val root = (project in file("."))
settings(
name := "wordcount"
)

Every change in the settings leads to a new map, as it's an immutable map

1. add to the global plugin file

mkdir /home/hduser/.sbt/0.13/plugins
echo addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "2.5.0") > /home/hduser/.sbt/0.13/plugins/plugin.sbt

or add to specific project

cd <project-home>
echo addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "2.5.0") > plugin.sbt

2. start the sbt shell

sbt

3. type eclipse and it will make an Eclipse-ready project

eclipse

4. navigate to File | Import | Import existing project into workspace to load the project into Eclipse

Spark(2) - Developing Application with Spark的更多相关文章

  1. (一)Spark简介-Java&Python版Spark

    Spark简介 视频教程: 1.优酷 2.YouTube 简介: Spark是加州大学伯克利分校AMP实验室,开发的通用内存并行计算框架.Spark在2013年6月进入Apache成为孵化项目,8个月 ...

  2. Spark学习(四) -- Spark作业提交

    标签(空格分隔): Spark 作业提交 先回顾一下WordCount的过程: sc.textFile("README.rd").flatMap(line => line.s ...

  3. Spark入门实战系列--1.Spark及其生态圈简介

    [注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 .简介 1.1 Spark简介 年6月进入Apache成为孵化项目,8个月后成为Apache ...

  4. Spark入门实战系列--3.Spark编程模型(上)--编程模型及SparkShell实战

    [注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 .Spark编程模型 1.1 术语定义 l应用程序(Application): 基于Spar ...

  5. Spark入门实战系列--4.Spark运行架构

    [注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 1. Spark运行架构 1.1 术语定义 lApplication:Spark Appli ...

  6. Spark中文指南(入门篇)-Spark编程模型(一)

    前言 本章将对Spark做一个简单的介绍,更多教程请参考:Spark教程 本章知识点概括 Apache Spark简介 Spark的四种运行模式 Spark基于Standlone的运行流程 Spark ...

  7. Spark On Yarn:提交Spark应用程序到Yarn

    转载自:http://lxw1234.com/archives/2015/07/416.htm 关键字:Spark On Yarn.Spark Yarn Cluster.Spark Yarn Clie ...

  8. 大数据技术之_19_Spark学习_01_Spark 基础解析 + Spark 概述 + Spark 集群安装 + 执行 Spark 程序

    第1章 Spark 概述1.1 什么是 Spark1.2 Spark 特点1.3 Spark 的用户和用途第2章 Spark 集群安装2.1 集群角色2.2 机器准备2.3 下载 Spark 安装包2 ...

  9. 【Spark深入学习 -14】Spark应用经验与程序调优

    ----本节内容------- 1.遗留问题解答 2.Spark调优初体验 2.1 利用WebUI分析程序瓶颈 2.2 设置合适的资源 2.3 调整任务的并发度 2.4 修改存储格式 3.Spark调 ...

随机推荐

  1. 《转》Ubuntu 12.04常用的快捷键

    Ubuntu 12.04常用的快捷键   超级键操作   1.超级键(Win键)–打开dash.   www.2cto.com   2.长按超级键– 启动Launcher.并快捷键列表.   3.按住 ...

  2. 基于Spark ALS构建商品推荐引擎

    基于Spark ALS构建商品推荐引擎   一般来讲,推荐引擎试图对用户与某类物品之间的联系建模,其想法是预测人们可能喜好的物品并通过探索物品之间的联系来辅助这个过程,让用户能更快速.更准确的获得所需 ...

  3. Springmvc中 同步/异步请求参数的传递以及数据的返回

    转载:http://blog.csdn.net/qh_java/article/details/44802287 注意: 这里的返回就是返回到jsp页面 **** controller接收前台数据的方 ...

  4. [转载] 深入理解Linux修改hostname

    原文: http://www.cnblogs.com/kerrycode/p/3595724.html 当我觉得对Linux系统下修改hostname已经非常熟悉的时候,今天碰到了几个个问题,这几个问 ...

  5. Lua了解 & 为什么游戏开发用Lua

    参考这篇文章 https://www.zhihu.com/question/21717567 看来就是网易风云为了让人写外挂不方便而采用的冷门语言.当然冷门的语言不代表不好用啦. Lua 虚拟机小,嵌 ...

  6. Java集合类源码分析

    常用类及源码分析 集合类 原理分析 Collection   List   Vector 扩充容量的方法 ensureCapacityHelper很多方法都加入了synchronized同步语句,来保 ...

  7. Java集合的Stack、Queue、Map的遍历

    Java集合的Stack.Queue.Map的遍历   在集合操作中,常常离不开对集合的遍历,对集合遍历一般来说一个foreach就搞定了,但是,对于Stack.Queue.Map类型的遍历,还是有一 ...

  8. Oracle SQL 调优之 sqlhc

    SQL 执行慢,如何 快速准确的优化. sqlhc 就是其中最好工具之一 通过获得sql所有的执行计划,列出实际的性能的瓶颈点,列出 sql 所在的表上的行数,每一列的数据和分布,现有的索引,sql ...

  9. php防止sql注入

    [一.在服务器端配置] 安全,PHP代码编写是一方面,PHP的配置更是非常关键. 我们php手手工安装的,php的默认配置文件在 /usr/local/apache2/conf/php.ini,我们最 ...

  10. Hbase之取出行数据指定部分+版本控制(类似MySQL的Limit)

    import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.CellScanner; import org. ...