Spark(2) - Developing Application with Spark
Exploring the Spark shell
Spark comes bundled with a PERL shell, which is a wrapper around the Scala shell. Though the Spark shell looks lime a command line for simple things, in reality a lot of complex queries can also be executed using it.
1. create the words directory
mkdir words
2. go into the words directory
cd words
3. create a sh.txt file
echo "to be or not to be" > sh.txt
4. start the Spark shell
spark-shell
5. load the words directory as RDD(Resilient Distributed Dataset)
Scala> val words = sc.textFile("hdfs://localhost:9000/user/hduser/words")
6. count the number of lines(result: 1)
Scala> words.count
7. divide the line (or lines) into multiple words
Scala> val wordsFlatMap = words.flatmap(_.split("\\W+"))
8. convert word to (word, 1)
Scala> val wordsMap = wordsFlatMap.map(w => (w, 1))
9. add the number of occurrences for each word
Scala> val wordCount = wordsMap.reduceByKey((a, b) => (a + b))
10. sort the results
Scala> val wordCountSorted = wordCount.sortByKey(true)
11. print the RDD
Scala> wordCountSorted.collect.foreach(println)
12. doing all operations in one step
Scala> sc.textFile("hdfs://localhost:9000/user/hduser/words").flatMap(_.split("\\W+")).map(w => (w,1)).reduceByKey((a,b) => (a+b)).sortByKey(true).collect.foreach(println)
This gives us the following output:
(or,1)
(to,2)
(not,1)
(be,2)
Developing Spark applications in Eclipse with Maven
Maven has two primary features:
1. Convention over configuration
/src/main/scala
/src/main/java
/src/main/resources
/src/test/scala
/src/test/java
/src/test/resources
2. Declarative dependency management
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
Install Maven plugin for Eclipse:
1. Open Eclipse and navigate to Help | Install New Software
2. Click on the Work with drop-down menu
3. Select the <eclipse version> update site
4. Click on Collaboration tools
5. Check Maven's integration with Eclipse
6. Click on Next and then click on Finish
Install the Scala plugin for Eclipse:
1. Open Eclipse and navigate to Help | Install New Software
2. Click on the Work with drop-down menu
3. Select the <eclipse version> update site
4. Type http://download.scala-ide.org/sdk/helium/e38/scala210/stable/site
5. Press Enter
6. Select Scala IDE for Eclipse
7. Click on Next and then click on Finish
8. Navigate to Window | Open Perspective | Scala
Developing Spark applications in Eclipse with SBT
Simple Build Tool(SBT) is a build tool made especially for Scala-based development. SBT follows Maven-based naming conventions and declarative dependency management.
SBT provides the following enchancements over Maven:
1. Dependencies are in the form of key-value pairs in the build.sbt file as opposed to pom.xml in Maven
2. It provides a shell that makes it very handy to perform build operations
3. For simple projects without dependencies, you do not even need the build.sbt file
In build.sbt, the first line is the project definition:
lazy val root = (project in file("."))
Each project has an immutable map of key-value pairs.
lazy val root = (project in file("."))
settings(
name := "wordcount"
)
Every change in the settings leads to a new map, as it's an immutable map
1. add to the global plugin file
mkdir /home/hduser/.sbt/0.13/plugins
echo addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "2.5.0") > /home/hduser/.sbt/0.13/plugins/plugin.sbt
or add to specific project
cd <project-home>
echo addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "2.5.0") > plugin.sbt
2. start the sbt shell
sbt
3. type eclipse and it will make an Eclipse-ready project
eclipse
4. navigate to File | Import | Import existing project into workspace to load the project into Eclipse
Spark(2) - Developing Application with Spark的更多相关文章
- (一)Spark简介-Java&Python版Spark
Spark简介 视频教程: 1.优酷 2.YouTube 简介: Spark是加州大学伯克利分校AMP实验室,开发的通用内存并行计算框架.Spark在2013年6月进入Apache成为孵化项目,8个月 ...
- Spark学习(四) -- Spark作业提交
标签(空格分隔): Spark 作业提交 先回顾一下WordCount的过程: sc.textFile("README.rd").flatMap(line => line.s ...
- Spark入门实战系列--1.Spark及其生态圈简介
[注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 .简介 1.1 Spark简介 年6月进入Apache成为孵化项目,8个月后成为Apache ...
- Spark入门实战系列--3.Spark编程模型(上)--编程模型及SparkShell实战
[注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 .Spark编程模型 1.1 术语定义 l应用程序(Application): 基于Spar ...
- Spark入门实战系列--4.Spark运行架构
[注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 1. Spark运行架构 1.1 术语定义 lApplication:Spark Appli ...
- Spark中文指南(入门篇)-Spark编程模型(一)
前言 本章将对Spark做一个简单的介绍,更多教程请参考:Spark教程 本章知识点概括 Apache Spark简介 Spark的四种运行模式 Spark基于Standlone的运行流程 Spark ...
- Spark On Yarn:提交Spark应用程序到Yarn
转载自:http://lxw1234.com/archives/2015/07/416.htm 关键字:Spark On Yarn.Spark Yarn Cluster.Spark Yarn Clie ...
- 大数据技术之_19_Spark学习_01_Spark 基础解析 + Spark 概述 + Spark 集群安装 + 执行 Spark 程序
第1章 Spark 概述1.1 什么是 Spark1.2 Spark 特点1.3 Spark 的用户和用途第2章 Spark 集群安装2.1 集群角色2.2 机器准备2.3 下载 Spark 安装包2 ...
- 【Spark深入学习 -14】Spark应用经验与程序调优
----本节内容------- 1.遗留问题解答 2.Spark调优初体验 2.1 利用WebUI分析程序瓶颈 2.2 设置合适的资源 2.3 调整任务的并发度 2.4 修改存储格式 3.Spark调 ...
随机推荐
- Xcode error: conflicting types for 'XXXX'
问题描述:在main方法中调用了一个写在main方法后面的方法,比如: void main(){ A(); } void A(){} Xcode编译后就报错:conflicting types for ...
- lotusscript基本语法
LotusScript是一种使用于Lotus Notes客户端程序或者是用于Domino服务器程序代理列表中的脚本语言.相当于用于网页中的脚本语言JavaScript.(JavaScript以可以用于 ...
- json、javaBean、xml互转的几种工具介绍 (转载)
工作中经常要用到Json.JavaBean.Xml之间的相互转换,用到了很多种方式,这里做下总结,以供参考. 现在主流的转换工具有json-lib.jackson.fastjson等,我为大家一一做简 ...
- Spring + JDBC 组合开发集成步骤
1:配置数据源,如: <beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="h ...
- read 判定用户输入的状态后运行相应的结果
文件名: test26.sh #!/bin/bash # getting just one character of input read -n1 -p "Do you want to co ...
- iOS - OC RunLoop 运行循环/消息循环
1.RunLoop 1)运行循环: 运行循环在 iOS 开发中几乎不用,但是概念的理解却非常重要. 同一个方法中的代码一般都在同一个运行循环中执行,运行循环监听 UI 界面的修改事件,待本次运行循环结 ...
- retire or not retire ? is a question.
corejava 上的一段代码 因吹思婷 "C:\Program Files\Java\jdk1.8.0_101\bin\java" -Didea.launcher.port=75 ...
- Win7_关闭休眠文件hiberfil.sys
1. C盘根目录下 hiberfil.sys 占用好几G空间,直接删 删不掉,也不推荐直接删. 2. 2.1.命令窗口中输入 powercfg -h off,即可关闭休眠功能,同时 Hiberfil. ...
- Python学习(21)python操作mysql数据库_操作
目录 数据库连接 创建数据库表 数据库插入操作 数据库查询操作 数据库更新操作 删除操作 执行事务 错误处理 数据库连接 连接数据库前,请先确认以下事项: 您已经创建了数据库 TEST. 在TEST数 ...
- Poco C++——HTTP的post请求和get请求
两种请求都需要包含头文件: #include <iostream> #include <string> #include "Poco/Net/HTTPClientSe ...