实验4 RDD编程初级实践
1.spark-shell交互式编程
(1) 该系总共有多少学生
scala> val lines = sc.textFile("file:///usr/local/spark/sparklab/Data01.txt")
lines: org.apache.spark.rdd.RDD[String] = file:///usr/local/spark/sparklab/Data01.txt MapPartitionsRDD[4] at textFile at <console>:24
scala> val info = lines.map(row => row.split(",")(0))
info: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[5] at map at <console>:25
scala> val latest = info.distinct()
latest: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[8] at distinct at <console>:25
scala> latest.count
res0: Long = 265
(2) 该系共开设来多少门课程
scala> val lines = sc.textFile("file:///usr/local/spark/sparklab/Data01.txt")
lines: org.apache.spark.rdd.RDD[String] = file:///usr/local/spark/sparklab/Data01.txt MapPartitionsRDD[4] at textFile at <console>:24
scala> val course = lines.map(row => row.split(",")(1))
course: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at map at <console>:25
scala> val course_num = course.distinct()
course_num: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at distinct at <console>:25
scala> course_num.count
res1: Long = 8
(3) Tom同学的总成绩平均分是多少
scala> val tom = lines.map(row => row.split(",")(0)=="Tom")
tom: org.apache.spark.rdd.RDD[Boolean] = MapPartitionsRDD[13] at map at <console>:25
scala> val tom = lines.filter(row => row.split(",")(0)=="Tom")
tom: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[14] at filter at <console>:25
scala> tom.foreach(println)
Tom,DataBase,26
Tom,Algorithm,12
Tom,OperatingSystem,16
Tom,Python,40
Tom,Software,60
scala> tom.map(row => (row.split(",")(0),row.split(",")(2).toInt)).mapValues(x => (x,1)).reduceByKey((x,y) => (x._1+y._1,x._2+y._2)).mapValues(x => (x._1/x._2)).collect()
res6: Array[(String, Int)] = Array((Tom,30))
(4) 求每名同学的选修的课程门数
scala> val c_num = lines.map(row=>(row.split(",")(0),row.split(",")(1)))
c_num: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[21] at map at <console>:25
scala> c_num.mapValues(x => (x,1)).reduceByKey((x,y) => (" ",x._2 + y._2)).mapValues(x => x._2).foreach(println)
(Ford,3)
(Lionel,4)
(Verne,3)
(Lennon,4)
(Joshua,4)
(Marvin,3)
(Marsh,4)
(Bartholomew,5)
(Conrad,2)
(Armand,3)
(Jonathan,4)
(Broderick,3)
(Brady,5)
(Derrick,6)
(Rod,4)
(Willie,4)
(Walter,4)
(Boyce,2)
(Duncann,5)
(Elvis,2)
(Elmer,4)
(Bennett,6)
(Elton,5)
(Jo,5)
(Jim,4)
(Adonis,5)
(Abel,4)
(Peter,4)
(Alvis,6)
(Joseph,3)
(Raymondt,6)
(Kerwin,3)
(Wright,4)
(Adam,3)
(Borg,4)
(Sandy,1)
(Ben,4)
(Miles,6)
(Clyde,7)
(Francis,4)
(Dempsey,4)
(Ellis,4)
(Edward,4)
(Mick,4)
(Cleveland,4)
(Luthers,5)
(Virgil,5)
(Ivan,4)
(Alvin,5)
(Dick,3)
(Bevis,4)
(Leo,5)
(Saxon,7)
(Armstrong,2)
(Hogan,4)
(Sid,3)
(Blair,4)
(Colbert,4)
(Lucien,5)
(Kerr,4)
(Montague,3)
(Giles,7)
(Kevin,4)
(Uriah,1)
(Jeffrey,4)
(Simon,2)
(Elijah,4)
(Greg,4)
(Colin,5)
(Arlen,4)
(Maxwell,4)
(Payne,6)
(Kennedy,4)
(Spencer,5)
(Kent,4)
(Griffith,4)
(Jeremy,6)
(Alan,5)
(Andrew,4)
(Jerry,3)
(Donahue,5)
(Gilbert,3)
(Bishop,2)
(Bernard,2)
(Egbert,4)
(George,4)
(Noah,4)
(Bruce,3)
(Mike,3)
(Frank,3)
(Boris,6)
(Tony,3)
(Christ,2)
(Ken,3)
(Milo,2)
(Victor,2)
(Clare,4)
(Nigel,3)
(Christopher,4)
(Robin,4)
(Chad,6)
(Alfred,2)
(Woodrow,3)
(Rory,4)
(Dennis,4)
(Ward,4)
(Chester,6)
(Emmanuel,3)
(Stan,3)
(Jerome,3)
(Corey,4)
(Harvey,7)
(Herbert,3)
(Maurice,2)
(Merle,3)
(Les,6)
(Bing,6)
(Charles,3)
(Clement,5)
(Leopold,7)
(Brian,6)
(Horace,5)
(Sebastian,6)
(Bernie,3)
(Basil,4)
(Michael,5)
(Ernest,5)
(Tom,5)
(Vic,3)
(Eli,5)
(Duke,4)
(Alva,5)
(Lester,4)
(Hayden,3)
(Bertram,3)
(Bart,5)
(Adair,3)
(Sidney,5)
(Bowen,5)
(Roderick,4)
(Colby,4)
(Jay,6)
(Meredith,4)
(Harold,4)
(Max,3)
(Scott,3)
(Barton,1)
(Elliot,3)
(Matthew,2)
(Alexander,4)
(Todd,3)
(Wordsworth,4)
(Geoffrey,4)
(Devin,4)
(Donald,4)
(Roy,6)
(Harry,4)
(Abbott,3)
(Baron,6)
(Mark,7)
(Lewis,4)
(Rock,6)
(Eugene,1)
(Aries,2)
(Samuel,4)
(Glenn,6)
(Will,3)
(Gerald,4)
(Henry,2)
(Jesse,7)
(Bradley,2)
(Merlin,5)
(Monroe,3)
(Hobart,4)
(Ron,6)
(Archer,5)
(Nick,5)
(Louis,6)
(Len,5)
(Randolph,3)
(Benson,4)
(John,6)
(Abraham,3)
(Benedict,6)
(Marico,6)
(Berg,4)
(Aldrich,3)
(Lou,2)
(Brook,4)
(Ronald,3)
(Pete,3)
(Nicholas,5)
(Bill,2)
(Harlan,6)
(Tracy,3)
(Gordon,4)
(Alston,4)
(Andy,3)
(Bruno,5)
(Beck,4)
(Phil,3)
(Barry,5)
(Nelson,5)
(Antony,5)
(Rodney,3)
(Truman,3)
(Marlon,4)
(Don,2)
(Philip,2)
(Sean,6)
(Webb,7)
(Solomon,5)
(Aaron,4)
(Blake,4)
(Amos,5)
(Chapman,4)
(Jonas,4)
(Valentine,8)
(Angelo,2)
(Boyd,3)
(Benjamin,4)
(Winston,4)
(Allen,4)
(Evan,3)
(Albert,3)
(Newman,2)
(Jason,4)
(Hilary,4)
(William,6)
(Dean,7)
(Claude,2)
(Booth,6)
(Channing,4)
(Jeff,4)
(Webster,2)
(Marshall,4)
(Cliff,5)
(Dominic,4)
(Upton,5)
(Herman,3)
(Levi,2)
(Clark,6)
(Hiram,6)
(Drew,5)
(Bert,3)
(Alger,5)
(Brandon,5)
(Antonio,3)
(Elroy,5)
(Leonard,2)
(Adolph,4)
(Blithe,3)
(Kenneth,3)
(Perry,5)
(Matt,4)
(Eric,4)
(Archibald,5)
(Martin,3)
(Kim,4)
(Clarence,7)
(Vincent,5)
(Winfred,3)
(Christian,2)
(Bob,3)
(Enoch,3)
(5) 该系DataBase课程共有多少人选修;
scala> val lines = sc.textFile("file:///usr/local/spark/sparklab/Data01.txt")
lines: org.apache.spark.rdd.RDD[String] = file:///usr/local/spark/sparklab/Data01.txt MapPartitionsRDD[4] at textFile at <console>:24
scala> val database_num = lines.filter(row => row.split(",")(1)=="DataBase")
database_num: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[20] at filter at <console>:25 scala> database_num.count
res7: Long = 126
(6) 各门课程的平均分是多少
scala> val ave = lines.map(row=>(row.split(",")(1),row.split(",")(2).toInt))
ave: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[26] at map at <console>:25
scala> ave.mapValues(x=>(x,1)).reduceByKey((x,y) => (x._1+y._1,x._2 + y._2)).mapValues(x => (x._1/ x._2)).collect()
res9: Array[(String, Int)] = Array((CLanguage,50), (Software,50), (Python,57), (Algorithm,48), (DataStructure,47), (DataBase,50), (ComputerNetwork,51), (OperatingSystem,54))
(7)使用累加器计算共有多少人选了DataBase这门课
scala> val lines = sc.textFile("file:///usr/local/spark/sparklab/Data01.txt")
lines: org.apache.spark.rdd.RDD[String] = file:///usr/local/spark/sparklab/Data01.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> val database_num = lines.filter(row=>row.split(",")(1)=="DataBase").map(row=>(row.split(",")(1),1))
database_num: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at <console>:25
scala> val accum = sc.longAccumulator("My Accumulator")
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: Some(My Accumulator), value: 0)
scala> database_num.values.foreach(x => accum.add(x))
scala> accum.value
res1: Long = 126
2.编写独立应用程序实现数据去重
对于两个输入文件A和B,编写Spark独立应用程序,对两个文件进行合并,并剔除其中重复的内容,得到一个新文件C。下面是输入文件和输出文件的一个样例,供参考。
输入文件A的样例如下:
20170101 x
20170102 y
20170103 x
20170104 y
20170105 z
20170106 z
输入文件B的样例如下:
20170101 y
20170102 y
20170103 x
20170104 z
20170105 y
根据输入的文件A和B合并得到的输出文件C的样例如下:
20170101 x
20170101 y
20170102 y
20170103 x
20170104 y
20170104 z
20170105 y
20170105 z
20170106 z
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.HashPartitioner
object lab04{
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("RemDup")
val sc = new SparkContext(conf)
val dataFile ="file:///usr/local/spark/sparklab/a.txt,file:///usr/local/spark/sparklab/b.txt"
val data = sc.textFile(dataFile,2)
val da = data.distinct()
da.foreach(println) }
}
3.编写独立应用程序实现求平均值问题
每个输入文件表示班级学生某个学科的成绩,每行内容由两个字段组成,第一个是学生名字,第二个是学生的成绩;编写Spark独立应用程序求出所有学生的平均成绩,并输出到一个新文件中。下面是输入文件和输出文件的一个样例,供参考。
Algorithm成绩:
小明 92
小红 87
小新 82
小丽 90
Database成绩:
小明 95
小红 81
小新 89
小丽 85
Python成绩:
小明 82
小红 83
小新 94
小丽 91
平均成绩如下:
(小红,83.67)
(小新,88.33)
(小明,89.67)
(小丽,88.67)
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.HashPartitioner
object lab043 {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("AvgScore")
val sc = new SparkContext(conf)
val dataFile = "file:///usr/local/spark/sparklab/lab043/1.txt,file:///usr/local/spark/sparklab/lab043/2.txt,file:///usr/local/spark/sparklab/lab043/3.txt"
val data = sc.textFile(dataFile,3)
var score = data.map(line=>(line.split(" ")(0),line.split(" ")(1).toInt)).mapValues(x=>(x,1)).reduceByKey((x,y)=>(x._1+y._1,x._2+y._2)).mapValues(x=>(x._1/x._2)).collect().foreach(println)
//res.saveAsTextFile("result")
}
}
实验4 RDD编程初级实践的更多相关文章
- 实验 2 Scala 编程初级实践
实验 2 Scala 编程初级实践 一.实验目的 1.掌握 Scala 语言的基本语法.数据结构和控制结构: 2.掌握面向对象编程的基础知识,能够编写自定义类和特质: 3.掌握函数式编程的基础知识,能 ...
- 学习进度-11 RDD 编程初级实践
一. 请到本教程官网的“下载专区”的“数据集”中下载 chapter5-data1.txt,该数据集包含 了某大学计算机系的成绩,数据格式如下所示: Tom,DataBase,80 Tom,Algor ...
- 实验5 Spark SQL编程初级实践
今天做实验[Spark SQL 编程初级实践],虽然网上有答案,但都是用scala语言写的,于是我用java语言重写实现一下. 1 .Spark SQL 基本操作将下列 JSON 格式数据复制到 Li ...
- 实验 5 Spark SQL 编程初级实践
实验 5 Spark SQL 编程初级实践 参考厦门大学林子雨 1. Spark SQL 基本操作 将下列 json 数据复制到你的 ubuntu 系统/usr/local/spark 下,并 ...
- spark实验(五)--Spark SQL 编程初级实践(1)
一.实验目的 (1)通过实验掌握 Spark SQL 的基本编程方法: (2)熟悉 RDD 到 DataFrame 的转化方法: (3)熟悉利用 Spark SQL 管理来自不同数据源的数据. 二.实 ...
- spark实验(四)--RDD编程(1)
一.实验目的 (1)熟悉 Spark 的 RDD 基本操作及键值对操作: (2)熟悉使用 RDD 编程解决实际具体问题的方法. 二.实验平台 操作系统:centos6.4 Spark 版本:1.5.0 ...
- 第五周周二练习:实验 5 Spark SQL 编程初级实践
1.题目: 源码: import java.util.Properties import org.apache.spark.sql.types._ import org.apache.spark.sq ...
- 实验5 Spark SQL 编程初级实践
源文件内容如下(包含 id,name,age),将数据复制保存到 ubuntu 系统/usr/local/spark 下, 命名为 employee.txt,实现从 RDD 转换得到 DataFram ...
- Spark SQL 编程初级实践
一.实验目的 (1) 通过实验掌握 Spark SQL 的基本编程方法: (2) 熟悉 RDD 到 DataFrame 的转化方法: (3) 熟悉利用 Spark ...
随机推荐
- 快速人体姿态估计:CVPR2019论文阅读
快速人体姿态估计:CVPR2019论文阅读 Fast Human Pose Estimation 论文链接: http://openaccess.thecvf.com/content_CVPR_201 ...
- GPU编程和流式多处理器(七)
6. 杂项说明 6.1. warp级原语 warp作为执行的原始单元(自然位于线程和块之间),重要性对CUDA程序员显而易见.从SM 1.x开始,NVIDIA开始添加专门针对thread的指令. Vo ...
- javaBean命名规范 get / set 后的首字母大写
javaBean命名规范 Sun 推荐的命名规范 1 ,类名要首字母大写,后面的单词首字母大写 2 ,方法名的第一个单词小写,后面的单词首字母大写 3 ,变量名的第一个单词小写,后面的单词首字母大写 ...
- JavaFx 创建快捷方式及设置开机启动
原文地址:JavaFx 创建快捷方式及设置开机启动 | Stars-One的杂货小窝 原本是想整个桌面启动器,需要在windows平台上实现开机启动,但我的软件都是jar文件,不是传统的exe文件,也 ...
- 【NX二次开发】Block UI 组
设置组及组内成员不可见 this->group->GetProperties()->SetLogical("Show", false); 设置组及组内成员不可操作 ...
- vagrant+java+springcloud+redis+zookeeper镜像下载(&制作详解)
文章很长,建议收藏起来,慢慢读! 备注:持续更新中..... 疯狂创客圈 经典图书 : <Netty Zookeeper Redis 高并发实战> 面试必备 + 大厂必备 + 涨薪必备 疯 ...
- 【题解】Luogu P1011 车站
题目描述 火车从始发站(称为第1站)开出,在始发站上车的人数为a,然后到达第2站,在第2站有人上.下车,但上.下车的人数相同,因此在第2站开出时(即在到达第3站之前)车上的人数保持为a人.从第3站起( ...
- ubuntu开机卡在/dev/sda* clean
问题描述: ①Ubuntu通过再生龙从一台笔记本还原到另外一台笔记本(硬盘到硬盘),开机后卡在自检界面: ②备份前的笔记本为17年发布的笔记本,还原后的笔记本为2020款发布的笔记本 从网上搜了一大篇 ...
- Linux用户体系
1.系统中和用户相关的文件 (1)/etc/passwd:记录系统用户信息文件 (2)/etc/shadow:系统用户密码文件 (3)/etc/group:组用户信息文件 (4)/etc/gshado ...
- MySQL原理 - InnoDB引擎 - 行记录存储 - Off-page 列
本文基于 MySQL 8 在前面的两篇文章,我们分析了 MySQL InnoDB 引擎的两种行记录存储格式: Compact 格式 Redundant 格式 在这里简单总结下: Compact 格式结 ...