SparkSQL之旅
1.准备数据employee.txt
,Gong Shaocheng,
,Li Dachao,
,Qiu Xin,
,Cheng Jiangzhong,
,Wo Binggang,
将数据放入hdfs
[root@jfp3- spark-studio]# hdfs dfs -put employee.txt /user/spark_studio
2.启动spark shell
[root@jfp3- spark-1.0.-bin-hadoop2]# ./bin/spark-shell --master spark://192.168.0.71:7077
3.编写脚本
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._ case class Employee(employeeId: Int, name: String, departmentId: Int) // Create an RDD of Employee objects and register it as a table.
val employees = sc.textFile("hdfs://jfp3-1:8020/user/spark_studio/employee.txt").map(_.split(",")).map(p => Employee(p(), p(), p().trim.toInt))
employees.registerAsTable("employee") // SQL statements can be run by using the sql methods provided by sqlContext.
val fsis = sql("SELECT name FROM employee WHERE departmentId = 1") // The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
fsis.map(t => "Name: " + t()).collect().foreach(println)
4.运行
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@ scala> import sqlContext._
import sqlContext._ scala> case class Employee(employeeId: String, name: String, departmentId: Int)
defined class Employee scala> val employees = sc.textFile("hdfs://jfp3-1:8020/user/spark_studio/employee.txt").map(_.split(",")).map(p => Employee(p(), p(), p().trim.toInt))
// :: INFO MemoryStore: ensureFreeSpace() called with curMem=, maxMem=
// :: INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 135.5 KB, free 294.8 MB)
employees: org.apache.spark.rdd.RDD[Employee] = MappedRDD[] at map at <console>: scala> employees.registerAsTable("employee") scala> val fsis = sql("SELECT name FROM employee WHERE departmentId = 1")
// :: INFO Analyzer: Max iterations () reached for batch MultiInstanceRelations
// :: INFO Analyzer: Max iterations () reached for batch CaseInsensitiveAttributeReferences
// :: INFO SQLContext$$anon$: Max iterations () reached for batch Add exchange
// :: INFO SQLContext$$anon$: Max iterations () reached for batch Prepare Expressions
fsis: org.apache.spark.sql.SchemaRDD =
SchemaRDD[] at RDD at SchemaRDD.scala:
== Query Plan ==
Project [name#:]
Filter (departmentId#: = )
ExistingRdd [employeeId#,name#,departmentId#], MapPartitionsRDD[] at mapPartitions at basicOperators.scala: scala> fsis.map(t => "Name: " + t()).collect().foreach(println)
// :: INFO FileInputFormat: Total input paths to process :
// :: INFO SparkContext: Starting job: collect at <console>:
// :: INFO DAGScheduler: Got job (collect at <console>:) with output partitions (allowLocal=false)
// :: INFO DAGScheduler: Final stage: Stage (collect at <console>:)
// :: INFO DAGScheduler: Parents of final stage: List()
// :: INFO DAGScheduler: Missing parents: List()
// :: INFO DAGScheduler: Submitting Stage (MappedRDD[] at map at <console>:), which has no missing parents
// :: INFO DAGScheduler: Submitting missing tasks from Stage (MappedRDD[] at map at <console>:)
// :: INFO TaskSchedulerImpl: Adding task set 0.0 with tasks
// :: INFO TaskSetManager: Starting task 0.0: as TID on executor : jfp3- (NODE_LOCAL)
// :: INFO TaskSetManager: Serialized task 0.0: as bytes in ms
// :: INFO TaskSetManager: Starting task 0.0: as TID on executor : jfp3- (NODE_LOCAL)
// :: INFO TaskSetManager: Serialized task 0.0: as bytes in ms
// :: INFO TaskSetManager: Finished TID in ms on jfp3- (progress: /)
// :: INFO TaskSetManager: Finished TID in ms on jfp3- (progress: /)
// :: INFO DAGScheduler: Completed ResultTask(, )
// :: INFO DAGScheduler: Completed ResultTask(, )
// :: INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
// :: INFO DAGScheduler: Stage (collect at <console>:) finished in 1.284 s
// :: INFO SparkContext: Job finished: collect at <console>:, took 1.386154401 s
Name: Gong Shaocheng
Name: Li Dachao
Name: Qiu Xin
5.将数据存为parquet格式,并运行sql
scala> val parquetFile = sqlContext.parquetFile("hdfs://jfp3-1:8020/user/spark_studio/employee.parquet")
// :: INFO Analyzer: Max iterations () reached for batch MultiInstanceRelations
// :: INFO Analyzer: Max iterations () reached for batch CaseInsensitiveAttributeReferences
// :: INFO SQLContext$$anon$: Max iterations () reached for batch Add exchange
// :: INFO SQLContext$$anon$: Max iterations () reached for batch Prepare Expressions
parquetFile: org.apache.spark.sql.SchemaRDD =
SchemaRDD[] at RDD at SchemaRDD.scala:
== Query Plan ==
ParquetTableScan [employeeId#,name#,departmentId#], (ParquetRelation hdfs://jfp3-1:8020/user/spark_studio/employee.parquet), None
scala> parquetFile.registerAsTable("parquetFile")
scala> val telcos = sql("SELECT name FROM parquetFile WHERE departmentId = 3")
// :: INFO Analyzer: Max iterations () reached for batch MultiInstanceRelations
// :: INFO Analyzer: Max iterations () reached for batch CaseInsensitiveAttributeReferences
// :: INFO SQLContext$$anon$: Max iterations () reached for batch Add exchange
// :: INFO SQLContext$$anon$: Max iterations () reached for batch Prepare Expressions
// :: INFO MemoryStore: ensureFreeSpace() called with curMem=, maxMem=
// :: INFO MemoryStore: Block broadcast_1 stored as values to memory (estimated size 176.3 KB, free 294.6 MB)
telcos: org.apache.spark.sql.SchemaRDD =
SchemaRDD[] at RDD at SchemaRDD.scala:
== Query Plan ==
Project [name#:]
Filter (departmentId#: = )
ParquetTableScan [name#,departmentId#], (ParquetRelation hdfs://jfp3-1:8020/user/spark_studio/employee.parquet), None
scala> telcos.collect().foreach(println)
// :: INFO FileInputFormat: Total input paths to process :
// :: INFO ParquetInputFormat: Total input paths to process :
// :: INFO ParquetFileReader: reading summary file: hdfs://jfp3-1:8020/user/spark_studio/employee.parquet/_metadata
// :: INFO deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
// :: INFO deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
// :: INFO SparkContext: Starting job: collect at <console>:
// :: INFO DAGScheduler: Got job (collect at <console>:) with output partitions (allowLocal=false)
// :: INFO DAGScheduler: Final stage: Stage (collect at <console>:)
// :: INFO DAGScheduler: Parents of final stage: List()
// :: INFO DAGScheduler: Missing parents: List()
// :: INFO DAGScheduler: Submitting Stage (SchemaRDD[] at RDD at SchemaRDD.scala:
== Query Plan ==
Project [name#:]
Filter (departmentId#: = )
ParquetTableScan [name#,departmentId#], (ParquetRelation hdfs://jfp3-1:8020/user/spark_studio/employee.parquet), None), which has no missing parents
// :: INFO DAGScheduler: Submitting missing tasks from Stage (SchemaRDD[] at RDD at SchemaRDD.scala:
== Query Plan ==
Project [name#:]
Filter (departmentId#: = )
ParquetTableScan [name#,departmentId#], (ParquetRelation hdfs://jfp3-1:8020/user/spark_studio/employee.parquet), None)
// :: INFO TaskSchedulerImpl: Adding task set 2.0 with tasks
// :: INFO TaskSetManager: Starting task 2.0: as TID on executor : jfp3- (NODE_LOCAL)
// :: INFO TaskSetManager: Serialized task 2.0: as bytes in ms
// :: INFO TaskSetManager: Starting task 2.0: as TID on executor : jfp3- (NODE_LOCAL)
// :: INFO TaskSetManager: Serialized task 2.0: as bytes in ms
// :: INFO DAGScheduler: Completed ResultTask(, )
// :: INFO TaskSetManager: Finished TID in ms on jfp3- (progress: /)
// :: INFO DAGScheduler: Completed ResultTask(, )
// :: INFO TaskSetManager: Finished TID in ms on jfp3- (progress: /)
// :: INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
// :: INFO DAGScheduler: Stage (collect at <console>:) finished in 2.177 s
// :: INFO SparkContext: Job finished: collect at <console>:, took 2.210887848 s
[Wo Binggang]
6. DSL syntax支持
scala> all.collect().foreach(println)
// :: INFO SparkContext: Starting job: collect at <console>:
// :: INFO DAGScheduler: Got job (collect at <console>:) with output partitions (allowLocal=false)
// :: INFO DAGScheduler: Final stage: Stage (collect at <console>:)
// :: INFO DAGScheduler: Parents of final stage: List()
// :: INFO DAGScheduler: Missing parents: List()
// :: INFO DAGScheduler: Submitting Stage (SchemaRDD[] at RDD at SchemaRDD.scala:
== Query Plan ==
Project [name#:]
Filter (departmentId#: >= )
ExistingRdd [employeeId#,name#,departmentId#], MapPartitionsRDD[] at mapPartitions at basicOperators.scala:), which has no missing parents
// :: INFO DAGScheduler: Submitting missing tasks from Stage (SchemaRDD[] at RDD at SchemaRDD.scala:
== Query Plan ==
Project [name#:]
Filter (departmentId#: >= )
ExistingRdd [employeeId#,name#,departmentId#], MapPartitionsRDD[] at mapPartitions at basicOperators.scala:)
// :: INFO TaskSchedulerImpl: Adding task set 6.0 with tasks
// :: INFO TaskSetManager: Starting task 6.0: as TID on executor : jfp3- (NODE_LOCAL)
// :: INFO TaskSetManager: Serialized task 6.0: as bytes in ms
// :: INFO TaskSetManager: Starting task 6.0: as TID on executor : jfp3- (NODE_LOCAL)
// :: INFO TaskSetManager: Serialized task 6.0: as bytes in ms
// :: INFO TaskSetManager: Finished TID in ms on jfp3- (progress: /)
// :: INFO DAGScheduler: Completed ResultTask(, )
// :: INFO DAGScheduler: Completed ResultTask(, )
// :: INFO TaskSetManager: Finished TID in ms on jfp3- (progress: /)
// :: INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool
// :: INFO DAGScheduler: Stage (collect at <console>:) finished in 0.039 s
// :: INFO SparkContext: Job finished: collect at <console>:, took 0.052556716 s
[Gong Shaocheng]
[Li Dachao]
[Qiu Xin]
[Cheng Jiangzhong]
[Wo Binggang]
SparkSQL之旅的更多相关文章
- sparkSQL实战详解
摘要 如果要想真正的掌握sparkSQL编程,首先要对sparkSQL的整体框架以及sparkSQL到底能帮助我们解决什么问题有一个整体的认识,然后就是对各个层级关系有一个清晰的认识后,才能真正的 ...
- hadoop学习之旅1
大数据介绍 大数据本质也是数据,但是又有了新的特征,包括数据来源广.数据格式多样化(结构化数据.非结构化数据.Excel文件.文本文件等).数据量大(最少也是TB级别的.甚至可能是PB级别).数据增长 ...
- 一条Sql的Spark之旅
背景 SQL作为一门标准的.通用的.简单的DSL,在大数据分析中有着越来越重要的地位;Spark在批处理引擎领域当前也是处于绝对的地位,而Spark2.0中的SparkSQL也支持ANSI-SQL ...
- Linq之旅:Linq入门详解(Linq to Objects)
示例代码下载:Linq之旅:Linq入门详解(Linq to Objects) 本博文详细介绍 .NET 3.5 中引入的重要功能:Language Integrated Query(LINQ,语言集 ...
- WCF学习之旅—第三个示例之四(三十)
上接WCF学习之旅—第三个示例之一(二十七) WCF学习之旅—第三个示例之二(二十八) WCF学习之旅—第三个示例之三(二十九) ...
- 【C#代码实战】群蚁算法理论与实践全攻略——旅行商等路径优化问题的新方法
若干年前读研的时候,学院有一个教授,专门做群蚁算法的,很厉害,偶尔了解了一点点.感觉也是生物智能的一个体现,和遗传算法.神经网络有异曲同工之妙.只不过当时没有实际需求学习,所以没去研究.最近有一个这样 ...
- Hadoop学习之旅二:HDFS
本文基于Hadoop1.X 概述 分布式文件系统主要用来解决如下几个问题: 读写大文件 加速运算 对于某些体积巨大的文件,比如其大小超过了计算机文件系统所能存放的最大限制或者是其大小甚至超过了计算机整 ...
- .NET跨平台之旅:在生产环境中上线第一个运行于Linux上的ASP.NET Core站点
2016年7月10日,我们在生产环境中上线了第一个运行于Linux上的ASP.NET Core站点,这是一个简单的提供后端服务的ASP.NET Core Web API站点. 项目是在Windows上 ...
- 【Knockout.js 学习体验之旅】(3)模板绑定
本文是[Knockout.js 学习体验之旅]系列文章的第3篇,所有demo均基于目前knockout.js的最新版本(3.4.0).小茄才识有限,文中若有不当之处,还望大家指出. 目录: [Knoc ...
随机推荐
- Java菜鸟学习 Script 脚本语言(简介)
script 可以写在head里 也可以写在body里 还可以写在 /html后面 script 也是成对出现的 <script></script> 他有三种常见的对话框 1 ...
- maven+Jenkins学习小记
jenkins配置方法1,tomcat下载,解压,切换到bin目录,配置环境变量,地址为catalina.bat文件夹下,也就是bin目录,再配置path变量2,启动tomcat,dos命令,cata ...
- C语言面试题(一)
裸辞后,本周开始求职之旅.令人厌烦的是,大多数公司都会通知你去面试,然后拿出一纸试题,开始作答,最后笔试成绩作为重要的参考来决定是否录取你.对于大学四年挂了三年科的我,习惯遇到问题令辟溪径,从不 ...
- (二)Kafka动态增加Topic的副本(Replication)
(二)Kafka动态增加Topic的副本(Replication) 1. 查看topic的原来的副本分布 [hadoop@sdf-nimbus-perf ~]$ le-kafka-topics.sh ...
- jquery总结03-遍历节点
这是用的最多的 向下遍历节点 children() 第一级子元素 相当于li>span find() 多级子孙元素 相当于li span 注意:.filter(':contains(&qu ...
- Scrum Meeting 2-20151202
任务安排 姓名 今日任务 明日任务 困难 董元财 完成下拉刷新的实现 请假(明天是编译截至最后一天) 无 胡亚坤 完成圆形头像代码设计 请假(明天是编译截至最后一天) 无 刘猛 学习listview的 ...
- C语言中const的正确用法
今天看<Linux内核编程>(Claudia Salzberg Podriguez等著)时,文中(p39)有一个错误,就是关于const的用法. 原文中举例说明:const int *x中 ...
- mysql exists 和 in的效率比较
这条语句适用于a表比b表大的情况 select * from ecs_goods a where cat_id in(select cat_id from ecs_category b); 这条语句适 ...
- UiAutomator环境搭建及详细操作
一.环境搭建 1.1 必备条件 JDK SDK(API高于15) Eclipse(安装ADT插件) ANT(用于编译生成的jar) 安装JDK并添加环境变量 1.2 详细步骤 1.安装JDK并添加环境 ...
- 入坑HttpServletRequest.getParameterMap
在项目开发的时候遇到一个小坑,在发送了异步请求以后,回调的时候传递给我一个参数直接就是HttpServletRequest的请求,下面简称request: 在使用的时候自以为很简单,直接get就好了嘛 ...