SparkSQL之旅
1.准备数据employee.txt
,Gong Shaocheng,
,Li Dachao,
,Qiu Xin,
,Cheng Jiangzhong,
,Wo Binggang,
将数据放入hdfs
[root@jfp3- spark-studio]# hdfs dfs -put employee.txt /user/spark_studio
2.启动spark shell
[root@jfp3- spark-1.0.-bin-hadoop2]# ./bin/spark-shell --master spark://192.168.0.71:7077
3.编写脚本
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._ case class Employee(employeeId: Int, name: String, departmentId: Int) // Create an RDD of Employee objects and register it as a table.
val employees = sc.textFile("hdfs://jfp3-1:8020/user/spark_studio/employee.txt").map(_.split(",")).map(p => Employee(p(), p(), p().trim.toInt))
employees.registerAsTable("employee") // SQL statements can be run by using the sql methods provided by sqlContext.
val fsis = sql("SELECT name FROM employee WHERE departmentId = 1") // The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
fsis.map(t => "Name: " + t()).collect().foreach(println)
4.运行
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@ scala> import sqlContext._
import sqlContext._ scala> case class Employee(employeeId: String, name: String, departmentId: Int)
defined class Employee scala> val employees = sc.textFile("hdfs://jfp3-1:8020/user/spark_studio/employee.txt").map(_.split(",")).map(p => Employee(p(), p(), p().trim.toInt))
// :: INFO MemoryStore: ensureFreeSpace() called with curMem=, maxMem=
// :: INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 135.5 KB, free 294.8 MB)
employees: org.apache.spark.rdd.RDD[Employee] = MappedRDD[] at map at <console>: scala> employees.registerAsTable("employee") scala> val fsis = sql("SELECT name FROM employee WHERE departmentId = 1")
// :: INFO Analyzer: Max iterations () reached for batch MultiInstanceRelations
// :: INFO Analyzer: Max iterations () reached for batch CaseInsensitiveAttributeReferences
// :: INFO SQLContext$$anon$: Max iterations () reached for batch Add exchange
// :: INFO SQLContext$$anon$: Max iterations () reached for batch Prepare Expressions
fsis: org.apache.spark.sql.SchemaRDD =
SchemaRDD[] at RDD at SchemaRDD.scala:
== Query Plan ==
Project [name#:]
Filter (departmentId#: = )
ExistingRdd [employeeId#,name#,departmentId#], MapPartitionsRDD[] at mapPartitions at basicOperators.scala: scala> fsis.map(t => "Name: " + t()).collect().foreach(println)
// :: INFO FileInputFormat: Total input paths to process :
// :: INFO SparkContext: Starting job: collect at <console>:
// :: INFO DAGScheduler: Got job (collect at <console>:) with output partitions (allowLocal=false)
// :: INFO DAGScheduler: Final stage: Stage (collect at <console>:)
// :: INFO DAGScheduler: Parents of final stage: List()
// :: INFO DAGScheduler: Missing parents: List()
// :: INFO DAGScheduler: Submitting Stage (MappedRDD[] at map at <console>:), which has no missing parents
// :: INFO DAGScheduler: Submitting missing tasks from Stage (MappedRDD[] at map at <console>:)
// :: INFO TaskSchedulerImpl: Adding task set 0.0 with tasks
// :: INFO TaskSetManager: Starting task 0.0: as TID on executor : jfp3- (NODE_LOCAL)
// :: INFO TaskSetManager: Serialized task 0.0: as bytes in ms
// :: INFO TaskSetManager: Starting task 0.0: as TID on executor : jfp3- (NODE_LOCAL)
// :: INFO TaskSetManager: Serialized task 0.0: as bytes in ms
// :: INFO TaskSetManager: Finished TID in ms on jfp3- (progress: /)
// :: INFO TaskSetManager: Finished TID in ms on jfp3- (progress: /)
// :: INFO DAGScheduler: Completed ResultTask(, )
// :: INFO DAGScheduler: Completed ResultTask(, )
// :: INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
// :: INFO DAGScheduler: Stage (collect at <console>:) finished in 1.284 s
// :: INFO SparkContext: Job finished: collect at <console>:, took 1.386154401 s
Name: Gong Shaocheng
Name: Li Dachao
Name: Qiu Xin
5.将数据存为parquet格式,并运行sql
scala> val parquetFile = sqlContext.parquetFile("hdfs://jfp3-1:8020/user/spark_studio/employee.parquet")
// :: INFO Analyzer: Max iterations () reached for batch MultiInstanceRelations
// :: INFO Analyzer: Max iterations () reached for batch CaseInsensitiveAttributeReferences
// :: INFO SQLContext$$anon$: Max iterations () reached for batch Add exchange
// :: INFO SQLContext$$anon$: Max iterations () reached for batch Prepare Expressions
parquetFile: org.apache.spark.sql.SchemaRDD =
SchemaRDD[] at RDD at SchemaRDD.scala:
== Query Plan ==
ParquetTableScan [employeeId#,name#,departmentId#], (ParquetRelation hdfs://jfp3-1:8020/user/spark_studio/employee.parquet), None
scala> parquetFile.registerAsTable("parquetFile")
scala> val telcos = sql("SELECT name FROM parquetFile WHERE departmentId = 3")
// :: INFO Analyzer: Max iterations () reached for batch MultiInstanceRelations
// :: INFO Analyzer: Max iterations () reached for batch CaseInsensitiveAttributeReferences
// :: INFO SQLContext$$anon$: Max iterations () reached for batch Add exchange
// :: INFO SQLContext$$anon$: Max iterations () reached for batch Prepare Expressions
// :: INFO MemoryStore: ensureFreeSpace() called with curMem=, maxMem=
// :: INFO MemoryStore: Block broadcast_1 stored as values to memory (estimated size 176.3 KB, free 294.6 MB)
telcos: org.apache.spark.sql.SchemaRDD =
SchemaRDD[] at RDD at SchemaRDD.scala:
== Query Plan ==
Project [name#:]
Filter (departmentId#: = )
ParquetTableScan [name#,departmentId#], (ParquetRelation hdfs://jfp3-1:8020/user/spark_studio/employee.parquet), None
scala> telcos.collect().foreach(println)
// :: INFO FileInputFormat: Total input paths to process :
// :: INFO ParquetInputFormat: Total input paths to process :
// :: INFO ParquetFileReader: reading summary file: hdfs://jfp3-1:8020/user/spark_studio/employee.parquet/_metadata
// :: INFO deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
// :: INFO deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
// :: INFO SparkContext: Starting job: collect at <console>:
// :: INFO DAGScheduler: Got job (collect at <console>:) with output partitions (allowLocal=false)
// :: INFO DAGScheduler: Final stage: Stage (collect at <console>:)
// :: INFO DAGScheduler: Parents of final stage: List()
// :: INFO DAGScheduler: Missing parents: List()
// :: INFO DAGScheduler: Submitting Stage (SchemaRDD[] at RDD at SchemaRDD.scala:
== Query Plan ==
Project [name#:]
Filter (departmentId#: = )
ParquetTableScan [name#,departmentId#], (ParquetRelation hdfs://jfp3-1:8020/user/spark_studio/employee.parquet), None), which has no missing parents
// :: INFO DAGScheduler: Submitting missing tasks from Stage (SchemaRDD[] at RDD at SchemaRDD.scala:
== Query Plan ==
Project [name#:]
Filter (departmentId#: = )
ParquetTableScan [name#,departmentId#], (ParquetRelation hdfs://jfp3-1:8020/user/spark_studio/employee.parquet), None)
// :: INFO TaskSchedulerImpl: Adding task set 2.0 with tasks
// :: INFO TaskSetManager: Starting task 2.0: as TID on executor : jfp3- (NODE_LOCAL)
// :: INFO TaskSetManager: Serialized task 2.0: as bytes in ms
// :: INFO TaskSetManager: Starting task 2.0: as TID on executor : jfp3- (NODE_LOCAL)
// :: INFO TaskSetManager: Serialized task 2.0: as bytes in ms
// :: INFO DAGScheduler: Completed ResultTask(, )
// :: INFO TaskSetManager: Finished TID in ms on jfp3- (progress: /)
// :: INFO DAGScheduler: Completed ResultTask(, )
// :: INFO TaskSetManager: Finished TID in ms on jfp3- (progress: /)
// :: INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
// :: INFO DAGScheduler: Stage (collect at <console>:) finished in 2.177 s
// :: INFO SparkContext: Job finished: collect at <console>:, took 2.210887848 s
[Wo Binggang]
6. DSL syntax支持
scala> all.collect().foreach(println)
// :: INFO SparkContext: Starting job: collect at <console>:
// :: INFO DAGScheduler: Got job (collect at <console>:) with output partitions (allowLocal=false)
// :: INFO DAGScheduler: Final stage: Stage (collect at <console>:)
// :: INFO DAGScheduler: Parents of final stage: List()
// :: INFO DAGScheduler: Missing parents: List()
// :: INFO DAGScheduler: Submitting Stage (SchemaRDD[] at RDD at SchemaRDD.scala:
== Query Plan ==
Project [name#:]
Filter (departmentId#: >= )
ExistingRdd [employeeId#,name#,departmentId#], MapPartitionsRDD[] at mapPartitions at basicOperators.scala:), which has no missing parents
// :: INFO DAGScheduler: Submitting missing tasks from Stage (SchemaRDD[] at RDD at SchemaRDD.scala:
== Query Plan ==
Project [name#:]
Filter (departmentId#: >= )
ExistingRdd [employeeId#,name#,departmentId#], MapPartitionsRDD[] at mapPartitions at basicOperators.scala:)
// :: INFO TaskSchedulerImpl: Adding task set 6.0 with tasks
// :: INFO TaskSetManager: Starting task 6.0: as TID on executor : jfp3- (NODE_LOCAL)
// :: INFO TaskSetManager: Serialized task 6.0: as bytes in ms
// :: INFO TaskSetManager: Starting task 6.0: as TID on executor : jfp3- (NODE_LOCAL)
// :: INFO TaskSetManager: Serialized task 6.0: as bytes in ms
// :: INFO TaskSetManager: Finished TID in ms on jfp3- (progress: /)
// :: INFO DAGScheduler: Completed ResultTask(, )
// :: INFO DAGScheduler: Completed ResultTask(, )
// :: INFO TaskSetManager: Finished TID in ms on jfp3- (progress: /)
// :: INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool
// :: INFO DAGScheduler: Stage (collect at <console>:) finished in 0.039 s
// :: INFO SparkContext: Job finished: collect at <console>:, took 0.052556716 s
[Gong Shaocheng]
[Li Dachao]
[Qiu Xin]
[Cheng Jiangzhong]
[Wo Binggang]
SparkSQL之旅的更多相关文章
- sparkSQL实战详解
摘要 如果要想真正的掌握sparkSQL编程,首先要对sparkSQL的整体框架以及sparkSQL到底能帮助我们解决什么问题有一个整体的认识,然后就是对各个层级关系有一个清晰的认识后,才能真正的 ...
- hadoop学习之旅1
大数据介绍 大数据本质也是数据,但是又有了新的特征,包括数据来源广.数据格式多样化(结构化数据.非结构化数据.Excel文件.文本文件等).数据量大(最少也是TB级别的.甚至可能是PB级别).数据增长 ...
- 一条Sql的Spark之旅
背景 SQL作为一门标准的.通用的.简单的DSL,在大数据分析中有着越来越重要的地位;Spark在批处理引擎领域当前也是处于绝对的地位,而Spark2.0中的SparkSQL也支持ANSI-SQL ...
- Linq之旅:Linq入门详解(Linq to Objects)
示例代码下载:Linq之旅:Linq入门详解(Linq to Objects) 本博文详细介绍 .NET 3.5 中引入的重要功能:Language Integrated Query(LINQ,语言集 ...
- WCF学习之旅—第三个示例之四(三十)
上接WCF学习之旅—第三个示例之一(二十七) WCF学习之旅—第三个示例之二(二十八) WCF学习之旅—第三个示例之三(二十九) ...
- 【C#代码实战】群蚁算法理论与实践全攻略——旅行商等路径优化问题的新方法
若干年前读研的时候,学院有一个教授,专门做群蚁算法的,很厉害,偶尔了解了一点点.感觉也是生物智能的一个体现,和遗传算法.神经网络有异曲同工之妙.只不过当时没有实际需求学习,所以没去研究.最近有一个这样 ...
- Hadoop学习之旅二:HDFS
本文基于Hadoop1.X 概述 分布式文件系统主要用来解决如下几个问题: 读写大文件 加速运算 对于某些体积巨大的文件,比如其大小超过了计算机文件系统所能存放的最大限制或者是其大小甚至超过了计算机整 ...
- .NET跨平台之旅:在生产环境中上线第一个运行于Linux上的ASP.NET Core站点
2016年7月10日,我们在生产环境中上线了第一个运行于Linux上的ASP.NET Core站点,这是一个简单的提供后端服务的ASP.NET Core Web API站点. 项目是在Windows上 ...
- 【Knockout.js 学习体验之旅】(3)模板绑定
本文是[Knockout.js 学习体验之旅]系列文章的第3篇,所有demo均基于目前knockout.js的最新版本(3.4.0).小茄才识有限,文中若有不当之处,还望大家指出. 目录: [Knoc ...
随机推荐
- 改int非空自增列为int可为空列
) --声明读取数据库所有数据表名称游标mycursor1 open mycursor1 --从游标里取出数据赋值到我们刚才声明的数据表名变量中 fetch next from mycursor1 i ...
- 解决jQuery ajax跨域问题,Google、IE、Firefox亲测有效
直接上最后的结果吧 JS: $.ajax({ type: "GET", async: false, crossDomain: true, url: "www.test.c ...
- Case1:WorkFlow不能运行的解决办法
原因为CRMAppPool选择了一个域用户,然后异步服务的用户执行会有问题 at CrmException..ctor(Int32 errorCode, Object[] arguments) ilO ...
- 数据库的点数据根据行政区shp来进行行政区处理,python定时器实现
# -*- coding: utf-8 -*- import struct import decimal import itertools import arcpy import math impor ...
- [转]关于 initWithNibName 和 loadNibNamed 的区别和联系
转载地址:http://jianyu996.blog.163.com/blog/static/1121145552012102293653906/ 关于 initWithNibName 和 loadN ...
- VIM 常用错误解决
1.option ‘omnifunc’ is not set 错误: vim7下Omni completion默认情况下是没有开启的,有时候自定义的vimrc文件会实现自动补齐,例如vim-autoc ...
- [问题2015S11] 复旦高等代数 II(14级)每周一题(第十二教学周)
[问题2015S11] 证明: 任一复方阵都相似于一个复对称阵. 举例说明: 存在实方阵, 它不相似于实对称阵. 问题解答请在以下网址下载:http://pan.baidu.com/share/ho ...
- Android TextView自动换行文字排版参差不齐的原因
今天项目没什么进展,公司后台出问题了.看了下刚刚学习Android时的笔记,发现TextView会自动换行,而且排版文字参差不齐.查了下资料,总结原因如下: 1.半角字符与全角字符混乱所致:这种情况一 ...
- Servlet 实现上传文件以及同时,写入xml格式文件和上传
package com.isoftstone.eply.servlet; import java.io.BufferedReader; import java.io.BufferedWriter; i ...
- JavaScript严格模式详解
转载自阮一峰的博客 Javascript 严格模式详解 作者: 阮一峰 一.概述 除了正常运行模式,ECMAscript 5添加了第二种运行模式:"严格模式"(strict m ...