Spark 学习笔记：（三）Spark SQL

参考：https://spark.apache.org/docs/latest/sql-programming-guide.html#overview

　　http://www.csdn.net/article/2015-04-03/2824407

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.

1)在Spark中，DataFrame是一种以RDD为基础的分布式数据集，类似于传统数据库中的二维表格。DataFrame与RDD的主要区别在于，前者带有schema元信息，即DataFrame所表示的二维表数据集的每一列都带有名称和类型。这使得Spark SQL得以洞察更多的结构信息，从而对藏于DataFrame背后的数据源以及作用于DataFrame之上的变换进行了针对性的优化，最终达到大幅提升运行时效率的目标。反观RDD，由于无从得知所存数据元素的具体内部结构，Spark Core只能在stage层面进行简单、通用的流水线优化。

2)A DataFrame can be operated on as normal RDDs and can also be registered as a temporary table. Registering a DataFrame as a table allows you to run SQL queries over its data.

3)The sql function on a SQLContext enables applications to run SQL queries programmatically and returns the result as a DataFrame.

val df = sqlContext.sql("SELECT * FROM table")   //sql接口

创建DataFrames：

With a SQLContext, applications can create DataFrames from an existing RDD, from a Hive table, or from data sources.

val sc: SparkContext // An existing SparkContext.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// this is used to implicitly convert an RDD to a DataFrame.

import sqlContext.implicits._

Two different methods for converting existing RDDs into DataFrames.
- The first method uses reflection to infer the schema of an RDD that contains specific types of objects.

The case class defines the schema of the table=>The names of the arguments to the case class are read using reflection and become the names of the columns=>This RDD can be implicitly converted to a DataFrame and then be registered as a table=>Tables can be used in subsequent SQL statements.

// sc is an existing SparkContext.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// this is used to implicitly convert an RDD to a DataFrame.

import sqlContext.implicits._

// Define the schema using a case class.

// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,

// you can use custom classes that implement the Product interface.

case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a table.

val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()

people.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.

val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are DataFrames and support all the normal RDD operations.

// The columns of a row in the result can be accessed by ordinal.

teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

- Aprogrammatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct DataFrames when the columns and their types are not known until runtime.

From data sources：

val df = sqlContext.load("people.json", "json")

df.select("name", "age").save("namesAndAges.parquet", "parquet")

val df = sqlContext.jsonFile("examples/src/main/resources/people.json")

Spark 学习笔记：（三）Spark SQL的更多相关文章

spark学习笔记总结-spark入门资料精化
Spark学习笔记 Spark简介 spark 可以很容易和yarn结合,直接调用HDFS.Hbase上面的数据,和hadoop结合.配置很容易. spark发展迅猛,框架比hadoop更加灵活实用. ...
Spark学习笔记-三种属性配置详细说明【转】
相关资料:Spark属性配置 http://www.cnblogs.com/chengxin1982/p/4023111.html 本文出处:转载自过往记忆(http://www.iteblog.c ...
Spark学习笔记-使用Spark History Server
在运行Spark应用程序的时候,driver会提供一个webUI给出应用程序的运行信息,但是该webUI随着应用程序的完成而关闭端口,也就是说,Spark应用程序运行完后,将无法查看应用程序的历史记 ...
Spark 学习笔记之 Spark history Server 搭建
在hdfs上建立文件夹/directory hadoop fs -mkdir /directory 进入conf目录 spark-env.sh 增加以下配置 export SPARK_HISTORY ...
Spark学习笔记之-Spark远程调试
Spark远程调试本例子介绍简单介绍spark一种远程调试方法,使用的IDE是IntelliJ IDEA. 1.了解jvm一些参数属性 -X ...
Spark学习笔记之SparkRDD
Spark学习笔记之SparkRDD 一. 基本概念 RDD(resilient distributed datasets)弹性分布式数据集. 来自于两方面 ① 内存集合和外部存储系统 ② ...
Spark学习笔记0——简单了解和技术架构
目录 Spark学习笔记0--简单了解和技术架构什么是Spark 技术架构和软件栈 Spark Core Spark SQL Spark Streaming MLlib GraphX 集群管理器受 ...
Oracle学习笔记三 SQL命令
SQL简介 SQL 支持下列类别的命令: 1.数据定义语言(DDL) 2.数据操纵语言(DML) 3.事务控制语言(TCL) 4.数据控制语言(DCL)
Spark学习笔记2（spark所需环境配置
Spark学习笔记2 配置spark所需环境 1.首先先把本地的maven的压缩包解压到本地文件夹中,安装好本地的maven客户端程序,版本没有什么要求不需要最新版的maven客户端. 解压完成之后 ...
Spark学习笔记3（IDEA编写scala代码并打包上传集群运行）
Spark学习笔记3 IDEA编写scala代码并打包上传集群运行我们在IDEA上的maven项目已经搭建完成了,现在可以写一个简单的spark代码并且打成jar包上传至集群,来检验一下我们的sp ...

随机推荐

ctype.h 第2章
ctype.h ctype.h是c标准函数库中的头文件定义了一批c语言字符分类函数 (c character classification functions) 用于测试字符是否属于特定的字 ...
python刷toj
1452 import sys a , b , c = map(int,sys.stdin.readline().split()) print ((a+b+c),(a*b*c), '%.2f' %(( ...
ansible部署
ansible的特性:基于Python语言实现,由paramiko,PyYAML和jinjia2三个关键模块部署简单,agentless 默认使用ssh协议 (1) 基于秘钥认证方式 ...
Python之静态语法检查
Python是一门动态语言.在给python传参数的时候并没有严格的类型限制.写python程序的时候,发现错误经常只能在执行的时候发现.有一些错误由于隐藏的比较深,只有特定逻辑才会触发,往往导致需要 ...
【Luogu】P2158仪仗队（欧拉函数）
题目链接首先来介绍欧拉函数. 设欧拉函数为f(n),则f(n)=1~n中与n互质的数的个数. 欧拉函数有三条引论: 1.若n为素数,则f(n)=n-1; 2.若n为pa,则f(n)=(p-1)*(p ...
BZOJ 2190 [SDOI2008]仪仗队 ——Dirichlet积
[题目分析] 考虑斜率为0和斜率不存在的两条线上只能看到3人. 其余的人能被看见,当且仅当gcd(x,y)=1 ,然后拿卷积算一算发现就是欧拉函数的前缀和的二倍. 注意2的情况要特判. [代码] # ...
FZU 2020 :组合【lucas】
Problem Description 给出组合数C(n,m), 表示从n个元素中选出m个元素的方案数.例如C(5,2) = 10, C(4,2) = 6.可是当n,m比较大的时候,C(n,m)很大! ...
Classloader中loadClass()方法和Class.forName()区别
Classloader中loadClass()方法和Class.forName()都能得到一个class对象,那这两者得到的class对象有什么区别呢 1.java类装载的过程 Java类装载有三个步 ...
mysql查询死锁，执行语句，服务器状态等语句集合
[转]http://blog.csdn.net/enweitech/article/details/52447006
PHP输出控制函数（ob系列函数）
PHP输出控制函数(ob系列函数) flush — 刷新输出缓冲ob_clean — 清空(擦掉)输出缓冲区ob_end_clean — 清空(擦除)缓冲区并关闭输出缓冲ob_end_flush — ...

Spark 学习笔记：（三）Spark SQL

Spark 学习笔记：（三）Spark SQL的更多相关文章

随机推荐

热门专题