Spark 学习笔记：（三）Spark SQL

参考：https://spark.apache.org/docs/latest/sql-programming-guide.html#overview

　　http://www.csdn.net/article/2015-04-03/2824407

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.

1)在Spark中，DataFrame是一种以RDD为基础的分布式数据集，类似于传统数据库中的二维表格。DataFrame与RDD的主要区别在于，前者带有schema元信息，即DataFrame所表示的二维表数据集的每一列都带有名称和类型。这使得Spark SQL得以洞察更多的结构信息，从而对藏于DataFrame背后的数据源以及作用于DataFrame之上的变换进行了针对性的优化，最终达到大幅提升运行时效率的目标。反观RDD，由于无从得知所存数据元素的具体内部结构，Spark Core只能在stage层面进行简单、通用的流水线优化。

2)A DataFrame can be operated on as normal RDDs and can also be registered as a temporary table. Registering a DataFrame as a table allows you to run SQL queries over its data.

3)The sql function on a SQLContext enables applications to run SQL queries programmatically and returns the result as a DataFrame.

val df = sqlContext.sql("SELECT * FROM table")   //sql接口

创建DataFrames：

With a SQLContext, applications can create DataFrames from an existing RDD, from a Hive table, or from data sources.

val sc: SparkContext // An existing SparkContext.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// this is used to implicitly convert an RDD to a DataFrame.

import sqlContext.implicits._

Two different methods for converting existing RDDs into DataFrames.
- The first method uses reflection to infer the schema of an RDD that contains specific types of objects.

The case class defines the schema of the table=>The names of the arguments to the case class are read using reflection and become the names of the columns=>This RDD can be implicitly converted to a DataFrame and then be registered as a table=>Tables can be used in subsequent SQL statements.

// sc is an existing SparkContext.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// this is used to implicitly convert an RDD to a DataFrame.

import sqlContext.implicits._

// Define the schema using a case class.

// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,

// you can use custom classes that implement the Product interface.

case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a table.

val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()

people.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.

val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are DataFrames and support all the normal RDD operations.

// The columns of a row in the result can be accessed by ordinal.

teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

- Aprogrammatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct DataFrames when the columns and their types are not known until runtime.

From data sources：

val df = sqlContext.load("people.json", "json")

df.select("name", "age").save("namesAndAges.parquet", "parquet")

val df = sqlContext.jsonFile("examples/src/main/resources/people.json")

Spark 学习笔记：（三）Spark SQL的更多相关文章

spark学习笔记总结-spark入门资料精化
Spark学习笔记 Spark简介 spark 可以很容易和yarn结合,直接调用HDFS.Hbase上面的数据,和hadoop结合.配置很容易. spark发展迅猛,框架比hadoop更加灵活实用. ...
Spark学习笔记-三种属性配置详细说明【转】
相关资料:Spark属性配置 http://www.cnblogs.com/chengxin1982/p/4023111.html 本文出处:转载自过往记忆(http://www.iteblog.c ...
Spark学习笔记-使用Spark History Server
在运行Spark应用程序的时候,driver会提供一个webUI给出应用程序的运行信息,但是该webUI随着应用程序的完成而关闭端口,也就是说,Spark应用程序运行完后,将无法查看应用程序的历史记 ...
Spark 学习笔记之 Spark history Server 搭建
在hdfs上建立文件夹/directory hadoop fs -mkdir /directory 进入conf目录 spark-env.sh 增加以下配置 export SPARK_HISTORY ...
Spark学习笔记之-Spark远程调试
Spark远程调试本例子介绍简单介绍spark一种远程调试方法,使用的IDE是IntelliJ IDEA. 1.了解jvm一些参数属性 -X ...
Spark学习笔记之SparkRDD
Spark学习笔记之SparkRDD 一. 基本概念 RDD(resilient distributed datasets)弹性分布式数据集. 来自于两方面 ① 内存集合和外部存储系统 ② ...
Spark学习笔记0——简单了解和技术架构
目录 Spark学习笔记0--简单了解和技术架构什么是Spark 技术架构和软件栈 Spark Core Spark SQL Spark Streaming MLlib GraphX 集群管理器受 ...
Oracle学习笔记三 SQL命令
SQL简介 SQL 支持下列类别的命令: 1.数据定义语言(DDL) 2.数据操纵语言(DML) 3.事务控制语言(TCL) 4.数据控制语言(DCL)
Spark学习笔记2（spark所需环境配置
Spark学习笔记2 配置spark所需环境 1.首先先把本地的maven的压缩包解压到本地文件夹中,安装好本地的maven客户端程序,版本没有什么要求不需要最新版的maven客户端. 解压完成之后 ...
Spark学习笔记3（IDEA编写scala代码并打包上传集群运行）
Spark学习笔记3 IDEA编写scala代码并打包上传集群运行我们在IDEA上的maven项目已经搭建完成了,现在可以写一个简单的spark代码并且打成jar包上传至集群,来检验一下我们的sp ...

随机推荐

linux下ln命令
转自:http://www.cnblogs.com/peida/archive/2012/12/11/2812294.html ln是linux中又一个非常重要命令,它的功能是为某一个文件在另外一个位 ...
九度oj 题目1096：日期差值
题目描述: 有两个日期,求两个日期之间的天数,如果两个日期是连续的我们规定他们之间的天数为两天输入: 有多组数据,每组数据有两行,分别表示两个日期,形式为YYYYMMDD 输出: 每组数据输出一行, ...
hdu6069[素数筛法] 2017多校4
对于[l , r]内的每个数,根据唯一分解定理有所以有因为 //可根据唯一分解定理推导所以题目要求就可以运用它到上述公式 (注意不能暴力对l,r内的数一个个分解算贡献 ...
【bzoj4826】[Hnoi2017]影魔单调栈+可持久化线段树
题目描述影魔,奈文摩尔,据说有着一个诗人的灵魂.事实上,他吞噬的诗人灵魂早已成千上万.千百年来,他收集了各式各样的灵魂,包括诗人.牧师.帝王.乞丐.奴隶.罪人,当然,还有英雄.每一个灵魂,都有着自己 ...
LA 4728 旋转卡壳算法求凸包的最大直径
#include<iostream> #include<cstdio> #include<cmath> #include<vector> #includ ...
自己写了一个超级简便且傻瓜式的且功能强大的CSV组件（并且代码优雅，绝对没有一行多余的代码）
github地址: https://github.com/yangfeixxx/chipsCSV.git 解决的问题:解决了传统的CSV工具类对于实体类无法自动化封装为带表头的CSV文件的痛点,在读取 ...
Spring入门之setter DI注入
1.新建Java项目导入依赖jar包,参考前一章 2.以不同文件格式输出为例 3.定义接口IOutputGenerator.java package com.spring.output; public ...
angular中关于自定义指令——repeat渲染完成后执行动作
业务中有时需要在异步获取数据并用ng-repeat遍历渲染完页面后执行某个操作,angular本身并没有提供监听ng-repeat渲染完成的指令,所以需要自己动手写.有经验的同学都应该知道,在ng-r ...
excel批量导入数据库SQL server
思路: 第一是文件上传,可以参照Jakarta的FileUpload组件,用普通的Post也就行了.第二是Excel解析,用JSL或者POI都行第三是数据保存,这个应该简单吧,一个循环,一行对应一条数 ...
搭建Redis环境以及所遇问题(CentOS7+Redis 3.2.8)
一.安装步骤 1. 首先需要安装gcc,把下载好的redis-3.2.8-rc2.tar.gz 放到/usr/local文件夹下 2. 进行解压 tar -zxvf redis-3.2.8-rc2.t ...

Spark 学习笔记：（三）Spark SQL

Spark 学习笔记：（三）Spark SQL的更多相关文章

随机推荐

热门专题