1 Overview
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.
 
2 DataFrames
A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python
 
2.1 SQLContext
The entry point into all functionality in Spark SQL is the SQLContext class, or one of its descendants. To create a basic SQLContext, all you need is a SparkContext.
val sqlContext =new org.apache.spark.sql.SQLContext(sc)
 
2.2 Creating DataFrames
val df = sqlContext.read.json("/home/slh/data/people.json")
 
2.3 Operations
show()
printSchema()
select("name").show()
select(df("name"), df("age") + 1).show()
filter(df("age") > 13).show()
groupBy("age").count().show()
 
2.4 Running SQL Queries
The sql function on a SQLContext enables applications to run SQL queries programmatically and returns the result as a DataFrame.
val df =  sqlContext.sql("select * from table")
 
2.5 Interoperating with RDDs
Spark SQL supports two different methods for converting existing RDDs into DataFrames. The first method uses reflection to infer the schema of an RDD that contains specific types of objects. This reflection based approach leads to more concise code and works well when you already know the schema while writing your Spark application.
The second method for creating DataFrames is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct DataFrames when the columns and their types are not known until runtime.
 
3 Data Sources
Spark SQL supports operating on a variety of data sources through the DataFrame interface. A DataFrame can be operated on as normal RDDs and can also be registered as a temporary table. Registering a DataFrame as a table allows you to run SQL queries over its data. This section describes the general methods for loading and saving data using the Spark Data Sources and then goes into specific options that are available for the built-in data sources.
 
3.1 Load/Save Functions
Generic:
val df = sqlContext.read.load("examples/src/main/resources/users.parquet")
df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
Specifying:
val df = sqlContext.read.format("json").load("examples/src/main/resources/people.json")
df.select("name", "age").write.format("parquet").save("namesAndAges.parquet")
Save Modes:
SaveMode.ErrorIfExists(default)
SaveMode.Append
SaveMode.Overwrite
SaveMode.Ignore
 
3.2 Parquet Files
Parquet is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data.
 
// The RDD is implicitly converted to a DataFrame by implicits, allowing it to be stored using Parquet.
people.write.parquet("people.parquet")

// Read in the parquet file created above. Parquet files are self-describing so the schema is preserved.
// The result of loading a Parquet file is also a DataFrame.
val parquetFile = sqlContext.read.parquet("people.parquet")

 
3.3 JSON Datasets
val path = "examples/src/main/resources/people.json"
val people = sqlContext.read.json(path)

val anotherPeopleRDD = sc.parallelize(
"""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val anotherPeople = sqlContext.read.json(anotherPeopleRDD)

 
3.4 Hive Tables
Spark SQL also supports reading and writing data stored in Apache Hive. However, since Hive has a large number of dependencies, it is not included in the default Spark assembly. Hive support is enabled by adding the -Phive and -Phive-thriftserver flags to Spark’s build. This command builds a new assembly jar that includes Hive. Note that this Hive assembly jar must also be present on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries (SerDes) in order to access data stored in Hive.
 
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

Spark SQL - DataFrame的更多相关文章

  1. Spark SQL DataFrame新增一列的四种方法

    方法一:利用createDataFrame方法,新增列的过程包含在构建rdd和schema中 方法二:利用withColumn方法,新增列的过程包含在udf函数中 方法三:利用SQL代码,新增列的过程 ...

  2. spark第七篇:Spark SQL, DataFrame and Dataset Guide

    预览 Spark SQL是用来处理结构化数据的Spark模块.有几种与Spark SQL进行交互的方式,包括SQL和Dataset API. 本指南中的所有例子都可以在spark-shell,pysp ...

  3. Spark SQL,如何将 DataFrame 转为 json 格式

    今天主要介绍一下如何将 Spark dataframe 的数据转成 json 数据.用到的是 scala 提供的 json 处理的 api. 用过 Spark SQL 应该知道,Spark dataf ...

  4. Spark操作dataFrame进行写入mysql,自定义sql的方式

    业务场景: 现在项目中需要通过对spark对原始数据进行计算,然后将计算结果写入到mysql中,但是在写入的时候有个限制: 1.mysql中的目标表事先已经存在,并且当中存在主键,自增长的键id 2. ...

  5. spark sql的agg函数,作用:在整体DataFrame不分组聚合

    .agg(expers:column*) 返回dataframe类型 ,同数学计算求值 df.agg(max("age"), avg("salary")) df ...

  6. spark结构化数据处理:Spark SQL、DataFrame和Dataset

    本文讲解Spark的结构化数据处理,主要包括:Spark SQL.DataFrame.Dataset以及Spark SQL服务等相关内容.本文主要讲解Spark 1.6.x的结构化数据处理相关东东,但 ...

  7. Spark SQL 用户自定义函数UDF、用户自定义聚合函数UDAF 教程(Java踩坑教学版)

    在Spark中,也支持Hive中的自定义函数.自定义函数大致可以分为三种: UDF(User-Defined-Function),即最基本的自定义函数,类似to_char,to_date等 UDAF( ...

  8. Spark修炼之道(进阶篇)——Spark入门到精通:第九节 Spark SQL执行流程解析

    1.总体执行流程 使用下列代码对SparkSQL流程进行分析.让大家明确LogicalPlan的几种状态,理解SparkSQL总体执行流程 // sc is an existing SparkCont ...

  9. 大数据技术之_19_Spark学习_03_Spark SQL 应用解析 + Spark SQL 概述、解析 、数据源、实战 + 执行 Spark SQL 查询 + JDBC/ODBC 服务器

    第1章 Spark SQL 概述1.1 什么是 Spark SQL1.2 RDD vs DataFrames vs DataSet1.2.1 RDD1.2.2 DataFrame1.2.3 DataS ...

随机推荐

  1. es6转码器-babel

    babel 基本使用 安装转码规则 # ES2015转码规则 $ npm install --save-dev babel-preset-es2015 # react转码规则 $ npm instal ...

  2. 算法导论-动态规划(最长公共子序列问题LCS)-C++实现

    首先定义一个给定序列的子序列,就是将给定序列中零个或多个元素去掉之后得到的结果,其形式化定义如下:给定一个序列X = <x1,x2 ,..., xm>,另一个序列Z =<z1,z2  ...

  3. html5 canvas 移动小方块

    <!doctype html> <html> <head> <meta charset="utf-8"> <title> ...

  4. redhat6.4升级openssh至6.7

    1:简介 最近浙江电信对线上服务器进行漏洞扫描,暴露出原有的openssh有漏洞,建议升级openssh版本: 2:操作环境 Red Hat Enterprise Linux Server relea ...

  5. 关于VBox安装GhostXP出现蓝屏processr.sys 的解决办法

    用VirtualBox虚拟系统安装了一个Ghost XP SP3,还原系统后,重启进入Windows XP时,出现蓝屏提示processr.sys, 蓝屏代码为0x000000CE提示处理器驱动文件问 ...

  6. Jsch

    JSch is a pure Java implementation of SSH2. JSch allows you to connect to an sshd server and use por ...

  7. Javascript模块规范(CommonJS规范&&AMD规范)

    Javascript模块化编程(AMD&CommonJS) 前端模块化开发的价值:https://github.com/seajs/seajs/issues/547 模块的写法 查看 AMD规 ...

  8. erlang web socket参考。

    出自: http://blog.sina.com.cn/s/blog_96b8a15401010g71.html

  9. 网络虚拟化技术 TUN/TAP MACVLAN MACVTAP

    TUN 设备 TUN 设备是一种虚拟网络设备,通过此设备,程序可以方便得模拟网络行为.先来看看物理设备是如何工作的:

  10. Ehcache(06)——监听器

    http://haohaoxuexi.iteye.com/blog/2119353 监听器 Ehcache中监听器有两种,监听CacheManager的CacheManagerEventListene ...