1.总体执行流程

使用下列代码对SparkSQL流程进行分析。让大家明确LogicalPlan的几种状态,理解SparkSQL总体执行流程

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._ // Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface.
case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table.
val people = sc.textFile("/examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
people.registerTempTable("people") // SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")

(1)查看teenagers的Schema信息

scala> teenagers.printSchema
root
|-- name: string (nullable = true)
|-- age: integer (nullable = false)

(2)查看执行流程

scala> teenagers.queryExecution
res3: org.apache.spark.sql.SQLContext#QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias('name),unresolvedalias('age)]
'Filter (('age >= 13) && ('age <= 19))
'UnresolvedRelation [people], None == Analyzed Logical Plan ==
name: string, age: int
Project [name#0,age#1]
Filter ((age#1 >= 13) && (age#1 <= 19))
Subquery people
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 == Optimized Logical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 == Physical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
Scan PhysicalRDD[name#0,age#1] Code Generation: true

QueryExecution中表示的是总体Spark SQL执行流程,从上面的输出结果能够看到,一个SQL语句要执行须要经过下列步骤:

== (1)Parsed Logical Plan ==
'Project [unresolvedalias('name),unresolvedalias('age)]
'Filter (('age >= 13) && ('age <= 19))
'UnresolvedRelation [people], None == (2)Analyzed Logical Plan ==
name: string, age: int
Project [name#0,age#1]
Filter ((age#1 >= 13) && (age#1 <= 19))
Subquery people
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 == (3)Optimized Logical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 == (4)Physical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
Scan PhysicalRDD[name#0,age#1] //启动动态字节码生成技术(bytecode generation。CG),提升查询效率
Code Generation: true

2.全表查询执行流程

执行语句:

val all= sqlContext.sql("SELECT * FROM people")

执行流程:

scala> all.queryExecution
res9: org.apache.spark.sql.SQLContext#QueryExecution =
//注意*号被解析为unresolvedalias(*)
== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
'UnresolvedRelation [people], None == Analyzed Logical Plan ==
//unresolvedalias(*)被analyzed为Schema中全部的字段
//UnresolvedRelation [people]被analyzed为Subquery people
name: string, age: int
Project [name#0,age#1]
Subquery people
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 == Optimized Logical Plan ==
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 == Physical Plan ==
Scan PhysicalRDD[name#0,age#1] Code Generation: true

3. filter查询执行流程

执行语句:

scala> val filterQuery= sqlContext.sql("SELECT * FROM people WHERE age >= 13 AND age <= 19")
filterQuery: org.apache.spark.sql.DataFrame = [name: string, age: int]

执行流程:

scala> filterQuery.queryExecution
res0: org.apache.spark.sql.SQLContext#QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
'Filter (('age >= 13) && ('age <= 19))
'UnresolvedRelation [people], None == Analyzed Logical Plan ==
name: string, age: int
Project [name#0,age#1]
//多出了Filter。后同
Filter ((age#1 >= 13) && (age#1 <= 19))
Subquery people
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:20 == Optimized Logical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:20 == Physical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
Scan PhysicalRDD[name#0,age#1] Code Generation: true

4. join查询执行流程

执行语句:

val joinQuery= sqlContext.sql("SELECT * FROM people a, people b where a.age=b.age")

查看总体执行流程

scala> joinQuery.queryExecution
res0: org.apache.spark.sql.SQLContext#QueryExecution =
//注意Filter
//Join Inner
== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
'Filter ('a.age = 'b.age)
'Join Inner, None
'UnresolvedRelation [people], Some(a)
'UnresolvedRelation [people], Some(b) == Analyzed Logical Plan ==
name: string, age: int, name: string, age: int
Project [name#0,age#1,name#2,age#3]
Filter (age#1 = age#3)
Join Inner, None
Subquery a
Subquery people
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22
Subquery b
Subquery people
LogicalRDD [name#2,age#3], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 == Optimized Logical Plan ==
Project [name#0,age#1,name#2,age#3]
Join Inner, Some((age#1 = age#3))
LogicalRDD [name#0,age#1], MapPartitionsRDD[4]... //查看其Physical Plan
scala> joinQuery.queryExecution.sparkPlan
res16: org.apache.spark.sql.execution.SparkPlan =
TungstenProject [name#0,age#1,name#2,age#3]
SortMergeJoin [age#1], [age#3]
Scan PhysicalRDD[name#0,age#1]
Scan PhysicalRDD[name#2,age#3]

前面的样例与以下的样例等同,仅仅只是其执行方式略有不同,执行语句:

scala> val innerQuery= sqlContext.sql("SELECT * FROM people a inner join people b on a.age=b.age")
innerQuery: org.apache.spark.sql.DataFrame = [name: string, age: int, name: string, age: int]

查看总体执行流程:

scala> innerQuery.queryExecution
res2: org.apache.spark.sql.SQLContext#QueryExecution =
//注意Join Inner
//另外这里面没有Filter
== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
'Join Inner, Some(('a.age = 'b.age))
'UnresolvedRelation [people], Some(a)
'UnresolvedRelation [people], Some(b) == Analyzed Logical Plan ==
name: string, age: int, name: string, age: int
Project [name#0,age#1,name#4,age#5]
Join Inner, Some((age#1 = age#5))
Subquery a
Subquery people
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22
Subquery b
Subquery people
LogicalRDD [name#4,age#5], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 //注意Optimized Logical Plan与Analyzed Logical Plan
//并没有进行特别的优化,突出这一点是为了比較后面的子查询
//其Analyzed和Optimized间的差别
== Optimized Logical Plan ==
Project [name#0,age#1,name#4,age#5]
Join Inner, Some((age#1 = age#5))
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder ... //查看其Physical Plan
scala> innerQuery.queryExecution.sparkPlan
res14: org.apache.spark.sql.execution.SparkPlan =
TungstenProject [name#0,age#1,name#6,age#7]
SortMergeJoin [age#1], [age#7]
Scan PhysicalRDD[name#0,age#1]
Scan PhysicalRDD[name#6,age#7]

5. 子查询执行流程

执行语句:

scala> val subQuery=sqlContext.sql("SELECT * FROM (SELECT * FROM people WHERE age >= 13)a where a.age <= 19")
subQuery: org.apache.spark.sql.DataFrame = [name: string, age: int]

查看总体执行流程:


scala> subQuery.queryExecution
res4: org.apache.spark.sql.SQLContext#QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
'Filter ('a.age <= 19)
'Subquery a
'Project [unresolvedalias(*)]
'Filter ('age >= 13)
'UnresolvedRelation [people], None == Analyzed Logical Plan ==
name: string, age: int
Project [name#0,age#1]
Filter (age#1 <= 19)
Subquery a
Project [name#0,age#1]
Filter (age#1 >= 13)
Subquery people
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 //这里须要注意Optimized与Analyzed间的差别
//Filter被进行了优化
== Optimized Logical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 == Physical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
Scan PhysicalRDD[name#0,age#1] Code Generation: true

6. 聚合SQL执行流程

执行语句:

scala> val aggregateQuery=sqlContext.sql("SELECT a.name,sum(a.age) FROM (SELECT * FROM people WHERE age >= 13)a where a.age <= 19 group by a.name")
aggregateQuery: org.apache.spark.sql.DataFrame = [name: string, _c1: bigint]

执行流程查看:


scala> aggregateQuery.queryExecution
res6: org.apache.spark.sql.SQLContext#QueryExecution =
//注意'Aggregate ['a.name], [unresolvedalias('a.name),unresolvedalias('sum('a.age))]
//即group by a.name被 parsed为unresolvedalias('a.name)
== Parsed Logical Plan ==
'Aggregate ['a.name], [unresolvedalias('a.name),unresolvedalias('sum('a.age))]
'Filter ('a.age <= 19)
'Subquery a
'Project [unresolvedalias(*)]
'Filter ('age >= 13)
'UnresolvedRelation [people], None == Analyzed Logical Plan ==
name: string, _c1: bigint
Aggregate [name#0], [name#0,sum(cast(age#1 as bigint)) AS _c1#9L]
Filter (age#1 <= 19)
Subquery a
Project [name#0,age#1]
Filter (age#1 >= 13)
Subquery people
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 == Optimized Logical Plan ==
Aggregate [name#0], [name#0,sum(cast(age#1 as bigint)) AS _c1#9L]
Filter ((age#1 >= 13) && (age#1 <= 19))
LogicalRDD [name#0,age#1], MapPartitions... //查看其Physical Plan
scala> aggregateQuery.queryExecution.sparkPlan
res10: org.apache.spark.sql.execution.SparkPlan =
TungstenAggregate(key=[name#0], functions=[(sum(cast(age#1 as bigint)),mode=Final,isDistinct=false)], output=[name#0,_c1#14L])
TungstenAggregate(key=[name#0], functions=[(sum(cast(age#1 as bigint)),mode=Partial,isDistinct=false)], output=[name#0,currentSum#17L])
Filter ((age#1 >= 13) && (age#1 <= 19))
Scan PhysicalRDD[name#0,age#1]

其他SQL语句。大家能够使用相同的方法查看其执行流程。以掌握Spark SQL背后实现的基本思想。

Spark修炼之道(进阶篇)——Spark入门到精通:第九节 Spark SQL执行流程解析的更多相关文章

  1. Spark修炼之道——Spark学习路线、课程大纲

    课程内容 Spark修炼之道(基础篇)--Linux基础(15讲).Akka分布式编程(8讲) Spark修炼之道(进阶篇)--Spark入门到精通(30讲) Spark修炼之道(实战篇)--Spar ...

  2. Spark入门:第1节 Spark概述:1 - 4

    2.spark概述 2.1 什么是spark Apache Spark™ is a unified analytics engine for large-scale data processing. ...

  3. Spark入门:第4节 Spark程序:1 - 9

    五. Spark角色介绍 Spark是基于内存计算的大数据并行计算框架.因为其基于内存计算,比Hadoop中MapReduce计算框架具有更高的实时性,同时保证了高效容错性和可伸缩性.从2009年诞生 ...

  4. Spark入门:第2节 Spark集群安装:1 - 3;第3节 Spark HA高可用部署:1 - 2

    三. Spark集群安装 3.1 下载spark安装包 下载地址spark官网:http://spark.apache.org/downloads.html 这里我们使用 spark-2.1.3-bi ...

  5. 深入浅出Mybatis系列(十)---SQL执行流程分析(源码篇)

    最近太忙了,一直没时间继续更新博客,今天忙里偷闲继续我的Mybatis学习之旅.在前九篇中,介绍了mybatis的配置以及使用, 那么本篇将走进mybatis的源码,分析mybatis 的执行流程, ...

  6. 第九篇:Map/Reduce 工作机制分析 - 作业的执行流程

    前言 从运行我们的 Map/Reduce 程序,到结果的提交,Hadoop 平台其实做了很多事情. 那么 Hadoop 平台到底做了什么事情,让 Map/Reduce 程序可以如此 "轻易& ...

  7. 深入浅出Mybatis系列十-SQL执行流程分析(源码篇)

    注:本文转载自南轲梦 注:博主 Chloneda:个人博客 | 博客园 | Github | Gitee | 知乎 最近太忙了,一直没时间继续更新博客,今天忙里偷闲继续我的Mybatis学习之旅.在前 ...

  8. Spark修炼之道(基础篇)——Linux大数据开发基础:第二节:Linux文件系统、文件夹(一)

    本节主要内容 怎样获取帮助文档 Linux文件系统简单介绍 文件夹操作 訪问权限 1. 怎样获取帮助文档 在实际工作过程其中,常常会忘记命令的使用方式.比如ls命令后面能够跟哪些參数,此时能够使用ma ...

  9. Spark修炼之道(高级篇)——Spark源代码阅读:第十二节 Spark SQL 处理流程分析

    作者:周志湖 以下的代码演示了通过Case Class进行表Schema定义的样例: // sc is an existing SparkContext. val sqlContext = new o ...

随机推荐

  1. pgjdbc源码分析

    一. 源代码目录结构 pgjdbc的源码结构如下图: 那么我们来一一看看各个模块都是做什么的吧. 1 core 该目录是程序的核心模块目录. 这里实现了大部分pgjdbc的基类和接口,例如statem ...

  2. Nytro MegaRaid

    Nytro MegaRaid简介 Dell R720xd,内存64G ,12块 SAS Dell R510xd,内存48G ,12块 SAS   SSD+SAS   SSD对于用户透明   raid会 ...

  3. Python之元类

    类型对象负责创建对象实例,控制对象行为.那么类型对象又由谁来创建呢? 元类(metaclass)——类型的类型 New-Style Class的默认类型是type >>> class ...

  4. 如何使用webpack优化首屏渲染时间

    其实说到性能优化,他的范围太广了,今天我们就只聊一聊通过webpack配置减少http请求数量这个点吧. 简单说下工作中遇到的问题吧,我们做的一个项目中首页用了十多张图片,每张图片都是一个静态资源,所 ...

  5. linux下高可用LVS搭建及配置方法

    一,安装与配置ipvsadm   ipvsadm --help  #查询是否安装成功     二,配置Director Server服务器 1. ifconifg eth0:0 183.61.87.4 ...

  6. Mac安装Elasticsearch时提示:No Java runtime present, requesting install.

    没有安装java的童鞋可以先去安装一下,地址:https://www.java.com/zh_CN/ 安装之后还是提示如下错误: ➜ elasticsearch-2.4.3 bin/elasticse ...

  7. 《java.util.concurrent 包源码阅读》06 ArrayBlockingQueue

    对于BlockingQueue的具体实现,主要关注的有两点:线程安全的实现和阻塞操作的实现.所以分析ArrayBlockingQueue也是基于这两点. 对于线程安全来说,所有的添加元素的方法和拿走元 ...

  8. Less注释语法

    Less注释语法 适当的注释是保证代码可读性的必要手段,Less支持两种类型的注释:多行注释和单行注释. 1)形如 /* */ 的多行注释.如: /* Hello, I'm a CSS-style c ...

  9. Redis 高可用集群

    Redis 高可用集群 Redis 的集群主从模型是一种高可用的集群架构.本章主要内容有:高可用集群的搭建,Jedis连接集群,新增集群节点,删除集群节点,其他配置补充说明. 高可用集群搭建 集群(c ...

  10. 访问者模式(Visitor)

    访问者模式(Visitor) 访问者模式把数据结构和作用于结构上的操作解耦合,使得操作集合可相对自由地演化.访问者模式适用于数据结构相对稳定算法又易变化的系统.因为访问者模式使得算法操作增加变得容易. ...