1.总体执行流程

使用下列代码对SparkSQL流程进行分析。让大家明确LogicalPlan的几种状态,理解SparkSQL总体执行流程

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._ // Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface.
case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table.
val people = sc.textFile("/examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
people.registerTempTable("people") // SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")

(1)查看teenagers的Schema信息

scala> teenagers.printSchema
root
|-- name: string (nullable = true)
|-- age: integer (nullable = false)

(2)查看执行流程

scala> teenagers.queryExecution
res3: org.apache.spark.sql.SQLContext#QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias('name),unresolvedalias('age)]
'Filter (('age >= 13) && ('age <= 19))
'UnresolvedRelation [people], None == Analyzed Logical Plan ==
name: string, age: int
Project [name#0,age#1]
Filter ((age#1 >= 13) && (age#1 <= 19))
Subquery people
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 == Optimized Logical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 == Physical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
Scan PhysicalRDD[name#0,age#1] Code Generation: true

QueryExecution中表示的是总体Spark SQL执行流程,从上面的输出结果能够看到,一个SQL语句要执行须要经过下列步骤:

== (1)Parsed Logical Plan ==
'Project [unresolvedalias('name),unresolvedalias('age)]
'Filter (('age >= 13) && ('age <= 19))
'UnresolvedRelation [people], None == (2)Analyzed Logical Plan ==
name: string, age: int
Project [name#0,age#1]
Filter ((age#1 >= 13) && (age#1 <= 19))
Subquery people
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 == (3)Optimized Logical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 == (4)Physical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
Scan PhysicalRDD[name#0,age#1] //启动动态字节码生成技术(bytecode generation。CG),提升查询效率
Code Generation: true

2.全表查询执行流程

执行语句:

val all= sqlContext.sql("SELECT * FROM people")

执行流程:

scala> all.queryExecution
res9: org.apache.spark.sql.SQLContext#QueryExecution =
//注意*号被解析为unresolvedalias(*)
== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
'UnresolvedRelation [people], None == Analyzed Logical Plan ==
//unresolvedalias(*)被analyzed为Schema中全部的字段
//UnresolvedRelation [people]被analyzed为Subquery people
name: string, age: int
Project [name#0,age#1]
Subquery people
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 == Optimized Logical Plan ==
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 == Physical Plan ==
Scan PhysicalRDD[name#0,age#1] Code Generation: true

3. filter查询执行流程

执行语句:

scala> val filterQuery= sqlContext.sql("SELECT * FROM people WHERE age >= 13 AND age <= 19")
filterQuery: org.apache.spark.sql.DataFrame = [name: string, age: int]

执行流程:

scala> filterQuery.queryExecution
res0: org.apache.spark.sql.SQLContext#QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
'Filter (('age >= 13) && ('age <= 19))
'UnresolvedRelation [people], None == Analyzed Logical Plan ==
name: string, age: int
Project [name#0,age#1]
//多出了Filter。后同
Filter ((age#1 >= 13) && (age#1 <= 19))
Subquery people
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:20 == Optimized Logical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:20 == Physical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
Scan PhysicalRDD[name#0,age#1] Code Generation: true

4. join查询执行流程

执行语句:

val joinQuery= sqlContext.sql("SELECT * FROM people a, people b where a.age=b.age")

查看总体执行流程

scala> joinQuery.queryExecution
res0: org.apache.spark.sql.SQLContext#QueryExecution =
//注意Filter
//Join Inner
== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
'Filter ('a.age = 'b.age)
'Join Inner, None
'UnresolvedRelation [people], Some(a)
'UnresolvedRelation [people], Some(b) == Analyzed Logical Plan ==
name: string, age: int, name: string, age: int
Project [name#0,age#1,name#2,age#3]
Filter (age#1 = age#3)
Join Inner, None
Subquery a
Subquery people
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22
Subquery b
Subquery people
LogicalRDD [name#2,age#3], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 == Optimized Logical Plan ==
Project [name#0,age#1,name#2,age#3]
Join Inner, Some((age#1 = age#3))
LogicalRDD [name#0,age#1], MapPartitionsRDD[4]... //查看其Physical Plan
scala> joinQuery.queryExecution.sparkPlan
res16: org.apache.spark.sql.execution.SparkPlan =
TungstenProject [name#0,age#1,name#2,age#3]
SortMergeJoin [age#1], [age#3]
Scan PhysicalRDD[name#0,age#1]
Scan PhysicalRDD[name#2,age#3]

前面的样例与以下的样例等同,仅仅只是其执行方式略有不同,执行语句:

scala> val innerQuery= sqlContext.sql("SELECT * FROM people a inner join people b on a.age=b.age")
innerQuery: org.apache.spark.sql.DataFrame = [name: string, age: int, name: string, age: int]

查看总体执行流程:

scala> innerQuery.queryExecution
res2: org.apache.spark.sql.SQLContext#QueryExecution =
//注意Join Inner
//另外这里面没有Filter
== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
'Join Inner, Some(('a.age = 'b.age))
'UnresolvedRelation [people], Some(a)
'UnresolvedRelation [people], Some(b) == Analyzed Logical Plan ==
name: string, age: int, name: string, age: int
Project [name#0,age#1,name#4,age#5]
Join Inner, Some((age#1 = age#5))
Subquery a
Subquery people
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22
Subquery b
Subquery people
LogicalRDD [name#4,age#5], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 //注意Optimized Logical Plan与Analyzed Logical Plan
//并没有进行特别的优化,突出这一点是为了比較后面的子查询
//其Analyzed和Optimized间的差别
== Optimized Logical Plan ==
Project [name#0,age#1,name#4,age#5]
Join Inner, Some((age#1 = age#5))
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder ... //查看其Physical Plan
scala> innerQuery.queryExecution.sparkPlan
res14: org.apache.spark.sql.execution.SparkPlan =
TungstenProject [name#0,age#1,name#6,age#7]
SortMergeJoin [age#1], [age#7]
Scan PhysicalRDD[name#0,age#1]
Scan PhysicalRDD[name#6,age#7]

5. 子查询执行流程

执行语句:

scala> val subQuery=sqlContext.sql("SELECT * FROM (SELECT * FROM people WHERE age >= 13)a where a.age <= 19")
subQuery: org.apache.spark.sql.DataFrame = [name: string, age: int]

查看总体执行流程:


scala> subQuery.queryExecution
res4: org.apache.spark.sql.SQLContext#QueryExecution =
== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
'Filter ('a.age <= 19)
'Subquery a
'Project [unresolvedalias(*)]
'Filter ('age >= 13)
'UnresolvedRelation [people], None == Analyzed Logical Plan ==
name: string, age: int
Project [name#0,age#1]
Filter (age#1 <= 19)
Subquery a
Project [name#0,age#1]
Filter (age#1 >= 13)
Subquery people
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 //这里须要注意Optimized与Analyzed间的差别
//Filter被进行了优化
== Optimized Logical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 == Physical Plan ==
Filter ((age#1 >= 13) && (age#1 <= 19))
Scan PhysicalRDD[name#0,age#1] Code Generation: true

6. 聚合SQL执行流程

执行语句:

scala> val aggregateQuery=sqlContext.sql("SELECT a.name,sum(a.age) FROM (SELECT * FROM people WHERE age >= 13)a where a.age <= 19 group by a.name")
aggregateQuery: org.apache.spark.sql.DataFrame = [name: string, _c1: bigint]

执行流程查看:


scala> aggregateQuery.queryExecution
res6: org.apache.spark.sql.SQLContext#QueryExecution =
//注意'Aggregate ['a.name], [unresolvedalias('a.name),unresolvedalias('sum('a.age))]
//即group by a.name被 parsed为unresolvedalias('a.name)
== Parsed Logical Plan ==
'Aggregate ['a.name], [unresolvedalias('a.name),unresolvedalias('sum('a.age))]
'Filter ('a.age <= 19)
'Subquery a
'Project [unresolvedalias(*)]
'Filter ('age >= 13)
'UnresolvedRelation [people], None == Analyzed Logical Plan ==
name: string, _c1: bigint
Aggregate [name#0], [name#0,sum(cast(age#1 as bigint)) AS _c1#9L]
Filter (age#1 <= 19)
Subquery a
Project [name#0,age#1]
Filter (age#1 >= 13)
Subquery people
LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at rddToDataFrameHolder at <console>:22 == Optimized Logical Plan ==
Aggregate [name#0], [name#0,sum(cast(age#1 as bigint)) AS _c1#9L]
Filter ((age#1 >= 13) && (age#1 <= 19))
LogicalRDD [name#0,age#1], MapPartitions... //查看其Physical Plan
scala> aggregateQuery.queryExecution.sparkPlan
res10: org.apache.spark.sql.execution.SparkPlan =
TungstenAggregate(key=[name#0], functions=[(sum(cast(age#1 as bigint)),mode=Final,isDistinct=false)], output=[name#0,_c1#14L])
TungstenAggregate(key=[name#0], functions=[(sum(cast(age#1 as bigint)),mode=Partial,isDistinct=false)], output=[name#0,currentSum#17L])
Filter ((age#1 >= 13) && (age#1 <= 19))
Scan PhysicalRDD[name#0,age#1]

其他SQL语句。大家能够使用相同的方法查看其执行流程。以掌握Spark SQL背后实现的基本思想。

Spark修炼之道(进阶篇)——Spark入门到精通:第九节 Spark SQL执行流程解析的更多相关文章

  1. Spark修炼之道——Spark学习路线、课程大纲

    课程内容 Spark修炼之道(基础篇)--Linux基础(15讲).Akka分布式编程(8讲) Spark修炼之道(进阶篇)--Spark入门到精通(30讲) Spark修炼之道(实战篇)--Spar ...

  2. Spark入门:第1节 Spark概述:1 - 4

    2.spark概述 2.1 什么是spark Apache Spark™ is a unified analytics engine for large-scale data processing. ...

  3. Spark入门:第4节 Spark程序:1 - 9

    五. Spark角色介绍 Spark是基于内存计算的大数据并行计算框架.因为其基于内存计算,比Hadoop中MapReduce计算框架具有更高的实时性,同时保证了高效容错性和可伸缩性.从2009年诞生 ...

  4. Spark入门:第2节 Spark集群安装:1 - 3;第3节 Spark HA高可用部署:1 - 2

    三. Spark集群安装 3.1 下载spark安装包 下载地址spark官网:http://spark.apache.org/downloads.html 这里我们使用 spark-2.1.3-bi ...

  5. 深入浅出Mybatis系列(十)---SQL执行流程分析(源码篇)

    最近太忙了,一直没时间继续更新博客,今天忙里偷闲继续我的Mybatis学习之旅.在前九篇中,介绍了mybatis的配置以及使用, 那么本篇将走进mybatis的源码,分析mybatis 的执行流程, ...

  6. 第九篇:Map/Reduce 工作机制分析 - 作业的执行流程

    前言 从运行我们的 Map/Reduce 程序,到结果的提交,Hadoop 平台其实做了很多事情. 那么 Hadoop 平台到底做了什么事情,让 Map/Reduce 程序可以如此 "轻易& ...

  7. 深入浅出Mybatis系列十-SQL执行流程分析(源码篇)

    注:本文转载自南轲梦 注:博主 Chloneda:个人博客 | 博客园 | Github | Gitee | 知乎 最近太忙了,一直没时间继续更新博客,今天忙里偷闲继续我的Mybatis学习之旅.在前 ...

  8. Spark修炼之道(基础篇)——Linux大数据开发基础:第二节:Linux文件系统、文件夹(一)

    本节主要内容 怎样获取帮助文档 Linux文件系统简单介绍 文件夹操作 訪问权限 1. 怎样获取帮助文档 在实际工作过程其中,常常会忘记命令的使用方式.比如ls命令后面能够跟哪些參数,此时能够使用ma ...

  9. Spark修炼之道(高级篇)——Spark源代码阅读:第十二节 Spark SQL 处理流程分析

    作者:周志湖 以下的代码演示了通过Case Class进行表Schema定义的样例: // sc is an existing SparkContext. val sqlContext = new o ...

随机推荐

  1. 三星R428 内存不兼容金士顿2G DDR3

    京东上买了个金士顿2G DDR3, 回家装上之后发现不兼容, 原机带的是三星DDR3 1066的2G条子,买的是 金士顿DDR3 2G 1333的条子,结果单独插任何一根都好使,两个插槽均无问题,但是 ...

  2. 这应该是目前最快速有效的ASP.NET Core学习方式(视频)

    ASP.NET Core都2.0了,它的普及还是不太好.作为一个.NET的老司机,我觉得.NET Core给我带来了很多的乐趣.Linux, Docker, CloudNative,MicroServ ...

  3. Altera FIFO IP核时序说明

    ALTERA在LPM(library of parameterized mudules)库中提供了参数可配置的单时钟FIFO(SCFIFO)和双时钟FIFO(DCFIFO).FIFO主要应用在需要数据 ...

  4. 【小技巧解决大问题】使用 frp 突破阿里云主机无弹性公网 IP 不能用作 Web 服务器的限制

    背景 今年 8 月份左右,打折价买了一个阿里云主机,比平常便宜了 2000 多块.买了之后,本想作为一个博客网站的,毕竟国内的服务器访问肯定快一些.满心欢喜的下单之后,却发现 http 服务,外网怎么 ...

  5. 父类清除浮动的原因、(清除浮动代码,置于CSS中方便调用)

    浮动因素在静态网页制作中经常被应用到,比如要让块级元素不独占一行,常常应用设置float的方式来实现.但是应用的时候会发现,设置了子类浮动后,未给父类清除浮动,这样就会造成一下问题: 1.浮动的元素会 ...

  6. android studio 默认 .gitignore 文件模板

    # built application files*.apk*.ap_ # files for the dex VM*.dex # Java class files*.class # generate ...

  7. CDH5.11..0安装

    1.参考: http://www.cnblogs.com/codedevelop/p/6762555.html grant all privileges on *.* to 'root'@'hostn ...

  8. ES6这些就够了

    刚开始用vue或者react,很多时候我们都会把ES6这个大兄弟加入我们的技术栈中.但是ES6那么多那么多特性,我们需要全部都掌握吗?秉着二八原则,掌握好常用的,有用的这个可以让我们快速起飞. 接下来 ...

  9. ShoneSharp语言(S#)的设计和使用介绍—数值Double

    ShoneSharp语言(S#)的设计和使用介绍 系列(5)- 数值Double 作者:Shone 声明:原创文章欢迎转载,但请注明出处,https://www.cnblogs.com/ShoneSh ...

  10. ##1.Centos7环境准备-- openstack pike

    ##1.Centos7环境准备 openstack pike 安装 目录汇总 http://www.cnblogs.com/elvi/p/7613861.html ##.Centos7环境准备 #Ce ...