第三篇:Spark SQL Catalyst源码分析之Analyzer
/** Spark SQL源码分析系列文章*/
前面几篇文章讲解了Spark SQL的核心执行流程和Spark SQL的Catalyst框架的Sql Parser是怎样接受用户输入sql,经过解析生成Unresolved Logical Plan的。我们记得Spark SQL的执行流程中另一个核心的组件式Analyzer,本文将会介绍Analyzer在Spark SQL里起到了什么作用。
Analyzer位于Catalyst的analysis package下,主要职责是将Sql Parser 未能Resolved的Logical Plan 给Resolved掉。
一、Analyzer构造
Analyzer会使用Catalog和FunctionRegistry将UnresolvedAttribute和UnresolvedRelation转换为catalyst里全类型的对象。
Analyzer里面有fixedPoint对象,一个Seq[Batch].
- class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Boolean)
 - extends RuleExecutor[LogicalPlan] with HiveTypeCoercion {
 - // TODO: pass this in as a parameter.
 - val fixedPoint = FixedPoint(100)
 - val batches: Seq[Batch] = Seq(
 - Batch("MultiInstanceRelations", Once,
 - NewRelationInstances),
 - Batch("CaseInsensitiveAttributeReferences", Once,
 - (if (caseSensitive) Nil else LowercaseAttributeReferences :: Nil) : _*),
 - Batch("Resolution", fixedPoint,
 - ResolveReferences ::
 - ResolveRelations ::
 - NewRelationInstances ::
 - ImplicitGenerate ::
 - StarExpansion ::
 - ResolveFunctions ::
 - GlobalAggregates ::
 - typeCoercionRules :_*),
 - Batch("AnalysisOperators", fixedPoint,
 - EliminateAnalysisOperators)
 - )
 
Analyzer里的一些对象解释:
FixedPoint:相当于迭代次数的上限。
- /** A strategy that runs until fix point or maxIterations times, whichever comes first. */
 - case class FixedPoint(maxIterations: Int) extends Strategy
 
Batch: 批次,这个对象是由一系列Rule组成的,采用一个策略(策略其实是迭代几次的别名吧,eg:Once)
- /** A batch of rules. */,
 - protected case class Batch(name: String, strategy: Strategy, rules: Rule[TreeType]*)
 
Rule:理解为一种规则,这种规则会应用到Logical Plan 从而将UnResolved 转变为Resolved
- abstract class Rule[TreeType <: TreeNode[_]] extends Logging {
 - /** Name for this rule, automatically inferred based on class name. */
 - val ruleName: String = {
 - val className = getClass.getName
 - if (className endsWith "$") className.dropRight(1) else className
 - }
 - def apply(plan: TreeType): TreeType
 - }
 
Strategy:最大的执行次数,如果执行次数在最大迭代次数之前就达到了fix point,策略就会停止,不再应用了。
- /**
 - * An execution strategy for rules that indicates the maximum number of executions. If the
 - * execution reaches fix point (i.e. converge) before maxIterations, it will stop.
 - */
 - abstract class Strategy { def maxIterations: Int }
 
Analyzer解析主要是根据这些Batch里面定义的策略和Rule来对Unresolved的逻辑计划进行解析的。
这里Analyzer类本身并没有定义执行的方法,而是要从它的父类RuleExecutor[LogicalPlan]寻找,Analyzer也实现了HiveTypeCosercion,这个类是参考Hive的类型自动兼容转换的原理。如图:
RuleExecutor:执行Rule的执行环境,它会将包含了一系列的Rule的Batch进行执行,这个过程都是串行的。
具体的执行方法定义在apply里:
可以看到这里是一个while循环,每个batch下的rules都对当前的plan进行作用,这个过程是迭代的,直到达到Fix Point或者最大迭代次数。
- def apply(plan: TreeType): TreeType = {
 - var curPlan = plan
 - batches.foreach { batch =>
 - val batchStartPlan = curPlan
 - var iteration = 1
 - var lastPlan = curPlan
 - var continue = true
 - // Run until fix point (or the max number of iterations as specified in the strategy.
 - while (continue) {
 - curPlan = batch.rules.foldLeft(curPlan) {
 - case (plan, rule) =>
 - val result = rule(plan) //这里将调用各个不同Rule的apply方法,将UnResolved Relations,Attrubute和Function进行Resolve
 - if (!result.fastEquals(plan)) {
 - logger.trace(
 - s"""
 - |=== Applying Rule ${rule.ruleName} ===
 - |${sideBySide(plan.treeString, result.treeString).mkString("\n")}
 - """.stripMargin)
 - }
 - result //返回作用后的result plan
 - }
 - iteration += 1
 - if (iteration > batch.strategy.maxIterations) { //如果迭代次数已经大于该策略的最大迭代次数,就停止循环
 - logger.info(s"Max iterations ($iteration) reached for batch ${batch.name}")
 - continue = false
 - }
 - if (curPlan.fastEquals(lastPlan)) { //如果在多次迭代中不再变化,因为plan有个unique id,就停止循环。
 - logger.trace(s"Fixed point reached for batch ${batch.name} after $iteration iterations.")
 - continue = false
 - }
 - lastPlan = curPlan
 - }
 - if (!batchStartPlan.fastEquals(curPlan)) {
 - logger.debug(
 - s"""
 - |=== Result of Batch ${batch.name} ===
 - |${sideBySide(plan.treeString, curPlan.treeString).mkString("\n")}
 - """.stripMargin)
 - } else {
 - logger.trace(s"Batch ${batch.name} has no effect.")
 - }
 - }
 - curPlan //返回Resolved的Logical Plan
 - }
 
二、Rules介绍
2.1、MultiInstanceRelation
- Batch("MultiInstanceRelations", Once,
 - NewRelationInstances)
 
- trait MultiInstanceRelation {
 - def newInstance: this.type
 - }
 
- object NewRelationInstances extends Rule[LogicalPlan] {
 - def apply(plan: LogicalPlan): LogicalPlan = {
 - val localRelations = plan collect { case l: MultiInstanceRelation => l} //将logical plan应用partial function得到所有MultiInstanceRelation的plan的集合
 - val multiAppearance = localRelations
 - .groupBy(identity[MultiInstanceRelation]) //group by操作
 - .filter { case (_, ls) => ls.size > 1 } //如果只取size大于1的进行后续操作
 - .map(_._1)
 - .toSet
 - //更新plan,使得每个实例的expId是唯一的。
 - plan transform {
 - case l: MultiInstanceRelation if multiAppearance contains l => l.newInstance
 - }
 - }
 - }
 
2.2、LowercaseAttributeReferences
- object LowercaseAttributeReferences extends Rule[LogicalPlan] {
 - def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 - case UnresolvedRelation(databaseName, name, alias) =>
 - UnresolvedRelation(databaseName, name, alias.map(_.toLowerCase))
 - case Subquery(alias, child) => Subquery(alias.toLowerCase, child)
 - case q: LogicalPlan => q transformExpressions {
 - case s: Star => s.copy(table = s.table.map(_.toLowerCase))
 - case UnresolvedAttribute(name) => UnresolvedAttribute(name.toLowerCase)
 - case Alias(c, name) => Alias(c, name.toLowerCase)()
 - case GetField(c, name) => GetField(c, name.toLowerCase)
 - }
 - }
 - }
 
2.3、ResolveReferences
- object ResolveReferences extends Rule[LogicalPlan] {
 - def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
 - case q: LogicalPlan if q.childrenResolved =>
 - logger.trace(s"Attempting to resolve ${q.simpleString}")
 - q transformExpressions {
 - case u @ UnresolvedAttribute(name) =>
 - // Leave unchanged if resolution fails. Hopefully will be resolved next round.
 - val result = q.resolve(name).getOrElse(u)//转化为NamedExpression
 - logger.debug(s"Resolving $u to $result")
 - result
 - }
 - }
 - }
 
2.4、 ResolveRelations
- object ResolveRelations extends Rule[LogicalPlan] {
 - def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 - case UnresolvedRelation(databaseName, name, alias) =>
 - catalog.lookupRelation(databaseName, name, alias)
 - }
 - }
 
2.5、ImplicitGenerate
- object ImplicitGenerate extends Rule[LogicalPlan] {
 - def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 - case Project(Seq(Alias(g: Generator, _)), child) =>
 - Generate(g, join = false, outer = false, None, child)
 - }
 - }
 
2.6 StarExpansion
- object StarExpansion extends Rule[LogicalPlan] {
 - def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 - // Wait until children are resolved
 - case p: LogicalPlan if !p.childrenResolved => p
 - // If the projection list contains Stars, expand it.
 - case p @ Project(projectList, child) if containsStar(projectList) =>
 - Project(
 - projectList.flatMap {
 - case s: Star => s.expand(child.output) //展开,将输入的Attributeexpand(input: Seq[Attribute]) 转化为Seq[NamedExpression]
 - case o => o :: Nil
 - },
 - child)
 - case t: ScriptTransformation if containsStar(t.input) =>
 - t.copy(
 - input = t.input.flatMap {
 - case s: Star => s.expand(t.child.output)
 - case o => o :: Nil
 - }
 - )
 - // If the aggregate function argument contains Stars, expand it.
 - case a: Aggregate if containsStar(a.aggregateExpressions) =>
 - a.copy(
 - aggregateExpressions = a.aggregateExpressions.flatMap {
 - case s: Star => s.expand(a.child.output)
 - case o => o :: Nil
 - }
 - )
 - }
 - /**
 - * Returns true if `exprs` contains a [[Star]].
 - */
 - protected def containsStar(exprs: Seq[Expression]): Boolean =
 - exprs.collect { case _: Star => true }.nonEmpty
 - }
 - }
 
2.7 ResolveFunctions
- object ResolveFunctions extends Rule[LogicalPlan] {
 - def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 - case q: LogicalPlan =>
 - q transformExpressions {
 - case u @ UnresolvedFunction(name, children) if u.childrenResolved =>
 - registry.lookupFunction(name, children) //看是否注册了当前udf
 - }
 - }
 - }
 
2.8 GlobalAggregates
- object GlobalAggregates extends Rule[LogicalPlan] {
 - def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 - case Project(projectList, child) if containsAggregates(projectList) =>
 - Aggregate(Nil, projectList, child)
 - }
 - def containsAggregates(exprs: Seq[Expression]): Boolean = {
 - exprs.foreach(_.foreach {
 - case agg: AggregateExpression => return true
 - case _ =>
 - })
 - false
 - }
 - }
 
2.9 typeCoercionRules
- val typeCoercionRules =
 - PropagateTypes ::
 - ConvertNaNs ::
 - WidenTypes ::
 - PromoteStrings ::
 - BooleanComparisons ::
 - BooleanCasts ::
 - StringToIntegralCasts ::
 - FunctionArgumentConversion ::
 - CastNulls ::
 - Nil
 
2.10 EliminateAnalysisOperators
- object EliminateAnalysisOperators extends Rule[LogicalPlan] {
 - def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 - case Subquery(_, child) => child //遇到Subquery,不反悔本身,返回它的Child,即删除了该元素
 - case LowerCaseSchema(child) => child
 - }
 - }
 
三、实践
- val exec = sqlContext.sql("select mobile as mb, sid as id, mobile*2 multi2mobile, count(1) times from (select * from temp_shengli_mobile)a where pfrom_id=0.0 group by mobile, sid, mobile*2")
 - 14/07/21 18:23:32 DEBUG SparkILoop$SparkILoopInterpreter: Invoking: public static java.lang.String $line47.$eval.$print()
 - 14/07/21 18:23:33 INFO Analyzer: Max iterations (2) reached for batch MultiInstanceRelations
 - 14/07/21 18:23:33 INFO Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences
 - 14/07/21 18:23:33 DEBUG Analyzer$ResolveReferences$: Resolving 'pfrom_id to pfrom_id#5
 - 14/07/21 18:23:33 DEBUG Analyzer$ResolveReferences$: Resolving 'mobile to mobile#2
 - 14/07/21 18:23:33 DEBUG Analyzer$ResolveReferences$: Resolving 'sid to sid#1
 - 14/07/21 18:23:33 DEBUG Analyzer$ResolveReferences$: Resolving 'mobile to mobile#2
 - 14/07/21 18:23:33 DEBUG Analyzer$ResolveReferences$: Resolving 'mobile to mobile#2
 - 14/07/21 18:23:33 DEBUG Analyzer$ResolveReferences$: Resolving 'sid to sid#1
 - 14/07/21 18:23:33 DEBUG Analyzer$ResolveReferences$: Resolving 'mobile to mobile#2
 - 14/07/21 18:23:33 DEBUG Analyzer:
 - === Result of Batch Resolution ===
 - !Aggregate ['mobile,'sid,('mobile * 2) AS c2#27], ['mobile AS mb#23,'sid AS id#24,('mobile * 2) AS multi2mobile#25,COUNT(1) AS times#26L] Aggregate [mobile#2,sid#1,(CAST(mobile#2, DoubleType) * CAST(2, DoubleType)) AS c2#27], [mobile#2 AS mb#23,sid#1 AS id#24,(CAST(mobile#2, DoubleType) * CAST(2, DoubleType)) AS multi2mobile#25,COUNT(1) AS times#26L]
 - ! Filter ('pfrom_id = 0.0) Filter (CAST(pfrom_id#5, DoubleType) = 0.0)
 - Subquery a Subquery a
 - ! Project [*] Project [data_date#0,sid#1,mobile#2,pverify_type#3,create_time#4,pfrom_id#5,p_status#6,pvalidate_time#7,feffect_time#8,plastupdate_ip#9,update_time#10,status#11,preserve_int#12]
 - ! UnresolvedRelation None, temp_shengli_mobile, None Subquery temp_shengli_mobile
 - ! SparkLogicalPlan (ExistingRdd [data_date#0,sid#1,mobile#2,pverify_type#3,create_time#4,pfrom_id#5,p_status#6,pvalidate_time#7,feffect_time#8,plastupdate_ip#9,update_time#10,status#11,preserve_int#12], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:174)
 - 14/07/21 18:23:33 DEBUG Analyzer:
 - === Result of Batch AnalysisOperators ===
 - !Aggregate ['mobile,'sid,('mobile * 2) AS c2#27], ['mobile AS mb#23,'sid AS id#24,('mobile * 2) AS multi2mobile#25,COUNT(1) AS times#26L] Aggregate [mobile#2,sid#1,(CAST(mobile#2, DoubleType) * CAST(2, DoubleType)) AS c2#27], [mobile#2 AS mb#23,sid#1 AS id#24,(CAST(mobile#2, DoubleType) * CAST(2, DoubleType)) AS multi2mobile#25,COUNT(1) AS times#26L]
 - ! Filter ('pfrom_id = 0.0) Filter (CAST(pfrom_id#5, DoubleType) = 0.0)
 - ! Subquery a Project [data_date#0,sid#1,mobile#2,pverify_type#3,create_time#4,pfrom_id#5,p_status#6,pvalidate_time#7,feffect_time#8,plastupdate_ip#9,update_time#10,status#11,preserve_int#12]
 - ! Project [*] SparkLogicalPlan (ExistingRdd [data_date#0,sid#1,mobile#2,pverify_type#3,create_time#4,pfrom_id#5,p_status#6,pvalidate_time#7,feffect_time#8,plastupdate_ip#9,update_time#10,status#11,preserve_int#12], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:174)
 - ! UnresolvedRelation None, temp_shengli_mobile, None
 
四、总结
原创文章,转载请注明:
转载自:OopsOutOfMemory盛利的Blog,作者: OopsOutOfMemory
本文链接地址:http://blog.csdn.net/oopsoom/article/details/38025185
注:本文基于署名-非商业性使用-禁止演绎 2.5 中国大陆(CC BY-NC-ND 2.5 CN)协议,欢迎转载、转发和评论,但是请保留本文作者署名和文章链接。如若需要用于商业目的或者与授权方面的协商,请联系我。

第三篇:Spark SQL Catalyst源码分析之Analyzer的更多相关文章
- 第五篇:Spark SQL Catalyst源码分析之Optimizer
		
/** Spark SQL源码分析系列文章*/ 前几篇文章介绍了Spark SQL的Catalyst的核心运行流程.SqlParser,和Analyzer 以及核心类库TreeNode,本文将详细讲解 ...
 - 第六篇:Spark SQL Catalyst源码分析之Physical Plan
		
/** Spark SQL源码分析系列文章*/ 前面几篇文章主要介绍的是spark sql包里的的spark sql执行流程,以及Catalyst包内的SqlParser,Analyzer和Optim ...
 - 第四篇:Spark SQL Catalyst源码分析之TreeNode Library
		
/** Spark SQL源码分析系列文章*/ 前几篇文章介绍了Spark SQL的Catalyst的核心运行流程.SqlParser,和Analyzer,本来打算直接写Optimizer的,但是发现 ...
 - 第八篇:Spark SQL Catalyst源码分析之UDF
		
/** Spark SQL源码分析系列文章*/ 在SQL的世界里,除了官方提供的常用的处理函数之外,一般都会提供可扩展的对外自定义函数接口,这已经成为一种事实的标准. 在前面Spark SQL源码分析 ...
 - 第二篇:Spark SQL Catalyst源码分析之SqlParser
		
/** Spark SQL源码分析系列文章*/ Spark SQL的核心执行流程我们已经分析完毕,可以参见Spark SQL核心执行流程,下面我们来分析执行流程中各个核心组件的工作职责. 本文先从入口 ...
 - (转)spring boot实战(第三篇)事件监听源码分析
		
原文:http://blog.csdn.net/liaokailin/article/details/48194777 监听源码分析 首先是我们自定义的main方法: package com.lkl. ...
 - Spark Scheduler模块源码分析之TaskScheduler和SchedulerBackend
		
本文是Scheduler模块源码分析的第二篇,第一篇Spark Scheduler模块源码分析之DAGScheduler主要分析了DAGScheduler.本文接下来结合Spark-1.6.0的源码继 ...
 - Spark RPC框架源码分析(三)Spark心跳机制分析
		
一.Spark心跳概述 前面两节中介绍了Spark RPC的基本知识,以及深入剖析了Spark RPC中一些源码的实现流程. 具体可以看这里: Spark RPC框架源码分析(二)运行时序 Spark ...
 - Spark Scheduler模块源码分析之DAGScheduler
		
本文主要结合Spark-1.6.0的源码,对Spark中任务调度模块的执行过程进行分析.Spark Application在遇到Action操作时才会真正的提交任务并进行计算.这时Spark会根据Ac ...
 
随机推荐
- 160708、JQuery解析XML数据的demo
			
用JavaScript解析XML数据是常见的编程任务,JavaScript能做的,JQuery当然也能做.下面我们来总结几个使用JQuery解析XML的例子. 方案1 当后台返回的数据类型是xml对象 ...
 - Less-loops循环
			
loop循环 example: .test(@i) when (@i > 0) { .test((@i - 1)); .study@{i} { width: (10px * @i); } } d ...
 - 所有版本chromedriver下载
			
所有版本chromedriver下载 http://chromedriver.storage.googleapis.com/index.html
 - python之MySQL学习——防止SQL注入(参数化处理)
			
import pymysql as ps # 打开数据库连接 db = ps.connect(host=', database='test', charset='utf8') # 创建一个游标对象 c ...
 - 【Python之路】第二十四篇--爬虫
			
网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本.另外一些不常使用的名字还有蚂蚁.自动索引.模拟程序或者蠕 ...
 - Log4j 2
			
Log4j – Apache Log4j 2 - Apache Log4j 2 http://logging.apache.org/log4j/2.x/ Apache Log4j 2 Apache L ...
 - Universally Unique Identifier amazonservices API order  亚马逊订单接口的分析  NextToken
			
one hour in linux mysql> ) from listorders; +----------+ | count() | +----------+ | | +---------- ...
 - 【我的Android进阶之旅】 Android Studio插件之Jenkins插件介绍
			
一Jenkins插件功能介绍 1Jenkins任务列表 2切换Jenkins分组 3构建Jenkins任务 4进入构建Jenkins任务的页面 5进入最后一次构建Jenkins任务的页面 6增加Jen ...
 - 我的Android进阶之旅】GitHub 上排名前 100 的 Android 开源库进行简单的介绍
			
GitHub Android Libraries Top 100 简介 本文转载于:https://github.com/Freelander/Android_Data/blob/master/And ...
 - 我的Android进阶之旅------>解决Jackson、Gson解析Json数据时,Json数据中的Key为Java关键字时解析为null的问题
			
1.问题描述 首先,需要解析的Json数据类似于下面的格式,但是包含了Java关键字abstract: { ret: 0, msg: "normal return.", news: ...