Spark修炼之道（高级篇）——Spark源代码阅读：第十二节 Spark SQL 处理流程分析

作者：周志湖

以下的代码演示了通过Case Class进行表Schema定义的样例：

// sc is an existing SparkContext.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// this is used to implicitly convert an RDD to a DataFrame.

import sqlContext.implicits._

// Define the schema using a case class.

// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,

// you can use custom classes that implement the Product interface.

case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a table.

val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()

people.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.

val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are DataFrames and support all the normal RDD operations.

// The columns of a row in the result can be accessed by field index:

teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

// or by field name:

teenagers.map(t => "Name: " + t.getAs[String]("name")).collect().foreach(println)

// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]

teenagers.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)

// Map("name" -> "Justin", "age" -> 19)

（1）sql方法返回DataFrame

  def sql(sqlText: String): DataFrame = {

    DataFrame(this, parseSql(sqlText))

  }

当中parseSql(sqlText)方法生成对应的LogicalPlan得到，该方法源代码例如以下：

//依据传入的sql语句，生成LogicalPlan

protected[sql] def parseSql(sql: String): LogicalPlan = ddlParser.parse(sql, false)

ddlParser对象定义例如以下：

protected[sql] val sqlParser = new SparkSQLParser(getSQLDialect().parse(_))

protected[sql] val ddlParser = new DDLParser(sqlParser.parse(_))

（2）然后调用DataFrame的apply方法

private[sql] object DataFrame {

  def apply(sqlContext: SQLContext, logicalPlan: LogicalPlan): DataFrame = {

    new DataFrame(sqlContext, logicalPlan)

  }

}

能够看到，apply方法參数有两个，各自是SQLContext和LogicalPlan，调用的是DataFrame的构造方法，详细源代码例如以下：

//DataFrame构造方法。该构造方法会自己主动对LogicalPlan进行分析，然后返回QueryExecution对象

def this(sqlContext: SQLContext, logicalPlan: LogicalPlan) = {

    this(sqlContext, {

      val qe = sqlContext.executePlan(logicalPlan)

      //推断是否已经创建。假设是则抛异常

      if (sqlContext.conf.dataFrameEagerAnalysis) {

        qe.assertAnalyzed()  // This should force analysis and throw errors if there are any

      }

      qe

    })

  }

（3）val qe = sqlContext.executePlan(logicalPlan) 返回QueryExecution， sqlContext.executePlan方法源代码例如以下：

protected[sql] def executePlan(plan: LogicalPlan) =

    new sparkexecution.QueryExecution(this, plan)

QueryExecution类中表达了Spark运行SQL的主要工作流程，详细例如以下

class QueryExecution(val sqlContext: SQLContext, val logical: LogicalPlan) {

  @VisibleForTesting

  def assertAnalyzed(): Unit = sqlContext.analyzer.checkAnalysis(analyzed)

  lazy val analyzed: LogicalPlan = sqlContext.analyzer.execute(logical)

  lazy val withCachedData: LogicalPlan = {

    assertAnalyzed()

    sqlContext.cacheManager.useCachedData(analyzed)

  }

  lazy val optimizedPlan: LogicalPlan = sqlContext.optimizer.execute(withCachedData)

  // TODO: Don't just pick the first one...

  lazy val sparkPlan: SparkPlan = {

    SparkPlan.currentContext.set(sqlContext)

    sqlContext.planner.plan(optimizedPlan).next()

  }

  // executedPlan should not be used to initialize any SparkPlan. It should be

  // only used for execution.

  lazy val executedPlan: SparkPlan = sqlContext.prepareForExecution.execute(sparkPlan)

  /** Internal version of the RDD. Avoids copies and has no schema */

  //调用toRDD方法运行任务将结果转换为RDD

  lazy val toRdd: RDD[InternalRow] = executedPlan.execute()

  protected def stringOrError[A](f: => A): String =

    try f.toString catch { case e: Throwable => e.toString }

  def simpleString: String = {

    s"""== Physical Plan ==

       |${stringOrError(executedPlan)}

      """.stripMargin.trim

  }

  override def toString: String = {

    def output =

      analyzed.output.map(o => s"${o.name}: ${o.dataType.simpleString}").mkString(", ")

    s"""== Parsed Logical Plan ==

       |${stringOrError(logical)}

       |== Analyzed Logical Plan ==

       |${stringOrError(output)}

       |${stringOrError(analyzed)}

       |== Optimized Logical Plan ==

       |${stringOrError(optimizedPlan)}

       |== Physical Plan ==

       |${stringOrError(executedPlan)}

       |Code Generation: ${stringOrError(executedPlan.codegenEnabled)}

    """.stripMargin.trim

  }

}

能够看到，SQL的运行流程为

1.Parsed Logical Plan：LogicalPlan

2.Analyzed Logical Plan：

lazy val analyzed: LogicalPlan = sqlContext.analyzer.execute(logical)

3.Optimized Logical Plan：lazy val optimizedPlan: LogicalPlan = sqlContext.optimizer.execute(withCachedData)

4. Physical Plan：lazy val executedPlan: SparkPlan = sqlContext.prepareForExecution.execute(sparkPlan)

能够调用results.queryExecution方法查看，代码例如以下：

scala> results.queryExecution

res1: org.apache.spark.sql.SQLContext#QueryExecution =

== Parsed Logical Plan ==

'Project [unresolvedalias('name)]

 'UnresolvedRelation [people], None

== Analyzed Logical Plan ==

name: string

Project [name#0]

 Subquery people

  LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at createDataFrame at <console>:47

== Optimized Logical Plan ==

Project [name#0]

 LogicalRDD [name#0,age#1], MapPartitionsRDD[4] at createDataFrame at <console>:47

== Physical Plan ==

TungstenProject [name#0]

 Scan PhysicalRDD[name#0,age#1]

Code Generation: true

（4）然后调用DataFrame的主构造器完毕DataFrame的构造

class DataFrame private[sql](

    @transient val sqlContext: SQLContext,

    @DeveloperApi @transient val queryExecution: QueryExecution) extends Serializable

（5）

当调用DataFrame的collect等方法时，便会触发运行executedPlan

  def collect(): Array[Row] = withNewExecutionId {

    queryExecution.executedPlan.executeCollect()

  }

比如：

scala> results.collect

res6: Array[org.apache.spark.sql.Row] = Array([Michael], [Andy], [Justin])

总体流程图例如以下：

Spark修炼之道（高级篇）——Spark源代码阅读：第十二节 Spark SQL 处理流程分析的更多相关文章

Spark修炼之道——Spark学习路线、课程大纲
课程内容 Spark修炼之道(基础篇)--Linux基础(15讲).Akka分布式编程(8讲) Spark修炼之道(进阶篇)--Spark入门到精通(30讲) Spark修炼之道(实战篇)--Spar ...
【转】【技术博客】Spark性能优化指南——高级篇
http://mp.weixin.qq.com/s?__biz=MjM5NjQ5MTI5OA==&mid=2651745207&idx=1&sn=3d70d59cede236e ...
Spark性能优化指南——高级篇
本文转载自:https://tech.meituan.com/spark-tuning-pro.html 美团技术点评团队) Spark性能优化指南——高级篇李雪蕤 ·2016-05-12 14:4 ...
Spark性能优化指南-高级篇(spark shuffle)
Spark性能优化指南-高级篇(spark shuffle) 非常好的讲解
Spark修炼之道（进阶篇）——Spark入门到精通：第九节 Spark SQL执行流程解析
1.总体执行流程使用下列代码对SparkSQL流程进行分析.让大家明确LogicalPlan的几种状态,理解SparkSQL总体执行流程 // sc is an existing SparkCont ...
【转载】Spark性能优化指南——高级篇
前言数据倾斜调优调优概述数据倾斜发生时的现象数据倾斜发生的原理如何定位导致数据倾斜的代码查看导致数据倾斜的key的数据分布情况数据倾斜的解决方案解决方案一:使用Hive ETL预处理数 ...
Spark性能优化指南——高级篇（转载）
前言继基础篇讲解了每个Spark开发人员都必须熟知的开发调优与资源调优之后,本文作为<Spark性能优化指南>的高级篇,将深入分析数据倾斜调优与shuffle调优,以解决更加棘手的性能问 ...
Spark性能优化指南-高级篇
转自https://tech.meituan.com/spark-tuning-pro.html,感谢原作者的贡献前言继基础篇讲解了每个Spark开发人员都必须熟知的开发调优与资源调优之后,本文作 ...
Spark性能调优-高级篇
前言继基础篇讲解了每个Spark开发人员都必须熟知的开发调优与资源调优之后,本文作为<Spark性能优化指南>的高级篇,将深入分析数据倾斜调优与shuffle调优,以解决更加棘手的性能问 ...

随机推荐

SP10628 COT - Count on a tree 主席树
Code: #include<cstdio> #include<cstring> #include<algorithm> #include<string> ...
NOIp2018模拟赛三十七
奇怪的一场... 前两题都是全场题,C题明显不可做,我题目都没看懂...(STO lhx OTZ) 成绩:100+100+8=208 貌似十几个208的...A题暴力$O(nmc)$能过...暴力容斥 ...
NuSOAP简介 php中使用webservice
许多机构已经采用了Apach和PHP作为他们的Web应用环境.在Web services模式中采用PHP可能看上去可能会比较难.但是事实上,搭配NuSoap,你可以轻松的应用PHP构建SOAP的客户端 ...
(50)与magento集成
我对接的是 odoo8 和 magento1.9.x 准备工作: l 服务器装上mangento 组件 : $ pip install magento 装上 requests 组件:$ pip ...
Python排序插入排序
插入排序从前往后遍历数组的每一个元素,对每一位元素都将其插入到已经有序的部分数组中,所以插入排序的要点就是找出要插入元素在已经有序的部分中的位置,同时,由于插入排序采用原地排序(in-place)算法 ...
The Zen of Python, by Tim Peters
Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Comp ...
codevs1281 矩阵乘法快速幂 !!!手写乘法取模!!! 练习struct的构造函数和成员函数
对于这道题目以及我的快速幂以及我的一节半晚自习我表示无力吐槽,, 首先矩阵乘法和快速幂没必要太多说吧,,嗯没必要,,我相信没必要,,实在做不出来写两个矩阵手推一下也就能理解矩阵的顺序了,要格外注意一些 ...
root用户无法切换到cdh的hive账号上
在/etc/passwd中看到hive账号是登录的终端是/bin/false,而正常的用户配置的都是/bin/bash,因此在root账号su到hive也是没有用的 hive:x:111:111:Hi ...
Linux 进程间通信(IPC)
Linux 进程间通信(IPC): Linux系统中除了进程和进程之间通信,我想大家也应该关注用户空间与内核空间是怎样通信的.例如说netlink等等. 除了传统进程间通信外像Socket通信也须要掌 ...
生成apk文件遇到的编译问题error: format not a string literal and no format arguments
编译错误时使用的android-ndk为r9的版本号.报下面错误: "Compile++ thumb : cocosdenshion_static <= SimpleAudioEngi ...

Spark修炼之道（高级篇）——Spark源代码阅读：第十二节 Spark SQL 处理流程分析

Spark修炼之道（高级篇）——Spark源代码阅读：第十二节 Spark SQL 处理流程分析的更多相关文章

随机推荐

热门专题