spark 2.1.1

一启动命令

启动spark thrift命令

$SPARK_HOME/sbin/start-thriftserver.sh

然后会执行

org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2

二启动过程及代码分析

hive thrift代码详见：https://www.cnblogs.com/barneywill/p/10185168.html

HiveThriftServer2是spark thrift核心类，继承自Hive的HiveServer2

org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 extends org.apache.hive.service.server.HiveServer2

启动过程：

HiveThriftServer2.main

SparkSQLEnv.init (sparkConf sparkSession sparkContext sqlContext)

HiveThriftServer2.init

addService(ThriftBinaryCLIService)

HiveThriftServer2.start

ThriftBinaryCLIService.run

TServer.serve

类结构：【接口或父类->子类】

TServer->TThreadPoolServer

TProcessorFactory->SQLPlainProcessorFactory

TProcessor->TSetIpAddressProcessor

ThriftCLIService->ThriftBinaryCLIService

CLIService->SparkSQLCLIService (核心子类)

服务初始化过程：

CLIService.init

SparkSQLCLIService.init

addService(SparkSQLSessionManager)

initCompositeService

SparkSQLSessionManager.init

addService(SparkSQLOperationManager)

initCompositeService

SparkSQLOperationManager.init

三 DDL执行过程

ddl执行过程需要和hive metastore交互

从执行计划开始：

spark-sql> explain create table test_table(id string);
== Physical Plan ==
ExecutedCommand
+- CreateTableCommand CatalogTable(
Table: `test_table`
Created: Wed Dec 19 18:04:15 CST 2018
Last Access: Thu Jan 01 07:59:59 CST 1970
Type: MANAGED
Schema: [StructField(id,StringType,true)]
Provider: hive
Storage(InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), false
Time taken: 0.28 seconds, Fetched 1 row(s)

从执行计划里可以找到具体的Command，这里是CreateTableCommand

org.apache.spark.sql.execution.command.tables

case class CreateTableCommand(table: CatalogTable, ifNotExists: Boolean) extends RunnableCommand {

  override def run(sparkSession: SparkSession): Seq[Row] = {

    sparkSession.sessionState.catalog.createTable(table, ifNotExists)

    Seq.empty[Row]

  }

}

这里可以看到是直接将请求分发给sparkSession.sessionState.catalog

org.apache.spark.sql.internal.SessionState

  /**

   * Internal catalog for managing table and database states.

   */

  lazy val catalog = new SessionCatalog(

    sparkSession.sharedState.externalCatalog,

    sparkSession.sharedState.globalTempViewManager,

    functionResourceLoader,

    functionRegistry,

    conf,

    newHadoopConf())

取的是sparkSession.sharedState.externalCatalog

org.apache.spark.sql.internal.SharedState

  /**

   * A catalog that interacts with external systems.

   */

  val externalCatalog: ExternalCatalog =

    SharedState.reflect[ExternalCatalog, SparkConf, Configuration](

      SharedState.externalCatalogClassName(sparkContext.conf),

      sparkContext.conf,

      sparkContext.hadoopConfiguration)

...

  private val HIVE_EXTERNAL_CATALOG_CLASS_NAME = "org.apache.spark.sql.hive.HiveExternalCatalog"

  private def externalCatalogClassName(conf: SparkConf): String = {

    conf.get(CATALOG_IMPLEMENTATION) match {

      case "hive" => HIVE_EXTERNAL_CATALOG_CLASS_NAME

      case "in-memory" => classOf[InMemoryCatalog].getCanonicalName

    }

  }

这里可以看到是通过externalCatalogClassName反射实例化的，代码里硬编码使用的是org.apache.spark.sql.hive.HiveExternalCatalog

org.apache.spark.sql.hive.HiveExternalCatalog

  /**

   * A Hive client used to interact with the metastore.

   */

  val client: HiveClient = {

    HiveUtils.newClientForMetadata(conf, hadoopConf)

  }

  private def withClient[T](body: => T): T = synchronized {

    try {

      body

    } catch {

      case NonFatal(exception) if isClientException(exception) =>

        val e = exception match {

          // Since we are using shim, the exceptions thrown by the underlying method of

          // Method.invoke() are wrapped by InvocationTargetException

          case i: InvocationTargetException => i.getCause

          case o => o

        }

        throw new AnalysisException(

          e.getClass.getCanonicalName + ": " + e.getMessage, cause = Some(e))

    }

  }

  override def createDatabase(

      dbDefinition: CatalogDatabase,

      ignoreIfExists: Boolean): Unit = withClient {

    client.createDatabase(dbDefinition, ignoreIfExists)

  }

这个类里执行任何ddl方法都会执行withClient，而withClient有synchronized，执行过程是直接把请求分发给client，下面看client是什么

org.apache.spark.sql.hive.client.IsolatedClientLoader

  /** The isolated client interface to Hive. */

  private[hive] def createClient(): HiveClient = {

    if (!isolationOn) {

      return new HiveClientImpl(version, sparkConf, hadoopConf, config, baseClassLoader, this)

    }

    // Pre-reflective instantiation setup.

    logDebug("Initializing the logger to avoid disaster...")

    val origLoader = Thread.currentThread().getContextClassLoader

    Thread.currentThread.setContextClassLoader(classLoader)

    try {

      classLoader

        .loadClass(classOf[HiveClientImpl].getName)

        .getConstructors.head

        .newInstance(version, sparkConf, hadoopConf, config, classLoader, this)

        .asInstanceOf[HiveClient]

    } catch {

可见client直接用的是org.apache.spark.sql.hive.client.HiveClientImpl

org.apache.spark.sql.hive.client.HiveClientImpl

  def withHiveState[A](f: => A): A = retryLocked {

    val original = Thread.currentThread().getContextClassLoader

    // Set the thread local metastore client to the client associated with this HiveClientImpl.

    Hive.set(client)

    // The classloader in clientLoader could be changed after addJar, always use the latest

    // classloader

    state.getConf.setClassLoader(clientLoader.classLoader)

    // setCurrentSessionState will use the classLoader associated

    // with the HiveConf in `state` to override the context class loader of the current

    // thread.

    shim.setCurrentSessionState(state)

    val ret = try f finally {

      Thread.currentThread().setContextClassLoader(original)

      HiveCatalogMetrics.incrementHiveClientCalls(1)

    }

    ret

  }

  private def retryLocked[A](f: => A): A = clientLoader.synchronized {

...

  override def createDatabase(

      database: CatalogDatabase,

      ignoreIfExists: Boolean): Unit = withHiveState {

    client.createDatabase(

      new HiveDatabase(

        database.name,

        database.description,

        database.locationUri,

        Option(database.properties).map(_.asJava).orNull),

        ignoreIfExists)

  }

这个类执行任何ddl方法都会执行withHiveState，withHiveState会执行retryLocked，retryLocked上有synchronized；而且这里也是直接将请求分发给client，这里的client是hive的类org.apache.hadoop.hive.ql.metadata.Hive

四 DML执行过程

dml执行过程最后会执行到spark.sql

sql执行过程：

CLIService.executeStatement （返回OperationHandle）

SessionManager.getSession

SessionManager.openSession

SparkSQLSessionManager.openSession

SparkSQLOperationManager.sessionToContexts.set （openSession时：session和sqlContext建立映射）

HiveSession.executeStatement

HiveSessionImpl.executeStatementInternal

OperationManager.newExecuteStatementOperation

SparkSQLOperationManager.newExecuteStatementOperation

SparkSQLOperationManager.sessionToContexts.get （通过session取到sqlContext）

ExecuteStatementOperation.run

SparkExecuteStatementOperation.run

SparkExecuteStatementOperation.execute

SQLContext.sql （熟悉的spark sql）

可见从SparkSQLCLIService初始化开始，逐个将各个类的实现类改为spark的子类比如：

org.apache.spark.sql.hive.thriftserver.SparkSQLSessionManager extends org.apache.hive.service.cli.session.SessionManager
org.apache.spark.sql.hive.thriftserver.server.SparkSQLOperationManager extends org.apache.hive.service.cli.operation.OperationManager
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation extends org.apache.hive.service.cli.operation.ExecuteStatementOperation

从而实现底层实现的替换；

hive的HiveServer2为什么这么容易的被扩展，详见spark代码的sql/hive-thriftserver，这里应该是将hive1.2代码做了很多修改，以后升级就不那么容易了；
至于spark为什么要花这么大力气扩展HiveServer2而不是重新实现，可能是为了保持接口一致，这样有利于原来使用hive thrift的用户平滑的迁移到spark thrift，因为唯一的改动就是切换url，实际上，相同sql下的spark thrift和hive thrift表现还是有很多不同的。

【原创】大数据基础之Spark（3）Spark Thrift实现原理及代码实现的更多相关文章

大数据学习系列之七 ----- Hadoop+Spark+Zookeeper+HBase+Hive集群搭建图文详解
引言在之前的大数据学习系列中,搭建了Hadoop+Spark+HBase+Hive 环境以及一些测试.其实要说的话,我开始学习大数据的时候,搭建的就是集群,并不是单机模式和伪分布式.至于为什么先写单 ...
CentOS6安装各种大数据软件第十章：Spark集群安装和部署
相关文章链接 CentOS6安装各种大数据软件第一章:各个软件版本介绍 CentOS6安装各种大数据软件第二章:Linux各个软件启动命令 CentOS6安装各种大数据软件第三章:Linux基础 ...
大数据平台搭建（hadoop+spark）
大数据平台搭建(hadoop+spark) 一.基本信息 1. 服务器基本信息主机名 ip地址安装服务 spark-master 172.16.200.81 jdk.hadoop.spark.sc ...
大数据系列之并行计算引擎Spark部署及应用
相关博文: 大数据系列之并行计算引擎Spark介绍之前介绍过关于Spark的程序运行模式有三种: 1.Local模式: 2.standalone(独立模式) 3.Yarn/mesos模式本文将介绍 ...
大数据系列之并行计算引擎Spark介绍
相关博文:大数据系列之并行计算引擎Spark部署及应用 Spark: Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎. Spark是UC Berkeley AMP lab ( ...
【原创】大数据基础之Zookeeper（2）源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
【原创】大数据基础之Spark（5）Shuffle实现原理及代码解析
一简介 Shuffle,简而言之,就是对数据进行重新分区,其中会涉及大量的网络io和磁盘io,为什么需要shuffle,以词频统计reduceByKey过程为例, serverA:partition ...
【原创】大数据基础之Spark（4）RDD原理及代码解析
一简介 spark核心是RDD,官方文档地址:https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-di ...
【原创】大数据基础之Spark（1）Spark Submit即Spark任务提交过程
Spark2.1.1 一 Spark Submit本地解析 1.1 现象提交命令: spark-submit --master local[10] --driver-memory 30g --cla ...
【原创】大数据基础之Hive（5）hive on spark
hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as it ...

随机推荐

深度理解 React Suspense(附源码解析)
本文介绍与 Suspense 在三种情景下使用方法,并结合源码进行相应解析.欢迎关注个人博客. Code Spliting 在 16.6 版本之前,code-spliting 通常是由第三方库来完成的 ...
python获取list列表随机数据
第一种方法(推荐)适用于随机取一个值, 返回一个值import randomlist1 = ['佛山', '南宁', '北海', '杭州', '南昌', '厦门', '温州']a = random.c ...
HashMap 与 HashSet 联系
HashMap实现 Map接口 HashSet实现Collection接口 HashSet底层是HashMap 好的记住这个就可以了 HashSet只存放key, value: private ...
ES 应用
1. ES的不同之处: 全文检索.处理同义词.通过相关性给文档评分, 从同样的数据中生成分析与聚合数据, 实时大型批处理. 安装es与kibana 1.下载:https://www.elastic ...
PS中如何提高修改psd图片的效率（自动选择工具）
在photoshop中制作图片的时候,一般要养成保留psd格式的习惯,纵然普通时候jpg,png格式常用,考虑到以后可能需要修改,也应该备份一下.如果考虑到以后需要修改,可每次成品保存成两个,一个ps ...
Civil 3D .NET二次开发第11章代码升级至2018版注意事项
原来涉及2017的,均需要改为2018 原来的21改为22 代码中AeccXUiLand.AeccApplication.11.0"改为AeccXUiLand.AeccApplication ...
Lua语言自学之01.基础概念的理解
编程不只是这么简单,它的思维是理性的编程思维,操纵机器干事本来就不是一件简单的事,要干什么,该怎么做,怎么做得才好. 脚本的概念在程序中十分重要,在游戏开发领域,它更是决定性的.脚本语言让程序员可以区 ...
opencv 增强现实（二）：特征点匹配
import cv2 as cv import numpy as np # def draw_keypoints(img, keypoints): # for kp in keypoints: # x ...
CentOS7配置iptables防火墙
CentOS 7中默认是firewalld防火墙,如果使用iptables需要先关闭firewalld防火墙(1.关闭防火墙,2.取消开机启动). #关闭firewalld systemctl sto ...
使用Vlc.DotNet打开摄像头并截图 C#
参考上一篇使用vlc打开usb摄像头理论上输入下面地址 "dshow:// :dshow-size=1600*1200:dshow-vdev=USB CAM2"C#就能打 ...

【原创】大数据基础之Spark（3）Spark Thrift实现原理及代码实现

一 启动命令