关于hive on spark会话的共享状态

spark sql中有一个类：

org.apache.spark.sql.internal.SharedState

它是用来做：

1、元数据地址管理（warehousePath）

2、查询结果缓存管理（cacheManager）

3、程序中的执行状态和metrics的监控（statusStore）

4、默认元数据库的目录管理（externalCatalog）

5、全局视图管理（主要是防止元数据库中存在重复）（globalTempViewManager）

1：首先介绍元数据地址管理（warehousePath）

这块儿主要是获取spark sql元数据库的路径地址，那么一般情况，我们都是默认把hive默认作为spark sql的元数据库，因为

它首先去加载hive的配置文件"hive-site.xml" , 然后根据hive-site.xml中获取的信息来获取到hive元数据库的路径：

hive.metastore.warehouse.dir

那么有时候，我们不使用hive作为spark sql的元数据库，那么这个时候我们加载的hive元数据路径应该是null

val hiveWarehouseDir = sparkContext.hadoopConfiguration.get("hive.metastore.warehouse.dir")

如果hiveWarehouseDir是null，那么就去加载spark sql的自带的元数据管理地址（spark.sql.warehouse.dir），然后把这个地址的值赋予给hive.metastore.warehouse.dir

因此大概流程就是获取hiveWarehouseDir：

具体代码：

val warehousePath: String = {

    val configFile = Utils.getContextOrSparkClassLoader.getResource("hive-site.xml")

    if (configFile != null) {

      logInfo(s"loading hive config file: $configFile")

      sparkContext.hadoopConfiguration.addResource(configFile)

    }

    // hive.metastore.warehouse.dir only stay in hadoopConf

    sparkContext.conf.remove("hive.metastore.warehouse.dir")

    // Set the Hive metastore warehouse path to the one we use

    val hiveWarehouseDir = sparkContext.hadoopConfiguration.get("hive.metastore.warehouse.dir")

    if (hiveWarehouseDir != null && !sparkContext.conf.contains(WAREHOUSE_PATH.key)) {

      // If hive.metastore.warehouse.dir is set and spark.sql.warehouse.dir is not set,

      // we will respect the value of hive.metastore.warehouse.dir.

      sparkContext.conf.set(WAREHOUSE_PATH.key, hiveWarehouseDir)

      logInfo(s"${WAREHOUSE_PATH.key} is not set, but hive.metastore.warehouse.dir " +

        s"is set. Setting ${WAREHOUSE_PATH.key} to the value of " +

        s"hive.metastore.warehouse.dir ('$hiveWarehouseDir').")

      hiveWarehouseDir

    } else {

      // If spark.sql.warehouse.dir is set, we will override hive.metastore.warehouse.dir using

      // the value of spark.sql.warehouse.dir.

      // When neither spark.sql.warehouse.dir nor hive.metastore.warehouse.dir is set,

      // we will set hive.metastore.warehouse.dir to the default value of spark.sql.warehouse.dir.

      val sparkWarehouseDir = sparkContext.conf.get(WAREHOUSE_PATH)

      logInfo(s"Setting hive.metastore.warehouse.dir ('$hiveWarehouseDir') to the value of " +

        s"${WAREHOUSE_PATH.key} ('$sparkWarehouseDir').")

      sparkContext.hadoopConfiguration.set("hive.metastore.warehouse.dir", sparkWarehouseDir)

      sparkWarehouseDir

    }

  }

  logInfo(s"Warehouse path is '$warehousePath'.")

warehousePath

2：CacheManager

将查询结果缓存起来；这样的好处就是，如果后面还需要本次查询出来的内容，就不需要在查询一遍数据源了（这块儿有时间单独写篇文章记录）

具体代码：

  /**

   * Class for caching query results reused in future executions.

   */

  val cacheManager: CacheManager = new CacheManager

cacheManager

3：statusStore

代码：

  /**

   * A status store to query SQL status/metrics of this Spark application, based on SQL-specific

   * [[org.apache.spark.scheduler.SparkListenerEvent]]s.

   */

  val statusStore: SQLAppStatusStore = {

    val kvStore = sparkContext.statusStore.store.asInstanceOf[ElementTrackingStore]

    val listener = new SQLAppStatusListener(sparkContext.conf, kvStore, live = true)

    sparkContext.listenerBus.addToStatusQueue(listener)

    val statusStore = new SQLAppStatusStore(kvStore, Some(listener))

    sparkContext.ui.foreach(new SQLTab(statusStore, _))

    statusStore

  }

statusStore

这段代码其实说白了就是将sql的状态和一些metrics指标写入到监听器中。

那么问题来了，监听器一定是实时的去监听的（读取的），然后spark sql还要不断的往监听器中写入，那么按照传统的list，map这种结构，在读取数据的时候还要在修改结构，会出现错误的；

因此spark sql采用了写时复制容器：

private[this] val listenersPlusTimers = new CopyOnWriteArrayList[(L, Option[Timer])]

将信息不断的写入同时，还不影响读取；

4、externalCatalog

获取spark 会话的内部目录（就是hiveWarehouseDir），如果不存在的话，就按照hiveWarehouseDir创建一个，当然，spark会通过回调函数的方式去监控当前目录中的事件：

externalCatalog.addListener(new ExternalCatalogEventListener {

      override def onEvent(event: ExternalCatalogEvent): Unit = {

        sparkContext.listenerBus.post(event)

      }

    })

此处代码：

/**

   * A catalog that interacts with external systems.

   */

  lazy val externalCatalog: ExternalCatalog = {

    val externalCatalog = SharedState.reflect[ExternalCatalog, SparkConf, Configuration](

      SharedState.externalCatalogClassName(sparkContext.conf),

      sparkContext.conf,

      sparkContext.hadoopConfiguration)

    val defaultDbDefinition = CatalogDatabase(

      SessionCatalog.DEFAULT_DATABASE,

      "default database",

      CatalogUtils.stringToURI(warehousePath),

      Map())

    // Create default database if it doesn't exist

    if (!externalCatalog.databaseExists(SessionCatalog.DEFAULT_DATABASE)) {

      // There may be another Spark application creating default database at the same time, here we

      // set `ignoreIfExists = true` to avoid `DatabaseAlreadyExists` exception.

      externalCatalog.createDatabase(defaultDbDefinition, ignoreIfExists = true)

    }

    // Make sure we propagate external catalog events to the spark listener bus

    externalCatalog.addListener(new ExternalCatalogEventListener {

      override def onEvent(event: ExternalCatalogEvent): Unit = {

        sparkContext.listenerBus.post(event)

      }

    })

    externalCatalog

  }

externalCatalog

5、

此处就是防止spark执行过程中的临时数据库出现在externalCatalog中，因为如果spark的GLOBAL_TEMP_DATABASE出现在externalCatalog中的话。那么随着程序的执行，下一个线程想要获取元数据库地址的时候，就没法在里面创建hiveWarehouseDir。因此，如果在externalCatalog中存在GLOBAL_TEMP_DATABASE，那么就抛异常

  /**

   * A manager for global temporary views.

   */

  lazy val globalTempViewManager: GlobalTempViewManager = {

    // System preserved database should not exists in metastore. However it's hard to guarantee it

    // for every session, because case-sensitivity differs. Here we always lowercase it to make our

    // life easier.

    val globalTempDB = sparkContext.conf.get(GLOBAL_TEMP_DATABASE).toLowerCase(Locale.ROOT)

    if (externalCatalog.databaseExists(globalTempDB)) {

      throw new SparkException(

        s"$globalTempDB is a system preserved database, please rename your existing database " +

          "to resolve the name conflict, or set a different value for " +

          s"${GLOBAL_TEMP_DATABASE.key}, and launch your Spark application again.")

    }

    new GlobalTempViewManager(globalTempDB)

  }

globalTempViewManager

关于hive on spark会话的共享状态的更多相关文章

基于CDH 5.9.1 搭建 Hive on Spark 及相关配置和调优
Hive默认使用的计算框架是MapReduce,在我们使用Hive的时候通过写SQL语句,Hive会自动将SQL语句转化成MapReduce作业去执行,但是MapReduce的执行速度远差与Spark ...
Hive、Spark SQL、Impala比较
Hive.Spark SQL.Impala比较 Hive.Spark SQL和Impala三种分布式SQL查询引擎都是SQL-on-Hadoop解决方案,但又各有特点.前面已经讨论了Hi ...
Spark记录-源码编译spark2.2.0（结合Hive on Spark/Hive on MR2/Spark on Yarn）
#spark2.2.0源码编译 #组件:mvn-3.3.9 jdk-1.8 #wget http://mirror.bit.edu.cn/apache/spark/spark-2.2.0/spark- ...
hive on spark VS SparkSQL VS hive on tez
http://blog.csdn.net/wtq1993/article/details/52435563 http://blog.csdn.net/yeruby/article/details/51 ...
hive on spark 释放session资源
背景启动hive时,可以看到2.0以后的版本,将要弃用mr引擎,官方建议使用spark,tez等引擎. spark同时支持批式流式处理,可以减少学习成本.所以选用了spark作为执行引擎. hive ...
教你成为全栈工程师(Full Stack Developer) 四十五-一文读懂hadoop、hbase、hive、spark分布式系统架构
转载自http://www.shareditor.com/blogshow?blogId=96 机器学习.数据挖掘等各种大数据处理都离不开各种开源分布式系统,hadoop用于分布式存储和map-red ...
hive on spark：return code 30041 Failed to create Spark client for Spark session原因分析及解决方案探寻
最近在Hive中使用Spark引擎进行执行时(set hive.execution.engine=spark),经常遇到return code 30041的报错,为了深入探究其原因,阅读了官方issu ...
Hive on Spark安装配置详解（都是坑啊）
个人主页:http://www.linbingdong.com 简书地址:http://www.jianshu.com/p/a7f75b868568 简介本文主要记录如何安装配置Hive on Sp ...
Hive On Spark概述
Hive现有支持的执行引擎有mr和tez,默认的执行引擎是mr,Hive On Spark的目的是添加一个spark的执行引擎,让hive能跑在spark之上: 在执行hive ql脚本之前指定执行引 ...

随机推荐

pat L2--005 简单复习一下并差集
布置宴席最微妙的事情,就是给前来参宴的各位宾客安排座位.无论如何,总不能把两个死对头排到同一张宴会桌旁!这个艰巨任务现在就交给你,对任何一对客人,请编写程序告诉主人他们是否能被安排同席. 输入格式: ...
.NET CORE API 使用Postman中Post请求获取不到传参问题
开发中遇到个坑记录下. 使用Postman请求core api 接口时,按之前的使用方法(form-data , x-www-form-urlencoded)怎么设置都无法访问. 最后采用raw写入 ...
JS基础_质数练习的改进，提高程序执行效率
<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title> ...
Tika提取文件元数据
Tika可以从文件中提取元数据. 什么是元数据: 元数据是文件所提供的的附件信息即文件的属性. word文档的元数据: Tika提取元数据: 我们可以使用文件parse()方法提取元数据,传递一个空的 ...
js对象的所有方法
Object构造方法 Object.assign() 将所有可枚举的自身属性的值从一个或多个源对象复制到目标对象. Object.create() 用指定的原型对象和属性创建一个新对象. Object ...
Redis05——Redis高级运用（管道连接，发布订阅，布隆过滤器）
Redis高级运用一.管道连接redis(一次发送多个命令,节省往返时间) 1.安装nc yum install nc -y 2.通过nc连接redis nc localhost 6379 3.通过 ...
web开发中的支付宝支付和微信支付
https://www.jianshu.com/p/155757d2b9eb  <script src="https://res.wx ...
PHP通过php-java-bridge调用JAVA的jar包里class类
正文: 有的时候我们需要在PHP里调用JAVA平台封装好的jar包里的class类和方法,一般飘易推荐的做法是采用php-java-bridge做桥接,本文就来介绍一下大致的实现方法. 先简单说 ...
java线程基础巩固---多线程与JVM内存结构的关系及Thread构造函数StackSize的理解
继续学习一下Thread的构造函数,在上次[http://www.cnblogs.com/webor2006/p/7760422.html]已经对如下构造都已经学习过了: 多线程与JVM内存结构的关系 ...
libusb传输endpoint描述符
至于endpoint描述符,它是属于设置的,每个设置都会有endpoint描述符,也就是每个接口的设置都表示一种功能,既然是实现了功能,那就必须通过endpoint来传输数据,那到底是用到了几个end ...

关于hive on spark会话的共享状态

关于hive on spark会话的共享状态的更多相关文章

随机推荐

热门专题