hive 2.1

hive执行sql有两种方式：

执行hive命令，又细分为hive -e，hive -f，hive交互式；
执行beeline命令，beeline会连接远程thrift server；

下面分别看这些场景下sql是怎样被执行的：

1 hive命令

启动命令

启动hive客户端命令

$HIVE_HOME/bin/hive

等价于

$HIVE_HOME/bin/hive --service cli

会调用

$HIVE_HOME/bin/ext/cli.sh

实际启动类为：org.apache.hadoop.hive.cli.CliDriver

代码解析

org.apache.hadoop.hive.cli.CliDriver

  public static void main(String[] args) throws Exception {

    int ret = new CliDriver().run(args);

    System.exit(ret);

  }

  public  int run(String[] args) throws Exception {

...

    // execute cli driver work

    try {

      return executeDriver(ss, conf, oproc);

    } finally {

      ss.resetThreadName();

      ss.close();

    }

...

  private int executeDriver(CliSessionState ss, HiveConf conf, OptionsProcessor oproc)

      throws Exception {

...

    if (ss.execString != null) {

      int cmdProcessStatus = cli.processLine(ss.execString);

      return cmdProcessStatus;

    }

...

    try {

      if (ss.fileName != null) {

        return cli.processFile(ss.fileName);

      }

    } catch (FileNotFoundException e) {

      System.err.println("Could not open input file for reading. (" + e.getMessage() + ")");

      return 3;

    }

...

    while ((line = reader.readLine(curPrompt + "> ")) != null) {

      if (!prefix.equals("")) {

        prefix += '\n';

      }

      if (line.trim().startsWith("--")) {

        continue;

      }

      if (line.trim().endsWith(";") && !line.trim().endsWith("\\;")) {

        line = prefix + line;

        ret = cli.processLine(line, true);

...

  public int processFile(String fileName) throws IOException {

...

      rc = processReader(bufferReader);

...

  public int processReader(BufferedReader r) throws IOException {

    String line;

    StringBuilder qsb = new StringBuilder();

    while ((line = r.readLine()) != null) {

      // Skipping through comments

      if (! line.startsWith("--")) {

        qsb.append(line + "\n");

      }

    }

    return (processLine(qsb.toString()));

  }

  public int processLine(String line, boolean allowInterrupting) {

...

        ret = processCmd(command);

...

  public int processCmd(String cmd) {

...

        CommandProcessor proc = CommandProcessorFactory.get(tokens, (HiveConf) conf);

        ret = processLocalCmd(cmd, proc, ss);

...

  int processLocalCmd(String cmd, CommandProcessor proc, CliSessionState ss) {

    int tryCount = 0;

    boolean needRetry;

    int ret = 0;

    do {

      try {

        needRetry = false;

        if (proc != null) {

          if (proc instanceof Driver) {

            Driver qp = (Driver) proc;

            PrintStream out = ss.out;

            long start = System.currentTimeMillis();

            if (ss.getIsVerbose()) {

              out.println(cmd);

            }

            qp.setTryCount(tryCount);

            ret = qp.run(cmd).getResponseCode();

...

              while (qp.getResults(res)) {

                for (String r : res) {

                  out.println(r);

                }

...

CliDriver.main会调用run，run会调用executeDriver，在executeDriver中对应上边提到的三种情况：

一种是hive -e执行sql，此时ss.execString非空，执行完进程退出；
一种是hive -f执行sql文件，此时ss.fileName非空，执行完进程退出；
一种是hive交互式执行sql，此时会不断读取reader.readLine，然后执行失去了并输出结果；

上述三种情况最终都会调用processLine，processLine会调用processLocalCmd，在processLocalCmd中会先调用到Driver.run执行sql，执行完之后再调用Driver.getResults输出结果，这也是Driver最重要的两个接口，Driver实现后边再看；

2 beeline命令

beeline需要连接到hive thrift server，先看hive thrift server如何启动：

hive thrift server

启动命令

启动hive thrift server命令

$HIVE_HOME/bin/hiveserver2

等价于

$HIVE_HOME/bin/hive --service hiveserver2

会调用

$HIVE_HOME/bin/ext/hiveserver2.sh

实际启动类为：org.apache.hive.service.server.HiveServer2

启动过程

HiveServer2.main

startHiveServer2

init

addService-CLIService,ThriftBinaryCLIService

start

Service.start

CLIService.start

ThriftBinaryCLIService.start

TThreadPoolServer.serve

类结构：【接口或父类->子类】

TServer->TThreadPoolServer

TProcessorFactory->SQLPlainProcessorFactory

TProcessor->TSetIpAddressProcessor

ThriftCLIService->ThriftBinaryCLIService

CLIService

HiveSession

代码解析

org.apache.hive.service.cli.thrift.ThriftBinaryCLIService

  public ThriftBinaryCLIService(CLIService cliService, Runnable oomHook) {

    super(cliService, ThriftBinaryCLIService.class.getSimpleName());

    this.oomHook = oomHook;

  }

ThriftBinaryCLIService是一个核心类，其中会实际启动thrift server，同时包装一个CLIService，请求最后都会调用底层的CLIService处理，下面看CLIService代码：

org.apache.hive.service.cli.CLIService

  @Override

  public OperationHandle executeStatement(SessionHandle sessionHandle, String statement,

      Map<String, String> confOverlay) throws HiveSQLException {

    OperationHandle opHandle =

        sessionManager.getSession(sessionHandle).executeStatement(statement, confOverlay);

    LOG.debug(sessionHandle + ": executeStatement()");

    return opHandle;

  }

  @Override

  public RowSet fetchResults(OperationHandle opHandle, FetchOrientation orientation,

                             long maxRows, FetchType fetchType) throws HiveSQLException {

    RowSet rowSet = sessionManager.getOperationManager().getOperation(opHandle)

        .getParentSession().fetchResults(opHandle, orientation, maxRows, fetchType);

    LOG.debug(opHandle + ": fetchResults()");

    return rowSet;

  }

CLIService最重要的两个接口，一个是executeStatement，一个是fetchResults，两个接口都会转发给HiveSession处理，下面看HiveSession实现类代码：

org.apache.hive.service.cli.session.HiveSessionImpl

  @Override

  public OperationHandle executeStatement(String statement, Map<String, String> confOverlay) throws HiveSQLException {

    return executeStatementInternal(statement, confOverlay, false, 0);

  }

  private OperationHandle executeStatementInternal(String statement,

      Map<String, String> confOverlay, boolean runAsync, long queryTimeout) throws HiveSQLException {

    acquire(true, true);

    ExecuteStatementOperation operation = null;

    OperationHandle opHandle = null;

    try {

      operation = getOperationManager().newExecuteStatementOperation(getSession(), statement,

          confOverlay, runAsync, queryTimeout);

      opHandle = operation.getHandle();

      operation.run();

...

  @Override

  public RowSet fetchResults(OperationHandle opHandle, FetchOrientation orientation,

      long maxRows, FetchType fetchType) throws HiveSQLException {

    acquire(true, false);

    try {

      if (fetchType == FetchType.QUERY_OUTPUT) {

        return operationManager.getOperationNextRowSet(opHandle, orientation, maxRows);

      }

      return operationManager.getOperationLogRowSet(opHandle, orientation, maxRows, sessionConf);

    } finally {

      release(true, false);

    }

  }

可见

HiveSessionImpl.executeStatement是调用ExecuteStatementOperation.run（ExecuteStatementOperation是Operation的一种）
HiveSessionImpl.fetchResults是调用OperationManager.getOperationNextRowSet，然后会调用到Operation.getNextRowSet

org.apache.hive.service.cli.operation.OperationManager

  public RowSet getOperationNextRowSet(OperationHandle opHandle,

      FetchOrientation orientation, long maxRows)

          throws HiveSQLException {

    return getOperation(opHandle).getNextRowSet(orientation, maxRows);

  }

下面写详细看Operation的run和getOperationNextRowSet：

org.apache.hive.service.cli.operation.Operation

  public void run() throws HiveSQLException {

    beforeRun();

    try {

      Metrics metrics = MetricsFactory.getInstance();

      if (metrics != null) {

        try {

          metrics.incrementCounter(MetricsConstant.OPEN_OPERATIONS);

        } catch (Exception e) {

          LOG.warn("Error Reporting open operation to Metrics system", e);

        }

      }

      runInternal();

    } finally {

      afterRun();

    }

  }

  public RowSet getNextRowSet() throws HiveSQLException {

    return getNextRowSet(FetchOrientation.FETCH_NEXT, DEFAULT_FETCH_MAX_ROWS);

  }

Operation是一个抽象类，

run会调用抽象方法runInternal
getNextRowSet会调用抽象方法getNextRowSet

下面会看到这两个抽象方法在子类中的实现，最终会依赖Driver的run和getResults；

1）先看runInternal在子类HiveCommandOperation中被实现：

org.apache.hive.service.cli.operation.HiveCommandOperation

  @Override

  public void runInternal() throws HiveSQLException {

    setState(OperationState.RUNNING);

    try {

      String command = getStatement().trim();

      String[] tokens = statement.split("\\s");

      String commandArgs = command.substring(tokens[0].length()).trim();

      CommandProcessorResponse response = commandProcessor.run(commandArgs);

...

这里会调用CommandProcessor.run，实际会调用Driver.run（Driver是CommandProcessor的实现类）；

2）再看getNextRowSet在子类SQLOperation中被实现：

org.apache.hive.service.cli.operation.SQLOperation

  public RowSet getNextRowSet(FetchOrientation orientation, long maxRows)

    throws HiveSQLException {

...

      driver.setMaxRows((int) maxRows);

      if (driver.getResults(convey)) {

        return decode(convey, rowSet);

      }

...

这里会调用Driver.getResults；

3 Driver

通过上面的代码分析发现无论是hive命令行执行还是beeline连接thrift server执行，最终都会依赖Driver，

Driver最核心的两个接口：

run
getResults

代码解析

org.apache.hadoop.hive.ql.Driver

  @Override

  public CommandProcessorResponse run(String command)

      throws CommandNeedRetryException {

    return run(command, false);

  }

  public CommandProcessorResponse run(String command, boolean alreadyCompiled)

        throws CommandNeedRetryException {

    CommandProcessorResponse cpr = runInternal(command, alreadyCompiled);

...

  private CommandProcessorResponse runInternal(String command, boolean alreadyCompiled)

      throws CommandNeedRetryException {

...

        ret = compileInternal(command, true);

...

      ret = execute(true);

...

  private int compileInternal(String command, boolean deferClose) {

...

      ret = compile(command, true, deferClose);

...

  public int compile(String command, boolean resetTaskIds, boolean deferClose) {

...

      plan = new QueryPlan(queryStr, sem, perfLogger.getStartTime(PerfLogger.DRIVER_RUN), queryId,

        queryState.getHiveOperation(), schema);

...

  public int execute(boolean deferClose) throws CommandNeedRetryException {

...

      // Add root Tasks to runnable

      for (Task<? extends Serializable> tsk : plan.getRootTasks()) {

        // This should never happen, if it does, it's a bug with the potential to produce

        // incorrect results.

        assert tsk.getParentTasks() == null || tsk.getParentTasks().isEmpty();

        driverCxt.addToRunnable(tsk);

      }

...

      // Loop while you either have tasks running, or tasks queued up

      while (driverCxt.isRunning()) {

        // Launch upto maxthreads tasks

        Task<? extends Serializable> task;

        while ((task = driverCxt.getRunnable(maxthreads)) != null) {

          TaskRunner runner = launchTask(task, queryId, noName, jobname, jobs, driverCxt);

          if (!runner.isRunning()) {

            break;

          }

        }

        // poll the Tasks to see which one completed

        TaskRunner tskRun = driverCxt.pollFinished();

        if (tskRun == null) {

          continue;

        }

        hookContext.addCompleteTask(tskRun);

        queryDisplay.setTaskResult(tskRun.getTask().getId(), tskRun.getTaskResult());

        Task<? extends Serializable> tsk = tskRun.getTask();

        TaskResult result = tskRun.getTaskResult();

...

        if (tsk.getChildTasks() != null) {

          for (Task<? extends Serializable> child : tsk.getChildTasks()) {

            if (DriverContext.isLaunchable(child)) {

              driverCxt.addToRunnable(child);

            }

          }

        }

      }

  public boolean getResults(List res) throws IOException, CommandNeedRetryException {

    if (driverState == DriverState.DESTROYED || driverState == DriverState.CLOSED) {

      throw new IOException("FAILED: query has been cancelled, closed, or destroyed.");

    }

    if (isFetchingTable()) {

      /**

       * If resultset serialization to thrift object is enabled, and if the destination table is

       * indeed written using ThriftJDBCBinarySerDe, read one row from the output sequence file,

       * since it is a blob of row batches.

       */

      if (fetchTask.getWork().isUsingThriftJDBCBinarySerDe()) {

        maxRows = 1;

      }

      fetchTask.setMaxRows(maxRows);

      return fetchTask.fetch(res);

    }

...

Driver的run会调用runInternal，runInternal中会先compileInternal编译sql并生成QueryPlan，然后调用execute执行QueryPlan中的所有task；
Driver的getResults会调用FetchTask的fetch来获取结果；

Hive SQL解析过程详见： https://www.cnblogs.com/barneywill/p/10186644.html

【原创】大数据基础之Hive（1）Hive SQL执行过程之代码流程的更多相关文章

【原创】大数据基础之Spark（4）RDD原理及代码解析
一简介 spark核心是RDD,官方文档地址:https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-di ...
CentOS6安装各种大数据软件第八章：Hive安装和配置
相关文章链接 CentOS6安装各种大数据软件第一章:各个软件版本介绍 CentOS6安装各种大数据软件第二章:Linux各个软件启动命令 CentOS6安装各种大数据软件第三章:Linux基础 ...
【原创】大数据基础之Benchmark（2）TPC-DS
tpc 官方:http://www.tpc.org/ 一简介 The TPC is a non-profit corporation founded to define transaction pr ...
【原创】大数据基础之Zookeeper（2）源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
【原创】大数据基础之Hive（5）性能调优Performance Tuning
1 compress & mr hive默认的execution engine是mr hive> set hive.execution.engine;hive.execution.eng ...
【原创】大数据基础之Hive（2）Hive SQL执行过程之SQL解析过程
Hive SQL解析过程 SQL->AST(Abstract Syntax Tree)->Task(MapRedTask,FetchTask)->QueryPlan(Task集合)- ...
【原创】大数据基础之Hive（5）hive on spark
hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as it ...
【原创】大数据基础之Hive（3）最简绿色部署
hadoop部署参考:https://www.cnblogs.com/barneywill/p/10428098.html 1 拷贝到所有服务器上并解压 # ansible all-servers - ...
了解大数据的技术生态系统 Hadoop,hive,spark(转载)
首先给出原文链接: 原文链接大数据本身是一个很宽泛的概念,Hadoop生态圈(或者泛生态圈)基本上都是为了处理超过单机尺度的数据处理而诞生的.你能够把它比作一个厨房所以须要的各种工具. 锅碗瓢盆,各 ...

随机推荐

程序员修神之路--🤠分布式高并发下Actor模型如此优秀🤠
写在开始一般来说有两种策略用来在并发线程中进行通信:共享数据和消息传递.使用共享数据方式的并发编程面临的最大的一个问题就是数据条件竞争.处理各种锁的问题是让人十分头痛的一件事. 传统多数流行的语言并 ...
linux安装tomcat部署web项目
我用的是如下图的两个软件,连接linux服务器. 其中WinSCp是传输文件用的,SecureCRT是用来输入命令的. 1.复制tomcat到指定目录(可复制到你想要的目录下),命令如下: cp /路 ...
软件工程(GZSD2015) 第二次作业成绩
作业评分表姓名提交语言界面 PSP(3) CODE(4) 代码规范(2) 改进(1) 基本得分提交时间原始总得分相对得分最终得分涂江凤 20150407 C CLI 3 4 2 1 ...
mpvue-Vant Weapp踩坑记
微信开发者工具:开发.调试和模拟运行微信小程序的最核心的工具了,所以必须安装 # 全局安装 vue-cli $ npm install --global vue-cli # 创建一个基于 mpvue- ...
Android艺术——Bitmap高效加载和缓存（1）
通过Bitmap我们可以设计一个ImageLoader,实现应该具有的功能是: 图片的同步加载:图片的异步加载:图片的压缩:内存缓存:磁盘缓存:网络获取: 1.加载首先提到加载:BitmapFact ...
LODOP提示、报错、现象，简短问答
提示升级提示:“打印控件需要升级!点击这里执行升级,升级后请重新进入."“Web打印服务CLodop需升级!点击这里执行升级,升级后请刷新页面.”(新版提示) 参考http://www.c- ...
LODOP不同打印机出现偏移问题
方法简单描述:1.精确套打,设置以纸张边缘为基点,可避免不同可打区域不同带了的影响.2.不同客户端打印机位置差异,可通过打印维护调整,结果在客户端本地.或调整打印机初始位置(本人使用的金税盘的开票软件 ...
vue实现点击展开,点击收起
安装vue的步骤在这里就不进行赘述了,下面直接进入正题首先定义一下data里面的数据: data () { return { toLearnList:[ 'html','css','javascri ...
五、Java多人博客系统-2.0版本-数据库设计
数据库设计表如下:文章类别表.文章表.评论表.留言表. 文章列表表:存放文章类别,首页菜单生成也是从这个表取的. 文章表:存放文章标题.发表时间.内容等信息. 评论表:文章评论内容. 留言表:用户发表 ...
placeholder效果
<!DOCTYPE HTML> <html lang="en-US"> <head> <meta charset="UT ...

【原创】大数据基础之Hive（1）Hive SQL执行过程之代码流程

1 hive命令

启动命令

代码解析

2 beeline命令

hive thrift server

启动命令

启动过程

类结构：【接口或父类->子类】

代码解析

3 Driver

代码解析

【原创】大数据基础之Hive（1）Hive SQL执行过程之代码流程的更多相关文章

随机推荐

热门专题