hive 2.3.4 on spark 2.4.0

Hive on Spark provides Hive with the ability to utilize Apache Spark as its execution engine.

set hive.execution.engine=spark;

1 version

Hive on Spark is only tested with a specific version of Spark, so a given version of Hive is only guaranteed to work with a specific version of Spark. Other versions of Spark may work with a given version of Hive, but that is not guaranteed. Below is a list of Hive versions and their corresponding compatible Spark versions.

以上版本对应是测试过的，其他版本也可能可用，需要测试；

2 yarn

Instead of the capacity scheduler, the fair scheduler is required. This fairly distributes an equal share of resources for jobs in the YARN cluster.

yarn-site.xml

<property>

<name>yarn.resourcemanager.scheduler.class</name>

<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>

</property>

3 spark

$ export SPARK_HOME=...

Note that you must have a version of Spark which does not include the Hive jars. Meaning one which was not built with the Hive profile. If you will use Parquet tables, it's recommended to also enable the "parquet-provided" profile. Otherwise there could be conflicts in Parquet dependency.

不能直接使用现有的spark安装目录，一个是hive依赖，一个parquet依赖，这两个依赖很容易导致问题；

4 library

$ ln -s $SPARK_HOME/jars/scala-library-2.11.8.jar $HIVE_HOME/lib/scala-library-2.11.8.jar
$ ln -s $SPARK_HOME/jars/spark-core_2.11-2.0.2.jar $HIVE_HOME/lib/spark-core_2.11-2.0.2.jar
$ ln -s $SPARK_HOME/jars/spark-network-common_2.11-2.0.2.jar $HIVE_HOME/lib/spark-network-common_2.11-2.0.2.jar

Prior to Hive 2.2.0, link the spark-assembly jar to HIVE_HOME/lib

spark2之前的版本有spark-assembly.jar，直接将该jar link到HIVE_HOME/lib

5 hive

$ hive
hive> set hive.execution.engine=spark;

默认的spark.master=yarn，更多配置

set spark.master=<Spark Master URL>
set spark.eventLog.enabled=true;
set spark.eventLog.dir=<Spark event log folder (must exist)>
set spark.executor.memory=512m;
set spark.executor.instances=10;
set spark.executor.cores=1;
set spark.serializer=org.apache.spark.serializer.KryoSerializer;

以上配置可以像设置hive config一样直接执行，也可以放到hive-site.xml中，也可以放到HIVE_CONF_DIR中的spark-defaults.conf中

This can be done either by adding a file "spark-defaults.conf" with these properties to the Hive classpath, or by setting them on Hive configuration (hive-site.xml).

6 报错

hive执行sql报错：

FAILED: SemanticException Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client

hive执行日志位于 /tmp/$user/hive.log

详细错误日志

2019-03-05 11:06:43 ERROR ApplicationMaster:91 - User class threw exception: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
at org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:47)
at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:134)
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:678)

因为spark打包时加了hive依赖，尝试使用没有hive的包

https://archive.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.4-without-hive.tgz

再执行，报parquet版本冲突

Caused by: java.lang.NoSuchMethodError: org.apache.parquet.schema.Types$MessageTypeBuilder.addFields([Lorg/apache/parquet/schema/Type;)Lorg/apache/parquet/schema/Types$BaseGroupBuilder;

只能编译了

1）spark 2.0-2.2

./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"

得到spark-2.0.2-bin-hadoop2-without-hive.tgz

2）spark 2.3及以上

./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided"

得到spark-2.4.0-bin-hadoop2-without-hive.tgz

使用spark-2.0.2-bin-hadoop2-without-hive.tgz再执行，还有报错

2019-03-05T17:10:55,537 ERROR [901dc3cf-a990-4e8b-95ec-fcf6a9c9002c main] ql.Driver: FAILED: SemanticException Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client.
org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client.

详细错误日志

2019-03-05T17:08:37,364 INFO [stderr-redir-1] client.SparkClientImpl: Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream

缺少jar，直接从spark-2.0.0-bin-hadoop2.4-without-hive里拷贝

$ cd spark-2.0.2-bin-hadoop2-without-hive
$ cp ../spark-2.4.0-bin-hadoop2.6/jars/hadoop-* jars/
$ cp ../spark-2.4.0-bin-hadoop2.6/jars/slf4j-* jars/
$ cp ../spark-2.4.0-bin-hadoop2.6/jars/log4j-* jars/
$ cp ../spark-2.4.0-bin-hadoop2.6/jars/guava-* jars/
$ cp ../spark-2.4.0-bin-hadoop2.6/jars/commons-* jars/
$ cp ../spark-2.4.0-bin-hadoop2.6/jars/protobuf-* jars/
$ cp ../spark-2.4.0-bin-hadoop2.6/jars/htrace-* jars/

这次ok了，执行sql输出

Query ID = hadoop_20190305180847_e8b638c8-394c-496d-a43e-26a0a17f9e18
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Spark Job = d5fea72c-c67c-49ec-9f4c-650a795c74c3
Running with YARN Application = application_1551754784891_0008
Kill Command = $HADOOP_HOME/bin/yarn application -kill application_1551754784891_0008

Query Hive on Spark job[1] stages: [2, 3]

Status: Running (Hive on Spark job[1])
--------------------------------------------------------------------------------------
STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED
--------------------------------------------------------------------------------------
Stage-2 ........ 0 FINISHED 275 275 0 0 0
Stage-3 ........ 0 FINISHED 1009 1009 0 0 0
--------------------------------------------------------------------------------------
STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 149.58 s
--------------------------------------------------------------------------------------
Status: Finished successfully in 149.58 seconds
OK

使用spark-2.4.0-bin-hadoop2-without-hive.tgz也没有问题；

参考：

https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark

https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

【原创】大数据基础之Hive（5）hive on spark的更多相关文章

【原创】大数据基础之Kudu（4）spark读写kudu
spark2.4.3+kudu1.9 1 批量读 val df = spark.read.format("kudu") .options(Map("kudu.master ...
CentOS6安装各种大数据软件第八章：Hive安装和配置
相关文章链接 CentOS6安装各种大数据软件第一章:各个软件版本介绍 CentOS6安装各种大数据软件第二章:Linux各个软件启动命令 CentOS6安装各种大数据软件第三章:Linux基础 ...
【原创】大数据基础之Benchmark（2）TPC-DS
tpc 官方:http://www.tpc.org/ 一简介 The TPC is a non-profit corporation founded to define transaction pr ...
【原创】大数据基础之Zookeeper（2）源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
【原创】大数据基础之Hive（5）性能调优Performance Tuning
1 compress & mr hive默认的execution engine是mr hive> set hive.execution.engine;hive.execution.eng ...
【原创】大数据基础之Hive（3）最简绿色部署
hadoop部署参考:https://www.cnblogs.com/barneywill/p/10428098.html 1 拷贝到所有服务器上并解压 # ansible all-servers - ...
了解大数据的技术生态系统 Hadoop,hive,spark(转载)
首先给出原文链接: 原文链接大数据本身是一个很宽泛的概念,Hadoop生态圈(或者泛生态圈)基本上都是为了处理超过单机尺度的数据处理而诞生的.你能够把它比作一个厨房所以须要的各种工具. 锅碗瓢盆,各 ...
大数据学习系列之四 ----- Hadoop+Hive环境搭建图文详解(单机)
引言在大数据学习系列之一 ----- Hadoop环境搭建(单机) 成功的搭建了Hadoop的环境,在大数据学习系列之二 ----- HBase环境搭建(单机)成功搭建了HBase的环境以及相关使用 ...
大数据入门第十一天——hive详解（一）入门与安装
一.基本概念 1.什么是hive The Apache Hive ™ data warehouse software facilitates reading, writing, and managin ...

随机推荐

Linux系统中用户组、文件权限浅解
用户组在linux中的每个用户必须属于一个组,不能独立于组外.在Linux中每个文件有所有者.所在组.其它组的概念. [所有者] 一般为文件的创建者,谁创建了该文件,就天然的成为该文件的所有者,用& ...
C#控件绘图恢复最小化后不自动重绘问题
最近在学习C#中的绘图,使用控件绘图时发现一个现象:即使将绘图代码写在了Paint方法中,将窗口最小化再恢复后依然不会重绘,而只有将鼠标移到控件上或者有其他改变窗口的行为时才会重绘. 一开始以为是自己 ...
Bootstrap 使用
bootstrap模板为使IE6.7.8版本(IE9以下版本)浏览器兼容html5新增的标签,引入下面代码文件即可. <script src="https://oss.maxcdn.c ...
C#调用sql存储过程（sqlserver,包括返回值得类型）
string strcon = "server=.;database=Myschool;uid=sa;pwd=123456"; SqlConnection sqlconn = ne ...
jQuery滑动
通过 jQuery,您可以在元素上创建滑动效果. jQuery 拥有以下滑动方法: slideDown(speed,callback):用于向下滑动元素. slideUp(speed,callback ...
jinja模板语法
模板要了解jinja2,那么需要先理解模板的概念.模板在Python的web开发中广泛使用,它能够有效的将业务逻辑和页面逻辑分开,使代码可读性增强.并且更加容易理解和维护. 模板简单来说就是一个其中 ...
5-23 CSS知识的补充
1,后代选择器使用空格表示后代选择器.顾名思义,父元素的后代(包括儿子,孙子,重孙子). <!DOCTYPE html> <html lang="en"> ...
sql连接：inner join on, left join on, right join on使用详解
点击打开原文 inner join(等值连接) 只返回两个表中联结字段相等的行 left join(左联接) 返回包括左表中的所有记录和右表中联结字段相等的记录 right join(右联接) 返回包 ...
MeEclipse搭建SSH框架之———大体框架
必备:MyEclipse软件,SSH需要的jar包,数据库,连接数据库的驱动jar包先搭建大体框架,再加别的东西,要不都不知道哪里错了一.新建web项目 MyEclipse左边右键--->N ...
🍓 react,jroll滑动删除 🍓
import React, { Component } from 'react'; import '../src/css/reset.css'; import '../src/css/delete.c ...

【原创】大数据基础之Hive（5）hive on spark