Hive On Spark环境搭建
Spark源码编译与环境搭建
Note that you must have a version of Spark which does not include the Hive jars;
Spark编译:
git clone https://github.com/apache/spark.git spark_src
cd spark_src
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
./make-distribution.sh --name "spark-without-hive" --tgz -Phadoop-2.4 -Dhadoop.version=2.5.-cdh5.3.1 -Pyarn -DskipTests package
Spark搭建:见Spark环境搭建章节
Hive源码编译与环境搭建
Hive编译:
git clone https://github.com/apache/hive.git hive_on_spark
git checkout spark
cd hive_on_spark
mvn clean install -Phadoop-,dist -DskipTests
编译完成后,hive安装包的位置: /packaging/target/apache-hive-1.2.0-SNAPSHOT-bin.tar.gz
注意pom.xml中spark.version要和spark的版本号对应
<spark.version>1.3.0</spark.version>
Hive安装:见Hive环境搭建章节
本案例中Spark和Hive的安装路径如下:
Spark安装目录:/home/spark/app/spark-1.3.0-bin-spark-without-hive
Hive安装目录:/home/spark/app/apache-hive-1.2.0-SNAPSHOT-bin
添加Spark的依赖到Hive的方法
方式一: Set the property 'spark.home' to point to the Spark installation:
hive> set spark.home=/home/spark/app/spark-1.3.-bin-spark-without-hive;
方式二: Define the SPARK_HOME environment variable before starting Hive CLI/HiveServer2:
export SPARK_HOME=/home/spark/app/spark-1.3.-bin-spark-without-hive
方式三: Set the spark-assembly jar on the Hive auxpath:
hive --auxpath /home/spark/app/spark-1.3.-bin-spark-without-hive/lib/spark-assembly-*.jar
方式四: Add the spark-assembly jar for the current user session:
hive> add jar /home/spark/app/spark-1.3.-bin-spark-without-hive/lib/spark-assembly-*.jar;
方式五: Link the spark-assembly jar to $HIVE_HOME/lib.
启动Hive过程中可能出现的错误:
[ERROR] Terminal initialization failed; falling back to unsupported
java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
at jline.TerminalFactory.create(TerminalFactory.java:)
at jline.TerminalFactory.get(TerminalFactory.java:)
at jline.console.ConsoleReader.<init>(ConsoleReader.java:)
at jline.console.ConsoleReader.<init>(ConsoleReader.java:)
at jline.console.ConsoleReader.<init>(ConsoleReader.java:)
at org.apache.hadoop.hive.cli.CliDriver.getConsoleReader(CliDriver.java:)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:)
at java.lang.reflect.Method.invoke(Method.java:)
at org.apache.hadoop.util.RunJar.main(RunJar.java:) Exception in thread "main" java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
解决方法:export HADOOP_USER_CLASSPATH_FIRST=true
其他场景的错误解决方法参见:https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
还有一个坑:需要设置spark.eventLog.dir参数,比如:
set spark.eventLog.dir= hdfs://hadoop000:/directory
否则查询会报错,这个坑深啊。。。。。。,否则一直报错:/tmp/spark-event类似的文件夹不存在。。。。
启动hive后设置执行引擎为spark:
hive> set hive.execution.engine=spark;
设置spark的运行模式:
hive> set spark.master=spark://hadoop000:7077
或者yarn:spark.master=yarn
Configure Spark-application configs for Hive
可以配置在spark-defaults.conf或者hive-site.xml
spark.master=<Spark Master URL>
spark.eventLog.enabled=true;
spark.executor.memory=512m;
spark.serializer=org.apache.spark.serializer.KryoSerializer;
spark.executor.memory=... #Amount of memory to use per executor process.
spark.executor.cores=... #Number of cores per executor.
spark.yarn.executor.memoryOverhead=...
spark.executor.instances=... #The number of executors assigned to each application.
spark.driver.memory=... #The amount of memory assigned to the Remote Spark Context (RSC). We recommend 4GB.
spark.yarn.driver.memoryOverhead=... #We recommend (MB).
参数配置详见文档:https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
执行sql语句后可以在监控页面查看job/stages等信息
hive (default)> select city_id, count(*) c from page_views group by city_id order by c desc limit 5;
Query ID = spark_20150309173838_444cb5b1-b72e-4fc3-87db-4162e364cb1e
Total jobs =
Launching Job out of
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
state = SENT
state = STARTED
state = STARTED
state = STARTED
state = STARTED
Query Hive on Spark job[] stages: Status: Running (Hive on Spark job[])
Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
-- ::, Stage-0_0: (+)/ Stage-1_0: / Stage-2_0: /
state = STARTED
state = STARTED
state = STARTED
-- ::, Stage-0_0: (+)/ Stage-1_0: / Stage-2_0: /
state = STARTED
state = STARTED
-- ::, Stage-0_0: / Finished Stage-1_0: (+)/ Stage-2_0: /
state = SUCCEEDED
-- ::, Stage-0_0: / Finished Stage-1_0: / Finished Stage-2_0: / Finished
Status: Finished successfully in 10.07 seconds
OK
city_id c
-
-
-
- Time taken: 18.417 seconds, Fetched: row(s)

Hive On Spark环境搭建的更多相关文章
- 分布式计算框架-Spark(spark环境搭建、生态环境、运行架构)
Spark涉及的几个概念:RDD:Resilient Distributed Dataset(弹性分布数据集).DAG:Direct Acyclic Graph(有向无环图).SparkContext ...
- Spark学习进度-Spark环境搭建&Spark shell
Spark环境搭建 下载包 所需Spark包:我选择的是2.2.0的对应Hadoop2.7版本的,下载地址:https://archive.apache.org/dist/spark/spark-2. ...
- Spark环境搭建(四)-----------数据仓库Hive环境搭建
Hive产生背景 1)MapReduce的编程不便,需通过Java语言等编写程序 2) HDFS上的文缺失Schema(在数据库中的表名列名等),方便开发者通过SQL的方式处理结构化的数据,而不需要J ...
- 大数据学习系列之六 ----- Hadoop+Spark环境搭建
引言 在上一篇中 大数据学习系列之五 ----- Hive整合HBase图文详解 : http://www.panchengming.com/2017/12/18/pancm62/ 中使用Hive整合 ...
- Spark环境搭建(六)-----------sprk源码编译
想要搭建自己的Hadoop和spark集群,尤其是在生产环境中,下载官网提供的安装包远远不够的,必须要自己源码编译spark才行. 环境准备: 1,Maven环境搭建,版本Apache Maven 3 ...
- 学习Spark——环境搭建(Mac版)
大数据情结 还记得上次跳槽期间,与很多猎头都有聊过,其中有一个猎头告诉我,整个IT跳槽都比较频繁,但是相对来说,做大数据的比较"懒"一些,不太愿意动.后来在一篇文中中也证实了这一观 ...
- Spark环境搭建(上)——基础环境搭建
Spark摘说 Spark的环境搭建涉及三个部分,一是linux系统基础环境搭建,二是Hadoop集群安装,三是Spark集群安装.在这里,主要介绍Spark在Centos系统上的准备工作--linu ...
- Eclipse+maven+scala+spark环境搭建
准备条件 我用的Eclipse版本 Eclipse Java EE IDE for Web Developers. Version: Luna Release (4.4.0) 我用的是Eclipse ...
- Hive记录-Hive on Spark环境部署
1.hive执行引擎 Hive默认使用MapReduce作为执行引擎,即Hive on mr.实际上,Hive还可以使用Tez和Spark作为其执行引擎,分别为Hive on Tez和Hive on ...
随机推荐
- pageHelp的使用
以前使用ibatis/mybatis,都是自己手写sql语句进行物理分页,虽然稍微有点麻烦,但是都习惯了.最近试用了下mybatis的分页插件 PageHelper,感觉还不错吧.记录下其使用方法. ...
- Flex box弹性布局 及 响应式前端设计的优化
Flex box弹性布局 Flex box是CSS3新添加的一种模型属性,它的出现有力的打破了我们常常使用的浮动布局.实现垂直等高.水平均分.按比例划分,可以实现许多我们之前做不到的自适应布局.如果你 ...
- Spring 的 NamedParameterJdbcTemplate(转)
NamedParameterJdbcTemplate类是基于JdbcTemplate类,并对它进行了封装从而支持命名参数特性. NamedParameterJdbcTemplate主要提供以下三类方法 ...
- web安全之xss
xss:跨站脚本攻击,攻击者,把一段恶意代码镶嵌到web页面,,用户浏览页面时,嵌入页面的恶意代码就会执行,从而到达攻击用户的目的. 重点在于脚本,javascript和actionscript ...
- 【Cocos2d-x 3.x】内存管理机制与源码分析
侯捷先生说过这么一句话 : 源码之前,了无秘密. 要了解Cocos2d-x的内存管理机制,就得阅读源码. 接触Cocos2d-x时, Cocos2d-x的最新版本已经到了3.2的时代,在学习Coco ...
- 较老版本的AFNetworking使用心得
较老版本的 AFNetworking 下载链接 ( http://pan.baidu.com/s/14Cxga ) 将压缩包中的文件夹拖入xcode工程项目中并引入如下的框架 简单的 JOSN 解析例 ...
- jQuery的选择器中的通配符使用介绍
$("input[id^='data']");//id属性以data开始的所有input标签 $("input[id$='data']");//id属性以dat ...
- 玩玩redis
一: 介绍 Redis 是一个高性能的key-value数据库. redis的出现,很大程度补偿了memcached这类key/value存储的不足,在部分场合可以对关系数据库起到很好的补充作用.它提 ...
- 简述Session
Session的原理 1.session技术的概述 * session是服务器端技术 * 服务器在运行时可以为每一个用户的浏览器创建一个其独享的session对象 * 由于session为用户浏览器独 ...
- Saying that Java is nice because it works on every OS is like saying that anal sex is nice because it works on every gender.
Saying that Java is nice because it works on every OS is like saying that anal sex is nice because i ...