Spark源码编译与环境搭建

Note that you must have a version of Spark which does not include the Hive jars;

Spark编译:

git clone https://github.com/apache/spark.git spark_src
cd spark_src
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
./make-distribution.sh --name "spark-without-hive" --tgz -Phadoop-2.4 -Dhadoop.version=2.5.-cdh5.3.1 -Pyarn -DskipTests package

Spark搭建:见Spark环境搭建章节

Hive源码编译与环境搭建

Hive编译

git clone https://github.com/apache/hive.git hive_on_spark
git checkout spark
cd hive_on_spark
mvn clean install -Phadoop-,dist -DskipTests

编译完成后,hive安装包的位置: /packaging/target/apache-hive-1.2.0-SNAPSHOT-bin.tar.gz

注意pom.xml中spark.version要和spark的版本号对应

<spark.version>1.3.0</spark.version>

Hive安装:见Hive环境搭建章节

本案例中Spark和Hive的安装路径如下:

Spark安装目录:/home/spark/app/spark-1.3.0-bin-spark-without-hive

Hive安装目录:/home/spark/app/apache-hive-1.2.0-SNAPSHOT-bin

添加Spark的依赖到Hive的方法

方式一: Set the property 'spark.home' to point to the Spark installation:

hive> set spark.home=/home/spark/app/spark-1.3.-bin-spark-without-hive;

方式二: Define the SPARK_HOME environment variable before starting Hive CLI/HiveServer2:

export SPARK_HOME=/home/spark/app/spark-1.3.-bin-spark-without-hive

方式三: Set the spark-assembly jar on the Hive auxpath:

hive --auxpath /home/spark/app/spark-1.3.-bin-spark-without-hive/lib/spark-assembly-*.jar

方式四: Add the spark-assembly jar for the current user session:

hive> add jar /home/spark/app/spark-1.3.-bin-spark-without-hive/lib/spark-assembly-*.jar;

方式五: Link the spark-assembly jar to $HIVE_HOME/lib.

启动Hive过程中可能出现的错误:

[ERROR] Terminal initialization failed; falling back to unsupported
java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
at jline.TerminalFactory.create(TerminalFactory.java:)
at jline.TerminalFactory.get(TerminalFactory.java:)
at jline.console.ConsoleReader.<init>(ConsoleReader.java:)
at jline.console.ConsoleReader.<init>(ConsoleReader.java:)
at jline.console.ConsoleReader.<init>(ConsoleReader.java:)
at org.apache.hadoop.hive.cli.CliDriver.getConsoleReader(CliDriver.java:)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:)
at java.lang.reflect.Method.invoke(Method.java:)
at org.apache.hadoop.util.RunJar.main(RunJar.java:) Exception in thread "main" java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected

解决方法:export HADOOP_USER_CLASSPATH_FIRST=true

其他场景的错误解决方法参见:https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

还有一个坑:需要设置spark.eventLog.dir参数,比如:

set spark.eventLog.dir= hdfs://hadoop000:/directory

否则查询会报错,这个坑深啊。。。。。。,否则一直报错:/tmp/spark-event类似的文件夹不存在。。。。

启动hive后设置执行引擎为spark:

hive> set hive.execution.engine=spark;

设置spark的运行模式:

hive> set spark.master=spark://hadoop000:7077

或者yarn:spark.master=yarn

Configure Spark-application configs for Hive

可以配置在spark-defaults.conf或者hive-site.xml

spark.master=<Spark Master URL>
spark.eventLog.enabled=true;
spark.executor.memory=512m;
spark.serializer=org.apache.spark.serializer.KryoSerializer;
spark.executor.memory=... #Amount of memory to use per executor process.
spark.executor.cores=... #Number of cores per executor.
spark.yarn.executor.memoryOverhead=...
spark.executor.instances=... #The number of executors assigned to each application.
spark.driver.memory=... #The amount of memory assigned to the Remote Spark Context (RSC). We recommend 4GB.
spark.yarn.driver.memoryOverhead=... #We recommend (MB).

参数配置详见文档:https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

执行sql语句后可以在监控页面查看job/stages等信息

hive (default)> select city_id, count(*) c from page_views group by city_id order by c desc limit 5;
Query ID = spark_20150309173838_444cb5b1-b72e-4fc3-87db-4162e364cb1e
Total jobs =
Launching Job out of
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
state = SENT
state = STARTED
state = STARTED
state = STARTED
state = STARTED
Query Hive on Spark job[] stages: Status: Running (Hive on Spark job[])
Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
-- ::, Stage-0_0: (+)/ Stage-1_0: / Stage-2_0: /
state = STARTED
state = STARTED
state = STARTED
-- ::, Stage-0_0: (+)/ Stage-1_0: / Stage-2_0: /
state = STARTED
state = STARTED
-- ::, Stage-0_0: / Finished Stage-1_0: (+)/ Stage-2_0: /
state = SUCCEEDED
-- ::, Stage-0_0: / Finished Stage-1_0: / Finished Stage-2_0: / Finished
Status: Finished successfully in 10.07 seconds
OK
city_id c
-
-
-
- Time taken: 18.417 seconds, Fetched: row(s)

Hive On Spark环境搭建的更多相关文章

  1. 分布式计算框架-Spark(spark环境搭建、生态环境、运行架构)

    Spark涉及的几个概念:RDD:Resilient Distributed Dataset(弹性分布数据集).DAG:Direct Acyclic Graph(有向无环图).SparkContext ...

  2. Spark学习进度-Spark环境搭建&Spark shell

    Spark环境搭建 下载包 所需Spark包:我选择的是2.2.0的对应Hadoop2.7版本的,下载地址:https://archive.apache.org/dist/spark/spark-2. ...

  3. Spark环境搭建(四)-----------数据仓库Hive环境搭建

    Hive产生背景 1)MapReduce的编程不便,需通过Java语言等编写程序 2) HDFS上的文缺失Schema(在数据库中的表名列名等),方便开发者通过SQL的方式处理结构化的数据,而不需要J ...

  4. 大数据学习系列之六 ----- Hadoop+Spark环境搭建

    引言 在上一篇中 大数据学习系列之五 ----- Hive整合HBase图文详解 : http://www.panchengming.com/2017/12/18/pancm62/ 中使用Hive整合 ...

  5. Spark环境搭建(六)-----------sprk源码编译

    想要搭建自己的Hadoop和spark集群,尤其是在生产环境中,下载官网提供的安装包远远不够的,必须要自己源码编译spark才行. 环境准备: 1,Maven环境搭建,版本Apache Maven 3 ...

  6. 学习Spark——环境搭建(Mac版)

    大数据情结 还记得上次跳槽期间,与很多猎头都有聊过,其中有一个猎头告诉我,整个IT跳槽都比较频繁,但是相对来说,做大数据的比较"懒"一些,不太愿意动.后来在一篇文中中也证实了这一观 ...

  7. Spark环境搭建(上)——基础环境搭建

    Spark摘说 Spark的环境搭建涉及三个部分,一是linux系统基础环境搭建,二是Hadoop集群安装,三是Spark集群安装.在这里,主要介绍Spark在Centos系统上的准备工作--linu ...

  8. Eclipse+maven+scala+spark环境搭建

    准备条件 我用的Eclipse版本 Eclipse Java EE IDE for Web Developers. Version: Luna Release (4.4.0) 我用的是Eclipse ...

  9. Hive记录-Hive on Spark环境部署

    1.hive执行引擎 Hive默认使用MapReduce作为执行引擎,即Hive on mr.实际上,Hive还可以使用Tez和Spark作为其执行引擎,分别为Hive on Tez和Hive on ...

随机推荐

  1. 根据url地址单个或批量下载图片

    我们在java开发的时候会遇到通过url地址下载图片的情况.方便起见,我把通过url地址下载图片封装了tool工具类,方便以后使用 1.根据如:http://abc.com/hotels/a.jpg  ...

  2. 转战网站后台与python

    这么长时间了,迷茫的大雾也逐渐散去,正如标题所写的一样,转战网站后台开发.这段时间没怎么写博客,主要还是太忙,忙着期末考试的预习,以及服务器的搭建,python的学习,还有各种各样杂七杂八的小事,就像 ...

  3. UART

    一.协议部分: 协议部分转自:http://www.s8052.com/index.htm 串行通信的传送方向通常有三种: 1.为单工,只允许数据向一个方向传送: 2.半双工,允许数据向两个方向中的任 ...

  4. <%@page contentType="text/html;charset=gbk"%>与<meta http-equiv="Content-Type" content="text/html; charset=GBK">区别

    前一个是在服务端起作用,是告诉应用服务器采用何种编码输出JSP文件流,后一个是在客户端起作用,是告诉浏览器是采用何种编码方式显示HTML页面.     前者由jsp引擎对输出内容进行编码, 后者将由I ...

  5. python raw String 获取字符串变量中的反斜杠

    常用的获取raw string的方式为: >>>r'\n' \n 不能用在字符串变量中,获取字符串变量中的反斜杠如下: tab = '\n' >>>tab.enco ...

  6. Centos6---Fail2ban

    1.安装: rpm -ivh http://dl.fedoraproject.org/pub/epel/epel-release-latest-6.noarch.rpm yum -y install ...

  7. linux 删除进程的多种方法

    kill pid kill -9 pid kill -15 pid pkill -f *.php kill -s 9 pid

  8. Amazon验证码机器算法识别

    Amazon验证码识别 在破解Amazon的验证码的时候,利用机器学习得到验证码破解精度超过70%,主要是训练样本不够,如果在足够的样本下达到90%是非常有可能的. update后,样本数为2800多 ...

  9. 转载:bootstrap, boosting, bagging 几种方法的联系

    转:http://blog.csdn.net/jlei_apple/article/details/8168856 这两天在看关于boosting算法时,看到一篇不错的文章讲bootstrap, ja ...

  10. Microsoft Win32 to Microsoft .NET Framework API Map

    Microsoft Win32 to Microsoft .NET Framework API Map .NET Development (General) Technical Articles   ...