1.官方网站下载spark 1.5.0的源码

2.根据官方编译即可。

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package

./make-distribution.sh --name custom-spark --tgz -Phadoop-2.6 -Pyarn

如你使用的版本是scala2.11 可以做以下操作

./dev/change-scala-version.sh 2.11 mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package

不用再执行 ./make-distribution.sh --name custom-spark --tgz -Phadoop-2.4 -Pyarn

然后将./assembly/target/scala-2.11/ spark-assembly-1.5.0-hadoop2.6.0.jar  大约137MB 将其拷贝到$HIVE_HOME/lib下 hive 启动后,

可以执行 set hive.execution.engine=spark; 即可

调试中遇到的问题: 一定要调YARN的内存,否则会获取不到资源

YARN: Diagnostic Messages for this Task: Container [pid=7830,containerID=container_1397098636321_27548_01_000297] is running beyond physical memory limits. Current usage: 2.1 GB of 2 GB physical memory used;  2.7 GB of 4.2 GB virtual memory used. Killing container. Dump of the process-tree for container_1397098636321_27548_01_000297 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 7830 7816 7830 7830 (java) 2547 390 2924818432 539150 /export/servers/jdk1.6.0_25/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2224m -Djava.io.tmpdir=/data2/nm/local/usercache/admin/appcache/application_1397098636321_27548/container_1397098636321_27548_01_000297/tmp -Dlog4j.configuration=container-log4j.properties......

检查yarn-site-xml job内存限制 <property>   <name>yarn.scheduler.minimum-allocation-mb</name>   <value>2048</value> </property>

解决方法: 1.增加yarn.scheduler.minimum-allocation-mb内存上限。 2.--hiveconf mapred.child.java.opts=-Xmx????m  一定要小于yarn.scheduler.minimum-allocation-mb

如果是vm超了,如下:调整yarn.nodemanager.vmem-pmem-ratio

查看log没有明显的ERROR,但存在类似以下描述的日志 2012-05-16 13:08:20,876 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:  Sending out status for container: container_id {, app_attempt_id {, application_id {, id: 18, cluster_timestamp: 1337134318909, },  attemptId: 1, }, id: 6, }, state: C_COMPLETE, diagnostics: "Container [pid=15641,containerID=container_1337134318909_0018_01_000006]  is running beyond virtual memory limits. Current usage: 32.1mb of 1.0gb physical memory used; 6.2gb of 2.1gb virtual memory used.  Killing container.\nDump of the process-tree for container_1337134318909_0018_01_000006 :\n\t|- PID PPID PGRPID  SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE\n\t|  - 15641 26354 15641 15641 (java) 36 2 6686339072 8207 /home/zhouchen.zm/jdk1.6.0_23/bin/java   原因: 该错误是YARN的虚拟内存计算方式导致,上例中用户程序申请的内存为1Gb,YARN根据此值乘以一个比例(默认为2.1)得出申请的虚拟内存的值, 当YARN计算的用户程序所需虚拟内存值大于计算出来的值时,就会报出以上错误。调节比例值可以解决该问题。具体参数为:yarn-site.xml中的yarn.nodemanager.vmem-pmem-ratio

------QIN XIAO YAN -------------- <!-- Site specific YARN configuration properties -->

<!-- Site specific YARN configuration properties -->
<property>
<description>The hostname of the RM.</description>
<name>yarn.resourcemanager.hostname</name>
<value>qxy1</value>
</property>
<property>
<description>The address of the applications manager interface in the RM.</description>
<name>yarn.resourcemanager.address</name>
<value>${yarn.resourcemanager.hostname}:8032</value>
</property>
<property>
<description>List of directories to store localized files in. An
application's localized file directory will be found in:
${yarn.nodemanager.local-dirs}/usercache/${user}/appcache/application_${appid}.
Individual containers' work directories, called container_${contid}, will
be subdirectories of this.
</description>
<name>yarn.nodemanager.local-dirs</name>
<value>${hadoop.tmp.dir}/nm-local-dir</value>
</property> <property>
<description>Amount of physical memory, in MB, that can be allocated
for containers.</description>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property> <property>
<description>Ratio between virtual memory to physical memory when
setting memory limits for containers. Container allocations are
expressed in terms of physical memory, and virtual memory usage
is allowed to exceed this allocation by this ratio.
</description>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property> <property>
<description>Number of vcores that can be allocated
for containers. This is used by the RM scheduler when allocating
resources for containers. This is not used to limit the number of
physical cores used by YARN containers.</description>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>8</value>
</property> <property>
<description>The class to use as the resource scheduler.</description>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property> <property>
<description>The minimum allocation for every container request at the RM,
in MBs. Memory requests lower than this will throw a
InvalidResourceRequestException.</description>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
</property> <property>
<description>The maximum allocation for every container request at the RM,
in MBs. Memory requests higher than this will throw a
InvalidResourceRequestException.</description>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>4096</value>
</property> <property>
<description>Path to file with nodes to include.</description>
<name>yarn.resourcemanager.nodes.include-path</name>
<value></value>
</property> <property>
<description>
Where to store container logs. An application's localized log directory
will be found in ${yarn.nodemanager.log-dirs}/application_${appid}.
Individual containers' log directories will be below this, in directories
named container_{$contid}. Each container directory will contain the files
stderr, stdin, and syslog generated by that container.
</description>
<name>yarn.nodemanager.log-dirs</name>
<value>${yarn.log.dir}/userlogs</value>
</property> <property>
<description>Time in seconds to retain user logs. Only applicable if
log aggregation is disabled
</description>
<name>yarn.nodemanager.log.retain-seconds</name>
<value>10800</value>
</property> <property>
<description>Where to aggregate logs to.</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/tmp/logs</value>
</property>
<property>
<description>The remote log dir will be created at
{yarn.nodemanager.remote-app-log-dir}/${user}/{thisParam}
</description>
<name>yarn.nodemanager.remote-app-log-dir-suffix</name>
<value>logs</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>

SRARK 启动时报如下错误: Error: A JNI error has occurred, please check your installation and try again

1. SPARK_DIST_CLASSPATH=$(/home/hadoop/hadoop-2.7.2/bin/hadoop classpath)

2. 解决办法:

3.  export SCALA_HOME=/opt/scala-2.11.8  4. export SPARK_MASTER_IP=192.168.233.159  5. export SPARK_WORKER_MEMORY=1g  6. export HADOOP_CONF_DIR=/opt/hadoop-2.6.2/etc/hadoop  7. export JAVA_HOME=/opt/jdk1.8.0_77  8. export SPARK_DIST_CLASSPATH=$(/opt/hadoop-2.6.2/bin/hadoop  classpath)     ##加这条

hive on spark 编译时遇到的问题的更多相关文章

  1. Hive扩展功能(七)--Hive On Spark

    软件环境: linux系统: CentOS6.7 Hadoop版本: 2.6.5 zookeeper版本: 3.4.8 主机配置: 一共m1, m2, m3这五部机, 每部主机的用户名都为centos ...

  2. Hive On Spark环境搭建

    Spark源码编译与环境搭建 Note that you must have a version of Spark which does not include the Hive jars; Spar ...

  3. Spark记录-源码编译spark2.2.0(结合Hive on Spark/Hive on MR2/Spark on Yarn)

    #spark2.2.0源码编译 #组件:mvn-3.3.9 jdk-1.8 #wget http://mirror.bit.edu.cn/apache/spark/spark-2.2.0/spark- ...

  4. Spark入门实战系列--2.Spark编译与部署(下)--Spark编译安装

    [注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 .编译Spark .时间不一样,SBT是白天编译,Maven是深夜进行的,获取依赖包速度不同 ...

  5. Spark编译及spark开发环境搭建

    最近需要将生产环境的spark1.3版本升级到spark1.6(尽管spark2.0已经发布一段时间了,稳定可靠起见,还是选择了spark1.6),同时需要基于spark开发一些中间件,因此需要搭建一 ...

  6. 【原创】大数据基础之Hive(5)hive on spark

    hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as it ...

  7. hive使用spark引擎的几种情况

    使用spark引擎查询hive有以下几种方式:1>使用spark-sql(spark sql cli)2>使用spark-thrift提交查询sql3>使用hive on spark ...

  8. Hive数据分析——Spark是一种基于rdd(弹性数据集)的内存分布式并行处理框架,比于Hadoop将大量的中间结果写入HDFS,Spark避免了中间结果的持久化

    转自:http://blog.csdn.net/wh_springer/article/details/51842496 近十年来,随着Hadoop生态系统的不断完善,Hadoop早已成为大数据事实上 ...

  9. Hive On Spark保姆级攻略

    声明: 此博客参考了官网的配置方式,并结合笔者在实践网上部分帖子时的踩坑经历整理而成 这里贴上官方配置说明: [官方]: https://cwiki.apache.org//confluence/di ...

随机推荐

  1. graphviz 的使用教程

    node 节点属性如下 : Name Default Values color black node shape color comment   any string (format-dependen ...

  2. cogs 49. 跳马问题 DFS dp

    49. 跳马问题 ★   输入文件:horse.in   输出文件:horse.out   简单对比时间限制:1 s   内存限制:128 MB [问题描述] 有一只中国象棋中的 “ 马 ” ,在半张 ...

  3. Python思维导图(二)—— 数据类型

    ============================================== =========可点击图片, 放大查看更清晰哦!========= ===========有任何错误请及 ...

  4. Nginx代理服务——反向代理

    ​ Nginx可以代理的服务 ​ 正向代理,例如翻墙 ​ 反向代理 ​ 正向和反向代理的区别 区别在于代理的对象不一样 正向代理:代理的对象是客户端 反向代理:代理的对象是服务器 配置语法 Synta ...

  5. crawler碎碎念4 关于python requests、Beautiful Soup库、SQLlite的基本操作

    Requests import requests from PIL import Image from io improt BytesTO import jason url = "..... ...

  6. 防止过拟合的方法 预测鸾凤花(sklearn)

    1. 防止过拟合的方法有哪些? 过拟合(overfitting)是指在模型参数拟合过程中的问题,由于训练数据包含抽样误差,训练时,复杂的模型将抽样误差也考虑在内,将抽样误差也进行了很好的拟合. 产生过 ...

  7. ThinkPad全家族机型对比

    如图所示

  8. 代码审计之CVE-2017-6920 Drupal远程代码执行漏洞学习

     1.背景介绍: CVE-2017-6920是Drupal Core的YAML解析器处理不当所导致的一个远程代码执行漏洞,影响8.x的Drupal Core. Drupal介绍:Drupal 是一个由 ...

  9. PBR原理

    漫反射和镜面反射 漫反射和镜面反射(或反射)光是描述光和材料之间两种主要相互作用类型的两个术语.镜面光是指从表面反弹的光.在光滑的表面上,这种光将反射所有相同的方向,并且表面将呈现镜像.漫射光是被吸收 ...

  10. PDO和Mysqli的区别

    参考:http://www.cnblogs.com/feng18/p/6523646.html 人家写的不错