hive on spark 编译时遇到的问题
1.官方网站下载spark 1.5.0的源码
2.根据官方编译即可。
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package
./make-distribution.sh --name custom-spark --tgz -Phadoop-2.6 -Pyarn
如你使用的版本是scala2.11 可以做以下操作
./dev/change-scala-version.sh 2.11 mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package
不用再执行 ./make-distribution.sh --name custom-spark --tgz -Phadoop-2.4 -Pyarn
然后将./assembly/target/scala-2.11/ spark-assembly-1.5.0-hadoop2.6.0.jar 大约137MB 将其拷贝到$HIVE_HOME/lib下 hive 启动后,
可以执行 set hive.execution.engine=spark; 即可
调试中遇到的问题: 一定要调YARN的内存,否则会获取不到资源
YARN: Diagnostic Messages for this Task: Container [pid=7830,containerID=container_1397098636321_27548_01_000297] is running beyond physical memory limits. Current usage: 2.1 GB of 2 GB physical memory used; 2.7 GB of 4.2 GB virtual memory used. Killing container. Dump of the process-tree for container_1397098636321_27548_01_000297 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 7830 7816 7830 7830 (java) 2547 390 2924818432 539150 /export/servers/jdk1.6.0_25/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2224m -Djava.io.tmpdir=/data2/nm/local/usercache/admin/appcache/application_1397098636321_27548/container_1397098636321_27548_01_000297/tmp -Dlog4j.configuration=container-log4j.properties......
检查yarn-site-xml job内存限制 <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>2048</value> </property>
解决方法: 1.增加yarn.scheduler.minimum-allocation-mb内存上限。 2.--hiveconf mapred.child.java.opts=-Xmx????m 一定要小于yarn.scheduler.minimum-allocation-mb
如果是vm超了,如下:调整yarn.nodemanager.vmem-pmem-ratio
查看log没有明显的ERROR,但存在类似以下描述的日志 2012-05-16 13:08:20,876 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id {, app_attempt_id {, application_id {, id: 18, cluster_timestamp: 1337134318909, }, attemptId: 1, }, id: 6, }, state: C_COMPLETE, diagnostics: "Container [pid=15641,containerID=container_1337134318909_0018_01_000006] is running beyond virtual memory limits. Current usage: 32.1mb of 1.0gb physical memory used; 6.2gb of 2.1gb virtual memory used. Killing container.\nDump of the process-tree for container_1337134318909_0018_01_000006 :\n\t|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE\n\t| - 15641 26354 15641 15641 (java) 36 2 6686339072 8207 /home/zhouchen.zm/jdk1.6.0_23/bin/java 原因: 该错误是YARN的虚拟内存计算方式导致,上例中用户程序申请的内存为1Gb,YARN根据此值乘以一个比例(默认为2.1)得出申请的虚拟内存的值, 当YARN计算的用户程序所需虚拟内存值大于计算出来的值时,就会报出以上错误。调节比例值可以解决该问题。具体参数为:yarn-site.xml中的yarn.nodemanager.vmem-pmem-ratio
------QIN XIAO YAN -------------- <!-- Site specific YARN configuration properties -->
<!-- Site specific YARN configuration properties -->
<property>
<description>The hostname of the RM.</description>
<name>yarn.resourcemanager.hostname</name>
<value>qxy1</value>
</property>
<property>
<description>The address of the applications manager interface in the RM.</description>
<name>yarn.resourcemanager.address</name>
<value>${yarn.resourcemanager.hostname}:8032</value>
</property>
<property>
<description>List of directories to store localized files in. An
application's localized file directory will be found in:
${yarn.nodemanager.local-dirs}/usercache/${user}/appcache/application_${appid}.
Individual containers' work directories, called container_${contid}, will
be subdirectories of this.
</description>
<name>yarn.nodemanager.local-dirs</name>
<value>${hadoop.tmp.dir}/nm-local-dir</value>
</property> <property>
<description>Amount of physical memory, in MB, that can be allocated
for containers.</description>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property> <property>
<description>Ratio between virtual memory to physical memory when
setting memory limits for containers. Container allocations are
expressed in terms of physical memory, and virtual memory usage
is allowed to exceed this allocation by this ratio.
</description>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property> <property>
<description>Number of vcores that can be allocated
for containers. This is used by the RM scheduler when allocating
resources for containers. This is not used to limit the number of
physical cores used by YARN containers.</description>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>8</value>
</property> <property>
<description>The class to use as the resource scheduler.</description>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property> <property>
<description>The minimum allocation for every container request at the RM,
in MBs. Memory requests lower than this will throw a
InvalidResourceRequestException.</description>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
</property> <property>
<description>The maximum allocation for every container request at the RM,
in MBs. Memory requests higher than this will throw a
InvalidResourceRequestException.</description>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>4096</value>
</property> <property>
<description>Path to file with nodes to include.</description>
<name>yarn.resourcemanager.nodes.include-path</name>
<value></value>
</property> <property>
<description>
Where to store container logs. An application's localized log directory
will be found in ${yarn.nodemanager.log-dirs}/application_${appid}.
Individual containers' log directories will be below this, in directories
named container_{$contid}. Each container directory will contain the files
stderr, stdin, and syslog generated by that container.
</description>
<name>yarn.nodemanager.log-dirs</name>
<value>${yarn.log.dir}/userlogs</value>
</property> <property>
<description>Time in seconds to retain user logs. Only applicable if
log aggregation is disabled
</description>
<name>yarn.nodemanager.log.retain-seconds</name>
<value>10800</value>
</property> <property>
<description>Where to aggregate logs to.</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/tmp/logs</value>
</property>
<property>
<description>The remote log dir will be created at
{yarn.nodemanager.remote-app-log-dir}/${user}/{thisParam}
</description>
<name>yarn.nodemanager.remote-app-log-dir-suffix</name>
<value>logs</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
SRARK 启动时报如下错误: Error: A JNI error has occurred, please check your installation and try again
1. SPARK_DIST_CLASSPATH=$(/home/hadoop/hadoop-2.7.2/bin/hadoop classpath)
2. 解决办法:
3. export SCALA_HOME=/opt/scala-2.11.8 4. export SPARK_MASTER_IP=192.168.233.159 5. export SPARK_WORKER_MEMORY=1g 6. export HADOOP_CONF_DIR=/opt/hadoop-2.6.2/etc/hadoop 7. export JAVA_HOME=/opt/jdk1.8.0_77 8. export SPARK_DIST_CLASSPATH=$(/opt/hadoop-2.6.2/bin/hadoop classpath) ##加这条
hive on spark 编译时遇到的问题的更多相关文章
- Hive扩展功能(七)--Hive On Spark
软件环境: linux系统: CentOS6.7 Hadoop版本: 2.6.5 zookeeper版本: 3.4.8 主机配置: 一共m1, m2, m3这五部机, 每部主机的用户名都为centos ...
- Hive On Spark环境搭建
Spark源码编译与环境搭建 Note that you must have a version of Spark which does not include the Hive jars; Spar ...
- Spark记录-源码编译spark2.2.0(结合Hive on Spark/Hive on MR2/Spark on Yarn)
#spark2.2.0源码编译 #组件:mvn-3.3.9 jdk-1.8 #wget http://mirror.bit.edu.cn/apache/spark/spark-2.2.0/spark- ...
- Spark入门实战系列--2.Spark编译与部署(下)--Spark编译安装
[注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 .编译Spark .时间不一样,SBT是白天编译,Maven是深夜进行的,获取依赖包速度不同 ...
- Spark编译及spark开发环境搭建
最近需要将生产环境的spark1.3版本升级到spark1.6(尽管spark2.0已经发布一段时间了,稳定可靠起见,还是选择了spark1.6),同时需要基于spark开发一些中间件,因此需要搭建一 ...
- 【原创】大数据基础之Hive(5)hive on spark
hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as it ...
- hive使用spark引擎的几种情况
使用spark引擎查询hive有以下几种方式:1>使用spark-sql(spark sql cli)2>使用spark-thrift提交查询sql3>使用hive on spark ...
- Hive数据分析——Spark是一种基于rdd(弹性数据集)的内存分布式并行处理框架,比于Hadoop将大量的中间结果写入HDFS,Spark避免了中间结果的持久化
转自:http://blog.csdn.net/wh_springer/article/details/51842496 近十年来,随着Hadoop生态系统的不断完善,Hadoop早已成为大数据事实上 ...
- Hive On Spark保姆级攻略
声明: 此博客参考了官网的配置方式,并结合笔者在实践网上部分帖子时的踩坑经历整理而成 这里贴上官方配置说明: [官方]: https://cwiki.apache.org//confluence/di ...
随机推荐
- CVE-2019-0708远程桌面服务远程执行代码漏洞exp利用过程
CVE-2019-0708远程桌面服务远程执行代码漏洞 上边这洞是啥我就不多说了,描述类的自行百度. 受影响系统版本范围: Windows Server 2008 R2 Windows Server ...
- FJUT-1370 记录一次解题过程
题目在福工院的1370 首先看题目,好家伙,全英文 那么大致的题意就是.有几个城市同在一条线上(相当于在x轴上),max i是第i个城市到其他所有城市的距离中的最大值,min i也就是所有中最小的. ...
- ElasticSearch 倒排索引简析
内容概要 倒排索引是什么?为什么需要倒排索引? 倒排索引是怎么工作的? 1. 倒排索引是什么? 假设有一个交友网站,信息表如下: 美女1:"我要找在上海做 PHP 的哥哥." 需要 ...
- 3种基础的 REST 安全机制
安全是 RESTful web service 的基石,我们主要讨论以下3种主要的方法: Basic authentication Oauth 2.0 Oauth 2.0 + JWT 1. Basic ...
- 多级反向代理java获取真实IP地址
public static String getIpAddress(HttpServletRequest request){ String ip = request.getHeader("x ...
- javaweb-codereview 学习记录-3
Class类加载流程 实际上就是ClassLoader将会调用loadclass来尝试加载类,首先将会在jvm中尝试加载我们想要加载的类,如果jvm中没有的话,将调用自身的findclass,此时要是 ...
- 【java面试】算法篇之堆排序
一.堆的概念 堆是一棵顺序存储的完全二叉树.完全二叉树中所有非终端节点的值均不大于(或不小于)其左.右孩子节点的值. 其中每个节点的值小于等于其左.右孩子的值,这样的堆称为小根堆: 其中每个节点的值大 ...
- Linux起源
Linux起源 操作系统出现时间线: Unix1970年诞生 ,71年用C语言重写 Apple II 诞生于1976年 window诞生于1985年 Linux诞生于1991年,由大学生Linus T ...
- set集合迭代
1.迭代遍历 Set<String> set = new HashSet<String>(); Iterator<String> it = set.iterator ...
- Mixing .NET