hive on spark 编译时遇到的问题
1.官方网站下载spark 1.5.0的源码
2.根据官方编译即可。
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package
./make-distribution.sh --name custom-spark --tgz -Phadoop-2.6 -Pyarn
如你使用的版本是scala2.11 可以做以下操作
./dev/change-scala-version.sh 2.11 mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package
不用再执行 ./make-distribution.sh --name custom-spark --tgz -Phadoop-2.4 -Pyarn
然后将./assembly/target/scala-2.11/ spark-assembly-1.5.0-hadoop2.6.0.jar 大约137MB 将其拷贝到$HIVE_HOME/lib下 hive 启动后,
可以执行 set hive.execution.engine=spark; 即可
调试中遇到的问题: 一定要调YARN的内存,否则会获取不到资源
YARN: Diagnostic Messages for this Task: Container [pid=7830,containerID=container_1397098636321_27548_01_000297] is running beyond physical memory limits. Current usage: 2.1 GB of 2 GB physical memory used; 2.7 GB of 4.2 GB virtual memory used. Killing container. Dump of the process-tree for container_1397098636321_27548_01_000297 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 7830 7816 7830 7830 (java) 2547 390 2924818432 539150 /export/servers/jdk1.6.0_25/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2224m -Djava.io.tmpdir=/data2/nm/local/usercache/admin/appcache/application_1397098636321_27548/container_1397098636321_27548_01_000297/tmp -Dlog4j.configuration=container-log4j.properties......
检查yarn-site-xml job内存限制 <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>2048</value> </property>
解决方法: 1.增加yarn.scheduler.minimum-allocation-mb内存上限。 2.--hiveconf mapred.child.java.opts=-Xmx????m 一定要小于yarn.scheduler.minimum-allocation-mb
如果是vm超了,如下:调整yarn.nodemanager.vmem-pmem-ratio
查看log没有明显的ERROR,但存在类似以下描述的日志 2012-05-16 13:08:20,876 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id {, app_attempt_id {, application_id {, id: 18, cluster_timestamp: 1337134318909, }, attemptId: 1, }, id: 6, }, state: C_COMPLETE, diagnostics: "Container [pid=15641,containerID=container_1337134318909_0018_01_000006] is running beyond virtual memory limits. Current usage: 32.1mb of 1.0gb physical memory used; 6.2gb of 2.1gb virtual memory used. Killing container.\nDump of the process-tree for container_1337134318909_0018_01_000006 :\n\t|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE\n\t| - 15641 26354 15641 15641 (java) 36 2 6686339072 8207 /home/zhouchen.zm/jdk1.6.0_23/bin/java 原因: 该错误是YARN的虚拟内存计算方式导致,上例中用户程序申请的内存为1Gb,YARN根据此值乘以一个比例(默认为2.1)得出申请的虚拟内存的值, 当YARN计算的用户程序所需虚拟内存值大于计算出来的值时,就会报出以上错误。调节比例值可以解决该问题。具体参数为:yarn-site.xml中的yarn.nodemanager.vmem-pmem-ratio
------QIN XIAO YAN -------------- <!-- Site specific YARN configuration properties -->
<!-- Site specific YARN configuration properties -->
<property>
<description>The hostname of the RM.</description>
<name>yarn.resourcemanager.hostname</name>
<value>qxy1</value>
</property>
<property>
<description>The address of the applications manager interface in the RM.</description>
<name>yarn.resourcemanager.address</name>
<value>${yarn.resourcemanager.hostname}:8032</value>
</property>
<property>
<description>List of directories to store localized files in. An
application's localized file directory will be found in:
${yarn.nodemanager.local-dirs}/usercache/${user}/appcache/application_${appid}.
Individual containers' work directories, called container_${contid}, will
be subdirectories of this.
</description>
<name>yarn.nodemanager.local-dirs</name>
<value>${hadoop.tmp.dir}/nm-local-dir</value>
</property> <property>
<description>Amount of physical memory, in MB, that can be allocated
for containers.</description>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property> <property>
<description>Ratio between virtual memory to physical memory when
setting memory limits for containers. Container allocations are
expressed in terms of physical memory, and virtual memory usage
is allowed to exceed this allocation by this ratio.
</description>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property> <property>
<description>Number of vcores that can be allocated
for containers. This is used by the RM scheduler when allocating
resources for containers. This is not used to limit the number of
physical cores used by YARN containers.</description>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>8</value>
</property> <property>
<description>The class to use as the resource scheduler.</description>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property> <property>
<description>The minimum allocation for every container request at the RM,
in MBs. Memory requests lower than this will throw a
InvalidResourceRequestException.</description>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
</property> <property>
<description>The maximum allocation for every container request at the RM,
in MBs. Memory requests higher than this will throw a
InvalidResourceRequestException.</description>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>4096</value>
</property> <property>
<description>Path to file with nodes to include.</description>
<name>yarn.resourcemanager.nodes.include-path</name>
<value></value>
</property> <property>
<description>
Where to store container logs. An application's localized log directory
will be found in ${yarn.nodemanager.log-dirs}/application_${appid}.
Individual containers' log directories will be below this, in directories
named container_{$contid}. Each container directory will contain the files
stderr, stdin, and syslog generated by that container.
</description>
<name>yarn.nodemanager.log-dirs</name>
<value>${yarn.log.dir}/userlogs</value>
</property> <property>
<description>Time in seconds to retain user logs. Only applicable if
log aggregation is disabled
</description>
<name>yarn.nodemanager.log.retain-seconds</name>
<value>10800</value>
</property> <property>
<description>Where to aggregate logs to.</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/tmp/logs</value>
</property>
<property>
<description>The remote log dir will be created at
{yarn.nodemanager.remote-app-log-dir}/${user}/{thisParam}
</description>
<name>yarn.nodemanager.remote-app-log-dir-suffix</name>
<value>logs</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
SRARK 启动时报如下错误: Error: A JNI error has occurred, please check your installation and try again
1. SPARK_DIST_CLASSPATH=$(/home/hadoop/hadoop-2.7.2/bin/hadoop classpath)
2. 解决办法:
3. export SCALA_HOME=/opt/scala-2.11.8 4. export SPARK_MASTER_IP=192.168.233.159 5. export SPARK_WORKER_MEMORY=1g 6. export HADOOP_CONF_DIR=/opt/hadoop-2.6.2/etc/hadoop 7. export JAVA_HOME=/opt/jdk1.8.0_77 8. export SPARK_DIST_CLASSPATH=$(/opt/hadoop-2.6.2/bin/hadoop classpath) ##加这条
hive on spark 编译时遇到的问题的更多相关文章
- Hive扩展功能(七)--Hive On Spark
软件环境: linux系统: CentOS6.7 Hadoop版本: 2.6.5 zookeeper版本: 3.4.8 主机配置: 一共m1, m2, m3这五部机, 每部主机的用户名都为centos ...
- Hive On Spark环境搭建
Spark源码编译与环境搭建 Note that you must have a version of Spark which does not include the Hive jars; Spar ...
- Spark记录-源码编译spark2.2.0(结合Hive on Spark/Hive on MR2/Spark on Yarn)
#spark2.2.0源码编译 #组件:mvn-3.3.9 jdk-1.8 #wget http://mirror.bit.edu.cn/apache/spark/spark-2.2.0/spark- ...
- Spark入门实战系列--2.Spark编译与部署(下)--Spark编译安装
[注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 .编译Spark .时间不一样,SBT是白天编译,Maven是深夜进行的,获取依赖包速度不同 ...
- Spark编译及spark开发环境搭建
最近需要将生产环境的spark1.3版本升级到spark1.6(尽管spark2.0已经发布一段时间了,稳定可靠起见,还是选择了spark1.6),同时需要基于spark开发一些中间件,因此需要搭建一 ...
- 【原创】大数据基础之Hive(5)hive on spark
hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as it ...
- hive使用spark引擎的几种情况
使用spark引擎查询hive有以下几种方式:1>使用spark-sql(spark sql cli)2>使用spark-thrift提交查询sql3>使用hive on spark ...
- Hive数据分析——Spark是一种基于rdd(弹性数据集)的内存分布式并行处理框架,比于Hadoop将大量的中间结果写入HDFS,Spark避免了中间结果的持久化
转自:http://blog.csdn.net/wh_springer/article/details/51842496 近十年来,随着Hadoop生态系统的不断完善,Hadoop早已成为大数据事实上 ...
- Hive On Spark保姆级攻略
声明: 此博客参考了官网的配置方式,并结合笔者在实践网上部分帖子时的踩坑经历整理而成 这里贴上官方配置说明: [官方]: https://cwiki.apache.org//confluence/di ...
随机推荐
- [AI开发]小型数据集解决实际工程问题——交通拥堵、交通事故实时告警
这篇文章其实主要是想介绍在深度学习过程中如何使用小型数据集,这种数据集样本数量一般在1000以下,有时候甚至只有几百.一般提到神经网络,大家都会说数据量越丰富,准确性越高,但是实际工作中,可能收集不了 ...
- 快速回顾MySQL:汇总和分组
10.3 汇总数据 我们经常需要汇总数据而不用把它们实际检索处出来,为此MySQL提供了专门的函数.使用这些函数,MySQL查询可用于检索数据,以便分析和报表的生成.这种类型的检索例子有以下几种: 确 ...
- 常用crud
增:@Insert("insert into t_user (`last_name`, `sex`) values(#{lastName}, #{sex})") 删:@Del ...
- 防止过拟合的方法 预测鸾凤花(sklearn)
1. 防止过拟合的方法有哪些? 过拟合(overfitting)是指在模型参数拟合过程中的问题,由于训练数据包含抽样误差,训练时,复杂的模型将抽样误差也考虑在内,将抽样误差也进行了很好的拟合. 产生过 ...
- leetcode 最大水池
leetcode 11题 水池最大容积 题目描述 给定 n 个非负整数 a1,a2,...,an,每个数代表坐标中的一个点 (i, ai) .在坐标内画 n 条垂直线,垂直线 i 的两个端点分别为 ( ...
- Scrum.自用SCRUM项目开发流程
目前团队所使用的开发模式流程,请大家指正优化. 1.PM需求整理:在禅道整理出需求清单和原型图(标注清楚):需求评审.往复,直到需求定稿. 2.SprintMaster根据需求清单进行技术任务拆解,包 ...
- CodeSign error: no provisioning profile at path '/Users/zhht-2015/Library/MobileDevice/Provisioning Profiles/79693141-f98b-4ac4-8bb4-476c9475f265.mobileprovision'
解决方法: 1.关闭Xcode,找到项目中的**.xcodeproj文件,点击右键,show package contents(打开包内容). 2.打开后找到project.pbxproj文件,用文本 ...
- SqlServer分页存储过程(多表查询,多条件排序),Repeater控件呈现数据以及分页
存储过程(Stored Procedure)是在大型数据库系统中,一组为了完成特定功能的SQL 语句集,存储在数据库中,经过第一次编译后再次调用不需要再次编译,用户通过指定存储过程的名字并给出 ...
- 传递额外的值 Passing Extra Values |在视图中生成输出URL | 高级路由特性 | 精通ASP-NET-MVC-5-弗瑞曼
结果呢 <a href="/App/DoCustomVariable?id=Hello">This is an outgoing URL</a> 理解片段变 ...
- MATLAB2017 下载及安装教程
全文借鉴于软件安装管家 链接: https://pan.baidu.com/s/1-X1Hg35bDG1G1AX4MnHxnA 提取码: ri88 复制这段内容后打开百度网盘手机App,操作更方便哦 ...