【原创】大叔经验分享(2)为什么hive在大表上加条件后执行limit很慢
问题重现
select id from big_table where name = 'sdlkfjalksdjfla' limit 100;
首先看执行计划:
hive> explain select * from big_table where name = 'sdlkfjalksdjfla' limit 100;
OK
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: 100
Processor Tree:
TableScan
alias: big_table
Statistics: Num rows: 7497189457 Data size: 1499437891589 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (name = 'sdlkfjalksdjfla') (type: boolean)
Statistics: Num rows: 3748594728 Data size: 749718945694 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 3748594728 Data size: 749718945694 Basic stats: COMPLETE Column stats: NONE
Limit
Number of rows: 100
Statistics: Num rows: 100 Data size: 20000 Basic stats: COMPLETE Column stats: NONE
ListSink
Time taken: 0.668 seconds, Fetched: 23 row(s)
可见只有一个stage,即Fetch Operator,再看执行过程:
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x00000006c1e00cd8> (a sun.nio.ch.Util$2)
- locked <0x00000006c1e00cc8> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000006c1e00aa0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(PacketReceiver.java:258)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:209)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:102)
at org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:186)
at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:146)
- locked <0x000000076b9bccb0> (a org.apache.hadoop.hdfs.RemoteBlockReader2)
at org.apache.hadoop.hdfs.BlockReaderUtil.readAll(BlockReaderUtil.java:32)
at org.apache.hadoop.hdfs.RemoteBlockReader2.readAll(RemoteBlockReader2.java:363)
at org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1072)
at org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1000)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1333)
at org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:78)
at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:107)
at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:166)
at org.apache.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:239)
at org.apache.orc.impl.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:858)
at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:829)
at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:986)
at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1021)
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1057)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:77)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:89)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:231)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:206)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:488)
at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2098)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:252)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
可见并没有提交远程job而是在本地直接做table scan,如果是在一个大表上加复杂查询条件再做limit就会很慢,因为极有可能需要全表扫描之后才能收集到所需结果(limit条数),这也是为什么对大表不加条件直接limit反而很快的原因。
如果想修改这种行为,需要修改如下配置:
hive.fetch.task.conversion
Some select queries can be converted to a single FETCH task, minimizing latency. Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incur RS – ReduceSinkOperator, requiring a MapReduce task), lateral views and joins.
Supported values are none, minimal and more.
0. none: Disable hive.fetch.task.conversion
1. minimal: SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only
2. more: SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns)
这个配置会尝试将query转换为一个fetch任务;
默认为more,将其改为none再执行上边的sql,就会提交到yarn上执行
set hive.fetch.task.conversion=none;
相关的配置还有一个
hive.fetch.task.conversion.threshold
Input threshold (in bytes) for applying hive.fetch.task.conversion. If target table is native, input length is calculated by summation of file lengths. If it's not native, the storage handler for the table can optionally implement the org.apache.hadoop.hive.ql.metadata.InputEstimator interface. A negative threshold means hive.fetch.task.conversion is applied without any input length threshold.
默认为1073741824 (1 GB)
【原创】大叔经验分享(2)为什么hive在大表上加条件后执行limit很慢的更多相关文章
- 【原创】经验分享:一个小小emoji尽然牵扯出来这么多东西?
前言 之前也分享过很多工作中踩坑的经验: 一个线上问题的思考:Eureka注册中心集群如何实现客户端请求负载及故障转移? [原创]经验分享:一个Content-Length引发的血案(almost.. ...
- Hive优化-大表join大表优化
Hive优化-大表join大表优化 5.大表join大表优化 如果Hive优化实战2中mapjoin中小表dim_seller很大呢?比如超过了1GB大小?这种就是大表join大表的问题.首先引入一个 ...
- 【原创】大叔经验分享(26)hive通过外部表读写elasticsearch数据
hive通过外部表读写elasticsearch数据,和读写hbase数据差不多,差别是需要下载elasticsearch-hadoop-hive-6.6.2.jar,然后使用其中的EsStorage ...
- 【原创】大叔经验分享(25)hive通过外部表读写hbase数据
在hive中创建外部表: CREATE EXTERNAL TABLE hive_hbase_table(key string, name string,desc string) STORED BY ' ...
- 【原创】大叔经验分享(34)hive中文注释乱码
在hive中查看表结构时中文注释乱码,分为两种情况,一种是desc $table,一种是show create table $table 1 数据库字符集 检查 mysql> show vari ...
- 价值100W的经验分享: 基于JSPatch的iOS应用线上Bug的即时修复方案,附源码.
限于iOS AppStore的审核机制,一些新的功能的添加或者bug的修复,想做些节日专属的活动等,几乎都是不太可能的.从已有的经验来看,也是有了一些比较常用的解决方案.本文先是会简单说明对比大部分方 ...
- 对现有Hive的大表进行动态分区
分区是在处理大型事实表时常用的方法.分区的好处在于缩小查询扫描范围,从而提高速度.分区分为两种:静态分区static partition和动态分区dynamic partition.静态分区和动态分区 ...
- hive两大表关联优化试验
呼叫结果(call_result)与销售历史(sale_history)的join优化: CALL_RESULT: 32亿条/444G SALE_HISTORY:17亿条/439G 原逻辑 Map: ...
- 【原创】大叔经验分享(24)hive metastore的几种部署方式
hive及其他组件(比如spark.impala等)都会依赖hive metastore,依赖的配置文件位于hive-site.xml hive metastore重要配置 hive.metastor ...
随机推荐
- js把变量转换成json数据
var a="";var MessageList=JSON.stringify(a);
- codeforces#1132 F. Clear the String(神奇的区间dp)
题意:给出一个字符串S,|S|<=500.每次操作可以删除一段连续的相同字母的子串.问,最少操作多少次可以把这个字符串变成空串. 分析:刚开始的思路是,把连续的串给删除掉,然后再....贪心.完 ...
- PHP中的DateTime类
DataTime类跟date(),strtotime(),gmdate()等函数有相同的作用,都是用来处理日期和时间的,但DateTime类更加直观.方便, 所以在PHP5.2.0以后推荐使用Date ...
- RPC框架原理简述:从实现一个简易RPCFramework说起(转)
摘要: 本文阐述了RPC框架与远程调用的产生背景,介绍了RPC的基本概念和使用背景,之后手动实现了简易的RPC框架并佐以实例进行演示,以便让各位看官对RPC有一个感性.清晰和完整的认识,最后讨论了RP ...
- 在线解析JSON+ AsyncTaskLoader
效果图: 获取并解析Json package com.example.admin.quakereport; import android.text.TextUtils;import android.u ...
- jsp篇 之 脚本元素
jsp的脚本元素 : 第一种:表达式 (类似输出语句) 表达式 形式:<%= %> 看源码发现[翻译]到java文件中的位置: [out.print(..)]里面的参数. 所以System ...
- [HDU5969] 最大的位或
题目类型:位运算 传送门:>Here< 题意:给出\(l\)和\(r\),求最大的\(x|y\),其中\(x,y\)在\([l,r]\)范围内 解题思路 首先让我想到了前面那题\(Bits ...
- 【XSY3154】入门多项式 高斯消元
题目大意 给你一个 \(n\times n\)的矩阵 \(A\),求次数最小且最高次项为 \(1\) 的多项式 \(F(x)\),满足 \(F(A)=0\). 所有操作都对 \(p\) 取模. \(n ...
- Lua语法基础(二)
1. 函数 1.1 函数声明 默认为全局 局部函数使用local关键字声明 1.2 参数 ...等同于Python中*args三个点表示可变参数 1.3 获取参数长度的两种方式 (1)将传入的参数.. ...
- Day041--CSS, 盒模型, 浮动
内容回顾 表单标签 input type text 普通的文本 password 密码 radio 单选 默认选中添加checked 互斥的效果 给radio标签添加 相同的name checkbo ...