【原创】大叔经验分享(2)为什么hive在大表上加条件后执行limit很慢
问题重现
select id from big_table where name = 'sdlkfjalksdjfla' limit 100;
首先看执行计划:
hive> explain select * from big_table where name = 'sdlkfjalksdjfla' limit 100;
OK
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: 100
Processor Tree:
TableScan
alias: big_table
Statistics: Num rows: 7497189457 Data size: 1499437891589 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (name = 'sdlkfjalksdjfla') (type: boolean)
Statistics: Num rows: 3748594728 Data size: 749718945694 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 3748594728 Data size: 749718945694 Basic stats: COMPLETE Column stats: NONE
Limit
Number of rows: 100
Statistics: Num rows: 100 Data size: 20000 Basic stats: COMPLETE Column stats: NONE
ListSink
Time taken: 0.668 seconds, Fetched: 23 row(s)
可见只有一个stage,即Fetch Operator,再看执行过程:
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x00000006c1e00cd8> (a sun.nio.ch.Util$2)
- locked <0x00000006c1e00cc8> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000006c1e00aa0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(PacketReceiver.java:258)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:209)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:102)
at org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:186)
at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:146)
- locked <0x000000076b9bccb0> (a org.apache.hadoop.hdfs.RemoteBlockReader2)
at org.apache.hadoop.hdfs.BlockReaderUtil.readAll(BlockReaderUtil.java:32)
at org.apache.hadoop.hdfs.RemoteBlockReader2.readAll(RemoteBlockReader2.java:363)
at org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1072)
at org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1000)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1333)
at org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:78)
at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:107)
at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:166)
at org.apache.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:239)
at org.apache.orc.impl.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:858)
at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:829)
at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:986)
at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1021)
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1057)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:77)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:89)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:231)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:206)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:488)
at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2098)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:252)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
可见并没有提交远程job而是在本地直接做table scan,如果是在一个大表上加复杂查询条件再做limit就会很慢,因为极有可能需要全表扫描之后才能收集到所需结果(limit条数),这也是为什么对大表不加条件直接limit反而很快的原因。
如果想修改这种行为,需要修改如下配置:
hive.fetch.task.conversion
Some select queries can be converted to a single FETCH task, minimizing latency. Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incur RS – ReduceSinkOperator, requiring a MapReduce task), lateral views and joins.
Supported values are none, minimal and more.
0. none: Disable hive.fetch.task.conversion
1. minimal: SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only
2. more: SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns)
这个配置会尝试将query转换为一个fetch任务;
默认为more,将其改为none再执行上边的sql,就会提交到yarn上执行
set hive.fetch.task.conversion=none;
相关的配置还有一个
hive.fetch.task.conversion.threshold
Input threshold (in bytes) for applying hive.fetch.task.conversion. If target table is native, input length is calculated by summation of file lengths. If it's not native, the storage handler for the table can optionally implement the org.apache.hadoop.hive.ql.metadata.InputEstimator interface. A negative threshold means hive.fetch.task.conversion is applied without any input length threshold.
默认为1073741824 (1 GB)
【原创】大叔经验分享(2)为什么hive在大表上加条件后执行limit很慢的更多相关文章
- 【原创】经验分享:一个小小emoji尽然牵扯出来这么多东西?
前言 之前也分享过很多工作中踩坑的经验: 一个线上问题的思考:Eureka注册中心集群如何实现客户端请求负载及故障转移? [原创]经验分享:一个Content-Length引发的血案(almost.. ...
- Hive优化-大表join大表优化
Hive优化-大表join大表优化 5.大表join大表优化 如果Hive优化实战2中mapjoin中小表dim_seller很大呢?比如超过了1GB大小?这种就是大表join大表的问题.首先引入一个 ...
- 【原创】大叔经验分享(26)hive通过外部表读写elasticsearch数据
hive通过外部表读写elasticsearch数据,和读写hbase数据差不多,差别是需要下载elasticsearch-hadoop-hive-6.6.2.jar,然后使用其中的EsStorage ...
- 【原创】大叔经验分享(25)hive通过外部表读写hbase数据
在hive中创建外部表: CREATE EXTERNAL TABLE hive_hbase_table(key string, name string,desc string) STORED BY ' ...
- 【原创】大叔经验分享(34)hive中文注释乱码
在hive中查看表结构时中文注释乱码,分为两种情况,一种是desc $table,一种是show create table $table 1 数据库字符集 检查 mysql> show vari ...
- 价值100W的经验分享: 基于JSPatch的iOS应用线上Bug的即时修复方案,附源码.
限于iOS AppStore的审核机制,一些新的功能的添加或者bug的修复,想做些节日专属的活动等,几乎都是不太可能的.从已有的经验来看,也是有了一些比较常用的解决方案.本文先是会简单说明对比大部分方 ...
- 对现有Hive的大表进行动态分区
分区是在处理大型事实表时常用的方法.分区的好处在于缩小查询扫描范围,从而提高速度.分区分为两种:静态分区static partition和动态分区dynamic partition.静态分区和动态分区 ...
- hive两大表关联优化试验
呼叫结果(call_result)与销售历史(sale_history)的join优化: CALL_RESULT: 32亿条/444G SALE_HISTORY:17亿条/439G 原逻辑 Map: ...
- 【原创】大叔经验分享(24)hive metastore的几种部署方式
hive及其他组件(比如spark.impala等)都会依赖hive metastore,依赖的配置文件位于hive-site.xml hive metastore重要配置 hive.metastor ...
随机推荐
- 如何卸载VS 2017之前版本比如VS 2013、VS2015、 VS vNext?
前言 大学专业为软件工程,进入大学之后才知道这个专业需要用到笔记本,我的笔记本配置为I3,内存4个G,已经有大几年了,中间坏了修了一次一直用到现在,这个笔记本还是我哥打工过年回来身上仅有的三四千块钱所 ...
- 虽然不抱希望但也愿.Net和Java之争暂得平息
我在刚开始学编程的时候就经常来博客园,当时博客园基本是.Net的天下,从那时开始.Net和Java哪个好就一直在打,这些年没怎么看博客园了,回来发现到了今天居然还在争论,让我不由得想来分析一下这个问题 ...
- 【Swift 3.1】iOS开发笔记(四)
一.唱片旋转效果(360°无限顺时针旋转) func animationRotateCover() { coverImageView.layer.removeAllAnimations() let a ...
- 如何在网中使用百度地图API自定义个性化地图
<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <m ...
- Spring MVC 使用介绍(四)—— 拦截器
一.概述 1.接口定义 拦截器由HandlerInterceptor接口定义: public interface HandlerInterceptor { // 预处理方法 boolean preHa ...
- BZOJ3626[LNOI2014]LCA——树链剖分+线段树
题目描述 给出一个n个节点的有根树(编号为0到n-1,根节点为0).一个点的深度定义为这个节点到根的距离+1.设dep[i]表示点i的深度,LCA(i,j)表示i与j的最近公共祖先.有q次询问,每次询 ...
- <数据结构基础学习>(三)Part 1 栈
一.栈 Stack 栈也是一种线性的数据结构 相比数组,栈相对应的操作是数组的子集. 只能从一端添加元素,也只能从一端取出元素.这一端成为栈顶. 1,2,3依次入栈得到的顺序为 3,2,1,栈顶为3, ...
- 统计iis日志第一例的次数
统计iis日志第一例(日期)出现的次数 IIS日志文件格式: #Software: Microsoft Internet Information Services 7.5 #Version: 1.0 ...
- 【VS】VS2013 未找到与约束contractname 匹配的导出
#事故现场 今天win10更新后,打开vs2013新建项目报错: #解决方案: 1.控制面板->程序->程序和功能,找到 Entity Framework Tools for Visual ...
- JN_0003:JS定义变量的3种方式
js中三种定义变量的方式const, var, let的区别. 1,const定义的变量不可以修改,而且必须初始化. 2,var定义的变量可以修改,如果不初始化会输出undefined,不会报错. 3 ...