【原创】大叔经验分享(2)为什么hive在大表上加条件后执行limit很慢
问题重现
select id from big_table where name = 'sdlkfjalksdjfla' limit 100;
首先看执行计划:
hive> explain select * from big_table where name = 'sdlkfjalksdjfla' limit 100;
OK
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: 100
Processor Tree:
TableScan
alias: big_table
Statistics: Num rows: 7497189457 Data size: 1499437891589 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (name = 'sdlkfjalksdjfla') (type: boolean)
Statistics: Num rows: 3748594728 Data size: 749718945694 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 3748594728 Data size: 749718945694 Basic stats: COMPLETE Column stats: NONE
Limit
Number of rows: 100
Statistics: Num rows: 100 Data size: 20000 Basic stats: COMPLETE Column stats: NONE
ListSink
Time taken: 0.668 seconds, Fetched: 23 row(s)
可见只有一个stage,即Fetch Operator,再看执行过程:
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x00000006c1e00cd8> (a sun.nio.ch.Util$2)
- locked <0x00000006c1e00cc8> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000006c1e00aa0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(PacketReceiver.java:258)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:209)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:102)
at org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:186)
at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:146)
- locked <0x000000076b9bccb0> (a org.apache.hadoop.hdfs.RemoteBlockReader2)
at org.apache.hadoop.hdfs.BlockReaderUtil.readAll(BlockReaderUtil.java:32)
at org.apache.hadoop.hdfs.RemoteBlockReader2.readAll(RemoteBlockReader2.java:363)
at org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1072)
at org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1000)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1333)
at org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:78)
at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:107)
at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:166)
at org.apache.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:239)
at org.apache.orc.impl.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:858)
at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:829)
at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:986)
at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1021)
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1057)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:77)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:89)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:231)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:206)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:488)
at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2098)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:252)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
可见并没有提交远程job而是在本地直接做table scan,如果是在一个大表上加复杂查询条件再做limit就会很慢,因为极有可能需要全表扫描之后才能收集到所需结果(limit条数),这也是为什么对大表不加条件直接limit反而很快的原因。
如果想修改这种行为,需要修改如下配置:
hive.fetch.task.conversion
Some select queries can be converted to a single FETCH task, minimizing latency. Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incur RS – ReduceSinkOperator, requiring a MapReduce task), lateral views and joins.
Supported values are none, minimal and more.
0. none: Disable hive.fetch.task.conversion
1. minimal: SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only
2. more: SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns)
这个配置会尝试将query转换为一个fetch任务;
默认为more,将其改为none再执行上边的sql,就会提交到yarn上执行
set hive.fetch.task.conversion=none;
相关的配置还有一个
hive.fetch.task.conversion.threshold
Input threshold (in bytes) for applying hive.fetch.task.conversion. If target table is native, input length is calculated by summation of file lengths. If it's not native, the storage handler for the table can optionally implement the org.apache.hadoop.hive.ql.metadata.InputEstimator interface. A negative threshold means hive.fetch.task.conversion is applied without any input length threshold.
默认为1073741824 (1 GB)
【原创】大叔经验分享(2)为什么hive在大表上加条件后执行limit很慢的更多相关文章
- 【原创】经验分享:一个小小emoji尽然牵扯出来这么多东西?
前言 之前也分享过很多工作中踩坑的经验: 一个线上问题的思考:Eureka注册中心集群如何实现客户端请求负载及故障转移? [原创]经验分享:一个Content-Length引发的血案(almost.. ...
- Hive优化-大表join大表优化
Hive优化-大表join大表优化 5.大表join大表优化 如果Hive优化实战2中mapjoin中小表dim_seller很大呢?比如超过了1GB大小?这种就是大表join大表的问题.首先引入一个 ...
- 【原创】大叔经验分享(26)hive通过外部表读写elasticsearch数据
hive通过外部表读写elasticsearch数据,和读写hbase数据差不多,差别是需要下载elasticsearch-hadoop-hive-6.6.2.jar,然后使用其中的EsStorage ...
- 【原创】大叔经验分享(25)hive通过外部表读写hbase数据
在hive中创建外部表: CREATE EXTERNAL TABLE hive_hbase_table(key string, name string,desc string) STORED BY ' ...
- 【原创】大叔经验分享(34)hive中文注释乱码
在hive中查看表结构时中文注释乱码,分为两种情况,一种是desc $table,一种是show create table $table 1 数据库字符集 检查 mysql> show vari ...
- 价值100W的经验分享: 基于JSPatch的iOS应用线上Bug的即时修复方案,附源码.
限于iOS AppStore的审核机制,一些新的功能的添加或者bug的修复,想做些节日专属的活动等,几乎都是不太可能的.从已有的经验来看,也是有了一些比较常用的解决方案.本文先是会简单说明对比大部分方 ...
- 对现有Hive的大表进行动态分区
分区是在处理大型事实表时常用的方法.分区的好处在于缩小查询扫描范围,从而提高速度.分区分为两种:静态分区static partition和动态分区dynamic partition.静态分区和动态分区 ...
- hive两大表关联优化试验
呼叫结果(call_result)与销售历史(sale_history)的join优化: CALL_RESULT: 32亿条/444G SALE_HISTORY:17亿条/439G 原逻辑 Map: ...
- 【原创】大叔经验分享(24)hive metastore的几种部署方式
hive及其他组件(比如spark.impala等)都会依赖hive metastore,依赖的配置文件位于hive-site.xml hive metastore重要配置 hive.metastor ...
随机推荐
- react的jsx语法
在webpack.config.js中配置解析的loader { test:/\.jsx?$/, use:{ loader:"babel-loader", options:{ pr ...
- C#直接使用DllImport调用C/C++动态库(dll文件)
1.C/C++动态库的编写 下面是我编写的一个比较简单的C++dll文件用来测试,关于如何编写dll文件,我这里便不再赘述,不懂得自行查询, (1).h文件 #ifdef MYDLL_EXPORTS ...
- vue-cli的跨域配置(自己总结)
- Python——Django-__init__.py的内容
一.告诉Django用pymysql来代替默认的MySQLdb(在__init__.py里) import pymysql #告诉Django用pymysql来代替默认的MySQLdb pymysql ...
- java从Swagger Api接口获取数据工具类
- DRF初识与序列化
一.Django的序列化方法 1.为什么要用序列化组件 做前后端分离的项目,我们前后端数据交互一般都选择JSON,JSON是一个轻量级的数据交互格式.那么我们给前端数据的时候都要转成json格式,那就 ...
- webpack学习记录 - 学习webpack-dev-server(三)
怎么用最简单的方式搭建一个服务器? 首先安装插件 npm i --save-dev webpack-dev-server 然后修改 packet.json 文件 "scripts" ...
- webpack学习记录-认识loader(二)
Loader 就像是一个翻译员,能把源文件经过转化后输出新的结果,并且一个文件还可以链式的经过多个翻译员翻译. loader参考文章:https://webpack.docschina.org/loa ...
- Vue笔记:使用 axios 中 this 指向问题
问题背景 在vue中使用axios做网络请求的时候,会遇到this不指向vue,而为undefined. 如下图所示,我们有一个 login 方法,希望在登录成功之后路由到主页,但通过 this.$r ...
- Linux 配置vim编辑器
最终效果 步骤1.下载NERDTree插件安装包(vim目录插件) https://www.vim.org/scripts/script.php?script_id=1658 步骤2.在家目录创建 . ...