5 Ways to Make Your Hive Queries Run Faster

Technique #1: Use Tez  Hive can use the Apache Tez execution engine instead of the venerable Map-reduce engine. I won’t go into details about the many benefits of using Tez which are mentioned here; instead, I want to make a simple recommendation: if it’s not turned on by default in your environment, use Tez by setting to ‘true’ the following in the beginning of your Hive query:    set hive.execution.engine=tez;   With the above setting, every HIVE query you execute will take advantage of Tez.  Technique #2: Use ORCFile  Hive supports ORCfile, a new table storage format that sports fantastic speed improvements through techniques like predicate push-down, compression and more.  Using ORCFile for every HIVE table should really be a no-brainer and extremely beneficial to get fast response times for your HIVE queries.  As an example, consider two large tables A and B (stored as text files, with some columns not all specified here), and a simple query like:    SELECT A.customerID, A.name, A.age, A.address join  B.role, B.department, B.salary  ON A.customerID=B.customerID;  This query may take a long time to execute since tables A and B are both stored as TEXT. Converting these tables to ORCFile format will usually reduce query time significantly:   CREATE TABLE A_ORC (  customerID int, name string, age int, address string  ) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);  INSERT INTO TABLE A_ORC SELECT * FROM A;  CREATE TABLE B_ORC (  customerID int, role string, salary float, department string  ) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);  INSERT INTO TABLE B_ORC SELECT * FROM B;  SELECT A_ORC.customerID, A_ORC.name,  A_ORC.age, A_ORC.address join  B_ORC.role, B_ORC.department, B_ORC.salary  ON A_ORC.customerID=B_ORC.customerID;  ORC supports compressed storage (with ZLIB or as shown above with SNAPPY) but also uncompressed storage.  Converting base tables to ORC is often the responsibility of your ingest team, and it may take them some time to change the complete ingestion process due to other priorities. The benefits of ORCFile are so tangible that I often recommend a do-it-yourself approach as demonstrated above – convert A into A_ORC and B into B_ORC and do the join that way, so that you benefit from faster queries immediately, with no dependencies on other teams. Technique #3: Use Vectorization  Vectorized query execution improves performance of operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time.  Introduced in Hive 0.13, this feature significantly improves query execution time, and is easily enabled with two parameters settings:    set hive.vectorized.execution.enabled = true;  set hive.vectorized.execution.reduce.enabled = true;  Technique #4: cost based query optimization  Hive optimizes each query’s logical and physical execution plan before submitting for final execution. These optimizations are not based on the cost of the query – that is, until now.  A recent addition to Hive, Cost-based optimization, performs further optimizations based on query cost, resulting in potentially different decisions: how to order joins, which type of join to perform, degree of parallelism and others.  To use cost-based optimization (also known as CBO), set the following parameters at the beginning of your query:    set hive.cbo.enable=true;  set hive.compute.query.using.stats=true;  set hive.stats.fetch.column.stats=true;  set hive.stats.fetch.partition.stats=true;  Then, prepare the data for CBO by running Hive’s “analyze” command to collect various statistics on the tables for which we want to use CBO.  For example, in a table tweets we want to collect statistics about the table and about 2 columns: “sender” and “topic”:    analyze table tweets compute statistics;  analyze table tweets compute statistics for columns sender, topic;  With HIVE 0.14 (on HDP 2.2) the analyze command works much faster, and you don’t need to specify each column, so you can just issue:    analyze table tweets compute statistics for columns;  That’s it. Now executing a query using this table should result in a different execution plan that is faster because of the cost calculation and different execution plan created by Hive. Technique #5: Write good SQL  SQL is a powerful declarative language. Like other declarative languages, there is more than one way to write a SQL statement. Although each statement’s functionality is the same, it may have strikingly different performance characteristics.  Let’s look at an example. Consider a click-stream event table:    CREATE TABLE clicks (  timestamp date, sessionID string, url string, source_ip string  ) STORED as ORC tblproperties (“orc.compress” = “SNAPPY”);  Each record represents a click event, and we would like to find the latest URL for each sessionID.  One might consider the following approach:    SELECT clicks.* FROM clicks inner join  (select sessionID, max(timestamp) as max_ts from clicks  group by sessionID) latest  ON clicks.sessionID = latest.sessionID and  clicks.timestamp = latest.max_ts;  In the above query, we build a sub-query to collect the timestamp of the latest event in each session, and then use an inner join to filter out the rest.  While the query is a reasonable solution—from a functional point of view—it turns out there’s a better way to re-write this query as follows:    SELECT * FROM  (SELECT *, RANK() over (partition by sessionID,  order by timestamp desc) as rank  FROM clicks) ranked_clicks  WHERE ranked_clicks.rank=1;  Here we use Hive’s OLAP functionality (OVER and RANK) to achieve the same thing, but without a Join.  Clearly, removing an unnecessary join will almost always result in better performance, and when using big data this is more important than ever. I find many cases where queries are not optimal — so look carefully at every query and consider if a rewrite can make it better and faster. Summary  Apache Hive is a powerful tool in the hands of data analysts and data scientists, and supports a variety of batch and interactive workloads.  In this blog post, I’ve discussed some useful techniques—the ones I use most often and find most useful for my day-to-day work as a data scientist—to make Hive queries run faster.  Thankfully, the Hive community is not finished yet. Even between HIVE 0.13 and HIVE 0.14, there are dramatic improvements in ORCFiles, vectorization and CBO and how they positively impact query performance.  I’m really excited about Stinger.next, bringing performance improvements to the sub-second range.  I can’t wait.

Technique #1: Use Tez  Hive can use the Apache Tez execution engine instead of the venerable Map-reduce engine. I won’t go into details about the many benefits of using Tez which are mentioned here; instead, I want to make a simple recommendation: if it’s not turned on by default in your environment, use Tez by setting to ‘true’ the following in the beginning of your Hive query:    set hive.execution.engine=tez;   With the above setting, every HIVE query you execute will take advantage of Tez.  Technique #2: Use ORCFile  Hive supports ORCfile, a new table storage format that sports fantastic speed improvements through techniques like predicate push-down, compression and more.  Using ORCFile for every HIVE table should really be a no-brainer and extremely beneficial to get fast response times for your HIVE queries.  As an example, consider two large tables A and B (stored as text files, with some columns not all specified here), and a simple query like:    SELECT A.customerID, A.name, A.age, A.address join  B.role, B.department, B.salary  ON A.customerID=B.customerID;  This query may take a long time to execute since tables A and B are both stored as TEXT. Converting these tables to ORCFile format will usually reduce query time significantly:   CREATE TABLE A_ORC (  customerID int, name string, age int, address string  ) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);  INSERT INTO TABLE A_ORC SELECT * FROM A;  CREATE TABLE B_ORC (  customerID int, role string, salary float, department string  ) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);  INSERT INTO TABLE B_ORC SELECT * FROM B;  SELECT A_ORC.customerID, A_ORC.name,  A_ORC.age, A_ORC.address join  B_ORC.role, B_ORC.department, B_ORC.salary  ON A_ORC.customerID=B_ORC.customerID;  ORC supports compressed storage (with ZLIB or as shown above with SNAPPY) but also uncompressed storage.  Converting base tables to ORC is often the responsibility of your ingest team, and it may take them some time to change the complete ingestion process due to other priorities. The benefits of ORCFile are so tangible that I often recommend a do-it-yourself approach as demonstrated above – convert A into A_ORC and B into B_ORC and do the join that way, so that you benefit from faster queries immediately, with no dependencies on other teams. Technique #3: Use Vectorization  Vectorized query execution improves performance of operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time.  Introduced in Hive 0.13, this feature significantly improves query execution time, and is easily enabled with two parameters settings:    set hive.vectorized.execution.enabled = true;  set hive.vectorized.execution.reduce.enabled = true;  Technique #4: cost based query optimization  Hive optimizes each query’s logical and physical execution plan before submitting for final execution. These optimizations are not based on the cost of the query – that is, until now.  A recent addition to Hive, Cost-based optimization, performs further optimizations based on query cost, resulting in potentially different decisions: how to order joins, which type of join to perform, degree of parallelism and others.  To use cost-based optimization (also known as CBO), set the following parameters at the beginning of your query:    set hive.cbo.enable=true;  set hive.compute.query.using.stats=true;  set hive.stats.fetch.column.stats=true;  set hive.stats.fetch.partition.stats=true;  Then, prepare the data for CBO by running Hive’s “analyze” command to collect various statistics on the tables for which we want to use CBO.  For example, in a table tweets we want to collect statistics about the table and about 2 columns: “sender” and “topic”:    analyze table tweets compute statistics;  analyze table tweets compute statistics for columns sender, topic;  With HIVE 0.14 (on HDP 2.2) the analyze command works much faster, and you don’t need to specify each column, so you can just issue:    analyze table tweets compute statistics for columns;  That’s it. Now executing a query using this table should result in a different execution plan that is faster because of the cost calculation and different execution plan created by Hive. Technique #5: Write good SQL  SQL is a powerful declarative language. Like other declarative languages, there is more than one way to write a SQL statement. Although each statement’s functionality is the same, it may have strikingly different performance characteristics.  Let’s look at an example. Consider a click-stream event table:    CREATE TABLE clicks (  timestamp date, sessionID string, url string, source_ip string  ) STORED as ORC tblproperties (“orc.compress” = “SNAPPY”);  Each record represents a click event, and we would like to find the latest URL for each sessionID.  One might consider the following approach:    SELECT clicks.* FROM clicks inner join  (select sessionID, max(timestamp) as max_ts from clicks  group by sessionID) latest  ON clicks.sessionID = latest.sessionID and  clicks.timestamp = latest.max_ts;  In the above query, we build a sub-query to collect the timestamp of the latest event in each session, and then use an inner join to filter out the rest.  While the query is a reasonable solution—from a functional point of view—it turns out there’s a better way to re-write this query as follows:    SELECT * FROM  (SELECT *, RANK() over (partition by sessionID,  order by timestamp desc) as rank  FROM clicks) ranked_clicks  WHERE ranked_clicks.rank=1;  Here we use Hive’s OLAP functionality (OVER and RANK) to achieve the same thing, but without a Join.  Clearly, removing an unnecessary join will almost always result in better performance, and when using big data this is more important than ever. I find many cases where queries are not optimal — so look carefully at every query and consider if a rewrite can make it better and faster. Summary  Apache Hive is a powerful tool in the hands of data analysts and data scientists, and supports a variety of batch and interactive workloads.  In this blog post, I’ve discussed some useful techniques—the ones I use most often and find most useful for my day-to-day work as a data scientist—to make Hive queries run faster.  Thankfully, the Hive community is not finished yet. Even between HIVE 0.13 and HIVE 0.14, there are dramatic improvements in ORCFiles, vectorization and CBO and how they positively impact query performance.  I’m really excited about Stinger.next, bringing performance improvements to the sub-second range.  I can’t wait.

5 Ways to Make Your Hive Queries Run Faster的更多相关文章

  1. 关于tez-ui的"All DAGs"和"Hive Queries"页面信息为空的问题解决过程

    近段时间发现公司的HDP大数据平台的tez-ui页面不能用了,页面显示为空,导致通过hive提交的sql不能方便地查找到Yarn上对应的applicationId,只能通过beeline的屏幕输出信息 ...

  2. Optimizing Hive queries for ORC formatted tables

    Short Description: Hive configuration settings to optimize your HiveQL when querying ORC formatted t ...

  3. how to run faster

    题目大意: 已知 $$ b_i = \sum_{j=1}^n {(i,j)^d [i,j]^c x_j}$$,给定 $b_i$ 求解 $x_i$ 解法: 考虑 $f(n) = \sum_{d|n}{f ...

  4. HIVE的几种优化

    5 WAYS TO MAKE YOUR HIVE QUERIES RUN FASTER 今天看了一篇[文章] (http://zh.hortonworks.com/blog/5-ways-make-h ...

  5. 《Programming Hive》读书笔记(一)Hadoop和hive环境搭建

    <Programming Hive>读书笔记(一)Hadoop和Hive环境搭建             先把主要的技术和工具学好,才干更高效地思考和工作.   Chapter 1.Int ...

  6. Partitioning & Archiving tables in SQL Server (Part 1: The basics)

    Reference: http://blogs.msdn.com/b/felixmar/archive/2011/02/14/partitioning-amp-archiving-tables-in- ...

  7. Covering Indexes in MySQL, PostgreSQL, and MongoDB

    Covering Indexes in MySQL, PostgreSQL, and MongoDB - Orange Matter https://orangematter.solarwinds.c ...

  8. DeveloperGuide Hive UDAF

    Writing GenericUDAFs: A Tutorial User-Defined Aggregation Functions (UDAFs) are an excellent way to ...

  9. 【大数据系列】apache hive 官方文档翻译

    GettingStarted 开始 Created by Confluence Administrator, last modified by Lefty Leverenz on Jun 15, 20 ...

随机推荐

  1. 移动端H5多平台分享实践--摘抄

    作者:大漠 日期:2018-01-20 点击:628 mobile 编辑推荐: 掘金是一个高质量的技术社区,从 CSS 到 Vue.js,性能优化到开源类库,让你不错过前端开发的每一个技术干货. 点击 ...

  2. 客户端负载均衡Ribbon之源码解析

    什么是负载均衡器? 假设有一个分布式系统,该系统由在不同计算机上运行的许多服务组成.但是,当用户数量很大时,通常会为服务创建多个副本.每个副本都在另一台计算机上运行.此时,出现 "Load ...

  3. hdu6219(最大空凸包)

    题意: 给一些点,求出一个最大的空凸包,这个凸包里没有任何给定点且要求这个凸包面积最大 分析: 枚举凸包左下角的点,然后dp[i][j]表示凸包的最后两条边是j->i和i->O情况下凸包的 ...

  4. String,StringBuffer,StringBuilder源码分析

    1.类结构 String Diagrams StringBuffer Diagrams StringBuilder Diagrams 通过以上Diagrams可以看出,String,StringBuf ...

  5. Elasticsearch分词导致的查找错误

    这周在做视频搜索的过程中遇到一个问题,就是用下面的查询表达式去Elasticsearch检索,检索不到想要的结果.查询语句如下: 而查询的字段的值为: "mergeVideoName&quo ...

  6. 布斯(Steve Jobs)在斯坦福大学的演讲稿,中英文对照版

    2005年6月14日,苹果CEO史蒂夫·乔布斯(Steve Jobs)在他的母校斯坦福大学的毕业典礼发表了著名的演讲,关于这段演讲,你会看到N多人的推荐(比如同样喜欢在大学演讲的李开复先生).此前曾经 ...

  7. SQL SERVER 工具

    http://www.cnblogs.com/fygh/archive/2012/04/25/2469563.html

  8. 线性回归,logistic回归分类

    学习过程 下面是一个典型的机器学习的过程,首先给出一个输入数据,我们的算法会通过一系列的过程得到一个估计的函数,这个函数有能力对没有见过的新数据给出一个新的估计,也被称为构建一个模型.就如同上面的线性 ...

  9. 1、CRM2011编程实战——清空指定页签以下的全部选项,并对页签以下的指定控件进行操作

    需求:当页面载入时,"呼叫编号"保持不变,"任务号"自己主动更新."接报时间"和"发生日期"自己主动设置为当天日期和时间 ...

  10. live555client连多路1080P视频流花屏问题

    硬件和软件环境是这种: DM8168 + linux. 解码器是DM8168自带的 视频来源: ipc通过live555做的的rtsp sever发送过来的 其它測试: 通过VLC在pc连4路1080 ...