背景

对于两个大表关联的场景，如果过滤条件的列值，存在高度倾斜，可以考虑根据反向滤值，进行过滤操作，减少连接的CPU时间。

数据准备

-- 状态表 tp01_state 记录 大表tp01 记录的多种状态 

kingbase=# select count(*) from tp01;

 count

----------

10000000

(1 行记录)

--只有一个高度倾斜的列值

kingbase=# select issuc,count(*) from tp01_state group by issuc order by issuc;

issuc |  count

-------+---------

N     |     100

Y     | 9999900

(2 行记录)

--有多个高度倾斜的列值

kingbase=# select istype, count(*) from tp01_state group by istype order by istype;

istype |  count

--------+---------

A      |     100

C      | 8999700

G      |     100

M      | 1000000

W      |     100

(5 行记录)

查询issuc='Y'数据

标准语句

多数数据匹配issuc='Y'条件，执行计划就是两个大表，进行hashjoin。

select * from tp01 where id in (select id from tp01_state where issuc = 'Y');

--或者

select * from tp01 where exists (select 1 from tp01_state where id = tp01.id and issuc = 'Y');

-- QUERY PLAN

Hash Semi Join  (cost=338555.00..1033383.15 rows=10000000 width=241) (actual time=2398.867..5889.537 rows=9999900 loops=1)

  Hash Cond: (tp01.id = tp01_state.id)

  ->  Seq Scan on tp01  (cost=0.00..444828.12 rows=10000012 width=241) (actual time=0.005..611.596 rows=10000000 loops=1)

  ->  Hash  (cost=213555.00..213555.00 rows=10000000 width=4) (actual time=2384.857..2384.858 rows=9999900 loops=1)

        Buckets: 16777216  Batches: 1  Memory Usage: 482631kB

        ->  Seq Scan on tp01_state  (cost=0.00..213555.00 rows=10000000 width=4) (actual time=0.011..775.853 rows=9999900 loops=1)

              Filter: (issuc = 'Y'::text)

              Rows Removed by Filter: 100

Planning Time: 0.186 ms

Execution Time: 6137.233 ms

问题：tp_state 数据千万级，两张表关联必然要消耗大量的CPU资源。可以看到，主要的时间消耗在hash join上

优化1：使用NOT IN 代替 IN

因为只有少量数据，匹配issuc='Y'反向条件，使用not in 减少大表的过滤操作。

select * from tp01 where tp01.id not in (select id from tp01_state where issuc <> 'Y' or issuc is null);

-- QUERY PLAN

Seq Scan on tp01  (cost=213555.00..683383.15 rows=5000006 width=241) (actual time=517.554..1629.795 rows=9999900 loops=1)

  Filter: (NOT (hashed SubPlan 1))

  Rows Removed by Filter: 100

  SubPlan 1

    ->  Seq Scan on tp01_state  (cost=0.00..213555.00 rows=1 width=4) (actual time=271.143..517.503 rows=100 loops=1)

          Filter: (issuc <> 'Y'::text)

          Rows Removed by Filter: 9999900

Planning Time: 0.087 ms

Execution Time: 1870.376 ms

修改后的SQL，虽然使用了filter 方式，但由于SubPlan 1 结果集很小，效率还是非常高效的。

优化2：使用not between 代替 <>

not between 操作可以使用索引，就可以减少子查询的执行时间。

select *

from tp01

where tp01.id not in (select id from tp01_state where issuc not between 'Y' and 'Y' or issuc is null);

-- QUERY PLAN

Seq Scan on tp01  (cost=17.35..469845.50 rows=5000006 width=241) (actual time=0.098..1109.085 rows=9999900 loops=1)

  Filter: (NOT (hashed SubPlan 1))

  Rows Removed by Filter: 100

  SubPlan 1

    ->  Bitmap Heap Scan on tp01_state  (cost=13.33..17.34 rows=1 width=4) (actual time=0.035..0.045 rows=100 loops=1)

          Recheck Cond: ((issuc < 'Y'::text) OR (issuc > 'Y'::text) OR (issuc IS NULL))

          Heap Blocks: exact=2

          ->  BitmapOr  (cost=13.33..13.33 rows=1 width=0) (actual time=0.028..0.030 rows=0 loops=1)

                ->  Bitmap Index Scan on tp01_state_issuc  (cost=0.00..4.44 rows=1 width=0) (actual time=0.020..0.020 rows=100 loops=1)

                      Index Cond: (issuc < 'Y'::text)

                ->  Bitmap Index Scan on tp01_state_issuc  (cost=0.00..4.44 rows=1 width=0) (actual time=0.007..0.007 rows=0 loops=1)

                      Index Cond: (issuc > 'Y'::text)

                ->  Bitmap Index Scan on tp01_state_issuc  (cost=0.00..4.44 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1)

                      Index Cond: (issuc IS NULL)

Planning Time: 0.109 ms

Execution Time: 1349.526 ms

查询istype in ('C','M')数据

标准语句

多数数据匹配istype in ('C','M')条件，执行计划就是两个大表，进行hashjoin。

explain analyze

select *

from tp01

where id in (select id from tp01_state where istype in ('C', 'M'));

-- QUERY PLAN

Hash Semi Join  (cost=307305.11..927445.94 rows=7500009 width=241) (actual time=2848.058..6398.654 rows=9999700 loops=1)

  Hash Cond: (tp01.id = tp01_state.id)

  ->  Seq Scan on tp01  (cost=0.00..444828.12 rows=10000012 width=241) (actual time=0.005..613.502 rows=10000000 loops=1)

  ->  Hash  (cost=213555.00..213555.00 rows=7500009 width=4) (actual time=2840.972..2840.972 rows=9999700 loops=1)

        Buckets: 16777216 (originally 8388608)  Batches: 1 (originally 1)  Memory Usage: 482624kB

        ->  Seq Scan on tp01_state  (cost=0.00..213555.00 rows=7500009 width=4) (actual time=0.034..1032.910 rows=9999700 loops=1)

              Filter: (istype = ANY ('{C,M}'::text[]))

              Rows Removed by Filter: 300

Planning Time: 0.193 ms

Execution Time: 6646.452 ms

优化1：使用NOT IN 代替 IN

因为只有少量数据，匹配istype in ('C','M')反向条件，使用not in 减少大表的过滤操作。

select *

from tp01

where id not in (select id from tp01_state where istype not in ('C', 'M') or istype is null );

-- QUERY PLAN

Seq Scan on tp01  (cost=175497.98..645326.13 rows=5000006 width=241) (actual time=778.116..2699.271 rows=9999700 loops=1)

  Filter: (NOT (hashed SubPlan 1))

  Rows Removed by Filter: 300

  SubPlan 1

    ->  Seq Scan on tp01_state  (cost=0.00..169248.00 rows=2499991 width=4) (actual time=0.006..767.589 rows=300 loops=1)

          Filter: ((istype <> ALL ('{C,M}'::text[])) OR (istype IS NULL))

          Rows Removed by Filter: 9999700

Planning Time: 0.101 ms

Execution Time: 2934.265 ms

优化2：使用not between 代替 <>

not between 操作根据选择率最佳的列值，使用索引，就可以减少子查询的执行时间。

select *

from tp01

where id not in (select id from tp01_state

where (istype not between 'C' and 'C' and istype not between 'M' and 'M') or istype is null);

-- QUERY PLAN

Seq Scan on tp01  (cost=106721.48..576549.63 rows=5000006 width=241) (actual time=223.295..1862.006 rows=9999700 loops=1)

  Filter: (NOT (hashed SubPlan 1))

  Rows Removed by Filter: 300

  SubPlan 1

    ->  Bitmap Heap Scan on tp01_state  (cost=22927.80..104507.84 rows=885454 width=4) (actual time=58.615..220.275 rows=300 loops=1)

          Recheck Cond: (((istype < 'C'::text) OR (istype > 'C'::text)) OR (istype IS NULL))

          Filter: ((((istype < 'C'::text) OR (istype > 'C'::text)) AND ((istype < 'M'::text) OR (istype > 'M'::text))) OR (istype IS NULL))

          Rows Removed by Filter: 1000000

          Heap Blocks: exact=4428

          ->  BitmapOr  (cost=22927.80..22927.80 rows=981652 width=0) (actual time=58.266..58.268 rows=0 loops=1)

                ->  BitmapOr  (cost=22701.99..22701.99 rows=981652 width=0) (actual time=58.262..58.263 rows=0 loops=1)

                      ->  Bitmap Index Scan on tp01_state_istype  (cost=0.00..5.69 rows=167 width=0) (actual time=0.026..0.027 rows=100 loops=1)

                            Index Cond: (istype < 'C'::text)

                      ->  Bitmap Index Scan on tp01_state_istype  (cost=0.00..22253.58 rows=981486 width=0) (actual time=58.235..58.235 rows=1000200 loops=1)

                            Index Cond: (istype > 'C'::text)

                ->  Bitmap Index Scan on tp01_state_istype  (cost=0.00..4.44 rows=1 width=0) (actual time=0.004..0.004 rows=0 loops=1)

                      Index Cond: (istype IS NULL)

Planning Time: 0.350 ms

Execution Time: 2099.544 ms

优化3：使用<和>的范围条件组合，代替not between

将多个not between条件，分解成范围条件组合，充分利用索引，减少filter操作。

select *

from tp01

where id not in (

    select id

    from tp01_state

    where (istype < 'C')

       or (istype > 'C' and istype < 'M')

       or (istype > 'M')

       or istype is null);

-- QUERY PLAN

Seq Scan on tp01  (cost=350.11..470178.26 rows=5000006 width=241) (actual time=0.142..1099.829 rows=9999700 loops=1)

  Filter: (NOT (hashed SubPlan 1))

  Rows Removed by Filter: 300

  SubPlan 1

    ->  Bitmap Heap Scan on tp01_state  (cost=8.60..349.28 rows=334 width=4) (actual time=0.067..0.091 rows=300 loops=1)

          Recheck Cond: ((istype < 'C'::text) OR ((istype > 'C'::text) AND (istype < 'M'::text)) OR (istype > 'M'::text) OR (istype IS NULL))

          Heap Blocks: exact=2

          ->  BitmapOr  (cost=8.60..8.60 rows=334 width=0) (actual time=0.058..0.060 rows=0 loops=1)

                ->  Bitmap Index Scan on tp01_state_istype  (cost=0.00..2.69 rows=167 width=0) (actual time=0.019..0.019 rows=100 loops=1)

                      Index Cond: (istype < 'C'::text)

                ->  Bitmap Index Scan on tp01_state_istype  (cost=0.00..1.45 rows=1 width=0) (actual time=0.024..0.024 rows=100 loops=1)

                      Index Cond: ((istype > 'C'::text) AND (istype < 'M'::text))

                ->  Bitmap Index Scan on tp01_state_istype  (cost=0.00..2.69 rows=167 width=0) (actual time=0.014..0.014 rows=100 loops=1)

                      Index Cond: (istype > 'M'::text)

                ->  Bitmap Index Scan on tp01_state_istype  (cost=0.00..1.44 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1)

                      Index Cond: (istype IS NULL)

Planning Time: 0.183 ms

Execution Time: 1340.184 ms

总结

查询优化的宗旨，是更少的数据量和更少计算量，不要摒弃 not in这样不易优化的操作符。

SQL调优系列--数据严重倾斜的连接优化的更多相关文章

Oracle SQL调优系列之SQL Monitor Report
@ 目录 1.SQL Monitor简介 2.捕捉sql的前提 3.SQL Monitor 参数设置 4.SQL Monitor Report 4.1.SQL_ID获取 4.2.Text文本格式 4. ...
HiveSql调优系列之Hive严格模式，如何合理使用Hive严格模式
目录综述 1.严格模式 1.1 参数设置 1.2 查看参数 1.3 严格模式限制内容及对应参数设置 2.实际操作 2.1 分区表查询时必须指定分区 2.2 order by必须指定limit 2.3 ...
SQL Server调优系列基础篇（并行运算总结）
前言上三篇文章我们介绍了查看查询计划的方式,以及一些常用的连接运算符.联合运算符的优化技巧. 本篇我们分析SQL Server的并行运算,作为多核计算机盛行的今天,SQL Server也会适时调整自 ...
SQL Server调优系列基础篇
前言关于SQL Server调优系列是一个庞大的内容体系,非一言两语能够分析清楚,本篇先就在SQL 调优中所最常用的查询计划进行解析,力图做好基础的掌握,夯实基本功!而后再谈谈整体的语句调优. 通过 ...
SQL Server调优系列基础篇（常用运算符总结——三种物理连接方式剖析）
前言上一篇我们介绍了如何查看查询计划,本篇将介绍在我们查看的查询计划时的分析技巧,以及几种我们常用的运算符优化技巧,同样侧重基础知识的掌握. 通过本篇可以了解我们平常所写的T-SQL语句,在SQL ...
SQL Server调优系列基础篇（联合运算符总结）
前言上两篇文章我们介绍了查看查询计划的方式,以及一些常用的连接运算符的优化技巧,本篇我们总结联合运算符的使用方式和优化技巧. 废话少说,直接进入本篇的主题. 技术准备基于SQL Server200 ...
SQL Server调优系列基础篇（并行运算总结篇二）
前言上一篇文章我们介绍了查看查询计划的并行运行方式. 本篇我们接着分析SQL Server的并行运算. 闲言少叙,直接进入本篇的正题. 技术准备同前几篇一样,基于SQL Server2008R2版 ...
SQL Server调优系列基础篇（索引运算总结）
前言上几篇文章我们介绍了如何查看查询计划.常用运算符的介绍.并行运算的方式,有兴趣的可以点击查看. 本篇将分析在SQL Server中,如何利用先有索引项进行查询性能优化,通过了解这些索引项的应用方 ...
SQL Server调优系列基础篇（子查询运算总结）
前言前面我们的几篇文章介绍了一系列关于运算符的介绍,以及各个运算符的优化方式和技巧.其中涵盖:查看执行计划的方式.几种数据集常用的连接方式.联合运算符方式.并行运算符等一系列的我们常见的运算符.有兴 ...
SQL Server调优系列进阶篇（查询优化器的运行方式）
前言前面我们的几篇文章介绍了一系列关于运算符的基础介绍,以及各个运算符的优化方式和技巧.其中涵盖:查看执行计划的方式.几种数据集常用的连接方式.联合运算符方式.并行运算符等一系列的我们常见的运算符. ...

随机推荐

【Android】使用MediaExtractor获取关键帧的时间戳
1 前言使用MediaExtractor.MediaMuxer去掉视频文件中的音频数据中介绍了 MediaExtractor 类的主要方法,本文主要将使用其 advance() 和 seekTo( ...
【Android逆向】破解看雪9月算法破解第二题
1. apk安装到手机,一样的界面,随便输入一样的报错 2. apk拖入到jadx重看看 public native String sha1(String str); static { System. ...
ABP开发需要用到的命令
0.命令行在哪里执行? 在Visual Studio的"解决方案资源管理器"的解决方案或者项目上点鼠标右键,选择"在终端中打开". 1.安装abp的命令行官网 ...
[Rust] 数据类型的转换
数据类型的转换类型转换的方式 Rust 提供了多种类型转换的方式. as T 用于数类型之间的转换.ixx, uxx, fxx 都可以. 注意:当溢出的时候,转换不会 panic,而是循环映射值. ...
【Azure 应用服务】Web App Service 中的应用程序配置(Application Setting) 怎么获取key vault中的值
问题描述 App Service中,如何通过 Application Setting 来配置 Key Vault中的值呢? 问题解答首先,App Service服务可以直接通过引用的方式,无需代码的 ...
【Azure 应用服务】部署WAR包到App Service访问出现404错误的解决方式
问题描述在Linux的App Service上,通过FTP把war文件和HTML静态文件上传到wwwroot目录下,静态文件访问成功,但是java应用中的请求都返回404错误问题解决因为FTP上 ...
linux系统信息命令笔记
1,时间和日期 2,磁盘信息 4,进程概念介绍 4.1,ps 基本命令使用 ps aux 显示内容太多了.一般用ps a 或 ps au 4.2, top命令的基本使用 top 可以动态的显示运行中的 ...
一: Mysql字符集问题
## Mysql 字符集问题 1 修改MySQL5.7字符集 1.1 修改步骤在MySQL 8.0版本之前,默认字符集为 latin1 ,utf8字符集指向的是 utf8mb3 .网站开发人员在数据 ...
Java 常用类 String的使用
1 package com.bytezero.stringclass; 2 3 import com.sun.tools.javac.Main; 4 import jdk.jfr.StackTrace ...
kafka的消费
1.消费方式 consumer 采用 pull(拉) 模式从 broker 中读取数据. push(推)模式很难适应消费速率不同的消费者,因为消息发送速率是由 broker 决定的. 它的目标是尽可能 ...

SQL调优系列--数据严重倾斜的连接优化

背景

数据准备

查询issuc='Y'数据

标准语句

优化1：使用NOT IN 代替 IN

优化2：使用not between 代替 <>

查询istype in ('C','M')数据

标准语句

优化1：使用NOT IN 代替 IN

优化2：使用not between 代替 <>

优化3：使用<和>的范围条件组合，代替not between

总结

SQL调优系列--数据严重倾斜的连接优化的更多相关文章

随机推荐

热门专题