列举GaussDB(DWS)常见的查询时索引失效场景

摘要：使用GaussDB(DWS)时，有时为了加快查询速度，需要对表建立索引。有时我们会遇到明明建立了索引，查询计划中却发现索引没有被使用的情况。本文将列举几种常见的场景和优化方法。

本文分享自华为云社区《GaussDB(DWS)之查询时索引失效原因》，作者：飞不起来的小松鼠。

使用GaussDB(DWS)时，有时为了加快查询速度，需要对表建立索引。有时我们会遇到明明建立了索引，查询计划中却发现索引没有被使用的情况。本文将列举几种常见的场景和优化方法。

1. 返回结果集很大

以行存表的Seq Scan和Index Scan为例：

Seq Scan：就是按照表的记录的排列顺序从头到尾依次检索扫描，每次扫描要取到所有的记录。这也是最简单最基础的扫表方式，扫描的代价比较大；

Index Scan：对于给定的查询，先扫描一遍索引，从索引中找到符合要求的记录的位置(指针)，再定位到表中具体的Page去取，即先走索引，再读表数据；

因此，根据两种扫描方式的特点可以看出，大多数情况下，Index Scan要比Seq Scan快。但是如果获取的结果集占所有数据的比重很大时（超过70%），这时Index Scan 因为要先扫描索引再读表数据反而不如直接全表扫描来的快。

2. 没有analyze

analyze会更新表的统计信息，如果表未做analyze或上次做完analyze之后表进行过数据量较大的增删操作，会导致统计信息不准，这时候也可能导致表没有走索引。

优化方法：对表进行analyze更新统计信息即可。

3. 过滤条件使用了函数或隐式类型转化导致没有走索引

如果再过滤条件中使用了计算、函数、隐式类型转化，都可能导致无法选择索引。

示例：create table test(a int, b text, c date); 且在a，b，c三列上都分别创建了索引。

场景1：使用计算

从下面的执行结果可以看到，where a = 101，where a = 102 - 1都能使用a列上的索引，但是where a + 1 = 102没有走索引。

postgres=# explain verbose select * from test where a + 1 = 102;

                            QUERY PLAN

-------------------------------------------------------------------

 Streaming (type: GATHER)  (cost=0.19..18.25 rows=6 width=14)

   Output: a, b, c

   Node/s: All datanodes

   ->  Seq Scan on public.test  (cost=0.00..12.25 rows=6 width=14)

         Output: a, b, c

         Distribute Key: a

         Filter: ((test.a + 1) = 102)

(7 rows)

postgres=#

postgres=# explain verbose select * from test where a  = 101;

                                     QUERY PLAN

-------------------------------------------------------------------------------------

 Streaming (type: GATHER)  (cost=0.06..14.27 rows=1 width=14)

   Output: a, b, c

   Node/s: datanode1

   ->  Index Scan using test_a_idx on public.test  (cost=0.00..8.27 rows=1 width=14)

         Output: a, b, c

         Distribute Key: a

         Index Cond: (test.a = 101)

(7 rows)

postgres=#

postgres=# explain verbose select * from test where a  = 102 - 1;

                                     QUERY PLAN

-------------------------------------------------------------------------------------

 Streaming (type: GATHER)  (cost=0.06..14.27 rows=1 width=14)

   Output: a, b, c

   Node/s: datanode1

   ->  Index Scan using test_a_idx on public.test  (cost=0.00..8.27 rows=1 width=14)

         Output: a, b, c

         Distribute Key: a

         Index Cond: (test.a = 101)

(7 rows)

优化方式：此类场景的优化方式是尽量使用常量代替表达式，或者常量计算尽量写在等号的右侧。

场景2：使用函数

从下面的执行结果可以看到，在索引列上使用函数也会导致无法选择索引：

postgres=# explain verbose select * from test where to_char(c, 'yyyyMMdd') = to_char(CURRENT_DATE,'yyyyMMdd');

                                                           QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------

 Streaming (type: GATHER)  (cost=0.19..21.00 rows=6 width=14)

   Output: a, b, c

   Node/s: All datanodes

   ->  Seq Scan on public.test  (cost=0.00..15.00 rows=6 width=14)

         Output: a, b, c

         Distribute Key: a

         Filter: (to_char(test.c, 'yyyyMMdd'::text) = to_char(('2021-03-16'::date)::timestamp with time zone, 'yyyyMMdd'::text))

(7 rows)

postgres=#

postgres=# explain verbose select * from test where c = current_date;

                                     QUERY PLAN

-------------------------------------------------------------------------------------

 Streaming (type: GATHER)  (cost=0.06..14.27 rows=1 width=14)

   Output: a, b, c

   Node/s: All datanodes

   ->  Index Scan using test_c_idx on public.test  (cost=0.00..8.27 rows=1 width=14)

         Output: a, b, c

         Distribute Key: a

         Index Cond: (test.c = '2021-03-16'::date)

(7 rows)

优化方法：尽量减少索引列上没有必要的函数调用。

场景3：隐式类型转化

此类场景是经常遇到的场景，例如b的类型是text类型，过滤条件是where b = 2，在生成计划时，text类型会隐式转化为bigint类型，实际的过滤条件变成where b::bigint = 2，导致b列上的索引失效：

postgres=# explain verbose select * from test where b = 2;

                            QUERY PLAN

-------------------------------------------------------------------

 Streaming (type: GATHER)  (cost=0.06..18.25 rows=1 width=14)

   Output: a, b, c

   Node/s: All datanodes

   ->  Seq Scan on public.test  (cost=0.00..12.25 rows=1 width=14)

         Output: a, b, c

         Distribute Key: a

         Filter: ((test.b)::bigint = 2)

(7 rows)

postgres=#

postgres=#

postgres=# explain verbose select * from test where b = '2';

                                     QUERY PLAN

-------------------------------------------------------------------------------------

 Streaming (type: GATHER)  (cost=0.06..14.27 rows=1 width=14)

   Output: a, b, c

   Node/s: All datanodes

   ->  Index Scan using test_b_idx on public.test  (cost=0.00..8.27 rows=1 width=14)

         Output: a, b, c

         Distribute Key: a

         Index Cond: (test.b = '2'::text)

(7 rows)

postgres=#

优化方法：索引条件上的常量尽可能使用和索引列相同类型的常量，避免发生隐式类型转化。

4. 使用nestloop + indexscan 代替 hashjoin

此类语句的特征是两个表关联的时候，其中一个表上where条件过滤之后的结果集行数很小，同时，最终满足条件的结果集行数也很小。此时，使用nestloop+indexscan的效果往往要由于hashjoin。较优的执行计划如下：

可以看到，第5层的Index Cond: (t1.b = t2.b)已经把join条件下推到了基表扫描上。

postgres=# explain verbose select t1.a,t1.b from t1,t2 where t1.b=t2.b and t2.a=4;

 id |                    operation                     | E-rows | E-distinct | E-memory | E-width | E-costs

----+--------------------------------------------------+--------+------------+----------+---------+---------

  1 | ->  Streaming (type: GATHER)                     |     26 |            |          |       8 | 17.97

  2 |    ->  Nested Loop (3,5)                         |     26 |            | 1MB      |       8 | 11.97

  3 |       ->  Streaming(type: BROADCAST)             |      2 |            | 2MB      |       4 | 2.78

  4 |          ->  Seq Scan on public.t2               |      1 |            | 1MB      |       4 | 2.62

  5 |       ->  Index Scan using t1_b_idx on public.t1 |     26 |            | 1MB      |       8 | 9.05

(5 rows)

 Predicate Information (identified by plan id)

-----------------------------------------------

   4 --Seq Scan on public.t2

         Filter: (t2.a = 4)

   5 --Index Scan using t1_b_idx on public.t1

         Index Cond: (t1.b = t2.b)

(4 rows)

 Targetlist Information (identified by plan id)

------------------------------------------------

   1 --Streaming (type: GATHER)

         Output: t1.a, t1.b

         Node/s: All datanodes

   2 --Nested Loop (3,5)

         Output: t1.a, t1.b

   3 --Streaming(type: BROADCAST)

         Output: t2.b

         Spawn on: datanode2

         Consumer Nodes: All datanodes

   4 --Seq Scan on public.t2

         Output: t2.b

         Distribute Key: t2.a

   5 --Index Scan using t1_b_idx on public.t1

         Output: t1.a, t1.b

         Distribute Key: t1.a

(15 rows)

   ====== Query Summary =====

---------------------------------

 System available mem: 9262694KB

 Query Max mem: 9471590KB

 Query estimated mem: 5144KB

(3 rows)

如果优化器没有选择这种执行计划，可以通过以下方式优化：

set enable_index_nestloop = on;

set enable_hashjoin = off;

set enable_seqscan = off;

5. 使用hint指定索引时指定的索引方式不对

GaussDB(DWS)的plan hint当前支持指定的Scan方式有三种：tablescan、indexscan和indexonlyscan。

tablescan：全表扫描，比如行存表的Seq Scan，列存表的CStore Scan

indexscan：先扫索引，再根据索引取表记录

indexonlyscan：覆盖索引扫描，所需的返回结果能被所扫描的索引全部覆盖。与index scan相比，index only scan所包含的字段集合，囊括了我们查询语句中的字段，这样，提取出相应的index ，就不必再根据索引取表记录了。

因此，对于需要indexonlyscan的场景，如果hint指定了indexscan，该hint是无法生效的：

postgres=# explain verbose select/*+ indexscan(test)*/ b from test where b = '1';

WARNING:  unused hint: IndexScan(test)

                             QUERY PLAN

--------------------------------------------------------------------

 Streaming (type: GATHER)  (cost=3.12..16.88 rows=100 width=2)

   Output: b

   Node/s: All datanodes

   ->  Seq Scan on public.test  (cost=0.00..10.88 rows=100 width=2)

         Output: b

         Distribute Key: a

         Filter: (test.b = '1'::text)

(7 rows)

postgres=#

postgres=# explain verbose select/*+ indexonlyscan(test)*/ b from test where b = '1';

                                         QUERY PLAN

--------------------------------------------------------------------------------------------

 Streaming (type: GATHER)  (cost=3.12..56.51 rows=100 width=2)

   Output: b

   Node/s: All datanodes

   ->  Index Only Scan using test_b_idx on public.test  (cost=0.00..50.51 rows=100 width=2)

         Output: b

         Distribute Key: a

         Index Cond: (test.b = '1'::text)

(7 rows)

优化方法：使用hint时正确指定indexscan和indexonlyscan。

6. 全文检索gin索引

为了加速文本搜索，进行全文检索时可以创建GIN索引：

create index idxb on test using gin(to_tsvector('english',b));

创建索引时，必须使用to_tsvector的两参数版本，并且只有当查询时也使用了两参数版本，且参数值与索引中相同时，才会使用该索引：

postgres=# explain verbose select  * from test where to_tsvector(b) @@ to_tsquery('cat') order by 1;

                                QUERY PLAN

--------------------------------------------------------------------------

 Streaming (type: GATHER)  (cost=22.23..27.87 rows=12 width=14)

   Output: a, b, c

   Merge Sort Key: test.a

   Node/s: All datanodes

   ->  Sort  (cost=21.86..21.87 rows=12 width=14)

         Output: a, b, c

         Sort Key: test.a

         ->  Seq Scan on public.test  (cost=0.00..21.78 rows=11 width=14)

               Output: a, b, c

               Distribute Key: a

               Filter: (to_tsvector(test.b) @@ '''cat'''::tsquery)

(11 rows)

postgres=#

postgres=# explain verbose select  * from test where to_tsvector('english',b) @@ to_tsquery('cat') order by 1;

                                            QUERY PLAN

---------------------------------------------------------------------------------------------------

 Streaming (type: GATHER)  (cost=16.09..22.03 rows=2 width=14)

   Output: a, b, c

   Merge Sort Key: test.a

   Node/s: All datanodes

   ->  Sort  (cost=16.03..16.03 rows=2 width=14)

         Output: a, b, c

         Sort Key: test.a

         ->  Bitmap Heap Scan on public.test  (cost=12.00..16.02 rows=1 width=14)

               Output: a, b, c

               Distribute Key: a

               Recheck Cond: (to_tsvector('english'::regconfig, test.b) @@ '''cat'''::tsquery)

               ->  Bitmap Index Scan on idxb  (cost=0.00..12.00 rows=1 width=0)

                     Index Cond: (to_tsvector('english'::regconfig, test.b) @@ '''cat'''::tsquery)

(13 rows)

优化方式：查询时也使用了两参数版本，且保证参数值与索引中相同。

点击关注，第一时间了解华为云新鲜技术~