列舉GaussDB(DWS)常見的查詢時索引失效場景

摘要：使用GaussDB(DWS)時，有時為了加快查詢速度，需要對表建立索引。有時我們會遇到明明建立了索引，查詢計劃中卻發現索引沒有被使用的情況。本文将列舉幾種常見的場景和優化方法。

本文分享自華為雲社群《GaussDB(DWS)之查詢時索引失效原因》，作者：飛不起來的小松鼠。

使用GaussDB(DWS)時，有時為了加快查詢速度，需要對表建立索引。有時我們會遇到明明建立了索引，查詢計劃中卻發現索引沒有被使用的情況。本文将列舉幾種常見的場景和優化方法。

1. 傳回結果集很大

以行存表的Seq Scan和Index Scan為例：

Seq Scan：就是按照表的記錄的排列順序從頭到尾依次檢索掃描，每次掃描要取到所有的記錄。這也是最簡單最基礎的掃表方式，掃描的代價比較大；

Index Scan：對于給定的查詢，先掃描一遍索引，從索引中找到符合要求的記錄的位置(指針)，再定位到表中具體的Page去取，即先走索引，再讀表資料；

是以，根據兩種掃描方式的特點可以看出，大多數情況下，Index Scan要比Seq Scan快。但是如果擷取的結果集占所有資料的比重很大時（超過70%），這時Index Scan 因為要先掃描索引再讀表資料反而不如直接全表掃描來的快。

2. 沒有analyze

analyze會更新表的統計資訊，如果表未做analyze或上次做完analyze之後表進行過資料量較大的增删操作，會導緻統計資訊不準，這時候也可能導緻表沒有走索引。

優化方法：對表進行analyze更新統計資訊即可。

3. 過濾條件使用了函數或隐式類型轉化導緻沒有走索引

如果再過濾條件中使用了計算、函數、隐式類型轉化，都可能導緻無法選擇索引。

示例：create table test(a int, b text, c date); 且在a，b，c三列上都分别建立了索引。

場景1：使用計算

從下面的執行結果可以看到，where a = 101，where a = 102 - 1都能使用a列上的索引，但是where a + 1 = 102沒有走索引。

postgres=# explain verbose select * from test where a + 1 = 102;
                            QUERY PLAN                             
-------------------------------------------------------------------
 Streaming (type: GATHER)  (cost=0.19..18.25 rows=6 width=14)
   Output: a, b, c
   Node/s: All datanodes
   ->  Seq Scan on public.test  (cost=0.00..12.25 rows=6 width=14)
         Output: a, b, c
         Distribute Key: a
         Filter: ((test.a + 1) = 102)
(7 rows)

postgres=# 
postgres=# explain verbose select * from test where a  = 101;
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Streaming (type: GATHER)  (cost=0.06..14.27 rows=1 width=14)
   Output: a, b, c
   Node/s: datanode1
   ->  Index Scan using test_a_idx on public.test  (cost=0.00..8.27 rows=1 width=14)
         Output: a, b, c
         Distribute Key: a
         Index Cond: (test.a = 101)
(7 rows)

postgres=# 
postgres=# explain verbose select * from test where a  = 102 - 1;
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Streaming (type: GATHER)  (cost=0.06..14.27 rows=1 width=14)
   Output: a, b, c
   Node/s: datanode1
   ->  Index Scan using test_a_idx on public.test  (cost=0.00..8.27 rows=1 width=14)
         Output: a, b, c
         Distribute Key: a
         Index Cond: (test.a = 101)
(7 rows)

優化方式：此類場景的優化方式是盡量使用常量代替表達式，或者常量計算盡量寫在等号的右側。

場景2：使用函數

從下面的執行結果可以看到，在索引列上使用函數也會導緻無法選擇索引：

postgres=# explain verbose select * from test where to_char(c, 'yyyyMMdd') = to_char(CURRENT_DATE,'yyyyMMdd');
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Streaming (type: GATHER)  (cost=0.19..21.00 rows=6 width=14)
   Output: a, b, c
   Node/s: All datanodes
   ->  Seq Scan on public.test  (cost=0.00..15.00 rows=6 width=14)
         Output: a, b, c
         Distribute Key: a
         Filter: (to_char(test.c, 'yyyyMMdd'::text) = to_char(('2021-03-16'::date)::timestamp with time zone, 'yyyyMMdd'::text))
(7 rows)

postgres=# 
postgres=# explain verbose select * from test where c = current_date;
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Streaming (type: GATHER)  (cost=0.06..14.27 rows=1 width=14)
   Output: a, b, c
   Node/s: All datanodes
   ->  Index Scan using test_c_idx on public.test  (cost=0.00..8.27 rows=1 width=14)
         Output: a, b, c
         Distribute Key: a
         Index Cond: (test.c = '2021-03-16'::date)
(7 rows)

優化方法：盡量減少索引列上沒有必要的函數調用。

場景3：隐式類型轉化

此類場景是經常遇到的場景，例如b的類型是text類型，過濾條件是where b = 2，在生成計劃時，text類型會隐式轉化為bigint類型，實際的過濾條件變成where b::bigint = 2，導緻b列上的索引失效：

postgres=# explain verbose select * from test where b = 2;
                            QUERY PLAN                             
-------------------------------------------------------------------
 Streaming (type: GATHER)  (cost=0.06..18.25 rows=1 width=14)
   Output: a, b, c
   Node/s: All datanodes
   ->  Seq Scan on public.test  (cost=0.00..12.25 rows=1 width=14)
         Output: a, b, c
         Distribute Key: a
         Filter: ((test.b)::bigint = 2)
(7 rows)

postgres=# 
postgres=# 
postgres=# explain verbose select * from test where b = '2';
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Streaming (type: GATHER)  (cost=0.06..14.27 rows=1 width=14)
   Output: a, b, c
   Node/s: All datanodes
   ->  Index Scan using test_b_idx on public.test  (cost=0.00..8.27 rows=1 width=14)
         Output: a, b, c
         Distribute Key: a
         Index Cond: (test.b = '2'::text)
(7 rows)

postgres=#

優化方法：索引條件上的常量盡可能使用和索引列相同類型的常量，避免發生隐式類型轉化。

4. 使用nestloop + indexscan 代替 hashjoin

此類語句的特征是兩個表關聯的時候，其中一個表上where條件過濾之後的結果集行數很小，同時，最終滿足條件的結果集行數也很小。此時，使用nestloop+indexscan的效果往往要由于hashjoin。較優的執行計劃如下：

可以看到，第5層的Index Cond: (t1.b = t2.b)已經把join條件下推到了基表掃描上。

postgres=# explain verbose select t1.a,t1.b from t1,t2 where t1.b=t2.b and t2.a=4;
 id |                    operation                     | E-rows | E-distinct | E-memory | E-width | E-costs 
----+--------------------------------------------------+--------+------------+----------+---------+---------
  1 | ->  Streaming (type: GATHER)                     |     26 |            |          |       8 | 17.97
  2 |    ->  Nested Loop (3,5)                         |     26 |            | 1MB      |       8 | 11.97
  3 |       ->  Streaming(type: BROADCAST)             |      2 |            | 2MB      |       4 | 2.78
  4 |          ->  Seq Scan on public.t2               |      1 |            | 1MB      |       4 | 2.62
  5 |       ->  Index Scan using t1_b_idx on public.t1 |     26 |            | 1MB      |       8 | 9.05
(5 rows)

 Predicate Information (identified by plan id) 
-----------------------------------------------
   4 --Seq Scan on public.t2
         Filter: (t2.a = 4)
   5 --Index Scan using t1_b_idx on public.t1
         Index Cond: (t1.b = t2.b)
(4 rows)

 Targetlist Information (identified by plan id) 
------------------------------------------------
   1 --Streaming (type: GATHER)
         Output: t1.a, t1.b
         Node/s: All datanodes
   2 --Nested Loop (3,5)
         Output: t1.a, t1.b
   3 --Streaming(type: BROADCAST)
         Output: t2.b
         Spawn on: datanode2
         Consumer Nodes: All datanodes
   4 --Seq Scan on public.t2
         Output: t2.b
         Distribute Key: t2.a
   5 --Index Scan using t1_b_idx on public.t1
         Output: t1.a, t1.b
         Distribute Key: t1.a
(15 rows)

   ====== Query Summary =====    
---------------------------------
 System available mem: 9262694KB
 Query Max mem: 9471590KB
 Query estimated mem: 5144KB
(3 rows)

如果優化器沒有選擇這種執行計劃，可以通過以下方式優化：

set enable_index_nestloop = on;

set enable_hashjoin = off;

set enable_seqscan = off;

5. 使用hint指定索引時指定的索引方式不對

GaussDB(DWS)的plan hint目前支援指定的Scan方式有三種：tablescan、indexscan和indexonlyscan。

tablescan：全表掃描，比如行存表的Seq Scan，列存表的CStore Scan

indexscan：先掃索引，再根據索引取表記錄

indexonlyscan：覆寫索引掃描，所需的傳回結果能被所掃描的索引全部覆寫。與index scan相比，index only scan所包含的字段集合，囊括了我們查詢語句中的字段，這樣，提取出相應的index ，就不必再根據索引取表記錄了。

是以，對于需要indexonlyscan的場景，如果hint指定了indexscan，該hint是無法生效的：

postgres=# explain verbose select/*+ indexscan(test)*/ b from test where b = '1';
WARNING:  unused hint: IndexScan(test)
                             QUERY PLAN                             
--------------------------------------------------------------------
 Streaming (type: GATHER)  (cost=3.12..16.88 rows=100 width=2)
   Output: b
   Node/s: All datanodes
   ->  Seq Scan on public.test  (cost=0.00..10.88 rows=100 width=2)
         Output: b
         Distribute Key: a
         Filter: (test.b = '1'::text)
(7 rows)

postgres=# 
postgres=# explain verbose select/*+ indexonlyscan(test)*/ b from test where b = '1';
                                         QUERY PLAN                                         
--------------------------------------------------------------------------------------------
 Streaming (type: GATHER)  (cost=3.12..56.51 rows=100 width=2)
   Output: b
   Node/s: All datanodes
   ->  Index Only Scan using test_b_idx on public.test  (cost=0.00..50.51 rows=100 width=2)
         Output: b
         Distribute Key: a
         Index Cond: (test.b = '1'::text)
(7 rows)

優化方法：使用hint時正确指定indexscan和indexonlyscan。

6. 全文檢索gin索引

為了加速文本搜尋，進行全文檢索時可以建立GIN索引：

create index idxb on test using gin(to_tsvector('english',b));

建立索引時，必須使用to_tsvector的兩參數版本，并且隻有當查詢時也使用了兩參數版本，且參數值與索引中相同時，才會使用該索引：

postgres=# explain verbose select  * from test where to_tsvector(b) @@ to_tsquery('cat') order by 1;
                                QUERY PLAN                                
--------------------------------------------------------------------------
 Streaming (type: GATHER)  (cost=22.23..27.87 rows=12 width=14)
   Output: a, b, c
   Merge Sort Key: test.a
   Node/s: All datanodes
   ->  Sort  (cost=21.86..21.87 rows=12 width=14)
         Output: a, b, c
         Sort Key: test.a
         ->  Seq Scan on public.test  (cost=0.00..21.78 rows=11 width=14)
               Output: a, b, c
               Distribute Key: a
               Filter: (to_tsvector(test.b) @@ '''cat'''::tsquery)
(11 rows)

postgres=# 
postgres=# explain verbose select  * from test where to_tsvector('english',b) @@ to_tsquery('cat') order by 1;
                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Streaming (type: GATHER)  (cost=16.09..22.03 rows=2 width=14)
   Output: a, b, c
   Merge Sort Key: test.a
   Node/s: All datanodes
   ->  Sort  (cost=16.03..16.03 rows=2 width=14)
         Output: a, b, c
         Sort Key: test.a
         ->  Bitmap Heap Scan on public.test  (cost=12.00..16.02 rows=1 width=14)
               Output: a, b, c
               Distribute Key: a
               Recheck Cond: (to_tsvector('english'::regconfig, test.b) @@ '''cat'''::tsquery)
               ->  Bitmap Index Scan on idxb  (cost=0.00..12.00 rows=1 width=0)
                     Index Cond: (to_tsvector('english'::regconfig, test.b) @@ '''cat'''::tsquery)
(13 rows)

列舉GaussDB(DWS)常見的查詢時索引失效場景

1. 傳回結果集很大

2. 沒有analyze

3. 過濾條件使用了函數或隐式類型轉化導緻沒有走索引

場景1：使用計算

場景2：使用函數

場景3：隐式類型轉化

4. 使用nestloop + indexscan 代替 hashjoin

5. 使用hint指定索引時指定的索引方式不對

6. 全文檢索gin索引

繼續閱讀

MySQL和Lucene索引對比分析1. MySQL索引實作2. Lucene索引實作3. MySQL與Lucence對比參考：

一文搞定 MySQL 索引

MySQL（第五篇）—資料的完整性限制（2）（索引、引用完整性）資料的完整性限制（2）三、索引四、引用完整性

【MySQL索引】MySQL索引分類，90%的開發都不知道一、根據底層資料結構劃分二、根據索引字段個數劃分三、根據是否是在主鍵上建立的索引進行劃分四、根據資料與索引的存儲關聯性劃分五、其他分類六、總結

JS--------for循環删除數組元素

Python中字元串常見操作總結

Lucence的基本原理

sql執行計劃 explain 以及結合索引優化sql

MySQL - order by排序優化

.slice()取指定位置的值

2021-08-08 mysql索引

SQLServer 提升查詢速度

詳解SQL中幾種常用的表連接配接方式

oracle 中不使用已有的索引解決辦法

對first_name建立唯一索引uniq_idx_firstname問題描述Sql語句

記一次因MySQL編碼問題導緻的慢查詢排查