關于PostgreSQL中的多字段索引之三(gin篇)

前兩篇講的btree和gist的多字段索引，本篇順理成章地講一下gin的多字段索引。

前兩篇請參考這裡:

<a href="http://blog.chinaunix.net/uid-20726500-id-5088527.html" target="_blank">http://blog.chinaunix.net/uid-20726500-id-5088527.html</a>

<a href="http://blog.chinaunix.net/uid-20726500-id-5090166.html" target="_blank">http://blog.chinaunix.net/uid-20726500-id-5090166.html</a>

不像gist和btree，gin天生就适合做多字段索引，不管查詢條件覆寫所有索引字段還是僅僅覆寫一個子集，它都可以勝任。

<a href="http://www.postgres.cn/docs/9.3/indexes-multicolumn.html" target="_blank">http://www.postgres.cn/docs/9.3/indexes-multicolumn.html</a>

---------------------------------------------------------------

一個多字段的 GIN 索引可以用于那些查詢條件包含索引字段子集的查詢中。不像B-tree 或 GiST，除了查詢條件使用的索引字段外，索引的搜尋效率是相同的。

gin多字段索引的原理是，索引中的每個key由字段編号加上元素值構成，這樣的key作為一個整體構成反向索引樹。

參考：src/backend/access/gin/README

...

* In a single-column index, a key tuple just contains the key datum, but

in a multi-column index, a key tuple contains the pair (column number,

key datum) where the column number is stored as an int2. This is needed

to support different key data types in different columns. This much of

the tuple is built by index_form_tuple according to the usual rules.

The column number (if present) can never be null, but the key datum can

be, in which case a null bitmap is present as usual. (As usual for index

tuples, the size of the null bitmap is fixed at INDEX_MAX_KEYS.)

為了和前面的btree和gist有個可比性，測試環境和資料還采用上一篇的标量資料，而不是采用gin特有的集合資料，是以要使用btree-gin擴充。

測試環境詳見

點選(此處)折疊或打開

postgres=# create extension btree_gin;

CREATE EXTENSION

postgres=# \timing

Timing is on.

postgres=# create index tb1_idx5 on tb1 using gin(c1,c2);

CREATE INDEX

Time: 23119.722 ms

postgres=# select pg_size_pretty(pg_relation_size('tb1_idx5'));

pg_size_pretty

----------------

129 MB

(1 row)

postgres=# explain (analyze,buffers) select count(*) from tb1 where c1=99 and c2=999;

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------

Aggregate (cost=404.70..404.71 rows=1 width=0) (actual time=15.797..15.798 rows=1 loops=1)

Buffers: shared hit=179 read=2

-> Bitmap Heap Scan on tb1 (cost=21.01..404.45 rows=99 width=0) (actual time=14.545..15.769 rows=92 loops=1)

Recheck Cond: ((c1 = 99) AND (c2 = 999))

Buffers: shared hit=179 read=2

-> Bitmap Index Scan on tb1_idx5 (cost=0.00..20.99 rows=99 width=0) (actual time=14.144..14.144 rows=92 loops=1)

Index Cond: ((c1 = 99) AND (c2 = 999))

Buffers: shared hit=87 read=2

Total runtime: 16.199 ms

(9 rows)

這個速度沒有btree和gist的多字段索引快，因為它的原理和BitmapAnd組合索引相似，要掃描2次索引再取交集，但比組合索引的50ms快。

3.3 c1+c2多字段gist索引處理單字段查詢

postgres=# explain (analyze,buffers) select count(*) from tb1 where c1=99;

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------

Aggregate (cost=46880.08..46880.09 rows=1 width=0) (actual time=1569.289..1569.290 rows=1 loops=1)

Buffers: shared hit=142 read=39638

-> Bitmap Heap Scan on tb1 (cost=1097.08..46624.25 rows=102333 width=0) (actual time=37.484..1548.405 rows=99886 loops=1)

Recheck Cond: (c1 = 99)

Rows Removed by Index Recheck: 7363157

Buffers: shared hit=142 read=39638

-> Bitmap Index Scan on tb1_idx5 (cost=0.00..1071.50 rows=102333 width=0) (actual time=35.034..35.034 rows=99886 loops=1)

Index Cond: (c1 = 99)

Buffers: shared hit=78

Total runtime: 1569.548 ms

(10 rows)

postgres=# explain (analyze,buffers) select count(*) from tb1 where c2=999;

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------

Aggregate (cost=23488.40..23488.41 rows=1 width=0) (actual time=79.362..79.362 rows=1 loops=1)

Buffers: shared hit=1974 read=6910

-> Bitmap Heap Scan on tb1 (cost=110.81..23464.27 rows=9652 width=0) (actual time=7.228..77.332 rows=9892 loops=1)

Recheck Cond: (c2 = 999)

Buffers: shared hit=1974 read=6910

-> Bitmap Index Scan on tb1_idx5 (cost=0.00..108.39 rows=9652 width=0) (actual time=4.468..4.468 rows=9892 loops=1)

Index Cond: (c2 = 999)

Buffers: shared hit=1 read=11

Total runtime: 79.408 ms

查詢速度和gist的多字段索引差不多。

PostgreSQL 9.4對gin的性能做了很大提升，下面再在9.4上重複一下上面的測試。

chenhj=# create table tb1(c1 int,c2 int);

CREATE TABLE

chenhj=# insert into tb1 select round(random()*100),round(random()*1000) from generate_series(1,10000000);

INSERT 0 10000000

chenhj=# select pg_size_pretty(pg_table_size('tb1'));

346 MB

chenhj=# create extension btree_gin;

chenhj=# create index tb1_idx5 on tb1 using gin(c1,c2);

chenhj=# select pg_size_pretty(pg_relation_size('tb1_idx5'));

47 MB

9.4裡gin索引的大小比9.3小了很多，隻有9.3的三分之一。

chenhj=# explain (analyze,buffers) select count(*) from tb1 where c1=99 and c2=999;

Aggregate (cost=423.95..423.96 rows=1 width=0) (actual time=3.399..3.399 rows=1 loops=1)

Buffers: shared hit=166

-> Bitmap Heap Scan on tb1 (cost=25.05..423.69 rows=103 width=0) (actual time=3.151..3.375 rows=117 loops=1)

Heap Blocks: exact=117

Buffers: shared hit=166

-> Bitmap Index Scan on tb1_idx5 (cost=0.00..25.03 rows=103 width=0) (actual time=3.125..3.125 rows=117 loops=1)

Buffers: shared hit=49

Planning time: 0.099 ms

Execution time: 3.448 ms

(11 rows)

比9.3快了很多，但仍然沒有btree和gist的多字段索引快，不過差别也不是太大。

查詢條件隻包含c1

chenhj=# explain (analyze,buffers) select count(*) from tb1 where c1=99;

------------------------------------------------------------------------------------------------------------------------------------

Aggregate (cost=46819.50..46819.51 rows=1 width=0) (actual time=590.436..590.437 rows=1 loops=1)

Buffers: shared hit=105 read=39583 written=700

-> Bitmap Heap Scan on tb1 (cost=981.50..46554.50 rows=106000 width=0) (actual time=49.895..570.300 rows=99790 loops=1)

Heap Blocks: exact=39665

Buffers: shared hit=105 read=39583 written=700

-> Bitmap Index Scan on tb1_idx5 (cost=0.00..955.00 rows=106000 width=0) (actual time=31.225..31.225 rows=99790 loops=1)

Buffers: shared hit=23

Planning time: 0.112 ms

Execution time: 590.496 ms

速度是9.3的2倍多，也比gist和btree都快。

查詢條件隻包含c2

chenhj=# explain (analyze,buffers) select count(*) from tb1 where c2=999;

Aggregate (cost=23539.39..23539.40 rows=1 width=0) (actual time=92.894..92.894 rows=1 loops=1)

Buffers: shared hit=1950 read=7130 written=3

-> Bitmap Heap Scan on tb1 (cost=95.12..23515.16 rows=9692 width=0) (actual time=5.672..90.431 rows=10178 loops=1)

Heap Blocks: exact=9073

Buffers: shared hit=1950 read=7130 written=3

-> Bitmap Index Scan on tb1_idx5 (cost=0.00..92.69 rows=9692 width=0) (actual time=2.671..2.671 rows=10178 loops=1)

Buffers: shared hit=6 read=1

Planning time: 0.187 ms

Execution time: 94.261 ms

和9.3差不多，和gist也差不多。

1）當查詢條件中同時出現多字段索引的所有條件時，多字段索引的速度優于多個字段單獨建索引然後通過BitmapAnd組合起來。btree和gist的多字段索引又快于gin多字段索引。

2）多字段索引用在隻包含全部索引字段的子集時，除了btree在查詢條件中多字段的第一個字段缺席時效率很差外，其它都表現良好。

3）9.4的gin索引的大小比btree和gist小了太多。（但gin有别的問題，在做比較查詢時有時效率不高，後面準備再通過例子講解）

4）9.4的gin索引比9.3的gin優化了太多了

最後引用一下手冊中關于使用多字段索引還是多索引組合的說明。加黑的那個限制似乎隻對btree的多字段索引有效。

-----------------------------------------------------------------------------------

關于PostgreSQL中的多字段索引之三(gin篇)

繼續閱讀

MySQL - order by排序優化

C#連接配接postgresql資料庫

2021-08-08 mysql索引

關于 java 轉碼 GBK -> UTF-8

postgresql替換換行符

查詢資料庫中存在的回車換行符

Django入門：第一個Django應用

SQLServer 提升查詢速度

postgresql 關于GBK和utf-8字元集問題

詳解SQL中幾種常用的表連接配接方式

oracle 中不使用已有的索引解決辦法

十四、MySQL備份和恢複資料庫1、備份和恢複的方法2、使用mysqldump導出3、恢複轉儲檔案4、字元編碼問題5、鎖表系列結語

對first_name建立唯一索引uniq_idx_firstname問題描述Sql語句

記一次因MySQL編碼問題導緻的慢查詢排查

PostgreSQL t_bits計算方法

為什麼要選擇UniDAC