作者
digoal
日期
2017-04-26
标簽
PostgreSQL , 模糊查詢 , 正則查詢 , pg_trgm , bytea , gin , 函數索引
背景
前模糊(有字首的模糊),後模糊(有字尾的模糊),前後模糊(無前字尾的模糊),正則比對都屬于文本搜尋領域常見的需求。
PostgreSQL擁有很強的文本搜尋能力,除了支援全文檢索,還支援模糊查詢、正則查詢。内置的pg_trgm插件是一般資料庫沒有的,可能很多人沒有聽說過。同時還内置了表達式索引、GIN索引的功能。
不同的模糊查詢需求,有不同的優化方法。
對于前模糊和後模糊,PostgreSQL則與其他資料庫一樣,可以使用btree來加速。後模糊可以使用反轉函數的函數索引來加速。
對于前後模糊和正則比對,一種方法是使用pg_trgm插件,利用GIN索引加速模糊和正則查詢(輸入3個或3個以上字元的模糊查詢效果很好)。另一種方法是自定義GIN表達式索引的方法,适合于定制的模糊查詢。
一、前模糊與後模糊的優化
- 前模糊(有字首的模糊)優化方法
使用b-tree可以支援前模糊的查詢。
1.1 當使用類型預設的index ops class時,僅适合于collate="C"的查詢(當資料庫預設的lc_collate<>C時,索引和查詢都需要明确指定collate "C")。
索引、查詢條件的collate必須一緻才能使用索引。
例子
test=# create table test(id int, info text);
CREATE TABLE
test=# insert into test select generate_series(1,1000000),md5(random()::text);
INSERT 0 1000000
test=# create index idx on test(info collate "C");
CREATE INDEX
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "C";
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.057..0.093 rows=18 loops=1)
Output: id, info
Index Cond: ((test.info >= 'abcd'::text) AND (test.info < 'abce'::text))
Filter: (test.info ~~ 'abcd%'::text COLLATE "C")
Buffers: shared hit=18 read=3
Planning time: 0.424 ms
Execution time: 0.124 ms
(7 rows)
1.2 當資料庫預設的lc_collate<>C時,還有一種方法讓b-tree索引支援模糊查詢。使用對應類型的pattern ops,使用pattern ops将使用字元的查詢方式而非binary的搜尋方式。
文檔中有如下解釋
https://www.postgresql.org/docs/9.6/static/indexes-opclass.htmlThe operator classes text_pattern_ops, varchar_pattern_ops, and bpchar_pattern_ops
support B-tree indexes on the types text, varchar, and char respectively.
The difference from the default operator classes is that the values are compared strictly
character by character rather than according to the locale-specific collation rules.
This makes these operator classes suitable for use by queries involving pattern
matching expressions (LIKE or POSIX regular expressions) when the database
does not use the standard "C" locale.
test=# drop table test;
DROP TABLE
test=# create table test(id int, info text);
CREATE TABLE
test=# insert into test select generate_series(1,1000000),md5(random()::text);
INSERT 0 1000000
test=# create index idx on test(info text_pattern_ops);
CREATE INDEX
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "zh_CN";
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.038..0.059 rows=12 loops=1)
Output: id, info
Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text))
Filter: (test.info ~~ 'abcd%'::text COLLATE "zh_CN")
Buffers: shared hit=12 read=3
Planning time: 0.253 ms
Execution time: 0.081 ms
(7 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "C";
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.027..0.050 rows=12 loops=1)
Output: id, info
Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text))
Filter: (test.info ~~ 'abcd%'::text COLLATE "C")
Buffers: shared hit=15
Planning time: 0.141 ms
Execution time: 0.072 ms
(7 rows)
使用類型對應的pattern ops,索引搜尋不僅支援LIKE的寫法,還支援規則表達式的寫法,如下:
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '^abcd';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.031..0.061 rows=12 loops=1)
Output: id, info
Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text))
Filter: (test.info ~ '^abcd'::text)
Buffers: shared hit=15
Planning time: 0.213 ms
Execution time: 0.083 ms
(7 rows)
- 後模糊(有字尾的模糊)的優化方法
使用反轉函數(reverse)索引,可以支援後模糊的查詢。
2.1 當使用類型預設的index ops class時,僅适合于collate="C"的查詢(當資料庫預設的lc_collate<>C時,索引和查詢都需要明确指定collate "C")。
test=# create index idx1 on test(reverse(info) collate "C");
CREATE INDEX
test=# select * from test limit 1;
id | info
----+----------------------------------
1 | b3275976cdd437a033d4329775a52514
(1 row)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) like '4152%' collate "C";
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Index Scan using idx1 on public.test (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.061..0.097 rows=18 loops=1)
Output: id, info
Index Cond: ((reverse(test.info) >= '4152'::text) AND (reverse(test.info) < '4153'::text))
Filter: (reverse(test.info) ~~ '4152%'::text COLLATE "C")
Buffers: shared hit=18 read=3
Planning time: 0.128 ms
Execution time: 0.122 ms
(7 rows)
test=# select * from test where reverse(info) like '4152%' collate "C";
id | info
--------+----------------------------------
847904 | abe2ecd90393b5275df8e34a39702514
414702 | 97f66d26545329321164042657d02514
191232 | 7820972c6220c2b01d46c11ebb532514
752742 | 93232ac39c6632e2540df44627c42514
217302 | 39e518893a1a7b1e691619bd1fc42514
1 | b3275976cdd437a033d4329775a52514
615718 | 4948f94c484c13dc6c4fae8a3db52514
308815 | fc2918ceff7c7a4dafd2e04031062514
149521 | 546d963842ea5ca593e622c810262514
811093 | 4b6eca2eb6b665af67b2813e91a62514
209000 | 1dfd0d4e326715c1739f031cca992514
937616 | 8827fd81f5b673fb5afecbe3e11b2514
419553 | bd6e01ce360af16137e8b6abc8ab2514
998324 | 7dff51c19dc5e5d9979163e7d14c2514
771518 | 8a54e30003a48539fff0aedc73ac2514
691566 | f90368348e3b6bf983fcbe10db2d2514
652274 | 8bf4a97b5f122a5540a21fa85ead2514
233437 | 739ed715fc203d47e37e79b5bcbe2514
(18 rows)
2.2 當資料庫預設的lc_collate<>C時,還有一種方法讓b-tree索引支援模糊查詢。使用對應類型的pattern ops,使用pattern ops将使用字元的查詢方式而非binary的搜尋方式。
使用類型對應的pattern ops,索引搜尋不僅支援LIKE的寫法,還支援規則表達式的寫法。
test=# create index idx1 on test(reverse(info) text_pattern_ops);
CREATE INDEX
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) like '4152%';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Index Scan using idx1 on public.test (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.026..0.049 rows=12 loops=1)
Output: id, info
Index Cond: ((reverse(test.info) ~>=~ '4152'::text) AND (reverse(test.info) ~<~ '4153'::text))
Filter: (reverse(test.info) ~~ '4152%'::text)
Buffers: shared hit=15
Planning time: 0.102 ms
Execution time: 0.072 ms
(7 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) ~ '^4152';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Index Scan using idx1 on public.test (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.031..0.063 rows=12 loops=1)
Output: id, info
Index Cond: ((reverse(test.info) ~>=~ '4152'::text) AND (reverse(test.info) ~<~ '4153'::text))
Filter: (reverse(test.info) ~ '^4152'::text)
Buffers: shared hit=15
Planning time: 0.148 ms
Execution time: 0.087 ms
(7 rows)
- 前、後模糊的合體優化方法
使用pg_trgm索引,可以支援前、後模糊的查詢。
注意:
(有字首的模糊)至少輸入1個字元,(有字尾的模糊)至少輸入2個字元,才有好的索引過濾效果。
如果要高效支援多位元組字元(例如中文),資料庫lc_ctype不能為"C",隻有TOKEN分割正确效果才是OK的。(lc_ctype設定正确,才能夠正确的逐一分割多位元組字元串中的文字: LC_CTYPE: Character classification (What is a letter? Its upper-case equivalent?))。
test=# \l+ test
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description
------+----------+----------+------------+------------+-------------------+--------+------------+-------------
test | postgres | UTF8 | zh_CN.utf8 | zh_CN.utf8 | | 245 MB | pg_default |
(1 row)
test=# create extension pg_trgm;
test=# create table test001(c1 text);
CREATE TABLE
生成随機中文字元串的函數
test=# create or replace function gen_hanzi(int) returns text as
$$
declare
res text;
begin
if $1 >=1 then
select string_agg(chr(19968+(random()*20901)::int), '') into res from generate_series(1,$1);
return res;
end if;
return null;
end;
$$
language plpgsql strict;
CREATE FUNCTION
生成随機資料
test=# insert into test001 select gen_hanzi(20) from generate_series(1,100000);
INSERT 0 100000
test=# create index idx_test001_1 on test001 using gin (c1 gin_trgm_ops);
CREATE INDEX
test=# select * from test001 limit 5;
c1
------------------------------------------
埳噪辦甾讷昃碇玾陧箖燋邢賀浮媊踮菵暔谉橅
秌橑籛鴎拟倶敤麁鼋醠轇坙騉鏦纗蘛婃坹娴儅
蔎緾鎧爪鵬二悲膼朠麻鸂鋬楨窷違繇糭嘓索籓
馳泅薬鐗愅撞竅浉滲蛁灎厀攚摐瞪拡擜詜隝緼
襳鋪煃匶瀌懲荼黹樆惺箧搔羾憯墆鋃硍蔓恧顤
(5 rows)
模糊查詢
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '你%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.030..0.034 rows=3 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '你%'::text)
Heap Blocks: exact=3
Buffers: shared hit=7
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.020..0.020 rows=3 loops=1)
Index Cond: (test001.c1 ~~ '你%'::text)
Buffers: shared hit=4
Planning time: 0.119 ms
Execution time: 0.063 ms
(10 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%恧顤';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.031..0.034 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%恧顤'::text)
Rows Removed by Index Recheck: 1
Heap Blocks: exact=2
Buffers: shared hit=6
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.020..0.020 rows=2 loops=1)
Index Cond: (test001.c1 ~~ '%恧顤'::text)
Buffers: shared hit=4
Planning time: 0.136 ms
Execution time: 0.062 ms
(11 rows)
二、前後均模糊的優化
使用pg_trgm插件,支援前後模糊的查詢。
如果要讓pg_trgm高效支援多位元組字元(例如中文),資料庫lc_ctype不能為"C",隻有TOKEN分割正确效果才是OK的。(lc_ctype設定正确,才能夠正确的逐一分割多位元組字元串中的文字: Character classification (What is a letter? Its upper-case equivalent?))。
建議輸入3個或3個以上字元,否則效果不佳(後面會分析原因)。
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%燋邢賀%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.038..0.038 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%燋邢賀%'::text)
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.025..0.025 rows=1 loops=1)
Index Cond: (test001.c1 ~~ '%燋邢賀%'::text)
Buffers: shared hit=4
Planning time: 0.170 ms
Execution time: 0.076 ms
(10 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%燋邢%';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=7615669.08..7615679.20 rows=10 width=61) (actual time=147.524..178.232 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%燋邢%'::text)
Rows Removed by Index Recheck: 99999
Heap Blocks: exact=1137
Buffers: shared hit=14429
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..7615669.08 rows=10 width=0) (actual time=147.377..147.377 rows=100000 loops=1)
Index Cond: (test001.c1 ~~ '%燋邢%'::text)
Buffers: shared hit=13292
Planning time: 0.133 ms
Execution time: 178.265 ms
(11 rows)
三、正則比對的優化
PostgreSQL 正則比對的文法為 字元串 ~ 'pattern' 或 字元串 ~* 'pattern'
https://www.postgresql.org/docs/9.6/static/functions-matching.htmltest=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 ~ '12[0-9]{3,9}';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=65.08..75.20 rows=10 width=61) (actual time=0.196..0.196 rows=0 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~ '12[0-9]{3,9}'::text)
Rows Removed by Index Recheck: 1
Heap Blocks: exact=1
Buffers: shared hit=50
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..65.08 rows=10 width=0) (actual time=0.183..0.183 rows=1 loops=1)
Index Cond: (test001.c1 ~ '12[0-9]{3,9}'::text)
Buffers: shared hit=49
Planning time: 0.452 ms
Execution time: 0.221 ms
(11 rows)
test01=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 ~ '宸朾啣' collate "zh_CN";
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=6.58..19.42 rows=10 width=61) (actual time=0.061..0.061 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~ '宸朾啣'::text COLLATE "zh_CN")
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..6.58 rows=10 width=0) (actual time=0.049..0.049 rows=1 loops=1)
Index Cond: (test001.c1 ~ '宸朾啣'::text COLLATE "zh_CN")
Buffers: shared hit=4
Planning time: 0.238 ms
Execution time: 0.082 ms
(10 rows)
正則比對索引原理參考contrib/pg_trgm/trgm_regexp.c
pg_trgm模糊查詢的原理
首先,pg_trgm将字元串的前端添加2個空格,末尾添加1個空格。
然後,每連續的3個字元為一個TOKEN,拆開。
最後,對TOKEN建立GIN反向索引。
檢視字元串的TOKEN,可以使用如下方法。
test=# select show_trgm('123');
show_trgm
-------------------------
{" 1"," 12",123,"23 "}
(1 row)
pg_trgm前後模糊字元個數要求的原因
使用pg_trgm時,如果要獲得最好的效果,最好滿足這些條件。
- 有字首的模糊查詢,例如a%,至少需要提供1個字元。( 搜尋的是token=' a' )
- 有字尾的模糊查詢,例如%ab,至少需要提供2個字元。( 搜尋的是token='ab ' )
- 前後模糊查詢,例如%abcd%,至少需要提供3個字元。( 這個使用數組搜尋,搜尋的是token(s) 包含 {" a"," ab",abc,bcd,"cd "} )
原因是什麼呢?
因為pg_trgm生成的TOKEN是三個字元,隻有在以上三個條件下,才能比對到對應的TOKEN。
test=# select show_trgm('123');
show_trgm
-------------------------
{" 1"," 12",123,"23 "}
(1 row)
四、小于3個輸入字元的模糊查詢的優化
當需要前後模糊搜尋1個或者2個字元時,pg_trgm無法滿足需求,但是我們可以使用表達式GIN索引。
使用表達式,将字元串拆成1個單字,兩個連續的字元的數組,對數組建立GIN索引即可。
test=# create or replace function split001(text) returns text[] as
$$
declare
res text[];
begin
select regexp_split_to_array($1,'') into res;
for i in 1..length($1)-1 loop
res := array_append(res, substring($1,i,2));
end loop;
return res;
end;
$$
language plpgsql strict immutable;
CREATE FUNCTION
test=# create index idx_test001_2 on test001 using gin (split001(c1));
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where split001(c1) @> array['你好'];
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=8.87..550.12 rows=500 width=61) (actual time=0.041..0.041 rows=0 loops=1)
Output: c1
Recheck Cond: (split001(test001.c1) @> '{你好}'::text[])
Buffers: shared hit=4
-> Bitmap Index Scan on idx_test001_2 (cost=0.00..8.75 rows=500 width=0) (actual time=0.039..0.039 rows=0 loops=1)
Index Cond: (split001(test001.c1) @> '{你好}'::text[])
Buffers: shared hit=4
Planning time: 0.104 ms
Execution time: 0.068 ms
(9 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where split001(c1) @> array['你'];
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=8.87..550.12 rows=500 width=61) (actual time=0.063..0.183 rows=86 loops=1)
Output: c1
Recheck Cond: (split001(test001.c1) @> '{你}'::text[])
Heap Blocks: exact=80
Buffers: shared hit=84
-> Bitmap Index Scan on idx_test001_2 (cost=0.00..8.75 rows=500 width=0) (actual time=0.048..0.048 rows=86 loops=1)
Index Cond: (split001(test001.c1) @> '{你}'::text[])
Buffers: shared hit=4
Planning time: 0.101 ms
Execution time: 0.217 ms
(10 rows)
test=# select * from test001 where split001(c1) @> array['你'];
c1
------------------------------------------
殐踨洪冨垓丩賢閚偉垢胸鍘崩你萭隡劭芛雫袰
靅慨熱臉罆淓寘鰻總襎戍謸枨陪丼倫柆套你仮
......
五、相似查詢優化
模糊查詢和正則比對都是找出完全符合條件的記錄,還有一種需求是相似查詢。
例如postgresql字元串,輸入 p0stgresgl 也能根據相似度比對到。
這裡同樣用到了pg_trgm插件,如果要支援中文,同樣有這樣的要求:
如果需要讓pg_trgm支援中文相似查詢,資料庫lc_ctype不能為"C",隻有TOKEN分割正确效果才是OK的。(lc_ctype設定正确,才能夠正确的逐一分割多位元組字元串中的文字: Character classification (What is a letter? Its upper-case equivalent?))。
test=# create index idx_test001_3 on test001 using gist (c1 gist_trgm_ops);
CREATE INDEX
test=# explain (analyze,verbose,timing,costs,buffers) SELECT t, c1 <-> '癷磛鹚蠌鰓蠲123鶡埀婎鳊苿奶垨惸溴蔻筴熝憡' AS dist
FROM test001 t
ORDER BY dist LIMIT 5;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.28..0.52 rows=5 width=89) (actual time=37.462..37.639 rows=5 loops=1)
Output: t.*, ((c1 <-> '癷磛鹚蠌鰓蠲123鶡埀婎鳊苿奶垨惸溴蔻筴熝憡'::text))
Buffers: shared hit=1631
-> Index Scan using idx_test001_3 on public.test001 t (cost=0.28..4763.28 rows=100000 width=89) (actual time=37.461..37.636 rows=5 loops=1)
Output: t.*, (c1 <-> '癷磛鹚蠌鰓蠲123鶡埀婎鳊苿奶垨惸溴蔻筴熝憡'::text)
Order By: (t.c1 <-> '癷磛鹚蠌鰓蠲123鶡埀婎鳊苿奶垨惸溴蔻筴熝憡'::text)
Buffers: shared hit=1631
Planning time: 0.089 ms
Execution time: 37.668 ms
(9 rows)
test=# SELECT t, c1 <-> '癷磛鹚蠌鰓蠲123鶡埀婎鳊苿奶垨惸溴蔻筴熝憡' AS dist
FROM test001 t
ORDER BY dist LIMIT 5;
t | dist
--------------------------------------------+----------
(癷磛鹚蠌鰓蠲你鶡埀婎鳊苿奶垨惸溴蔻筴熝憡) | 0.307692
(坆桻悁斾耾瑚豌腏炁悿隖轲盃挜稐睟礓蜮鉛湆) | 0.976744
(癷鉜餯祂鼃恫蝅瓟顡廕梍蛸歡僷贊敔欓侑韌鐹) | 0.976744
(癷嚯鳬戚蹪熼胘檙佌欔韜挹樷覄惶蹝顼鑜鞖媗) | 0.976744
(癷饎瞲餿堒歃峽盾豼擔禞嵪豦咢脈馄竨濟隘緘) | 0.976744
(5 rows)
六、小結
- 如果隻有前模糊查詢需求(字元串 like 'xx%'),使用collate "C"的b-tree索引;當collate不為"C"時,可以使用類型對應的pattern ops(例如text_pattern_ops)建立b-tree索引。
- 如果隻有後模糊的查詢需求(字元串 like '%abc' 等價于 reverse(字元串) like 'cba%'),使用collate "C"的reverse()表達式的b-tree索引;當collate不為"C"時,可以使用類型對應的pattern ops(例如text_pattern_ops)建立b-tree索引。
- 如果有前後模糊查詢需求,并且包含中文,請使用lc_ctype <> "C"的資料庫,同時使用pg_trgm插件的gin索引。(隻有TOKEN分割正确效果才是OK的。(lc_ctype設定正确,才能夠正确的逐一分割多位元組字元串中的文字: Character classification (What is a letter? Its upper-case equivalent?))。)
- 如果有前後模糊查詢需求,并且不包含中文,請使用pg_trgm插件的gin索引。
- 如果有正規表達式查詢需求,請使用pg_trgm插件的gin索引。
- 如果有輸入條件少于3個字元的模糊查詢需求,可以使用GIN表達式索引,通過數組包含的方式進行搜尋,性能一樣非常好。
七、性能
1億條記錄,每條記錄15個随機中文。測試前後模糊查詢性能。
- 生成測試資料
vi test.sql
insert into test001 select gen_hanzi(15) from generate_series(1,2500000);
pgbench -n -r -P 1 -f ./test.sql -c 40 -j 40 -t 1 test
test=# select count(*) from test001;
count
-----------
100000000
(1 row)
test=# select * from test001 limit 10;
c1
--------------------------------
釾笉皜鰈确艄騚馺腃彊釲忰采汦擇
槮搮圮墔婂蹾飄孡鶒鎮赀聵線麯櫕
孨鄈韞萅赫炧暤蟠檼駧餪崉娲譌筯
烸喖醝稦怩鷟棾奜妛曫仾飛饡繪韋
撐豁襉峊炠眏罱襄彊鰮莆壏妒辷阛
蜁愊鶱磹貳帵眲嚉榑蒼潵檐簄椰魨
瑄翁蠃巨躋壾蛸湗鑂顂櫟砣八癱栵
馇巍笿鞒裝棊嘢恓煓熴锠鋈蹃煿屓
訆韄踔牤嘇糺絢軿鵑燿螛梋鰢謇郼
撲蓨傷釱糕觩嬖蓷鰼繩圓醷熌靉掑
(10 rows)
- 建立索引
表和索引大小test=# set maintenance_work_mem ='32GB'; test=# create index idx_test001_1 on test001 using gin (c1 gin_trgm_ops);
test=# \di+
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+---------------+-------+----------+---------+-------+-------------
public | idx_test001_1 | index | postgres | test001 | 12 GB |
(1 row)
test=# \dt+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+---------+-------+----------+---------+-------------
public | test001 | table | postgres | 7303 MB |
(1 row)
- 模糊查詢性能測試
3.1 前模糊
響應時間:9毫秒
傳回4701行
select * from test001 where c1 like '你%';
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '你%';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=89.50..10161.50 rows=10000 width=46) (actual time=1.546..8.868 rows=4701 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '你%'::text)
Rows Removed by Index Recheck: 85
Heap Blocks: exact=4776
Buffers: shared hit=4784
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..87.00 rows=10000 width=0) (actual time=0.879..0.879 rows=4786 loops=1)
Index Cond: (test001.c1 ~~ '你%'::text)
Buffers: shared hit=8
Planning time: 0.099 ms
Execution time: 9.166 ms
(11 rows)
3.2 後模糊
響應時間:0.25毫秒
傳回2行
select * from test001 where c1 like '%靉掑';
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%靉掑';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=89.50..10161.50 rows=10000 width=46) (actual time=0.049..0.223 rows=2 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%靉掑'::text)
Rows Removed by Index Recheck: 87
Heap Blocks: exact=89
Buffers: shared hit=94
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..87.00 rows=10000 width=0) (actual time=0.031..0.031 rows=89 loops=1)
Index Cond: (test001.c1 ~~ '%靉掑'::text)
Buffers: shared hit=5
Planning time: 0.113 ms
Execution time: 0.249 ms
(11 rows)
3.3 前後模糊
響應時間:0.2毫秒
傳回1行
select * from test001 where c1 like '%螛梋鰢%';
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%螛梋鰢%';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=89.50..10161.50 rows=10000 width=46) (actual time=0.044..0.175 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%螛梋鰢%'::text)
Rows Removed by Index Recheck: 81
Heap Blocks: exact=82
Buffers: shared hit=87
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..87.00 rows=10000 width=0) (actual time=0.027..0.027 rows=82 loops=1)
Index Cond: (test001.c1 ~~ '%螛梋鰢%'::text)
Buffers: shared hit=5
Planning time: 0.112 ms
Execution time: 0.201 ms
(11 rows)
小結
三字或以上模糊查詢,使用pg_trgm可以很好的解決。
1,2個字的模糊查詢,使用表達式索引也可以實作很好的性能。
postgres=# create or replace function split_12(text) returns text[] as
$$
declare
res text[];
begin
select regexp_split_to_array($1, '') into res;
for i in 1..length($1)-1 loop
res := array_append(res, substring($1, i, 2));
end loop;
return res;
end;
$$
language plpgsql strict immutable;
CREATE FUNCTION
postgres=# select split_12('abc你好');
split_12
------------------------------
{a,b,c,你,好,ab,bc,c你,你好}
(1 row)
create index idx2 on tbl using gin (split_12(col));
select * from tbl where split_12(col) @> array['單字或雙字'];
建議應用在查詢的時候,判斷一下詞的個數,選擇正确的SQL寫法。