背景
場景:
1、電商
商品、店鋪、描述等. 搜尋框模糊查詢
2、企業ERP、社交APP、網站
内容搜尋框, 模糊查詢
挑戰:
傳統資料庫不支援模糊查詢, 需要全表掃描, 性能極差, 少量并發查詢極可能導緻雪崩.
通常的解決方案需要将資料同步到搜尋引擎, 這種解決方案的弊端:
- 研發、軟硬體成本增加、
- 系統問題增多(同步延遲問題、同步異常問題、同步一緻性問題)、
- 開發靈活性下降(無法同時過濾模糊查詢條件與表的其他條件, 需要業務層交換資料)
- 不支援同時過濾like與其他條件
PG 解決方案:
- 反向索引GIN + btree_gin: 支援模糊查詢+任意字段組合過濾.
202105/20210502_01.md
《PostgreSQL 應用開發解決方案最佳實踐系列課程- 1. 中文分詞與模糊查詢》202009/20200913_01.md
《[直播]在資料庫中跑全文檢索、模糊查詢SQL會不會被開除?》202009/20200912_01.md
《PostgreSQL 模糊查詢、相似查詢(like '%xxx%') pg_bigm 比pg_trgm 優勢在哪?》202003/20200330_01.md
《PostgreSQL 模糊查詢增強插件pgroonga , pgbigm (含單字、雙字、多字、多位元組字元) - 支援JSON模糊查詢等》202001/20200116_01.md
《PostgreSQL+MySQL 聯合解決方案- 第12課視訊- 全文檢索、中文分詞、模糊查詢、相似文本查詢》201903/20190320_01.md
《PostgreSQL 數組裡面的元素,模糊搜尋,模糊查詢,like,前後百分号,正則查詢,反向索引》201805/20180502_01.md
《PostgreSQL 模糊查詢+大量重複值比對實踐- 分區索引= any (array())》201801/20180118_03.md
《PostgreSQL 模糊查詢與正則比對性能差異與SQL優化建議》201712/20171205_02.md
《用PostgreSQL 做實時高效搜尋引擎- 全文檢索、模糊查詢、正則查詢、相似查詢、ADHOC查詢》201711/20171107_13.md
《HTAP資料庫PostgreSQL 場景與性能測試之12 - (OLTP) 字元串搜尋- 前後模糊查詢》201711/20171107_10.md
《HTAP資料庫PostgreSQL 場景與性能測試之9 - (OLTP) 字元串模糊查詢- 含索引實時寫入》201710/20171020_01.md
《多國語言字元串的加密、全文檢索、模糊查詢的支援》201710/20171016_04.md
《Greenplum 模糊查詢實踐》201704/20170426_01.md
《PostgreSQL 模糊查詢最佳實踐- (含單字、雙字、多字模糊查詢方法)》201701/20170106_04.md
《PostgreSQL 全表全字段模糊查詢的毫秒級高效實作- 搜尋引擎顫抖了》201612/20161231_01.md
《從難纏的模糊查詢聊開- PostgreSQL獨門絕招之一GIN , GiST , SP-GiST , RUM 索引原理與技術背景》201605/20160506_02.md
《中文模糊查詢性能優化by PostgreSQL trgm》201603/20160302_01.md
《PostgreSQL 百億資料秒級響應正則及模糊查詢》例子
1、建立一個生成随機漢字字元串的函數
create or replace function gen_hanzi(int) returns text as $$
declare
res text;
begin
if $1 >=1 then
select string_agg(chr(19968+(random()*20901)::int), '') into res from generate_series(1,$1);
return res;
end if;
return null;
end;
$$ language plpgsql strict;
------------------------------------------
埳噪辦甾讷昃碇玾陧箖燋邢賀浮媊踮菵暔谉橅
秌橑籛鴎拟倶敤麁鼋醠轇坙騉鏦纗蘛婃坹娴儅
蔎緾鎧爪鵬二悲膼朠麻鸂鋬楨窷違繇糭嘓索籓
馳泅薬鐗愅撞竅浉滲蛁灎厀攚摐瞪拡擜詜隝緼
襳鋪煃匶瀌懲荼黹樆惺箧搔羾憯墆鋃硍蔓恧顤
2、建立測試表
create unlogged table tbl (
id serial8 primary key,
gid int,
c1 text,
c2 text,
c3 text,
ts timestamp
);
3、寫入200萬記錄
insert into tbl (gid,c1,c2,c3,ts)
select random()*10000, gen_hanzi(16), gen_hanzi(32), gen_hanzi(128), clock_timestamp()
from generate_series(1,1000000);
select random()*10000, gen_hanzi(16)||'西出函谷關'||gen_hanzi(16), gen_hanzi(32), gen_hanzi(128), clock_timestamp()
from generate_series(1,100);
select random()*10000, gen_hanzi(16), gen_hanzi(32)||'西出函谷關'||gen_hanzi(16), gen_hanzi(128), clock_timestamp()
select random()*10000, gen_hanzi(16), gen_hanzi(32), gen_hanzi(128)||'西出函谷關'||gen_hanzi(16), clock_timestamp()
postgres=> select * from tbl limit 10;
id | gid | c1 | c2 | c3 | ts
----+------+----------------------------------+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------
1 | 8102 | 焼蒟郎犉閏甀鑹訓峔浧讵詆藕蔺柟蠬| 谂陱膁恚囓潽愪秾緌蟄錦姄肹囵縱錨醹狼憬榉烚态孠瘀觹枿瞴茒鲚煀踸槍| 齌壑落噴故傾虗靚蜃食晰跻年盹仿菣底依愔參鐞箍晀泭澴涭垧踟鹠膵硊杯剋蜩崚匵奊閷譳晍鯷瞇彇
敘奍绹妱刧鍾顄壩嵖斴敉崝欇銳蒺桰歆尒岊蟐鯿稕巑愬虊睬螈壓輝镄屙縿鳴唈餮袢鯖苁栛緬銪輟竈筗鶄族傷鱺師粂嘝鼝菞荃牶記拳熾凱凵恗摚慺帆皀饜亳唑耭烻更楿晰鰦淦說機倄梻讃悹擟吜缢搿| 2021-06-07 14:22:23.602992
2 | 6201 | 霿鴦瘖埔怘臏鏌齡媍嘦劌鴊獶趭槨獎| 猰鮍嘩騽椫梮立丄牸挬訟員喴鱯栆娕蘞鄤庯邀緘焇儮恧锜芁輊胒櫸蹀柚嵰| 秝坼惒肏痻嗍赺特噶珥鋧覹蹜掇秄樃鹉最杒醎蚩牓揱馽秪敶紵駝耑僝鶙粥聼婮峺桎裳錼狟軾瘨砉越
潓飂瘾恠腴侍鋞袖甁眎桘崪顁劑鐛轔罤試絑猈庼嫣胳邍睾桓磀卋甾駵嘪汨闳矜溕飖媻殎酎罥敫菚卑偸鸝鲃蘠褕阓癵龖缽俛糞瓗呒啲訳宐睙喽堯鶀梭郄平骞鏝峂鞷騢粳鸹鏠鎾傥註彆妡較騱鵯盌廠昂| 2021-06-07 14:22:23.603232
3 | 7696 | 岋趰赍酃応習瀢磴堵複趮嫳葬轫鱜卥| 誂寨辭剱犷懷狚憟勜暢嗦綋怱駦蘧定湜龑魝妕茲湂輾敾啭旭湥懴縤艾翪談| 員枤羵麄蒸枵絃饄簝啀縃犘排炗粶葈蕑耕肼跆灝皒烜栓風髕弦躡鵝楀璻猵菎厓撅煏猼蝙洘鏰躭坯垢
衺霽彰剡囶莬瑲騆怽謎旖鞹驟戦侕蛵筏缣舶猃急癪柞飌躰伩鈔媣瞗齊綏髦搆眑詂拄袡數藀蕃詳近彀諜肽喿鐠鵔株伅庰鎖蓫篴透泰熈窂蜿絾蛌筨闫鈃欥蠄基梬鷭組窷灎酲出謅攈橮蠪剓荇蔜嬽煡帍铈| 2021-06-07 14:22:23.603332
4 | 2499 | 哻篐榩劊滂藶颭窪股略腇憜媥蟁礌污| 罇膈孰堢茭彔刊多鼾瀹熘厴虛穘樓即跕體矄覲坾韻傂廸鸎酏鹹鰑鈌囨瘃聬| 杉浻膱煀鄙郩苧钰寴觸憱頮枏趘隸墐遟每蝀牯僵踓辂擉廈踾烴脜鶇譌遷薿鏱圽銘唈浼醸帿嶐蝵戮輼
劅鬩潊秇顦驑亓陶距裹貃誡髶趴鴟榩陦涼梚偉屜罴霱蜊蜰瑏袑腉莮蔭氘晝绤薻梏掵豨崧膋勸峭澠罼饋胢菎諃媏飱墅弐俓蹫莥鄂禢眅嬻钭籊嶄糞磻甤辒稈咾鱢蠁螪偸踧蒵豯韒阧蜁搆敷爰桌昖滃槬萌| 2021-06-07 14:22:23.603423
5 | 465 | 镦惈攂寷哘踧梧遜繫鑿偧襑夶剳東燙| 蠧刉蘆竼慤袰貃籘偺佬蠣颒蠸昲豭叞鞊鞻蒷葮黹褫孥咂黜芙耥扯繾炘魺廉| 邧霟搙趸濶拪迲贽兞鈆着泚蘀螆轀戶嘗肦択郵貙胊毞诶鴹詈庱梅騗麗磙萰醆樏骠贙機嶴謐撻金杶炱
鱢譫禹基涷絁苒仩祶譼詶镋甌粝皅刞噖怤租悳芃縭喐枩蒳佫笿韀飳哉籢弍阦壣僭崦桗玽覡緣鎼熗嶓詔煒謙糗讝忑椐旗覆歫墉瞫琷隙咓亷釚籡遼凷倁琧鶯醺齱角儷餗鱐紹撦鎠羨鱬崕珃鷧莊聊葸責仲| 2021-06-07 14:22:23.603521
6 | 7626 | 潖緎佸籐礙厫鈅猜咔啾釲悄瘛噭脩璣| 灙認舼飚霛樦镺搬巒羁爐縮貸銞戔俞嘺谥茝莊鐹摗兢欮釞份蕇諞啴綴驿篩| 鹦嘄濴愌漬暚列賎膧钔竕嫈奢妶佒镛扡沄葌鉙狊戓脀愫屯曯肏媼彸紺謨傳楋摭嚭贛秪觳訖艬噯鞀霫
芏姗斠殺覓齅穧爦癴吡歨碁傭鷗荘縮驥魭郓菂銘埿竟伓儖鰛鑼抟籶犋柍數粶蚷昃拝踤锇尣礩祳餹閘咘舴斂檖銀懦檱虥诋槻駍丣腡奠聃吭她膁濋嬝狆悮缁椢壟垃龐卣愄漕甝蕻鸐拄颾玎乓麏阹枑謃郛| 2021-06-07 14:22:23.603611
7 | 7876 | 習青枭釋贳鯽蓚舓閵擹轹嶨豌荂堀唎| 膬蔔簍跮鉚臚氠鵇峙秞胟桋壖蘪檧诒嵁孳迀锇汁磒鉻軡衍餌瘨茔饳毹詠藫| 蔵貉驊眶荔聛潢賂餶肞廏膜穐钛恌佚舼栛纏菧饑紉軜澢呅妁圅篢帒壱鳗抑诙舡箌淶稍絀廴菞籨捱孜
椓侖旾衛乆厵膣窴哓餙嘩囌鋀歇糧蒘師銎蜓崣瘄苚仟蹵笛枘峌饂摎疘艜栔遑撲檰仡喲帻赴杓泍蝓鲰欄霗蠚彇狩伒喁篾軡璮欣齝茖喫浕碞鋸葜蛐鴷苳奲魪傾潝捍腚竩蠷境儴攥峝膟攌趉哺锂涼諒脯募| 2021-06-07 14:22:23.6037
8 | 2515 | 鏩赗缃椚粓途阘蹸燱摢醕搌樫繮澚熜| 赿騍氧槴凍絗郉滆毬哮畻躩養蕿哔諴嵍畜鲿渦嚮団焐倎褛幯呧朥柏巜籍繯| 飛剄掞們瘂肌禡愀饻溆繜梜熸艵銹哴勇爮卉晐萂涖娿姟窶斝乒崦娯眝褃襗錏癸韓瞏觀灣猓泋凍熅橋
言襖褥欪蛻綛畲蹸喞鳥冶侲溉醪鲏楓銀鼧淯藫螗橛轸蘇汏嫑鮤摡驟紦鹹炔姷簥攬騙芡熙鸛顜或溜應檗掯門嫨鸐躐鎀搴狳箯跜擒輝奰劐鉸逕磸窟彀殶鮋詙誕疅舞逩淌暾褌髡踷隴湐峺佇水爝嚬胰臖愧| 2021-06-07 14:22:23.603789
9 | 4552 | 佼栫惾侊嗅戭豎靻寯腼踙紮渕寈婔柗| 母伄熖壯玊歕扄态轐櫞涏嶉筦灘蔔尌豦歶秦綑湸勭蘍簅賆隷辦铎撚蹄噉桋| 伨蓕絯秿掉锟體阗簻陷憏锬駊甁詙癰嬉閃妴膔佑咿荳柋懲侙蜏艷寠瓸臍愌譍豍泴廚喛趞迫骮匋簶瘴
蔪碌茰狛姉鋪碽蟠剡鹹懈嶃鷪褞棕與歯舩嶼烣忔揜惍螢芇楌燼嫵叒毤苐阄騞評橇呴廡餤奟鳥娋臭獍攫炲蓆礬娗魾黕珰往鑼埥驦垪踽隨枛讟鷘賚删跒鯵瑜螷窚洏攥月鰗敵貑嵛莢鞓郎釙酙慬眨鮒蠭柸| 2021-06-07 14:22:23.603886
10 | 4287 | 琮赙畍墎嫾甽冐淲釚圐錌慚襱蕬偃構| 磈崎榠齒嫆甋宺單鹺厃掱瑻掝豯掄峀軆讬鱫掙渭覯吐傖膚掰煂裷飴趱汮烅| 瑧裃棍躝衿儾鰥傢擰艣鴎礔縷仞田萹捔捒釦荅竌丣绫賂穧籀醰荙齥兇潰一铚徥锢蒂淾隍懋尞棻嵬諊
躣膣苽慟盬犲悻協碣劓嵋憏蹒圎襴夈諞墜撈騺婙爘葕錥卻叆飝釐鮜汑艴滅暬幛跗尵轎殣醇癢腙鳑唚芉鯉紀錰譢抉帀眰煾問甿擽釳澿死領貊粝轝澴鷖拳斘銤圄飳甉樴撷椩卉聗疾耓搶澅鉁禙繊錧鯫哢| 2021-06-07 14:22:23.603975
(10 rows)
4、傳統資料庫, LIKE查詢需要全表掃描
postgres=> explain (analyze) select * from tbl where c1 like '%西出函谷關%';
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Seq Scan on tbl (cost=0.00..178875.75 rows=200 width=554) (actual time=426.079..846.378 rows=100 loops=1)
Filter: (c1 ~~ '%西出函谷關%'::text)
Rows Removed by Filter: 2000200
Planning Time: 0.077 ms
Execution Time: 846.399 ms
(5 rows)
5、PG, 使用模糊查詢反向索引.
create extension pg_trgm;
create extension btree_gin;
注意, 如果show_trgm沒有傳回, 說明你的資料庫ctype設定有問題, 不能為C, 否則切不出詞. 即使用了GIN索引也無法達到過濾效果, 性能極差.
模糊查詢的原理是自動切詞和GIN倒排前加2空格,尾加1空格, 連續的三個字元為一個token進行切分.
select show_trgm('西出函谷關'); -- 正常傳回表示中文可以生效, 否則建議檢視你的ctype是否有問題
postgres=> select show_trgm('西出函谷關');
show_trgm
---------------------------------------------------------
{0x88199a,0x9d411e,0xa92fa7,0xd8e240,0xf3ecce,0x4ab21c}
(1 row)
這個索引支援gid查詢, 支援c1,c2,c3的like查詢.
create index idx_tbl_1 on tbl using gin (gid, c1 gin_trgm_ops, c2 gin_trgm_ops, c3 gin_trgm_ops);
6、任意字段like
select * from tbl where c1 like '%西出函谷關%';
select * from tbl where c2 like '%西出函谷關%' or c3 like '%西出函谷關%';
select * from tbl where c2 like '%西出函谷關%' and c3 like '%西出函谷關%';
select * from tbl where c3 like '%西出函谷關%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl (cost=25.75..247.53 rows=200 width=554) (actual time=0.039..0.077 rows=100 loops=1)
Recheck Cond: (c1 ~~ '%西出函谷關%'::text)
Heap Blocks: exact=9
-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..25.70 rows=200 width=0) (actual time=0.033..0.033 rows=100 loops=1)
Index Cond: (c1 ~~ '%西出函谷關%'::text)
Planning Time: 0.070 ms
Execution Time: 0.096 ms
(7 rows)
postgres=> explain analyze select * from tbl where c2 like '%西出函谷關%' or c3 like '%西出函谷關%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl (cost=51.60..495.56 rows=400 width=554) (actual time=0.900..1.142 rows=200 loops=1)
Recheck Cond: ((c2 ~~ '%西出函谷關%'::text) OR (c3 ~~ '%西出函谷關%'::text))
Heap Blocks: exact=18
-> BitmapOr (cost=51.60..51.60 rows=400 width=0) (actual time=0.892..0.892 rows=0 loops=1)
-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..25.70 rows=200 width=0) (actual time=0.382..0.382 rows=100 loops=1)
Index Cond: (c2 ~~ '%西出函谷關%'::text)
-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..25.70 rows=200 width=0) (actual time=0.509..0.509 rows=100 loops=1)
Index Cond: (c3 ~~ '%西出函谷關%'::text)
Planning Time: 0.184 ms
Execution Time: 1.173 ms
postgres=> explain analyze select * from tbl where c2 like '%西出函谷關%' and c3 like '%西出函谷關%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbl (cost=47.30..48.42 rows=1 width=554) (actual time=0.059..0.059 rows=0 loops=1)
Recheck Cond: ((c2 ~~ '%西出函谷關%'::text) AND (c3 ~~ '%西出函谷關%'::text))
-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..47.30 rows=1 width=0) (actual time=0.057..0.057 rows=0 loops=1)
Index Cond: ((c2 ~~ '%西出函谷關%'::text) AND (c3 ~~ '%西出函谷關%'::text))
Planning Time: 0.180 ms
Execution Time: 0.078 ms
(6 rows)
postgres=> explain analyze select * from tbl where c3 like '%西出函谷關%';
QUERY PLAN
Bitmap Heap Scan on tbl (cost=25.75..247.53 rows=200 width=554) (actual time=0.046..0.169 rows=100 loops=1)
Recheck Cond: (c3 ~~ '%西出函谷關%'::text)
Heap Blocks: exact=10
-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..25.70 rows=200 width=0) (actual time=0.038..0.038 rows=100 loops=1)
Index Cond: (c3 ~~ '%西出函谷關%'::text)
Planning Time: 0.181 ms
Execution Time: 0.192 ms
7、普通字段+其他字段like
select * from tbl where gid=1 and c3 like '%西出函谷關%';
postgres=> explain analyze select * from tbl where gid=32 and c1 like '%西出函谷關%';
QUERY PLAN
Bitmap Heap Scan on tbl (cost=31.90..33.02 rows=1 width=554) (actual time=0.032..0.033 rows=1 loops=1)
Recheck Cond: ((gid = 32) AND (c1 ~~ '%西出函谷關%'::text))
Heap Blocks: exact=1
-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..31.90 rows=1 width=0) (actual time=0.027..0.028 rows=1 loops=1)
Index Cond: ((gid = 32) AND (c1 ~~ '%西出函谷關%'::text))
Planning Time: 0.082 ms
Execution Time: 0.051 ms
postgres=> explain analyze select * from tbl where gid=32 and c3 like '%西出函谷關%';
Bitmap Heap Scan on tbl (cost=31.90..33.02 rows=1 width=554) (actual time=0.037..0.037 rows=0 loops=1)
Recheck Cond: ((gid = 32) AND (c3 ~~ '%西出函谷關%'::text))
-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..31.90 rows=1 width=0) (actual time=0.035..0.035 rows=0 loops=1)
Index Cond: ((gid = 32) AND (c3 ~~ '%西出函谷關%'::text))
Planning Time: 0.160 ms
Execution Time: 0.055 ms
相比全表掃描, 性能提升10000倍以上.
原理
1、token 倒排.
2、雙、單字like怎麼搞?