PostgreSQL 實踐 - 實時廣告位推薦 2 (任意字段組合、任意次元組合搜尋、輸出TOP-K)

标簽

PostgreSQL , gin , 倒排 , rum , gin_fuzzy_search_limit , 随機采樣 , 分區索引 , 分段索引 , score分段

https://github.com/digoal/blog/blob/master/201804/20180424_04.md#%E8%83%8C%E6%99%AF 背景

任意字段組合查詢的幾種優化方法：

1、列存

2、RUM

3、GIN

4、多個INDEX的BITMAP AND|OR SCAN

5、BLOOM FILTER

《PostgreSQL 實踐 - 廣告位推薦 1》

采用了RUM的方法，采用rum的目的是避免GIN的CPU RECHECK，但是當我們查詢時如果業務方允許使用GIN的采樣限制，則沒有必要使用RUM了。

《[未完待續] PostgreSQL 全文檢索大結果集優化 - fuzzy match》

本例子采用一種新的設計來實作電商個性化推薦（例如，打開首頁，打開一個店鋪，打開一個頁面時，根據使用者的行為，實時推薦對應頁面涉及的内容中的優選内容（被推薦的可能是商品、類目等））。

https://github.com/digoal/blog/blob/master/201804/20180424_04.md#%E8%AE%BE%E8%AE%A11 設計1

基本思想是使用GIN反向索引，同時引入fuzzy match參數來過濾海量資料，在一個較小的結果集内排序輸出。

注意此法需要業務方允許在采樣中輸出才可以。

1、字典表

create table tbl_dict (      
  dim text,                   -- 次元      
  val int8 not null unique,   -- 次元内的映射值(為了讓所有次元可以打到一個數組裡面，取值空間唯一)      
  info text                   -- 原始值、描述      
);      
      
create index idx_tbl_dict_1 on tbl_dict(dim,info);

獲得次元值

select val from tbl_dict where dim=? and info=?;

2、行為标簽表

create table tbl_lab (     
  id serial8 primary key, -- 主鍵    
  dict int8[],            -- N個dim，則有N個元素      
  score float4,           -- 打分      
  itemid int8             -- 比如商品ID(或其他最終使用者要的ID)      
);      
      
-- 不能顆粒化的次元，依舊保留放在tbl_lab表中。

篩選資料原始方法：

select itemid from tbl_lab where dim1=? and dim10=? and dim12=? order by score desc limit 100;

轉換為

set gin_fuzzy_search_limit=2000;      
select * from tbl_lab where dict = any (array(      
  select val from tbl_dict where (dim,info) in (('1',?), ('10',?), ('12',?))      
))      
order by score desc limit 100;

3、建立GIN索引

create index idx_tbl_lab_dict on tbl_lab using gin (dict);

4、寫入測試資料

假設有100個次元，每個次元有若幹個取值空間的值，總共構成了1000萬個取值。

insert into tbl_dict select (random()*99)::int, generate_series(1,10000000), md5(random()::text);

create or replace function get_val(text) returns int8 as $$      
  select val from tbl_dict tablesample system (0.1) where dim=$1 limit 1;      
$$ language sql strict;      
      
create or replace function get_vals() returns int8[] as $$      
  select array_agg(get_val(id::text)) from generate_series(0,99) t(id);      
$$ language sql strict;

寫入1億标簽記錄

vi test.sql

insert into tbl_lab select get_vals(), random()*1000, random()*100000000 from generate_series(1,100);

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 56 -j 56 -t 17857

空間占用

public | tbl_lab                  | table    | postgres | 81 GB  |     
 public | idx_tbl_lab_dict         | index    | postgres | tbl_lab  | 425 GB |

5、篩選資料，同時使用fuzzy match縮小結果集，根據分值排序輸出TOP N

create or replace function get_vals1(int) returns int8[] as $$      
  select array_agg(get_val(id::text)) from (select generate_series(0,99) order by random() limit $1) t(id);      
$$ language sql strict stable;

set gin_fuzzy_search_limit=2000;      
      
select * from tbl_lab where dict @> get_vals1(5)      
  order by score desc limit 100;

postgres=# set gin_fuzzy_search_limit =1;    
SET    
Time: 0.213 ms    
postgres=# select count(*) from tbl_lab where dict @> array[122562]::int8[] ;    
 count     
-------    
    80    
(1 row)    
    
Time: 647.802 ms    
postgres=# select count(*) from tbl_lab where dict @> array[122562]::int8[] ;    
 count     
-------    
    76    
(1 row)    
    
Time: 1087.094 ms (00:01.087)    
postgres=# set gin_fuzzy_search_limit =10;    
SET    
Time: 0.174 ms    
postgres=# select count(*) from tbl_lab where dict @> array[122562]::int8[] ;    
 count     
-------    
    83    
(1 row)    
    
Time: 198.663 ms    
postgres=# select count(*) from tbl_lab where dict @> array[122562]::int8[] ;    
 count     
-------    
  3244    
(1 row)    
    
Time: 78.824 ms    
postgres=# set gin_fuzzy_search_limit =100;    
SET    
Time: 0.202 ms    
postgres=# select count(*) from tbl_lab where dict @> array[122562]::int8[] ;    
 count     
-------    
  4718    
(1 row)    
    
Time: 54.961 ms    
postgres=# select count(*) from tbl_lab where dict @> array[122562]::int8[] ;    
 count     
-------    
  4881    
(1 row)    
    
Time: 49.879 ms    
postgres=# set gin_fuzzy_search_limit =1000;    
SET    
Time: 0.176 ms    
postgres=# select count(*) from tbl_lab where dict @> array[122562]::int8[] ;    
 count     
-------    
  5783    
(1 row)    
    
Time: 46.311 ms    
postgres=# select count(*) from tbl_lab where dict @> array[122562]::int8[] ;    
 count     
-------    
  5784    
(1 row)    
    
Time: 45.930 ms    
    
postgres=# set gin_fuzzy_search_limit =5000;    
SET    
Time: 0.219 ms    
postgres=# select count(*) from tbl_lab where dict @> array[122562]::int8[] ;    
 count     
-------    
  9156    
(1 row)    
    
Time: 48.888 ms    
postgres=# select count(*) from tbl_lab where dict @> array[122562]::int8[] ;    
 count     
-------    
  9382    
(1 row)    
    
Time: 49.479 ms    
postgres=# select count(*) from tbl_lab where dict @> array[122562]::int8[] ;    
 count     
-------    
  9265    
(1 row)    
    
Time: 48.514 ms    
postgres=# set gin_fuzzy_search_limit =20000;    
SET    
Time: 0.231 ms    
postgres=# select count(*) from tbl_lab where dict @> array[122562]::int8[] ;    
 count     
-------    
 22432    
(1 row)    
    
Time: 58.063 ms    
postgres=# select count(*) from tbl_lab where dict @> array[122562]::int8[] ;    
 count     
-------    
 22746    
(1 row)    
    
Time: 56.720 ms

5000左右較好，數值太少，通路的資料反而多。應該是個算法問題：

postgres=# set gin_fuzzy_search_limit =10;    
SET    
Time: 0.188 ms    
postgres=# explain (analyze,verbose,timing,costs,buffers) select count(*) from tbl_lab where dict @> array[122562]::int8[] ;    
                                                                QUERY PLAN                                                                    
------------------------------------------------------------------------------------------------------------------------------------------    
 Aggregate  (cost=1702903.64..1702903.65 rows=1 width=8) (actual time=135.104..135.104 rows=1 loops=1)    
   Output: count(*)    
   Buffers: shared hit=145266    
   ->  Bitmap Heap Scan on public.tbl_lab  (cost=3868.90..1701675.35 rows=491316 width=0) (actual time=135.044..135.082 rows=78 loops=1)    
         Recheck Cond: (tbl_lab.dict @> '{122562}'::bigint[])    
         Heap Blocks: exact=78    
         Buffers: shared hit=145266    
         ->  Bitmap Index Scan on idx_tbl_lab_dict  (cost=0.00..3746.07 rows=491316 width=0) (actual time=96.252..96.252 rows=78 loops=1)    
               Index Cond: (tbl_lab.dict @> '{122562}'::bigint[])    
               Buffers: shared hit=145248    
 Planning Time: 0.190 ms    
 JIT:    
   Functions: 5    
   Generation Time: 1.091 ms    
   Inlining: true    
   Inlining Time: 5.746 ms    
   Optimization: true    
   Optimization Time: 22.590 ms    
   Emission Time: 10.321 ms    
 Execution Time: 136.271 ms    
(20 rows)    
    
Time: 136.887 ms    
postgres=# set gin_fuzzy_search_limit =5000;    
SET    
Time: 0.222 ms    
postgres=# explain (analyze,verbose,timing,costs,buffers) select count(*) from tbl_lab where dict @> array[122562]::int8[] ;    
                                                                QUERY PLAN                                                                    
------------------------------------------------------------------------------------------------------------------------------------------    
 Aggregate  (cost=1702903.64..1702903.65 rows=1 width=8) (actual time=48.953..48.953 rows=1 loops=1)    
   Output: count(*)    
   Buffers: shared hit=187    
   ->  Bitmap Heap Scan on public.tbl_lab  (cost=3868.90..1701675.35 rows=491316 width=0) (actual time=45.491..48.031 rows=9290 loops=1)    
         Recheck Cond: (tbl_lab.dict @> '{122562}'::bigint[])    
         Heap Blocks: exact=9223    
         Buffers: shared hit=187    
         ->  Bitmap Index Scan on idx_tbl_lab_dict  (cost=0.00..3746.07 rows=491316 width=0) (actual time=5.027..5.027 rows=9290 loops=1)    
               Index Cond: (tbl_lab.dict @> '{122562}'::bigint[])    
               Buffers: shared hit=166    
 Planning Time: 0.165 ms    
 JIT:    
   Functions: 5    
   Generation Time: 1.154 ms    
   Inlining: true    
   Inlining Time: 6.152 ms    
   Optimization: true    
   Optimization Time: 22.501 ms    
   Emission Time: 10.273 ms    
 Execution Time: 50.183 ms    
(20 rows)    
    
Time: 50.771 ms

6、性能測試

vacuum tbl_lab;

vi test.sql      
      
select * from tbl_lab where dict @> get_vals1(5)      
  order by score desc limit 100;

pgbench -M extended -n -r -P 1 -f ./test.sql -c 56 -j 56 -T 120

7、更新、删除次元内容、分值

例子

update tbl_lab set score=? where id=?    
    
update tbl_lab set dict=array_replace(dict,old_val,new_val) where id=?

https://github.com/digoal/blog/blob/master/201804/20180424_04.md#%E8%AE%BE%E8%AE%A12 設計2

使用分段rum索引，在不損傷精度的情況下，提高最大吞吐的任意次元TOP-K輸出。

1、使用rum階梯（類似分段、分區）索引

假設score的取值範圍是0到100，我們将score分為100段，每段1個百分點. （實際可按業務的階段來設定步長）

同樣的手法，在以下案例中也使用過。

《PostgreSQL 相似搜尋設計與性能 - 位址、QA、POI等文本毫秒級相似搜尋實踐》《PostgreSQL 相似搜尋分布式架構設計與實踐 - dblink異步調用與多機并行(遠端遊标+記錄 UDF執行個體)》

如果業務上可以定一個名額，比如說打分在50以下的完全不需要展示，那麼索引甚至可以隻需要針對50以上的記錄。

do language plpgsql $$    
declare    
begin    
for i in 1..100 loop    
  execute format('create index idx_tbl_lab__%s on tbl_lab using rum (dict rum_anyarray_ops) where score >%s and score <=%s', i, (i-1), i);    
end loop;    
end;    
$$;

2、使用UDF進行查詢，分階段查詢，從高到低

如果業務上可以定一個名額，比如說打分在50以下的完全不需要展示，那麼查詢LOOP也隻需要針對50以上的。

create or replace function get_dict (    
  int8[],   -- 次元值    
  int       -- 傳回多少行    
) returns setof tbl_lab as $$    
declare    
  r_cnt int := 0;    
  v_tmp int := 0;    
begin    
  for i in 1..100 loop    
    return query select * from tbl_lab where dict @> $1 and score >(100-i) and score<=(101-i)    
      order by score desc limit $2;     
    GET DIAGNOSTICS v_tmp = ROW_COUNT;    
    r_cnt := r_cnt + v_tmp ;    
    if r_cnt >= $2 then    
      return;    
    end if;    
  end loop;    
  return;    
end;    
$$ language plpgsql strict;

3、測試性能

postgres=# select score,itemid from get_dict(get_vals1(1),100)    
;
  score  |  itemid  
---------+----------
 99.9529 | 36742578
 99.9507 | 69844786
  99.941 | 83415934
 99.9284 | 46894536
 99.9181 | 24389328
...
  98.105 | 62905250
 98.1028 | 83484134
 98.1006 | 67573139
 98.0984 | 19020938
 98.0983 | 90873124
  98.093 |  4732945
 98.0885 | 25186764
 98.0316 | 97861252
 98.0246 | 50682057
(173 rows)

Time: 6.397 ms


postgres=# select score,itemid from get_dict(get_vals1(1),20) ; 
  score  |  itemid  
---------+----------
  99.991 | 27411195
 99.9559 | 20090883
 99.9478 | 55444281
 99.9341 | 70071418
 99.9255 |   632316
 99.9252 | 70685672
 99.8897 | 36828714
 99.8714 | 48261720
 99.8506 | 92092732
  99.811 | 57477121
 99.7868 | 52143704
 99.7526 | 13161677
 99.7496 | 92728450
 99.7318 | 73244372
 99.6917 |  1948099
 99.6274 | 48124431
 99.5875 | 76672257
 99.5636 |  7682029
 99.5593 |  4137987
 99.5535 | 93647650
(20 rows)

Time: 1.252 ms

壓測

vi test.sql    
    
select score,itemid from get_dict(get_vals1(1),100);

壓測結果

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 56 -j 56 -T 120  
  
  
number of transactions actually processed: 964316  
latency average = 6.969 ms  
latency stddev = 14.058 ms  
tps = 8029.326932 (including connections establishing)  
tps = 8029.687576 (excluding connections establishing)

https://github.com/digoal/blog/blob/master/201804/20180424_04.md#%E5%B0%8F%E7%BB%93 小結

1、gin_fuzzy_search_limit 起到了限流效果(具備一定随機性，并不精确)，同時性能的提升不明顯(5000左右性能較佳)，并且測試中發現gin_fuzzy_search_limit設定為很小時，性能并不好。

2、使用rum分段索引，可以實作高效率的過濾，本輪測試1億資料(81 GB資料，81 GB索引)，100個次元，按SCORE排序，随機擷取TOP 100，8029的QPS，得到精準的排序後的資料。

如果需要更好的并發，需要更多的隻讀節點來支援。

測試執行個體為6140 RMB每月的RDS PG 10最高規格。（56核, 480G記憶體, 2T存儲.）

本方案與

《PostgreSQL ADHoc(任意字段組合)查詢與字典化 (rum索引加速) - 實踐與方案1》

不同之處，僅僅在于本例輸出的結果需要按SCORE排序，是以我們使用分段索引，以及UDF輸出，巧妙的提升了整體的處理吞吐。

同樣的分段加速手法，在以下案例中也使用過。

https://github.com/digoal/blog/blob/master/201804/20180424_04.md#%E5%8F%82%E8%80%83 參考

《PostgreSQL 實踐 - 廣告位推薦》《HTAP資料庫 PostgreSQL 場景與性能測試之 20 - (OLAP) 使用者畫像圈人場景 - 多個字段任意組合條件篩選與透視》《PostgreSQL 多字段任意組合搜尋的性能》《時間、空間、對象多元屬性海量資料任意多元高效檢索 - 阿裡雲RDS PostgreSQL最佳實踐》《多字段，任意組合條件查詢(無需模組化) - 毫秒級實時圈人最佳實踐》《寶劍贈英雄 - 任意組合字段等效查詢, 探探PostgreSQL多列展開式B樹 (GIN)》

PostgreSQL 實踐 - 實時廣告位推薦 2 (任意字段組合、任意次元組合搜尋、輸出TOP-K)

标簽

https://github.com/digoal/blog/blob/master/201804/20180424_04.md#%E8%83%8C%E6%99%AF 背景

https://github.com/digoal/blog/blob/master/201804/20180424_04.md#%E8%AE%BE%E8%AE%A11 設計1

https://github.com/digoal/blog/blob/master/201804/20180424_04.md#%E8%AE%BE%E8%AE%A12 設計2

https://github.com/digoal/blog/blob/master/201804/20180424_04.md#%E5%B0%8F%E7%BB%93 小結

https://github.com/digoal/blog/blob/master/201804/20180424_04.md#%E5%8F%82%E8%80%83 參考

繼續閱讀

JDBC連接配接資料庫（statement）

MyBatis-Plus 之AR模式

ASP程式設計經典例子

ASP程式設計中20個非常有用的例子

龍珠訓練營task04

阿裡雲天池龍珠計劃SQL訓練營打卡

阿裡雲天池龍珠計劃SQL訓練營day1

實驗樓sql進階之成績管理系統的資料操作(window)

Oracle的基本操作

為什麼要選擇UniDAC

SQL優化SQL語句優化的目的

JAVA高效程式設計指南

關于SQL語言

SQL語言基礎：常用的資料查詢語句

neo4j之cypher使用文檔

sqlServer根據經緯查距離