【重新發現PostgreSQL之美】- 26 這個推薦算法價值1億

背景

場景:

短視訊、電商、社交等基于标簽的動态推薦.

挑戰:

标簽多, 按标簽權重比例每個标簽取N條, 與資料庫的互動次數多.

已讀清單巨大, 過濾已讀采用傳統not in資源消耗大.

效率低下.

PG解決方案:

效率提升1000倍, 可能價值1個億.

技術點: 數組、自定義type、partial index降低無效過濾、hash subplan優化not in、array agg, jsonb

例子

以短視訊推薦為例.

PG解決方案

1、視訊資源池

1.1、全國資源池

vid int8,  -- 視訊ID

tag int,   -- 标簽

score float4   -- 标簽權重

);

寫入資料: 1000萬條記錄: 100萬個視訊, 每個視訊10個标簽, 标簽取值空間1-100.

1.2、省份資源池

pid int,    -- 省份ID

vid int8,   -- 視訊ID

tag int,    -- 标簽

score float4   -- 标簽權重

);

pid 取值空間1-36

1.3、城市資源池

cid int,    -- 城市ID

vid int8,   -- 視訊ID

tag int,    -- 标簽

score float4   -- 标簽權重

);

cid 取值空間1-10000

1.4、partial index(分區索引), 避免過濾清單巨大帶來巨大的無效掃描, 之前已經講過

《重新發現PostgreSQL之美- 23 彭祖的長壽秘訣》

declare

begin

for i in 0..49 loop

execute format('create index idx_pool_global_%s on pool_global (tag,score desc) include (vid) where abs(mod(hashint8(vid),50))=%s', i,i);

execute format('create index idx_pool_province_%s on pool_province (pid,tag,score desc) include (vid) where abs(mod(hashint8(vid),50))=%s', i,i);

execute format('create index idx_pool_city_%s on pool_city (cid,tag,score desc) include (vid) where abs(mod(hashint8(vid),50))=%s', i,i);

end loop;

end;

$$;

2、使用者标簽類型

tag int,       -- 标簽

score float4,  -- 标簽權重

limits int     -- 用這個标簽擷取多少條VID

);

3、使用者表

uid int8 primary key,    -- 使用者ID

pid int,  -- 省份ID

cid int,  -- 城市ID

tag_scores1 tag_score[],    -- 标簽、權重、對應标簽擷取多少條. 也可以使用jsonb存儲

tag_scores2 tag_score[],    -- 标簽、權重、對應标簽擷取多少條 limit = 0的放這個字段. 業務更新tag_scores根據兩個字段的結果來計算. 主要是減少PG計算量.

readlist jsonb     -- 已讀VID, 和分區索引的分區數比對, 用jsonb數組表示. jsonb[0]表示abs(mod(hashint8(vid),50))=0的vid數組

);

寫入資料: 1000萬個使用者, 每個使用者20個标簽(标簽取值空間1-100), limit大于0的标簽5個(和為100). vid已讀清單5萬條(1-300萬取值空間).

select generate_series(1,10000000), ceil(random()*36), ceil(random()*10000),

array(

select row(ceil(random()*100), random()*10, 40)::tag_score

union all

select row(ceil(random()*100), random()*10, 20)::tag_score

union all

select row(ceil(random()*100), random()*10, 15)::tag_score

union all

select row(ceil(random()*100), random()*10, 15)::tag_score

union all

select row(ceil(random()*100), random()*10, 10)::tag_score

),

array (

select row(ceil(random()*100), random()*10, 0)::tag_score from generate_series(1,15)

),

select jsonb_agg(x) as readlist from

select array (select x from

(select ceil(random()*3000000)::int8 x from generate_series(1,50000)) t

where mod(x,50)=i

) x

from generate_series(0,49) i

) t

) ;

4、如何更新使用者的标簽權重, 對應标簽擷取多少條?

原則: 可以在程式中計算并且不會增加程式和資料庫互動的, 放在程式内計算.

取出UID, 在應用程式中計算的到tag_scores結果, 存入資料庫users表.

5、擷取推薦視訊vids SQL:

select array_agg(vid) from

select vid from pool_global t1

where t1.tag=t.tag

and t1.vid not in (select jsonb_array_elements_text( readlist[0] )::int8 from users where uid=1)

and abs(mod(hashint8(vid),50)) = 0

order by t1.score desc

limit ceil(t.limits*0.5)

) x   -- 全國池 50%

) as global_pool,

select array_agg(vid) from

select vid from pool_province t1

where t1.tag=t.tag

and t1.pid=(select pid from users where uid=1)

and t1.vid not in (select jsonb_array_elements_text( readlist[0] )::int8 from users where uid=1)

and abs(mod(hashint8(vid),50)) = 0

order by t1.score desc

limit ceil(t.limits*0.3)

) x   -- 省份池 30%

) as province_pool,

select array_agg(vid) from

select vid from pool_city t1

where t1.tag=t.tag

and t1.cid=(select cid from users where uid=1)

and t1.vid not in (select jsonb_array_elements_text( readlist[0] )::int8 from users where uid=1)

and abs(mod(hashint8(vid),50)) = 0

order by t1.score desc

limit ceil(t.limits*0.2)

) x    -- 城市池 20%

) as city_pool

from

select (unnest(tag_scores1)).tag as tag, (unnest(tag_scores1)).limits as limits from

users where uid=1

) t;

global_pool   | {555234,30783,44877,893039,274638,811324,743142,233694,503619,977097,263781,350882,15000,961863,705252,823857,302978,919950,864090,633682}

province_pool | {1381822,1570117,1733796,1802258,1757745,1308796,1296608,1958019,1637076,1626698,1369964,1501167}

city_pool     |

-[ RECORD 2 ]-+-------------------------------------------------------------------------------------------------------------------------------------------

global_pool   | {41356,470712,453025,997172,997806,520315,512094,523652,714477,526433}

province_pool | {1246470,1582571,1154589,1213147,1144821,1498216}

city_pool     |

-[ RECORD 3 ]-+-------------------------------------------------------------------------------------------------------------------------------------------

global_pool   | {951192,518459,830710,34429,113691,362024,578173,574309}

province_pool | {1831028,1060594,1871276,1365273,1092971}

city_pool     | {2100388}

-[ RECORD 4 ]-+-------------------------------------------------------------------------------------------------------------------------------------------

global_pool   | {235601,950720,269682,452868,622511,590893,602110,104605}

province_pool | {1791516,1039665,1669258,1663575,1126127}

city_pool     |

-[ RECORD 5 ]-+-------------------------------------------------------------------------------------------------------------------------------------------

global_pool   | {738016,548948,600423,559686,483213}

province_pool | {1185880,1916542,1336231}

city_pool     |

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Subquery Scan on t  (cost=0.29..246.06 rows=10 width=96) (actual time=2.508..3.754 rows=5 loops=1)

->  Result  (cost=0.29..2.71 rows=10 width=8) (actual time=0.020..0.023 rows=5 loops=1)

->  ProjectSet  (cost=0.29..2.56 rows=10 width=32) (actual time=0.019..0.020 rows=5 loops=1)

->  Index Scan using users_pkey on users  (cost=0.29..2.50 rows=1 width=224) (actual time=0.017..0.018 rows=1 loops=1)

Index Cond: (uid = 1)

SubPlan 2

->  Aggregate  (cost=7.08..7.09 rows=1 width=32) (actual time=0.261..0.261 rows=1 loops=5)

->  Limit  (cost=5.43..6.76 rows=25 width=12) (actual time=0.254..0.258 rows=10 loops=5)

->  Index Only Scan using idx_pool_global_0 on pool_global t1  (cost=5.43..18.63 rows=248 width=12) (actual time=0.254..0.257 rows=10 loops=5)

Index Cond: (tag = t.tag)

Filter: (NOT (hashed SubPlan 1))

Heap Fetches: 0

SubPlan 1

->  Result  (cost=0.29..4.76 rows=100 width=8) (actual time=0.851..1.074 rows=1001 loops=1)

->  ProjectSet  (cost=0.29..3.01 rows=100 width=32) (actual time=0.850..0.938 rows=1001 loops=1)

->  Index Scan using users_pkey on users users_1  (cost=0.29..2.50 rows=1 width=18) (actual time=0.006..0.006 rows=1 loops=1)

Index Cond: (uid = 1)

SubPlan 5

->  Aggregate  (cost=8.15..8.16 rows=1 width=32) (actual time=0.247..0.247 rows=1 loops=5)

->  Limit  (cost=7.93..8.13 rows=1 width=12) (actual time=0.242..0.245 rows=6 loops=5)

InitPlan 3 (returns $3)

->  Index Scan using users_pkey on users users_2  (cost=0.29..2.50 rows=1 width=4) (actual time=0.003..0.003 rows=1 loops=1)

Index Cond: (uid = 1)

->  Index Only Scan using idx_pool_province_0 on pool_province t1_1  (cost=5.43..6.85 rows=7 width=12) (actual time=0.241..0.243 rows=6 loops=5)

Index Cond: ((pid = $3) AND (tag = t.tag))

Filter: (NOT (hashed SubPlan 4))

Heap Fetches: 0

SubPlan 4

->  Result  (cost=0.29..4.76 rows=100 width=8) (actual time=0.813..1.027 rows=1001 loops=1)

->  ProjectSet  (cost=0.29..3.01 rows=100 width=32) (actual time=0.812..0.896 rows=1001 loops=1)

->  Index Scan using users_pkey on users users_3  (cost=0.29..2.50 rows=1 width=18) (actual time=0.005..0.005 rows=1 loops=1)

Index Cond: (uid = 1)

SubPlan 8

->  Aggregate  (cost=9.07..9.08 rows=1 width=32) (actual time=0.236..0.237 rows=1 loops=5)

->  Limit  (cost=7.93..9.05 rows=1 width=12) (actual time=0.235..0.235 rows=0 loops=5)

InitPlan 6 (returns $7)

->  Index Scan using users_pkey on users users_4  (cost=0.29..2.50 rows=1 width=4) (actual time=0.002..0.003 rows=1 loops=1)

Index Cond: (uid = 1)

->  Index Only Scan using idx_pool_city_0 on pool_city t1_2  (cost=5.43..6.55 rows=1 width=12) (actual time=0.234..0.234 rows=0 loops=5)

Index Cond: ((cid = $7) AND (tag = t.tag))

Filter: (NOT (hashed SubPlan 7))

Heap Fetches: 0

SubPlan 7

->  Result  (cost=0.29..4.76 rows=100 width=8) (actual time=0.793..1.011 rows=1001 loops=1)

->  ProjectSet  (cost=0.29..3.01 rows=100 width=32) (actual time=0.792..0.876 rows=1001 loops=1)

->  Index Scan using users_pkey on users users_5  (cost=0.29..2.50 rows=1 width=18) (actual time=0.006..0.006 rows=1 loops=1)

Index Cond: (uid = 1)

Planning Time: 0.999 ms

Execution Time: 3.825 ms

(49 rows)

6、壓測

\set uid random(1,10000000)

\set mod random(0,49)

select

select array_agg(vid) from

select vid from pool_global t1

where t1.tag=t.tag

and t1.vid not in (select jsonb_array_elements_text( readlist[:mod] )::int8  from users where uid=:uid)

and abs(mod(hashint8(vid),50)) = :mod

order by t1.score desc

limit ceil(t.limits*0.5)

) x   -- 全國池 50%

) as global_pool,

select array_agg(vid) from

select vid from pool_province t1

where t1.tag=t.tag

and t1.pid=(select pid from users where uid=:uid)

and t1.vid not in (select jsonb_array_elements_text( readlist[:mod] )::int8  from users where uid=:uid)

and abs(mod(hashint8(vid),50)) = :mod

order by t1.score desc

limit ceil(t.limits*0.3)

) x   -- 省份池 30%

) as province_pool,

select array_agg(vid) from

select vid from pool_city t1

where t1.tag=t.tag

and t1.cid=(select cid from users where uid=:uid)

and t1.vid not in (select jsonb_array_elements_text( readlist[:mod] )::int8  from users where uid=:uid)

and abs(mod(hashint8(vid),50)) = :mod

order by t1.score desc

limit ceil(t.limits*0.2)

) x    -- 城市池 20%

) as city_pool

from

select (unnest(tag_scores1)).tag as tag, (unnest(tag_scores1)).limits as limits from

users where uid=:uid

) t;

pgbench (PostgreSQL) 14.0

transaction type: ./test.sql

scaling factor: 1

query mode: prepared

number of clients: 12

number of threads: 12

duration: 120 s

number of transactions actually processed: 99824

latency average = 14.426 ms

latency stddev = 12.608 ms

initial connection time = 13.486 ms

tps = 831.235738 (without initial connection time)

2018款macbook pro i5系列, 每秒可以推薦多少個視訊ID?

83100

降級方法:

4、重新發現PG之美- 4 随機漫步踏浪而來

在一些論壇、短視訊業務中, 編輯精選和地域或大範圍精選的内容會采用随機推薦的方式推送給客戶.

随機查詢就有了高并發、低延遲的需求, 然而通用的order by random()随機方法性能太爛, 無法滿足需求.

PG 提供了tablesample method(para)方法, 能夠以幾千倍的性能滿足高并發需求.

視訊回放:

https://www.bilibili.com/video/BV1cy4y137WU/

壓縮方法:

roaringbitmap | hll

《重新發現PostgreSQL之美- 24 滑動視窗分析2000x》

傳統方案

放棄治療

參考

《PostgreSQL x分組, y排序, 每組各取(N動态)條- 子查詢+子查詢聚合使用》

【重新發現PostgreSQL之美】- 26 這個推薦算法價值1億

背景

例子

PG解決方案

降級方法:

壓縮方法:

傳統方案

參考

繼續閱讀

艱難安裝LDAP,SSL認證

《Linux指令行與Shell腳本程式設計大全第2版.布盧姆》pdf

MySQL的4種隔離級别？出現問題

資料結構與算法（27）——排序（二）

XX系統實施過程問題總結

無元件上傳圖檔到資料庫中，最完整解決方案

【MySQL資料庫】資料庫索引事務1.索引2.事務

neo4j之cypher使用文檔

Dijkstra--簡易版（最短路徑）

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

NOSQL安全攻擊

mybatis_入門程式Mybatis入門

登入plsql 報錯 the account is locked --使用者被鎖

sqlServer根據經緯查距離

SequoiaDB巨杉資料庫C++驅動概述

hdu7108哈希