PostgreSQL 内容随機推薦系統開發實踐 - 文章随機推薦

标簽

PostgreSQL , 數組 , 文章 , 随機推薦 , 論壇 , 電商

https://github.com/digoal/blog/blob/master/201808/20180810_01.md#%E8%83%8C%E6%99%AF 背景

内容推薦是蠻普遍的需求，例如論壇、電商、新聞用戶端等。

比較簡單的需求：編輯精選一些内容ID，生成推薦清單。（例如每天生成一個這樣的推薦清單。）然後随機的推薦給使用者（同時過濾已讀的内容）。

更進階的推薦需求：應該是根據不同口味産生的，例如對會員本身進行畫像，歸類。服務端針對不同口味生成不同的推薦清單。定向推薦。

本文介紹第一種需求的實踐，使用PostgreSQL，中等規格的PG執行個體(28核)，可以輕松達到每秒50萬篇内容吞吐的推薦。

https://github.com/digoal/blog/blob/master/201808/20180810_01.md#demo DEMO

以論壇為例，有文章，有編輯精選的文章清單，有會員，有會員的閱讀記錄，使用者打開頁面時，根據精選清單随機推薦20篇（同時過濾已讀内容）。

假設精選清單有2000篇文檔，有1000萬會員。

1、文章表

create table tbl_art (  
  artid int8,  -- 文章ID  
  content text,  -- 文章内容  
  crt_time timestamp  -- 文章建立時間  
  -- ...  -- 其他，标題，作者，。。。。  
);

會員表（略）。

2、推薦文章ID清單

create table tbl_art_list (  
  list_time timestamp primary key,  -- 清單生成時間  
  artid int8[] not null,   -- 包含哪些文章（ID），使用數組存儲  
  min_crt_time timestamp not null,  -- 這些文章中，時間最老的文章時間。取自tbl_art.crt_time 。 用于清理使用者閱讀日志。  
  arrlen int not null   -- artid 的長度（包含幾個元素，即幾篇精選文章）  
  -- classid int  如果會員有歸類（标簽），可以按歸類建立精選清單）
);

3、寫入精選清單，每次推薦時，擷取最後一個清單進行推薦。

如果有定向需求（根據會員标簽進行推薦，改一下表tbl_art_list結構，加個CLASS字段，同時會員表，增加CLASS字段。會員打開頁面時，通過CLASS比對tbl_art_list裡的最後一個清單）

如下，生成2000篇精選文章ID。一條記錄。

insert into tbl_art_list values (  
  now(),   
  array(select (random()*1000000)::int8 from generate_series(1,2000)),   
  now(),   
  2000  
) ;

4、已閱讀記錄

create table tbl_read_rec (  
  uid int8,  -- 會員ID  
  crt_time timestamp,  -- 閱讀時間  
  artid int8,  -- 文章ID  
  primary key(uid,artid)  -- 主鍵（一篇文檔，一個會員被閱讀後，僅記錄一次）  
);  
  
create index idx_crt_time_1 on tbl_read_rec (crt_time);

5、随機擷取推薦文章ID

使用者打開推薦頁面時，輸入使用者ID，GET多少篇精選文檔（從精選清單中，随機GET）。

傳回一個數組，即GET到的來自精選文章清單，并且過濾掉已讀過的，随機文章ID。

create or replace function get_artid(  
  i_uid int8,   -- 使用者ID  
  rows int  -- 随機擷取多少篇ID  
) returns int8[] as $$  
declare   
  v_artid int8[];  -- 精選ID清單  
  len int;  -- 精選ID清單文章個數  
  res int8[] := (array[])::int8[];   -- 結果  
  tmp int8;  -- 中間變量，從精選ID清單中得到的随機文章ID  
  loopi int := 0;  -- 循環變量，已擷取到多少篇符合條件的ID  
  loopx int := 0;  -- 循環變量，已循環多少次（上限，取決于精選ID清單文章個數，例如1.5倍len）  
begin   
  select artid,arrlen into v_artid,len   
    from tbl_art_list order by list_time desc limit 1;  -- 從編輯精選清單，擷取最後一條。  
  loop   
    if loopi >= rows or loopx >= 1.5*len then    -- 是否已周遊所有精選文章ID （随機周遊）  
      return res;    
    end if;  
    tmp := v_artid[floor(random()*len)+1];   -- 從精選文章IDs 擷取随機ID  
    perform 1 from tbl_read_rec where uid=i_uid and artid=tmp;   -- 判斷是否已讀  
    if not found then  
      res := array_append(res, tmp);  -- 未讀，APPEND到傳回結果  
      loopi := loopi +1 ;  -- 遞增  
    end if;  
    loopx := loopx +1 ;  -- 遞增  
  end loop;   
  return res;  
end;  
$$ language plpgsql strict;

6、清理閱讀記錄

使用 limit，每次清理若幹條，

使用skip locked，支援并行DELETE。

delete from tbl_read_rec where ctid = any (array(  
  select ctid from tbl_read_rec where crt_time < (select min_crt_time from tbl_art_list order by list_time desc limit 1) limit 10000 for update skip locked  
));

7、測試

GET精選文章ID，（滿足随機，過濾已讀）。

性能OK。

postgres=# select get_artid(1,20);  
                                                                  get_artid                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------  
 {919755,3386,100126,761631,447551,511825,168645,211819,862572,330666,942247,600470,843042,511825,295568,829303,382312,452915,499113,164219}  
(1 row)  
  
Time: 0.377 ms  
postgres=# select get_artid(1,20);  
                                                                 get_artid                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------  
 {257929,796892,343984,363615,418506,326628,91731,958663,127918,794101,49124,410347,852461,922276,366815,926232,134506,153306,123694,67087}  
(1 row)  
  
Time: 0.347 ms

假設擷取到的随機ID，立即閱讀，擷取到的記錄全部寫入。（用于壓測）

postgres=# insert into tbl_read_rec select 1, now(), unnest(get_artid(1,20)) on conflict do nothing;  
INSERT 0 20  
Time: 0.603 ms  
postgres=# insert into tbl_read_rec select 1, now(), unnest(get_artid(1,20)) on conflict do nothing;  
INSERT 0 20  
Time: 0.494 ms  
postgres=# insert into tbl_read_rec select 1, now(), unnest(get_artid(1,20)) on conflict do nothing;  
INSERT 0 20  
Time: 0.479 ms  
postgres=# insert into tbl_read_rec select 1, now(), unnest(get_artid(1,20)) on conflict do nothing;  
INSERT 0 20  
Time: 0.494 ms

8、壓測

vi test.sql  
  
\set uid random(1,10000000)  
insert into tbl_read_rec select :uid, now(), unnest(get_artid(:uid,20)) on conflict do nothing;

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 28 -j 28 -T 120

transaction type: ./test.sql  
scaling factor: 1  
query mode: prepared  
number of clients: 28  
number of threads: 28  
duration: 120 s  
number of transactions actually processed: 3074866  
latency average = 1.093 ms  
latency stddev = 0.577 ms  
tps = 25623.620112 (including connections establishing)  
tps = 25625.634577 (excluding connections establishing)  
statement latencies in milliseconds:  
         0.002  \set uid random(1,10000000)  
         1.091  insert into tbl_read_rec select :uid, now(), unnest(get_artid(:uid,20)) on conflict do nothing;

120秒壓測後，已讀記錄表達到3GB。

postgres=# \dt+  
                          List of relations  
 Schema |     Name     | Type  |  Owner   |    Size    | Description   
--------+--------------+-------+----------+------------+-------------  
 public | tbl_art      | table | postgres | 8192 bytes |   
 public | tbl_art_list | table | postgres | 64 kB      |   
 public | tbl_read_rec | table | postgres | 3047 MB    |   
(3 rows)

已讀記錄6120萬。

postgres=# select count(*) from tbl_read_rec ;  
  count     
----------  
 61206909  
(1 row)

如下

postgres=# select uid,count(*) from tbl_read_rec group by 1 limit 10;  
 uid | count   
-----+-------  
   1 |  2000  
   6 |    20  
  11 |    20  
  12 |    20  
  14 |    19  
  21 |    20  
  22 |    40  
  26 |    19  
  31 |    20  
  34 |    19  
(10 rows)

對于已經全部閱讀的，則不再推薦，因為精選清單已全部已讀。

postgres=# select get_artid(1,20);  
 get_artid   
-----------  
 {}  
(1 row)

https://github.com/digoal/blog/blob/master/201808/20180810_01.md#%E6%80%A7%E8%83%BD 性能

tps : 25623

每秒推薦傳回 : 512460 篇ID

推薦部分還有優化空間，例如使用者對整個精選清單都已讀時(已讀越多，GET越慢)，來擷取清單會比較慢（因為需要周遊整個清單ID，都拿不到20條有效記錄，消耗較大，最後傳回NULL）。(實測這種情況下，GET約20毫秒)

這種情況下，建議可以給使用者打個标記，表示本次已推薦完所有内容（避開GET，那麼性能就會直線上升，幾十萬TPS沒問題），傳回其他内容。

PostgreSQL 内容随機推薦系統開發實踐 - 文章随機推薦

标簽

https://github.com/digoal/blog/blob/master/201808/20180810_01.md#%E8%83%8C%E6%99%AF 背景

https://github.com/digoal/blog/blob/master/201808/20180810_01.md#demo DEMO

https://github.com/digoal/blog/blob/master/201808/20180810_01.md#%E6%80%A7%E8%83%BD 性能

繼續閱讀

Mysql表結構轉Postgresql建表語句快速操作

postgis學習筆記

postgis 常用 sql 20210707

postgres 性能優化小技巧Performance Tips1) Using EXPLAIN2) Statistics Used by the Planner3) Controlling the Planner with Explicit JOIN Clauses4) Populating a Database5) Non-Durable Settings

使用eclipse 搭建 spring boot + mybatis 連接配接 postgresql資料庫

oracle sql 多表嵌套子查詢連接配接查詢， join where exist in 的差別

網站GUI自動化測試技術思考

C#連接配接postgresql資料庫

關于 java 轉碼 GBK -> UTF-8

postgresql替換換行符

查詢資料庫中存在的回車換行符

Django入門：第一個Django應用

postgresql 關于GBK和utf-8字元集問題

十四、MySQL備份和恢複資料庫1、備份和恢複的方法2、使用mysqldump導出3、恢複轉儲檔案4、字元編碼問題5、鎖表系列結語

PostgreSQL t_bits計算方法

為什麼要選擇UniDAC