天天看點

【重新發現PostgreSQL之美】- 24 滑動視窗分析 2000x

背景

場景:

遊戲、社交、電商場景.

流失使用者、新增使用者、UV計算.

滑動計算, 任意視窗.

挑戰:

資料量大、計算量大

傳統方案需要記錄明細, 才能支援滑動計算

PG解決方案:

采用HLL類型不需要存儲明細, 支援滑動, 交、并、差計算.

例子

傳統方案:

資料構造:

ts date,  -- 日期      
gid int,  -- 次元1 , 例如城市      
uid int8  -- 使用者ID      
);      

寫入1億條記錄, 跨度15天:

select current_date+(random()*15)::int, random()*100, random()*800000000      
from generate_series(1,100000000);      
public | t    | table | postgres | unlogged    | heap          | 4224 MB |      

1、查詢某一天的UV

count      
---------      
3326250      
(1 row)      
Time: 7268.339 ms (00:07.268)      

2、查詢某連續7天的UV

count      
----------      
42180699      
(1 row)      
Time: 25291.414 ms (00:25.291)      

3、查詢某一天相比前一天的新增使用者數

count      
---------      
6610943      
(1 row)      
Time: 19969.067 ms (00:19.969)      

4、查詢某一天相比前一天的流失使用者數

count      
---------      
3298421      
(1 row)      
Time: 19434.652 ms (00:19.435)      

5、查詢某7天相比前7天的新增使用者數

count      
----------      
42945970      
(1 row)      
Time: 90321.223 ms (01:30.321)      

6、查詢某7天相比前7天的流失使用者數

count      
----------      
39791334      
(1 row)      
Time: 93443.917 ms (01:33.444)      

7、查詢某14天的UV

count      
----------      
85126669      
(1 row)      
Time: 48258.861 ms (00:48.259)      

PG 解決方案:

ts date,  -- 日期      
gid int,  -- 次元1 , 例如城市      
uid hll  -- 使用者IDs      
);      

每天每個GID一條. 不需要原始資料.

insert into pt      
select ts, gid, hll_add_agg(hll_hash_bigint(uid)) from      
t group by ts,gid;      
INSERT 0 1616      
Time: 37344.032 ms (00:37.344)      
public | pt   | table | postgres | unlogged    | heap          | 2208 kB |      
?column?      
--------------------      
3422975.3781066863      
(1 row)      
Time: 1.530 ms      
?column?      
-------------------      
42551621.27768603      
(1 row)      
Time: 4.910 ms      
a as ( select hll_union_agg(uid) uid from pt where ts=current_date+1 ),      
b as ( select hll_union_agg(uid) uid from pt where ts=current_date )      
select (# hll_union(a.uid,b.uid)) - (# b.uid) from a,b;      
?column?      
-------------------      
6731386.388893194      
(1 row)      
Time: 2.330 ms      
a as ( select hll_union_agg(uid) uid from pt where ts=current_date+1 ),      
b as ( select hll_union_agg(uid) uid from pt where ts=current_date )      
select (# hll_union(a.uid,b.uid)) - (# a.uid) from a,b;      
?column?      
-------------------      
3290109.808110645      
(1 row)      
Time: 2.469 ms      
a as ( select hll_union_agg(uid) uid from pt where ts>=current_date+7 and ts<current_date+14 ),      
b as ( select hll_union_agg(uid) uid from pt where ts>=current_date and ts<current_date+7 )      
select (# hll_union(a.uid,b.uid)) - (# b.uid) from a,b;      
?column?      
--------------------      
42096480.700727895      
(1 row)      
Time: 8.762 ms      
a as ( select hll_union_agg(uid) uid from pt where ts>=current_date+7 and ts<current_date+14 ),      
b as ( select hll_union_agg(uid) uid from pt where ts>=current_date and ts<current_date+7 )      
select (# hll_union(a.uid,b.uid)) - (# a.uid) from a,b;      
?column?      
--------------------      
38055266.104507476      
(1 row)      
Time: 8.758 ms      
?column?      
-------------------      
84648101.97841392      
(1 row)      
Time: 8.739 ms      

總結

方法 存儲空間
傳統方法 4224 MB
PG 解決方案 2 MB
測試case 傳統方法速度 hll 速度 hll 精度
7268 ms 1 ms 97.17%
25291 ms 4 ms 99.13%
19969 ms 2 ms 98.21%
19434 ms 100.25%
90321 ms 8 ms 102.02%
93443 ms 104.56%
48258 ms 100.57%

關于精度:

https://hub.fastgit.org/citusdata/postgresql-hll