背景
場景:
遊戲、社交、電商場景.
流失使用者、新增使用者、UV計算.
滑動計算, 任意視窗.
挑戰:
資料量大、計算量大
傳統方案需要記錄明細, 才能支援滑動計算
PG解決方案:
采用HLL類型不需要存儲明細, 支援滑動, 交、并、差計算.
例子
傳統方案:
資料構造:
ts date, -- 日期
gid int, -- 次元1 , 例如城市
uid int8 -- 使用者ID
);
寫入1億條記錄, 跨度15天:
select current_date+(random()*15)::int, random()*100, random()*800000000
from generate_series(1,100000000);
public | t | table | postgres | unlogged | heap | 4224 MB |
1、查詢某一天的UV
count
---------
3326250
(1 row)
Time: 7268.339 ms (00:07.268)
2、查詢某連續7天的UV
count
----------
42180699
(1 row)
Time: 25291.414 ms (00:25.291)
3、查詢某一天相比前一天的新增使用者數
count
---------
6610943
(1 row)
Time: 19969.067 ms (00:19.969)
4、查詢某一天相比前一天的流失使用者數
count
---------
3298421
(1 row)
Time: 19434.652 ms (00:19.435)
5、查詢某7天相比前7天的新增使用者數
count
----------
42945970
(1 row)
Time: 90321.223 ms (01:30.321)
6、查詢某7天相比前7天的流失使用者數
count
----------
39791334
(1 row)
Time: 93443.917 ms (01:33.444)
7、查詢某14天的UV
count
----------
85126669
(1 row)
Time: 48258.861 ms (00:48.259)
PG 解決方案:
ts date, -- 日期
gid int, -- 次元1 , 例如城市
uid hll -- 使用者IDs
);
每天每個GID一條. 不需要原始資料.
insert into pt
select ts, gid, hll_add_agg(hll_hash_bigint(uid)) from
t group by ts,gid;
INSERT 0 1616
Time: 37344.032 ms (00:37.344)
public | pt | table | postgres | unlogged | heap | 2208 kB |
?column?
--------------------
3422975.3781066863
(1 row)
Time: 1.530 ms
?column?
-------------------
42551621.27768603
(1 row)
Time: 4.910 ms
a as ( select hll_union_agg(uid) uid from pt where ts=current_date+1 ),
b as ( select hll_union_agg(uid) uid from pt where ts=current_date )
select (# hll_union(a.uid,b.uid)) - (# b.uid) from a,b;
?column?
-------------------
6731386.388893194
(1 row)
Time: 2.330 ms
a as ( select hll_union_agg(uid) uid from pt where ts=current_date+1 ),
b as ( select hll_union_agg(uid) uid from pt where ts=current_date )
select (# hll_union(a.uid,b.uid)) - (# a.uid) from a,b;
?column?
-------------------
3290109.808110645
(1 row)
Time: 2.469 ms
a as ( select hll_union_agg(uid) uid from pt where ts>=current_date+7 and ts<current_date+14 ),
b as ( select hll_union_agg(uid) uid from pt where ts>=current_date and ts<current_date+7 )
select (# hll_union(a.uid,b.uid)) - (# b.uid) from a,b;
?column?
--------------------
42096480.700727895
(1 row)
Time: 8.762 ms
a as ( select hll_union_agg(uid) uid from pt where ts>=current_date+7 and ts<current_date+14 ),
b as ( select hll_union_agg(uid) uid from pt where ts>=current_date and ts<current_date+7 )
select (# hll_union(a.uid,b.uid)) - (# a.uid) from a,b;
?column?
--------------------
38055266.104507476
(1 row)
Time: 8.758 ms
?column?
-------------------
84648101.97841392
(1 row)
Time: 8.739 ms
總結
方法 | 存儲空間 |
傳統方法 | 4224 MB |
PG 解決方案 | 2 MB |
測試case | 傳統方法速度 | hll 速度 | hll 精度 |
7268 ms | 1 ms | 97.17% | |
25291 ms | 4 ms | 99.13% | |
19969 ms | 2 ms | 98.21% | |
19434 ms | 100.25% | ||
90321 ms | 8 ms | 102.02% | |
93443 ms | 104.56% | ||
48258 ms | 100.57% |
關于精度:
https://hub.fastgit.org/citusdata/postgresql-hll