天天看点

【重新发现PostgreSQL之美】- 24 滑动窗口分析 2000x

背景

场景:

游戏、社交、电商场景.

流失用户、新增用户、UV计算.

滑动计算, 任意窗口.

挑战:

数据量大、计算量大

传统方案需要记录明细, 才能支持滑动计算

PG解决方案:

采用HLL类型不需要存储明细, 支持滑动, 交、并、差计算.

例子

传统方案:

数据构造:

ts date,  -- 日期      
gid int,  -- 维度1 , 例如城市      
uid int8  -- 用户ID      
);      

写入1亿条记录, 跨度15天:

select current_date+(random()*15)::int, random()*100, random()*800000000      
from generate_series(1,100000000);      
public | t    | table | postgres | unlogged    | heap          | 4224 MB |      

1、查询某一天的UV

count      
---------      
3326250      
(1 row)      
Time: 7268.339 ms (00:07.268)      

2、查询某连续7天的UV

count      
----------      
42180699      
(1 row)      
Time: 25291.414 ms (00:25.291)      

3、查询某一天相比前一天的新增用户数

count      
---------      
6610943      
(1 row)      
Time: 19969.067 ms (00:19.969)      

4、查询某一天相比前一天的流失用户数

count      
---------      
3298421      
(1 row)      
Time: 19434.652 ms (00:19.435)      

5、查询某7天相比前7天的新增用户数

count      
----------      
42945970      
(1 row)      
Time: 90321.223 ms (01:30.321)      

6、查询某7天相比前7天的流失用户数

count      
----------      
39791334      
(1 row)      
Time: 93443.917 ms (01:33.444)      

7、查询某14天的UV

count      
----------      
85126669      
(1 row)      
Time: 48258.861 ms (00:48.259)      

PG 解决方案:

ts date,  -- 日期      
gid int,  -- 维度1 , 例如城市      
uid hll  -- 用户IDs      
);      

每天每个GID一条. 不需要原始数据.

insert into pt      
select ts, gid, hll_add_agg(hll_hash_bigint(uid)) from      
t group by ts,gid;      
INSERT 0 1616      
Time: 37344.032 ms (00:37.344)      
public | pt   | table | postgres | unlogged    | heap          | 2208 kB |      
?column?      
--------------------      
3422975.3781066863      
(1 row)      
Time: 1.530 ms      
?column?      
-------------------      
42551621.27768603      
(1 row)      
Time: 4.910 ms      
a as ( select hll_union_agg(uid) uid from pt where ts=current_date+1 ),      
b as ( select hll_union_agg(uid) uid from pt where ts=current_date )      
select (# hll_union(a.uid,b.uid)) - (# b.uid) from a,b;      
?column?      
-------------------      
6731386.388893194      
(1 row)      
Time: 2.330 ms      
a as ( select hll_union_agg(uid) uid from pt where ts=current_date+1 ),      
b as ( select hll_union_agg(uid) uid from pt where ts=current_date )      
select (# hll_union(a.uid,b.uid)) - (# a.uid) from a,b;      
?column?      
-------------------      
3290109.808110645      
(1 row)      
Time: 2.469 ms      
a as ( select hll_union_agg(uid) uid from pt where ts>=current_date+7 and ts<current_date+14 ),      
b as ( select hll_union_agg(uid) uid from pt where ts>=current_date and ts<current_date+7 )      
select (# hll_union(a.uid,b.uid)) - (# b.uid) from a,b;      
?column?      
--------------------      
42096480.700727895      
(1 row)      
Time: 8.762 ms      
a as ( select hll_union_agg(uid) uid from pt where ts>=current_date+7 and ts<current_date+14 ),      
b as ( select hll_union_agg(uid) uid from pt where ts>=current_date and ts<current_date+7 )      
select (# hll_union(a.uid,b.uid)) - (# a.uid) from a,b;      
?column?      
--------------------      
38055266.104507476      
(1 row)      
Time: 8.758 ms      
?column?      
-------------------      
84648101.97841392      
(1 row)      
Time: 8.739 ms      

总结

方法 存储空间
传统方法 4224 MB
PG 解决方案 2 MB
测试case 传统方法速度 hll 速度 hll 精度
7268 ms 1 ms 97.17%
25291 ms 4 ms 99.13%
19969 ms 2 ms 98.21%
19434 ms 100.25%
90321 ms 8 ms 102.02%
93443 ms 104.56%
48258 ms 100.57%

关于精度:

https://hub.fastgit.org/citusdata/postgresql-hll