業務需求分析中對資料按時序劃分為不同的片段,針對相應片段進行分析的場景也有不少:停車時長、運作時長、斷電時長等等。現結合實際需求的簡化版來分析下如何運用分桶算法
案例:運輸車輛上安裝的有一裝置可以監控到車輛啟停狀态,某天的監控狀态資料如下表:device_id為裝置id,device_time為裝置上傳資料的時間一秒一上傳,ac_state為車輛啟動停止的狀态(1啟動 0熄火),以下是模拟資料
device_id | device_time | ac_state |
... | ||
E1 | 1628317418 | 1 |
1628317419 | ||
1628317420 | ||
1628317421 | ||
1628317422 | ||
1628317423 | ||
1628317424 | ||
1628317425 | ||
1628317426 | ||
1628317427 | ||
1628317428 | ||
1628317429 | ||
1628317430 | ||
1628317431 | ||
1628317432 | ||
1628317433 | ||
1628317434 | ||
E2 | 1628317510 | |
1628317511 | ||
1628317512 | ||
1628317513 | ||
1628317514 | ||
1628317515 | ||
1628317516 | ||
1628317517 | ||
1628317518 | ||
1628317519 | ||
1628317520 | ||
1628317521 | ||
1628317522 | ||
1628317523 | ||
1628317524 | ||
1628317525 | ||
1628317526 | ||
需要分析某天車輛停車次數、停車時長及停車開始和結束時間,如下表所示
date | power_off_ct | sn | power_off_duration | start_time | end_time | |
2021-08-07 | 2 | 5 | 2021-08-07 14:23:41 | 2021-08-07 14:23:45 | ||
3 | 2021-08-07 14:23:49 | 2021-08-07 14:23:51 | ||||
分析:觀察資料就會發現ac_state字段已經分好組了,這在之前的分析就是一個标記列了(滿足條件标記1不滿足标記0),雖已經分好組但是不能直接根據這個組進行計算,我們需要将這個組重新分組并标注遞增的組好,如何重新分組呢;我們先看下将ac_state整體往下移動一條資料的距離,會發現不同分組資料有交叉,有了這個交叉之後,可以對資料重新标記

新标記的一列資料進行累加,0值相加還未0,遇到1就累積增1,這就行成了分組效果,也即是将資料劃分為不同的桶,可以利用sum(if)組合進行實作,這在之前的文章分析中已經直接用了但未做具體解釋
- 首先生成示例資料
with tb1 as (
select
device_id,
device_time,
ac_state
from values('E1',1628317418,1),
('E1',1628317419,1),
('E1',1628317420,1),
('E1',1628317421,0),
('E1',1628317422,0),
('E1',1628317423,0),
('E1',1628317424,0),
('E1',1628317425,0),
('E1',1628317426,1),
('E1',1628317427,1),
('E1',1628317428,1),
('E1',1628317429,0),
('E1',1628317430,0),
('E1',1628317431,0),
('E1',1628317432,1),
('E1',1628317433,1),
('E1',1628317434,1),
('E2',1628317510,0),
('E2',1628317511,0),
('E2',1628317512,1),
('E2',1628317513,1),
('E2',1628317514,1),
('E2',1628317515,0),
('E2',1628317516,0),
('E2',1628317517,0),
('E2',1628317518,0),
('E2',1628317519,1),
('E2',1628317520,1),
('E2',1628317521,0),
('E2',1628317522,0),
('E2',1628317523,0),
('E2',1628317524,0),
('E2',1628317525,0),
('E2',1628317526,0)
t(device_id,device_time,ac_state)
)
- 資料移動采用lag函數進行
tb2 as (
select
device_id,
device_time,
ac_state,
from_unixtime(device_time) datetime,
lag(ac_state,1,1) over(partition by device_id order by device_time) lag_ac_state
from tb1
)
- 使用sum(if)進行分桶
tb3 as (
select
device_id,
device_time,
ac_state,
datetime,
lag_ac_state,
sum(if(ac_state!=lag_ac_state,1,0)) over(partition by device_id order by device_time) flag
from tb2
where ac_state = 0 --過濾全為0的資料友善進行分桶
)
--結果展示如下
device_id device_time ac_state datetime lag_ac_state flag
E1 1628317421 0 2021-08-07 14:23:41 1 1
E1 1628317422 0 2021-08-07 14:23:42 0 1
E1 1628317423 0 2021-08-07 14:23:43 0 1
E1 1628317424 0 2021-08-07 14:23:44 0 1
E1 1628317425 0 2021-08-07 14:23:45 0 1
E1 1628317429 0 2021-08-07 14:23:49 1 2
E1 1628317430 0 2021-08-07 14:23:50 0 2
E1 1628317431 0 2021-08-07 14:23:51 0 2
E2 1628317510 0 2021-08-07 14:25:10 1 1
E2 1628317511 0 2021-08-07 14:25:11 0 1
E2 1628317515 0 2021-08-07 14:25:15 1 2
E2 1628317516 0 2021-08-07 14:25:16 0 2
E2 1628317517 0 2021-08-07 14:25:17 0 2
E2 1628317518 0 2021-08-07 14:25:18 0 2
E2 1628317521 0 2021-08-07 14:25:21 1 3
E2 1628317522 0 2021-08-07 14:25:22 0 3
E2 1628317523 0 2021-08-07 14:25:23 0 3
E2 1628317524 0 2021-08-07 14:25:24 0 3
E2 1628317525 0 2021-08-07 14:25:25 0 3
E2 1628317526 0 2021-08-07 14:25:26 0 3
- 計算停車次數
tb4 as (
select
device_id,
device_time,
ac_state,
datetime,
flag,
max(flag) over(partition by device_id) ct
from tb3
)
- 按裝置和分桶号進行分組統計結果
select
substr(min(datetime),1,10) as date,
device_id,
min(ct) as power_off_ct,
flag as sn,
max(device_time)-min(device_time) as power_off_duration,
min(datetime) as start_time,
max(datetime) as end_time
from tb4
group by device_id,flag;
--結果如下
date device_id power_off_ct sn power_off_duration start_time end_time
2021-08-07 E1 2 1 4 2021-08-07 14:23:41 2021-08-07 14:23:45
2021-08-07 E1 2 2 2 2021-08-07 14:23:49 2021-08-07 14:23:51
2021-08-07 E2 3 1 1 2021-08-07 14:25:10 2021-08-07 14:25:11
2021-08-07 E2 3 2 3 2021-08-07 14:25:15 2021-08-07 14:25:18
2021-08-07 E2 3 3 5 2021-08-07 14:25:21 2021-08-07 14:25:26
以上就是分析過程,在業務分析過程中該方法能很好的解決類似需求,舉一反三,希望能幫助到大家。
拜了個拜