0.文章系列連結
- SLS機器學習介紹(01):時序統計模組化
- SLS機器學習介紹(02):時序聚類模組化
- SLS機器學習介紹(03):時序異常檢測模組化
- SLS機器學習介紹(04):規則模式挖掘
- SLS機器學習介紹(05):時間序列預測
1. 高頻檢測場景
1.1 場景一
叢集中有N台機器,每台機器中有M個時序名額(CPU、記憶體、IO、流量等),若單獨的針對每條時序曲線做模組化,要手寫太多重複的SQL,且對平台的計算消耗特别大。該如何更好的應用SQL實作上述的場景需求?
1.2 場景二
針對系統中的N條時序曲線進行異常檢測後,如何快速知道:這其中有哪些時序曲線是有異常的呢?
2. 平台實驗
2.1 解決一
針對場景一中描述的問題,我們給出如下的資料限制。其中資料在日志服務的LogStore中按照如下結構存儲:
timestamp : unix_time_stamp
machine: name1
metricName: cpu0
metricValue: 50
---
timestamp : unix_time_stamp
machine: name1
metricName: cpu1
metricValue: 50
---
timestamp : unix_time_stamp
machine: name1
metricName: mem
metricValue: 50
---
timestamp : unix_time_stamp
machine: name2
metricName: mem
metricValue: 60
在上述的LogStore中我們先擷取N個名額的時序資訊:
* | select timestamp - timestamp % 60 as time, machine, metricName, avg(metricValue) from log group by time, machine, metricName
現在我們針對上述結果做批量的時序異常檢測算法,并得到N個名額的檢測結果:
* |
select machine, metricName, ts_predicate_arma(time, value, 5, 1, 1) as res from (
select
timestamp - timestamp % 60 as time,
machine, metricName,
avg(metricValue) as value
from log group by time, machine, metricName )
group by machine, metricName
通過上述SQL,我們得到的結果的結構如下
| machine | metricName | [[time, src, pred, upper, lower, prob]] |
| ------- | ---------- | --------------------------------------- |
針對上述結果,我們利用矩陣轉置操作,将結果轉換成如下格式,具體的SQL如下:
* |
select
machine, metricName,
res[1] as ts, res[2] as ds, res[3] as preds, res[4] as uppers, res[5] as lowers, res[6] as probs
from ( select machine, metricName, array_transpose(ts_predicate_arma(time, value, 5, 1, 1)) as res from (
select
timestamp - timestamp % 60 as time,
machine, metricName,
avg(metricValue) as value
from log group by time, machine, metricName )
group by machine, metricName )
經過對二維數組的轉換後,我們将每行的内容拆分出來,得到符合預期的結果,具體格式如下:
| machine | metricName | ts | ds | preds | uppers | lowers | probs |
| ------- | ---------- | -- | -- | ----- | ------ | ------ | ----- |
2.2 解決二
針對批量檢測的結果,我們該如何快速的将存在特定異常的結果過濾篩選出來呢?日志服務平台提供了針對異常檢測結果的過濾操作。
select ts_anomaly_filter(lineName, ts, ds, preds, probs, nWatch, anomalyType)
其中,針對anomalyType有如下說明:
- 0:表示關注全部異常
- 1:表示關注上升沿異常
- -1:表示下降沿異常
其中,針對nWatch有如下說明:
- 表示從實際時序資料的最後一個有效的觀測點開始到最近nWatch個觀測點的長度。
具體使用如下所示:
* |
select
ts_anomaly_filter(lineName, ts, ds, preds, probs, cast(5 as bigint), cast(1 as bigint))
from
( select
concat(machine, '-', metricName) as lineName,
res[1] as ts, res[2] as ds, res[3] as preds, res[4] as uppers, res[5] as lowers, res[6] as probs
from ( select machine, metricName, array_transpose(ts_predicate_arma(time, value, 5, 1, 1)) as res from (
select
timestamp - timestamp % 60 as time,
machine, metricName,
avg(metricValue) as value
from log group by time, machine, metricName )
group by machine, metricName ) )
通過上述結果,我們拿到的是一個Row類型的資料,我們可以使用如下方式,将具體的結構提煉出來:
* |
select
res.name, res.ts, res.ds, res.preds, res.probs
from
( select
ts_anomaly_filter(lineName, ts, ds, preds, probs, cast(5 as bigint), cast(1 as bigint)) as res
from
( select
concat(machine, '-', metricName) as lineName,
res[1] as ts, res[2] as ds, res[3] as preds, res[4] as uppers, res[5] as lowers, res[6] as probs
from (
select
machine, metricName, array_transpose(ts_predicate_arma(time, value, 5, 1, 1)) as res
from (
select
timestamp - timestamp % 60 as time,
machine, metricName, avg(metricValue) as value
from log group by time, machine, metricName )
group by machine, metricName ) ) )
通過上述操作,就可以實作對批量異常檢測的結果進行過濾處理操作,幫助使用者更好的批量設定告警。
3.硬廣時間
3.1 日志進階
這裡是日志服務的各種功能的示範
日志服務整體介紹,各種Demo
更多日志進階内容可以參考:
日志服務學習路徑。
3.2 聯系我們
糾錯或者幫助文檔以及最佳實踐貢獻,請聯系:悟冥
問題咨詢請加釘釘群: