TODS：從時間序列資料中檢測不同類型的異常值

自動建構用于時間序列異常值檢測的機器學習管道。

時間序列異常值檢測旨在識别資料中意外或罕見的執行個體。作為資料分析最重要的任務之一，異常值檢測在時間序列資料上有多種應用，例如欺詐檢測、故障檢測和網絡安全攻擊檢測。例如，雅虎 [1] 和微軟 [2] 已經建立了自己的時間序列異常值檢測服務來監控他們的業務資料并觸發異常值警報。在時間序列資料上，異常值可以分為三種情況：逐點異常值、模式（集體）異常值和系統異常值。

在本文中，我想介紹一個開源項目，用于建構機器學習管道以檢測時間序列資料中的異常值。本文将簡要介紹三種常見的異常值以及相應的檢測政策。然後将提供基于兩個支援的 API 的示例代碼：用于開發時間序列異常值檢測管道的 TODS API 和用于使用第三方包進行實驗的 scikit-learn API。

概述

TODS [3] 是一個全棧機器學習系統，用于對多元時間序列資料進行異常值檢測。TODS 為建構基于機器學習的異常值檢測系統提供了詳盡的子產品，包括：資料處理、時間序列處理、特征分析、檢測算法和強化子產品。通過這些子產品提供的功能包括：通用資料預處理、時間序列資料平滑/轉換、從時域/頻域中提取特征、各種檢測算法，以及涉及人類專業知識來校準系統。可以對時間序列資料執行三種常見的異常值檢測場景：逐點檢測（時間點作為異常值）、模式檢測（子序列作為異常值）和系統檢測（時間序列集作為異常值）。

當時間序列中存在潛在的系統故障或小故障時，通常會出現逐點異常值。這種異常值存在于全局（與整個時間序列中的資料點相比）或局部（與相鄰點相比）的單個資料點上。全局異常值通常很明顯，檢測全局異常值的常見做法是擷取資料集的統計值（例如，最小值/最大值/平均值/标準偏差）并設定檢測異常點的門檻值。局部異常值通常出現在特定上下文中，具有相同值的資料點如果不在特定上下文中顯示，則不會被識别為異常值。檢測局部異常值的常用政策是識别上下文（通過季節性趨勢分解、自相關），然後應用統計/機器學習方法（例如 AutoRegression、IsolationForest、OneClassSVM）來檢測異常值。

當資料中存在異常行為時，通常會出現模式異常值。模式異常值是指與其他子序列相比其行為異常的時間序列資料的子序列（連續點）。檢測模式異常值的常見做法，包括不和諧分析（例如，矩陣配置檔案 [6]、HotSAX [7]）和子序列聚類 [4]。Discords 分析利用滑動視窗将時間序列分割成多個子序列，并計算子序列之間的距離（例如，歐幾裡德距離）以找到時間序列資料中的不一緻。子序列聚類也将子序列分割應用于時間序列資料，并采用子序列作為每個時間點的特征，其中滑動視窗的大小為特征的數量。然後，采用無監督機器學習方法，例如聚類（例如，KMeans、PCA）或逐點異常值檢測算法來檢測模式異常值。

當許多系統之一處于異常狀态時，系統異常值會不斷發生，其中系統被定義為多元時間序列資料。檢測系統異常值的目标是從許多類似的系統中找出處于異常狀态的系統。例如，從具有多條生産線的工廠檢測異常生産線。檢測這種異常值的常用方法是執行逐點和模式異常值檢測以獲得每個時間點/子序列的異常值分數，然後采用內建技術為每個系統生成整體異常值分數以進行比較和檢測。

通過 Scikit-learn API 進行實驗

在建構機器學習管道的開始，需要進行大量實驗來調整或分析算法。在 TODS 中，Scikit-learn 類似 API 可用于大多數子產品，允許使用者靈活地将單個函數調用到實驗腳本中。這是一個調用矩陣配置檔案的示例，用于使用 UCR 資料集識别模式異常值 [5]。

import numpy as np
 from tods.sk_interface.detection_algorithm.MatrixProfile_skinterface import MatrixProfileSKI
 from sklearn.metrics import precision_recall_curve
 from sklearn.metrics import accuracy_score
 from sklearn.metrics import confusion_matrix
 from sklearn.metrics import classification_report
 
 #prepare the data
 data = np.loadtxt("./500_UCR_Anomaly_robotDOG1_10000_19280_19360.txt")
 
 X_train = np.expand_dims(data[:10000], axis=1)
 X_test = np.expand_dims(data[10000:], axis=1)
 
 transformer = MatrixProfileSKI()
 transformer.fit(X_train)
 prediction_labels_train = transformer.predict(X_train)
 prediction_labels = transformer.predict(X_test)
 prediction_score = transformer.predict_score(X_test)
 
 y_true = prediction_labels_train
 y_pred = prediction_labels
 
 print('Accuracy Score: ', accuracy_score(y_true, y_pred))
 
 confusion_matrix(y_true, y_pred)
 
 print(classification_report(y_true, y_pred))

複制

結果如下：

Accuracy Score:  0.89
     precision    recall  f1-score   support
 0       0.90      0.98      0.94      9005
 1       0.21      0.04      0.06       995
 
 accuracy                               0.89     10000
    macro avg       0.55      0.51      0.50     10000
 weighted avg       0.83      0.89      0.85     10000

複制

使用 TODS API 建構管道

在管道探索的後期階段，需要在沒有開發工作的情況下以可重複的方式管理實驗，因為會有更多的超參數群組件組合。在 TODS 中，我們的管道建構和執行 API 允許使用者使用單個腳本生成各種可重制的管道。生成的管道将存儲為 .json 或 .yml 檔案等類型的描述檔案，這些檔案可以輕松地使用不同的資料集進行複制/執行以及共享給同僚。下面的示例利用 TODS API 以 .json 格式建立自動編碼器管道，并使用 TODS 後端引擎運作管道以檢測雅虎網絡入侵資料集中的點異常值 [1]。

Step1：生成管道描述檔案管道生成腳本提供如下。雖然它看起來比 Scikit-learn 界面更長，但使用者可以輕松地添加帶有候選的超參數

from d3m import index
 from d3m.metadata.base import ArgumentType
 from d3m.metadata.pipeline import Pipeline, PrimitiveStep
 
 # Creating pipeline
 pipeline_description = Pipeline()
 pipeline_description.add_input(name='inputs')
 
 # Step 0: dataset_to_dataframe
 step_0 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.tods.data_processing.dataset_to_dataframe'))
 step_0.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='inputs.0')
 step_0.add_output('produce')
 pipeline_description.add_step(step_0)
 
 # Step 1: column_parser
 step_1 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.tods.data_processing.column_parser'))
 step_1.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.0.produce')
 step_1.add_output('produce')
 pipeline_description.add_step(step_1)
 
 # Step 2: extract_columns_by_semantic_types(attributes)
 step_2 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.tods.data_processing.extract_columns_by_semantic_types'))
 step_2.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.1.produce')
 step_2.add_output('produce')
 step_2.add_hyperparameter(name='semantic_types', argument_type=ArgumentType.VALUE,
                               data=['https://metadata.datadrivendiscovery.org/types/Attribute'])
 pipeline_description.add_step(step_2)
 
 # Step 3: extract_columns_by_semantic_types(targets)
 step_3 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.tods.data_processing.extract_columns_by_semantic_types'))
 step_3.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.0.produce')
 step_3.add_output('produce')
 step_3.add_hyperparameter(name='semantic_types', argument_type=ArgumentType.VALUE,
                             data=['https://metadata.datadrivendiscovery.org/types/TrueTarget'])
 pipeline_description.add_step(step_3)
 
 attributes = 'steps.2.produce'
 targets = 'steps.3.produce'
 
 # Step 4: processing
 step_4 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.tods.timeseries_processing.transformation.axiswise_scaler'))
 step_4.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference=attributes)
 step_4.add_output('produce')
 pipeline_description.add_step(step_4)
 
 # Step 5: algorithm`
 step_5 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.tods.detection_algorithm.pyod_ae'))
 step_5.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.4.produce')
 step_5.add_output('produce')
 pipeline_description.add_step(step_5)
 
 # Step 6: Predictions
 step_6 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.tods.data_processing.construct_predictions'))
 step_6.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.5.produce')
 step_6.add_argument(name='reference', argument_type=ArgumentType.CONTAINER, data_reference='steps.1.produce')
 step_6.add_output('produce')
 pipeline_description.add_step(step_6)
 
 # Final Output
 pipeline_description.add_output(name='output predictions', data_reference='steps.6.produce')
 
 # Output to json
 data = pipeline_description.to_json()
 with open('autoencoder_pipeline.json', 'w') as f:
     f.write(data)
     print(data)

複制

Step2：運作管道

建立管道描述檔案後，我們可以按如下方式運作管道描述檔案并評估無監督管道：

import sys
 import argparse
 import os
 import pandas as pd
 
 from tods import generate_dataset, load_pipeline, evaluate_pipeline
 
 this_path = os.path.dirname(os.path.abspath(__file__))
 table_path = os.path.join(this_path, '../../datasets/anomaly/raw_data/yahoo_sub_5.csv') # file path to the dataset
 target_index = 6 # which column is the label
 pipeline_path = "./autoencoder_pipeline.json"
 metric = "ALL"
 
 # Read data and generate dataset
 df = pd.read_csv(table_path)
 dataset = generate_dataset(df, target_index)
 
 # Load the default pipeline
 pipeline = load_pipeline(pipeline_path)
 
 # Run the pipeline
 pipeline_result = evaluate_pipeline(dataset, pipeline, metric)
 print(pipeline_result.scores)

複制

雖然這個API需要一個腳本來生成管道描述檔案，但它提供了靈活的接口來生成多個管道。

帶有标簽資訊的自動模型發現

除了手動建立管道之外，TODS 還利用 TODS API 提供自動模型發現。自動模型發現的目标旨在根據驗證集中的标簽資訊和給定的計算時間限制搜尋最佳管道。

import pandas as pd
 from axolotl.backend.simple import SimpleRunner
 from tods import generate_dataset, generate_problem
 from tods.searcher import BruteForceSearch
 
 table_path = '../../datasets/anomaly/raw_data/yahoo_sub_5.csv'
 target_index = 6 # what column is the target
 time_limit = 30 # How many seconds you wanna search
 
 metric = 'F1_MACRO' # F1 on label 1
 
 # Read data and generate dataset and problem
 df = pd.read_csv(table_path)
 dataset = generate_dataset(df, target_index=target_index)
 problem_description = generate_problem(dataset, metric)
 
 # Start backend
 backend = SimpleRunner(random_seed=0)
 
 # Start search algorithm
 search = BruteForceSearch(problem_description=problem_description,
                           backend=backend)
 
 # Find the best pipeline
 best_runtime, best_pipeline_result = search.search_fit(input_data=[dataset], time_limit=time_limit)
 best_pipeline = best_runtime.pipeline
 best_output = best_pipeline_result.output
 
 # Evaluate the best pipeline
 best_scores = search.evaluate(best_pipeline).scores
 
 
 print('*' * 52)
 print('Search History:')
 for pipeline_result in search.history:
     print('-' * 52)
     print('Pipeline id:', pipeline_result.pipeline.id)
     print(pipeline_result.scores)
 print('*' * 52)
 
 print('')
 
 print('*' * 52)
 print('Best pipeline:')
 print('-' * 52)
 print('Pipeline id:', best_pipeline.id)
 print('Pipeline json:', best_pipeline.to_json())
 print('Output:')
 print(best_output)
 print('Scores:')
 print(best_scores)
 print('*' * 52)

複制

管道搜尋完成後，使用者可以通過管道id通路所有搜尋到的管道，并儲存任何管道描述檔案以供後續使用。更多細節可以在官方示範 jupyter notebook 中找到。https://github.com/datamllab/tods/blob/master/examples/Demo%20Notebook/TODS%20Official%20Demo%20Notebook.ipynb

總結

該項目團隊正在為該項目積極開發更多功能，包括帶有可視化工具的圖形使用者界面、半監督學習算法和進階管道搜尋器。目标是使時間序列資料的異常值檢測變得可通路且更容易。我希望你喜歡閱讀這篇文章，在接下來的文章中，我将詳細介紹在時間序列資料中檢測不同類型異常值的常見政策，并介紹 TODS 中具有合成标準的資料合成器。

引用

[1] Thill, Markus, Wolfgang Konen, and Thomas Bäck. “Online anomaly detection on the webscope S5 dataset: A comparative study.” 2017 Evolving and Adaptive Intelligent Systems (EAIS). IEEE, 2017.[2] Ren, Hansheng, et al. “Time-series anomaly detection service at microsoft.” Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019.[3] Lai, K.-H., Zha, D., Wang, G., Xu, J., Zhao, Y., Kumar, D., Chen, Y., Zumkhawaka, P., Wan, M., Martinez, D., & Hu, X. (2021). TODS: An Automated Time Series Outlier Detection System. Proceedings of the AAAI Conference on Artificial Intelligence, 35(18), 16060–16062.[4] Keogh, Eamonn, et al. “Segmenting time series: A survey and novel approach.” Data mining in time series databases. 2004. 1–21[5] https://wu.renjie.im/research/anomaly-benchmarks-are-flawed/arxiv/[6] Yeh, Chin-Chia Michael, et al. “Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets.” 2016 IEEE 16th international conference on data mining (ICDM). Ieee, 2016.[7] Keogh, Eamonn, Jessica Lin, and Ada Fu. “Hot sax: Efficiently finding the most unusual time series subsequence.” Fifth IEEE International Conference on Data Mining (ICDM’05). Ieee, 2005.

本文作者：Henry Lai

原文位址:https://towardsdatascience.com/tods-detecting-outliers-from-time-series-data-2d4bd2e91381