laitimes

【异常检测算法】 Isolate Forest 的 Python 实现

author:A data man's own place

什么是Isolate Forest?

Many outlier detection methods typically analyze normal data points first and then look for observations that don't conform to normal data patterns. However, the Isolate Forest (IForest) proposed by Liu, Ting, and Zhou (2008) differs from these methods. Instead, IForest identifies outliers directly, rather than finding outliers by analyzing normal data points. It uses a tree structure to isolate each observation, with outliers often being the first data points to be picked out, and normal points being hidden deep in the tree. They called each tree an Isolate Tree (iTree) and built an iTrees tree group. Outliers are observations with a short average path length on iTrees.

【异常检测算法】 Isolate Forest 的 Python 实现

https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf

iTree uses partition maps and trees to explain how to isolate data points. The red dot is furthest from the other dots, followed by the green dot, and finally the blue dot. In an area map, it only takes a "cut" to separate the red dot from the others. The second tangent is the green dot, the third is the blue dot, and so on. The more cuts it takes to separate a point, the deeper the point will be in the tree. The reciprocal of the number of cuts is the anomaly score. The tree structure illustrates the same problem. A red dot can be found through one fork, a green dot can be found with a second fork, a blue dot can be found with a third fork, and so on. The depth number is a good representation of the score of the outlier. To align with the convention of high outlier scores, the outlier score is defined as the reciprocal of the depth number.

【异常检测算法】 Isolate Forest 的 Python 实现

iTree

An iTree is a type of binary tree where each node has 0 or 2 child nodes. Conditions for tree growth include that the end node has only one data point, that all data values in the node are the same, or that the tree reaches the height limit set by the researchers. iTree doesn't need to be fully developed before all end nodes have a single data point. Normally, when the height depth reaches a set limit, the tree stops growing because we are concerned about the outliers close to the root node. Therefore, it is not necessary to build a large iTree, as most of the data in the iTree is normal data points. A smaller sample size yields a better iTree because the swamp effect and masking effect are diminished. The iTree algorithm is different from the decision tree algorithm because iTree does not use target variables to train the tree, it is an unsupervised learning method.

Why "Forest"

Random forests may be more common than "isolated forests". Forest refers to the concept used to construct tree ensemble learning. Why do we need to do this? It is well known that a single decision tree has the disadvantage of overfitting, which means that the model predicts the training data well, but does not generalize the new data well. The ensemble strategy overcomes this problem by constructing multiple decision trees and then averaging their predictions.

Because isolated forests don't use any distance metrics to detect outliers, they're fast and take up less memory. This advantage makes it suitable for both large data volumes and high-dimensional problems.

【异常检测算法】 Isolate Forest 的 Python 实现

图(B)Isolation Forest

Figure (B) shows a data matrix, with each row being an observation with multidimensional values. The goal of IForest is to assign outliers to each observation. First, it randomly selects any number of rows and any number of columns to create a table, such as (1), (2), and (3). An observation will appear in at least one table. Each table establishes an iTree tree to display outlier scores. Table (1) has 6 rows and 3 columns. The first cut-off point in Table (1) may be the 6th observation because its value is very different from the other observations. After that, the second segment point in Table (1) may be the 4th observation. Similarly, in Table (3), the first cut point may be the 6th observation (i.e., the third record). The second slice point is the 4th observation (i.e., the first record in the table). In short, if there are N tables, there will be N iTrees. An observation can have up to N scores. IForest calculates the arithmetic mean of the fractions to arrive at the final score.

Modeling process

Step 1: Build the model

I generated a simulated dataset with six variables and 500 observations. The unsupervised model uses only the X variable, while the Y variable is only used for validation. The percentage of outliers is set at 5%,

contamination=0.05

。 You can plot the first two variables, with yellow dots representing outliers and purple dots being normal data points.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pyod.utils.data import generate_data
contamination = 0.05 # percentage of outliers
n_train = 500 # number of training points
n_test = 500 # number of testing points
n_features = 6 # number of features
X_train, X_test, y_train, y_test = generate_data(
 n_train=n_train, 
 n_test=n_test, 
 n_features= n_features, 
 contamination=contamination, 
 random_state=123)

X_train_pd = pd.DataFrame(X_train)
X_train_pd.head()
           
【异常检测算法】 Isolate Forest 的 Python 实现

image

【异常检测算法】 Isolate Forest 的 Python 实现

image

The size of the tree

max_samples

Set to 40 observations. In IForest, a smaller sample size can produce better iTrees without specifying a larger tree size.

from pyod.models.iforest import IForest
isft = IForest(contamination=0.05, max_samples=40, behaviour='new') 
isft.fit(X_train)

# Training data
y_train_scores = isft.decision_function(X_train)
y_train_pred = isft.predict(X_train)

# Test data
y_test_scores = isft.decision_function(X_test)
y_test_pred = isft.predict(X_test) # outlier labels (0 or 1)

# Threshold for the defined comtanimation rate
print("The threshold for the defined contamination rate:" , isft.threshold_)

def count_stat(vector):
# Because it is '0' and '1', we can run a count statistic. 
 unique, counts = np.unique(vector, return_counts=True)
return dict(zip(unique, counts))

print("The training data:", count_stat(y_train_pred))
print("The training data:", count_stat(y_test_pred))
           
The threshold for the defined contamination rate: 
-3.986394547794703e-15
The training data: {0: 475, 1: 25}
The training data: {0: 470, 1: 30}
           

污染率(contamination)

In practice, it is often not possible to determine the percentage of outliers. Section (C.2) explains how to determine a reasonable threshold when we are unable to determine the percentage of outliers beforehand. PyOD defaults to a contamination rate of 10%. Here, I'm setting the contamination rate to 5% because the contamination rate is 5% in the training samples. This parameter does not affect the calculation of the outlier score. Built-in functions

threshold_

Training data is calculated based on the contamination rate

阈值

。 In this example, the threshold is -5.082e-15 when the contamination rate is 0.05. function

decision_functions()

Used to generate outliers, functions

predict()

Labels (1 or 0) are assigned based on the threshold.

Hyperparameters

I will use

.get_params()

to explain some important parameters:

isft.get_params()
           
{'behaviour': 'new',
 'bootstrap': False,
 'contamination': 0.05,
 'max_features': 1.0,
 'max_samples': 40,
 'n_estimators': 100,
 'n_jobs': 1,
 'random_state': None,
 'verbose': 0}
           
  • "max_samples"

    (Maximum number of samples) specifies the number of samples extracted for each base estimator of training, which is an important parameter;
  • "n_estimators"

    Represents the number of trees in the set, defaulting to 100;
  • "max_features"

    (Maximum Features) specifies the number of features extracted by each base estimator for training, defaulting to 1.0;
  • "n_jobs"

    Indicates the number of jobs running in parallel, and if set to -1, the number of jobs will be determined based on the number of cores.

Importance of Special Conquest

IForest uses a tree structure that measures the relative importance of features in determining outliers, measuring the importance of features through the Gini Impurity Index, which adds up to 1.0.

isft_vi = isft.feature_importances_
isft_vi
           
array([0.17325335, 0.13573292, 0.17200832,
 0.17157248, 0.17091259, 0.17652033])
           

You can plot feature importance maps like tree models. The following graph shows the relative strength of the feature when determining outliers.

from matplotlib import pyplot as plt
for_plot = pd.DataFrame({'x_axis':X_train_pd.columns,
'y_axis':isft_vi}).sort_values(by='y_axis',ascending=True)
for_plot['y_axis'].plot.barh()
           
【异常检测算法】 Isolate Forest 的 Python 实现

IForest 对异常值的变量重要性

Step 2 - Determine a reasonable threshold for the model

The threshold should be determined based on the histogram of outliers, and the following figure recommends a threshold of about 0.0, which means that the outliers of most normal data are less than 0.0, and the outliers of abnormal data are in the higher range.

import matplotlib.pyplot as plt
plt.hist(y_train_scores, bins='auto') # arguments are passed to np.histogram
plt.title("Outlier score")
plt.show()
           
【异常检测算法】 Isolate Forest 的 Python 实现

Step 3 - Display descriptive statistics for the normal and abnormal groups

The analysis of normal and abnormal groups is a key step in verifying the plausibility of the model. I developed a program called

descriptive_stat_threshold()

A brief function that shows the size and descriptive statistics of the normal and abnormal group features. In addition, I have made a simple setting of the contamination rate threshold, you can test different thresholds to determine a reasonable anomaly group size.

threshold = isft.threshold_ # Or other value from the above histogram

def descriptive_stat_threshold(df,pred_score, threshold):
# Let's see how many '0's and '1's.
 df = pd.DataFrame(df)
 df['Anomaly_Score'] = pred_score
 df['Group'] = np.where(df['Anomaly_Score']< threshold, 'Normal', 'Outlier')

# Now let's show the summary statistics:
 cnt = df.groupby('Group')['Anomaly_Score'].count().reset_index().rename(columns={'Anomaly_Score':'Count'})
 cnt['Count %'] = (cnt['Count'] / cnt['Count'].sum()) * 100 # The count and count %
 stat = df.groupby('Group').mean().round(2).reset_index() # The avg.
 stat = cnt.merge(stat, left_on='Group',right_on='Group') # Put the count and the avg. together
return (stat)

descriptive_stat_threshold(X_train,y_train_scores, threshold)
           
【异常检测算法】 Isolate Forest 的 Python 实现

The table above contains the main points of the model evaluation and results. Pay special attention to tagging features with their names for effective presentation.

  • Size of outliers: The size of the outliers depends on the threshold selected. A higher threshold results in a smaller group.
  • Trait statistics in each group: Trait statistics should be consistent with prior business knowledge. If some features show puzzling results, the feature should be rechecked or deleted. Iterative models are required to ensure that all features are plausible.
  • Mean Outliers: The mean outlier score was much higher in the outlier group than in the normal group (0.18>-0.10). You don't have to explain too much about the score.

Since we already have the basic facts to generate the data, we can generate a confusion matrix to understand the performance of the model. The model performed well, identifying all 25 outliers.

def confusion_matrix(actual,score, threshold):
 Actual_pred = pd.DataFrame({'Actual': actual, 'Pred': score})
 Actual_pred['Pred'] = np.where(Actual_pred['Pred']<=threshold,0,1)
 cm = pd.crosstab(Actual_pred['Actual'],Actual_pred['Pred'])
return (cm)
confusion_matrix(y_train,y_train_scores,threshold)
           
【异常检测算法】 Isolate Forest 的 Python 实现

Model stability is achieved by aggregating multiple models

The IForest algorithm is very sensitive to outliers and can lead to overfitting. In order to get stable predictions, the scores of multiple models can be aggregated. Out of all hyperparameters, the number of trees

n_estimators

Probably the most critical parameter. I would create 5 models based on the range of the number of trees, and then take the average prediction of these models as the final model prediction. The PyOD module provides four ways to summarize the results, and you only need to use one method to generate the summary results. Note the installation for these functions

pip install combo

from pyod.models.combination import aom, moa, average, maximization
from pyod.utils.utility import standardizer
from pyod.models.iforest import IForest

# Standardize data
X_train_norm, X_test_norm = standardizer(X_train, X_test)

# Test a range of the number of trees
k_list = [100, 200, 300, 400, 500]
n_clf = len(k_list)
# Just prepare data frames so we can store the model results
train_scores = np.zeros([X_train.shape[0], n_clf])
test_scores = np.zeros([X_test.shape[0], n_clf])

# Modeling
for i in range(n_clf):
 k = k_list[i]
#isft = IForest(contamination=0.05, max_samples=k) 
 isft = IForest(contamination=0.05, n_estimators=k) 
 isft.fit(X_train_norm)

# Store the results in each column:
 train_scores[:, i] = isft.decision_function(X_train_norm) 
 test_scores[:, i] = isft.decision_function(X_test_norm) 
# Decision scores have to be normalized before combination
train_scores_norm, test_scores_norm = standardizer(train_scores,test_scores)
           

The predictions of the 5 models are averaged to get the average of the outliers ("y_by_average"). A histogram is plotted in the image below.

# Combination by average
# The test_scores_norm is 500 x 10. The "average" function will take the average of the 10 columns. The result "y_by_average" is a single column: 
y_train_by_average = average(train_scores_norm)
y_test_by_average = average(test_scores_norm)
import matplotlib.pyplot as plt
plt.hist(y_train_by_average, bins='auto') # arguments are passed to np.histogram
plt.title("Combination by average")
plt.show()
           
【异常检测算法】 Isolate Forest 的 Python 实现

Mean score histogram

The diagram above shows that the threshold is equal to 1.0. Therefore, the characteristics of the normal group and the outliers are listed in the table below. Of these, 25 data points were identified as outliers.

descriptive_stat_threshold(X_train,
 y_train_by_average,
1.0)
           
【异常检测算法】 Isolate Forest 的 Python 实现

summary

Most model-based anomaly detection methods construct the outline of a normal instance and then identify the instances that do not conform to the normal profile as an outlier. In contrast, IForest isolates anomalous data directly and unambiguously. IForest uses a tree structure to isolate each data point, with outliers being singled out first and normal points clustered in a tree-like structure. Because Isolation Forest does not use any distance measurement to detect anomalies, it is fast and suitable for large data volumes and high-dimensional problems.

Read on